# 06 — Ingestion Pipelines

The ingestion layer is where the world enters the engine. There are two fundamentally different kinds of input:

1. **Free prose** — chronicles, novels, short stories, dialogue logs, Discord messages. The engine reads the text, extracts entities and relations, embeds chunks. On Cognee, this is the `cognee.add()` + `cognee.cognify()` pipeline, with a custom extraction prompt that emits the Lore Engine's 36 typed labels.
2. **Structured lore** — timelines, family trees, gazetteers, bestiaries, magic-system descriptions, written in YAML by the world-builder. The Lore Engine's structured parser materializes typed graph edges directly. **No LLM is required for these.**

The structured path is the one that makes the engine *historically accurate*. Prose extraction is fuzzy by nature; YAML ingestion is exact. Both paths exist; structured is preferred for anything that becomes a load-bearing fact (lineage, era boundaries, faction rules).

## Ingestion paths overview

```
                          ┌─────────────────────────────────┐
                          │      World-Builder Authoring     │
                          │   (markdown, YAML, dialogue)     │
                          └────────────┬────────────────────┘
                                       │
       ┌───────────────────────────────┼───────────────────────────────┐
       │                               │                               │
       ▼                               ▼                               ▼
  prose path                      timeline.yaml                   family_tree.yaml
  cognee.add()                    Lore Engine YAML parser          Lore Engine YAML parser
  cognee.cognify()                (no LLM, exact)                  (no LLM, exact)
       │                               │                               │
       ▼                               ▼                               ▼
  Cognee chunks + vectors        Date, Era, Event nodes         Person, Lineage nodes
  Typed triples                  RULES, OCCURRED_DURING         PARENT_OF edges
  (Lore Engine extraction        PARTICIPATED_IN edges          EXISTED_DURING edges
   prompt emits 36 labels)             │                               │
       │                               │                               │
       └───────────────────────────────┴───────────────────────────────┘
                                       │
                                       ▼
                              Cognee-managed graph
                              (Neo4j — ADR 0008)
                                       │
                                       ▼
                            Consistency pipeline runs
                            (live + nightly batch)
```

## Path 1: Free prose (via Cognee)

The prose path goes through Cognee's standard `add` + `cognify` pipeline. The Lore Engine registers a custom extraction prompt with Cognee; the prompt tells the LLM to emit the Lore Engine's 36 typed labels and the ~70 edge types instead of Cognee's default `Entity`/`DataPoint` types.

```python
# World-builder's ingestion script
import cognee

await cognee.add("chapters/aldric_origin.md")   # raw markdown
await cognee.cognify()                            # extract + embed + index
```

The pipeline:

1. **Cognee watcher** detects a new file (or receives a `cognee.add()` call).
2. **Cognee ingestion worker** chunks the text (512-token windows, 64-token overlap), generates embeddings, writes `Chunk` and `Dataset` nodes.
3. **Lore Engine extraction prompt** runs on each chunk. The LLM is told to emit triples using the Lore Engine's typed ontology. The response is parsed and validated against the schema.
4. **Entity resolution** matches extracted entity names against known canonical names (Cognee's `loadKnownEntities` helper, with a `lore_engine` namespace prefix).
5. **Cypher writer** materializes entities and relations into the graph using Cognee's graph adapter, applying the `:FEATURES` edge from the source.
6. **Contradiction detection** runs on the new edges (see `04-consistency.md`).

### Extraction prompt (Lore Engine extension to Cognee)

Cognee's default extraction prompt emits `Entity` and `DataPoint` types. The Lore Engine replaces this with a prompt that teaches the LLM the Lore Engine's 36 typed labels and the ~70 edge types:

```
You are extracting structured information from a passage of high-fantasy fiction
for the Lore Engine knowledge graph.

Emit a list of triples. Each triple is (subject, relation, object).

Subject and object must be one of the Lore Engine typed labels:
  Person, Faction, Location, Item, Era, Date, Lineage, Culture, Deity,
  Language, MagicSystem, Title, Region, Material, Creature, Spell,
  Plane, Setting, NPC, PC, Human, DomainEntity.

Relation must be one of the Lore Engine typed edge types:
  RULED, PARENT_OF, MEMBER_OF, LOCATED_IN, OCCURRED_AT, OCCURRED_DURING,
  PARTICIPATED_IN, ALLIED_WITH, ENEMY_OF, POSSESSES, SPOUSE_OF, WORSHIPS,
  PRACTICES, SPEAKS, BELONGS_TO, CLAIMS_TITLE, CAUSED, PRECEDED,
  CONCURRENT_WITH, WITNESSED, LOGGED_IN, GIVEN_BY, TARGETS, PAID_BY,
  PART_OF, ... (full list in 01-ontology.md)

For Event nodes, the temporal_hint field is REQUIRED. Format: {era}.{year}[.month_N][.day_N].
For Person nodes, birth and death years are STRONGLY PREFERRED in temporal_hint.
For Faction nodes, founded and dissolved years are STRONGLY PREFERRED.

If the passage describes a person, also extract their MEMBER_OF, WORSHIPS,
SPEAKS, BELONGS_TO, POSSESSES if explicitly stated. Prefer specific
faction/religion/culture names over generic descriptions.

If a fact is too vague to assign a time, emit temporal_hint: "unknown"
and set source_confidence: 0.5.
```

The Cognee pipeline runs this prompt per chunk and parses the result. The Lore Engine validates the parsed triples against its typed ontology (rejecting triples that reference unknown labels) before writing to the graph.

### What prose is good for

- Color, character voice, cultural texture.
- The kind of information that doesn't have a clean structure: *"Aldric was known for his sharp wit and his hatred of the Crimson Pact."*
- In-fiction dialogue logs.

### What prose is bad for

- Lineage. *"Aldric was the son of Maric, who was the son of Theron..."* extracted by an LLM is correct maybe 80% of the time, and silent errors are catastrophic. **Use a `family_tree.yaml` for lineage. Always.**
- Era boundaries. *"The Third Age began in 1 TA..."* — the LLM will sometimes parse this as `1st_age` or `first_age` or `third_age_1`. Use a `timeline.yaml`.
- Magic system taxonomy. Free text describing spells is fine; the spell-to-system mapping is a `magic_system.yaml`.

## Path 2: Structured YAML ingestion

This is the new pipeline. Each YAML type has a dedicated extractor that parses the structure and writes typed Cypher directly — no LLM in the loop.

### `timeline.yaml` — era boundaries + named events

```yaml
era: "3rd_age"
parent_era: null
start: -100
end: 600
description: "The Third Age. The age of iron crowns and broken gods."

events:
  - slug: "battle_of_black_spire"
    label: "Battle of Black Spire"
    in_fiction_date: "17 Hearthmoon, 340 TA"
    era: "3rd_age.age_of_iron"
    year: 340
    month: 3
    day: 17
    location: "black_spire_pass"
    participants: ["house_vyr", "crimson_pact"]
    description: "House Vyr's decisive victory over the Crimson Pact."
    significance: "End of the Border Wars."
```

The timeline extractor:

1. Creates/updates the `Era` node.
2. For each event, creates a `Date` node, an `Event` node, and `OCCURRED_AT` + `OCCURRED_DURING` + `PARTICIPATED_IN` edges.
3. Sets `valid_from` / `valid_until` on each `Event` based on its date.
4. Tags the `LoreSource` as `source_type: timeline`.

### `family_tree.yaml` — direct lineage

```yaml
founding_ancestor: "theron_ashveil"
lineage: "house_vyr_bloodline"
description: "The bloodline of House Vyr, from Theron Ashveil to the present."

members:
  - id: "theron_ashveil"
    name: "Theron Ashveil"
    born: "1st_age.year_412"
    died: "2nd_age.year_87"
    spouse_of: ["mara_ashveil"]
    
  - id: "maric_vyr"
    name: "Maric Vyr"
    born: "2nd_age.year_70"
    died: "3rd_age.year_15"
    parents: ["theron_ashveil", "mara_ashveil"]
    
  - id: "aldric_raventhorne"
    name: "Aldric Raventhorne"
    born: "3rd_age.year_300"
    died: "3rd_age.year_360"
    parents: ["cael_vyr", "yssa_raventhorne"]
    spouse_of: ["elara_raventhorne"]
```

The family-tree extractor:

1. Creates/updates `Person` nodes.
2. Creates/updates the `Lineage` node with `founding_ancestor`.
3. Writes `PARENT_OF` edges (with `valid_from` set to the child's birth, `valid_until` set to the parent's death).
4. Writes `MEMBER_OF` edges from each person to the lineage.
5. Runs anachronism check on every node: do the parents' lifespans cover the child's birth?

### `gazetteer.yaml` — locations, regions, geography

```yaml
locations:
  - id: "thornwall_keep"
    name: "Thornwall Keep"
    type: "fortress"
    part_of: "valdorn"
    culture_of: "valdorni"
    coordinates: {x: 1240, y: 870}
    description: "..."
    events_held: ["coronation_of_aelric"]

regions:
  - id: "northern_reaches"
    name: "Northern Reaches"
    parent_region: null
    contains: ["valdorn", "mardsville", "frosthollow"]
```

The gazetteer extractor:

1. Creates `Location` and `Region` nodes.
2. Writes `PART_OF` edges.
3. Writes `CULTURE_OF` edges.
4. Materializes named events as `OCCURRED_AT` edges.

### `bestiary.yaml` — creatures

```yaml
creatures:
  - id: "pale_worm"
    name: "The Pale Worm"
    species: "worm"
    alignment: "chaotic_evil"
    habitat: "frosthollow"
    first_appeared: "3rd_age.year_120"
    description: "A massive frost-worm that haunts the Frosthollow tundra."
    defeated_by: ["aldric_raventhorne"]  # creates DEFEATED edges
```

### `magic_system.yaml` — magic taxonomy

```yaml
systems:
  - id: "the_weave"
    name: "The Weave"
    source: "natural_law"
    practitioners: ["valdorni_mage", "sisterhood_of_silver"]
    description: "..."
    
  - id: "divine_miracles"
    name: "Divine Miracles"
    source: "aelar_the_patient"
    practitioners: ["cleric_of_aelar"]
    description: "..."

spells:
  - id: "emberlance"
    name: "Emberlance"
    system: "the_weave"
    level: 3
    school: "evocation"
    practitioners: ["valdorni_mage"]
```

### `culture.yaml` — cultures, languages, deities

```yaml
cultures:
  - id: "valdorni"
    name: "Valdorni"
    language: "old_valdorni"
    homeland: "valdorn"
    description: "..."

languages:
  - id: "old_valdorni"
    name: "Old Valdorni"
    script: "runic"
    speakers: ["valdorni", "house_vyr"]

deities:
  - id: "aelar_the_patient"
    name: "Aelar the Patient"
    domain: ["healing", "patience", "winter"]
    alignment: "neutral_good"
    symbol: "a single open eye"
    worshipped_by: ["valdorni", "sisterhood_of_silver"]
```

## Path 3: Dialogue logs (in-fiction)

For when a player/NPC says something in-character that should be recorded as lore:

```json
POST /ingest/dialogue
{
  "speaker": "aldric_raventhorne",
  "text": "I will not rest until the Crimson Pact is broken.",
  "in_fiction_date": "3rd_age.year_345",
  "location": "thornwall_keep"
}
```

This creates a `Message` (or a special `Dialogue` node) and links the speaker to the location at the time. Useful for building up first-person perspective in `narrate_arc`.

## Why YAML, not JSON or TOML

YAML wins because:

- **Comments.** Every world-builder annotates their lore. JSON forces them out.
- **References.** `parents: ["theron_ashveil"]` is readable; `{"parents": ["theron_ashveil"]}` is noise.
- **Multi-line strings.** `description: |` blocks handle prose naturally.
- **Standard tooling.** `pyyaml` is in every Python install. No additional deps.

The downside is YAML's gotchas (Norway problem, tab/space sensitivity). The extractor is strict and rejects ambiguous inputs with line numbers — better to fail loudly than silently parse `NO: false` as the boolean `True`.

## The structured-ingestor (Lore Engine on Cognee)

The structured-YAML parser lives in the Lore Engine extension as a Python module:

```python
# lore_engine/parsers/timeline.py
# Validates the timeline.yaml schema
# Emits MERGE (e:Era {slug, parent_era, start, end}) and similar
# Calls Cognee's graph adapter to execute the Cypher
# Tags the LoreSource with source_type: timeline

# lore_engine/parsers/family_tree.py
# Same pattern, different schema

# ... one parser per YAML type
```

The structured path is **fast and deterministic** — typical ingest is <500ms per YAML file, no GPU, no LLM latency. The parser is a thin wrapper over Cognee's graph adapter; the schema validation is strict and rejects ambiguous inputs with line numbers.

## What this means for the LLM

The LLM never has to ingest. It only reads. World-builders ingest via:

- `cognee.add()` + `cognee.cognify()` (prose — markdown, dialogue)
- `POST /ingest/structured` (YAML — new)
- `POST /ingest/dialogue` (JSON — new)
- `tea add-source <file>` (CLI wrapper — new, optional)
- Direct MCP tool calls (`add_lore_source`, `add_entity`, `add_relation`)

The LLM is told (in the reasoning harness): *"You do not write lore. You do not modify the graph. You query it. If you believe a fact is missing, you say so to the user; the world-builder adds it."*

## Risk: YAML drift from prose

A common failure mode: the prose says "Aldric's father was Theron" but the `family_tree.yaml` has his father as "Maric." The engine flags this as a contradiction. The world-builder picks one. The LLM is told never to resolve the contradiction itself.

**Mitigation:** the consistency engine treats prose-derived lineage as `confidence: 0.6` and YAML-derived lineage as `confidence: 1.0` by default. When they conflict, the YAML wins and a `Contradiction` node is created with the prose source cited.

## Roadmap note

The structured ingestion is the most leveraged thing in this design. **It is also the part that requires the most world-builder discipline.** We can't enforce YAML authoring; we can make it easy and rewarding (validation, preview, auto-completion in a future UI).