Files
lore-engine/docs/06-ingestion.md
Kaysser Kayyali 45ca1d962d docs: sweep remaining 'Neo4j or Kuzu' references after ADR 0008
The earlier commit missed 8 spots that still presented the graph
backend as undecided (00-overview x2, 12-storage x2, 13-microservice
x2, 06-ingestion, plan/05). All now pinned to Neo4j per ADR 0008.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-17 22:57:16 -04:00

14 KiB

06 — Ingestion Pipelines

The ingestion layer is where the world enters the engine. There are two fundamentally different kinds of input:

  1. Free prose — chronicles, novels, short stories, dialogue logs, Discord messages. The engine reads the text, extracts entities and relations, embeds chunks. On Cognee, this is the cognee.add() + cognee.cognify() pipeline, with a custom extraction prompt that emits the Lore Engine's 36 typed labels.
  2. Structured lore — timelines, family trees, gazetteers, bestiaries, magic-system descriptions, written in YAML by the world-builder. The Lore Engine's structured parser materializes typed graph edges directly. No LLM is required for these.

The structured path is the one that makes the engine historically accurate. Prose extraction is fuzzy by nature; YAML ingestion is exact. Both paths exist; structured is preferred for anything that becomes a load-bearing fact (lineage, era boundaries, faction rules).

Ingestion paths overview

                          ┌─────────────────────────────────┐
                          │      World-Builder Authoring     │
                          │   (markdown, YAML, dialogue)     │
                          └────────────┬────────────────────┘
                                       │
       ┌───────────────────────────────┼───────────────────────────────┐
       │                               │                               │
       ▼                               ▼                               ▼
  prose path                      timeline.yaml                   family_tree.yaml
  cognee.add()                    Lore Engine YAML parser          Lore Engine YAML parser
  cognee.cognify()                (no LLM, exact)                  (no LLM, exact)
       │                               │                               │
       ▼                               ▼                               ▼
  Cognee chunks + vectors        Date, Era, Event nodes         Person, Lineage nodes
  Typed triples                  RULES, OCCURRED_DURING         PARENT_OF edges
  (Lore Engine extraction        PARTICIPATED_IN edges          EXISTED_DURING edges
   prompt emits 36 labels)             │                               │
       │                               │                               │
       └───────────────────────────────┴───────────────────────────────┘
                                       │
                                       ▼
                              Cognee-managed graph
                              (Neo4j — ADR 0008)
                                       │
                                       ▼
                            Consistency pipeline runs
                            (live + nightly batch)

Path 1: Free prose (via Cognee)

The prose path goes through Cognee's standard add + cognify pipeline. The Lore Engine registers a custom extraction prompt with Cognee; the prompt tells the LLM to emit the Lore Engine's 36 typed labels and the ~70 edge types instead of Cognee's default Entity/DataPoint types.

# World-builder's ingestion script
import cognee

await cognee.add("chapters/aldric_origin.md")   # raw markdown
await cognee.cognify()                            # extract + embed + index

The pipeline:

  1. Cognee watcher detects a new file (or receives a cognee.add() call).
  2. Cognee ingestion worker chunks the text (512-token windows, 64-token overlap), generates embeddings, writes Chunk and Dataset nodes.
  3. Lore Engine extraction prompt runs on each chunk. The LLM is told to emit triples using the Lore Engine's typed ontology. The response is parsed and validated against the schema.
  4. Entity resolution matches extracted entity names against known canonical names (Cognee's loadKnownEntities helper, with a lore_engine namespace prefix).
  5. Cypher writer materializes entities and relations into the graph using Cognee's graph adapter, applying the :FEATURES edge from the source.
  6. Contradiction detection runs on the new edges (see 04-consistency.md).

Extraction prompt (Lore Engine extension to Cognee)

Cognee's default extraction prompt emits Entity and DataPoint types. The Lore Engine replaces this with a prompt that teaches the LLM the Lore Engine's 36 typed labels and the ~70 edge types:

You are extracting structured information from a passage of high-fantasy fiction
for the Lore Engine knowledge graph.

Emit a list of triples. Each triple is (subject, relation, object).

Subject and object must be one of the Lore Engine typed labels:
  Person, Faction, Location, Item, Era, Date, Lineage, Culture, Deity,
  Language, MagicSystem, Title, Region, Material, Creature, Spell,
  Plane, Setting, NPC, PC, Human, DomainEntity.

Relation must be one of the Lore Engine typed edge types:
  RULED, PARENT_OF, MEMBER_OF, LOCATED_IN, OCCURRED_AT, OCCURRED_DURING,
  PARTICIPATED_IN, ALLIED_WITH, ENEMY_OF, POSSESSES, SPOUSE_OF, WORSHIPS,
  PRACTICES, SPEAKS, BELONGS_TO, CLAIMS_TITLE, CAUSED, PRECEDED,
  CONCURRENT_WITH, WITNESSED, LOGGED_IN, GIVEN_BY, TARGETS, PAID_BY,
  PART_OF, ... (full list in 01-ontology.md)

For Event nodes, the temporal_hint field is REQUIRED. Format: {era}.{year}[.month_N][.day_N].
For Person nodes, birth and death years are STRONGLY PREFERRED in temporal_hint.
For Faction nodes, founded and dissolved years are STRONGLY PREFERRED.

If the passage describes a person, also extract their MEMBER_OF, WORSHIPS,
SPEAKS, BELONGS_TO, POSSESSES if explicitly stated. Prefer specific
faction/religion/culture names over generic descriptions.

If a fact is too vague to assign a time, emit temporal_hint: "unknown"
and set source_confidence: 0.5.

The Cognee pipeline runs this prompt per chunk and parses the result. The Lore Engine validates the parsed triples against its typed ontology (rejecting triples that reference unknown labels) before writing to the graph.

What prose is good for

  • Color, character voice, cultural texture.
  • The kind of information that doesn't have a clean structure: "Aldric was known for his sharp wit and his hatred of the Crimson Pact."
  • In-fiction dialogue logs.

What prose is bad for

  • Lineage. "Aldric was the son of Maric, who was the son of Theron..." extracted by an LLM is correct maybe 80% of the time, and silent errors are catastrophic. Use a family_tree.yaml for lineage. Always.
  • Era boundaries. "The Third Age began in 1 TA..." — the LLM will sometimes parse this as 1st_age or first_age or third_age_1. Use a timeline.yaml.
  • Magic system taxonomy. Free text describing spells is fine; the spell-to-system mapping is a magic_system.yaml.

Path 2: Structured YAML ingestion

This is the new pipeline. Each YAML type has a dedicated extractor that parses the structure and writes typed Cypher directly — no LLM in the loop.

timeline.yaml — era boundaries + named events

era: "3rd_age"
parent_era: null
start: -100
end: 600
description: "The Third Age. The age of iron crowns and broken gods."

events:
  - slug: "battle_of_black_spire"
    label: "Battle of Black Spire"
    in_fiction_date: "17 Hearthmoon, 340 TA"
    era: "3rd_age.age_of_iron"
    year: 340
    month: 3
    day: 17
    location: "black_spire_pass"
    participants: ["house_vyr", "crimson_pact"]
    description: "House Vyr's decisive victory over the Crimson Pact."
    significance: "End of the Border Wars."

The timeline extractor:

  1. Creates/updates the Era node.
  2. For each event, creates a Date node, an Event node, and OCCURRED_AT + OCCURRED_DURING + PARTICIPATED_IN edges.
  3. Sets valid_from / valid_until on each Event based on its date.
  4. Tags the LoreSource as source_type: timeline.

family_tree.yaml — direct lineage

founding_ancestor: "theron_ashveil"
lineage: "house_vyr_bloodline"
description: "The bloodline of House Vyr, from Theron Ashveil to the present."

members:
  - id: "theron_ashveil"
    name: "Theron Ashveil"
    born: "1st_age.year_412"
    died: "2nd_age.year_87"
    spouse_of: ["mara_ashveil"]
    
  - id: "maric_vyr"
    name: "Maric Vyr"
    born: "2nd_age.year_70"
    died: "3rd_age.year_15"
    parents: ["theron_ashveil", "mara_ashveil"]
    
  - id: "aldric_raventhorne"
    name: "Aldric Raventhorne"
    born: "3rd_age.year_300"
    died: "3rd_age.year_360"
    parents: ["cael_vyr", "yssa_raventhorne"]
    spouse_of: ["elara_raventhorne"]

The family-tree extractor:

  1. Creates/updates Person nodes.
  2. Creates/updates the Lineage node with founding_ancestor.
  3. Writes PARENT_OF edges (with valid_from set to the child's birth, valid_until set to the parent's death).
  4. Writes MEMBER_OF edges from each person to the lineage.
  5. Runs anachronism check on every node: do the parents' lifespans cover the child's birth?

gazetteer.yaml — locations, regions, geography

locations:
  - id: "thornwall_keep"
    name: "Thornwall Keep"
    type: "fortress"
    part_of: "valdorn"
    culture_of: "valdorni"
    coordinates: {x: 1240, y: 870}
    description: "..."
    events_held: ["coronation_of_aelric"]

regions:
  - id: "northern_reaches"
    name: "Northern Reaches"
    parent_region: null
    contains: ["valdorn", "mardsville", "frosthollow"]

The gazetteer extractor:

  1. Creates Location and Region nodes.
  2. Writes PART_OF edges.
  3. Writes CULTURE_OF edges.
  4. Materializes named events as OCCURRED_AT edges.

bestiary.yaml — creatures

creatures:
  - id: "pale_worm"
    name: "The Pale Worm"
    species: "worm"
    alignment: "chaotic_evil"
    habitat: "frosthollow"
    first_appeared: "3rd_age.year_120"
    description: "A massive frost-worm that haunts the Frosthollow tundra."
    defeated_by: ["aldric_raventhorne"]  # creates DEFEATED edges

magic_system.yaml — magic taxonomy

systems:
  - id: "the_weave"
    name: "The Weave"
    source: "natural_law"
    practitioners: ["valdorni_mage", "sisterhood_of_silver"]
    description: "..."
    
  - id: "divine_miracles"
    name: "Divine Miracles"
    source: "aelar_the_patient"
    practitioners: ["cleric_of_aelar"]
    description: "..."

spells:
  - id: "emberlance"
    name: "Emberlance"
    system: "the_weave"
    level: 3
    school: "evocation"
    practitioners: ["valdorni_mage"]

culture.yaml — cultures, languages, deities

cultures:
  - id: "valdorni"
    name: "Valdorni"
    language: "old_valdorni"
    homeland: "valdorn"
    description: "..."

languages:
  - id: "old_valdorni"
    name: "Old Valdorni"
    script: "runic"
    speakers: ["valdorni", "house_vyr"]

deities:
  - id: "aelar_the_patient"
    name: "Aelar the Patient"
    domain: ["healing", "patience", "winter"]
    alignment: "neutral_good"
    symbol: "a single open eye"
    worshipped_by: ["valdorni", "sisterhood_of_silver"]

Path 3: Dialogue logs (in-fiction)

For when a player/NPC says something in-character that should be recorded as lore:

POST /ingest/dialogue
{
  "speaker": "aldric_raventhorne",
  "text": "I will not rest until the Crimson Pact is broken.",
  "in_fiction_date": "3rd_age.year_345",
  "location": "thornwall_keep"
}

This creates a Message (or a special Dialogue node) and links the speaker to the location at the time. Useful for building up first-person perspective in narrate_arc.

Why YAML, not JSON or TOML

YAML wins because:

  • Comments. Every world-builder annotates their lore. JSON forces them out.
  • References. parents: ["theron_ashveil"] is readable; {"parents": ["theron_ashveil"]} is noise.
  • Multi-line strings. description: | blocks handle prose naturally.
  • Standard tooling. pyyaml is in every Python install. No additional deps.

The downside is YAML's gotchas (Norway problem, tab/space sensitivity). The extractor is strict and rejects ambiguous inputs with line numbers — better to fail loudly than silently parse NO: false as the boolean True.

The structured-ingestor (Lore Engine on Cognee)

The structured-YAML parser lives in the Lore Engine extension as a Python module:

# lore_engine/parsers/timeline.py
# Validates the timeline.yaml schema
# Emits MERGE (e:Era {slug, parent_era, start, end}) and similar
# Calls Cognee's graph adapter to execute the Cypher
# Tags the LoreSource with source_type: timeline

# lore_engine/parsers/family_tree.py
# Same pattern, different schema

# ... one parser per YAML type

The structured path is fast and deterministic — typical ingest is <500ms per YAML file, no GPU, no LLM latency. The parser is a thin wrapper over Cognee's graph adapter; the schema validation is strict and rejects ambiguous inputs with line numbers.

What this means for the LLM

The LLM never has to ingest. It only reads. World-builders ingest via:

  • cognee.add() + cognee.cognify() (prose — markdown, dialogue)
  • POST /ingest/structured (YAML — new)
  • POST /ingest/dialogue (JSON — new)
  • tea add-source <file> (CLI wrapper — new, optional)
  • Direct MCP tool calls (add_lore_source, add_entity, add_relation)

The LLM is told (in the reasoning harness): "You do not write lore. You do not modify the graph. You query it. If you believe a fact is missing, you say so to the user; the world-builder adds it."

Risk: YAML drift from prose

A common failure mode: the prose says "Aldric's father was Theron" but the family_tree.yaml has his father as "Maric." The engine flags this as a contradiction. The world-builder picks one. The LLM is told never to resolve the contradiction itself.

Mitigation: the consistency engine treats prose-derived lineage as confidence: 0.6 and YAML-derived lineage as confidence: 1.0 by default. When they conflict, the YAML wins and a Contradiction node is created with the prose source cited.

Roadmap note

The structured ingestion is the most leveraged thing in this design. It is also the part that requires the most world-builder discipline. We can't enforce YAML authoring; we can make it easy and rewarding (validation, preview, auto-completion in a future UI).