The earlier commit missed 8 spots that still presented the graph backend as undecided (00-overview x2, 12-storage x2, 13-microservice x2, 06-ingestion, plan/05). All now pinned to Neo4j per ADR 0008. Co-Authored-By: Claude <noreply@anthropic.com>
14 KiB
06 — Ingestion Pipelines
The ingestion layer is where the world enters the engine. There are two fundamentally different kinds of input:
- Free prose — chronicles, novels, short stories, dialogue logs, Discord messages. The engine reads the text, extracts entities and relations, embeds chunks. On Cognee, this is the
cognee.add()+cognee.cognify()pipeline, with a custom extraction prompt that emits the Lore Engine's 36 typed labels. - Structured lore — timelines, family trees, gazetteers, bestiaries, magic-system descriptions, written in YAML by the world-builder. The Lore Engine's structured parser materializes typed graph edges directly. No LLM is required for these.
The structured path is the one that makes the engine historically accurate. Prose extraction is fuzzy by nature; YAML ingestion is exact. Both paths exist; structured is preferred for anything that becomes a load-bearing fact (lineage, era boundaries, faction rules).
Ingestion paths overview
┌─────────────────────────────────┐
│ World-Builder Authoring │
│ (markdown, YAML, dialogue) │
└────────────┬────────────────────┘
│
┌───────────────────────────────┼───────────────────────────────┐
│ │ │
▼ ▼ ▼
prose path timeline.yaml family_tree.yaml
cognee.add() Lore Engine YAML parser Lore Engine YAML parser
cognee.cognify() (no LLM, exact) (no LLM, exact)
│ │ │
▼ ▼ ▼
Cognee chunks + vectors Date, Era, Event nodes Person, Lineage nodes
Typed triples RULES, OCCURRED_DURING PARENT_OF edges
(Lore Engine extraction PARTICIPATED_IN edges EXISTED_DURING edges
prompt emits 36 labels) │ │
│ │ │
└───────────────────────────────┴───────────────────────────────┘
│
▼
Cognee-managed graph
(Neo4j — ADR 0008)
│
▼
Consistency pipeline runs
(live + nightly batch)
Path 1: Free prose (via Cognee)
The prose path goes through Cognee's standard add + cognify pipeline. The Lore Engine registers a custom extraction prompt with Cognee; the prompt tells the LLM to emit the Lore Engine's 36 typed labels and the ~70 edge types instead of Cognee's default Entity/DataPoint types.
# World-builder's ingestion script
import cognee
await cognee.add("chapters/aldric_origin.md") # raw markdown
await cognee.cognify() # extract + embed + index
The pipeline:
- Cognee watcher detects a new file (or receives a
cognee.add()call). - Cognee ingestion worker chunks the text (512-token windows, 64-token overlap), generates embeddings, writes
ChunkandDatasetnodes. - Lore Engine extraction prompt runs on each chunk. The LLM is told to emit triples using the Lore Engine's typed ontology. The response is parsed and validated against the schema.
- Entity resolution matches extracted entity names against known canonical names (Cognee's
loadKnownEntitieshelper, with alore_enginenamespace prefix). - Cypher writer materializes entities and relations into the graph using Cognee's graph adapter, applying the
:FEATURESedge from the source. - Contradiction detection runs on the new edges (see
04-consistency.md).
Extraction prompt (Lore Engine extension to Cognee)
Cognee's default extraction prompt emits Entity and DataPoint types. The Lore Engine replaces this with a prompt that teaches the LLM the Lore Engine's 36 typed labels and the ~70 edge types:
You are extracting structured information from a passage of high-fantasy fiction
for the Lore Engine knowledge graph.
Emit a list of triples. Each triple is (subject, relation, object).
Subject and object must be one of the Lore Engine typed labels:
Person, Faction, Location, Item, Era, Date, Lineage, Culture, Deity,
Language, MagicSystem, Title, Region, Material, Creature, Spell,
Plane, Setting, NPC, PC, Human, DomainEntity.
Relation must be one of the Lore Engine typed edge types:
RULED, PARENT_OF, MEMBER_OF, LOCATED_IN, OCCURRED_AT, OCCURRED_DURING,
PARTICIPATED_IN, ALLIED_WITH, ENEMY_OF, POSSESSES, SPOUSE_OF, WORSHIPS,
PRACTICES, SPEAKS, BELONGS_TO, CLAIMS_TITLE, CAUSED, PRECEDED,
CONCURRENT_WITH, WITNESSED, LOGGED_IN, GIVEN_BY, TARGETS, PAID_BY,
PART_OF, ... (full list in 01-ontology.md)
For Event nodes, the temporal_hint field is REQUIRED. Format: {era}.{year}[.month_N][.day_N].
For Person nodes, birth and death years are STRONGLY PREFERRED in temporal_hint.
For Faction nodes, founded and dissolved years are STRONGLY PREFERRED.
If the passage describes a person, also extract their MEMBER_OF, WORSHIPS,
SPEAKS, BELONGS_TO, POSSESSES if explicitly stated. Prefer specific
faction/religion/culture names over generic descriptions.
If a fact is too vague to assign a time, emit temporal_hint: "unknown"
and set source_confidence: 0.5.
The Cognee pipeline runs this prompt per chunk and parses the result. The Lore Engine validates the parsed triples against its typed ontology (rejecting triples that reference unknown labels) before writing to the graph.
What prose is good for
- Color, character voice, cultural texture.
- The kind of information that doesn't have a clean structure: "Aldric was known for his sharp wit and his hatred of the Crimson Pact."
- In-fiction dialogue logs.
What prose is bad for
- Lineage. "Aldric was the son of Maric, who was the son of Theron..." extracted by an LLM is correct maybe 80% of the time, and silent errors are catastrophic. Use a
family_tree.yamlfor lineage. Always. - Era boundaries. "The Third Age began in 1 TA..." — the LLM will sometimes parse this as
1st_ageorfirst_ageorthird_age_1. Use atimeline.yaml. - Magic system taxonomy. Free text describing spells is fine; the spell-to-system mapping is a
magic_system.yaml.
Path 2: Structured YAML ingestion
This is the new pipeline. Each YAML type has a dedicated extractor that parses the structure and writes typed Cypher directly — no LLM in the loop.
timeline.yaml — era boundaries + named events
era: "3rd_age"
parent_era: null
start: -100
end: 600
description: "The Third Age. The age of iron crowns and broken gods."
events:
- slug: "battle_of_black_spire"
label: "Battle of Black Spire"
in_fiction_date: "17 Hearthmoon, 340 TA"
era: "3rd_age.age_of_iron"
year: 340
month: 3
day: 17
location: "black_spire_pass"
participants: ["house_vyr", "crimson_pact"]
description: "House Vyr's decisive victory over the Crimson Pact."
significance: "End of the Border Wars."
The timeline extractor:
- Creates/updates the
Eranode. - For each event, creates a
Datenode, anEventnode, andOCCURRED_AT+OCCURRED_DURING+PARTICIPATED_INedges. - Sets
valid_from/valid_untilon eachEventbased on its date. - Tags the
LoreSourceassource_type: timeline.
family_tree.yaml — direct lineage
founding_ancestor: "theron_ashveil"
lineage: "house_vyr_bloodline"
description: "The bloodline of House Vyr, from Theron Ashveil to the present."
members:
- id: "theron_ashveil"
name: "Theron Ashveil"
born: "1st_age.year_412"
died: "2nd_age.year_87"
spouse_of: ["mara_ashveil"]
- id: "maric_vyr"
name: "Maric Vyr"
born: "2nd_age.year_70"
died: "3rd_age.year_15"
parents: ["theron_ashveil", "mara_ashveil"]
- id: "aldric_raventhorne"
name: "Aldric Raventhorne"
born: "3rd_age.year_300"
died: "3rd_age.year_360"
parents: ["cael_vyr", "yssa_raventhorne"]
spouse_of: ["elara_raventhorne"]
The family-tree extractor:
- Creates/updates
Personnodes. - Creates/updates the
Lineagenode withfounding_ancestor. - Writes
PARENT_OFedges (withvalid_fromset to the child's birth,valid_untilset to the parent's death). - Writes
MEMBER_OFedges from each person to the lineage. - Runs anachronism check on every node: do the parents' lifespans cover the child's birth?
gazetteer.yaml — locations, regions, geography
locations:
- id: "thornwall_keep"
name: "Thornwall Keep"
type: "fortress"
part_of: "valdorn"
culture_of: "valdorni"
coordinates: {x: 1240, y: 870}
description: "..."
events_held: ["coronation_of_aelric"]
regions:
- id: "northern_reaches"
name: "Northern Reaches"
parent_region: null
contains: ["valdorn", "mardsville", "frosthollow"]
The gazetteer extractor:
- Creates
LocationandRegionnodes. - Writes
PART_OFedges. - Writes
CULTURE_OFedges. - Materializes named events as
OCCURRED_ATedges.
bestiary.yaml — creatures
creatures:
- id: "pale_worm"
name: "The Pale Worm"
species: "worm"
alignment: "chaotic_evil"
habitat: "frosthollow"
first_appeared: "3rd_age.year_120"
description: "A massive frost-worm that haunts the Frosthollow tundra."
defeated_by: ["aldric_raventhorne"] # creates DEFEATED edges
magic_system.yaml — magic taxonomy
systems:
- id: "the_weave"
name: "The Weave"
source: "natural_law"
practitioners: ["valdorni_mage", "sisterhood_of_silver"]
description: "..."
- id: "divine_miracles"
name: "Divine Miracles"
source: "aelar_the_patient"
practitioners: ["cleric_of_aelar"]
description: "..."
spells:
- id: "emberlance"
name: "Emberlance"
system: "the_weave"
level: 3
school: "evocation"
practitioners: ["valdorni_mage"]
culture.yaml — cultures, languages, deities
cultures:
- id: "valdorni"
name: "Valdorni"
language: "old_valdorni"
homeland: "valdorn"
description: "..."
languages:
- id: "old_valdorni"
name: "Old Valdorni"
script: "runic"
speakers: ["valdorni", "house_vyr"]
deities:
- id: "aelar_the_patient"
name: "Aelar the Patient"
domain: ["healing", "patience", "winter"]
alignment: "neutral_good"
symbol: "a single open eye"
worshipped_by: ["valdorni", "sisterhood_of_silver"]
Path 3: Dialogue logs (in-fiction)
For when a player/NPC says something in-character that should be recorded as lore:
POST /ingest/dialogue
{
"speaker": "aldric_raventhorne",
"text": "I will not rest until the Crimson Pact is broken.",
"in_fiction_date": "3rd_age.year_345",
"location": "thornwall_keep"
}
This creates a Message (or a special Dialogue node) and links the speaker to the location at the time. Useful for building up first-person perspective in narrate_arc.
Why YAML, not JSON or TOML
YAML wins because:
- Comments. Every world-builder annotates their lore. JSON forces them out.
- References.
parents: ["theron_ashveil"]is readable;{"parents": ["theron_ashveil"]}is noise. - Multi-line strings.
description: |blocks handle prose naturally. - Standard tooling.
pyyamlis in every Python install. No additional deps.
The downside is YAML's gotchas (Norway problem, tab/space sensitivity). The extractor is strict and rejects ambiguous inputs with line numbers — better to fail loudly than silently parse NO: false as the boolean True.
The structured-ingestor (Lore Engine on Cognee)
The structured-YAML parser lives in the Lore Engine extension as a Python module:
# lore_engine/parsers/timeline.py
# Validates the timeline.yaml schema
# Emits MERGE (e:Era {slug, parent_era, start, end}) and similar
# Calls Cognee's graph adapter to execute the Cypher
# Tags the LoreSource with source_type: timeline
# lore_engine/parsers/family_tree.py
# Same pattern, different schema
# ... one parser per YAML type
The structured path is fast and deterministic — typical ingest is <500ms per YAML file, no GPU, no LLM latency. The parser is a thin wrapper over Cognee's graph adapter; the schema validation is strict and rejects ambiguous inputs with line numbers.
What this means for the LLM
The LLM never has to ingest. It only reads. World-builders ingest via:
cognee.add()+cognee.cognify()(prose — markdown, dialogue)POST /ingest/structured(YAML — new)POST /ingest/dialogue(JSON — new)tea add-source <file>(CLI wrapper — new, optional)- Direct MCP tool calls (
add_lore_source,add_entity,add_relation)
The LLM is told (in the reasoning harness): "You do not write lore. You do not modify the graph. You query it. If you believe a fact is missing, you say so to the user; the world-builder adds it."
Risk: YAML drift from prose
A common failure mode: the prose says "Aldric's father was Theron" but the family_tree.yaml has his father as "Maric." The engine flags this as a contradiction. The world-builder picks one. The LLM is told never to resolve the contradiction itself.
Mitigation: the consistency engine treats prose-derived lineage as confidence: 0.6 and YAML-derived lineage as confidence: 1.0 by default. When they conflict, the YAML wins and a Contradiction node is created with the prose source cited.
Roadmap note
The structured ingestion is the most leveraged thing in this design. It is also the part that requires the most world-builder discipline. We can't enforce YAML authoring; we can make it easy and rewarding (validation, preview, auto-completion in a future UI).