Files
lore-engine/docs/10-critique.md
Kaysser Kayyali c3fa2f7ce4 docs: ELI5 'how it works' page + soften false-precise '36 labels' arithmetic
- Add docs/how-it-works.html: self-contained explainer with inline SVG
  diagrams, ELI5 tone. Covers the big idea, plain-AI vs Lore Engine,
  the cast (Cognee/Neo4j/Minimax-M3/45 tools), question flow, why time
  matters, disputed edges + confidence, how lore gets in, the
  consistency safety net, and how it differs from a wiki.
- Soften the false-precise '36 labels' bucket arithmetic to honest
  'roughly 36' across 00-overview, 11-extensibility, 10-critique
  (sub-arithmetic didn't reconcile across docs).

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-17 23:40:50 -04:00

18 KiB
Raw Permalink Blame History

10 — Critique

Pressure-test of the design. What could break, what's weak, where this could fail. I tried to find the holes; here they are, ranked by severity.

Severity 1 (blockers)

S1.1 — The current token is a global mutable

The current reserved time token resolves against a single :Now config node. This is a single point of failure and a synchronization nightmare in multi-user scenarios.

  • If two sessions are running in different in-fiction times (e.g., a flashback scene and a present-day scene), they cannot both use current correctly.
  • If the world-builder forgets to update :Now after a time skip, every current query is wrong.

Fix: every current query must accept an optional world_time parameter that overrides the :Now node. The LLM is told to use it for flashback scenes. The MCP server tracks per-session world_time in the active context.

Status: acknowledged, fix in scope of Phase 2.

S1.2 — lore_verified: false is a boolean, but reality is a spectrum

A fact from one provisional source should be weighted differently from a fact from five contradictory sources. The boolean is too coarse.

Fix: add a source_confidence float on every node and every edge, weighted by source document confidence, source agreement, and rule-engine consistency results. The LLM sees the float and phrases its answer accordingly ("high confidence" / "reported by one source, unconfirmed").

Status: partially fixed in 01-ontology.md (we have source_confidence as a property). The lore_verified boolean stays as a coarse filter; the float is for nuanced reasoning. Need to formalize the weighting formula.

S1.3 — Entity resolution at scale

loadKnownEntities in the existing extractor loads 100 names and injects them into the prompt. At 10,000 entities, this won't fit. At 100,000, the prompt is unusable.

Fix: the structured ingestion path bypasses this entirely (YAML is exact). For the prose path, we need a different strategy:

  • Pre-compute embeddings of entity names, retrieve top-K by similarity to the chunk being extracted.
  • Or: extract first, resolve second via a separate entity-linking model (cheaper LLM call).
  • Or: accept that the prose path doesn't scale beyond ~10K entities and force the world-builder to use YAML for the long tail.

Status: known, fix required before Phase 2 ships at scale. Mitigation: structured YAML is exact; prose is fuzzy; the design is robust as long as the high-stakes data goes through structured paths.

Severity 2 (design risks)

S2.1 — Time model precision vs. lore granularity mismatch

We chose {era}.{year} as the canonical format. Most prose says things like "in the late Third Age" — no year. If we force precision, the prose extractor either guesses (hallucination risk) or stamps everything as 3rd_age (lossy).

Fix: the prose extractor is told to use the least specific valid time the source supports. "Late Third Age" → 3rd_age with a precision: low flag on the edge. The LLM is told that low-precision edges are not safe to use for narrow time-window claims.

Status: documented, not yet implemented in the extraction prompt. Add to the prose-extractor update in Phase 2.

S2.2 — The Consistency Engine will over-flag

High-fantasy worlds are full of valid temporal overlaps (a person ruling two kingdoms through marriage, a faction that is both allied and at war with a third party via different treaties). The Category A rules will produce a flood of Contradiction nodes.

Fix:

  • Default severity is warn, not error. A warn contradiction is "the world-builder should look at this," not "this is wrong."
  • The world-builder can mark a warn as acknowledged (a property on the Contradiction node), which suppresses future flagging.
  • Rules have a confidence_threshold parameter; below it, no violation is created.
  • A disable_rules[] list on the world config to silence specific rules per era or per region.

Status: fix design complete, implementation in Phase 7.

S2.3 — LLM cost on summarize_chain and narrate_arc

These are the only LLM-in-the-loop read tools. They make multiple Cypher queries, then call an LLM to render prose. At session scale, this is the single biggest cost driver.

Fix:

  • Default to no internal-LLM path. The LLM the user is talking to can do its own narrative synthesis from raw tool output.
  • summarize_chain is opt-in: the LLM must explicitly request it.
  • Future: cache summarize_chain results per (entity, depth, style, world_time) tuple. The world doesn't change for 95% of queries.

Status: documented, gated behind explicit LLM request.

S2.4 — The 45-tool surface is past the LLM's tool-use ceiling

Empirically, LLMs start making poor tool choices past ~25 tools in the same system prompt. The current catalog is 8 inherited + 37 new = 45 tools, well past the ceiling.

Fix:

  • Phase 6 test: measure tool selection accuracy with all 45 vs. with the 8 most-used. If 8 is dramatically better, collapse the long tail.
  • Group tools by function in the system prompt (we already do this) and instruct the LLM to look at the relevant group first.
  • If still bad: collapse state_at into entity_context (with optional comprehensive: true), and summarize_chain into narrate_arc (with optional style: bullets).

Status: acknowledged, in scope of Phase 10.

Severity 3 (known limitations)

S3.1 — Prophecy and unreliable narration aren't first-class

A claim like "the prophecy says the Crimson Throne will fall" is in the graph as a Claim node (if at all), but the engine doesn't model who said it, how reliable they are, or whether it's come true.

Fix (v2): add a Claim node label with claimant, reliability, verification_status, and claimed_event edges. The cite tool can return claims, not just chunks, and the LLM can answer "is the prophecy true?" with "the prophecy claims X, source: Aelar's temple, reliability: contested, no verification."

Status: out of scope for v1. Documented as a v2 feature.

S3.2 — Cross-world queries

The engine is per-world. A future version might want to query across two worlds (for a multi-world campaign or a comparison). The schema doesn't support this — the :Era slugs aren't namespaced.

Fix (v1.2, resolved): the v1.2 Setting and Plane graph nodes + EXISTS_IN edges replace the v1.1 flat world_id string namespace. Multi-setting queries are now supported via Setting filters and EXISTS_IN traversal. The "deferred to v2" framing in the v1 review is no longer accurate — the resolution is the v1.2 plane model. See 17-planes.md.

Status: deferred.

S3.3 — The reasoning harness depends on the LLM reading it

The system prompt is instruction, not constraint. The LLM can ignore it, especially under adversarial user pressure ("just give me an answer, don't worry about citations").

Fix:

  • The MCP server can enforce some rules (e.g., refuse cite-less answers via a "force citation" mode).
  • A "consistency-required" mode that rejects LLM tool calls inconsistent with the latest :ConsistencyRun result.
  • A user-facing UI that shows the LLM's tool-call trace, so a human can audit violations.

Status: enforcement is a v2 feature. v1 relies on the LLM being well-behaved.

S3.4 — The structured YAML format is a maintenance burden

A world-builder has to learn YAML, follow a strict schema, and update it as the world evolves. The prose path is much easier: just write a story.

Fix:

  • Phase 5: build a CLI tea worldbuilder with autocomplete, validation, and preview.
  • Phase 5: a web UI for editing YAML with type-ahead from existing entity names.
  • Phase 5: import-from-prose via the LLM (read a markdown chapter, propose a YAML diff, world-builder approves).

Status: tooling is in scope but not the core design.

Severity 4 (philosophical issues)

S4.1 — The engine models the written world, not the imagined world

A world-builder's mental model of their world is always richer than what's in any document. The engine can only reason about what's been ingested. The LLM can never answer "what is the secret history of the Vyrs that the world-builder hasn't written down?" — because the engine has no record of unwritten facts.

This isn't a bug, it's a feature. The engine is bounded by its sources. The LLM should never invent to fill the gap.

Status: explicit design choice. The system prompt says so.

S4.2 — The "best" tool for the LLM is the one it actually uses

We designed 45 tools (8 inherited + 37 new). The LLM might use 8 of them 95% of the time. The other 37 are dead weight — they bloat the system prompt and confuse the tool-selection logic.

Fix: measure tool usage in Phase 6. Tools with <2% usage in test sessions get either promoted (made part of a higher-level tool) or pruned. The design is a floor, not a ceiling. We add tools; we don't take them away unless evidence says we should.

Status: ongoing. Re-evaluate after Phase 10.

S4.3 — "Historically accurate" is a moving target

A world-builder changes the lore. The engine must absorb the change without breaking prior reasoning. We don't have a versioning model.

Fix (v2): every node and edge has a valid_from_version / valid_until_version pair. Old queries can be replayed against a snapshot. The consistency engine can diff two versions and surface what changed.

Status: deferred. v1 expects the world to evolve by MERGE, not by version.

Open questions

These are decisions I couldn't make alone. The world-builder should answer them before Phase 1.

  1. How granular is the time model in practice? Resolved (Q1): year-level precision is the default, with optional month/day/event precision when the source supports it. The UDF and the storage cost are unchanged.
  2. Are there multi-world / planar structures? Resolved (Q2): yes. The engine adds Setting and Plane graph nodes (v1.2); the v1.1 flat world_id string namespace is deprecated. Multi-setting queries are supported via Setting filters; planar relationships via Plane, EXISTS_IN, and the four plane-relation edge types (REFLECTS, LAYER_OF, ADJACENT_TO, ACCESSIBLE_VIA). See 17-planes.md.
  3. How are NPCs and PC players modeled? Resolved (Q3): separately. The NPC, PC, and Human labels in 01-ontology.md cover this. The in-fiction Person is canonical; the wrappers track who controls it.
  4. What's the policy on retconning? Resolved (Q4): preserve history by default. Old edges/nodes are marked retconned with a snapshot in the retcon Postgres table (12-storage-strategy.md#postgres-schema). Explicit DELETE is the only way to remove something permanently.
  5. How is the world bootstrapped? Resolved (Q5): organically over a long period. The engine supports partial worlds (some eras defined, some not), and the consistency engine surfaces missing structural data as :Orphan nodes. No need to pre-define everything.
  6. What's the confidence weighting formula? Resolved (Q6): more recent source wins. The source_uploaded_at (or source_published_at when known) is the tiebreaker. The engine stores both. When two prose sources disagree and both are recent, the rule engine surfaces the contradiction; it does not pick a winner automatically.
  7. Are contradiction nodes user-facing? Resolved (Q7): the local engine is read-only for contradictions — the world-builder reviews them in a queue. An external source may be authorized to resolve contradictions later (e.g. a community lore-council with write access). The local engine never auto-resolves.

Resolved-by-Kay decisions in v1.1

All 7 open questions are now resolved and reflected in:

  • 01-ontology.md — adds Plane, NPC, PC, Human labels
  • 02-time-model.md — year-level precision is the default
  • 12-storage-strategy.mdretcon Postgres table for retcon history
  • 09-roadmap.md — Phase 0 (pre-flight) now includes resolving these

What this design is good at

For balance:

  • Time-aware queries. The time model is the strongest part. The time_in_window UDF + era-tree membership + current resolution is a real primitive that solves the most common failure mode.
  • Source attribution. Every claim traces to a document. The LLM is told to cite.
  • Structured ingestion. The YAML path makes high-stakes data (lineage, era boundaries, faction rules) exact, not fuzzy.
  • Modular tools. Each tool does one job. Higher-level patterns are compositions, not mega-tools.
  • Consistency surfacing. The engine reports what it doesn't know as loudly as what it does.
  • Polymorphic extension. v1.1's DomainEntity + TypeTemplate model lets the world-builder add new domain types (thieves-guild missions, war campaigns, black markets) without code changes.

What this design is not good at (yet)

  • Scaling beyond ~10K entities on the prose path. Entity resolution via prompt-injection doesn't scale. The structured path scales; the prose path doesn't.
  • Prophecy, deception, unreliable narration. v1 doesn't model these as first-class.
  • Forcing the LLM to behave. The reasoning harness is a contract, not an enforcement mechanism.
  • User experience for world-builders. v1 is CLI + YAML. UI is a v2.
  • Versioning and retcon handling at the v1 level. v1 mutates in place; v1.1's retcon table preserves history but the in-graph nodes still get MERGE'd. A v2 might use temporal versioning on the graph itself.
  • Auto-resolution of cross-source conflicts. v1.1 surfaces them; the world-builder resolves.

v1.1 critique additions

After the v1 review, the modularization question surfaced four new design risks worth recording.

S1.4 (NEW, blocker) — Closed-world ontology ceiling

The v1 ontology has roughly 36 hard-coded labels (7 base + v1 core incl. Relation per ADR 0009 + 2 v1.2 planes + 5 v1.1 polymorphic + 5 consistency). A thieves-guild mission is forced into Event, a war campaign is forced into Faction-with-properties, a black-market trade log is forced into Item-with-properties. The LLM can talk about these things, but the engine can't reason over their structure.

Fix: the polymorphic DomainEntity wrapper + TypeTemplate data-defined schemas. See 11-extensibility.md. This is the load-bearing change for "arbitrary new concept, define how it associates with larger constructs, but also have flexibility to get as detailed as we need."

Status: resolved in v1.2 design. The polymorphic extension model is shipped in the MVP (it's how the v1 ontology becomes extensible without code). The template-watcher is a Cognee data-pipeline; the dynamic tool generator is part of the Lore Engine extension. Implementation is Phase 5 of the Cognee roadmap in 09-roadmap.md.

S1.5 (NEW, blocker) — Single mcp-server binary blocks iteration

The original GraphMCP-Example mcp-server/main.go was a 1144-line single file. Adding a new tool meant editing main.go, recompiling, redeploying. The iteration loop for a world that's going to grow indefinitely is the cost of the entire program.

Fix: switching to Cognee as the substrate. Cognee is the gateway; the Lore Engine is one in-process Python extension (one tool per Group file, registered at startup). Adding a new tool is a Python edit + Cognee restart (5 minutes). Adding a new domain type is a YAML file + hot-reload (no restart). See 13-microservice-decomposition.md.

Status: N/A in v1.2. The substrate switch resolves this completely. The Lore Engine does not own the mcp-server; Cognee does.

S2.5 (NEW, design risk) — The polymorphic wrapper adds query complexity

Every DomainEntity query is now polymorphic — the engine has to look up the template, get the field names, build the right query. The performance overhead is small for typed queries, but for expand_context and graph_traverse, the engine has to follow relations through the Relation label and re-resolve the template for each step.

Fix: Cognee caches TypeTemplate lookups in its in-process store. The first time a template is referenced, its spec is loaded; subsequent queries use the cached version. Cache invalidation is on template reload (hot-reload event from the template-watcher data-pipeline). Cognee's caching layer handles this without us writing a custom cache.

Status: acknowledged, fix designed, implementation in Phase 5 of the Cognee roadmap.

S2.6 (NEW, design risk) — Cross-store consistency is genuinely hard

When the world-builder writes a new mission, we touch the Cognee graph (entity, relations) and the operational Postgres tables (mission_log row). These two writes are not atomic. A partial failure leaves the world in an inconsistent state.

Fix: the saga pattern is no longer needed. Cognee manages its own transaction model for the graph + Postgres + vector store. The Lore Engine's operational tables are in Cognee's Postgres, so writes that touch the graph and the operational tables are managed by Cognee's atomicity guarantees. We do not need a custom saga layer.

Status: N/A in v1.2. Cognee handles this. The v1.1 saga-pattern section in 12-storage-strategy.md has been removed.

Conclusion

The design is viable for v1 on Cognee, with a clear scope of 16 days for the MVP (Phases 03) and 33 days for the full v1 + extensions (Phases 06). The 7 open questions are resolved. The biggest remaining risks are scale (entity resolution), over-flagging (consistency engine), Cognee-specific substrate quirks, and LLM misbehavior (harness enforcement). Each has a documented mitigation.

I would build the Cognee spike first (Phase 0, 2 days), validate the substrate, then proceed to the MVP (Phases 13, 14 days). The polymorphic extension model (Phase 5) and the consistency engine (Phase 4) are the highest-leverage v1.1 additions and ship in the same ~33-day window.