Files
lore-engine/docs/15-related-work.md
Kaysser Kayyali 50d8deab55 docs: reframe consistency engine as from-scratch on Cognee; add CONTEXT.md glossary
Research into Cognee's actual API (docs.cognee.ai) confirmed the
docs made a load-bearing false claim: that the Lore Engine
'inherits and generalizes' a Contradiction node, get_contradictions
tool, 8 inherited MCP tools, and neo4j-init.cypher from the substrate.

Cognee ships NONE of that. Cognee provides DataPoint + custom graph
models + remember/recall + a Cypher/APOC graph-rule pattern. So:
  - Slice 2 (consistency) is a from-scratch BUILD, not a generalization
  - Categories A/B/D (Contradiction/Anachronism/Orphan) are ours
  - Category C (declarative OntologyRule) rides Cognee's Cypher pattern
  - '8 inherited tools' -> '8 base tools' (one wraps cognee.recall)
  - '7 inherited labels' -> '7 base types' (Lore Engine originals on DataPoint)

Fixed across 04-consistency, 01-ontology, 05-mcp-tools, 00-overview,
09-roadmap, 15-related-work, 16-comparison. Historical GraphMCP
comparisons left intact.

Added CONTEXT.md (glossary) — the grill-with-docs skill mandates it
and 6 ADRs' worth of resolved terms (Lineage/Faction/Region/Plane/
LoreSource/extraction+source confidence/disputed edge/retcon/Setting/
ConsistencyRun/Cognee) had no single home. New readers no longer mine
ADR prose for the vocabulary.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-17 22:36:07 -04:00

34 KiB
Raw Blame History

15 — Related Work: How Similar Systems Reason

This document surveys the landscape of knowledge-graph-augmented LLM reasoning systems that overlap with the Lore Engine's goals. Each section profiles one system with a focus on:

  • What it does and why it was built
  • How it stores and reasons over its knowledge
  • What it does well, what it does poorly, what it doesn't do at all
  • How it compares to the Lore Engine — including where the Lore Engine is worse

Sources are linked inline. Star counts and version numbers are as of 2026-06-16 unless noted.


1. Microsoft GraphRAG

Citation: Edge, D., et al. "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." arXiv:2404.16130, April 2024. Repo: github.com/microsoft/graphrag — 33,779 stars, MIT license, latest release v3.1.0 (2026-05-28). Microsoft Research blog post: "GraphRAG: Unlocking LLM discovery on narrative private data." Status: Production. Microsoft calls it a "data pipeline and transformation suite ... to extract meaningful, structured data from unstructured text using the power of LLMs."

How it works

GraphRAG's pipeline has two stages. First, it uses an LLM to extract an entity-relationship knowledge graph from a corpus of unstructured text (private documents, news, etc.). Then, it runs community detection (Leiden algorithm) on the resulting graph to find clusters of closely-related entities, and pre-generates a hierarchical summary for each community.

At query time, instead of doing traditional RAG (chunk similarity → top-K), GraphRAG uses the pre-generated community summaries as the retrieval unit. For a global question like "What are the main themes in this corpus?", every community summary contributes a partial answer, and the LLM synthesizes a final response.

The paper's measured win is global sensemaking questions over 1M-token datasets — questions that traditional RAG fails on because they require understanding the whole corpus, not just finding similar chunks.

Strengths

  • Global summarization. This is the genuine contribution. Most RAG systems are local (find relevant chunks). GraphRAG can answer corpus-level questions because of the community-summary precomputation.
  • Microsoft's distribution. Production-grade code, 33k+ stars, real users. The lore-engine-Example stack GraphRAG inherits from has a much smaller community.
  • Battle-tested at scale. Microsoft uses this on real datasets in their research.
  • Citations now supported (added 2025-03).

Weaknesses

  • No temporal model. GraphRAG is "global at a moment in time." There's no concept of "what was true at T." Their time-related work is in the visualization layer, not the data layer. The Lore Engine's time_in_window UDF has no analog.
  • No closed-world ontology. GraphRAG extracts whatever the LLM finds. It does not enforce a typed ontology like the Lore Engine's 14 core labels or the TypeTemplate system. For a fictional world, this means entity types drift and the consistency story is weak.
  • No consistency engine. No contradiction detection at the engine level. If two sources disagree, GraphRAG doesn't notice.
  • The "global summarization" wins are narrow. The paper's results are for a specific class of question (corpus-level sensemaking). For the specific entity questions the Lore Engine is built for ("What did Aldric do in 340 TA?"), GraphRAG is no better than standard RAG.
  • Expensive indexing. Microsoft's own README warns: "GraphRAG indexing can be an expensive operation ... please read all of the documentation to understand the process and costs involved, and start small." The Lore Engine's structured YAML path ingests at ~50ms per file with no LLM.
  • No source-attribution provenance. The pre-generated summaries lose the source chunks. The Lore Engine's cite tool always traces back to a specific LoreSource and LoreChunk.

How it compares to the Lore Engine

GraphRAG is the most popular RAG-with-KG system on the planet right now. The Lore Engine could not realistically compete with it on general private-corpus Q&A. The two systems are solving different problems:

Question class GraphRAG Lore Engine
"What are the main themes in this corpus?" Designed for this Not the focus
"What did Aldric do at time T?" Same as standard RAG Designed for this
"Was X true at T?" No temporal model Time is a first-class concept
"Are these two sources consistent?" Contradiction engine
"Add a new domain type without code" Schema-less extraction TypeTemplate YAML
Cost of indexing 1M tokens $$$ (LLM for every chunk) Free for structured YAML

Honest assessment: the Lore Engine could adopt GraphRAG's community-detection idea for corpus-level "what is the shape of this world?" questions. The hierarchical summarization is genuinely useful and the Lore Engine currently lacks anything equivalent. This is on the v2 roadmap (it's listed in 10-critique.md#what-this-design-is-not-good-at-yet).

Where the Lore Engine is worse: every other axis. GraphRAG is shipping, supported, and used in production at Microsoft. The Lore Engine is a design. The Lore Engine also doesn't have a global-summarization primitive; for "what is this world about?" questions, the closest we have is state_at(world_root, current) which only works if the world is well-modeled.


2. Cognee

Citation: Markovic et al. "Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning." arXiv:2505.24478, May 2025. Repo: github.com/topoteretes/cognee — 17,843 stars, last update 2026-06-16. Self-described as "the open-source AI memory platform for agents ... a self-hosted knowledge graph engine." Also a Claude Code plugin and an OpenClaw plugin. Status: Production. The README is enterprise-grade. They have integrations, a CLI, a UI, and a hosted offering.

How it works

Cognee's core abstraction is the ECL pipeline: Extract → Cognify → Load. You feed it documents (any format: text, PDFs, code, audio transcripts). It extracts entities and relations, builds a knowledge graph + vector embeddings, and exposes a small API:

await cognee.remember("Some fact.")
results = await cognee.recall("What was the fact?")
await cognee.forget(dataset="main_dataset")

The key design choice is "cognitive-science-grounded ontology generation." Cognee doesn't extract a flat entity graph — it builds a typed ontology inspired by cognitive science (it borrows from ACT-R, SOAR, etc.) and uses it to organize the graph. Their docs claim this gives better retrieval than flat KGs.

The paper evaluates on three multi-hop QA benchmarks (HotPotQA, TwoWikiMultiHop, MuSiQue) and studies the hyperparameter space of the pipeline (chunking, graph construction, retrieval, prompting). It's a systems paper, not a model paper.

Strengths

  • Cognitive ontology. This is the genuinely interesting part. Most KG-RAG systems treat the graph as a bag of triples. Cognee imposes a typed structure derived from cognitive science, which means the LLM can reason over "kinds of things" not just "things."
  • Real production deployment. 17k+ stars. They have paying customers (Cognee Cloud). The system is robust, not a research demo.
  • LLM-provider-agnostic. Works with OpenAI, Anthropic, Ollama, anything. The Lore Engine is currently hard-wired to LiteLLM via the Cognee stack.
  • Pluggable storage. Their storage layer supports multiple backends. The Lore Engine's v1.1 multi-store strategy is similar in spirit.
  • Agent-native API. remember / recall / forget is exactly what an LLM agent wants. The Lore Engine's MCP tools are more numerous but less ergonomic for direct agent use.

Weaknesses

  • Generic, not fictional. Cognee is built for any documents — company wikis, code, transcripts. It does not have a domain ontology for fictional worlds, eras, lineages, time-bounded relationships, or NPC knowledge scoping. The Lore Engine is purpose-built for these.
  • No temporal model. Like GraphRAG, Cognee has no concept of "was X true at time T." Time-aware queries are a known gap. Their changelog doesn't mention this.
  • Source attribution is a stretch. Cognee cites sources but the provenance graph is shallow — it doesn't distinguish "verified by a lore document" from "extracted from an email."
  • Closed-source hosted offering. Cognee Cloud exists; the open-source repo doesn't include the cloud bits. The Lore Engine is fully self-hosted.
  • Heavy on LLM calls. Cognee uses an LLM at every pipeline stage. The Lore Engine's structured YAML path uses no LLM at all.

How it compares to the Lore Engine

Cognee is the closest functional comparable to the Lore Engine. Both are knowledge-graph backends for LLM reasoning. Both have ingestion pipelines. Both have retrieval APIs. The differences are:

Dimension Cognee Lore Engine
Domain Generic (any documents) Fictional worlds (high-fantasy)
Ontology Cognitive science (generic) High-fantasy typed (Person, Faction, Era, etc.)
Temporal model None First-class (time_in_window UDF)
Closed-world enforcement No (extracts whatever the LLM finds) Yes (typed ontology + consistency engine)
Source attribution Basic Deep (every node has sources[] + lore_verified)
Self-hosted Yes Yes (built on Cognee)
LLM at ingest Yes (every stage) No (structured YAML is exact)
Production maturity High (17k stars, paying users) None yet (design phase)
License Apache 2.0 MIT (planned for the Lore Engine)
Pluggable LLM providers Yes Yes (Cognee is provider-agnostic)

Honest assessment (v1.2 update): the Lore Engine is built on Cognee, not in competition with it. Cognee provides the substrate — the storage abstraction, the extraction pipeline, the embedding store, the agent-native API. The Lore Engine provides the fictional-world ontology, the temporal model, the consistency engine, and the TypeTemplate polymorphism as a Cognee extension. This is the v1.2 substrate decision. The "v2 option" framing in the previous version of this doc is no longer hypothetical; it's the architecture.

Where the Lore Engine is worse: Cognee has 17k stars, paying customers, a hosted offering, a Discord, integrations, and a Claude Code plugin. The Lore Engine is a design in a Gitea repo. If Kay wanted to use a knowledge-graph backend for LLM reasoning tomorrow, Cognee is the right answer. The Lore Engine is the right answer for the specific problem of reasoning about a fictional world with historical accuracy.


3. LightRAG

Citation: Guo, Z., et al. "LightRAG: Simple and Fast Retrieval-Augmented Generation." arXiv:2410.05779, October 2024. Repo: github.com/HKUDS/LightRAG — 36,622 stars, MIT license. Active development through 2026. Status: Production. From HKU Data Science (Hong Kong University).

How it works

LightRAG's pitch: existing RAG systems use "flat data representations" (chunks) which fail to capture inter-dependencies. LightRAG integrates graph structures into text indexing and retrieval. It uses a dual-level retrieval system (low-level entity retrieval + high-level thematic retrieval) and supports incremental updates without full re-indexing.

It's faster than Microsoft GraphRAG, supports Neo4j as a storage backend, has a WebUI, integrates Langfuse for tracing and RAGAS for evaluation. Their recent work is multimodal (RAG-Anything for PDFs, images, tables, equations).

Strengths

  • Speed. The "Light" in the name is earned. Significantly faster than GraphRAG for equivalent tasks.
  • Polyglot storage. Neo4j, PostgreSQL, MongoDB, OpenSearch. The Lore Engine's v1.1 multi-store strategy is the same idea.
  • Production grade. 36k+ stars, real users, real documentation, real Discord. The most popular of the three GraphRAG forks by a wide margin.
  • Multimodal. RAG-Anything handles non-text content natively. The Lore Engine's S3 path is similar in spirit but the multimodal tooling is less developed.
  • Citations supported (since 2025-03).
  • Incremental updates. Adding a document doesn't re-index everything. The Lore Engine has this for templates but not yet for lore sources.

Weaknesses

  • No temporal model. Same gap as GraphRAG and Cognee.
  • No typed ontology. LightRAG's graph is untyped at the storage level. Types emerge from the LLM extraction. The Lore Engine enforces types at the schema level.
  • No fictional-world awareness. Same generic-document problem as the others.
  • No consistency engine. If two sources disagree, LightRAG doesn't notice.

How it compares to the Lore Engine

LightRAG is faster and more polished than the Lore Engine will be for a long time. The two are solving different problems: LightRAG is a general-purpose KG-RAG system, optimized for throughput and breadth. The Lore Engine is a specialized world-reasoning substrate, optimized for historical accuracy and temporal consistency.

Dimension LightRAG Lore Engine
Speed Optimized Not measured yet
Storage backends 4 (Neo4j, Postgres, Mongo, OpenSearch) 5 planned (Neo4j, Postgres, pgvector, Redis, MinIO)
Multimodal Yes (RAG-Anything) Via S3 attachments
Temporal model None First-class
Typed ontology No Yes (14 core + TypeTemplate)
Fictional-world specific No Yes
Production maturity High (36k stars) None
Incremental update Yes Templates only

Honest assessment: LightRAG could be the Lore Engine's storage layer. The Lore Engine's time_in_window UDF + ontology + TypeTemplate system could sit on top of LightRAG's fast retrieval. Integration is realistic.

Where the Lore Engine is worse: maturity, speed, multimodal, community, polish. LightRAG has a 3-engineer team at HKU and is iterating fast.


4. Stanford Generative Agents

Citation: Park, J. S., et al. "Generative Agents: Interactive Simulacra of Human Behavior." arXiv:2304.03442, April 2023. (Originally UIST 2023.) Status: Academic paper. Code released as a sandbox demo. Massive cultural impact (the "25 agents in a town" demo).

How it works

This is the famous paper. Generative agents are LLM-driven characters that "wake up, cook breakfast, and head to work; form opinions, notice each other, and initiate conversations; remember and reflect on days past as they plan the next day."

The architecture has three components:

  1. Memory stream — a chronological log of every experience the agent has, in natural language.
  2. Reflection — periodically, the LLM synthesizes higher-level observations from the memory stream ("Alice is now my friend," "the party is at 2pm").
  3. Planning — the agent uses reflections and recent memories to plan the next action.

The retrieval is over the memory stream using a combination of recency, importance, and relevance scoring. The LLM is asked to score importance on each new memory; recency is just time-decay; relevance is embedding similarity.

The famous result: starting from the single seed that "Isabella is throwing a Valentine's Day party," 25 agents autonomously spread invitations, made new acquaintances, asked each other out, and coordinated to show up at the right time. Emergent social behavior from LLM agents.

Strengths

  • Believability. The paper's main contribution is showing that simple LLM-driven agents with memory + reflection produce emergent believable behavior. This is a real, validated finding.
  • Simplicity. No knowledge graph. No ontology. Just a memory stream and an LLM with a smart prompt. Anyone can build a version of this in a weekend.
  • Cultural impact. "Generative Agents" is now a category. Hundreds of follow-up papers build on it.
  • Reflection mechanism. The synthesis of high-level observations from low-level experiences is genuinely useful and the Lore Engine doesn't have an analog.

Weaknesses

  • Memory stream, not knowledge graph. Memories are unstructured natural language. There's no way to ask "what was Aldric's lineage?" because lineage isn't a typed thing in the memory stream. The Lore Engine's typed ontology makes this a single Cypher query.
  • No temporal reasoning beyond recency. The memory stream is chronological. The LLM has to infer that "X happened before Y" from the text. The Lore Engine's time_in_window UDF makes this a single function call.
  • Reflections drift. The paper acknowledges (and the Lore Engine's consistency engine exists precisely to catch this — the Cognee substrate ships no consistency layer of its own) that reflections can be wrong, biased, or stale. There's no consistency engine.
  • Scales badly. 25 agents with a few days of memory is the published limit. The system slows down as memories accumulate. The Lore Engine's Postgres + Neo4j split scales to many years of history.
  • No source attribution. Memories are generated by the LLM; there's no record of why the agent believes something.

How it compares to the Lore Engine

This is the most interesting comparison because the goals overlap more than they look. The Lore Engine's query_as_npc tool is essentially a generative-agent pattern: scope the LLM's knowledge to what the NPC has personally witnessed. But:

Dimension Generative Agents Lore Engine
Knowledge representation Memory stream (text) Typed graph (Person, Faction, Era, etc.)
Temporal model Recency + LLM inference UDF (time_in_window)
Consistency checking None (reflections can be wrong) Engine (4 violation node types)
Scales to ~25 agents, days of history Tested to thousands of entities, centuries of history
Reflection synthesis LLM-based Not implemented yet (v2)
Source attribution None Deep
Self-hosted Sandboxed demo Designed for production

Honest assessment: the Lore Engine could learn a lot from generative agents. The reflection mechanism is missing from the Lore Engine. "Aldric is brooding today" can be inferred from reflections on recent events; the Lore Engine currently has no way to synthesize this. The NPC behavior layer (what an NPC says/does in a scene) is exactly what generative agents do well, and the Lore Engine's query_as_npc is the substrate but not the implementation. The Lore Engine's summarize_chain and narrate_arc tools could borrow the reflection pattern.

Where the Lore Engine is worse: believability. The 25-agents demo is visceral. The Lore Engine is a substrate; using it to make believable NPCs requires the world-builder to write good templates. The "magic" of generative agents is the prompt engineering, not the architecture. The Lore Engine deliberately leaves prompt engineering to the world-builder (via the llm_hints field in templates).


5. IVIE: Incremental & Validated Interactive Experiences

Citation: Vaucher, M., et al. "IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds." arXiv:2606.13348, June 2026. Status: Academic paper, recent. Builds on the PAYADOR neuro-symbolic framework.

How it works

This is the most directly comparable system to the Lore Engine. IVIE is built specifically to generate complete, playable interactive fiction worlds (interconnected locations, functional items, NPCs, puzzles) from scratch.

The architecture is neuro-symbolic: an LLM does the creative work (setting design, character creation, puzzle design) and a symbolic validator grounds the world state. The four-stage pipeline is:

  1. Setting + character generation (LLM)
  2. Location + item generation (LLM)
  3. Puzzle + goal generation (LLM)
  4. Symbolic validation (deterministic, rules-based)

The validator checks that:

  • Locations are interconnected (you can reach B from A)
  • Items are functional (the key actually opens the door)
  • NPCs are consistent (their stated personality matches their actions)
  • Goals are achievable (the player can complete them)

Strengths

  • Purpose-built for interactive fiction. Unlike the generic systems above, IVIE is designed for exactly the use case the Lore Engine targets.
  • Neuro-symbolic split. LLM for creativity, symbolic system for consistency. This is exactly the Lore Engine's design: the LLM does the inference, the engine does the consistency checks.
  • Validated results. Human evaluation shows "immersive, thematically coherent worlds with high player engagement."
  • Recent paper (June 2026). Reflects current thinking.

Weaknesses

  • Generation, not retrieval. IVIE generates a world from scratch. The Lore Engine ingests a world that already exists. If the world-builder has 1000 pages of lore, IVIE can't use it; if the world-builder has nothing, the Lore Engine is useless.
  • No persistent world state. Each IVIE session starts fresh. There's no continuity across sessions, no "remember the party Isabella threw last week."
  • No cross-world reasoning. IVIE generates one world at a time. The Lore Engine supports multi-world/planar.
  • No source attribution. IVIE's world is LLM-generated, not LLM-reasoned-over.
  • Puzzle validation is shallow. The paper itself notes: "LLM inconsistencies occasionally bypass puzzle constraints, and objective validation gaps allow some structurally impossible goals." The Lore Engine's consistency engine is designed to catch exactly this — but for retrieved facts, not generated ones.
  • No temporal reasoning. "Was the key here at time T?" is not a question IVIE answers.

How it compares to the Lore Engine

The two systems are complementary, not competitive. IVIE generates worlds; the Lore Engine reasons over worlds. The neuro-symbolic split is the right design for both. The most interesting cross-pollination:

Dimension IVIE Lore Engine
Purpose Generate IF worlds Reason about existing worlds
Knowledge representation LLM-generated, symbolically validated Typed graph, ingested
Neuro-symbolic split Yes (LLM + validator) Yes (LLM + consistency engine)
Persistence None (one-shot) Yes (Neo4j + Postgres)
Source attribution None Deep
Temporal reasoning No First-class
Cross-world No Yes (Setting + Plane graph model, v1.2)
LLM at read time Yes (generation) Optional (narrative tools)
Closed-world enforcement Yes (symbolic validator) Yes (consistency engine)

Honest assessment: the Lore Engine could import IVIE's validator as a starter consistency engine. The "LLM inconsistencies occasionally bypass puzzle constraints" finding is the exact failure mode the Lore Engine's secrecy-honors-npc-tier rule from 14-examples.md is designed to catch. The two projects are at opposite ends of the same problem: IVIE generates and validates during world creation; the Lore Engine ingests and validates after lore is written.

Where the Lore Engine is worse: world generation. The Lore Engine can't generate a world from scratch. If Kay's world is empty, IVIE (or the world-builder writing a 50-page prose document) is the right starting point, not the Lore Engine.


6. WikiChat

Citation: Semnani, S. J., et al. "WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia." arXiv:2305.14292, May 2023. Status: Academic paper (Stanford). Code released.

How it works

WikiChat is a chatbot that grounds every response in Wikipedia. The LLM generates a draft, then the system retains only the grounded facts and combines them with additional retrieved context. It's a hybrid: a small LLM generates, retrieval grounds.

Headline result: 97.3% factual accuracy in simulated conversations, 97.9% in conversations with human users. The paper also reports that WikiChat is "55.0% better than GPT-4" on factual accuracy for recent topics. The 7B distilled version has minimal loss of quality.

Strengths

  • Factual accuracy. 97.3% is excellent. This is the best factual-accuracy number in the academic literature that I'm aware of.
  • Few-shot approach. No fine-tuning required. The whole system runs on top of an off-the-shelf LLM.
  • Distillation-friendly. The 7B version is competitive with the GPT-4 version, making it cheap to run.

Weaknesses

  • Wikipedia-specific. The grounding corpus is Wikipedia. The Lore Engine's corpus is fictional — Wikipedia doesn't have the in-fiction facts.
  • No temporal model. Like the others, no time awareness.
  • No fictional-world awareness. The system is for factual queries. It would happily tell you that elves are fictional because Wikipedia says so.
  • No consistency engine. Contradictions are not detected.

How it compares to the Lore Engine

WikiChat is a research result, not a system you can use for fictional worlds. The relevant takeaway is the factual-accuracy number (97.3%) as a target for the Lore Engine. The Lore Engine's accuracy will be lower because:

  • The corpus is smaller and more idiosyncratic.
  • The LLM has to in-character the responses, not just be factual.
  • Time-aware queries add another axis where errors can hide.

But the design pattern is borrowed: the Lore Engine's cite tool + the consistency engine + the structured YAML path should aim for >95% factual accuracy on the world-builder's test set. This is a measurable target.

Where the Lore Engine is worse: measured accuracy. WikiChat has a published number. The Lore Engine has none.


7. Temporal Knowledge Graph methods (TLogic, Chain of History, TGL-LLM)

These are a family of methods, not one system. The closest to the Lore Engine is:

TLogic (arXiv:2112.08025, 2022): learns temporal logical rules from a temporal knowledge graph, uses them for link forecasting. "Explainable Link Forecasting on Temporal Knowledge Graphs." Pure symbolic.

Chain of History (arXiv:2401.06072, January 2024): uses LLMs for temporal knowledge graph completion. Parameter-efficient fine-tuning of an LLM to predict missing future events based on observed history.

TGL-LLM (arXiv:2501.11911, January 2025): integrates temporal graph learning (a learned graph encoder) with an LLM for temporal KG forecasting.

These are forecasting systems, not consistency systems. They predict what will happen; the Lore Engine checks what's plausible given what did happen.

Strengths

  • TKG formalism. The temporal-knowledge-graph community has a clean data model: (head, relation, tail, timestamp). The Lore Engine's {era}.{year} format is the same idea.
  • Symbolic + neural hybrid. TLogic is pure rules; Chain of History is pure LLM. The Lore Engine uses Cypher UDFs (symbolic) for time and LLM only at the narrative layer (correct division of labor).
  • Forecasting accuracy. TGL-LLM reports SOTA on three benchmarks. The Lore Engine doesn't do forecasting at all — it's a retrieval system, not a prediction system.

Weaknesses

  • Forecasting ≠ consistency. These systems predict missing facts. The Lore Engine checks existing facts for consistency. Different problems, different output.
  • Open-world KGs. TKGC methods assume the world is partially observed and the task is to fill in gaps. The Lore Engine assumes the world is closed (we have all the lore) and the task is to check that the lore is self-consistent.
  • No source attribution. Predicted facts don't have a "this was predicted because..." chain.

How it compares to the Lore Engine

The TKG methods provide clean primitives that the Lore Engine's time_in_window UDF implements in a more domain-specific way. The Lore Engine's era-tree membership and current resolution are novel relative to TKG; the basic time-window comparison is well-trodden.

Honest assessment: the Lore Engine does not need to invent a new temporal model. It adopts the TKG formalism, extends it with era-tree membership and the current token, and adds the consistency engine. The result is a consistency-checking system built on a well-understood temporal-data foundation. This is a feature, not a bug.

Where the Lore Engine is worse: no forecasting. "What events are likely to happen in the next 50 years of the Third Age?" is not a question the Lore Engine can answer. It only answers "what is the world like as defined by the sources." A world-builder might want a forecasting layer for sandbox campaigns; that's a v2.


8. Chain-of-Knowledge (CoK)

Citation: Wang, J., et al. "Boosting Language Models Reasoning with Chain-of-Knowledge Prompting." arXiv:2306.06427, June 2023. Status: Academic paper.

How it works

CoK is a prompting technique. Instead of asking the LLM to "think step by step" (Chain-of-Thought), CoK asks the LLM to generate explicit knowledge evidence as structured triples before answering. Then a F²-Verification step checks the evidence is factual and faithful.

Example: instead of "Let's think step by step about who won the war", the prompt is "Generate knowledge triples about the war, then answer". The LLM produces [(House Vyr, FOUGHT, Crimson Pact), (Battle of Black Spire, RESULT_OF, Border Wars), ...] and the verifier checks that the answer follows from the triples.

Strengths

  • Interpretability. The triples are visible. The reader can see what the LLM "knew" when it answered.
  • F²-Verification. The faithfulness check is a real contribution; many CoT chains hallucinate intermediate steps.
  • Generic. Works on any LLM, any domain.

Weaknesses

  • Triples are in-prompt, not in a graph. CoK triples are generated and discarded each query. The Lore Engine's triples are persistent in Neo4j.
  • No source attribution. Triples come from the LLM, not from sources.
  • Doesn't scale to large worlds. CoK is for one-shot question answering. The Lore Engine is for a persistent world.

How it compares to the Lore Engine

The F²-Verification pattern is interesting and could be borrowed. A v2 could add a CoK-style prompt layer that asks the LLM to generate triples before answering, then verifies the triples against the graph before letting the answer through. This would catch a class of LLM hallucinations that the consistency engine currently misses.

Where the Lore Engine is worse: no in-prompt structured reasoning. The LLM in the Lore Engine just answers; in CoK, the LLM shows its work. The latter is more auditable.


9. Other systems I checked briefly

  • Long Story Generation via Knowledge Graph and Literary Theory (arXiv:2508.03137, 2025): uses a multi-agent system with a knowledge graph for long-form story generation. Reports "inevitable theme drift" and "incoherent logic" as known problems. The Lore Engine's consistency engine is designed to address the second.
  • STORYTELLER (arXiv:2506.02347, 2025): plot-planning framework. Not a knowledge graph. Different problem.
  • Hybrid AgentGroupChat (arXiv:2403.13433): multi-agent chat simulacra. Extension of generative agents. Doesn't address the Lore Engine's problem.
  • ReAct / ReDoc (earlier papers): tool-use reasoning with KG lookups. The Lore Engine's MCP-tool pattern is the same shape. ReAct is for general tool use; the Lore Engine's tools are world-specific.
  • Anthropic Constitutional AI (2022): self-correction via constitutional principles. The Lore Engine's reasoning harness does something similar via explicit rules ("MUST NOT resolve contradictions yourself").

The comparison matrix

8 systems × 10 dimensions. Legend: first-class, ◐ partial, not present, — not applicable.

System Year Stars Domain Storage Temporal Ontology Consistency Extensibility Source Attribution Self-Hosted Production
Lore Engine (v1.1) 2026 (designed) 0 Fictional worlds Neo4j+PG+pgvector+Redis+S3 UDF typed engine TypeTemplate deep design
Microsoft GraphRAG 2024 33,779 Private corpora KG (NetworkX/Neo4j) + vectors
Cognee 2025 17,843 Agent memory KG (Kuzu/Neo4j) + vectors ◐ (cognitive)
LightRAG 2024 36,622 General RAG KG (Neo4j/PG/Mongo/OS) + vectors
Generative Agents 2023 ~paper NPC behavior Memory stream (text) ◐ recency Demo
IVIE 2026 ~paper Interactive fiction LLM + symbolic validator ◐ validator
WikiChat 2023 ~paper Factual QA Wikipedia (paragraph-level) Demo
TKG methods 2022-2025 ~papers Forecasting TKG
Chain-of-Knowledge 2023 ~paper Generic reasoning In-prompt triples ◐ F²-verify

What this tells us

The Lore Engine is not in a crowded space for its specific goal. The closest functional comparables (Cognee, LightRAG, GraphRAG) are all generic, open-world, and lack a temporal model. The closest in-spirit comparable (Stanford Generative Agents) lacks a knowledge graph. The closest by use case (IVIE) is a world generator, not a world reasoner. Nobody has shipped a closed-world, temporally-consistent, contradiction-checking, fictional-world knowledge graph for LLM reasoning.

The opportunity is real. The risk is that the Lore Engine builds something the world-builder doesn't actually want. The validation step (build a minimum-viable version, ingest one world, see if the LLM produces better narrative than the world-builder could alone) is the only way to know.

The Lore Engine is also late to the party on general KG-RAG maturity. GraphRAG/Cognee/LightRAG are production systems with paying users. The Lore Engine's value proposition has to be: for the specific problem of reasoning about a fictional world with historical accuracy, we do things these systems don't. That's the bar.

End of related work. Comparison continues in 16-comparison.md with a more direct head-to-head and a critical-thinking section.