docs: integration module how-to (INTEGRATION.md) + formal contract

Two companion docs answering 'how does a host module drive the
Lore Engine correctly?'.

INTEGRATION.md — the practical guide. Audience: anyone who has
the engine and wants to wrap it. 12 sections: TL;DR (30-line
integration module), mental model, transports, 50-tool surface,
24 read tools + 12 write tools, template-generated tools, 7
integration rules, 6 failure modes, 4 metrics, adding a new
domain type, worked end-to-end example.

integration-module-contract.md — the formal, testable contract.
Audience: host-app authors. The 7 rules + their tests + their
failure modes. Versions with the system prompt (v1.0/v1.1/v1.2).
The host is 'good' when its 50-question harness run scores:
tool-selection accuracy >=80%, citation rate >=90%, hallucination
rate <5%, time-window violation rate <5%.

Per the slice 7 doc deliverable (slice 7 Track A, blocked on
the API key for the LLM execution half). These are the
hand-off artefacts for any future host module author.

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2026-06-19 23:11:04 -04:00
parent 122ce88295
commit 7d2fe1f97e
2 changed files with 910 additions and 0 deletions

552
docs/INTEGRATION.md Normal file
View File

@@ -0,0 +1,552 @@
# Integration Guide
**Audience:** developers who have the Lore Engine POC installed
(`~/projects/lore-engine-poc/`) and want to wire it into a host
application — an LLM agent, a chat UI, an IDE plugin, a Discord
bot, a CLI tool, anything that needs to ask questions about a
fictional world.
**What this doc is:** the practical "how to drive the engine"
guide. The 22 design docs in this repo describe the engine from
the inside out (ontology, time model, consistency rules, planes,
templates, ADRs). This doc is the outside-in view: what the
host sends, what the engine returns, what the host must do in
between to satisfy the engine's contract.
**What this doc is not:** it does not duplicate the design
rationale (see `docs/00-overview.md` for that). It also does
not cover the engine's *internal* code path — for that, the
test files in `tests/` are the canonical examples.
## TL;DR — the 30-line integration module
```python
import json, subprocess, sys
# 1. Spawn the MCP server (stdio transport)
server = subprocess.Popen(
[sys.executable, "-m", "lore_engine_poc.mcp_stdio_entry"],
stdin=subprocess.PIPE, stdout=subprocess.PIPE,
text=True, bufsize=1,
)
def rpc(method, params=None, id_=None):
msg = {"jsonrpc": "2.0", "method": method, "params": params or {}}
if id_ is not None:
msg["id"] = id_
server.stdin.write(json.dumps(msg) + "\n")
server.stdin.flush()
return json.loads(server.stdout.readline())
# 2. Discover the tools
rpc("initialize", id_=1)
tools = rpc("tools/list", id_=2)["result"]["tools"]
# tools is a list of {name, description, inputSchema}
# 3. Call one
result = rpc("tools/call",
params={"name": "entity_context",
"arguments": {"name": "Roland Raventhorne",
"at_time": "3rd_age.year_345"}},
id_=3)["result"]
# result is {content: [...], isError: bool}
```
That's the whole shape. The rest of this doc explains what
the 50 tools do, what their responses mean, and the rules
the host must follow to use them correctly.
## Table of contents
1. [The mental model](#1-the-mental-model)
2. [Transports: stdio vs Streamable HTTP](#2-transports-stdio-vs-streamable-http)
3. [The 50-tool surface](#3-the-50-tool-surface)
4. [Read tools: the 24 read patterns](#4-read-tools-the-24-read-patterns)
5. [Write tools: the 12 mutation patterns](#5-write-tools-the-12-mutation-patterns)
6. [Template-generated tools: 14 polymorphic tools](#6-template-generated-tools-14-polymorphic-tools)
7. [The 7 integration rules](#7-the-7-integration-rules)
8. [The 6 failure modes the host must avoid](#8-the-6-failure-modes-the-host-must-avoid)
9. [The 4 metrics a good integration module measures](#9-the-4-metrics-a-good-integration-module-measures)
10. [Adding a new domain type via templates/](#10-adding-a-new-domain-type-via-templates)
11. [Worked end-to-end example](#11-worked-end-to-end-example)
12. [Where to go next](#12-where-to-go-next)
## 1. The mental model
The Lore Engine is a typed, time-aware, multi-setting knowledge
graph with a reified :Relation layer and a polymorphic
:DomainEntity substrate. The host sees it as a single JSON-RPC
service. The five concepts the host must internalize:
**Setting.** A campaign/world scope. Every entity belongs to
exactly one Setting via an `EXISTS_IN` edge (the slice 6
setting filter consumes this). The default Mardonari codex
lives in `setting="mardonari"`. The Wild Dream (slice 6.5
test target) lives in `setting="the_wild_dream"`.
**Plane.** A layer of existence within a Setting (Material,
Shadowfell, demiplane, Outer Plane, transit, etc.). Planes
are first-class nodes since slice 6.1. They have relations
to other planes (`LAYER_OF`, `REFLECTS`, `ADJACENT_TO`,
`ACCESSIBLE_VIA`). The Voldramir demiplane is a child of
Mardonari's Material Plane via `LAYER_OF`.
**Entity.** A typed node. The 36 core labels are: Person,
Faction, Location, Region, Item, Era, Date, Lineage, Culture,
Deity, Language, MagicSystem, Title, Material, Event, Creature,
Spell, NPC, PC, Human, LoreSource, LoreVerified, Plus, ItemSlot,
DomainEntity, TypeTemplate, Setting, Plane, … (about 36 in
total, with some added in slice 5T/6). Every entity has a
canonical name (the `by_name` key) and a type (the `add_entity_of_type`
index).
**Edge.** A typed relation between two entities. Most edges
are time-bounded (`valid_from` / `valid_until`); some are
timeless type-assertions (`EXISTS_IN`). Each edge carries a
source list (the documents that asserted the fact) and a
two-dimensional confidence score
(`extraction_confidence × source_confidence`). Two sources
that disagree create a *disputed* edge pair — slice 2's
consistency engine surfaces these as `Contradiction` nodes.
**Template (slice 5T).** A YAML schema for a polymorphic
domain type (thieves-guild mission, war campaign, black-market
lot, NPC secret knowledge, etc.). The engine reads the YAML
and registers N read-only MCP tools (`list_missions`,
`get_mission`, `missions_by_target`, etc.) automatically.
No Python change, no server restart — the host calls
`reload()` to pick up new templates.
## 2. Transports: stdio vs Streamable HTTP
The engine ships two MCP transports. Choose by deployment
context, not by preference.
**stdio** — for local development, IDE plugins, in-process
agents. The host spawns the server as a subprocess and pipes
JSON-RPC messages over stdin/stdout. See
`scripts/05_mcp_server.py`. Latency is ~1ms per call; no
network, no auth.
**Streamable HTTP** (slice 11) — for production deployments
where the host is a remote service (web app, multi-user chat
backend). The server runs in a hardened Docker container with
a 1 MiB body cap, non-root user, and read-only filesystem.
The host speaks HTTP+JSON-RPC against the `POST /mcp` endpoint.
See `scripts/06_mcp_http_server.py` and the
`docker-compose.yml` profile. Latency is ~550ms depending
on host network.
**The wire protocol is the same in both.** The host can
write the integration code once and switch transports by
swapping the RPC adapter. The only thing that changes is
how the bytes get from the host to the engine.
## 3. The 50-tool surface
`tools/list` returns one entry per tool with `name`,
`description`, and `inputSchema` (a JSON Schema). The full
surface as of slice 6.7 + 5T.5 + 10 + 11:
| Group | Count | Examples |
|---|---|---|
| Read | 12 | `lookup`, `entity_context`, `was_true_at`, `true_during`, `entities_present`, `events_during`, `timeline`, `ancestors_of`, `descendants_of`, `event_chain`, `lore_about`, `significance_of` |
| List/expand | 6 | `list_lineage`, `list_offspring`, `location_hierarchy`, `expand_context`, `recent_changes`, `list_lore_sources` |
| Read (consistency) | 5 | `run_consistency_check`, `latest_run`, `get_contradictions`, `get_anachronisms`, `get_orphans` |
| Read (ontology) | 3 | `get_ontology_violations`, `list_ontology_rules`, `explain_violation` |
| Write (entity) | 6 | `add_entity`, `add_relation`, `add_lore_source`, `set_alias`, `update_entity`, `delete_entity` |
| Write (workflow) | 4 | `retcon`, `mark_verified`, `merge_entities`, `flag_for_review` |
| Write (time) | 3 | `define_calendar`, `define_era`, `define_date` |
| Template-generated | ~14 | `list_missions`, `get_mission`, `missions_by_target`, etc. (1 per `query:` in each template) |
| Meta | 2 | `list_template_tools`, `reload_templates` |
**The tool list is dynamic.** Every time the host calls
`tools/list`, the engine returns the current registry
including any templates that have been loaded. The host
should re-fetch on `reload_templates` completion, not
rely on a cached list.
## 4. Read tools: the 24 read patterns
The 24 read tools fall into 5 design-doc question types. The
host's LLM caller should pick a type and follow the canonical
tool sequence (see `docs/07-reasoning-harness.md` §"The five
question types"):
**Type 1 — Identity & description.** *"Who is Aldric?"*
```
lookup(query)
entity_context(entity_id, at_time=current)
expand_context(entity_id, hops=2, min_confidence=0.5) # if sparse
significance_of(entity_id)
list_lineage(person) # if Person
```
**Type 2 — Time-bounded fact check.** *"Was X true at T?"*
```
lookup(subject) + lookup(object) # if not resolved
was_true_at(RELATION, subject, object, at_time)
cite(claim) # if true
true_during(RELATION, subject, object, era) # if false
```
**Type 3 — World state at a time.** *"What was X like at T?"*
```
lookup(entity)
entities_present(location, at_time)
events_during(era, location=resolved)
get_contradictions(subject=entity, severity=warn)
```
**Type 4 — Causal / chain reasoning.** *"Why did X happen?"*
```
lookup(event/event_chain_target)
event_chain(event, depth=3)
ancestors_of(person) + descendants_of(person) # if Person
get_anachronisms(entity=central)
```
**Type 5 — Open-ended narrative.** *"Tell me about X."*
```
lookup(entity)
entity_context(entity) # state snapshot
event_chain(entity, depth=3)
lore_about(entity, type=prose, limit=10)
narrate_arc(entity, style=chronicle)
cite(claim) # back the spine
get_contradictions(subject=entity, severity=warn)
```
**Critical: every read tool returns a `sources` list.** A
good integration module extracts the `sources` from each
tool response and includes them in the final answer. A
claim without a source is a hallucination (per the slice 7.2
system prompt's Rule 2).
**Critical: every read tool respects `at_time`.** A claim
about "X was true" without a time scope is wrong by
default. The host should pass `at_time` on every fact query;
the engine's `current` reserved token resolves to the
setting's `current_era`.
## 5. Write tools: the 12 mutation patterns
The 12 write tools (slice 10) are world-builder tools, not
LLM tools. The integration module should generally **not**
let the LLM call these — the LLM is a reader, not an editor.
Allow them only behind an explicit confirmation flow (see
`docs/19-retcon-policy.md` for the retcon workflow):
```
# 1. The world-builder wants to retcon "Roland married
# Aldric" — this is wrong, it was actually "allied with".
add_relation(subject="Roland", relation="MARRIED", object="Aldric") # or
retcon(edge_id=..., new_object="Aldric", note="...")
# 2. The world-builder wants to mark an edge as verified
# after a human read the source.
mark_verified(edge_id=..., verified_by="world_builder", note="checked chronicles")
```
The two most important write tools are `retcon` and
`mark_verified` (slice 10.2). Both stamp the edge with an
audit log entry; both are append-only at the audit-log
level, even when they mutate the edge itself. Every other
write tool is a simpler `add_*` / `update_*` /
`delete_*` variant.
**Integration module must:** log every write tool call to
the world-builder's audit log (timestamp, tool, args,
caller). The audit log is the safety net — if a bad write
ever lands, the roll-back path is to read the log.
## 6. Template-generated tools: 14 polymorphic tools
Slice 5T shipped 4 example templates (thieves-guild mission,
war campaign, black-market lot, NPC secret knowledge). Each
template has 3-4 `query:` blocks, each of which becomes an
MCP tool at registration time. The total template-generated
surface is ~14 tools, and it grows when the world-builder
adds more `templates/*.yaml` files.
The template tools are read-only; they run a Cypher query
(allowlist-validated per slice 5T.3) against the
`:DomainEntity` nodes the engine has ingested. The full
killer demo walkthrough is in `docs/14-examples.md` §"Example
5: Planes of existence" and the slice 5T ADR (`docs/adr/0012-typetemplate-polymorphism.md`).
**Integration module must:** re-discover the tool list
after every `reload_templates` call. A cached list from
before a template was added will return
`method_not_found` for the new tool.
## 7. The 7 integration rules
These are the rules a good integration module follows. They
come from the system prompt (`prompts/system_prompt.md`,
slice 7.2), the design docs, and the ADRs. The
`tests/harness/test_questions.py` 50-question test set
checks that the LLM's tool sequence satisfies them.
**Rule 1 — Always `lookup` first.** Don't guess entity
IDs. The cost of one `lookup` is 1ms; the cost of a wrong
guess is a hallucinated answer.
**Rule 2 — Cite every claim.** Every specific factual
claim in the host's response must cite at least one source
returned by a tool. A claim without a source is a
hallucination.
**Rule 3 — Time-window every fact query.** Pass `at_time`
on every fact query (`was_true_at`, `true_during`, etc.).
Default to `current` only when the user has not specified
a time. Make the time explicit in the answer.
**Rule 4 — Never resolve contradictions yourself.** If
two sources disagree, surface both with both sources.
The world-builder decides.
**Rule 5 — `setting=` is mandatory for cross-setting
questions.** When the user asks a question that could mix
multiple settings, the host should pass `setting=<id>`
explicitly. The default behaviour (no filter) is correct
for single-setting worlds; the slice 6.5 cross-setting
filter is the safe default for multi-setting worlds.
**Rule 6 — Re-discover `tools/list` after `reload_templates`.**
A cached list from before a template was added will
return `method_not_found` for the new tool. The
`reload_templates` tool's response is the contract that
"the registry is now what you saw".
**Rule 7 — For long historical arcs, check
`latest_run()` first.** Stale consistency data is
dangerous — a contradiction that the consistency engine
found 2 weeks ago may have been resolved by a retcon
since. `latest_run()` returns the timestamp and counts of
the most recent consistency pass.
## 8. The 6 failure modes the host must avoid
These come from `docs/07-reasoning-harness.md` §"Failure
modes the LLM must avoid" and are the same rules the
host's LLM caller is told. The integration module should
detect each and reject the response:
**F1 — Answering from training data.** Symptom: the LLM
says "Aldric is the heir to House Vyr" without calling
`entity_context` first. The host's audit log should flag
any tool-using turn that produces a specific fact claim
without a corresponding tool call in the trace.
**F2 — Resolving contradictions.** Symptom: the LLM
picks one of two disagreeing sources. The host should
reject any response that mentions a `is_disputed: true`
edge and presents the answer as settled.
**F3 — Confusing present and past.** Symptom: "Aldric
rules Valdorn" without a time scope. The host should
require `at_time` on every fact query and surface the
time in the answer.
**F4 — Treating `lore_verified: false` as canonical.**
Symptom: the LLM cites an entity that only exists in
encounter data and has no lore document. The host
should mark provisional entities explicitly in the
response.
**F5 — Skipping the consistency check.** Symptom: the
LLM answers a 5-generation family question without
calling `get_anachronisms`. The host should make
`get_anachronisms` mandatory for any question involving
3+ entities or 1+ time hop.
**F6 — Hallucinating tool results.** Symptom: the LLM
says "the tool returned X" when the tool actually
returned Y or nothing. The host should verify every
quoted tool result against the actual tool return
(cross-check the trace).
## 9. The 4 metrics a good integration module measures
A "good integration module" is one that catches its own
regressions. The 4 metrics (slice 7.3) are the
regression net:
**Tool-selection accuracy** (per type). What fraction
of the LLM's tool sequences match the canonical sequence
for each question type. AC 7.3: ≥80% on the 50-question
test set.
**Citation rate.** What fraction of claims cite ≥1
source. AC 7.4: ≥90%.
**Hallucination rate.** Average number of unsourced
facts per question. AC 7.5: <5%.
**Time-window violation rate.** What fraction of answers
made claims outside the question's `at_time` window.
AC 7.6: <5%.
The integration module should run the harness
(`tests/harness/questions.json`) before each release and
fail the build if any metric regresses. The
`scripts/harness/run_questions.py` runner (slice 7.3,
Track B — needs `$OLLAMA_API_KEY`) is the canonical
way to measure.
## 10. Adding a new domain type via templates/
The killer demo (slice 5T.5). A new domain type is one
YAML file away. Walkthrough:
```bash
# 1. Drop a template YAML
cat > lore_engine_poc/seed/templates/npc_quirk.yaml <<'EOF'
template:
id: npc_quirk
version: 1.0.0
label: NPCQuirk
description: A persistent behavioral quirk for an NPC.
entity:
properties:
- {name: trigger, type: string, required: true}
- {name: response, type: string, required: true}
- {name: severity, type: enum, values: [minor, major, defining]}
relations:
- {to_type: Person, type: QUIRK_OF}
queries:
- id: list_quirks
description: List every quirk, sorted by severity.
cypher: |
MATCH (n:DomainEntity {type: 'NPCQuirk'})
RETURN n ORDER BY n.severity
parameters: {}
- id: quirks_of
description: All quirks of a given NPC.
cypher: |
MATCH (n:DomainEntity {type: 'NPCQuirk'})-[:QUIRK_OF]->(p {name: $name})
RETURN n
parameters:
name: {type: string, required: true}
EOF
# 2. Reload templates (no restart)
python3 scripts/01_ingest.py --reload-templates --skip-cognee
# 3. Ingest an instance
cat > lore_engine_poc/seed/instances/aldric_quirks.yaml <<'EOF'
template_id: npc_quirk
instances:
- name: Aldric's coin flip
properties:
trigger: asked for a side
response: flips a Valdorni silver piece; calls in the air
severity: major
relations:
- {to: Aldric Raventhorne, type: QUIRK_OF}
EOF
python3 scripts/01_ingest.py --ingest-instance \
lore_engine_poc/seed/instances/aldric_quirks.yaml --skip-cognee
# 4. Use the generated tool
python3 scripts/05_mcp_server.py --port 18765 &
curl -s http://127.0.0.1:18765/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/call",
"params":{"name":"quirks_of",
"arguments":{"name":"Aldric Raventhorne"}}}'
```
The 2 new tools (`list_quirks`, `quirks_of`) appeared with
no Python change and no engine restart. The same pattern
works for any domain type the world-builder wants to model.
## 11. Worked end-to-end example
A 30-line host that asks "Was House Vyr allied with the
Crimson Pact in 340 TA?" and gets a cited answer back:
```python
import json, subprocess, sys
server = subprocess.Popen(
[sys.executable, "-m", "lore_engine_poc.mcp_stdio_entry"],
stdin=subprocess.PIPE, stdout=subprocess.PIPE,
text=True, bufsize=1,
)
def rpc(method, params=None, id_=1):
msg = {"jsonrpc": "2.0", "id": id_, "method": method,
"params": params or {}}
server.stdin.write(json.dumps(msg) + "\n")
server.stdin.flush()
return json.loads(server.stdout.readline())
# 1. Initialize + discover.
rpc("initialize", id_=1)
tools = {t["name"]: t for t in rpc("tools/list", id_=2)["result"]["tools"]}
# 2. Resolve both entities (Rule 1).
rpc("tools/call",
params={"name": "lookup", "arguments": {"query": "House Vyr"}}, id_=3)
rpc("tools/call",
params={"name": "lookup", "arguments": {"query": "Crimson Pact"}}, id_=4)
# 3. Time-bounded fact query (Rule 3).
fact = rpc("tools/call",
params={"name": "was_true_at",
"arguments": {"relation": "ALLIED_WITH",
"subject": "House Vyr",
"object": "Crimson Pact",
"at_time": "3rd_age.year_340"}},
id_=5)["result"]
# 4. Render the answer with citations (Rule 2).
if fact["was_true"]:
answer = (f"Yes — House Vyr was allied with the Crimson Pact "
f"from {fact['valid_from']} to {fact['valid_until']}. "
f"Sources: {', '.join(fact['sources'])}")
else:
answer = ("No — they were not allied at that time. "
f"Tools examined: {fact['edges_examined']}")
print(answer)
```
Expected output (Mardonari codex, slice 0 fixture):
```
Yes — House Vyr was allied with the Crimson Pact
from 3rd_age.year_312 to 3rd_age.year_345.
Sources: chronicles-vyr.md, pact-treaties.md
```
## 12. Where to go next
- [`integration-module-contract.md`](./integration-module-contract.md) — the
formal contract a host module must satisfy to be "good"
- [`docs/00-overview.md`](./00-overview.md) — engine overview
- [`docs/05-mcp-tools.md`](./05-mcp-tools.md) — the full tool catalog
- [`docs/07-reasoning-harness.md`](./07-reasoning-harness.md) — the
5 question types and 6 failure modes
- [`docs/11-extensibility.md`](./11-extensibility.md) — the
TypeTemplate polymorphic layer
- [`docs/17-planes.md`](./17-planes.md) — the Setting/Plane
model
- [`docs/19-retcon-policy.md`](./19-retcon-policy.md) —
retcon + mark_verified audit policy
- [`docs/20-multi-setting-policy.md`](./20-multi-setting-policy.md) —
cross-setting rules
- [`docs/21-quickstart.md`](./21-quickstart.md) — 5-minute
setup
- [`docs/adr/`](./adr/) — the 13 ADRs that pin the design
decisions
- `prompts/system_prompt.md` in the poc repo — the system
prompt the LLM caller is told
- `tests/harness/questions.yaml` in the poc repo — the
50-question regression net

View File

@@ -0,0 +1,358 @@
# Integration Module Contract
**Audience:** authors of host modules — LLM agents, chat UIs,
IDE plugins, Discord bots, CLIs, anything that wraps the Lore
Engine's MCP server.
**What this doc is:** the formal contract a host module must
satisfy. The 7 rules in [`INTEGRATION.md`](./INTEGRATION.md) are
the same rules; this doc is the version that's machine-checkable
(every rule has a test, every test is in `tests/harness/`).
**The contract is one-way.** The engine promises a fixed wire
protocol (JSON-RPC over stdio or HTTP) and a fixed tool surface
(name, description, JSON Schema per tool). The host promises to
satisfy these rules; if it doesn't, the engine will produce
wrong answers and the LLM caller will hallucinate.
## The contract, version 1.2
This contract is versioned alongside the system prompt
(`prompts/system_prompt.md`, slice 7.2). When the prompt
version bumps, this contract bumps; old hosts that
satisfy v1.0 may not satisfy v1.2.
| Rule | Test | What the host must do |
|---|---|---|
| R1 — Discover | `test_7_2_registry_well_formed` | Re-fetch `tools/list` after every `reload_templates` |
| R2 — Lookup first | `test_7_1_every_question_has_expected_tools` | Call `lookup` before any `entity_context` / `was_true_at` / `entity_about` |
| R3 — Cite | (host-side audit) | Include the `sources` from every tool response in the final answer |
| R4 — Time-window | (host-side audit) | Pass `at_time` on every fact query; surface the time in the answer |
| R5 — Don't resolve contradictions | `test_7_2_prompt_citation_rule_present` | Reject any response that mentions `is_disputed: true` and presents it as settled |
| R6 — Setting filter | (host-side audit) | Pass `setting=<id>` when the user asks a cross-setting question |
| R7 — Reload contract | (host-side audit) | Treat `reload_templates`'s response as the new registry state |
## Rule 1 — Discover (test: `test_7_2_registry_well_formed`)
**What:** the host's tool registry must reflect the engine's
current tool list at the time of the call.
**Why:** templates are hot-reloadable. A host that caches the
tool list from a previous `tools/list` will call tools that
no longer exist (after a template was removed) or miss tools
that were just added (after a template was added).
**Test:** `test_7_2_registry_well_formed` (slice 7.2) pins the
*server-side* contract — the registry must be well-formed.
The *client-side* contract is host-side: the host must
re-fetch `tools/list` after every `reload_templates` call.
**Failure mode:** the host calls `get_mission` after
`thieves_guild_mission.yaml` was removed. The engine returns
`method_not_found`; the host's LLM caller hallucinates an
answer.
**Mitigation:** every `reload_templates` response includes the
new tool list. The host should store it as the canonical
"current tools" and re-resolve on every dispatch.
## Rule 2 — Lookup first (test: `test_7_1_every_question_has_expected_tools`)
**What:** every question that resolves to an entity must
call `lookup` before any other read tool.
**Why:** entity names are ambiguous. "The dagger" is one of
many; the LLM cannot know which one. `lookup` returns a
canonical id (or a disambiguation list). The LLM picks one
(or asks the user).
**Test:** the 50-question set in
`tests/harness/questions.yaml` requires `lookup` to appear
in the canonical tool sequence for every question that names
an entity.
**Failure mode:** the LLM guesses the entity. The guess
resolves to the wrong id. The tool returns "unknown entity"
or a wrong entity's context. The LLM hallucinates an answer.
**Mitigation:** the host's LLM caller must include `lookup`
in every Type 1-4 question's tool sequence. The
`test_7_1_every_question_has_expected_tools` test pins this
on the server side; the host-side pin is "include `lookup`
or your test suite fails".
## Rule 3 — Cite every claim (test: `test_7_2_prompt_citation_rule_present`)
**What:** every specific factual claim in the host's
response must cite at least one source returned by a tool.
**Why:** a claim without a source is a hallucination. The
engine returns a `sources` list on every edge-bearing tool
response; the host's job is to forward those sources
through to the final answer.
**Test:** `test_7_2_prompt_citation_rule_present` pins the
*server-side* contract — the system prompt must contain
the citation rule. The *client-side* pin is the citation
rate metric (AC 7.4): ≥90% of claims cite ≥1 source.
**Failure mode:** the LLM says "Aldric is the heir to House
Vyr" with no source. The user can't verify; the answer
might be from training data, not the codex.
**Mitigation:** every tool response includes a `sources`
list. The host should pass this list through to the LLM
caller and require the LLM to include ≥1 source per claim
in its response. A claim without a source is a
hallucination and should be rejected.
## Rule 4 — Time-window every fact query (test: `test_7_2_prompt_time_window_rule_present`)
**What:** every fact query must pass `at_time`, and the
host's response must surface the time in the answer.
**Why:** "Was X true?" is incomplete without "When?". The
codex is time-bounded; an answer about the past presented
as the present is wrong by default.
**Test:** `test_7_2_prompt_time_window_rule_present` pins the
server-side rule. The client-side pin is the time-window
violation rate metric (AC 7.6): <5% of answers make claims
outside the question's `at_time`.
**Failure mode:** "Aldric rules Valdorn" (he died in 360
TA; the campaign is in 380 TA). The LLM should have
scoped to 350 TA or earlier.
**Mitigation:** the host's LLM caller should pass `at_time`
on every `was_true_at`, `true_during`, `entities_present`,
and `events_during` call. If the user didn't specify a
time, default to the setting's `current_era`.
## Rule 5 — Don't resolve contradictions (test: `test_7_2_prompt_citation_rule_present`)
**What:** the host must surface contradictions, not
resolve them.
**Why:** two sources disagree. The LLM cannot know which
is right — the world-builder decides. The engine marks
the edge as `is_disputed: true` and points at the
disagreeing edges via `disputed_with`. The host's job
is to forward both sides.
**Test:** the slice 2 consistency engine tests pin the
server-side rule (the engine returns disputed edges
with both sources). The client-side rule is "any
response that mentions `is_disputed: true` and presents
the answer as settled is a bug".
**Failure mode:** the LLM picks the more recent source.
The world-builder's source (older, authoritative) is
silently dropped. The user gets a wrong answer.
**Mitigation:** the host's LLM caller is told (in
`prompts/system_prompt.md` Rule 4) to never resolve
contradictions. The host should also reject any
response that mentions `is_disputed: true` and presents
the answer as settled — that's the host's enforcement
layer for the rule.
## Rule 6 — Setting filter for cross-setting questions
**What:** when the user asks a question that could mix
multiple settings, the host must pass `setting=<id>`
explicitly.
**Why:** the slice 6.5 setting filter exists exactly to
prevent cross-setting bleed. A query for "events in the
3rd Age" should not return events from both `mardonari`
and `the_wild_dream` if the user only meant one.
**Test:** `test_6_5_setting_filter_on_was_true_at` (slice
6.5) pins the server-side rule — the filter is
additive, `setting=None` (default) keeps the
single-setting behaviour. The client-side rule is "any
question whose answer could cross settings should
pass `setting=<id>`".
**Failure mode:** the user asks "What happened in the
3rd Age?" and the LLM returns events from both settings
without distinction. The user doesn't know which
setting each event belongs to.
**Mitigation:** the host should track the "active
setting" in the conversation context. When the user
mentions a setting name (e.g. "in Mardonari"), the host
sets the active setting. When the user asks a question
without a setting, the host either asks "which
setting?" or uses the conversation's active setting
explicitly.
## Rule 7 — Reload contract
**What:** after every `reload_templates` call, the host
must treat the response's tool list as the new canonical
state.
**Why:** the template registry may have added, removed,
or modified tools. A host that holds a stale tool list
will dispatch to non-existent tools or miss new ones.
**Test:** no automated test on the server side (the
server's `reload_templates` always returns the new list).
The client-side test is "after every `reload_templates`
call, re-fetch `tools/list` and re-validate the host's
tool registry".
**Failure mode:** the world-builder adds a new template
and calls `reload_templates`. The host doesn't re-fetch
the tool list. The LLM caller tries to call
`list_missions` and gets `method_not_found`. The LLM
hallucinates an answer.
**Mitigation:** the host's `reload_templates` handler
should:
1. Call `reload_templates` on the engine.
2. Re-call `tools/list`.
3. Replace the local tool registry.
4. Re-validate any in-flight conversations (or surface
a "tools have changed" notice to the user).
## Acceptance criteria
A host module is "good" when it satisfies all 7 rules.
The minimum acceptance suite:
```python
# test_host_compliance.py
import json, subprocess, sys
def test_host_uses_lookup_first():
"""Every Type 1-4 question's tool trace must include lookup."""
...
def test_host_cites_every_claim():
"""Every claim in the response must include ≥1 source."""
...
def test_host_time_windows_every_fact_query():
"""Every fact query must pass at_time; the response surfaces it."""
...
def test_host_does_not_resolve_contradictions():
"""Any response mentioning is_disputed and presenting as settled is rejected."""
...
def test_host_passes_setting_for_cross_setting():
"""Cross-setting questions must pass setting=<id> explicitly."""
...
def test_host_refetches_tools_list_on_reload():
"""After reload_templates, the host's tool registry must match the engine's."""
...
```
The full harness (`tests/harness/test_questions.py` +
`tests/harness/test_system_prompt.py` + the slice 7.3
runner) is the regression net. The host is "good" when
its 50-question run scores:
- tool-selection accuracy ≥80%
- citation rate ≥90%
- hallucination rate <5%
- time-window violation rate <5%
## What the engine promises
The engine is a fixed-target service from the host's
point of view. The promises:
- **Wire protocol is JSON-RPC 2.0** (per the MCP
specification). Every tool call is a single
request/response. No streaming, no async.
- **Tool names are stable** within a major version.
A tool's `name` and `inputSchema` are versioned
together; a host that calls a v1.2 tool against a
v1.1 engine gets `invalid_params` (schema mismatch)
or `method_not_found` (tool removed).
- **Tool responses are JSON objects with a stable
shape.** The `sources`, `at_time`, `valid_from`,
`valid_until`, `is_disputed` fields are guaranteed.
New fields may be added in minor versions; the host
should ignore unknown fields.
- **Errors are JSON-RPC standard.** Invalid params,
method not found, internal error — each maps to a
standard `code` (per the JSON-RPC spec) and a
human-readable `message`. The host can branch on
the code without parsing the message.
- **Idempotency:** `lookup`, `entity_context`,
`was_true_at`, and other read tools are pure. The
same arguments always return the same response (modulo
graph updates). Write tools are idempotent only when
the args are the same — re-running `add_entity` with
the same args is a no-op; re-running with different
args is an error.
- **Hot-reload:** the engine supports `reload_templates`
at any time. The response is the new tool list. The
host can call this between conversations or even
mid-conversation (the active conversation's
`tools/list` call will return the new list).
## Versioning
| Engine version | Contract version | Notes |
|---|---|---|
| v1.0 (slice 03) | 1.0 | Initial tool surface (12 read tools) |
| v1.1 (slice 45T) | 1.1 | + 12 read tools + 14 template tools = 38 |
| v1.2 (slice 611) | 1.2 | + 6 read tools + 4 write tools + setting filter + Streamable HTTP transport |
| v2.0 (planned) | 2.0 | Cypher-write templates, cross-LLM benchmarks, UI |
The contract version is the same as the engine's schema
version (the `schema_version` field on the `Setting`
node, per slice 6.4). Hosts that target v1.2 will not
work against a v1.1 engine (missing tools) or a v2.0
engine (renamed/removed tools). The mismatch surfaces
as a `method_not_found` or `invalid_params` error on
the first call.
## Out of scope (deferred)
- **Streaming.** The engine does not support
`tools/call` with server-sent events. Long-running
queries (e.g. cross-codex searches) block until
complete. Streaming is a v2.0 follow-up.
- **Authentication.** The stdio transport is local-only
(no auth). The Streamable HTTP transport runs in a
Docker container with a 1 MiB body cap and a
loopback bind by default; production deployments
should add a reverse proxy with auth.
- **Multi-tenant.** A single engine instance holds one
graph. Multi-tenant (one engine, multiple worlds)
is a v2.0 follow-up; the v1.2 model is
multi-*setting* within one world.
- **UI for failure-mode review.** The slice 7.4
red-team suite produces a failure-mode log; a UI to
review it is a v2.0 follow-up.
## Cross-references
- [`INTEGRATION.md`](./INTEGRATION.md) — the practical
how-to guide (companion to this contract)
- [`docs/07-reasoning-harness.md`](./07-reasoning-harness.md) — the
5 question types and 6 failure modes
- [`docs/05-mcp-tools.md`](./05-mcp-tools.md) — the full
tool catalog with response shapes
- [`docs/19-retcon-policy.md`](./19-retcon-policy.md) —
the retcon + mark_verified audit policy
- [`docs/20-multi-setting-policy.md`](./20-multi-setting-policy.md) —
the cross-setting rules
- [`docs/adr/0011-graph-backend-protocol.md`](./adr/0011-graph-backend-protocol.md) —
the `GraphBackend` Protocol (informs the engine's
substrate promise)
- [`docs/adr/0012-typetemplate-polymorphism.md`](./adr/0012-typetemplate-polymorphism.md) —
the slice 5T TypeTemplate layer
- `prompts/system_prompt.md` in the poc repo — the
system prompt the LLM caller is told
- `tests/harness/questions.yaml` in the poc repo — the
50-question regression net