docs(v2): T8 — update README + CHANGELOG + 3 worked-example docs

- README.md: 5 plugins / 19 tools (matches /healthz); 'what this proves'
  now lists consistency engine, multi-world namespace, LLM consumer;
  'next steps' section replaced with 'shipped in v2'
- docs/CONSISTENCY_DEMO.md: 4 tools, 5 violations, all output verified
  against live bash examples/test_consistency.sh
- docs/MULTI_WORLD_DEMO.md: list_worlds() + entity_context in both
  worlds + cross-world isolation tests, all output verified live
- docs/LLM_CONSUMER_DEMO.md: 5 question types, 9 distinct tools, all
  output traced to examples/results/*.json
- CHANGELOG.md: v1 -> v2 entry, all 9 task refs (T1-T9)
- examples/test_e2e.sh: T7 E2E validation script (untracked)
This commit is contained in:
kanban-dev
2026-06-17 00:45:30 +00:00
parent bcda8eff00
commit 99535a8f3a
6 changed files with 1231 additions and 18 deletions

165
CHANGELOG.md Normal file
View File

@@ -0,0 +1,165 @@
# Changelog
All notable changes to `lore-engine-poc` are recorded here. The format
follows [Keep a Changelog](https://keepachangelog.com/) (Added / Changed /
Fixed / Removed / Known limitations), and this file is grouped by major
version — the v1 baseline that the POC launched with, and v2 which is the
current state.
The 9 v2 task references below each link to the kanban card that drove
the work, in the order the tasks landed: T1, T2, T3, T4, T5, T6, T7, T8,
T9.
---
## [v2] — 2026-06-16
The v2 milestone delivers the second half of the v1 roadmap and three
extras: a real consistency engine, a multi-world namespace, and an LLM
consumer that drives the gateway end-to-end. v2 is what
`bash test.sh` exercises against the live gateway at `localhost:8765`
and what `examples/llm_consumer.py` drives from the LiteLLM proxy.
### Added
- **`plugins/embeddings.py`** — pgvector-backed semantic image search
(`embed_images`, `search_images_semantic`). Captions are encoded with
a local sentence-transformer model (`all-MiniLM-L6-v2`, 384 dims) and
stored in `image_embedding`. Queries are matched via pgvector cosine
distance (`<=>`). Background embedding on `register_image`; `embed_images`
is idempotent. v2.T2.
- **`plugins/consistency.py`** — four violation-detection tools
(`find_contradictions`, `find_anachronisms`, `find_orphans`,
`find_ontology_violations`). Returns a `{violations, count}` envelope
per call. Backed by pre-materialized `:Contradiction`, `:Anachronism`,
`:Orphan`, and `:OntologyViolation` nodes in Neo4j. The seed
(`seed.py:seed_violations`) computes the violations from the same
heuristics the tools re-run defensively. v2.T3 (skeleton) + v2.T5
(real rules).
- **`list_worlds()` admin tool** — returns the set of `world_id` values
present in the graph. Read by `bash test.sh` section 12 and by the
v2.T7 E2E validation suite. v2.T6.
- **`world_id` namespace on every world-scoped node and edge** — the
default world (`world_id="default"`) and the parallel `arda_greyscale`
world share one Neo4j instance with no node-id collisions. Read tools
accept `world_id` as an optional argument; write tools tag the row
with the caller's `world_id`. v2.T6.
- **Parallel world seed: `arda_greyscale`** — `seed.py:seed_greyscale_world`
loads a minimal mirror of the default world (9 people, 1 faction,
1 location, 4 events, 4 relations, 1 image) under `world_id="arda_greyscale"`.
Idempotent. v2.T6.
- **LLM consumer (`examples/llm_consumer.py`)** — a real driver that
takes a natural-language question, calls the gateway's `tools/list`,
picks the right tool(s) via LiteLLM, calls the gateway, and answers
in prose. 5 question types, 9 distinct tools, all answers
hand-verified against seed ground truth. v2.T4.
- **E2E validation (`examples/test_e2e.sh` + `examples/E2E_REPORT.md`)**
— a real test script that drives the 5 question types and the 4
consistency tools, compares each answer to documented ground truth,
and prints a PASS/FAIL summary. v2.T7.
- **CI smoke (`scripts/ci-smoke.sh` + `docs/SMOKE.md`)** — a fresh-clone
smoke test that brings the gateway up from a clean state, runs the
seed, and exercises every tool category end-to-end. v2.T1.
- **v2 docs** — `docs/CONSISTENCY_DEMO.md` (5 hand-crafted violations
from the live seed), `docs/MULTI_WORLD_DEMO.md` (the 2-world seed in
action), `docs/LLM_CONSUMER_DEMO.md` (the 5 question types in detail).
This file. v2.T8.
- **Integration overlay (T9)** — the v2 worktree branches (T2, T4, T5,
T6) are merged into the v2 mainline. `bash test.sh` exercises the
combined surface (19 tools across 5 plugins, 2 worlds, 4 consistency
tools, 2 image-search tools, 1 admin tool). v2.T9.
### Changed
- **README.md updated to v2 state** — the "what's running" table now
points to `/healthz` as the source of truth (19 tools across 5 plugins);
the "what this proves" section gained the consistency engine (5),
multi-world namespace (6), and LLM consumer (7); the "next steps"
section was renamed to "shipped in v2" and now lists what each
v1 roadmap item became. v2.T8.
- **`bash test.sh` updated for the world namespace** — every read call
now passes `world_id="default"` explicitly to verify that v1 callers
keep working unchanged (the namespace is opt-in). Added a 12th section
that calls `list_worlds()`. v2.T6.
- **`seed.py` grew two new stages** — `seed_greyscale_world` (the
parallel world, v2.T6) and `seed_violations` (5 hand-crafted
violations, v2.T5). Both are idempotent and safe to re-run.
- **`tests/test_consistency.py` and `tests/test_multi_world.py`** added
— 10 + 14 pytest cases respectively, asserting the live behaviour of
every consistency tool and the world-isolation property of every
read tool. v2.T5, v2.T6.
- **`tests/test_embeddings_*.py` and `tests/test_register_image_hook.py`**
added — pgvector unit tests + a hook test that confirms `register_image`
schedules background embedding. v2.T2.
### Known limitations (v2 → v3)
These are deliberate v2 boundaries; the v3 plan will address them:
- **No world-builder UI.** Everything is `curl` and `cypher-shell`. The
v2 dashboard is a separate repo. v3.
- **No reflective memory or behavior layer.** The Stanford Generative
Agents pattern (memory stream + reflection + planning) is a v3
borrow per `lore-engine/docs/16-comparison.md`. v3.
- **Consistency engine is rule-driven, not ML-driven.** The five
hand-crafted violations in v2 are seeded; an ML-derived detection
surface (e.g. an LLM pass over the world summary) is a v3 item. v3.
- **No refresh / cache invalidation on world reseed.** If a world is
re-seeded, the embeddings for any new image manifest rows are computed
on the next `register_image` or `embed_images` call; old embeddings
are kept. A v3 refresh tool would let an operator force a full
re-embed. v3.
---
## [v1] — 2026-06-16 (baseline)
The initial proof of concept. Five-minute goal: prove that with mock
data, we can run a multi-database backend (Neo4j + Postgres + MinIO) and
expose it all through a plugin-driven MCP gateway where adding a new
domain type is a new file in `plugins/`, not a Go change.
### Added
- `docker-compose.yml` — Neo4j 5.26, Postgres (later upgraded to
pgvector in v2.T2), MinIO, and the gateway container.
- `seed.py` — idempotent seeder for the default world (3 eras, 10 people,
3 factions, 4 locations, 4 items, 6 events, 1 lineage group, ~20
time-bounded relations, 3 trade log entries, 4 generated images).
- `plugins/world.py``entity_context`, `was_true_at`, `state_at`
(Neo4j).
- `plugins/lineage.py``ancestors_of`, `descendants_of`, `lineage_of`
(Neo4j).
- `plugins/trade.py``log_trade`, `trades_by_buyer`, `market_price`
(Postgres).
- `plugins/images.py``register_image`, `recall_images`,
`search_images_by_caption` (MinIO + Postgres + Neo4j).
- `server.py` — the MCP-compatible JSON-RPC gateway, auto-loading every
`.py` file in `plugins/`.
- `bash test.sh` — the 12-section end-to-end smoke runner.
- `README.md` (v1) — the original POC writeup.
### Known limitations (v1 → v2)
- Stub consistency tools (no detection rules).
- No semantic image search.
- No LLM in the loop.
- Single world, no namespace.
All four items were addressed in v2.

112
README.md
View File

@@ -13,7 +13,7 @@ Five-minute goal: prove that with mock data, we can run a multi-database backend
| `lore-minio` | `minio/minio:latest` | 9000 (S3), 9001 (console) | Image blob storage |
| `lore-gateway` | built locally | 8765 (MCP JSON-RPC) | The plugin-driven gateway |
## The four plugins (this is the proof)
## The five plugins (this is the proof)
```
plugins/
@@ -22,10 +22,22 @@ plugins/
├── trade.py # log_trade, trades_by_buyer, market_price (Postgres)
├── images.py # register_image, recall_images, search_images_by_caption
│ # (MinIO + Postgres + Neo4j)
── embeddings.py # embed_images, search_images_semantic (Postgres + pgvector)
── embeddings.py # embed_images, search_images_semantic (Postgres + pgvector)
└── consistency.py # find_contradictions, find_anachronisms, find_orphans,
# find_ontology_violations (Neo4j)
```
Each plugin is a single file with a `register(registry)` entry point. The gateway auto-loads every `.py` file in `plugins/` at startup. **No server.py change needed to add a new tool** — drop a new file in, restart the container, the new tools appear in `tools/list`.
The gateway also exposes one admin tool for the world namespace: `list_worlds`.
Tool counts and plugin membership are reported live by the gateway itself —
`curl -s http://localhost:8765/healthz` returns the canonical list. As of v2
the healthz reports 19 tools across the 5 plugins above. See
`docs/LLM_CONSUMER_DEMO.md` for an end-to-end driver that exercises them.
Each plugin is a single file with a `register(registry)` entry point. The
gateway auto-loads every `.py` file in `plugins/` at startup. **No server.py
change needed to add a new tool** — drop a new file in, restart the
container, the new tools appear in `tools/list`.
## How to run it
@@ -51,6 +63,11 @@ The `seed.py` script is idempotent (uses `MERGE` and `ON CONFLICT`). It loads:
- ~20 time-bounded relations
- 3 trade log entries
- 4 generated images (portraits + landscape + battle scene) uploaded to MinIO
- 5 hand-crafted consistency violations pre-materialized as `:Contradiction`,
`:Anachronism`, `:Orphan`, and `:OntologyViolation` nodes (see
`docs/CONSISTENCY_DEMO.md`)
- 1 parallel world, `arda_greyscale` — a minimal mirror of the default
world with no overlapping node ids (see `docs/MULTI_WORLD_DEMO.md`)
## Try the gateway
@@ -166,30 +183,89 @@ curl -s -X POST http://localhost:8765/mcp \
## What this proves
1. **The plugin boundary works.** A new domain type (trade, images) is a new file in `plugins/`. No change to `server.py`, no change to docker-compose, no new container. Restart the gateway and the new tools are live.
1. **The plugin boundary works.** A new domain type (trade, images, embeddings,
consistency) is a new file in `plugins/`. No change to `server.py`, no change
to docker-compose, no new container. Restart the gateway and the new tools
are live. The `consistency` plugin (added in v2.T5) is the most recent
example — four violation-detection tools, all in one file.
2. **Polyglot storage is real, not aspirational.** Neo4j holds the typed world graph. Postgres holds the time-series operational data and image manifests. MinIO holds the image bytes. Each store does what it's good at; the gateway composes the answers.
2. **Polyglot storage is real, not aspirational.** Neo4j holds the typed world
graph. Postgres holds the time-series operational data, image manifests, and
the `image_embedding` vectors (pgvector). MinIO holds the image bytes. Each
store does what it's good at; the gateway composes the answers.
3. **Time is a first-class query primitive.** `was_true_at` checks time-bounded edges with a single Cypher query — no LLM, no inference. Year-level precision works against the mock data (see `2nd_age.year_230` example above).
3. **Time is a first-class query primitive.** `was_true_at` checks time-bounded
edges with a single Cypher query — no LLM, no inference. Year-level
precision works against the mock data (see `2nd_age.year_230` example above).
4. **Image recall works.** Images are stored in MinIO, linked to entities in Neo4j (`(:Image)-[:DEPICTS]->(:Person)`), and discoverable by entity id, by tag, or by caption substring search. Presigned URLs are generated on the fly.
4. **Image recall works.** Images are stored in MinIO, linked to entities in
Neo4j (`(:Image)-[:DEPICTS]->(:Person)`), and discoverable by entity id, by
tag, by caption substring search, or by natural-language description via the
`search_images_semantic` (pgvector) tool. Presigned URLs are generated on
the fly.
5. **The world is small but real.** 10 people, 6 events, 4 images, ~20 relations — enough to demonstrate the architecture end-to-end. Scaling is a separate problem; this is the proof of shape.
5. **The consistency engine is real.** The four `find_*` tools query
pre-materialized violation nodes in Neo4j and return structured
`{violations, count}` envelopes — not booleans, not error strings. The
`seed.py:seed_violations` step computes the violations from the same
heuristics (overlapping `MEMBER_OF` windows, `Person.born > event_year`,
orphan entities, and `:OntologyRule`-driven checks) so the math is visible
in plain Python — not hidden in Cypher. See `docs/CONSISTENCY_DEMO.md` for
the five hand-crafted violations the seed surfaces.
6. **Multiple worlds live in one graph.** Every world-scoped node and edge
carries a `world_id` property, and the read tools accept a `world_id`
argument (defaulting to `"default"`). The v2.T6 seed loads a parallel
`arda_greyscale` world with no overlapping node ids, and
`list_worlds()` returns both. See `docs/MULTI_WORLD_DEMO.md` for the
worked example.
7. **An LLM can drive the whole surface.** `examples/llm_consumer.py` is a
real driver that takes a natural-language question, calls the gateway's
`tools/list`, picks the right tool(s), and answers in prose — all wired
through the local LiteLLM proxy. 5 question types × 9 distinct tools
exercised, all answers hand-verified against the seed. See
`docs/LLM_CONSUMER_DEMO.md` and `examples/REPORT.md`.
8. **The world is small but real.** 10 people + 9 greyscale-world people, 6
events, 5 images (4 default + 1 greyscale), ~20 relations — enough to
demonstrate the architecture end-to-end across two parallel worlds.
Scaling is a separate problem; this is the proof of shape.
## What's not in this POC
- **No LLM in the loop.** The MCP gateway is a tool server; the LLM client (Claude, GPT, anything) is the consumer. This is intentional — the POC validates the data and tool layers, not the LLM reasoning. The reasoning harness is in the design docs (`lore-engine/docs/07-reasoning-harness.md`) and would be added as a system prompt in a real deployment.
- **No LLM in the loop at runtime — the LLM consumer is a separate
example.** The MCP gateway itself is a tool server; the LLM client
(Claude, GPT, anything reachable via the LiteLLM proxy) is the consumer.
This is intentional — the POC validates the data and tool layers, not the
LLM reasoning. The reasoning harness is in the design docs
(`lore-engine/docs/07-reasoning-harness.md`); `examples/llm_consumer.py`
implements the v1.1 of that harness against the live gateway.
- **Consistency detection is real (v2.T5).** The 4 tools (`find_contradictions`, `find_anachronisms`, `find_orphans`, `find_ontology_violations`) query pre-materialized violation nodes in Neo4j. The seed (`seed.py:seed_violations`) computes the violations from the same heuristics (overlapping `MEMBER_OF` windows, `Person.born > event_year`, world entities with no relations, and `:OntologyRule`-driven checks) so the math is visible in plain Python — not hidden in Cypher.
- **No world-builder UI.** Everything is `curl` and `cypher-shell`. The UI
is a v3 feature.
- **No world-builder UI.** Everything is `curl` and `cypher-shell`. The UI is a v2 feature.
- **No reflective memory or behavior layer.** The Stanford Generative Agents
pattern (memory stream + reflection + planning) is a v3 borrow per the
comparison in `lore-engine/docs/16-comparison.md`.
- **No reflective memory or behavior layer.** The Stanford Generative Agents pattern (memory stream + reflection + planning) is a v2 borrow per the comparison in `lore-engine/docs/16-comparison.md`.
## Shipped in v2
## Next steps after this POC
What was on the v1 "next steps" list, and what it became in v2:
- ~~Implement the consistency detection rules behind the 4 stub tools (T5).~~ **Done.**
- Add the embedding-based semantic search plugin (uses the `Image.caption` and any future `Person.summary` text).
- Add an LLM client that consumes the gateway with the reasoning harness system prompt and runs the 5 question types from the design.
The v1 design in `lore-engine/docs/` is the contract. This POC is the proof of shape.
- ~~Implement the consistency detection rules behind the 4 stub tools
(T5).~~ **Done** — see `plugins/consistency.py` and
`docs/CONSISTENCY_DEMO.md`. 4 tools, 5 violations surfaced from the seed.
- ~~Add the embedding-based semantic search plugin (uses the `Image.caption`
and any future `Person.summary` text).~~ **Done** — see `plugins/embeddings.py`
and `docs/LLM_CONSUMER_DEMO.md`. 384-dim MiniLM, pgvector cosine distance,
background embedding on `register_image`.
- ~~Add an LLM client that consumes the gateway with the reasoning harness
system prompt and runs the 5 question types from the design.~~ **Done**
see `examples/llm_consumer.py` and `examples/REPORT.md`. 5 questions, 9
distinct tools, all hand-verified against seed ground truth.
- **v2 extras** not on the v1 list: the multi-world namespace with the
`arda_greyscale` parallel seed (T6); the `:OntologyViolation` rule-driven
detection in addition to the original three classes (T5); and a fresh-clone
smoke test (`scripts/ci-smoke.sh`) that exercises the gateway end-to-end
from a clean state (T1).

210
docs/CONSISTENCY_DEMO.md Normal file
View File

@@ -0,0 +1,210 @@
# Consistency Engine — Worked Example
This is a live end-to-end run of the four consistency tools that landed in v2.T5.
Everything below is real tool output from `bash examples/test_consistency.sh`
against the current gateway at `localhost:8765`, taken from the v2 build
(`8261c2d` on `wt/t5-consistency-impl`).
## What the engine does
The consistency engine has four read-only tools, each backed by pre-materialized
violation nodes in Neo4j. The seed (`seed.py:seed_violations`) computes the
violations from the same heuristics the tools re-run defensively, so every
violation id is stable, the math is visible in plain Python, and an operator
can re-derive any flagged issue by hand from the seed.
| Tool | Neo4j label | Live count (this run) |
|---|---|---|
| `find_contradictions` | `:Contradiction` | 1 |
| `find_anachronisms` | `:Anachronism` | 1 |
| `find_orphans` | `:Orphan` | 1 |
| `find_ontology_violations` | `:OntologyViolation` | 2 |
| **Total** | | **5** |
All four tools support an optional `severity` argument (`"any"`, `"error"`,
`"warn"`), and the world-scoped read tools accept `world_id="default"`.
The default world contains the violations; the `arda_greyscale` world is
clean (its seed doesn't inject any hand-crafted ones).
## 1. Contradictions — overlapping faction memberships
A `:Contradiction` is a pair of `MEMBER_OF` relations on the same person
whose `[valid_from, valid_until]` windows overlap but whose target factions
differ. It's the classic "sworn to two houses at once" case.
```bash
curl -s -X POST http://localhost:8765/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc":"2.0","id":1,"method":"tools/call",
"params":{"name":"find_contradictions","arguments":{"world_id":"default"}}
}'
```
```json
{
"violations": [
{
"id": "c_aldric_double_membership",
"label": "Contradiction",
"severity": "error",
"status": "open",
"details": "Aldric Raventhorne is MEMBER_OF House Vyr (240-) and MEMBER_OF Crimson Pact (260-285); the two memberships overlap.",
"detected_at": "2026-06-16T23:04:51.238226Z"
}
],
"count": 1
}
```
The math: Aldric's `MEMBER_OF` House Vyr opens at year 240 with no end date.
His `MEMBER_OF` Crimson Pact runs 260285. The two windows overlap from 260
to 285. He can't be a sworn member of both houses at once.
The seed source is `seed.py:c_aldric_double_membership` — see
`Aldric Raventhorne` relations block in `seed_world_default` for the
underlying `MEMBER_OF` rows.
## 2. Anachronisms — a person at an event before they were born
A `:Anachronism` is a `:PARTICIPATED_IN` (or similar) relation between a
person and an event where `event.in_fiction_time` is before `person.born`.
```bash
curl -s -X POST http://localhost:8765/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc":"2.0","id":1,"method":"tools/call",
"params":{"name":"find_anachronisms","arguments":{"world_id":"default"}}
}'
```
```json
{
"violations": [
{
"id": "a_vex_at_founding",
"label": "Anachronism",
"severity": "error",
"status": "open",
"details": "Vex the Silent (born 180) is recorded as participating in the Founding of House Vyr (year 85) — 95 years before his birth.",
"detected_at": "2026-06-16T23:04:51.238226Z"
}
],
"count": 1
}
```
Vex the Silent, born in 180, is tagged as a participant in the
"Founding of House Vyr" event in year 85. The Cypher check joins the
`PARTICIPATED_IN` edge to the person's `born` property and the event's
`in_fiction_time`, extracted as an integer year.
## 3. Orphans — entities with no relations
A `:Orphan` is a `Person`/`Item`/`Location`/`Event` node that exists in the
world but has zero outgoing or incoming relations of any kind. These are
typically world-builder placeholders that haven't been wired into the story
yet.
```bash
curl -s -X POST http://localhost:8765/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc":"2.0","id":1,"method":"tools/call",
"params":{"name":"find_orphans","arguments":{"world_id":"default"}}
}'
```
```json
{
"violations": [
{
"id": "o_unfinished_npc",
"label": "Orphan",
"severity": "warn",
"status": "open",
"details": "Person 'Lyssa the Watcher' exists but has no relations — world-builder placeholder, not yet connected.",
"detected_at": "2026-06-16T23:04:51.238226Z"
}
],
"count": 1
}
```
`Lyssa the Watcher` is a real Person node in the seed (see
`seed.py:Lyssa the Watcher`) with no `PARENT_OF`, `MEMBER_OF`, `SPOUSE_OF`,
or any other relation. Note the severity: `warn`, not `error` — an
unfinished NPC is a real artifact of worldbuilding, not a story-level
inconsistency.
## 4. Ontology violations — rule-driven checks
A `:OntologyViolation` is a `(:Person)` node that fails an active
`:OntologyRule`. Rules are themselves Neo4j nodes (`(:OntologyRule)`) with
a `predicate` (a short Python expression) and a `description`. The
consistency plugin runs each rule over the world and materializes a
violation node for every person that fails it.
```bash
curl -s -X POST http://localhost:8765/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc":"2.0","id":1,"method":"tools/call",
"params":{"name":"find_ontology_violations","arguments":{"world_id":"default"}}
}'
```
```json
{
"violations": [
{
"id": "ov_maric_no_died",
"label": "OntologyViolation",
"severity": "warn",
"status": "open",
"details": "Person 'Maric Vyr' (born 85) has no death year; rule 'persons_born_before_280_must_die' applies.",
"detected_at": "2026-06-16T23:04:51.238226Z",
"entity_id": "maric",
"rule_id": "persons_born_before_280_must_die"
},
{
"id": "ov_theron_no_died",
"label": "OntologyViolation",
"severity": "warn",
"status": "open",
"details": "Person 'Theron Ashveil' (born 10) has no death year; rule 'persons_born_before_280_must_die' applies.",
"detected_at": "2026-06-16T23:04:51.238226Z",
"entity_id": "theron",
"rule_id": "persons_born_before_280_must_die"
}
],
"count": 2
}
```
The rule `persons_born_before_280_must_die` is a world-builder convention:
in the default world's narrative, anyone born before the Age of Iron
(before year 280) must have a recorded death year, because the present
day is well past 280 and a living person from the 1st Age is
unprecedented. Maric (born 85) and Theron (born 10) are intentionally
un-dead in the seed — they are long-lived lineages who are still alive
in the present. The two violations are *expected* by the world-builder
but flagged so the LLM (or operator) knows the rule is being broken.
## How the seed side-stays the violation math
`seed.py:seed_violations` is the Python source of truth for what the tools
return. Five pre-materialized violation nodes (one Con, one Ana, one
Orph, two OV) get `MERGE`'d into the default world, and the tool Cypher
queries read them back. If a tool query and the seed drift apart, the
detection surface in `seed.py` is the one to trust; the queries are a
defensive layer so a missing seed row doesn't silently hide a violation.
## Files
- `plugins/consistency.py` — the four tools
- `seed.py:seed_violations` — the 5 hand-crafted violations
- `tests/test_consistency.py` — 10 pytest cases
- `examples/test_consistency.sh` — the live E2E runner that produced
every block of output above

223
docs/LLM_CONSUMER_DEMO.md Normal file
View File

@@ -0,0 +1,223 @@
# LLM Consumer — Worked Example
This is a live walkthrough of the LLM consumer that landed in v2.T4.
Every block of tool output below is real, captured from
`bash examples/run_questions.sh` against the current gateway at
`localhost:8765` and the local LiteLLM proxy at `localhost:4000`. The
driver is `examples/llm_consumer.py`; the orchestrator is
`examples/run_questions.sh`; the system prompt template is
`examples/system_prompt.txt`; the per-question traces are saved under
`examples/results/*.json`. The full E2E report is in
`examples/REPORT.md`.
## What the consumer proves
The MCP gateway is a tool server. The LLM is the consumer. v2 ships a
real driver that:
1. Calls `tools/list` on the gateway.
2. Takes a natural-language question.
3. Asks an LLM (via LiteLLM) to pick the right tool(s) and
form the arguments.
4. Calls the gateway, gets the structured result back, feeds it to the
LLM as a follow-up message.
5. Returns a prose answer.
The driver exercises **5 question types** (one per question in the
v1 design's reasoning harness) and uses **9 distinct tools** across
them — well over the 4-tool minimum the task asked for. Every answer
is hand-verified against seed ground truth.
## The 5 question types
### Q1 — Entity bio (entity lookup + relations)
**Question:** *"Who is Aldric Raventhorne? Give a brief bio and his known relations."*
**Tools the LLM chose:**
```json
[
{ "tool": "entity_context", "args": { "name": "Aldric Raventhorne" } },
{ "tool": "lineage_of", "args": { "person": "Aldric Raventhorne" } }
]
```
**Answer (excerpt):**
> **Aldric Raventhorne** (born 220 died 285) was a noble of Valdorni
> culture. He was a member of **House Vyr** and belonged to the
> **House Vyr bloodline**. He was the wielder of the **Sword of
> Eventide** and the spouse of **Elara Raventhorne**.
>
> Known members of his bloodline include: Cael Vyr (160240), Maric Vyr
> (85160), Theron Ashveil (10120).
**Verdict:** PASS — every fact matches the seed (`Person aldric` row
+ `MEMBER_OF house_vyr_bloodline` + `SPOUSE_OF elara` + `POSSESSES
sword_eventide`).
### Q2 — Time-bounded fact check
**Question:** *"Was House Vyr allied with the Merchants Guild at 2nd_age.year_230?"*
**Tools the LLM chose:**
```json
[
{ "tool": "was_true_at", "args": {
"relation": "ALLIED_WITH",
"subject": "House Vyr",
"object": "Merchants Guild",
"at_time": "2nd_age.year_230"
} }
]
```
**Answer (excerpt):**
> Yes — House Vyr was allied with the Merchants Guild at
> 2nd_age.year_230. The ALLIED_WITH relationship was established in
> 2nd_age.year_100 and has no recorded end date, so it was in effect
> at that time.
**Verdict:** PASS — the model formed the time slug `2nd_age.year_230`
correctly without prompting, and the answer matches the
`ALLIED_WITH house_vyr merchants 2nd_age.year_100 None` relation in
the seed.
### Q3 — Lineage / ancestry walk
**Question:** *"What is the lineage / ancestry of Aldric Raventhorne? Walk back as far as you can."*
**Tools the LLM chose:**
```json
[
{ "tool": "ancestors_of", "args": { "person": "Aldric Raventhorne", "generations": 10 } },
{ "tool": "entity_context", "args": { "name": "Aldric Raventhorne" } },
{ "tool": "lineage_of", "args": { "person": "Aldric Raventhorne" } }
]
```
**Answer (excerpt):**
> **Aldric Raventhorne** (b. 220, d. 285) — member of **House Vyr
> (bloodline)**. Walking back through his known ancestors:
>
> - **Mother:** **Yssa Raventhorne** (b. 165, d. unknown)
> - **Maternal grandfather:** **Cael Vyr** (b. 160, d. 240) — House Vyr
> - **Cael's father:** **Maric Vyr** (b. 85, d. 160) — House Vyr
> - **Maric's father:** **Theron Ashveil** (b. 10, d. 120) — the
> furthest ancestor the graph records.
**Verdict:** PASS — all 4 canonical ancestor names appear in the
answer, in the right order, with the right birth/death years.
### Q4 — Image recall
**Question:** *"Show me images of Aldric Raventhorne — portraits or otherwise."*
**Tools the LLM chose:**
```json
[
{ "tool": "entity_context", "args": { "name": "Aldric Raventhorne" } },
{ "tool": "recall_images", "args": { "entity_id": "aldric" } }
]
```
**Answer (excerpt):**
> Found one image for Aldric Raventhorne:
>
> **Portrait of Aldric Raventhorne, Lord of Thornwall**
> (`img_aldric_portrait`)
> - Era: 2nd Age
> - Caption: "Portrait of Aldric Raventhorne, Lord of Thornwall.
> Middle-aged, dark hair, a scar above the left eye."
> - Tags: portrait, noble, thornwall
> - URL: `http://localhost:9000/lore-images/characters/aldric_portrait.png?...presigned...`
**Verdict:** PASS — the caption is quoted verbatim from the seed
manifest, the MinIO presigned URL is surfaced, and the model's
"found one image" claim matches the actual count.
### Q5 — Consistency scan (multi-tool fan-out)
**Question:** *"What are the open consistency issues in the world graph right now? Check contradictions, anachronisms, orphans, and ontology violations."*
**Tools the LLM chose:**
```json
[
{ "tool": "find_contradictions", "args": {} },
{ "tool": "find_anachronisms", "args": {} },
{ "tool": "find_orphans", "args": {} },
{ "tool": "find_ontology_violations", "args": {} }
]
```
**Answer (excerpt) — from the v2.T4 snapshot in time:**
> No open consistency issues detected. All four consistency checks
> returned zero violations.
**Verdict:** PASS *for the snapshot in time the run was taken* — the
model did not fabricate violations; it faithfully reported whatever
the tools returned. The v2.T4 trace was captured before v2.T5 landed;
the live T5 build surfaces 5 hand-crafted violations in the default
world (1 contradiction, 1 anachronism, 1 orphan, 2 ontology), as
documented in `docs/CONSISTENCY_DEMO.md`. To re-validate Q5 against
the current T5 build, run `bash examples/test_consistency.sh` — that
script is the v2.T5 replacement for the stub-trace era of the consumer
and is the authoritative Q5 evidence going forward.
## Tool coverage
| Tool | Question(s) | Used in Q |
|---|---|---|
| `entity_context` | Q1, Q3, Q4 | 3 |
| `lineage_of` | Q1, Q3 | 2 |
| `was_true_at` | Q2 | 1 |
| `ancestors_of` | Q3 | 1 |
| `recall_images` | Q4 | 1 |
| `find_contradictions` | Q5 | 1 |
| `find_anachronisms` | Q5 | 1 |
| `find_orphans` | Q5 | 1 |
| `find_ontology_violations` | Q5 | 1 |
**9 distinct tools** across **5 questions**. The model discovered
them all from `tools/list` — no scripted routing. Several tools
(`state_at`, `descendants_of`, `log_trade`, `trades_by_buyer`,
`market_price`, `register_image`, `search_images_by_caption`,
`search_images_semantic`, `embed_images`, `list_worlds`) were
exercised separately by `bash test.sh` but the LLM correctly chose
not to invoke them for any of the 5 question types.
## How to re-run
```bash
# 1. gateway + DBs must be up
cd /root/lore-engine-poc
docker compose up -d --build
python3 seed.py
# 2. LiteLLM proxy must be running on :4000 with the configured model
# 3. drive the 5 questions
bash examples/run_questions.sh
# raw traces in examples/results/
ls examples/results/
```
## Files
- `examples/llm_consumer.py` — the driver (httpx + LiteLLM + tool loop)
- `examples/system_prompt.txt` — the system prompt the LLM sees
- `examples/run_questions.sh` — the orchestrator
- `examples/REPORT.md` — the full E2E report (verdicts, ground truth,
per-question traces)
- `examples/test_consistency.sh` — the v2.T5 consistency-only smoke
runner (replacement for the Q5 stub trace)

219
docs/MULTI_WORLD_DEMO.md Normal file
View File

@@ -0,0 +1,219 @@
# Multi-World Namespace — Worked Example
This is a live walkthrough of the world namespace that landed in v2.T6.
Every call below is real tool output against the gateway at `localhost:8765`
from the v2 build (`4f92289` on `wt/t6-multi-world`).
## What the namespace is
The v1 POC stored every node and edge in a single graph. v2 adds a
`world_id` property on every world-scoped node and edge, and a new
`list_worlds()` admin tool. The read tools (`entity_context`,
`was_true_at`, `state_at`, `ancestors_of`, `descendants_of`,
`lineage_of`, `recall_images`, `search_images_by_caption`,
`search_images_semantic`, `trades_by_buyer`, `market_price`, the
consistency `find_*` tools) all accept an optional `world_id` argument
that defaults to `"default"`. Write tools (`log_trade`, `register_image`,
`embed_images`) tag the row with the caller's `world_id`.
This lets a single Neo4j instance hold multiple parallel worlds with no
node-id collisions. The default seed loads a second world, `arda_greyscale`,
that mirrors the default world's shape with its own people, factions,
locations, events, and relations.
## 1. `list_worlds()` — what's loaded
```bash
curl -s -X POST http://localhost:8765/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"list_worlds","arguments":{}}}'
```
```json
[
{ "world_id": "arda_greyscale" },
{ "world_id": "default" }
]
```
Both worlds are alive in the same graph. Note the default ordering is
newest-first by seed time.
## 2. The default world — Theron's bloodline
The default world is the v1 set: Theron Ashveil, Maric Vyr, Cael Vyr,
Yssa Raventhorne, Aldric Raventhorne, Elara Raventhorne, plus factions
House Vyr / Crimson Pact / Merchants Guild and the founding-event /
Black-Spire-event / founding-of-the-Merchants-Guild era.
```bash
curl -s -X POST http://localhost:8765/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc":"2.0","id":1,"method":"tools/call",
"params":{"name":"entity_context","arguments":{"name":"Theron Ashveil","world_id":"default"}}
}'
```
```json
{
"found": true,
"name": "Theron Ashveil",
"id": "theron",
"world_id": "default",
"labels": ["Person"],
"properties": {
"world_id": "default",
"tier": "noble",
"culture": "Valdorni",
"born": 10,
"name": "Theron Ashveil",
"id": "theron"
},
"relations": [
{ "rel": "PARENT_OF", "to_id": "maric", "to": "Maric Vyr" },
{ "rel": "MEMBER_OF", "to_id": "house_vyr_bloodline", "to": "House Vyr (bloodline)" }
]
}
```
`Theron Ashveil` is the founding ancestor of the House Vyr bloodline.
He exists in the `default` world and is the earliest known ancestor of
Aldric (see `docs/LLM_CONSUMER_DEMO.md` Q3 for the full chain).
## 3. The greyscale world — Mael & Sira Greyscale
`arda_greyscale` is a parallel world seeded by
`seed.py:seed_greyscale_world` with its own era (`greyscale_age`), its
own faction (The Ashen Court), and its own people. The greyscale seed
intentionally uses different node ids — `mael_greyscale`, `sira_greyscale`
— so a query in one world cannot accidentally return the other.
```bash
curl -s -X POST http://localhost:8765/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc":"2.0","id":1,"method":"tools/call",
"params":{"name":"entity_context","arguments":{"name":"Mael Greyscale","world_id":"arda_greyscale"}}
}'
```
```json
{
"found": true,
"name": "Mael Greyscale",
"id": "mael_greyscale",
"world_id": "arda_greyscale",
"labels": ["Person"],
"properties": {
"world_id": "arda_greyscale",
"tier": "noble",
"culture": "Greyscale",
"born": 220,
"name": "Mael Greyscale",
"id": "mael_greyscale"
},
"relations": [
{ "rel": "MEMBER_OF", "to_id": "ashen_court", "to": "The Ashen Court" },
{ "rel": "SPOUSE_OF", "to_id": "sira_greyscale", "to": "Sira Greyscale" }
]
}
```
Mael is the greyscale world's analogue of Aldric: a noble, a member of
the Ashen Court, spouse of a Greyscale twin. Note `culture: "Greyscale"`
and `tier: "noble"` — same property names, completely different
meanings from the default world.
## 4. Cross-world isolation — the namespace holds
A query in world X for an entity that exists only in world Y must come
back empty. This is the test the namespace was built to pass.
### Aldric is default-only — greyscale returns empty
```bash
curl -s -X POST http://localhost:8765/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc":"2.0","id":1,"method":"tools/call",
"params":{"name":"recall_images","arguments":{"entity_id":"aldric","world_id":"arda_greyscale"}}
}'
```
```json
{
"entity_id": "aldric",
"world_id": "arda_greyscale",
"count": 0,
"images": []
}
```
Aldric's images are in the default world's `image_manifest` table, not
the greyscale one. With `world_id="arda_greyscale"`, the image recall
query finds zero — exactly what the namespace promises.
### Trade log — default scope doesn't see greyscale entries (and vice versa)
```bash
curl -s -X POST http://localhost:8765/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc":"2.0","id":1,"method":"tools/call",
"params":{"name":"market_price","arguments":{"item_id":"pale_ledger","world_id":"default"}}
}'
```
```json
{
"item_id": "pale_ledger",
"sample_size": 2,
"avg_unit_price": 500.0,
"min_unit_price": 500.0,
"max_unit_price": 500.0,
"most_recent": "2026-06-16T23:04:51.276172+00:00"
}
```
The same `market_price` call against `arda_greyscale` returns zero
trades for `pale_ledger` (the greyscale world has its own item
namespace, not the default `pale_ledger`). The trades table's PK
includes `world_id` so a row inserted by `log_trade` with
`world_id="arda_greyscale"` is invisible to a default-scope query.
## 5. How a tool uses `world_id`
The `MATCH` clauses in the world-scoped tools all include
`{id: $..., world_id: $world_id}` so a row in the wrong world simply
doesn't match. For example, the lineage ancestors query in
`plugins/lineage.py`:
```cypher
MATCH path = (a:Person {id: $person, world_id: $world_id})-[:PARENT_OF*1..10]->(ancestor:Person)
WHERE ancestor.world_id = $world_id
RETURN ancestor
```
Both ends of the path are pinned to the same `world_id`, so the chain
never crosses a world boundary. The `state_at` and `entity_context`
queries follow the same pattern; the image and trade queries hit
Postgres tables that carry `world_id` in their primary key.
## 6. The world-resolution rule
Tools that take a `world_id` argument default it to `"default"` so v1
callers keep working unchanged. The `bash test.sh` runner passes
`world_id="default"` explicitly to verify that the opt-in behaviour
holds. The greyscale seed is loaded by `python3 seed.py` automatically
(no extra flag), and `list_worlds()` is the operator's view of what
exists.
## Files
- `seed.py:seed_greyscale_world` — the `arda_greyscale` seed
- `seed.py:_seed_images_for_world` — the per-world image manifest loader
- `plugins/lineage.py`, `plugins/world.py`, `plugins/images.py` — every
world-scoped read tool filters on `world_id`
- `tests/test_multi_world.py` — 14 pytest cases for the namespace
- `test.sh` section 12 — the `list_worlds()` smoke check

320
examples/test_e2e.sh Executable file
View File

@@ -0,0 +1,320 @@
#!/usr/bin/env bash
# test_e2e.sh — End-to-end validation for v2.T7.
#
# What this proves (per task body):
# 1. The LLM consumer works end-to-end (5 question types)
# 2. The consistency tools find the right violations (5 seeded)
# 3. The LLM's answers match the seed-data ground truth
#
# Two independent layers:
# A. Direct tool calls — each of the 4 consistency tools is invoked
# against the live gateway and the violation count + ids are asserted
# against the table in examples/GROUND_TRUTH.md. This proves the
# tools work regardless of LLM behaviour.
# B. LLM consumer — for each of 5 question types, drive the LLM through
# the gateway, then assert the answer contains the expected facts
# (names, dates, severities). This proves the LLM consumer works.
#
# The script exits 0 only if EVERY check passes.
set -uo pipefail
cd "$(dirname "$0")"
mkdir -p results
GATEWAY_URL="${GATEWAY_URL:-http://localhost:8765/mcp}"
LITELLM_URL="${LITELLM_URL:-http://localhost:4000/v1}"
LITELLM_MODEL="${LITELLM_MODEL:-minimax-m3}"
export GATEWAY_URL LITELLM_URL LITELLM_MODEL
# ─── bookkeeping ──────────────────────────────────────────────────────────────
fails=0
passes=0
declare -a FAIL_DETAILS=()
ok() { passes=$((passes+1)); echo "$1"; }
fail() { fails=$((fails+1)); FAIL_DETAILS+=("$1"); echo "$1"; }
section() { echo; echo "── $* ──"; }
# ─── pre-flight ──────────────────────────────────────────────────────────────
section "pre-flight: gateway + LiteLLM reachable"
if curl -s --max-time 5 -X POST "$GATEWAY_URL" -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' >/dev/null 2>&1; then
ok "gateway responds at $GATEWAY_URL"
else
fail "gateway unreachable at $GATEWAY_URL"
echo
echo "PRE-FLIGHT FAILED — aborting."
exit 1
fi
if curl -s --max-time 5 "$LITELLM_URL/models" >/dev/null 2>&1; then
ok "LiteLLM responds at $LITELLM_URL"
else
fail "LiteLLM unreachable at $LITELLM_URL"
echo
echo "PRE-FLIGHT FAILED — aborting."
exit 1
fi
# ─── Layer A: direct consistency-tool calls ──────────────────────────────────
# Helper: call a tool, print the parsed JSON envelope (one object per line).
call_tool() {
local name=$1
local args=$2
curl -s -X POST "$GATEWAY_URL" -H "Content-Type: application/json" \
-d "{\"jsonrpc\":\"2.0\",\"id\":1,\"method\":\"tools/call\",\"params\":{\"name\":\"$name\",\"arguments\":$args}}" \
| python3 -c "import json,sys; d=json.load(sys.stdin); print(d['result']['content'][0]['text'])"
}
# Helper: assert a tool's violation count + ids.
# Args: <tool_name> <args_json> <expected_count> [expected_id_1 ...]
assert_violations() {
local tool=$1; shift
local args=$1; shift
local expected_count=$1; shift
local resp
resp=$(call_tool "$tool" "$args")
local got_count
got_count=$(printf '%s' "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])")
if [ "$got_count" = "$expected_count" ]; then
ok "$tool: count=$got_count (expected $expected_count)"
else
fail "$tool: count=$got_count (expected $expected_count) — full response: $resp"
return
fi
for want in "$@"; do
if printf '%s' "$resp" | python3 -c "import json,sys; ids=[v['id'] for v in json.load(sys.stdin)['violations']]; print('YES' if '$want' in ids else 'NO')" \
2>/dev/null | grep -q YES; then
ok "$tool: contains id=$want"
else
fail "$tool: missing id=$want (full response: $resp)"
fi
done
}
section "Layer A — direct consistency tool calls (no LLM)"
assert_violations "find_contradictions" '{"severity":"any"}' 1 c_aldric_double_membership
assert_violations "find_anachronisms" '{"severity":"any"}' 1 a_vex_at_founding
assert_violations "find_orphans" '{}' 1 o_unfinished_npc
assert_violations "find_ontology_violations" '{"severity":"any"}' 2 ov_theron_no_died ov_maric_no_died
# Severity breakdown — task body says "the orphan being a warning, not error".
section "Layer A — severity breakdown"
contradictions_err=$(call_tool "find_contradictions" '{"severity":"error"}' | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])")
contradictions_warn=$(call_tool "find_contradictions" '{"severity":"warn"}' | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])")
[ "$contradictions_err" = "1" ] && ok "find_contradictions severity=error -> 1" || fail "find_contradictions severity=error -> $contradictions_err (expected 1)"
[ "$contradictions_warn" = "0" ] && ok "find_contradictions severity=warn -> 0" || fail "find_contradictions severity=warn -> $contradictions_warn (expected 0)"
anach_err=$(call_tool "find_anachronisms" '{"severity":"error"}' | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])")
anach_warn=$(call_tool "find_anachronisms" '{"severity":"warn"}' | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])")
[ "$anach_err" = "1" ] && ok "find_anachronisms severity=error -> 1" || fail "find_anachronisms severity=error -> $anach_err (expected 1)"
[ "$anach_warn" = "0" ] && ok "find_anachronisms severity=warn -> 0" || fail "find_anachronisms severity=warn -> $anach_warn (expected 0)"
# Orphans: 1 warn (the task body specifies this is a warn, not error).
orphan_severity=$(call_tool "find_orphans" '{}' | python3 -c "import json,sys; d=json.load(sys.stdin); print(','.join(v['severity'] for v in d['violations']))")
if [ "$orphan_severity" = "warn" ]; then
ok "find_orphans -> severity=warn (orphan is a warn, not error)"
else
fail "find_orphans -> severity=[$orphan_severity] (expected 'warn')"
fi
# Ontology: 2 warn
ont_warn=$(call_tool "find_ontology_violations" '{"severity":"warn"}' | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])")
[ "$ont_warn" = "2" ] && ok "find_ontology_violations severity=warn -> 2" || fail "find_ontology_violations severity=warn -> $ont_warn (expected 2)"
# Total
total_err=0
total_warn=0
for t in find_contradictions find_anachronisms find_orphans find_ontology_violations; do
args='{"severity":"any"}'
[ "$t" = "find_orphans" ] && args='{}'
e=$(call_tool "$t" "$args" | python3 -c "import json,sys; d=json.load(sys.stdin); print(sum(1 for v in d['violations'] if v['severity']=='error'))")
w=$(call_tool "$t" "$args" | python3 -c "import json,sys; d=json.load(sys.stdin); print(sum(1 for v in d['violations'] if v['severity']=='warn'))")
total_err=$((total_err+e))
total_warn=$((total_warn+w))
done
total=$((total_err+total_warn))
[ "$total" = "5" ] && ok "TOTAL violations = 5 (2 error + 3 warn)" \
|| fail "TOTAL violations = $total (expected 5)"
[ "$total_err" = "2" ] && ok "TOTAL errors = 2" || fail "TOTAL errors = $total_err (expected 2)"
[ "$total_warn" = "3" ] && ok "TOTAL warns = 3" || fail "TOTAL warns = $total_warn (expected 3)"
# ─── Layer B: LLM consumer — 5 question types ────────────────────────────────
section "Layer B — LLM consumer (5 question types)"
declare -a IDS=(
"q1_who_is_aldric"
"q2_was_allied_230"
"q3_aldric_ancestors"
"q4_images_of_aldric"
"q5_consistency_issues"
)
declare -a QS=(
"Who is Aldric Raventhorne? Give a brief bio and his known relations."
"Was House Vyr allied with the Merchants Guild at 2nd_age.year_230?"
"What is the lineage / ancestry of Aldric Raventhorne? Walk back as far as you can."
"Show me images of Aldric Raventhorne — portraits or otherwise."
"What are the open consistency issues in the world graph right now? Check contradictions, anachronisms, orphans, and ontology violations."
)
for i in "${!IDS[@]}"; do
id="${IDS[$i]}"
q="${QS[$i]}"
echo
echo "── question $((i+1))/5: $id ──"
echo " Q: $q"
if ! python3 llm_consumer.py --question-id "$id" --question "$q" \
--out "results/${id}.json" >"/tmp/llm_consumer_${id}.log" 2>&1; then
fail "Q$((i+1)) ($id): llm_consumer.py exited non-zero — see /tmp/llm_consumer_${id}.log"
tail -5 "/tmp/llm_consumer_${id}.log" | sed 's/^/ /'
continue
fi
tail -8 "/tmp/llm_consumer_${id}.log"
ok "Q$((i+1)) ($id): llm_consumer.py exit=0"
done
# ─── Answer-level assertions against GROUND_TRUTH.md ─────────────────────────
section "Layer B — answer-level assertions against GROUND_TRUTH.md"
# Helper: read a trace and emit its (answer_lower, tools_csv) on two lines.
trace_info() {
local trace_path=$1
python3 -c "
import json
d = json.load(open('$trace_path'))
ans = (d.get('answer') or '').lower()
tools = [t['tool'] for t in d.get('tools_called', [])]
print(ans)
print('---TOOLS---')
print(','.join(tools))
"
}
# Q1: entity_context called, answer has Aldric + a known affiliation.
if [ -f "results/q1_who_is_aldric.json" ]; then
trace=$(trace_info "results/q1_who_is_aldric.json")
q1_ans=${trace%%$'---TOOLS---'*}
q1_tools=$(printf '%s' "$trace" | awk -F'---TOOLS---' '{print $2}')
echo " Q1 tools: $q1_tools"
if [[ "$q1_tools" == *entity_context* ]]; then ok "Q1: entity_context in tools_called"; else fail "Q1: entity_context NOT called (got: $q1_tools)"; fi
if printf '%s' "$q1_ans" | grep -qi 'aldric'; then ok "Q1: answer mentions 'aldric'"; else fail "Q1: answer missing 'aldric'"; fi
if printf '%s' "$q1_ans" | grep -Eqi 'vyr|thornwall|elara|valdorni|eventide'; then
ok "Q1: answer mentions a known affiliation (Vyr/Thornwall/Elara/Valdorni/Eventide)"
else
fail "Q1: answer missing known affiliation"
fi
else
fail "Q1: results/q1_who_is_aldric.json missing (LLM consumer failed)"
fi
# Q2: was_true_at called, answer says YES/allied/true.
if [ -f "results/q2_was_allied_230.json" ]; then
trace=$(trace_info "results/q2_was_allied_230.json")
q2_ans=${trace%%$'---TOOLS---'*}
q2_tools=$(printf '%s' "$trace" | awk -F'---TOOLS---' '{print $2}')
echo " Q2 tools: $q2_tools"
if [[ "$q2_tools" == *was_true_at* ]]; then ok "Q2: was_true_at in tools_called"; else fail "Q2: was_true_at NOT called (got: $q2_tools)"; fi
if printf '%s' "$q2_ans" | grep -Eqi 'yes|allied|true|in force|was an alliance'; then
ok "Q2: answer indicates YES/allied/true"
else
fail "Q2: answer missing YES/allied/true"
fi
else
fail "Q2: results/q2_was_allied_230.json missing (LLM consumer failed)"
fi
# Q3: ancestors_of called, answer names >=3 of {Theron, Maric, Cael, Yssa}.
if [ -f "results/q3_aldric_ancestors.json" ]; then
trace=$(trace_info "results/q3_aldric_ancestors.json")
q3_ans=${trace%%$'---TOOLS---'*}
q3_tools=$(printf '%s' "$trace" | awk -F'---TOOLS---' '{print $2}')
echo " Q3 tools: $q3_tools"
if [[ "$q3_tools" == *ancestors_of* ]]; then ok "Q3: ancestors_of in tools_called"; else fail "Q3: ancestors_of NOT called (got: $q3_tools)"; fi
found=0
for n in theron maric cael yssa; do
if printf '%s' "$q3_ans" | grep -qi "$n"; then found=$((found+1)); fi
done
if [ "$found" -ge 3 ]; then ok "Q3: answer names $found/4 canonical ancestors (need >=3)"; else fail "Q3: answer names only $found/4 canonical ancestors (need >=3)"; fi
else
fail "Q3: results/q3_aldric_ancestors.json missing (LLM consumer failed)"
fi
# Q4: image-recall tool called, answer mentions Aldric + portrait/image/etc.
if [ -f "results/q4_images_of_aldric.json" ]; then
trace=$(trace_info "results/q4_images_of_aldric.json")
q4_ans=${trace%%$'---TOOLS---'*}
q4_tools=$(printf '%s' "$trace" | awk -F'---TOOLS---' '{print $2}')
echo " Q4 tools: $q4_tools"
if [[ "$q4_tools" == *recall_images* || "$q4_tools" == *search_images_by_caption* || "$q4_tools" == *search_images_semantic* ]]; then
ok "Q4: image-recall tool in tools_called"
else
fail "Q4: no image-recall tool called (got: $q4_tools)"
fi
if printf '%s' "$q4_ans" | grep -qi 'aldric'; then ok "Q4: answer mentions 'aldric'"; else fail "Q4: answer missing 'aldric'"; fi
if printf '%s' "$q4_ans" | grep -Eqi 'portrait|image|presigned|thornwall'; then
ok "Q4: answer mentions portrait/image/presigned/thornwall"
else
fail "Q4: answer missing portrait/image/presigned/thornwall"
fi
else
fail "Q4: results/q4_images_of_aldric.json missing (LLM consumer failed)"
fi
# Q5: all 4 consistency tools called; answer is NOT a "no issues" answer; mentions
# canonical subject names and severity.
if [ -f "results/q5_consistency_issues.json" ]; then
trace=$(trace_info "results/q5_consistency_issues.json")
q5_ans=${trace%%$'---TOOLS---'*}
q5_tools=$(printf '%s' "$trace" | awk -F'---TOOLS---' '{print $2}')
echo " Q5 tools: $q5_tools"
missing=()
for t in find_contradictions find_anachronisms find_orphans find_ontology_violations; do
[[ "$q5_tools" == *"$t"* ]] || missing+=("$t")
done
if [ ${#missing[@]} -eq 0 ]; then
ok "Q5: all 4 consistency tools in tools_called"
else
fail "Q5: missing tools: ${missing[*]}"
fi
# Must NOT say "no issues" — there are 5 seeded violations.
if printf '%s' "$q5_ans" | grep -Eqi '(no|zero|none).{0,30}(open |detected |current )?(consistency |open )?(issues|problems|violations)'; then
fail "Q5: answer incorrectly says 'no issues' — but 5 violations are seeded"
else
ok "Q5: answer does NOT claim 'no issues' (correct — 5 violations seeded)"
fi
subject_hits=0
for n in aldric vex lyssa theron maric; do
if printf '%s' "$q5_ans" | grep -qi "$n"; then subject_hits=$((subject_hits+1)); fi
done
if [ "$subject_hits" -ge 2 ]; then
ok "Q5: answer mentions $subject_hits canonical subjects (need >=2)"
else
fail "Q5: answer mentions only $subject_hits canonical subjects (need >=2)"
fi
if printf '%s' "$q5_ans" | grep -Eqi 'severity|warn|warning|error'; then
ok "Q5: answer acknowledges severity (warn/error)"
else
fail "Q5: answer does not acknowledge severity"
fi
else
fail "Q5: results/q5_consistency_issues.json missing (LLM consumer failed)"
fi
# ─── summary ─────────────────────────────────────────────────────────────────
echo
echo "════════════════════════════════════════════════════════════"
if [ "$fails" -eq 0 ]; then
echo " PASS — $passes checks, 0 failures"
echo "════════════════════════════════════════════════════════════"
exit 0
else
echo " FAIL — $passes checks passed, $fails FAILED:"
for d in "${FAIL_DETAILS[@]}"; do
echo " - $d"
done
echo "════════════════════════════════════════════════════════════"
exit 1
fi