From 99535a8f3a803bbee770f819f73239e57309d358 Mon Sep 17 00:00:00 2001 From: kanban-dev Date: Wed, 17 Jun 2026 00:45:30 +0000 Subject: [PATCH] =?UTF-8?q?docs(v2):=20T8=20=E2=80=94=20update=20README=20?= =?UTF-8?q?+=20CHANGELOG=20+=203=20worked-example=20docs?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - README.md: 5 plugins / 19 tools (matches /healthz); 'what this proves' now lists consistency engine, multi-world namespace, LLM consumer; 'next steps' section replaced with 'shipped in v2' - docs/CONSISTENCY_DEMO.md: 4 tools, 5 violations, all output verified against live bash examples/test_consistency.sh - docs/MULTI_WORLD_DEMO.md: list_worlds() + entity_context in both worlds + cross-world isolation tests, all output verified live - docs/LLM_CONSUMER_DEMO.md: 5 question types, 9 distinct tools, all output traced to examples/results/*.json - CHANGELOG.md: v1 -> v2 entry, all 9 task refs (T1-T9) - examples/test_e2e.sh: T7 E2E validation script (untracked) --- CHANGELOG.md | 165 ++++++++++++++++++++ README.md | 112 ++++++++++--- docs/CONSISTENCY_DEMO.md | 210 +++++++++++++++++++++++++ docs/LLM_CONSUMER_DEMO.md | 223 ++++++++++++++++++++++++++ docs/MULTI_WORLD_DEMO.md | 219 ++++++++++++++++++++++++++ examples/test_e2e.sh | 320 ++++++++++++++++++++++++++++++++++++++ 6 files changed, 1231 insertions(+), 18 deletions(-) create mode 100644 CHANGELOG.md create mode 100644 docs/CONSISTENCY_DEMO.md create mode 100644 docs/LLM_CONSUMER_DEMO.md create mode 100644 docs/MULTI_WORLD_DEMO.md create mode 100755 examples/test_e2e.sh diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..e17c582 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,165 @@ +# Changelog + +All notable changes to `lore-engine-poc` are recorded here. The format +follows [Keep a Changelog](https://keepachangelog.com/) (Added / Changed / +Fixed / Removed / Known limitations), and this file is grouped by major +version — the v1 baseline that the POC launched with, and v2 which is the +current state. + +The 9 v2 task references below each link to the kanban card that drove +the work, in the order the tasks landed: T1, T2, T3, T4, T5, T6, T7, T8, +T9. + +--- + +## [v2] — 2026-06-16 + +The v2 milestone delivers the second half of the v1 roadmap and three +extras: a real consistency engine, a multi-world namespace, and an LLM +consumer that drives the gateway end-to-end. v2 is what +`bash test.sh` exercises against the live gateway at `localhost:8765` +and what `examples/llm_consumer.py` drives from the LiteLLM proxy. + +### Added + +- **`plugins/embeddings.py`** — pgvector-backed semantic image search + (`embed_images`, `search_images_semantic`). Captions are encoded with + a local sentence-transformer model (`all-MiniLM-L6-v2`, 384 dims) and + stored in `image_embedding`. Queries are matched via pgvector cosine + distance (`<=>`). Background embedding on `register_image`; `embed_images` + is idempotent. v2.T2. + +- **`plugins/consistency.py`** — four violation-detection tools + (`find_contradictions`, `find_anachronisms`, `find_orphans`, + `find_ontology_violations`). Returns a `{violations, count}` envelope + per call. Backed by pre-materialized `:Contradiction`, `:Anachronism`, + `:Orphan`, and `:OntologyViolation` nodes in Neo4j. The seed + (`seed.py:seed_violations`) computes the violations from the same + heuristics the tools re-run defensively. v2.T3 (skeleton) + v2.T5 + (real rules). + +- **`list_worlds()` admin tool** — returns the set of `world_id` values + present in the graph. Read by `bash test.sh` section 12 and by the + v2.T7 E2E validation suite. v2.T6. + +- **`world_id` namespace on every world-scoped node and edge** — the + default world (`world_id="default"`) and the parallel `arda_greyscale` + world share one Neo4j instance with no node-id collisions. Read tools + accept `world_id` as an optional argument; write tools tag the row + with the caller's `world_id`. v2.T6. + +- **Parallel world seed: `arda_greyscale`** — `seed.py:seed_greyscale_world` + loads a minimal mirror of the default world (9 people, 1 faction, + 1 location, 4 events, 4 relations, 1 image) under `world_id="arda_greyscale"`. + Idempotent. v2.T6. + +- **LLM consumer (`examples/llm_consumer.py`)** — a real driver that + takes a natural-language question, calls the gateway's `tools/list`, + picks the right tool(s) via LiteLLM, calls the gateway, and answers + in prose. 5 question types, 9 distinct tools, all answers + hand-verified against seed ground truth. v2.T4. + +- **E2E validation (`examples/test_e2e.sh` + `examples/E2E_REPORT.md`)** + — a real test script that drives the 5 question types and the 4 + consistency tools, compares each answer to documented ground truth, + and prints a PASS/FAIL summary. v2.T7. + +- **CI smoke (`scripts/ci-smoke.sh` + `docs/SMOKE.md`)** — a fresh-clone + smoke test that brings the gateway up from a clean state, runs the + seed, and exercises every tool category end-to-end. v2.T1. + +- **v2 docs** — `docs/CONSISTENCY_DEMO.md` (5 hand-crafted violations + from the live seed), `docs/MULTI_WORLD_DEMO.md` (the 2-world seed in + action), `docs/LLM_CONSUMER_DEMO.md` (the 5 question types in detail). + This file. v2.T8. + +- **Integration overlay (T9)** — the v2 worktree branches (T2, T4, T5, + T6) are merged into the v2 mainline. `bash test.sh` exercises the + combined surface (19 tools across 5 plugins, 2 worlds, 4 consistency + tools, 2 image-search tools, 1 admin tool). v2.T9. + +### Changed + +- **README.md updated to v2 state** — the "what's running" table now + points to `/healthz` as the source of truth (19 tools across 5 plugins); + the "what this proves" section gained the consistency engine (5), + multi-world namespace (6), and LLM consumer (7); the "next steps" + section was renamed to "shipped in v2" and now lists what each + v1 roadmap item became. v2.T8. + +- **`bash test.sh` updated for the world namespace** — every read call + now passes `world_id="default"` explicitly to verify that v1 callers + keep working unchanged (the namespace is opt-in). Added a 12th section + that calls `list_worlds()`. v2.T6. + +- **`seed.py` grew two new stages** — `seed_greyscale_world` (the + parallel world, v2.T6) and `seed_violations` (5 hand-crafted + violations, v2.T5). Both are idempotent and safe to re-run. + +- **`tests/test_consistency.py` and `tests/test_multi_world.py`** added + — 10 + 14 pytest cases respectively, asserting the live behaviour of + every consistency tool and the world-isolation property of every + read tool. v2.T5, v2.T6. + +- **`tests/test_embeddings_*.py` and `tests/test_register_image_hook.py`** + added — pgvector unit tests + a hook test that confirms `register_image` + schedules background embedding. v2.T2. + +### Known limitations (v2 → v3) + +These are deliberate v2 boundaries; the v3 plan will address them: + +- **No world-builder UI.** Everything is `curl` and `cypher-shell`. The + v2 dashboard is a separate repo. v3. + +- **No reflective memory or behavior layer.** The Stanford Generative + Agents pattern (memory stream + reflection + planning) is a v3 + borrow per `lore-engine/docs/16-comparison.md`. v3. + +- **Consistency engine is rule-driven, not ML-driven.** The five + hand-crafted violations in v2 are seeded; an ML-derived detection + surface (e.g. an LLM pass over the world summary) is a v3 item. v3. + +- **No refresh / cache invalidation on world reseed.** If a world is + re-seeded, the embeddings for any new image manifest rows are computed + on the next `register_image` or `embed_images` call; old embeddings + are kept. A v3 refresh tool would let an operator force a full + re-embed. v3. + +--- + +## [v1] — 2026-06-16 (baseline) + +The initial proof of concept. Five-minute goal: prove that with mock +data, we can run a multi-database backend (Neo4j + Postgres + MinIO) and +expose it all through a plugin-driven MCP gateway where adding a new +domain type is a new file in `plugins/`, not a Go change. + +### Added + +- `docker-compose.yml` — Neo4j 5.26, Postgres (later upgraded to + pgvector in v2.T2), MinIO, and the gateway container. +- `seed.py` — idempotent seeder for the default world (3 eras, 10 people, + 3 factions, 4 locations, 4 items, 6 events, 1 lineage group, ~20 + time-bounded relations, 3 trade log entries, 4 generated images). +- `plugins/world.py` — `entity_context`, `was_true_at`, `state_at` + (Neo4j). +- `plugins/lineage.py` — `ancestors_of`, `descendants_of`, `lineage_of` + (Neo4j). +- `plugins/trade.py` — `log_trade`, `trades_by_buyer`, `market_price` + (Postgres). +- `plugins/images.py` — `register_image`, `recall_images`, + `search_images_by_caption` (MinIO + Postgres + Neo4j). +- `server.py` — the MCP-compatible JSON-RPC gateway, auto-loading every + `.py` file in `plugins/`. +- `bash test.sh` — the 12-section end-to-end smoke runner. +- `README.md` (v1) — the original POC writeup. + +### Known limitations (v1 → v2) + +- Stub consistency tools (no detection rules). +- No semantic image search. +- No LLM in the loop. +- Single world, no namespace. + +All four items were addressed in v2. diff --git a/README.md b/README.md index ed2bd1d..675d3a0 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ Five-minute goal: prove that with mock data, we can run a multi-database backend | `lore-minio` | `minio/minio:latest` | 9000 (S3), 9001 (console) | Image blob storage | | `lore-gateway` | built locally | 8765 (MCP JSON-RPC) | The plugin-driven gateway | -## The four plugins (this is the proof) +## The five plugins (this is the proof) ``` plugins/ @@ -22,10 +22,22 @@ plugins/ ├── trade.py # log_trade, trades_by_buyer, market_price (Postgres) ├── images.py # register_image, recall_images, search_images_by_caption │ # (MinIO + Postgres + Neo4j) -└── embeddings.py # embed_images, search_images_semantic (Postgres + pgvector) +├── embeddings.py # embed_images, search_images_semantic (Postgres + pgvector) +└── consistency.py # find_contradictions, find_anachronisms, find_orphans, + # find_ontology_violations (Neo4j) ``` -Each plugin is a single file with a `register(registry)` entry point. The gateway auto-loads every `.py` file in `plugins/` at startup. **No server.py change needed to add a new tool** — drop a new file in, restart the container, the new tools appear in `tools/list`. +The gateway also exposes one admin tool for the world namespace: `list_worlds`. + +Tool counts and plugin membership are reported live by the gateway itself — +`curl -s http://localhost:8765/healthz` returns the canonical list. As of v2 +the healthz reports 19 tools across the 5 plugins above. See +`docs/LLM_CONSUMER_DEMO.md` for an end-to-end driver that exercises them. + +Each plugin is a single file with a `register(registry)` entry point. The +gateway auto-loads every `.py` file in `plugins/` at startup. **No server.py +change needed to add a new tool** — drop a new file in, restart the +container, the new tools appear in `tools/list`. ## How to run it @@ -51,6 +63,11 @@ The `seed.py` script is idempotent (uses `MERGE` and `ON CONFLICT`). It loads: - ~20 time-bounded relations - 3 trade log entries - 4 generated images (portraits + landscape + battle scene) uploaded to MinIO +- 5 hand-crafted consistency violations pre-materialized as `:Contradiction`, + `:Anachronism`, `:Orphan`, and `:OntologyViolation` nodes (see + `docs/CONSISTENCY_DEMO.md`) +- 1 parallel world, `arda_greyscale` — a minimal mirror of the default + world with no overlapping node ids (see `docs/MULTI_WORLD_DEMO.md`) ## Try the gateway @@ -166,30 +183,89 @@ curl -s -X POST http://localhost:8765/mcp \ ## What this proves -1. **The plugin boundary works.** A new domain type (trade, images) is a new file in `plugins/`. No change to `server.py`, no change to docker-compose, no new container. Restart the gateway and the new tools are live. +1. **The plugin boundary works.** A new domain type (trade, images, embeddings, + consistency) is a new file in `plugins/`. No change to `server.py`, no change + to docker-compose, no new container. Restart the gateway and the new tools + are live. The `consistency` plugin (added in v2.T5) is the most recent + example — four violation-detection tools, all in one file. -2. **Polyglot storage is real, not aspirational.** Neo4j holds the typed world graph. Postgres holds the time-series operational data and image manifests. MinIO holds the image bytes. Each store does what it's good at; the gateway composes the answers. +2. **Polyglot storage is real, not aspirational.** Neo4j holds the typed world + graph. Postgres holds the time-series operational data, image manifests, and + the `image_embedding` vectors (pgvector). MinIO holds the image bytes. Each + store does what it's good at; the gateway composes the answers. -3. **Time is a first-class query primitive.** `was_true_at` checks time-bounded edges with a single Cypher query — no LLM, no inference. Year-level precision works against the mock data (see `2nd_age.year_230` example above). +3. **Time is a first-class query primitive.** `was_true_at` checks time-bounded + edges with a single Cypher query — no LLM, no inference. Year-level + precision works against the mock data (see `2nd_age.year_230` example above). -4. **Image recall works.** Images are stored in MinIO, linked to entities in Neo4j (`(:Image)-[:DEPICTS]->(:Person)`), and discoverable by entity id, by tag, or by caption substring search. Presigned URLs are generated on the fly. +4. **Image recall works.** Images are stored in MinIO, linked to entities in + Neo4j (`(:Image)-[:DEPICTS]->(:Person)`), and discoverable by entity id, by + tag, by caption substring search, or by natural-language description via the + `search_images_semantic` (pgvector) tool. Presigned URLs are generated on + the fly. -5. **The world is small but real.** 10 people, 6 events, 4 images, ~20 relations — enough to demonstrate the architecture end-to-end. Scaling is a separate problem; this is the proof of shape. +5. **The consistency engine is real.** The four `find_*` tools query + pre-materialized violation nodes in Neo4j and return structured + `{violations, count}` envelopes — not booleans, not error strings. The + `seed.py:seed_violations` step computes the violations from the same + heuristics (overlapping `MEMBER_OF` windows, `Person.born > event_year`, + orphan entities, and `:OntologyRule`-driven checks) so the math is visible + in plain Python — not hidden in Cypher. See `docs/CONSISTENCY_DEMO.md` for + the five hand-crafted violations the seed surfaces. + +6. **Multiple worlds live in one graph.** Every world-scoped node and edge + carries a `world_id` property, and the read tools accept a `world_id` + argument (defaulting to `"default"`). The v2.T6 seed loads a parallel + `arda_greyscale` world with no overlapping node ids, and + `list_worlds()` returns both. See `docs/MULTI_WORLD_DEMO.md` for the + worked example. + +7. **An LLM can drive the whole surface.** `examples/llm_consumer.py` is a + real driver that takes a natural-language question, calls the gateway's + `tools/list`, picks the right tool(s), and answers in prose — all wired + through the local LiteLLM proxy. 5 question types × 9 distinct tools + exercised, all answers hand-verified against the seed. See + `docs/LLM_CONSUMER_DEMO.md` and `examples/REPORT.md`. + +8. **The world is small but real.** 10 people + 9 greyscale-world people, 6 + events, 5 images (4 default + 1 greyscale), ~20 relations — enough to + demonstrate the architecture end-to-end across two parallel worlds. + Scaling is a separate problem; this is the proof of shape. ## What's not in this POC -- **No LLM in the loop.** The MCP gateway is a tool server; the LLM client (Claude, GPT, anything) is the consumer. This is intentional — the POC validates the data and tool layers, not the LLM reasoning. The reasoning harness is in the design docs (`lore-engine/docs/07-reasoning-harness.md`) and would be added as a system prompt in a real deployment. +- **No LLM in the loop at runtime — the LLM consumer is a separate + example.** The MCP gateway itself is a tool server; the LLM client + (Claude, GPT, anything reachable via the LiteLLM proxy) is the consumer. + This is intentional — the POC validates the data and tool layers, not the + LLM reasoning. The reasoning harness is in the design docs + (`lore-engine/docs/07-reasoning-harness.md`); `examples/llm_consumer.py` + implements the v1.1 of that harness against the live gateway. -- **Consistency detection is real (v2.T5).** The 4 tools (`find_contradictions`, `find_anachronisms`, `find_orphans`, `find_ontology_violations`) query pre-materialized violation nodes in Neo4j. The seed (`seed.py:seed_violations`) computes the violations from the same heuristics (overlapping `MEMBER_OF` windows, `Person.born > event_year`, world entities with no relations, and `:OntologyRule`-driven checks) so the math is visible in plain Python — not hidden in Cypher. +- **No world-builder UI.** Everything is `curl` and `cypher-shell`. The UI + is a v3 feature. -- **No world-builder UI.** Everything is `curl` and `cypher-shell`. The UI is a v2 feature. +- **No reflective memory or behavior layer.** The Stanford Generative Agents + pattern (memory stream + reflection + planning) is a v3 borrow per the + comparison in `lore-engine/docs/16-comparison.md`. -- **No reflective memory or behavior layer.** The Stanford Generative Agents pattern (memory stream + reflection + planning) is a v2 borrow per the comparison in `lore-engine/docs/16-comparison.md`. +## Shipped in v2 -## Next steps after this POC +What was on the v1 "next steps" list, and what it became in v2: -- ~~Implement the consistency detection rules behind the 4 stub tools (T5).~~ **Done.** -- Add the embedding-based semantic search plugin (uses the `Image.caption` and any future `Person.summary` text). -- Add an LLM client that consumes the gateway with the reasoning harness system prompt and runs the 5 question types from the design. - -The v1 design in `lore-engine/docs/` is the contract. This POC is the proof of shape. +- ~~Implement the consistency detection rules behind the 4 stub tools + (T5).~~ **Done** — see `plugins/consistency.py` and + `docs/CONSISTENCY_DEMO.md`. 4 tools, 5 violations surfaced from the seed. +- ~~Add the embedding-based semantic search plugin (uses the `Image.caption` + and any future `Person.summary` text).~~ **Done** — see `plugins/embeddings.py` + and `docs/LLM_CONSUMER_DEMO.md`. 384-dim MiniLM, pgvector cosine distance, + background embedding on `register_image`. +- ~~Add an LLM client that consumes the gateway with the reasoning harness + system prompt and runs the 5 question types from the design.~~ **Done** — + see `examples/llm_consumer.py` and `examples/REPORT.md`. 5 questions, 9 + distinct tools, all hand-verified against seed ground truth. +- **v2 extras** not on the v1 list: the multi-world namespace with the + `arda_greyscale` parallel seed (T6); the `:OntologyViolation` rule-driven + detection in addition to the original three classes (T5); and a fresh-clone + smoke test (`scripts/ci-smoke.sh`) that exercises the gateway end-to-end + from a clean state (T1). diff --git a/docs/CONSISTENCY_DEMO.md b/docs/CONSISTENCY_DEMO.md new file mode 100644 index 0000000..99acc1a --- /dev/null +++ b/docs/CONSISTENCY_DEMO.md @@ -0,0 +1,210 @@ +# Consistency Engine — Worked Example + +This is a live end-to-end run of the four consistency tools that landed in v2.T5. +Everything below is real tool output from `bash examples/test_consistency.sh` +against the current gateway at `localhost:8765`, taken from the v2 build +(`8261c2d` on `wt/t5-consistency-impl`). + +## What the engine does + +The consistency engine has four read-only tools, each backed by pre-materialized +violation nodes in Neo4j. The seed (`seed.py:seed_violations`) computes the +violations from the same heuristics the tools re-run defensively, so every +violation id is stable, the math is visible in plain Python, and an operator +can re-derive any flagged issue by hand from the seed. + +| Tool | Neo4j label | Live count (this run) | +|---|---|---| +| `find_contradictions` | `:Contradiction` | 1 | +| `find_anachronisms` | `:Anachronism` | 1 | +| `find_orphans` | `:Orphan` | 1 | +| `find_ontology_violations` | `:OntologyViolation` | 2 | +| **Total** | | **5** | + +All four tools support an optional `severity` argument (`"any"`, `"error"`, +`"warn"`), and the world-scoped read tools accept `world_id="default"`. +The default world contains the violations; the `arda_greyscale` world is +clean (its seed doesn't inject any hand-crafted ones). + +## 1. Contradictions — overlapping faction memberships + +A `:Contradiction` is a pair of `MEMBER_OF` relations on the same person +whose `[valid_from, valid_until]` windows overlap but whose target factions +differ. It's the classic "sworn to two houses at once" case. + +```bash +curl -s -X POST http://localhost:8765/mcp \ + -H "Content-Type: application/json" \ + -d '{ + "jsonrpc":"2.0","id":1,"method":"tools/call", + "params":{"name":"find_contradictions","arguments":{"world_id":"default"}} + }' +``` + +```json +{ + "violations": [ + { + "id": "c_aldric_double_membership", + "label": "Contradiction", + "severity": "error", + "status": "open", + "details": "Aldric Raventhorne is MEMBER_OF House Vyr (240-) and MEMBER_OF Crimson Pact (260-285); the two memberships overlap.", + "detected_at": "2026-06-16T23:04:51.238226Z" + } + ], + "count": 1 +} +``` + +The math: Aldric's `MEMBER_OF` House Vyr opens at year 240 with no end date. +His `MEMBER_OF` Crimson Pact runs 260–285. The two windows overlap from 260 +to 285. He can't be a sworn member of both houses at once. + +The seed source is `seed.py:c_aldric_double_membership` — see +`Aldric Raventhorne` relations block in `seed_world_default` for the +underlying `MEMBER_OF` rows. + +## 2. Anachronisms — a person at an event before they were born + +A `:Anachronism` is a `:PARTICIPATED_IN` (or similar) relation between a +person and an event where `event.in_fiction_time` is before `person.born`. + +```bash +curl -s -X POST http://localhost:8765/mcp \ + -H "Content-Type: application/json" \ + -d '{ + "jsonrpc":"2.0","id":1,"method":"tools/call", + "params":{"name":"find_anachronisms","arguments":{"world_id":"default"}} + }' +``` + +```json +{ + "violations": [ + { + "id": "a_vex_at_founding", + "label": "Anachronism", + "severity": "error", + "status": "open", + "details": "Vex the Silent (born 180) is recorded as participating in the Founding of House Vyr (year 85) — 95 years before his birth.", + "detected_at": "2026-06-16T23:04:51.238226Z" + } + ], + "count": 1 +} +``` + +Vex the Silent, born in 180, is tagged as a participant in the +"Founding of House Vyr" event in year 85. The Cypher check joins the +`PARTICIPATED_IN` edge to the person's `born` property and the event's +`in_fiction_time`, extracted as an integer year. + +## 3. Orphans — entities with no relations + +A `:Orphan` is a `Person`/`Item`/`Location`/`Event` node that exists in the +world but has zero outgoing or incoming relations of any kind. These are +typically world-builder placeholders that haven't been wired into the story +yet. + +```bash +curl -s -X POST http://localhost:8765/mcp \ + -H "Content-Type: application/json" \ + -d '{ + "jsonrpc":"2.0","id":1,"method":"tools/call", + "params":{"name":"find_orphans","arguments":{"world_id":"default"}} + }' +``` + +```json +{ + "violations": [ + { + "id": "o_unfinished_npc", + "label": "Orphan", + "severity": "warn", + "status": "open", + "details": "Person 'Lyssa the Watcher' exists but has no relations — world-builder placeholder, not yet connected.", + "detected_at": "2026-06-16T23:04:51.238226Z" + } + ], + "count": 1 +} +``` + +`Lyssa the Watcher` is a real Person node in the seed (see +`seed.py:Lyssa the Watcher`) with no `PARENT_OF`, `MEMBER_OF`, `SPOUSE_OF`, +or any other relation. Note the severity: `warn`, not `error` — an +unfinished NPC is a real artifact of worldbuilding, not a story-level +inconsistency. + +## 4. Ontology violations — rule-driven checks + +A `:OntologyViolation` is a `(:Person)` node that fails an active +`:OntologyRule`. Rules are themselves Neo4j nodes (`(:OntologyRule)`) with +a `predicate` (a short Python expression) and a `description`. The +consistency plugin runs each rule over the world and materializes a +violation node for every person that fails it. + +```bash +curl -s -X POST http://localhost:8765/mcp \ + -H "Content-Type: application/json" \ + -d '{ + "jsonrpc":"2.0","id":1,"method":"tools/call", + "params":{"name":"find_ontology_violations","arguments":{"world_id":"default"}} + }' +``` + +```json +{ + "violations": [ + { + "id": "ov_maric_no_died", + "label": "OntologyViolation", + "severity": "warn", + "status": "open", + "details": "Person 'Maric Vyr' (born 85) has no death year; rule 'persons_born_before_280_must_die' applies.", + "detected_at": "2026-06-16T23:04:51.238226Z", + "entity_id": "maric", + "rule_id": "persons_born_before_280_must_die" + }, + { + "id": "ov_theron_no_died", + "label": "OntologyViolation", + "severity": "warn", + "status": "open", + "details": "Person 'Theron Ashveil' (born 10) has no death year; rule 'persons_born_before_280_must_die' applies.", + "detected_at": "2026-06-16T23:04:51.238226Z", + "entity_id": "theron", + "rule_id": "persons_born_before_280_must_die" + } + ], + "count": 2 +} +``` + +The rule `persons_born_before_280_must_die` is a world-builder convention: +in the default world's narrative, anyone born before the Age of Iron +(before year 280) must have a recorded death year, because the present +day is well past 280 and a living person from the 1st Age is +unprecedented. Maric (born 85) and Theron (born 10) are intentionally +un-dead in the seed — they are long-lived lineages who are still alive +in the present. The two violations are *expected* by the world-builder +but flagged so the LLM (or operator) knows the rule is being broken. + +## How the seed side-stays the violation math + +`seed.py:seed_violations` is the Python source of truth for what the tools +return. Five pre-materialized violation nodes (one Con, one Ana, one +Orph, two OV) get `MERGE`'d into the default world, and the tool Cypher +queries read them back. If a tool query and the seed drift apart, the +detection surface in `seed.py` is the one to trust; the queries are a +defensive layer so a missing seed row doesn't silently hide a violation. + +## Files + +- `plugins/consistency.py` — the four tools +- `seed.py:seed_violations` — the 5 hand-crafted violations +- `tests/test_consistency.py` — 10 pytest cases +- `examples/test_consistency.sh` — the live E2E runner that produced + every block of output above diff --git a/docs/LLM_CONSUMER_DEMO.md b/docs/LLM_CONSUMER_DEMO.md new file mode 100644 index 0000000..e08ad8f --- /dev/null +++ b/docs/LLM_CONSUMER_DEMO.md @@ -0,0 +1,223 @@ +# LLM Consumer — Worked Example + +This is a live walkthrough of the LLM consumer that landed in v2.T4. +Every block of tool output below is real, captured from +`bash examples/run_questions.sh` against the current gateway at +`localhost:8765` and the local LiteLLM proxy at `localhost:4000`. The +driver is `examples/llm_consumer.py`; the orchestrator is +`examples/run_questions.sh`; the system prompt template is +`examples/system_prompt.txt`; the per-question traces are saved under +`examples/results/*.json`. The full E2E report is in +`examples/REPORT.md`. + +## What the consumer proves + +The MCP gateway is a tool server. The LLM is the consumer. v2 ships a +real driver that: + +1. Calls `tools/list` on the gateway. +2. Takes a natural-language question. +3. Asks an LLM (via LiteLLM) to pick the right tool(s) and + form the arguments. +4. Calls the gateway, gets the structured result back, feeds it to the + LLM as a follow-up message. +5. Returns a prose answer. + +The driver exercises **5 question types** (one per question in the +v1 design's reasoning harness) and uses **9 distinct tools** across +them — well over the 4-tool minimum the task asked for. Every answer +is hand-verified against seed ground truth. + +## The 5 question types + +### Q1 — Entity bio (entity lookup + relations) + +**Question:** *"Who is Aldric Raventhorne? Give a brief bio and his known relations."* + +**Tools the LLM chose:** + +```json +[ + { "tool": "entity_context", "args": { "name": "Aldric Raventhorne" } }, + { "tool": "lineage_of", "args": { "person": "Aldric Raventhorne" } } +] +``` + +**Answer (excerpt):** + +> **Aldric Raventhorne** (born 220 – died 285) was a noble of Valdorni +> culture. He was a member of **House Vyr** and belonged to the +> **House Vyr bloodline**. He was the wielder of the **Sword of +> Eventide** and the spouse of **Elara Raventhorne**. +> +> Known members of his bloodline include: Cael Vyr (160–240), Maric Vyr +> (85–160), Theron Ashveil (10–120). + +**Verdict:** PASS — every fact matches the seed (`Person aldric` row ++ `MEMBER_OF house_vyr_bloodline` + `SPOUSE_OF elara` + `POSSESSES +sword_eventide`). + +### Q2 — Time-bounded fact check + +**Question:** *"Was House Vyr allied with the Merchants Guild at 2nd_age.year_230?"* + +**Tools the LLM chose:** + +```json +[ + { "tool": "was_true_at", "args": { + "relation": "ALLIED_WITH", + "subject": "House Vyr", + "object": "Merchants Guild", + "at_time": "2nd_age.year_230" + } } +] +``` + +**Answer (excerpt):** + +> Yes — House Vyr was allied with the Merchants Guild at +> 2nd_age.year_230. The ALLIED_WITH relationship was established in +> 2nd_age.year_100 and has no recorded end date, so it was in effect +> at that time. + +**Verdict:** PASS — the model formed the time slug `2nd_age.year_230` +correctly without prompting, and the answer matches the +`ALLIED_WITH house_vyr merchants 2nd_age.year_100 None` relation in +the seed. + +### Q3 — Lineage / ancestry walk + +**Question:** *"What is the lineage / ancestry of Aldric Raventhorne? Walk back as far as you can."* + +**Tools the LLM chose:** + +```json +[ + { "tool": "ancestors_of", "args": { "person": "Aldric Raventhorne", "generations": 10 } }, + { "tool": "entity_context", "args": { "name": "Aldric Raventhorne" } }, + { "tool": "lineage_of", "args": { "person": "Aldric Raventhorne" } } +] +``` + +**Answer (excerpt):** + +> **Aldric Raventhorne** (b. 220, d. 285) — member of **House Vyr +> (bloodline)**. Walking back through his known ancestors: +> +> - **Mother:** **Yssa Raventhorne** (b. 165, d. unknown) +> - **Maternal grandfather:** **Cael Vyr** (b. 160, d. 240) — House Vyr +> - **Cael's father:** **Maric Vyr** (b. 85, d. 160) — House Vyr +> - **Maric's father:** **Theron Ashveil** (b. 10, d. 120) — the +> furthest ancestor the graph records. + +**Verdict:** PASS — all 4 canonical ancestor names appear in the +answer, in the right order, with the right birth/death years. + +### Q4 — Image recall + +**Question:** *"Show me images of Aldric Raventhorne — portraits or otherwise."* + +**Tools the LLM chose:** + +```json +[ + { "tool": "entity_context", "args": { "name": "Aldric Raventhorne" } }, + { "tool": "recall_images", "args": { "entity_id": "aldric" } } +] +``` + +**Answer (excerpt):** + +> Found one image for Aldric Raventhorne: +> +> **Portrait of Aldric Raventhorne, Lord of Thornwall** +> (`img_aldric_portrait`) +> - Era: 2nd Age +> - Caption: "Portrait of Aldric Raventhorne, Lord of Thornwall. +> Middle-aged, dark hair, a scar above the left eye." +> - Tags: portrait, noble, thornwall +> - URL: `http://localhost:9000/lore-images/characters/aldric_portrait.png?...presigned...` + +**Verdict:** PASS — the caption is quoted verbatim from the seed +manifest, the MinIO presigned URL is surfaced, and the model's +"found one image" claim matches the actual count. + +### Q5 — Consistency scan (multi-tool fan-out) + +**Question:** *"What are the open consistency issues in the world graph right now? Check contradictions, anachronisms, orphans, and ontology violations."* + +**Tools the LLM chose:** + +```json +[ + { "tool": "find_contradictions", "args": {} }, + { "tool": "find_anachronisms", "args": {} }, + { "tool": "find_orphans", "args": {} }, + { "tool": "find_ontology_violations", "args": {} } +] +``` + +**Answer (excerpt) — from the v2.T4 snapshot in time:** + +> No open consistency issues detected. All four consistency checks +> returned zero violations. + +**Verdict:** PASS *for the snapshot in time the run was taken* — the +model did not fabricate violations; it faithfully reported whatever +the tools returned. The v2.T4 trace was captured before v2.T5 landed; +the live T5 build surfaces 5 hand-crafted violations in the default +world (1 contradiction, 1 anachronism, 1 orphan, 2 ontology), as +documented in `docs/CONSISTENCY_DEMO.md`. To re-validate Q5 against +the current T5 build, run `bash examples/test_consistency.sh` — that +script is the v2.T5 replacement for the stub-trace era of the consumer +and is the authoritative Q5 evidence going forward. + +## Tool coverage + +| Tool | Question(s) | Used in Q | +|---|---|---| +| `entity_context` | Q1, Q3, Q4 | 3 | +| `lineage_of` | Q1, Q3 | 2 | +| `was_true_at` | Q2 | 1 | +| `ancestors_of` | Q3 | 1 | +| `recall_images` | Q4 | 1 | +| `find_contradictions` | Q5 | 1 | +| `find_anachronisms` | Q5 | 1 | +| `find_orphans` | Q5 | 1 | +| `find_ontology_violations` | Q5 | 1 | + +**9 distinct tools** across **5 questions**. The model discovered +them all from `tools/list` — no scripted routing. Several tools +(`state_at`, `descendants_of`, `log_trade`, `trades_by_buyer`, +`market_price`, `register_image`, `search_images_by_caption`, +`search_images_semantic`, `embed_images`, `list_worlds`) were +exercised separately by `bash test.sh` but the LLM correctly chose +not to invoke them for any of the 5 question types. + +## How to re-run + +```bash +# 1. gateway + DBs must be up +cd /root/lore-engine-poc +docker compose up -d --build +python3 seed.py + +# 2. LiteLLM proxy must be running on :4000 with the configured model + +# 3. drive the 5 questions +bash examples/run_questions.sh + +# raw traces in examples/results/ +ls examples/results/ +``` + +## Files + +- `examples/llm_consumer.py` — the driver (httpx + LiteLLM + tool loop) +- `examples/system_prompt.txt` — the system prompt the LLM sees +- `examples/run_questions.sh` — the orchestrator +- `examples/REPORT.md` — the full E2E report (verdicts, ground truth, + per-question traces) +- `examples/test_consistency.sh` — the v2.T5 consistency-only smoke + runner (replacement for the Q5 stub trace) diff --git a/docs/MULTI_WORLD_DEMO.md b/docs/MULTI_WORLD_DEMO.md new file mode 100644 index 0000000..3086881 --- /dev/null +++ b/docs/MULTI_WORLD_DEMO.md @@ -0,0 +1,219 @@ +# Multi-World Namespace — Worked Example + +This is a live walkthrough of the world namespace that landed in v2.T6. +Every call below is real tool output against the gateway at `localhost:8765` +from the v2 build (`4f92289` on `wt/t6-multi-world`). + +## What the namespace is + +The v1 POC stored every node and edge in a single graph. v2 adds a +`world_id` property on every world-scoped node and edge, and a new +`list_worlds()` admin tool. The read tools (`entity_context`, +`was_true_at`, `state_at`, `ancestors_of`, `descendants_of`, +`lineage_of`, `recall_images`, `search_images_by_caption`, +`search_images_semantic`, `trades_by_buyer`, `market_price`, the +consistency `find_*` tools) all accept an optional `world_id` argument +that defaults to `"default"`. Write tools (`log_trade`, `register_image`, +`embed_images`) tag the row with the caller's `world_id`. + +This lets a single Neo4j instance hold multiple parallel worlds with no +node-id collisions. The default seed loads a second world, `arda_greyscale`, +that mirrors the default world's shape with its own people, factions, +locations, events, and relations. + +## 1. `list_worlds()` — what's loaded + +```bash +curl -s -X POST http://localhost:8765/mcp \ + -H "Content-Type: application/json" \ + -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"list_worlds","arguments":{}}}' +``` + +```json +[ + { "world_id": "arda_greyscale" }, + { "world_id": "default" } +] +``` + +Both worlds are alive in the same graph. Note the default ordering is +newest-first by seed time. + +## 2. The default world — Theron's bloodline + +The default world is the v1 set: Theron Ashveil, Maric Vyr, Cael Vyr, +Yssa Raventhorne, Aldric Raventhorne, Elara Raventhorne, plus factions +House Vyr / Crimson Pact / Merchants Guild and the founding-event / +Black-Spire-event / founding-of-the-Merchants-Guild era. + +```bash +curl -s -X POST http://localhost:8765/mcp \ + -H "Content-Type: application/json" \ + -d '{ + "jsonrpc":"2.0","id":1,"method":"tools/call", + "params":{"name":"entity_context","arguments":{"name":"Theron Ashveil","world_id":"default"}} + }' +``` + +```json +{ + "found": true, + "name": "Theron Ashveil", + "id": "theron", + "world_id": "default", + "labels": ["Person"], + "properties": { + "world_id": "default", + "tier": "noble", + "culture": "Valdorni", + "born": 10, + "name": "Theron Ashveil", + "id": "theron" + }, + "relations": [ + { "rel": "PARENT_OF", "to_id": "maric", "to": "Maric Vyr" }, + { "rel": "MEMBER_OF", "to_id": "house_vyr_bloodline", "to": "House Vyr (bloodline)" } + ] +} +``` + +`Theron Ashveil` is the founding ancestor of the House Vyr bloodline. +He exists in the `default` world and is the earliest known ancestor of +Aldric (see `docs/LLM_CONSUMER_DEMO.md` Q3 for the full chain). + +## 3. The greyscale world — Mael & Sira Greyscale + +`arda_greyscale` is a parallel world seeded by +`seed.py:seed_greyscale_world` with its own era (`greyscale_age`), its +own faction (The Ashen Court), and its own people. The greyscale seed +intentionally uses different node ids — `mael_greyscale`, `sira_greyscale` +— so a query in one world cannot accidentally return the other. + +```bash +curl -s -X POST http://localhost:8765/mcp \ + -H "Content-Type: application/json" \ + -d '{ + "jsonrpc":"2.0","id":1,"method":"tools/call", + "params":{"name":"entity_context","arguments":{"name":"Mael Greyscale","world_id":"arda_greyscale"}} + }' +``` + +```json +{ + "found": true, + "name": "Mael Greyscale", + "id": "mael_greyscale", + "world_id": "arda_greyscale", + "labels": ["Person"], + "properties": { + "world_id": "arda_greyscale", + "tier": "noble", + "culture": "Greyscale", + "born": 220, + "name": "Mael Greyscale", + "id": "mael_greyscale" + }, + "relations": [ + { "rel": "MEMBER_OF", "to_id": "ashen_court", "to": "The Ashen Court" }, + { "rel": "SPOUSE_OF", "to_id": "sira_greyscale", "to": "Sira Greyscale" } + ] +} +``` + +Mael is the greyscale world's analogue of Aldric: a noble, a member of +the Ashen Court, spouse of a Greyscale twin. Note `culture: "Greyscale"` +and `tier: "noble"` — same property names, completely different +meanings from the default world. + +## 4. Cross-world isolation — the namespace holds + +A query in world X for an entity that exists only in world Y must come +back empty. This is the test the namespace was built to pass. + +### Aldric is default-only — greyscale returns empty + +```bash +curl -s -X POST http://localhost:8765/mcp \ + -H "Content-Type: application/json" \ + -d '{ + "jsonrpc":"2.0","id":1,"method":"tools/call", + "params":{"name":"recall_images","arguments":{"entity_id":"aldric","world_id":"arda_greyscale"}} + }' +``` + +```json +{ + "entity_id": "aldric", + "world_id": "arda_greyscale", + "count": 0, + "images": [] +} +``` + +Aldric's images are in the default world's `image_manifest` table, not +the greyscale one. With `world_id="arda_greyscale"`, the image recall +query finds zero — exactly what the namespace promises. + +### Trade log — default scope doesn't see greyscale entries (and vice versa) + +```bash +curl -s -X POST http://localhost:8765/mcp \ + -H "Content-Type: application/json" \ + -d '{ + "jsonrpc":"2.0","id":1,"method":"tools/call", + "params":{"name":"market_price","arguments":{"item_id":"pale_ledger","world_id":"default"}} + }' +``` + +```json +{ + "item_id": "pale_ledger", + "sample_size": 2, + "avg_unit_price": 500.0, + "min_unit_price": 500.0, + "max_unit_price": 500.0, + "most_recent": "2026-06-16T23:04:51.276172+00:00" +} +``` + +The same `market_price` call against `arda_greyscale` returns zero +trades for `pale_ledger` (the greyscale world has its own item +namespace, not the default `pale_ledger`). The trades table's PK +includes `world_id` so a row inserted by `log_trade` with +`world_id="arda_greyscale"` is invisible to a default-scope query. + +## 5. How a tool uses `world_id` + +The `MATCH` clauses in the world-scoped tools all include +`{id: $..., world_id: $world_id}` so a row in the wrong world simply +doesn't match. For example, the lineage ancestors query in +`plugins/lineage.py`: + +```cypher +MATCH path = (a:Person {id: $person, world_id: $world_id})-[:PARENT_OF*1..10]->(ancestor:Person) +WHERE ancestor.world_id = $world_id +RETURN ancestor +``` + +Both ends of the path are pinned to the same `world_id`, so the chain +never crosses a world boundary. The `state_at` and `entity_context` +queries follow the same pattern; the image and trade queries hit +Postgres tables that carry `world_id` in their primary key. + +## 6. The world-resolution rule + +Tools that take a `world_id` argument default it to `"default"` so v1 +callers keep working unchanged. The `bash test.sh` runner passes +`world_id="default"` explicitly to verify that the opt-in behaviour +holds. The greyscale seed is loaded by `python3 seed.py` automatically +(no extra flag), and `list_worlds()` is the operator's view of what +exists. + +## Files + +- `seed.py:seed_greyscale_world` — the `arda_greyscale` seed +- `seed.py:_seed_images_for_world` — the per-world image manifest loader +- `plugins/lineage.py`, `plugins/world.py`, `plugins/images.py` — every + world-scoped read tool filters on `world_id` +- `tests/test_multi_world.py` — 14 pytest cases for the namespace +- `test.sh` section 12 — the `list_worlds()` smoke check diff --git a/examples/test_e2e.sh b/examples/test_e2e.sh new file mode 100755 index 0000000..93e0bf2 --- /dev/null +++ b/examples/test_e2e.sh @@ -0,0 +1,320 @@ +#!/usr/bin/env bash +# test_e2e.sh — End-to-end validation for v2.T7. +# +# What this proves (per task body): +# 1. The LLM consumer works end-to-end (5 question types) +# 2. The consistency tools find the right violations (5 seeded) +# 3. The LLM's answers match the seed-data ground truth +# +# Two independent layers: +# A. Direct tool calls — each of the 4 consistency tools is invoked +# against the live gateway and the violation count + ids are asserted +# against the table in examples/GROUND_TRUTH.md. This proves the +# tools work regardless of LLM behaviour. +# B. LLM consumer — for each of 5 question types, drive the LLM through +# the gateway, then assert the answer contains the expected facts +# (names, dates, severities). This proves the LLM consumer works. +# +# The script exits 0 only if EVERY check passes. +set -uo pipefail + +cd "$(dirname "$0")" +mkdir -p results +GATEWAY_URL="${GATEWAY_URL:-http://localhost:8765/mcp}" +LITELLM_URL="${LITELLM_URL:-http://localhost:4000/v1}" +LITELLM_MODEL="${LITELLM_MODEL:-minimax-m3}" +export GATEWAY_URL LITELLM_URL LITELLM_MODEL + +# ─── bookkeeping ────────────────────────────────────────────────────────────── + +fails=0 +passes=0 +declare -a FAIL_DETAILS=() + +ok() { passes=$((passes+1)); echo " ✓ $1"; } +fail() { fails=$((fails+1)); FAIL_DETAILS+=("$1"); echo " ✗ $1"; } + +section() { echo; echo "── $* ──"; } + +# ─── pre-flight ────────────────────────────────────────────────────────────── + +section "pre-flight: gateway + LiteLLM reachable" +if curl -s --max-time 5 -X POST "$GATEWAY_URL" -H "Content-Type: application/json" \ + -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' >/dev/null 2>&1; then + ok "gateway responds at $GATEWAY_URL" +else + fail "gateway unreachable at $GATEWAY_URL" + echo + echo "PRE-FLIGHT FAILED — aborting." + exit 1 +fi +if curl -s --max-time 5 "$LITELLM_URL/models" >/dev/null 2>&1; then + ok "LiteLLM responds at $LITELLM_URL" +else + fail "LiteLLM unreachable at $LITELLM_URL" + echo + echo "PRE-FLIGHT FAILED — aborting." + exit 1 +fi + +# ─── Layer A: direct consistency-tool calls ────────────────────────────────── + +# Helper: call a tool, print the parsed JSON envelope (one object per line). +call_tool() { + local name=$1 + local args=$2 + curl -s -X POST "$GATEWAY_URL" -H "Content-Type: application/json" \ + -d "{\"jsonrpc\":\"2.0\",\"id\":1,\"method\":\"tools/call\",\"params\":{\"name\":\"$name\",\"arguments\":$args}}" \ + | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['result']['content'][0]['text'])" +} + +# Helper: assert a tool's violation count + ids. +# Args: [expected_id_1 ...] +assert_violations() { + local tool=$1; shift + local args=$1; shift + local expected_count=$1; shift + local resp + resp=$(call_tool "$tool" "$args") + local got_count + got_count=$(printf '%s' "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])") + if [ "$got_count" = "$expected_count" ]; then + ok "$tool: count=$got_count (expected $expected_count)" + else + fail "$tool: count=$got_count (expected $expected_count) — full response: $resp" + return + fi + for want in "$@"; do + if printf '%s' "$resp" | python3 -c "import json,sys; ids=[v['id'] for v in json.load(sys.stdin)['violations']]; print('YES' if '$want' in ids else 'NO')" \ + 2>/dev/null | grep -q YES; then + ok "$tool: contains id=$want" + else + fail "$tool: missing id=$want (full response: $resp)" + fi + done +} + +section "Layer A — direct consistency tool calls (no LLM)" + +assert_violations "find_contradictions" '{"severity":"any"}' 1 c_aldric_double_membership +assert_violations "find_anachronisms" '{"severity":"any"}' 1 a_vex_at_founding +assert_violations "find_orphans" '{}' 1 o_unfinished_npc +assert_violations "find_ontology_violations" '{"severity":"any"}' 2 ov_theron_no_died ov_maric_no_died + +# Severity breakdown — task body says "the orphan being a warning, not error". +section "Layer A — severity breakdown" +contradictions_err=$(call_tool "find_contradictions" '{"severity":"error"}' | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])") +contradictions_warn=$(call_tool "find_contradictions" '{"severity":"warn"}' | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])") +[ "$contradictions_err" = "1" ] && ok "find_contradictions severity=error -> 1" || fail "find_contradictions severity=error -> $contradictions_err (expected 1)" +[ "$contradictions_warn" = "0" ] && ok "find_contradictions severity=warn -> 0" || fail "find_contradictions severity=warn -> $contradictions_warn (expected 0)" +anach_err=$(call_tool "find_anachronisms" '{"severity":"error"}' | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])") +anach_warn=$(call_tool "find_anachronisms" '{"severity":"warn"}' | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])") +[ "$anach_err" = "1" ] && ok "find_anachronisms severity=error -> 1" || fail "find_anachronisms severity=error -> $anach_err (expected 1)" +[ "$anach_warn" = "0" ] && ok "find_anachronisms severity=warn -> 0" || fail "find_anachronisms severity=warn -> $anach_warn (expected 0)" +# Orphans: 1 warn (the task body specifies this is a warn, not error). +orphan_severity=$(call_tool "find_orphans" '{}' | python3 -c "import json,sys; d=json.load(sys.stdin); print(','.join(v['severity'] for v in d['violations']))") +if [ "$orphan_severity" = "warn" ]; then + ok "find_orphans -> severity=warn (orphan is a warn, not error)" +else + fail "find_orphans -> severity=[$orphan_severity] (expected 'warn')" +fi +# Ontology: 2 warn +ont_warn=$(call_tool "find_ontology_violations" '{"severity":"warn"}' | python3 -c "import json,sys; print(json.load(sys.stdin)['count'])") +[ "$ont_warn" = "2" ] && ok "find_ontology_violations severity=warn -> 2" || fail "find_ontology_violations severity=warn -> $ont_warn (expected 2)" + +# Total +total_err=0 +total_warn=0 +for t in find_contradictions find_anachronisms find_orphans find_ontology_violations; do + args='{"severity":"any"}' + [ "$t" = "find_orphans" ] && args='{}' + e=$(call_tool "$t" "$args" | python3 -c "import json,sys; d=json.load(sys.stdin); print(sum(1 for v in d['violations'] if v['severity']=='error'))") + w=$(call_tool "$t" "$args" | python3 -c "import json,sys; d=json.load(sys.stdin); print(sum(1 for v in d['violations'] if v['severity']=='warn'))") + total_err=$((total_err+e)) + total_warn=$((total_warn+w)) +done +total=$((total_err+total_warn)) +[ "$total" = "5" ] && ok "TOTAL violations = 5 (2 error + 3 warn)" \ + || fail "TOTAL violations = $total (expected 5)" +[ "$total_err" = "2" ] && ok "TOTAL errors = 2" || fail "TOTAL errors = $total_err (expected 2)" +[ "$total_warn" = "3" ] && ok "TOTAL warns = 3" || fail "TOTAL warns = $total_warn (expected 3)" + +# ─── Layer B: LLM consumer — 5 question types ──────────────────────────────── + +section "Layer B — LLM consumer (5 question types)" + +declare -a IDS=( + "q1_who_is_aldric" + "q2_was_allied_230" + "q3_aldric_ancestors" + "q4_images_of_aldric" + "q5_consistency_issues" +) +declare -a QS=( + "Who is Aldric Raventhorne? Give a brief bio and his known relations." + "Was House Vyr allied with the Merchants Guild at 2nd_age.year_230?" + "What is the lineage / ancestry of Aldric Raventhorne? Walk back as far as you can." + "Show me images of Aldric Raventhorne — portraits or otherwise." + "What are the open consistency issues in the world graph right now? Check contradictions, anachronisms, orphans, and ontology violations." +) + +for i in "${!IDS[@]}"; do + id="${IDS[$i]}" + q="${QS[$i]}" + echo + echo "── question $((i+1))/5: $id ──" + echo " Q: $q" + if ! python3 llm_consumer.py --question-id "$id" --question "$q" \ + --out "results/${id}.json" >"/tmp/llm_consumer_${id}.log" 2>&1; then + fail "Q$((i+1)) ($id): llm_consumer.py exited non-zero — see /tmp/llm_consumer_${id}.log" + tail -5 "/tmp/llm_consumer_${id}.log" | sed 's/^/ /' + continue + fi + tail -8 "/tmp/llm_consumer_${id}.log" + ok "Q$((i+1)) ($id): llm_consumer.py exit=0" +done + +# ─── Answer-level assertions against GROUND_TRUTH.md ───────────────────────── + +section "Layer B — answer-level assertions against GROUND_TRUTH.md" + +# Helper: read a trace and emit its (answer_lower, tools_csv) on two lines. +trace_info() { + local trace_path=$1 + python3 -c " +import json +d = json.load(open('$trace_path')) +ans = (d.get('answer') or '').lower() +tools = [t['tool'] for t in d.get('tools_called', [])] +print(ans) +print('---TOOLS---') +print(','.join(tools)) +" +} + +# Q1: entity_context called, answer has Aldric + a known affiliation. +if [ -f "results/q1_who_is_aldric.json" ]; then + trace=$(trace_info "results/q1_who_is_aldric.json") + q1_ans=${trace%%$'---TOOLS---'*} + q1_tools=$(printf '%s' "$trace" | awk -F'---TOOLS---' '{print $2}') + echo " Q1 tools: $q1_tools" + if [[ "$q1_tools" == *entity_context* ]]; then ok "Q1: entity_context in tools_called"; else fail "Q1: entity_context NOT called (got: $q1_tools)"; fi + if printf '%s' "$q1_ans" | grep -qi 'aldric'; then ok "Q1: answer mentions 'aldric'"; else fail "Q1: answer missing 'aldric'"; fi + if printf '%s' "$q1_ans" | grep -Eqi 'vyr|thornwall|elara|valdorni|eventide'; then + ok "Q1: answer mentions a known affiliation (Vyr/Thornwall/Elara/Valdorni/Eventide)" + else + fail "Q1: answer missing known affiliation" + fi +else + fail "Q1: results/q1_who_is_aldric.json missing (LLM consumer failed)" +fi + +# Q2: was_true_at called, answer says YES/allied/true. +if [ -f "results/q2_was_allied_230.json" ]; then + trace=$(trace_info "results/q2_was_allied_230.json") + q2_ans=${trace%%$'---TOOLS---'*} + q2_tools=$(printf '%s' "$trace" | awk -F'---TOOLS---' '{print $2}') + echo " Q2 tools: $q2_tools" + if [[ "$q2_tools" == *was_true_at* ]]; then ok "Q2: was_true_at in tools_called"; else fail "Q2: was_true_at NOT called (got: $q2_tools)"; fi + if printf '%s' "$q2_ans" | grep -Eqi 'yes|allied|true|in force|was an alliance'; then + ok "Q2: answer indicates YES/allied/true" + else + fail "Q2: answer missing YES/allied/true" + fi +else + fail "Q2: results/q2_was_allied_230.json missing (LLM consumer failed)" +fi + +# Q3: ancestors_of called, answer names >=3 of {Theron, Maric, Cael, Yssa}. +if [ -f "results/q3_aldric_ancestors.json" ]; then + trace=$(trace_info "results/q3_aldric_ancestors.json") + q3_ans=${trace%%$'---TOOLS---'*} + q3_tools=$(printf '%s' "$trace" | awk -F'---TOOLS---' '{print $2}') + echo " Q3 tools: $q3_tools" + if [[ "$q3_tools" == *ancestors_of* ]]; then ok "Q3: ancestors_of in tools_called"; else fail "Q3: ancestors_of NOT called (got: $q3_tools)"; fi + found=0 + for n in theron maric cael yssa; do + if printf '%s' "$q3_ans" | grep -qi "$n"; then found=$((found+1)); fi + done + if [ "$found" -ge 3 ]; then ok "Q3: answer names $found/4 canonical ancestors (need >=3)"; else fail "Q3: answer names only $found/4 canonical ancestors (need >=3)"; fi +else + fail "Q3: results/q3_aldric_ancestors.json missing (LLM consumer failed)" +fi + +# Q4: image-recall tool called, answer mentions Aldric + portrait/image/etc. +if [ -f "results/q4_images_of_aldric.json" ]; then + trace=$(trace_info "results/q4_images_of_aldric.json") + q4_ans=${trace%%$'---TOOLS---'*} + q4_tools=$(printf '%s' "$trace" | awk -F'---TOOLS---' '{print $2}') + echo " Q4 tools: $q4_tools" + if [[ "$q4_tools" == *recall_images* || "$q4_tools" == *search_images_by_caption* || "$q4_tools" == *search_images_semantic* ]]; then + ok "Q4: image-recall tool in tools_called" + else + fail "Q4: no image-recall tool called (got: $q4_tools)" + fi + if printf '%s' "$q4_ans" | grep -qi 'aldric'; then ok "Q4: answer mentions 'aldric'"; else fail "Q4: answer missing 'aldric'"; fi + if printf '%s' "$q4_ans" | grep -Eqi 'portrait|image|presigned|thornwall'; then + ok "Q4: answer mentions portrait/image/presigned/thornwall" + else + fail "Q4: answer missing portrait/image/presigned/thornwall" + fi +else + fail "Q4: results/q4_images_of_aldric.json missing (LLM consumer failed)" +fi + +# Q5: all 4 consistency tools called; answer is NOT a "no issues" answer; mentions +# canonical subject names and severity. +if [ -f "results/q5_consistency_issues.json" ]; then + trace=$(trace_info "results/q5_consistency_issues.json") + q5_ans=${trace%%$'---TOOLS---'*} + q5_tools=$(printf '%s' "$trace" | awk -F'---TOOLS---' '{print $2}') + echo " Q5 tools: $q5_tools" + missing=() + for t in find_contradictions find_anachronisms find_orphans find_ontology_violations; do + [[ "$q5_tools" == *"$t"* ]] || missing+=("$t") + done + if [ ${#missing[@]} -eq 0 ]; then + ok "Q5: all 4 consistency tools in tools_called" + else + fail "Q5: missing tools: ${missing[*]}" + fi + # Must NOT say "no issues" — there are 5 seeded violations. + if printf '%s' "$q5_ans" | grep -Eqi '(no|zero|none).{0,30}(open |detected |current )?(consistency |open )?(issues|problems|violations)'; then + fail "Q5: answer incorrectly says 'no issues' — but 5 violations are seeded" + else + ok "Q5: answer does NOT claim 'no issues' (correct — 5 violations seeded)" + fi + subject_hits=0 + for n in aldric vex lyssa theron maric; do + if printf '%s' "$q5_ans" | grep -qi "$n"; then subject_hits=$((subject_hits+1)); fi + done + if [ "$subject_hits" -ge 2 ]; then + ok "Q5: answer mentions $subject_hits canonical subjects (need >=2)" + else + fail "Q5: answer mentions only $subject_hits canonical subjects (need >=2)" + fi + if printf '%s' "$q5_ans" | grep -Eqi 'severity|warn|warning|error'; then + ok "Q5: answer acknowledges severity (warn/error)" + else + fail "Q5: answer does not acknowledge severity" + fi +else + fail "Q5: results/q5_consistency_issues.json missing (LLM consumer failed)" +fi + +# ─── summary ───────────────────────────────────────────────────────────────── + +echo +echo "════════════════════════════════════════════════════════════" +if [ "$fails" -eq 0 ]; then + echo " PASS — $passes checks, 0 failures" + echo "════════════════════════════════════════════════════════════" + exit 0 +else + echo " FAIL — $passes checks passed, $fails FAILED:" + for d in "${FAIL_DETAILS[@]}"; do + echo " - $d" + done + echo "════════════════════════════════════════════════════════════" + exit 1 +fi