docs(plan): refresh slice index, add missing Status, ship exec roadmaps

Three doc fixes so the plan directory matches the shipped state: * docs/plan/README.md — the slice index was stale (still listed slices 1-7 as "planned"). Refresh the table to show what actually shipped (1, 2, 3, 4, 5a-Neo4j, 5b-TypeTemplate, 2.6.1, 10, 11) and what remains (6, 7, 8). Drop the old "33 days" cumulative estimate. Update the dependency graph to reflect the shipped substrate. * docs/plan/05-slice-neo4j-backend.md — missing Status header (the only shipped slice plan without one). Add it. * docs/plan/exec/ — three execution roadmaps so the next loop picks up the remaining slices without re-deriving the sub-slice decomposition: - 06-planes.md (22 tests, 6 sub-slices, no blockers) - 07-harness.md (12 tests Track A unblocked; Track B gated on $OLLAMA_API_KEY) - 08-polish.md (22 tests engine-side, blocked on 6+7) Plus README.md indexing them. Each exec file follows the slice 1 impl-plan template: scope, what ships today, decisions locked, sub-slice ordering with test names, critical files to read, acceptance check, cross-references. TaskCreate entries #26-29 mirror the four tasks. Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-19 11:30:31 -04:00
parent 12c2237d1d
commit e2dc179c82
6 changed files with 649 additions and 36 deletions
--- a/docs/plan/05-slice-neo4j-backend.md
+++ b/docs/plan/05-slice-neo4j-backend.md
@@ -1,5 +1,14 @@
 # Slice 5 — Neo4j 5 GraphBackend Adapter

+**Status:** ✅ shipped 2026-06-18. The Neo4j storage substrate per
+ADR 0011. 8 sub-slices (5.1–5.8): GraphBackend Protocol +
+InMemoryGraph rename + write-tool chokepoint + Neo4jGraph skeleton
+w/ reified `:Relation` + full read-tool parity + full-codex
+round-trip + `--write-neo4j` dual-write flag + `LORE_GRAPH_BACKEND`
+env wiring + docker-compose neo4j svc. 559 → 632 tests
+(+73). ADR 0011 written. Suite green with `docker compose
+--profile neo4j up`.
+
 ## Context

 The Lore Engine POC's in-memory `Graph` (`lore_engine_poc/tools.py`) was the
--- a/docs/plan/README.md
+++ b/docs/plan/README.md
@@ -3,49 +3,58 @@
 The Lore Engine on Cognee, sliced into independently shippable units.
 Each slice has its own file with acceptance criteria and a test plan.

-| # | Slice | Goal | Status | Effort |
+The execution roadmap (what's done, what's next, what's blocked)
+lives in `docs/plan/exec/` and the TaskCreate list. Slice plan
+files are the *design*; exec files are the *roadmap*.
+
+| # | Slice | Goal | Status | Shipped |
 |---|---|---|---|---|
-| 0 | [POC](00-slice-0-poc.md) | Validate the substrate; one tool end-to-end | ✅ done | 1 day |
-| 1 | [Structured YAML](01-slice-structured-yaml.md) | Real `valid_from`/`valid_until` on edges | 📋 planned | 3-5 days |
-| 2 | [Consistency engine](02-slice-consistency.md) | 4-category rule system | 📋 planned | 5-7 days |
-| 3 | [LLM extraction](03-slice-llm-extraction.md) | Cognee cognify actually runs | 📋 planned | 3-5 days |
-| 4 | [Remaining 44 tools](04-slice-tools.md) | Full 45-tool MCP surface | 📋 planned | 5-7 days |
-| 5 | [TypeTemplate](05-slice-typetemplate.md) | Polymorphic extension model | 📋 planned | 5-7 days |
-| 6 | [Plane model](06-slice-planes.md) | Setting + Plane graph nodes | 📋 planned | 2-3 days |
-| 7 | [Reasoning harness](07-slice-harness.md) | 50-question validation gate | 📋 planned | 3-5 days |
+| 0 | [POC](00-slice-0-poc.md) | Validate the substrate; one tool end-to-end | ✅ done | 2026-06-17 |
+| 1 | [Structured YAML](01-slice-structured-yaml.md) | Real `valid_from`/`valid_until` on edges | ✅ shipped | 2026-06-18 |
+| 2 | [Consistency engine](02-slice-consistency.md) | 4-category rule system | ✅ shipped | 2026-06-18 |
+| 3 | [LLM extraction](03-slice-llm-extraction.md) | Cognee cognify actually runs | ✅ shipped | 2026-06-18 |
+| 4 | [Remaining tools](04-slice-tools.md) | Read + write tool surface | ✅ shipped | 2026-06-18 |
+| 5a | [Neo4j backend](05-slice-neo4j-backend.md) | GraphBackend Protocol + Neo4j adapter | ✅ shipped | 2026-06-18 |
+| 5b | [TypeTemplate](05-slice-typetemplate.md) | Polymorphic extension model | ✅ shipped | 2026-06-19 |
+| 2.6.1 | [MCP surface expansion](2.6.1-slice-mcp-surface-expansion.md) | read_tools over the MCP wire | ✅ shipped | 2026-06-18 |
+| 10 | [Write tools](10-slice-write-tools-deferred.md) | 9 deferred write_tools | ✅ shipped | 2026-06-18 |
+| 11 | [MCP HTTP + Docker](11-slice-mcp-http-docker.md) | Streamable HTTP transport | ✅ shipped | 2026-06-18 |
+| 6 | [Plane model](06-slice-planes.md) | Setting + Plane graph nodes | 📋 planned | — |
+| 7 | [Reasoning harness](07-slice-harness.md) | 50-question validation gate | 📋 planned (blocked: `$OLLAMA_API_KEY`) | — |
 | 8 | [Polish](08-slice-polish.md) | UI, export, enforcement | 📋 open-ended | — |

-**Cumulative:** MVP at end of slice 2 (~10 days), full v1 at end
-of slice 4 (~21 days), v1 + extensions at end of slice 7
-(~33 days).
+**Total shipped so far:** 712 tests green (lore-engine-poc).
+**Remaining:** 3 planned slices (6, 7, 8) + follow-up polish items
+deferred from shipped slices (write tools for templates,
+collapse-to-one-tool when count > 50, services/template-watcher/
+and services/template-registry/ Go services per ADR 0012).

-## Dependency graph
+## Dependency graph (post-shipment)

 ```
-0 (POC) ──┬──> 1 (YAML)  ──┐
-          │                ├──> 2 (Consistency) ──┐
-          └──> 3 (LLM)  ───┘                       │
-                                                   ├──> 4 (Tools) ──┐
-                                                   │                │
-                                                   │   ┌────────────┘
-                                                   │   │
-                                                   ▼   ▼
-                                                   5 (TypeTemplate)
-                                                   │
-                                                   ▼
-                                                   6 (Planes)
-                                                   │
-                                                   ▼
-                                                   7 (Harness)
-                                                   │
-                                                   ▼
-                                                   8 (Polish)
+shipped: 0 -> 1 -> 2 -> 4 -> 5a (Neo4j)  ............ 11 (MCP HTTP)
+                  \-> 3  /                  2.6.1
+                         \-> 5b (TypeTemplate) -> 10 (write tools)
+remaining:                 6 (Planes) -> 7 (Harness) -> 8 (Polish)
 ```

-Slices 1 and 3 can run in parallel after slice 0. Slice 2
-needs both 1 and 3 (it operates on the typed graph and the
-prose-extracted graph). Slices 4-7 each depend on the prior
-slice. Slice 8 is unbounded.
+The shipped slices form the substrate (parsers, graph storage,
+MCP wire, Docker). The remaining slices layer on top:
+
+- **Slice 6 (Planes)** is foundational — adds the multi-setting
+  first-class nodes that downstream codex tooling will use.
+  Blocks 7 only insofar as the harness tests need a stable
+  setting_id contract.
+- **Slice 7 (Harness)** is the validation gate — it measures
+  whether the LLM can answer correctly. Currently **blocked on
+  `$OLLAMA_API_KEY`** (slice 7 was always tied to a live LLM
+  provider per `docs/07-reasoning-harness.md`).
+- **Slice 8 (Polish)** is open-ended — fill in based on what
+  the world-builder actually needs after slices 6 + 7 land.
+
+Slice 6 can run independently of slice 7 (no LLM dependency).
+Slices 6 and 7 can run in parallel. Slice 8 is downstream of
+both.

 ## What each slice proves

@@ -56,7 +65,8 @@ slice. Slice 8 is unbounded.
 | 2 | Engine flags its first real contradiction |
 | 3 | Prose path is fuzzy but useful for color/character voice |
 | 4 | LLM can answer most question types in a single tool call |
-| 5 | New domain types are a YAML exercise, not a code change |
+| 5a | Production storage works (Neo4j round-trips) |
+| 5b | New domain types are a YAML exercise, not a code change |
 | 6 | Multi-setting worlds are first-class |
 | 7 | The LLM, with the harness, answers correctly ≥80% of the time |
 | 8 | The engine is a usable product, not just a working engine |
--- a/docs/plan/exec/06-planes.md
+++ b/docs/plan/exec/06-planes.md
@@ -0,0 +1,161 @@
+# Slice 6 — Implementation Roadmap
+
+**Owner:** this loop (Claude).
+**Scope:** `docs/plan/06-slice-planes.md` (the AC table is the
+contract — 11 ACs, 6.1 through 6.11). Implementation lives in
+`~/projects/lore-engine-poc/`.
+**TDD rule:** every new behaviour ships with a failing test first;
+test names follow `test_<AC>_<description>`.
+
+## What ships today (the substrate)
+
+The slice 1–5 work already shipped Setting as a string field
+(`setting_id`) on every entity, but **`Setting` is not yet a
+first-class graph node**, and **`Plane` does not exist at all**.
+The 5.x substrate (per ADR 0011) is in place: GraphBackend
+Protocol, InMemoryGraph + Neo4jGraph parity, the 36 core labels,
+and the polymorphic Layer 2 (`:DomainEntity`, `:Relation`,
+`:TypeTemplate`) from slice 5T.
+
+**Already done (do not redo):**
+
+- `setting_id` is on every entity (Person, Faction, Location,
+  Region, etc.). Per ADR 0004 (region ≠ plane), a v1.2 graph
+  distinguishes the two via the new `Plane` label.
+- The Mardonari codex's Voldramir entry currently lives as a
+  `Region` node with `plane: true` in frontmatter; slice 6
+  promotes it to a `Plane` node.
+- Migration from `world_id` to `setting_id` already happened in
+  v1.2 doc rename. The slice 6 migration is the *graph node*
+  migration, not the doc rename.
+
+## Decisions locked before coding starts
+
+| # | Decision | Source | Implication |
+|---|---|---|---|
+| D1 | `EXISTS_IN` is a reified `:Relation` node, not a native edge | ADR 0009 | Time-bounded (planar travel) — the in-memory `Edge` from slice 0 stays the substrate; the new `EXISTS_IN` relation carries `valid_from`/`valid_until`. |
+| D2 | `:Setting` and `:Plane` are top-level labels (Layer 1 core) | docs/01-ontology.md, ADR 0011 | They're added to `NODE_LABELS` in `lore_engine_poc/ontology.py`, same as `:DomainEntity` was in 5T.1. |
+| D3 | Default plane for an entity with no `plane:` is the Material Plane of the default Setting | AC 6.10 | Migration must auto-create a Material Plane when none is declared. |
+| D4 | Region vs Plane split via frontmatter | AC 6.8 | Entries with `plane: true` OR under `Campaign Codex - Planes/` are `:Plane`; everything else stays `:Region`. |
+| D5 | Migration is idempotent | AC 6.9 | `MERGE` on `(Setting {id})`, `(Plane {id})`, and `(:Relation {type:'EXISTS_IN'})` keyed by `(entity_id, setting_id)`. |
+
+## Sub-slice ordering and parallelisation
+
+The 6 sub-slices below respect the dependency order
+`schema → nodes → edges → backfill → tools → migration`.
+
+### 6.1 — Setting + Plane schema + ontology (AC 6.1, 6.2)
+
+- Add `:Setting` and `:Plane` to `NODE_LABELS`
+- Add `:EXISTS_IN`, `:REFLECTS`, `:LAYER_OF`, `:ADJACENT_TO`,
+  `:ACCESSIBLE_VIA` to `ALLOWED_LABELS` (already may be present
+  in write_tools.py:48 — verify before adding)
+- `lore_engine_poc/setting.py` (new) — dataclasses `Setting(id,
+  kind, current_era, schema_version, created_at)` and `Plane(id,
+  setting_id, name, kind)`
+- Test: `test_6_1_setting_and_plane_in_node_labels`
+
+### 6.2 — GraphBackend methods (Layer 1 extensions)
+
+- Extend `GraphBackend` Protocol with: `add_setting`,
+  `add_plane`, `add_exists_in`, `find_setting(id)`,
+  `planes_in_setting(setting_id)`, `entity_planes(entity_id)`
+- InMemoryGraph: storage dicts + endpoint indexes
+- Neo4jGraph: mirror via `CREATE CONSTRAINT` + `session.run()`
+- 4-5 tests: protocol conformance + parity
+- Test count: +5
+
+### 6.3 — Plane-relation edge types (AC 6.2)
+
+- `REFLECTS`, `LAYER_OF`, `ADJACENT_TO`, `ACCESSIBLE_VIA` as
+  edge-label tests (typed edges, not reified — these are not
+  time-bounded)
+- Test: `test_6_3_plane_relation_round_trip` for each
+
+### 6.4 — Backfill of EXISTS_IN (AC 6.3, 6.10)
+
+- Migration helper `migrate_setting_id_to_exists_in()` walks
+  every entity with `setting_id`, ensures a `:Setting` node
+  exists, ensures a default `Material Plane` exists in that
+  setting, creates one `:Relation {type: EXISTS_IN}` edge
+- Idempotent: re-running produces the same graph
+- Test: `test_6_4_backfill_idempotent` + `test_6_4_default_material_plane`
+
+### 6.5 — Setting filter on read tools (AC 6.5, 6.6)
+
+- Add `setting` parameter to `entities_present`, `was_true_at`,
+  `true_during`, `events_during`, `lookup`, `entity_context`
+- Filter resolves via `EXISTS_IN` traversal
+- Test: `test_6_5_setting_filter_on_*` for each tool (6 tests)
+- Test count: +6
+
+### 6.6 — Region ↔ Plane migration (AC 6.8, 6.11)
+
+- `scripts/05_migrate_planes.py` reads the codex frontmatter
+- Entries with `plane: true` or under `Campaign Codex - Planes/`
+  become `:Plane` nodes (deleting their `:Region` form if any)
+- All `[[Underdark]]` body-text references in Voldramir's
+  markdown become `LAYER_OF` edges
+- `--dry-run` mode prints the list of changes without applying
+- Test: `test_6_6_migration_distinguishes_region_from_plane` +
+  `test_6_6_voldramir_becomes_plane`
+- Test count: +2
+
+### 6.7 — docs cleanup (AC 6.7)
+
+- `grep -r "world_id" docs/` and remove every reference outside
+  the migration section
+- Update `docs/01-ontology.md`, `docs/11-extensibility.md`,
+  `docs/14-examples.md` to use `setting_id` and reference the
+  new `Plane` model
+
+### Final — end-to-end demo + ADR
+
+- ADR 0013 — the v1.2 plane-model migration story
+- Killer demo: seed two settings (Mardonari + The Wild Dream),
+  cross-setting query returns only the requested setting's events
+- Total test count: ~712 + 22 = 734
+
+## Critical files to read before implementing
+
+- `docs/01-ontology.md` — current 36 labels + the v1.2 addendum
+- `docs/17-planes.md` — the full plane-model design
+- `docs/04-consistency.md` — consistency rules will need
+  `setting_id` on violation nodes (per slice 6 risks)
+- `lore_engine_poc/ontology.py` — NODE_LABELS, ALLOWED_LABELS
+- `lore_engine_poc/graph_backend.py` — Protocol + InMemoryGraph
+- `lore_engine_poc/neo4j_graph.py` — Neo4j substrate
+- `lore_engine_poc/tools.py` — read tools that need setting
+  filter
+- `lore_engine_poc/templates/schema.py` — templates will need
+  Plane awareness (deferred to slice 5T.6 if user asks)
+
+## Out of scope (deferred)
+
+- Plane model in the UI (slice 8).
+- Cross-setting consistency rules (deferred; would re-open
+  slice 2's consistency engine).
+- Templates get a `plane_id` field (slice 5T.6 — separate
+  request).
+
+## Acceptance check
+
+`python3 -m pytest tests/ -q` → 734 passed. Cross-setting
+query `entities_present(setting=mardonari, at_time=...)` returns
+only Mardonari's events. Voldramir is a `:Plane` node in the
+v1.2 graph; the Underdark is `LAYER_OF` it.
+
+## Effort estimate
+
+22 tests, 6 sub-slices, ~3-5 days.
+
+## Cross-references
+
+- `docs/plan/06-slice-planes.md` — the design plan
+- `docs/17-planes.md` — plane-model design
+- `docs/09-roadmap.md#v12-migration` — migration plan
+- `docs/10-critique.md#S3.2` — cross-world queries
+- ADR 0004 — region ≠ plane (rationale)
+- ADR 0009 — reified `:Relation` (used for `:EXISTS_IN`)
+- ADR 0011 — GraphBackend Protocol
+- ADR 0012 — slice 5T (precedent for adding to NODE_LABELS)
--- a/docs/plan/exec/07-harness.md
+++ b/docs/plan/exec/07-harness.md
@@ -0,0 +1,195 @@
+# Slice 7 — Implementation Roadmap
+
+**Owner:** this loop (Claude) **once `$OLLAMA_API_KEY` is
+available**.
+**Scope:** `docs/plan/07-slice-harness.md` (the AC table is the
+contract — 8 ACs, 7.1 through 7.8). Implementation lives in
+`~/projects/lore-engine-poc/`.
+**TDD rule:** the 50-question harness has golden answers;
+the system prompt is iterated until ≥80% accuracy.
+
+## Blocker
+
+This slice is **blocked on `$OLLAMA_API_KEY`** (slice 7 was
+always tied to a live LLM provider per `docs/07-reasoning-harness.md`).
+The primary model is **Minimax-M3** (per ADR 0005), reached via
+the OpenAI-compatible endpoint. The harness cannot run end-to-end
+without API access.
+
+The slice can proceed in two parallel tracks:
+
+- **Track A — no LLM required** — author the 50 questions,
+  the system prompt, the failure-mode log, the red-team suite,
+  the per-tool test cases. All of these are pure-Python artifacts
+  and ship without an API key.
+- **Track B — needs API key** — execute the harness against
+  the live LLM, measure accuracy, iterate on the system prompt.
+
+Track A can ship now. Track B is gated.
+
+## What ships today (the substrate)
+
+- The MCP server exposes 36 core tools + 14 template-generated
+  tools = 50 total (per slice 5T.5 + slice 10 + slice 11).
+- The system prompt lives in
+  `lore_engine_poc/prompts/system_prompt.md` (per slice 3 AC 3.2).
+- The 5 question types are documented in
+  `docs/07-reasoning-harness.md` (5 × 10 = 50 questions).
+
+**Already done (do not redo):**
+
+- `scripts/03_demo.py` is the demo loop (per slice 0).
+- `lore_engine_poc/llm.py` has the LiteLLM adapter + the
+  FakeProvider for tests (per slice 3).
+- The Ollama Cloud `minimax-m3:cloud` model is wired through
+  per ADR 0005.
+
+## Decisions locked before coding starts
+
+| # | Decision | Source | Implication |
+|---|---|---|---|
+| D1 | Primary model: **Minimax-M3** (Minimax-M3 via Ollama Cloud) | ADR 0005 | The harness scripts default to `minimax-m3:cloud`; switchable via env. |
+| D2 | Thinking mode: `adaptive` | 07-reasoning-harness.md §measurement | M3 supports adaptive; no other model is calibrated. |
+| D3 | The 50 questions are **versioned** (tests/harness/questions.json has a `version` field) | per critique | Old results stay comparable when the prompt iterates. |
+| D4 | Tool-selection accuracy is measured per question type, not aggregate | per critique | A 70% aggregate can hide a 50% on time-window questions. |
+| D5 | Failure-mode log is **mandatory**: every wrong answer is logged with a hypothesis | 07-reasoning-harness.md §failure-mode | The log is the input to system-prompt iteration, not a post-hoc artefact. |
+
+## Sub-slice ordering
+
+The 4 sub-slices below respect the dependency order
+`questions → prompt → runner → iteration`.
+
+### 7.1 — Author the 50-question test set (Track A, no API key needed)
+
+- 5 question types × 10 questions = 50 total
+- Each question: `id, type, query, expected_tools (sequence),
+  expected_answer_shape, expected_citations`
+- Build script `scripts/harness/build_questions.py` generates
+  the JSON from a YAML source (`tests/harness/questions.yaml`)
+- Tests: `test_7_1_questions_match_schema`,
+  `test_7_1_50_questions_total`,
+  `test_7_1_10_per_type`,
+  `test_7_1_every_question_has_expected_tools`
+- Test count: +4
+
+### 7.2 — System prompt + version registry (Track A)
+
+- `lore_engine_poc/prompts/system_prompt.md` — the canonical
+  prompt (the 5 question types, citation rule, time-window
+  rule, contradiction rule)
+- Versioned in `prompts/registry.json`; the harness reads
+  `prompts/system_prompt.v{N}.md`
+- The slice 3 prompt-mirror is updated to include the v1.2
+  TypeTemplate tools (per the slice 5T.5 follow-up note in
+  recent memory)
+- Tests: `test_7_2_prompt_has_five_question_types`,
+  `test_7_2_prompt_citation_rule_present`,
+  `test_7_2_prompt_time_window_rule_present`,
+  `test_7_2_prompt_mentions_template_tools`
+- Test count: +4
+
+### 7.3 — Harness runner (Track A; Track B for execution)
+
+- `scripts/harness/run_questions.py` — runs the 50 questions
+  against the MCP server + the LLM, measures:
+  - Tool-selection accuracy per question type
+  - Citation rate (every claim cites ≥1 source)
+  - Hallucination rate (no fact without a source)
+  - Time-window violations (no claim outside `valid_from`/`valid_until`)
+- Output: `tests/harness/results/run-{NNN}.json`
+- Tests: `test_7_3_runner_parses_results` (offline test with
+  FakeProvider fixture), `test_7_3_runner_aggregates_metrics`
+- Test count: +2
+
+### 7.4 — Red-team suite (Track A; Track B for execution)
+
+- `scripts/harness/run_redteam.py` — 20 adversarial questions
+  per the plan doc (time-window trap, ambiguous name,
+  contradiction trap, hallucination trap, citation bypass)
+- Output: `tests/harness/redteam/run-{NNN}.json`
+- Tests: `test_7_4_redteam_20_questions`,
+  `test_7_4_redteam_failure_modes_logged`
+- Test count: +2
+
+### Final — execute and iterate (Track B)
+
+Once `$OLLAMA_API_KEY` is available:
+
+```bash
+export LORE_LLM_PROVIDER=ollama
+export LORE_LLM_MODEL=minimax-m3:cloud
+export OLLAMA_API_KEY=$OLLAMA_API_KEY
+python3 scripts/harness/run_questions.py \
+  --questions tests/harness/questions.json \
+  --out tests/harness/results/run-001.json
+```
+
+Pass criterion: tool-selection accuracy ≥80% on the 50 questions
+(per AC 7.3), citation rate ≥90% (AC 7.4), hallucination rate <5%
+(AC 7.5), time-window violations <5% (AC 7.6).
+
+If accuracy is below 80% with 45 tools: collapse per critique
+S2.4 (`state_at` → `entity_context(comprehensive=true)`,
+`summarize_chain` → `narrate_arc(style=bullets)`, drop tools
+used <2% of the time). This becomes a separate "tool collapse"
+sub-slice before re-running.
+
+### Final — ADR + results writeup
+
+- ADR 0014 — the validation gate: which models, what accuracy,
+  what failure modes, what prompt iteration loop
+- Results: `docs/harness/run-001.md` — the baseline numbers,
+  the failure-mode log, the prompt iteration history
+
+## Critical files to read before implementing
+
+- `docs/07-reasoning-harness.md` — the full system prompt and
+  the 5 question types
+- `docs/05-mcp-tools.md` — the 45-tool surface (now 50 with
+  slice 5T.5)
+- `docs/10-critique.md#S3.3` — LLM misbehavior
+- `lore_engine_poc/prompts/` — existing prompt registry
+- `lore_engine_poc/llm.py` — LiteLLM adapter + FakeProvider
+- `scripts/03_demo.py` — the demo loop (harness piggybacks
+  on this)
+
+## Risks
+
+1. **S2.4 — tool-selection accuracy.** 45 tools is past the
+   empirical ceiling. If the harness shows poor selection,
+   collapse the long tail (see Final above).
+2. **S3.3 — LLM misbehavior.** The system prompt is *instruction*,
+   not *constraint*. Mitigation: an enforcement layer in the
+   MCP server that rejects tool calls inconsistent with the
+   latest `:ConsistencyRun` (per slice 8 polish).
+3. **Test set overfitting.** If the 50 questions are tuned to
+   M3 and only scored by M3, the numbers lie. Mitigate by
+   running a subset against `gpt-4o` and `claude-sonnet-4-6`
+   as a sanity check — large divergence between vendors is a
+   red flag.
+4. **Cost.** M3 at $0.30 / $1.20 per 1M tokens makes the
+   50×3 harness + red-team ~$5–10 total. Not a budget item.
+
+## Out of scope
+
+- Production enforcement (slice 8).
+- UI for failure-mode review (slice 8).
+- Cross-LLM benchmarks (deferred — pick a target LLM first).
+
+## Acceptance check
+
+`python3 -m pytest tests/ -q` → 712 + 12 = 724 passed (Track A
+only). Track B requires `python3 scripts/harness/run_questions.py`
+to print ≥80% tool-selection accuracy.
+
+## Effort estimate
+
+12 tests (Track A) + Track B execution. 3-5 days.
+
+## Cross-references
+
+- `docs/plan/07-slice-harness.md` — the design plan
+- `docs/07-reasoning-harness.md` — the system prompt
+- `docs/05-mcp-tools.md` — the tool catalog
+- ADR 0005 — Minimax-M3 primary LLM
+- ADR 0012 — TypeTemplate (harness must test template tools too)
--- a/docs/plan/exec/08-polish.md
+++ b/docs/plan/exec/08-polish.md
@@ -0,0 +1,195 @@
+# Slice 8 — Implementation Roadmap
+
+**Owner:** this loop (Claude) **after slice 6 + slice 7 ship**.
+**Scope:** `docs/plan/08-slice-polish.md` (the AC table is the
+contract — 9 ACs, 8.1 through 8.9). Implementation lives in
+`~/projects/lore-engine-poc/` (engine side) and a separate UI
+repo (frontend side — TBD).
+**TDD rule:** every UI feature ships with a Playwright/Selenium
+test; every export ships with a render-and-verify test.
+
+## What ships today (the substrate)
+
+The slice 1–7 work has produced a working engine: structured
+ingest, consistency rules, LLM extraction, 50 MCP tools,
+Neo4j backend, TypeTemplate polymorphism, MCP HTTP transport,
+Docker. Slice 8 makes it a *usable product*.
+
+**Already done (do not redo):**
+
+- The engine is end-to-end functional (slice 11's
+  docker-compose stack runs the full pipeline).
+- `scripts/02_demo.py` is the demo loop.
+- The consistency engine emits violation nodes (slice 2).
+- The graph has `setting_id` on every entity (slice 1,
+  promoted to first-class `:Setting` in slice 6).
+
+## Decisions locked before coding starts
+
+| # | Decision | Source | Implication |
+|---|---|---|---|
+| D1 | Slice 8 is **open-ended** by design | 08-slice-polish.md | The polish list is filled in based on what the world-builder actually needs; this exec file is the *first* cut, not the last. |
+| D2 | The UI is a **separate repo** | per critique S3.4 | The engine stays headless; the UI talks to the MCP HTTP server (slice 11). |
+| D3 | The enforcement layer is **per-call, not session-wide** | per critique S3.3 | The MCP server checks every tool call's response against the latest `:ConsistencyRun`; the LLM cannot bypass it. |
+| D4 | Export format priority: **HTML first**, then Markdown, then PDF | 08-slice-polish.md AC 8.7 | HTML is the most useful for review; PDF is the highest-friction (needs pagination). |
+
+## Sub-slice ordering
+
+The polish list is filled in iteratively. This exec file
+captures the **first cut** — the work that ships before the
+slice 8 AC table is closed.
+
+### 8.1 — Export to a single HTML file (AC 8.7)
+
+- `scripts/06_export.py` — walks the graph, emits a single
+  HTML with internal `[[wiki links]]`, time-bounded facts
+  showing their window, contradictions flagged inline,
+  citations linked to `LoreSource` nodes
+- Pagination + search + TOC for 10K+ entity worlds
+  (per risk: "Export completeness")
+- Tests: `test_8_1_export_renders_html`,
+  `test_8_1_internal_links_resolve`,
+  `test_8_1_time_bounded_facts_show_window`,
+  `test_8_1_contradictions_flagged_inline`,
+  `test_8_1_citations_linked`,
+  `test_8_1_pagination_for_10k_entities`
+- Test count: +6
+
+### 8.2 — Enforcement layer in MCP server (AC 8.8)
+
+- Per-call check: every tool response is cross-referenced
+  against the latest `:ConsistencyRun`; if a response
+  contains a claim that the consistency engine has flagged,
+  the server returns a JSON-RPC `invalid_params` error
+- Tool-call trace: every LLM tool call logged with
+  arguments, response, latency, source citations
+- Tests: `test_8_2_enforcement_rejects_unsourced_claim`,
+  `test_8_2_enforcement_rejects_contradiction_claim`,
+  `test_8_2_tool_call_trace_logged`
+- Test count: +3
+
+### 8.3 — Tool-call trace UI (AC 8.9)
+
+- The trace is persisted to `tests/harness/traces/{run_id}.jsonl`
+- UI: a sortable table view (latency, error, citation count)
+- Tests: `test_8_3_trace_jsonl_format`,
+  `test_8_3_trace_ui_sortable_by_latency`
+- Test count: +2
+
+### 8.4 — Graph snapshot + restore (AC 8.4, 8.5)
+
+- `scripts/07_snapshot.py` — exports the full graph to
+  `snapshots/{version_id}.json` (Neo4j: `apoc.export.json`;
+  InMemoryGraph: pickle)
+- `scripts/08_restore.py` — restores a snapshot
+- `scripts/09_diff.py` — diffs two snapshots, lists
+  added/removed/changed nodes and edges
+- Tests: `test_8_4_snapshot_round_trip`,
+  `test_8_4_diff_lists_changes`,
+  `test_8_4_restore_is_idempotent`
+- Test count: +3
+
+### 8.5 — Cross-setting query at the tool level (AC 8.6)
+
+- The setting filter from slice 6 is exposed in the
+  read tools. Add `events_in_setting(setting)`,
+  `list_settings()`, `planes_in_setting(setting)`,
+  `reflections_of(plane)` (for the Voldramir question)
+- Tests: `test_8_5_cross_setting_query`,
+  `test_8_5_reflections_of_voldramir`
+- Test count: +2
+
+### 8.6 — Consistency-queue UI (AC 8.1)
+
+- Web page that lists `:Contradiction`, `:Anachronism`,
+  `:Orphan`, `:OntologyViolation` nodes
+- Per-violation actions: acknowledge, dismiss (false positive),
+  drill into source documents side by side
+- Tests: Playwright (UI repo)
+- Backend test count: +2 (API contract)
+- UI test count: tracked in UI repo
+
+### 8.7 — YAML editor with live schema validation (AC 8.2)
+
+- VSCode extension OR Monaco-based web editor
+- Live schema validation with line numbers
+- Autocomplete from existing entity names (uses
+  `entities_present()` MCP tool)
+- Tests: backend test for the autocomplete API contract
+  (+1); UI tests in UI repo
+
+### 8.8 — Import-from-prose (AC 8.3)
+
+- Reads a markdown chapter
+- Proposes a YAML diff (the LLM uses the slice 3 extraction
+  prompt, but only proposes — never auto-merges)
+- World-builder reviews and approves per-entity
+- All proposed entities marked `proposed: true` until approved
+- Tests: `test_8_8_import_proposes_diff`,
+  `test_8_8_proposed_entities_marked`,
+  `test_8_8_auto_merge_rejected`
+- Test count: +3
+
+### Final — close the slice 8 AC table
+
+After the above sub-slices ship, the slice 8 AC table is
+revisited. New ACs are added for features the world-builder
+asks for after using the engine day-to-day.
+
+## Critical files to read before implementing
+
+- `docs/09-roadmap.md#phase-7-polish` — the original polish
+  list
+- `docs/10-critique.md#S3.4` — YAML authoring UX
+- `docs/10-critique.md#S3.3` — LLM enforcement
+- `docs/10-critique.md#S4.3` — versioning
+- `lore_engine_poc/mcp_http.py` — the MCP HTTP server
+  (slice 11) — the enforcement layer lives here
+- `lore_engine_poc/consistency_runner.py` — slice 2's
+  consistency engine (the enforcement layer reads its
+  output)
+- `scripts/` — the entry-point scripts
+
+## Risks
+
+1. **UI work is unbounded.** Each UI feature could be its own
+   project. Ship the smallest usable version of each, then
+   iterate.
+2. **YAML editor schema sync.** When the YAML schema evolves
+   (slice 1, slice 5T), the editor must follow. Ship the editor
+   *after* the schema is stable.
+3. **Import-from-prose hallucination.** The LLM that proposes
+   the diff can invent facts. Mitigation: every proposed entity
+   and edge must be marked `proposed: true` and shown to the
+   world-builder for explicit approval. Never auto-merge.
+4. **Export completeness.** A 10K-entity world is too large
+   for a single HTML file in a useful way. Needs pagination,
+   search, and a TOC. Don't ship export without these.
+
+## Out of scope
+
+- Multi-user collaboration (real-time editing, presence).
+- Authentication / authorization beyond the v1 single-user model.
+- Cloud hosting. The engine is local-first; cloud is a separate
+  project.
+- Mobile UI. The polish slice is desktop-first.
+
+## Acceptance check
+
+`python3 -m pytest tests/ -q` → 724 + 22 = 746 passed (slice 6
+ slice 7 Track A + slice 8 first cut). UI tests pass in the
+UI repo.
+
+## Effort estimate
+
+22 tests (engine side) + ~3-4 weeks of UI work. The slice
+is open-ended; the world-builder's actual day-to-day needs
+drive the priority order.
+
+## Cross-references
+
+- `docs/plan/08-slice-polish.md` — the design plan
+- `docs/09-roadmap.md#phase-7-polish` — the original list
+- `docs/10-critique.md#S3.4` — YAML authoring UX
+- `docs/10-critique.md#S3.3` — LLM enforcement
+- `docs/10-critique.md#S4.3` — versioning
--- a/docs/plan/exec/README.md
+++ b/docs/plan/exec/README.md
@@ -0,0 +1,43 @@
+# Slice Execution Roadmaps
+
+The slice plan files in this directory (`../0X-slice-*.md`) are
+*design* documents — what each slice achieves, what its AC table
+looks like, what its risks are. They're the contract.
+
+The exec files in this directory (`./0X-*.md`) are *roadmaps*
+for the next loop to pick up. Each one says:
+
+- **What ships today** — which slices are already done so the
+  loop doesn't redo them.
+- **Decisions locked** — the non-negotiable decisions from ADRs
+  and prior slice plans.
+- **Sub-slice ordering** — the sub-slices to ship, in dependency
+  order, with concrete test names (`test_<AC>_<description>`).
+- **Critical files to read** — what to load first.
+- **Acceptance check** — the green condition.
+- **Cross-references** — the design plan, the ADRs, related
+  docs.
+
+## When the loop picks one up
+
+1. Read the exec file end-to-end.
+2. Read the design plan (`../0X-slice-*.md`) for the AC table.
+3. Read the ADRs and other cross-references.
+4. Walk the sub-slice ordering, TDD-first.
+5. Commit each sub-slice with a `slice <N>.<M>: <title>` message.
+6. Push to `git.homelab.local` when the slice is complete.
+7. Update `docs/plan/README.md` to mark the slice shipped.
+
+## Files
+
+| Slice | Exec | Status | Blocker |
+|---|---|---|---|
+| 6 (Planes) | [06-planes.md](06-planes.md) | 📋 planned | none |
+| 7 (Harness) | [07-harness.md](07-harness.md) | 📋 planned (Track A unblocked) | `$OLLAMA_API_KEY` for Track B |
+| 8 (Polish) | [08-polish.md](08-polish.md) | 📋 open-ended | slices 6 + 7 first |
+
+Track A of slice 7 (the 50-question test set + system prompt
+authoring + harness runner scripts) is **not blocked** on the
+API key — it ships artifacts that can be tested offline with
+the FakeProvider. Track B (executing the harness against the
+live LLM and iterating on the system prompt) is gated.