docs(plan): refresh slice index, add missing Status, ship exec roadmaps

Three doc fixes so the plan directory matches the shipped state:

* docs/plan/README.md — the slice index was stale (still listed
  slices 1-7 as "planned"). Refresh the table to show what
  actually shipped (1, 2, 3, 4, 5a-Neo4j, 5b-TypeTemplate,
  2.6.1, 10, 11) and what remains (6, 7, 8). Drop the old
  "33 days" cumulative estimate. Update the dependency graph
  to reflect the shipped substrate.

* docs/plan/05-slice-neo4j-backend.md — missing Status header
  (the only shipped slice plan without one). Add it.

* docs/plan/exec/ — three execution roadmaps so the next loop
  picks up the remaining slices without re-deriving the
  sub-slice decomposition:
    - 06-planes.md (22 tests, 6 sub-slices, no blockers)
    - 07-harness.md (12 tests Track A unblocked; Track B
      gated on $OLLAMA_API_KEY)
    - 08-polish.md (22 tests engine-side, blocked on 6+7)
  Plus README.md indexing them.

Each exec file follows the slice 1 impl-plan template: scope,
what ships today, decisions locked, sub-slice ordering with
test names, critical files to read, acceptance check,
cross-references.

TaskCreate entries #26-29 mirror the four tasks.

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2026-06-19 11:30:31 -04:00
parent 12c2237d1d
commit e2dc179c82
6 changed files with 649 additions and 36 deletions

View File

@@ -1,5 +1,14 @@
# Slice 5 — Neo4j 5 GraphBackend Adapter
**Status:** ✅ shipped 2026-06-18. The Neo4j storage substrate per
ADR 0011. 8 sub-slices (5.15.8): GraphBackend Protocol +
InMemoryGraph rename + write-tool chokepoint + Neo4jGraph skeleton
w/ reified `:Relation` + full read-tool parity + full-codex
round-trip + `--write-neo4j` dual-write flag + `LORE_GRAPH_BACKEND`
env wiring + docker-compose neo4j svc. 559 → 632 tests
(+73). ADR 0011 written. Suite green with `docker compose
--profile neo4j up`.
## Context
The Lore Engine POC's in-memory `Graph` (`lore_engine_poc/tools.py`) was the

View File

@@ -3,49 +3,58 @@
The Lore Engine on Cognee, sliced into independently shippable units.
Each slice has its own file with acceptance criteria and a test plan.
| # | Slice | Goal | Status | Effort |
The execution roadmap (what's done, what's next, what's blocked)
lives in `docs/plan/exec/` and the TaskCreate list. Slice plan
files are the *design*; exec files are the *roadmap*.
| # | Slice | Goal | Status | Shipped |
|---|---|---|---|---|
| 0 | [POC](00-slice-0-poc.md) | Validate the substrate; one tool end-to-end | ✅ done | 1 day |
| 1 | [Structured YAML](01-slice-structured-yaml.md) | Real `valid_from`/`valid_until` on edges | 📋 planned | 3-5 days |
| 2 | [Consistency engine](02-slice-consistency.md) | 4-category rule system | 📋 planned | 5-7 days |
| 3 | [LLM extraction](03-slice-llm-extraction.md) | Cognee cognify actually runs | 📋 planned | 3-5 days |
| 4 | [Remaining 44 tools](04-slice-tools.md) | Full 45-tool MCP surface | 📋 planned | 5-7 days |
| 5 | [TypeTemplate](05-slice-typetemplate.md) | Polymorphic extension model | 📋 planned | 5-7 days |
| 6 | [Plane model](06-slice-planes.md) | Setting + Plane graph nodes | 📋 planned | 2-3 days |
| 7 | [Reasoning harness](07-slice-harness.md) | 50-question validation gate | 📋 planned | 3-5 days |
| 0 | [POC](00-slice-0-poc.md) | Validate the substrate; one tool end-to-end | ✅ done | 2026-06-17 |
| 1 | [Structured YAML](01-slice-structured-yaml.md) | Real `valid_from`/`valid_until` on edges | ✅ shipped | 2026-06-18 |
| 2 | [Consistency engine](02-slice-consistency.md) | 4-category rule system | ✅ shipped | 2026-06-18 |
| 3 | [LLM extraction](03-slice-llm-extraction.md) | Cognee cognify actually runs | ✅ shipped | 2026-06-18 |
| 4 | [Remaining tools](04-slice-tools.md) | Read + write tool surface | ✅ shipped | 2026-06-18 |
| 5a | [Neo4j backend](05-slice-neo4j-backend.md) | GraphBackend Protocol + Neo4j adapter | ✅ shipped | 2026-06-18 |
| 5b | [TypeTemplate](05-slice-typetemplate.md) | Polymorphic extension model | ✅ shipped | 2026-06-19 |
| 2.6.1 | [MCP surface expansion](2.6.1-slice-mcp-surface-expansion.md) | read_tools over the MCP wire | ✅ shipped | 2026-06-18 |
| 10 | [Write tools](10-slice-write-tools-deferred.md) | 9 deferred write_tools | ✅ shipped | 2026-06-18 |
| 11 | [MCP HTTP + Docker](11-slice-mcp-http-docker.md) | Streamable HTTP transport | ✅ shipped | 2026-06-18 |
| 6 | [Plane model](06-slice-planes.md) | Setting + Plane graph nodes | 📋 planned | — |
| 7 | [Reasoning harness](07-slice-harness.md) | 50-question validation gate | 📋 planned (blocked: `$OLLAMA_API_KEY`) | — |
| 8 | [Polish](08-slice-polish.md) | UI, export, enforcement | 📋 open-ended | — |
**Cumulative:** MVP at end of slice 2 (~10 days), full v1 at end
of slice 4 (~21 days), v1 + extensions at end of slice 7
(~33 days).
**Total shipped so far:** 712 tests green (lore-engine-poc).
**Remaining:** 3 planned slices (6, 7, 8) + follow-up polish items
deferred from shipped slices (write tools for templates,
collapse-to-one-tool when count > 50, services/template-watcher/
and services/template-registry/ Go services per ADR 0012).
## Dependency graph
## Dependency graph (post-shipment)
```
0 (POC) ──┬──> 1 (YAML) ──┐
├──> 2 (Consistency) ──┐
└──> 3 (LLM) ───┘ │
├──> 4 (Tools) ──┐
│ │
│ ┌────────────┘
│ │
▼ ▼
5 (TypeTemplate)
6 (Planes)
7 (Harness)
8 (Polish)
shipped: 0 -> 1 -> 2 -> 4 -> 5a (Neo4j) ............ 11 (MCP HTTP)
\-> 3 / 2.6.1
\-> 5b (TypeTemplate) -> 10 (write tools)
remaining: 6 (Planes) -> 7 (Harness) -> 8 (Polish)
```
Slices 1 and 3 can run in parallel after slice 0. Slice 2
needs both 1 and 3 (it operates on the typed graph and the
prose-extracted graph). Slices 4-7 each depend on the prior
slice. Slice 8 is unbounded.
The shipped slices form the substrate (parsers, graph storage,
MCP wire, Docker). The remaining slices layer on top:
- **Slice 6 (Planes)** is foundational — adds the multi-setting
first-class nodes that downstream codex tooling will use.
Blocks 7 only insofar as the harness tests need a stable
setting_id contract.
- **Slice 7 (Harness)** is the validation gate — it measures
whether the LLM can answer correctly. Currently **blocked on
`$OLLAMA_API_KEY`** (slice 7 was always tied to a live LLM
provider per `docs/07-reasoning-harness.md`).
- **Slice 8 (Polish)** is open-ended — fill in based on what
the world-builder actually needs after slices 6 + 7 land.
Slice 6 can run independently of slice 7 (no LLM dependency).
Slices 6 and 7 can run in parallel. Slice 8 is downstream of
both.
## What each slice proves
@@ -56,7 +65,8 @@ slice. Slice 8 is unbounded.
| 2 | Engine flags its first real contradiction |
| 3 | Prose path is fuzzy but useful for color/character voice |
| 4 | LLM can answer most question types in a single tool call |
| 5 | New domain types are a YAML exercise, not a code change |
| 5a | Production storage works (Neo4j round-trips) |
| 5b | New domain types are a YAML exercise, not a code change |
| 6 | Multi-setting worlds are first-class |
| 7 | The LLM, with the harness, answers correctly ≥80% of the time |
| 8 | The engine is a usable product, not just a working engine |

161
docs/plan/exec/06-planes.md Normal file
View File

@@ -0,0 +1,161 @@
# Slice 6 — Implementation Roadmap
**Owner:** this loop (Claude).
**Scope:** `docs/plan/06-slice-planes.md` (the AC table is the
contract — 11 ACs, 6.1 through 6.11). Implementation lives in
`~/projects/lore-engine-poc/`.
**TDD rule:** every new behaviour ships with a failing test first;
test names follow `test_<AC>_<description>`.
## What ships today (the substrate)
The slice 15 work already shipped Setting as a string field
(`setting_id`) on every entity, but **`Setting` is not yet a
first-class graph node**, and **`Plane` does not exist at all**.
The 5.x substrate (per ADR 0011) is in place: GraphBackend
Protocol, InMemoryGraph + Neo4jGraph parity, the 36 core labels,
and the polymorphic Layer 2 (`:DomainEntity`, `:Relation`,
`:TypeTemplate`) from slice 5T.
**Already done (do not redo):**
- `setting_id` is on every entity (Person, Faction, Location,
Region, etc.). Per ADR 0004 (region ≠ plane), a v1.2 graph
distinguishes the two via the new `Plane` label.
- The Mardonari codex's Voldramir entry currently lives as a
`Region` node with `plane: true` in frontmatter; slice 6
promotes it to a `Plane` node.
- Migration from `world_id` to `setting_id` already happened in
v1.2 doc rename. The slice 6 migration is the *graph node*
migration, not the doc rename.
## Decisions locked before coding starts
| # | Decision | Source | Implication |
|---|---|---|---|
| D1 | `EXISTS_IN` is a reified `:Relation` node, not a native edge | ADR 0009 | Time-bounded (planar travel) — the in-memory `Edge` from slice 0 stays the substrate; the new `EXISTS_IN` relation carries `valid_from`/`valid_until`. |
| D2 | `:Setting` and `:Plane` are top-level labels (Layer 1 core) | docs/01-ontology.md, ADR 0011 | They're added to `NODE_LABELS` in `lore_engine_poc/ontology.py`, same as `:DomainEntity` was in 5T.1. |
| D3 | Default plane for an entity with no `plane:` is the Material Plane of the default Setting | AC 6.10 | Migration must auto-create a Material Plane when none is declared. |
| D4 | Region vs Plane split via frontmatter | AC 6.8 | Entries with `plane: true` OR under `Campaign Codex - Planes/` are `:Plane`; everything else stays `:Region`. |
| D5 | Migration is idempotent | AC 6.9 | `MERGE` on `(Setting {id})`, `(Plane {id})`, and `(:Relation {type:'EXISTS_IN'})` keyed by `(entity_id, setting_id)`. |
## Sub-slice ordering and parallelisation
The 6 sub-slices below respect the dependency order
`schema → nodes → edges → backfill → tools → migration`.
### 6.1 — Setting + Plane schema + ontology (AC 6.1, 6.2)
- Add `:Setting` and `:Plane` to `NODE_LABELS`
- Add `:EXISTS_IN`, `:REFLECTS`, `:LAYER_OF`, `:ADJACENT_TO`,
`:ACCESSIBLE_VIA` to `ALLOWED_LABELS` (already may be present
in write_tools.py:48 — verify before adding)
- `lore_engine_poc/setting.py` (new) — dataclasses `Setting(id,
kind, current_era, schema_version, created_at)` and `Plane(id,
setting_id, name, kind)`
- Test: `test_6_1_setting_and_plane_in_node_labels`
### 6.2 — GraphBackend methods (Layer 1 extensions)
- Extend `GraphBackend` Protocol with: `add_setting`,
`add_plane`, `add_exists_in`, `find_setting(id)`,
`planes_in_setting(setting_id)`, `entity_planes(entity_id)`
- InMemoryGraph: storage dicts + endpoint indexes
- Neo4jGraph: mirror via `CREATE CONSTRAINT` + `session.run()`
- 4-5 tests: protocol conformance + parity
- Test count: +5
### 6.3 — Plane-relation edge types (AC 6.2)
- `REFLECTS`, `LAYER_OF`, `ADJACENT_TO`, `ACCESSIBLE_VIA` as
edge-label tests (typed edges, not reified — these are not
time-bounded)
- Test: `test_6_3_plane_relation_round_trip` for each
### 6.4 — Backfill of EXISTS_IN (AC 6.3, 6.10)
- Migration helper `migrate_setting_id_to_exists_in()` walks
every entity with `setting_id`, ensures a `:Setting` node
exists, ensures a default `Material Plane` exists in that
setting, creates one `:Relation {type: EXISTS_IN}` edge
- Idempotent: re-running produces the same graph
- Test: `test_6_4_backfill_idempotent` + `test_6_4_default_material_plane`
### 6.5 — Setting filter on read tools (AC 6.5, 6.6)
- Add `setting` parameter to `entities_present`, `was_true_at`,
`true_during`, `events_during`, `lookup`, `entity_context`
- Filter resolves via `EXISTS_IN` traversal
- Test: `test_6_5_setting_filter_on_*` for each tool (6 tests)
- Test count: +6
### 6.6 — Region ↔ Plane migration (AC 6.8, 6.11)
- `scripts/05_migrate_planes.py` reads the codex frontmatter
- Entries with `plane: true` or under `Campaign Codex - Planes/`
become `:Plane` nodes (deleting their `:Region` form if any)
- All `[[Underdark]]` body-text references in Voldramir's
markdown become `LAYER_OF` edges
- `--dry-run` mode prints the list of changes without applying
- Test: `test_6_6_migration_distinguishes_region_from_plane` +
`test_6_6_voldramir_becomes_plane`
- Test count: +2
### 6.7 — docs cleanup (AC 6.7)
- `grep -r "world_id" docs/` and remove every reference outside
the migration section
- Update `docs/01-ontology.md`, `docs/11-extensibility.md`,
`docs/14-examples.md` to use `setting_id` and reference the
new `Plane` model
### Final — end-to-end demo + ADR
- ADR 0013 — the v1.2 plane-model migration story
- Killer demo: seed two settings (Mardonari + The Wild Dream),
cross-setting query returns only the requested setting's events
- Total test count: ~712 + 22 = 734
## Critical files to read before implementing
- `docs/01-ontology.md` — current 36 labels + the v1.2 addendum
- `docs/17-planes.md` — the full plane-model design
- `docs/04-consistency.md` — consistency rules will need
`setting_id` on violation nodes (per slice 6 risks)
- `lore_engine_poc/ontology.py` — NODE_LABELS, ALLOWED_LABELS
- `lore_engine_poc/graph_backend.py` — Protocol + InMemoryGraph
- `lore_engine_poc/neo4j_graph.py` — Neo4j substrate
- `lore_engine_poc/tools.py` — read tools that need setting
filter
- `lore_engine_poc/templates/schema.py` — templates will need
Plane awareness (deferred to slice 5T.6 if user asks)
## Out of scope (deferred)
- Plane model in the UI (slice 8).
- Cross-setting consistency rules (deferred; would re-open
slice 2's consistency engine).
- Templates get a `plane_id` field (slice 5T.6 — separate
request).
## Acceptance check
`python3 -m pytest tests/ -q` → 734 passed. Cross-setting
query `entities_present(setting=mardonari, at_time=...)` returns
only Mardonari's events. Voldramir is a `:Plane` node in the
v1.2 graph; the Underdark is `LAYER_OF` it.
## Effort estimate
22 tests, 6 sub-slices, ~3-5 days.
## Cross-references
- `docs/plan/06-slice-planes.md` — the design plan
- `docs/17-planes.md` — plane-model design
- `docs/09-roadmap.md#v12-migration` — migration plan
- `docs/10-critique.md#S3.2` — cross-world queries
- ADR 0004 — region ≠ plane (rationale)
- ADR 0009 — reified `:Relation` (used for `:EXISTS_IN`)
- ADR 0011 — GraphBackend Protocol
- ADR 0012 — slice 5T (precedent for adding to NODE_LABELS)

View File

@@ -0,0 +1,195 @@
# Slice 7 — Implementation Roadmap
**Owner:** this loop (Claude) **once `$OLLAMA_API_KEY` is
available**.
**Scope:** `docs/plan/07-slice-harness.md` (the AC table is the
contract — 8 ACs, 7.1 through 7.8). Implementation lives in
`~/projects/lore-engine-poc/`.
**TDD rule:** the 50-question harness has golden answers;
the system prompt is iterated until ≥80% accuracy.
## Blocker
This slice is **blocked on `$OLLAMA_API_KEY`** (slice 7 was
always tied to a live LLM provider per `docs/07-reasoning-harness.md`).
The primary model is **Minimax-M3** (per ADR 0005), reached via
the OpenAI-compatible endpoint. The harness cannot run end-to-end
without API access.
The slice can proceed in two parallel tracks:
- **Track A — no LLM required** — author the 50 questions,
the system prompt, the failure-mode log, the red-team suite,
the per-tool test cases. All of these are pure-Python artifacts
and ship without an API key.
- **Track B — needs API key** — execute the harness against
the live LLM, measure accuracy, iterate on the system prompt.
Track A can ship now. Track B is gated.
## What ships today (the substrate)
- The MCP server exposes 36 core tools + 14 template-generated
tools = 50 total (per slice 5T.5 + slice 10 + slice 11).
- The system prompt lives in
`lore_engine_poc/prompts/system_prompt.md` (per slice 3 AC 3.2).
- The 5 question types are documented in
`docs/07-reasoning-harness.md` (5 × 10 = 50 questions).
**Already done (do not redo):**
- `scripts/03_demo.py` is the demo loop (per slice 0).
- `lore_engine_poc/llm.py` has the LiteLLM adapter + the
FakeProvider for tests (per slice 3).
- The Ollama Cloud `minimax-m3:cloud` model is wired through
per ADR 0005.
## Decisions locked before coding starts
| # | Decision | Source | Implication |
|---|---|---|---|
| D1 | Primary model: **Minimax-M3** (Minimax-M3 via Ollama Cloud) | ADR 0005 | The harness scripts default to `minimax-m3:cloud`; switchable via env. |
| D2 | Thinking mode: `adaptive` | 07-reasoning-harness.md §measurement | M3 supports adaptive; no other model is calibrated. |
| D3 | The 50 questions are **versioned** (tests/harness/questions.json has a `version` field) | per critique | Old results stay comparable when the prompt iterates. |
| D4 | Tool-selection accuracy is measured per question type, not aggregate | per critique | A 70% aggregate can hide a 50% on time-window questions. |
| D5 | Failure-mode log is **mandatory**: every wrong answer is logged with a hypothesis | 07-reasoning-harness.md §failure-mode | The log is the input to system-prompt iteration, not a post-hoc artefact. |
## Sub-slice ordering
The 4 sub-slices below respect the dependency order
`questions → prompt → runner → iteration`.
### 7.1 — Author the 50-question test set (Track A, no API key needed)
- 5 question types × 10 questions = 50 total
- Each question: `id, type, query, expected_tools (sequence),
expected_answer_shape, expected_citations`
- Build script `scripts/harness/build_questions.py` generates
the JSON from a YAML source (`tests/harness/questions.yaml`)
- Tests: `test_7_1_questions_match_schema`,
`test_7_1_50_questions_total`,
`test_7_1_10_per_type`,
`test_7_1_every_question_has_expected_tools`
- Test count: +4
### 7.2 — System prompt + version registry (Track A)
- `lore_engine_poc/prompts/system_prompt.md` — the canonical
prompt (the 5 question types, citation rule, time-window
rule, contradiction rule)
- Versioned in `prompts/registry.json`; the harness reads
`prompts/system_prompt.v{N}.md`
- The slice 3 prompt-mirror is updated to include the v1.2
TypeTemplate tools (per the slice 5T.5 follow-up note in
recent memory)
- Tests: `test_7_2_prompt_has_five_question_types`,
`test_7_2_prompt_citation_rule_present`,
`test_7_2_prompt_time_window_rule_present`,
`test_7_2_prompt_mentions_template_tools`
- Test count: +4
### 7.3 — Harness runner (Track A; Track B for execution)
- `scripts/harness/run_questions.py` — runs the 50 questions
against the MCP server + the LLM, measures:
- Tool-selection accuracy per question type
- Citation rate (every claim cites ≥1 source)
- Hallucination rate (no fact without a source)
- Time-window violations (no claim outside `valid_from`/`valid_until`)
- Output: `tests/harness/results/run-{NNN}.json`
- Tests: `test_7_3_runner_parses_results` (offline test with
FakeProvider fixture), `test_7_3_runner_aggregates_metrics`
- Test count: +2
### 7.4 — Red-team suite (Track A; Track B for execution)
- `scripts/harness/run_redteam.py` — 20 adversarial questions
per the plan doc (time-window trap, ambiguous name,
contradiction trap, hallucination trap, citation bypass)
- Output: `tests/harness/redteam/run-{NNN}.json`
- Tests: `test_7_4_redteam_20_questions`,
`test_7_4_redteam_failure_modes_logged`
- Test count: +2
### Final — execute and iterate (Track B)
Once `$OLLAMA_API_KEY` is available:
```bash
export LORE_LLM_PROVIDER=ollama
export LORE_LLM_MODEL=minimax-m3:cloud
export OLLAMA_API_KEY=$OLLAMA_API_KEY
python3 scripts/harness/run_questions.py \
--questions tests/harness/questions.json \
--out tests/harness/results/run-001.json
```
Pass criterion: tool-selection accuracy ≥80% on the 50 questions
(per AC 7.3), citation rate ≥90% (AC 7.4), hallucination rate <5%
(AC 7.5), time-window violations <5% (AC 7.6).
If accuracy is below 80% with 45 tools: collapse per critique
S2.4 (`state_at` → `entity_context(comprehensive=true)`,
`summarize_chain` → `narrate_arc(style=bullets)`, drop tools
used <2% of the time). This becomes a separate "tool collapse"
sub-slice before re-running.
### Final — ADR + results writeup
- ADR 0014 — the validation gate: which models, what accuracy,
what failure modes, what prompt iteration loop
- Results: `docs/harness/run-001.md` — the baseline numbers,
the failure-mode log, the prompt iteration history
## Critical files to read before implementing
- `docs/07-reasoning-harness.md` — the full system prompt and
the 5 question types
- `docs/05-mcp-tools.md` — the 45-tool surface (now 50 with
slice 5T.5)
- `docs/10-critique.md#S3.3` — LLM misbehavior
- `lore_engine_poc/prompts/` — existing prompt registry
- `lore_engine_poc/llm.py` — LiteLLM adapter + FakeProvider
- `scripts/03_demo.py` — the demo loop (harness piggybacks
on this)
## Risks
1. **S2.4 — tool-selection accuracy.** 45 tools is past the
empirical ceiling. If the harness shows poor selection,
collapse the long tail (see Final above).
2. **S3.3 — LLM misbehavior.** The system prompt is *instruction*,
not *constraint*. Mitigation: an enforcement layer in the
MCP server that rejects tool calls inconsistent with the
latest `:ConsistencyRun` (per slice 8 polish).
3. **Test set overfitting.** If the 50 questions are tuned to
M3 and only scored by M3, the numbers lie. Mitigate by
running a subset against `gpt-4o` and `claude-sonnet-4-6`
as a sanity check — large divergence between vendors is a
red flag.
4. **Cost.** M3 at $0.30 / $1.20 per 1M tokens makes the
50×3 harness + red-team ~$510 total. Not a budget item.
## Out of scope
- Production enforcement (slice 8).
- UI for failure-mode review (slice 8).
- Cross-LLM benchmarks (deferred — pick a target LLM first).
## Acceptance check
`python3 -m pytest tests/ -q` → 712 + 12 = 724 passed (Track A
only). Track B requires `python3 scripts/harness/run_questions.py`
to print ≥80% tool-selection accuracy.
## Effort estimate
12 tests (Track A) + Track B execution. 3-5 days.
## Cross-references
- `docs/plan/07-slice-harness.md` — the design plan
- `docs/07-reasoning-harness.md` — the system prompt
- `docs/05-mcp-tools.md` — the tool catalog
- ADR 0005 — Minimax-M3 primary LLM
- ADR 0012 — TypeTemplate (harness must test template tools too)

195
docs/plan/exec/08-polish.md Normal file
View File

@@ -0,0 +1,195 @@
# Slice 8 — Implementation Roadmap
**Owner:** this loop (Claude) **after slice 6 + slice 7 ship**.
**Scope:** `docs/plan/08-slice-polish.md` (the AC table is the
contract — 9 ACs, 8.1 through 8.9). Implementation lives in
`~/projects/lore-engine-poc/` (engine side) and a separate UI
repo (frontend side — TBD).
**TDD rule:** every UI feature ships with a Playwright/Selenium
test; every export ships with a render-and-verify test.
## What ships today (the substrate)
The slice 17 work has produced a working engine: structured
ingest, consistency rules, LLM extraction, 50 MCP tools,
Neo4j backend, TypeTemplate polymorphism, MCP HTTP transport,
Docker. Slice 8 makes it a *usable product*.
**Already done (do not redo):**
- The engine is end-to-end functional (slice 11's
docker-compose stack runs the full pipeline).
- `scripts/02_demo.py` is the demo loop.
- The consistency engine emits violation nodes (slice 2).
- The graph has `setting_id` on every entity (slice 1,
promoted to first-class `:Setting` in slice 6).
## Decisions locked before coding starts
| # | Decision | Source | Implication |
|---|---|---|---|
| D1 | Slice 8 is **open-ended** by design | 08-slice-polish.md | The polish list is filled in based on what the world-builder actually needs; this exec file is the *first* cut, not the last. |
| D2 | The UI is a **separate repo** | per critique S3.4 | The engine stays headless; the UI talks to the MCP HTTP server (slice 11). |
| D3 | The enforcement layer is **per-call, not session-wide** | per critique S3.3 | The MCP server checks every tool call's response against the latest `:ConsistencyRun`; the LLM cannot bypass it. |
| D4 | Export format priority: **HTML first**, then Markdown, then PDF | 08-slice-polish.md AC 8.7 | HTML is the most useful for review; PDF is the highest-friction (needs pagination). |
## Sub-slice ordering
The polish list is filled in iteratively. This exec file
captures the **first cut** — the work that ships before the
slice 8 AC table is closed.
### 8.1 — Export to a single HTML file (AC 8.7)
- `scripts/06_export.py` — walks the graph, emits a single
HTML with internal `[[wiki links]]`, time-bounded facts
showing their window, contradictions flagged inline,
citations linked to `LoreSource` nodes
- Pagination + search + TOC for 10K+ entity worlds
(per risk: "Export completeness")
- Tests: `test_8_1_export_renders_html`,
`test_8_1_internal_links_resolve`,
`test_8_1_time_bounded_facts_show_window`,
`test_8_1_contradictions_flagged_inline`,
`test_8_1_citations_linked`,
`test_8_1_pagination_for_10k_entities`
- Test count: +6
### 8.2 — Enforcement layer in MCP server (AC 8.8)
- Per-call check: every tool response is cross-referenced
against the latest `:ConsistencyRun`; if a response
contains a claim that the consistency engine has flagged,
the server returns a JSON-RPC `invalid_params` error
- Tool-call trace: every LLM tool call logged with
arguments, response, latency, source citations
- Tests: `test_8_2_enforcement_rejects_unsourced_claim`,
`test_8_2_enforcement_rejects_contradiction_claim`,
`test_8_2_tool_call_trace_logged`
- Test count: +3
### 8.3 — Tool-call trace UI (AC 8.9)
- The trace is persisted to `tests/harness/traces/{run_id}.jsonl`
- UI: a sortable table view (latency, error, citation count)
- Tests: `test_8_3_trace_jsonl_format`,
`test_8_3_trace_ui_sortable_by_latency`
- Test count: +2
### 8.4 — Graph snapshot + restore (AC 8.4, 8.5)
- `scripts/07_snapshot.py` — exports the full graph to
`snapshots/{version_id}.json` (Neo4j: `apoc.export.json`;
InMemoryGraph: pickle)
- `scripts/08_restore.py` — restores a snapshot
- `scripts/09_diff.py` — diffs two snapshots, lists
added/removed/changed nodes and edges
- Tests: `test_8_4_snapshot_round_trip`,
`test_8_4_diff_lists_changes`,
`test_8_4_restore_is_idempotent`
- Test count: +3
### 8.5 — Cross-setting query at the tool level (AC 8.6)
- The setting filter from slice 6 is exposed in the
read tools. Add `events_in_setting(setting)`,
`list_settings()`, `planes_in_setting(setting)`,
`reflections_of(plane)` (for the Voldramir question)
- Tests: `test_8_5_cross_setting_query`,
`test_8_5_reflections_of_voldramir`
- Test count: +2
### 8.6 — Consistency-queue UI (AC 8.1)
- Web page that lists `:Contradiction`, `:Anachronism`,
`:Orphan`, `:OntologyViolation` nodes
- Per-violation actions: acknowledge, dismiss (false positive),
drill into source documents side by side
- Tests: Playwright (UI repo)
- Backend test count: +2 (API contract)
- UI test count: tracked in UI repo
### 8.7 — YAML editor with live schema validation (AC 8.2)
- VSCode extension OR Monaco-based web editor
- Live schema validation with line numbers
- Autocomplete from existing entity names (uses
`entities_present()` MCP tool)
- Tests: backend test for the autocomplete API contract
(+1); UI tests in UI repo
### 8.8 — Import-from-prose (AC 8.3)
- Reads a markdown chapter
- Proposes a YAML diff (the LLM uses the slice 3 extraction
prompt, but only proposes — never auto-merges)
- World-builder reviews and approves per-entity
- All proposed entities marked `proposed: true` until approved
- Tests: `test_8_8_import_proposes_diff`,
`test_8_8_proposed_entities_marked`,
`test_8_8_auto_merge_rejected`
- Test count: +3
### Final — close the slice 8 AC table
After the above sub-slices ship, the slice 8 AC table is
revisited. New ACs are added for features the world-builder
asks for after using the engine day-to-day.
## Critical files to read before implementing
- `docs/09-roadmap.md#phase-7-polish` — the original polish
list
- `docs/10-critique.md#S3.4` — YAML authoring UX
- `docs/10-critique.md#S3.3` — LLM enforcement
- `docs/10-critique.md#S4.3` — versioning
- `lore_engine_poc/mcp_http.py` — the MCP HTTP server
(slice 11) — the enforcement layer lives here
- `lore_engine_poc/consistency_runner.py` — slice 2's
consistency engine (the enforcement layer reads its
output)
- `scripts/` — the entry-point scripts
## Risks
1. **UI work is unbounded.** Each UI feature could be its own
project. Ship the smallest usable version of each, then
iterate.
2. **YAML editor schema sync.** When the YAML schema evolves
(slice 1, slice 5T), the editor must follow. Ship the editor
*after* the schema is stable.
3. **Import-from-prose hallucination.** The LLM that proposes
the diff can invent facts. Mitigation: every proposed entity
and edge must be marked `proposed: true` and shown to the
world-builder for explicit approval. Never auto-merge.
4. **Export completeness.** A 10K-entity world is too large
for a single HTML file in a useful way. Needs pagination,
search, and a TOC. Don't ship export without these.
## Out of scope
- Multi-user collaboration (real-time editing, presence).
- Authentication / authorization beyond the v1 single-user model.
- Cloud hosting. The engine is local-first; cloud is a separate
project.
- Mobile UI. The polish slice is desktop-first.
## Acceptance check
`python3 -m pytest tests/ -q` → 724 + 22 = 746 passed (slice 6
+ slice 7 Track A + slice 8 first cut). UI tests pass in the
UI repo.
## Effort estimate
22 tests (engine side) + ~3-4 weeks of UI work. The slice
is open-ended; the world-builder's actual day-to-day needs
drive the priority order.
## Cross-references
- `docs/plan/08-slice-polish.md` — the design plan
- `docs/09-roadmap.md#phase-7-polish` — the original list
- `docs/10-critique.md#S3.4` — YAML authoring UX
- `docs/10-critique.md#S3.3` — LLM enforcement
- `docs/10-critique.md#S4.3` — versioning

43
docs/plan/exec/README.md Normal file
View File

@@ -0,0 +1,43 @@
# Slice Execution Roadmaps
The slice plan files in this directory (`../0X-slice-*.md`) are
*design* documents — what each slice achieves, what its AC table
looks like, what its risks are. They're the contract.
The exec files in this directory (`./0X-*.md`) are *roadmaps*
for the next loop to pick up. Each one says:
- **What ships today** — which slices are already done so the
loop doesn't redo them.
- **Decisions locked** — the non-negotiable decisions from ADRs
and prior slice plans.
- **Sub-slice ordering** — the sub-slices to ship, in dependency
order, with concrete test names (`test_<AC>_<description>`).
- **Critical files to read** — what to load first.
- **Acceptance check** — the green condition.
- **Cross-references** — the design plan, the ADRs, related
docs.
## When the loop picks one up
1. Read the exec file end-to-end.
2. Read the design plan (`../0X-slice-*.md`) for the AC table.
3. Read the ADRs and other cross-references.
4. Walk the sub-slice ordering, TDD-first.
5. Commit each sub-slice with a `slice <N>.<M>: <title>` message.
6. Push to `git.homelab.local` when the slice is complete.
7. Update `docs/plan/README.md` to mark the slice shipped.
## Files
| Slice | Exec | Status | Blocker |
|---|---|---|---|
| 6 (Planes) | [06-planes.md](06-planes.md) | 📋 planned | none |
| 7 (Harness) | [07-harness.md](07-harness.md) | 📋 planned (Track A unblocked) | `$OLLAMA_API_KEY` for Track B |
| 8 (Polish) | [08-polish.md](08-polish.md) | 📋 open-ended | slices 6 + 7 first |
Track A of slice 7 (the 50-question test set + system prompt
authoring + harness runner scripts) is **not blocked** on the
API key — it ships artifacts that can be tested offline with
the FakeProvider. Track B (executing the harness against the
live LLM and iterating on the system prompt) is gated.