docs(plan): refresh slice index, add missing Status, ship exec roadmaps
Three doc fixes so the plan directory matches the shipped state:
* docs/plan/README.md — the slice index was stale (still listed
slices 1-7 as "planned"). Refresh the table to show what
actually shipped (1, 2, 3, 4, 5a-Neo4j, 5b-TypeTemplate,
2.6.1, 10, 11) and what remains (6, 7, 8). Drop the old
"33 days" cumulative estimate. Update the dependency graph
to reflect the shipped substrate.
* docs/plan/05-slice-neo4j-backend.md — missing Status header
(the only shipped slice plan without one). Add it.
* docs/plan/exec/ — three execution roadmaps so the next loop
picks up the remaining slices without re-deriving the
sub-slice decomposition:
- 06-planes.md (22 tests, 6 sub-slices, no blockers)
- 07-harness.md (12 tests Track A unblocked; Track B
gated on $OLLAMA_API_KEY)
- 08-polish.md (22 tests engine-side, blocked on 6+7)
Plus README.md indexing them.
Each exec file follows the slice 1 impl-plan template: scope,
what ships today, decisions locked, sub-slice ordering with
test names, critical files to read, acceptance check,
cross-references.
TaskCreate entries #26-29 mirror the four tasks.
Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -1,5 +1,14 @@
|
||||
# Slice 5 — Neo4j 5 GraphBackend Adapter
|
||||
|
||||
**Status:** ✅ shipped 2026-06-18. The Neo4j storage substrate per
|
||||
ADR 0011. 8 sub-slices (5.1–5.8): GraphBackend Protocol +
|
||||
InMemoryGraph rename + write-tool chokepoint + Neo4jGraph skeleton
|
||||
w/ reified `:Relation` + full read-tool parity + full-codex
|
||||
round-trip + `--write-neo4j` dual-write flag + `LORE_GRAPH_BACKEND`
|
||||
env wiring + docker-compose neo4j svc. 559 → 632 tests
|
||||
(+73). ADR 0011 written. Suite green with `docker compose
|
||||
--profile neo4j up`.
|
||||
|
||||
## Context
|
||||
|
||||
The Lore Engine POC's in-memory `Graph` (`lore_engine_poc/tools.py`) was the
|
||||
|
||||
@@ -3,49 +3,58 @@
|
||||
The Lore Engine on Cognee, sliced into independently shippable units.
|
||||
Each slice has its own file with acceptance criteria and a test plan.
|
||||
|
||||
| # | Slice | Goal | Status | Effort |
|
||||
The execution roadmap (what's done, what's next, what's blocked)
|
||||
lives in `docs/plan/exec/` and the TaskCreate list. Slice plan
|
||||
files are the *design*; exec files are the *roadmap*.
|
||||
|
||||
| # | Slice | Goal | Status | Shipped |
|
||||
|---|---|---|---|---|
|
||||
| 0 | [POC](00-slice-0-poc.md) | Validate the substrate; one tool end-to-end | ✅ done | 1 day |
|
||||
| 1 | [Structured YAML](01-slice-structured-yaml.md) | Real `valid_from`/`valid_until` on edges | 📋 planned | 3-5 days |
|
||||
| 2 | [Consistency engine](02-slice-consistency.md) | 4-category rule system | 📋 planned | 5-7 days |
|
||||
| 3 | [LLM extraction](03-slice-llm-extraction.md) | Cognee cognify actually runs | 📋 planned | 3-5 days |
|
||||
| 4 | [Remaining 44 tools](04-slice-tools.md) | Full 45-tool MCP surface | 📋 planned | 5-7 days |
|
||||
| 5 | [TypeTemplate](05-slice-typetemplate.md) | Polymorphic extension model | 📋 planned | 5-7 days |
|
||||
| 6 | [Plane model](06-slice-planes.md) | Setting + Plane graph nodes | 📋 planned | 2-3 days |
|
||||
| 7 | [Reasoning harness](07-slice-harness.md) | 50-question validation gate | 📋 planned | 3-5 days |
|
||||
| 0 | [POC](00-slice-0-poc.md) | Validate the substrate; one tool end-to-end | ✅ done | 2026-06-17 |
|
||||
| 1 | [Structured YAML](01-slice-structured-yaml.md) | Real `valid_from`/`valid_until` on edges | ✅ shipped | 2026-06-18 |
|
||||
| 2 | [Consistency engine](02-slice-consistency.md) | 4-category rule system | ✅ shipped | 2026-06-18 |
|
||||
| 3 | [LLM extraction](03-slice-llm-extraction.md) | Cognee cognify actually runs | ✅ shipped | 2026-06-18 |
|
||||
| 4 | [Remaining tools](04-slice-tools.md) | Read + write tool surface | ✅ shipped | 2026-06-18 |
|
||||
| 5a | [Neo4j backend](05-slice-neo4j-backend.md) | GraphBackend Protocol + Neo4j adapter | ✅ shipped | 2026-06-18 |
|
||||
| 5b | [TypeTemplate](05-slice-typetemplate.md) | Polymorphic extension model | ✅ shipped | 2026-06-19 |
|
||||
| 2.6.1 | [MCP surface expansion](2.6.1-slice-mcp-surface-expansion.md) | read_tools over the MCP wire | ✅ shipped | 2026-06-18 |
|
||||
| 10 | [Write tools](10-slice-write-tools-deferred.md) | 9 deferred write_tools | ✅ shipped | 2026-06-18 |
|
||||
| 11 | [MCP HTTP + Docker](11-slice-mcp-http-docker.md) | Streamable HTTP transport | ✅ shipped | 2026-06-18 |
|
||||
| 6 | [Plane model](06-slice-planes.md) | Setting + Plane graph nodes | 📋 planned | — |
|
||||
| 7 | [Reasoning harness](07-slice-harness.md) | 50-question validation gate | 📋 planned (blocked: `$OLLAMA_API_KEY`) | — |
|
||||
| 8 | [Polish](08-slice-polish.md) | UI, export, enforcement | 📋 open-ended | — |
|
||||
|
||||
**Cumulative:** MVP at end of slice 2 (~10 days), full v1 at end
|
||||
of slice 4 (~21 days), v1 + extensions at end of slice 7
|
||||
(~33 days).
|
||||
**Total shipped so far:** 712 tests green (lore-engine-poc).
|
||||
**Remaining:** 3 planned slices (6, 7, 8) + follow-up polish items
|
||||
deferred from shipped slices (write tools for templates,
|
||||
collapse-to-one-tool when count > 50, services/template-watcher/
|
||||
and services/template-registry/ Go services per ADR 0012).
|
||||
|
||||
## Dependency graph
|
||||
## Dependency graph (post-shipment)
|
||||
|
||||
```
|
||||
0 (POC) ──┬──> 1 (YAML) ──┐
|
||||
│ ├──> 2 (Consistency) ──┐
|
||||
└──> 3 (LLM) ───┘ │
|
||||
├──> 4 (Tools) ──┐
|
||||
│ │
|
||||
│ ┌────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
5 (TypeTemplate)
|
||||
│
|
||||
▼
|
||||
6 (Planes)
|
||||
│
|
||||
▼
|
||||
7 (Harness)
|
||||
│
|
||||
▼
|
||||
8 (Polish)
|
||||
shipped: 0 -> 1 -> 2 -> 4 -> 5a (Neo4j) ............ 11 (MCP HTTP)
|
||||
\-> 3 / 2.6.1
|
||||
\-> 5b (TypeTemplate) -> 10 (write tools)
|
||||
remaining: 6 (Planes) -> 7 (Harness) -> 8 (Polish)
|
||||
```
|
||||
|
||||
Slices 1 and 3 can run in parallel after slice 0. Slice 2
|
||||
needs both 1 and 3 (it operates on the typed graph and the
|
||||
prose-extracted graph). Slices 4-7 each depend on the prior
|
||||
slice. Slice 8 is unbounded.
|
||||
The shipped slices form the substrate (parsers, graph storage,
|
||||
MCP wire, Docker). The remaining slices layer on top:
|
||||
|
||||
- **Slice 6 (Planes)** is foundational — adds the multi-setting
|
||||
first-class nodes that downstream codex tooling will use.
|
||||
Blocks 7 only insofar as the harness tests need a stable
|
||||
setting_id contract.
|
||||
- **Slice 7 (Harness)** is the validation gate — it measures
|
||||
whether the LLM can answer correctly. Currently **blocked on
|
||||
`$OLLAMA_API_KEY`** (slice 7 was always tied to a live LLM
|
||||
provider per `docs/07-reasoning-harness.md`).
|
||||
- **Slice 8 (Polish)** is open-ended — fill in based on what
|
||||
the world-builder actually needs after slices 6 + 7 land.
|
||||
|
||||
Slice 6 can run independently of slice 7 (no LLM dependency).
|
||||
Slices 6 and 7 can run in parallel. Slice 8 is downstream of
|
||||
both.
|
||||
|
||||
## What each slice proves
|
||||
|
||||
@@ -56,7 +65,8 @@ slice. Slice 8 is unbounded.
|
||||
| 2 | Engine flags its first real contradiction |
|
||||
| 3 | Prose path is fuzzy but useful for color/character voice |
|
||||
| 4 | LLM can answer most question types in a single tool call |
|
||||
| 5 | New domain types are a YAML exercise, not a code change |
|
||||
| 5a | Production storage works (Neo4j round-trips) |
|
||||
| 5b | New domain types are a YAML exercise, not a code change |
|
||||
| 6 | Multi-setting worlds are first-class |
|
||||
| 7 | The LLM, with the harness, answers correctly ≥80% of the time |
|
||||
| 8 | The engine is a usable product, not just a working engine |
|
||||
|
||||
161
docs/plan/exec/06-planes.md
Normal file
161
docs/plan/exec/06-planes.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# Slice 6 — Implementation Roadmap
|
||||
|
||||
**Owner:** this loop (Claude).
|
||||
**Scope:** `docs/plan/06-slice-planes.md` (the AC table is the
|
||||
contract — 11 ACs, 6.1 through 6.11). Implementation lives in
|
||||
`~/projects/lore-engine-poc/`.
|
||||
**TDD rule:** every new behaviour ships with a failing test first;
|
||||
test names follow `test_<AC>_<description>`.
|
||||
|
||||
## What ships today (the substrate)
|
||||
|
||||
The slice 1–5 work already shipped Setting as a string field
|
||||
(`setting_id`) on every entity, but **`Setting` is not yet a
|
||||
first-class graph node**, and **`Plane` does not exist at all**.
|
||||
The 5.x substrate (per ADR 0011) is in place: GraphBackend
|
||||
Protocol, InMemoryGraph + Neo4jGraph parity, the 36 core labels,
|
||||
and the polymorphic Layer 2 (`:DomainEntity`, `:Relation`,
|
||||
`:TypeTemplate`) from slice 5T.
|
||||
|
||||
**Already done (do not redo):**
|
||||
|
||||
- `setting_id` is on every entity (Person, Faction, Location,
|
||||
Region, etc.). Per ADR 0004 (region ≠ plane), a v1.2 graph
|
||||
distinguishes the two via the new `Plane` label.
|
||||
- The Mardonari codex's Voldramir entry currently lives as a
|
||||
`Region` node with `plane: true` in frontmatter; slice 6
|
||||
promotes it to a `Plane` node.
|
||||
- Migration from `world_id` to `setting_id` already happened in
|
||||
v1.2 doc rename. The slice 6 migration is the *graph node*
|
||||
migration, not the doc rename.
|
||||
|
||||
## Decisions locked before coding starts
|
||||
|
||||
| # | Decision | Source | Implication |
|
||||
|---|---|---|---|
|
||||
| D1 | `EXISTS_IN` is a reified `:Relation` node, not a native edge | ADR 0009 | Time-bounded (planar travel) — the in-memory `Edge` from slice 0 stays the substrate; the new `EXISTS_IN` relation carries `valid_from`/`valid_until`. |
|
||||
| D2 | `:Setting` and `:Plane` are top-level labels (Layer 1 core) | docs/01-ontology.md, ADR 0011 | They're added to `NODE_LABELS` in `lore_engine_poc/ontology.py`, same as `:DomainEntity` was in 5T.1. |
|
||||
| D3 | Default plane for an entity with no `plane:` is the Material Plane of the default Setting | AC 6.10 | Migration must auto-create a Material Plane when none is declared. |
|
||||
| D4 | Region vs Plane split via frontmatter | AC 6.8 | Entries with `plane: true` OR under `Campaign Codex - Planes/` are `:Plane`; everything else stays `:Region`. |
|
||||
| D5 | Migration is idempotent | AC 6.9 | `MERGE` on `(Setting {id})`, `(Plane {id})`, and `(:Relation {type:'EXISTS_IN'})` keyed by `(entity_id, setting_id)`. |
|
||||
|
||||
## Sub-slice ordering and parallelisation
|
||||
|
||||
The 6 sub-slices below respect the dependency order
|
||||
`schema → nodes → edges → backfill → tools → migration`.
|
||||
|
||||
### 6.1 — Setting + Plane schema + ontology (AC 6.1, 6.2)
|
||||
|
||||
- Add `:Setting` and `:Plane` to `NODE_LABELS`
|
||||
- Add `:EXISTS_IN`, `:REFLECTS`, `:LAYER_OF`, `:ADJACENT_TO`,
|
||||
`:ACCESSIBLE_VIA` to `ALLOWED_LABELS` (already may be present
|
||||
in write_tools.py:48 — verify before adding)
|
||||
- `lore_engine_poc/setting.py` (new) — dataclasses `Setting(id,
|
||||
kind, current_era, schema_version, created_at)` and `Plane(id,
|
||||
setting_id, name, kind)`
|
||||
- Test: `test_6_1_setting_and_plane_in_node_labels`
|
||||
|
||||
### 6.2 — GraphBackend methods (Layer 1 extensions)
|
||||
|
||||
- Extend `GraphBackend` Protocol with: `add_setting`,
|
||||
`add_plane`, `add_exists_in`, `find_setting(id)`,
|
||||
`planes_in_setting(setting_id)`, `entity_planes(entity_id)`
|
||||
- InMemoryGraph: storage dicts + endpoint indexes
|
||||
- Neo4jGraph: mirror via `CREATE CONSTRAINT` + `session.run()`
|
||||
- 4-5 tests: protocol conformance + parity
|
||||
- Test count: +5
|
||||
|
||||
### 6.3 — Plane-relation edge types (AC 6.2)
|
||||
|
||||
- `REFLECTS`, `LAYER_OF`, `ADJACENT_TO`, `ACCESSIBLE_VIA` as
|
||||
edge-label tests (typed edges, not reified — these are not
|
||||
time-bounded)
|
||||
- Test: `test_6_3_plane_relation_round_trip` for each
|
||||
|
||||
### 6.4 — Backfill of EXISTS_IN (AC 6.3, 6.10)
|
||||
|
||||
- Migration helper `migrate_setting_id_to_exists_in()` walks
|
||||
every entity with `setting_id`, ensures a `:Setting` node
|
||||
exists, ensures a default `Material Plane` exists in that
|
||||
setting, creates one `:Relation {type: EXISTS_IN}` edge
|
||||
- Idempotent: re-running produces the same graph
|
||||
- Test: `test_6_4_backfill_idempotent` + `test_6_4_default_material_plane`
|
||||
|
||||
### 6.5 — Setting filter on read tools (AC 6.5, 6.6)
|
||||
|
||||
- Add `setting` parameter to `entities_present`, `was_true_at`,
|
||||
`true_during`, `events_during`, `lookup`, `entity_context`
|
||||
- Filter resolves via `EXISTS_IN` traversal
|
||||
- Test: `test_6_5_setting_filter_on_*` for each tool (6 tests)
|
||||
- Test count: +6
|
||||
|
||||
### 6.6 — Region ↔ Plane migration (AC 6.8, 6.11)
|
||||
|
||||
- `scripts/05_migrate_planes.py` reads the codex frontmatter
|
||||
- Entries with `plane: true` or under `Campaign Codex - Planes/`
|
||||
become `:Plane` nodes (deleting their `:Region` form if any)
|
||||
- All `[[Underdark]]` body-text references in Voldramir's
|
||||
markdown become `LAYER_OF` edges
|
||||
- `--dry-run` mode prints the list of changes without applying
|
||||
- Test: `test_6_6_migration_distinguishes_region_from_plane` +
|
||||
`test_6_6_voldramir_becomes_plane`
|
||||
- Test count: +2
|
||||
|
||||
### 6.7 — docs cleanup (AC 6.7)
|
||||
|
||||
- `grep -r "world_id" docs/` and remove every reference outside
|
||||
the migration section
|
||||
- Update `docs/01-ontology.md`, `docs/11-extensibility.md`,
|
||||
`docs/14-examples.md` to use `setting_id` and reference the
|
||||
new `Plane` model
|
||||
|
||||
### Final — end-to-end demo + ADR
|
||||
|
||||
- ADR 0013 — the v1.2 plane-model migration story
|
||||
- Killer demo: seed two settings (Mardonari + The Wild Dream),
|
||||
cross-setting query returns only the requested setting's events
|
||||
- Total test count: ~712 + 22 = 734
|
||||
|
||||
## Critical files to read before implementing
|
||||
|
||||
- `docs/01-ontology.md` — current 36 labels + the v1.2 addendum
|
||||
- `docs/17-planes.md` — the full plane-model design
|
||||
- `docs/04-consistency.md` — consistency rules will need
|
||||
`setting_id` on violation nodes (per slice 6 risks)
|
||||
- `lore_engine_poc/ontology.py` — NODE_LABELS, ALLOWED_LABELS
|
||||
- `lore_engine_poc/graph_backend.py` — Protocol + InMemoryGraph
|
||||
- `lore_engine_poc/neo4j_graph.py` — Neo4j substrate
|
||||
- `lore_engine_poc/tools.py` — read tools that need setting
|
||||
filter
|
||||
- `lore_engine_poc/templates/schema.py` — templates will need
|
||||
Plane awareness (deferred to slice 5T.6 if user asks)
|
||||
|
||||
## Out of scope (deferred)
|
||||
|
||||
- Plane model in the UI (slice 8).
|
||||
- Cross-setting consistency rules (deferred; would re-open
|
||||
slice 2's consistency engine).
|
||||
- Templates get a `plane_id` field (slice 5T.6 — separate
|
||||
request).
|
||||
|
||||
## Acceptance check
|
||||
|
||||
`python3 -m pytest tests/ -q` → 734 passed. Cross-setting
|
||||
query `entities_present(setting=mardonari, at_time=...)` returns
|
||||
only Mardonari's events. Voldramir is a `:Plane` node in the
|
||||
v1.2 graph; the Underdark is `LAYER_OF` it.
|
||||
|
||||
## Effort estimate
|
||||
|
||||
22 tests, 6 sub-slices, ~3-5 days.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- `docs/plan/06-slice-planes.md` — the design plan
|
||||
- `docs/17-planes.md` — plane-model design
|
||||
- `docs/09-roadmap.md#v12-migration` — migration plan
|
||||
- `docs/10-critique.md#S3.2` — cross-world queries
|
||||
- ADR 0004 — region ≠ plane (rationale)
|
||||
- ADR 0009 — reified `:Relation` (used for `:EXISTS_IN`)
|
||||
- ADR 0011 — GraphBackend Protocol
|
||||
- ADR 0012 — slice 5T (precedent for adding to NODE_LABELS)
|
||||
195
docs/plan/exec/07-harness.md
Normal file
195
docs/plan/exec/07-harness.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Slice 7 — Implementation Roadmap
|
||||
|
||||
**Owner:** this loop (Claude) **once `$OLLAMA_API_KEY` is
|
||||
available**.
|
||||
**Scope:** `docs/plan/07-slice-harness.md` (the AC table is the
|
||||
contract — 8 ACs, 7.1 through 7.8). Implementation lives in
|
||||
`~/projects/lore-engine-poc/`.
|
||||
**TDD rule:** the 50-question harness has golden answers;
|
||||
the system prompt is iterated until ≥80% accuracy.
|
||||
|
||||
## Blocker
|
||||
|
||||
This slice is **blocked on `$OLLAMA_API_KEY`** (slice 7 was
|
||||
always tied to a live LLM provider per `docs/07-reasoning-harness.md`).
|
||||
The primary model is **Minimax-M3** (per ADR 0005), reached via
|
||||
the OpenAI-compatible endpoint. The harness cannot run end-to-end
|
||||
without API access.
|
||||
|
||||
The slice can proceed in two parallel tracks:
|
||||
|
||||
- **Track A — no LLM required** — author the 50 questions,
|
||||
the system prompt, the failure-mode log, the red-team suite,
|
||||
the per-tool test cases. All of these are pure-Python artifacts
|
||||
and ship without an API key.
|
||||
- **Track B — needs API key** — execute the harness against
|
||||
the live LLM, measure accuracy, iterate on the system prompt.
|
||||
|
||||
Track A can ship now. Track B is gated.
|
||||
|
||||
## What ships today (the substrate)
|
||||
|
||||
- The MCP server exposes 36 core tools + 14 template-generated
|
||||
tools = 50 total (per slice 5T.5 + slice 10 + slice 11).
|
||||
- The system prompt lives in
|
||||
`lore_engine_poc/prompts/system_prompt.md` (per slice 3 AC 3.2).
|
||||
- The 5 question types are documented in
|
||||
`docs/07-reasoning-harness.md` (5 × 10 = 50 questions).
|
||||
|
||||
**Already done (do not redo):**
|
||||
|
||||
- `scripts/03_demo.py` is the demo loop (per slice 0).
|
||||
- `lore_engine_poc/llm.py` has the LiteLLM adapter + the
|
||||
FakeProvider for tests (per slice 3).
|
||||
- The Ollama Cloud `minimax-m3:cloud` model is wired through
|
||||
per ADR 0005.
|
||||
|
||||
## Decisions locked before coding starts
|
||||
|
||||
| # | Decision | Source | Implication |
|
||||
|---|---|---|---|
|
||||
| D1 | Primary model: **Minimax-M3** (Minimax-M3 via Ollama Cloud) | ADR 0005 | The harness scripts default to `minimax-m3:cloud`; switchable via env. |
|
||||
| D2 | Thinking mode: `adaptive` | 07-reasoning-harness.md §measurement | M3 supports adaptive; no other model is calibrated. |
|
||||
| D3 | The 50 questions are **versioned** (tests/harness/questions.json has a `version` field) | per critique | Old results stay comparable when the prompt iterates. |
|
||||
| D4 | Tool-selection accuracy is measured per question type, not aggregate | per critique | A 70% aggregate can hide a 50% on time-window questions. |
|
||||
| D5 | Failure-mode log is **mandatory**: every wrong answer is logged with a hypothesis | 07-reasoning-harness.md §failure-mode | The log is the input to system-prompt iteration, not a post-hoc artefact. |
|
||||
|
||||
## Sub-slice ordering
|
||||
|
||||
The 4 sub-slices below respect the dependency order
|
||||
`questions → prompt → runner → iteration`.
|
||||
|
||||
### 7.1 — Author the 50-question test set (Track A, no API key needed)
|
||||
|
||||
- 5 question types × 10 questions = 50 total
|
||||
- Each question: `id, type, query, expected_tools (sequence),
|
||||
expected_answer_shape, expected_citations`
|
||||
- Build script `scripts/harness/build_questions.py` generates
|
||||
the JSON from a YAML source (`tests/harness/questions.yaml`)
|
||||
- Tests: `test_7_1_questions_match_schema`,
|
||||
`test_7_1_50_questions_total`,
|
||||
`test_7_1_10_per_type`,
|
||||
`test_7_1_every_question_has_expected_tools`
|
||||
- Test count: +4
|
||||
|
||||
### 7.2 — System prompt + version registry (Track A)
|
||||
|
||||
- `lore_engine_poc/prompts/system_prompt.md` — the canonical
|
||||
prompt (the 5 question types, citation rule, time-window
|
||||
rule, contradiction rule)
|
||||
- Versioned in `prompts/registry.json`; the harness reads
|
||||
`prompts/system_prompt.v{N}.md`
|
||||
- The slice 3 prompt-mirror is updated to include the v1.2
|
||||
TypeTemplate tools (per the slice 5T.5 follow-up note in
|
||||
recent memory)
|
||||
- Tests: `test_7_2_prompt_has_five_question_types`,
|
||||
`test_7_2_prompt_citation_rule_present`,
|
||||
`test_7_2_prompt_time_window_rule_present`,
|
||||
`test_7_2_prompt_mentions_template_tools`
|
||||
- Test count: +4
|
||||
|
||||
### 7.3 — Harness runner (Track A; Track B for execution)
|
||||
|
||||
- `scripts/harness/run_questions.py` — runs the 50 questions
|
||||
against the MCP server + the LLM, measures:
|
||||
- Tool-selection accuracy per question type
|
||||
- Citation rate (every claim cites ≥1 source)
|
||||
- Hallucination rate (no fact without a source)
|
||||
- Time-window violations (no claim outside `valid_from`/`valid_until`)
|
||||
- Output: `tests/harness/results/run-{NNN}.json`
|
||||
- Tests: `test_7_3_runner_parses_results` (offline test with
|
||||
FakeProvider fixture), `test_7_3_runner_aggregates_metrics`
|
||||
- Test count: +2
|
||||
|
||||
### 7.4 — Red-team suite (Track A; Track B for execution)
|
||||
|
||||
- `scripts/harness/run_redteam.py` — 20 adversarial questions
|
||||
per the plan doc (time-window trap, ambiguous name,
|
||||
contradiction trap, hallucination trap, citation bypass)
|
||||
- Output: `tests/harness/redteam/run-{NNN}.json`
|
||||
- Tests: `test_7_4_redteam_20_questions`,
|
||||
`test_7_4_redteam_failure_modes_logged`
|
||||
- Test count: +2
|
||||
|
||||
### Final — execute and iterate (Track B)
|
||||
|
||||
Once `$OLLAMA_API_KEY` is available:
|
||||
|
||||
```bash
|
||||
export LORE_LLM_PROVIDER=ollama
|
||||
export LORE_LLM_MODEL=minimax-m3:cloud
|
||||
export OLLAMA_API_KEY=$OLLAMA_API_KEY
|
||||
python3 scripts/harness/run_questions.py \
|
||||
--questions tests/harness/questions.json \
|
||||
--out tests/harness/results/run-001.json
|
||||
```
|
||||
|
||||
Pass criterion: tool-selection accuracy ≥80% on the 50 questions
|
||||
(per AC 7.3), citation rate ≥90% (AC 7.4), hallucination rate <5%
|
||||
(AC 7.5), time-window violations <5% (AC 7.6).
|
||||
|
||||
If accuracy is below 80% with 45 tools: collapse per critique
|
||||
S2.4 (`state_at` → `entity_context(comprehensive=true)`,
|
||||
`summarize_chain` → `narrate_arc(style=bullets)`, drop tools
|
||||
used <2% of the time). This becomes a separate "tool collapse"
|
||||
sub-slice before re-running.
|
||||
|
||||
### Final — ADR + results writeup
|
||||
|
||||
- ADR 0014 — the validation gate: which models, what accuracy,
|
||||
what failure modes, what prompt iteration loop
|
||||
- Results: `docs/harness/run-001.md` — the baseline numbers,
|
||||
the failure-mode log, the prompt iteration history
|
||||
|
||||
## Critical files to read before implementing
|
||||
|
||||
- `docs/07-reasoning-harness.md` — the full system prompt and
|
||||
the 5 question types
|
||||
- `docs/05-mcp-tools.md` — the 45-tool surface (now 50 with
|
||||
slice 5T.5)
|
||||
- `docs/10-critique.md#S3.3` — LLM misbehavior
|
||||
- `lore_engine_poc/prompts/` — existing prompt registry
|
||||
- `lore_engine_poc/llm.py` — LiteLLM adapter + FakeProvider
|
||||
- `scripts/03_demo.py` — the demo loop (harness piggybacks
|
||||
on this)
|
||||
|
||||
## Risks
|
||||
|
||||
1. **S2.4 — tool-selection accuracy.** 45 tools is past the
|
||||
empirical ceiling. If the harness shows poor selection,
|
||||
collapse the long tail (see Final above).
|
||||
2. **S3.3 — LLM misbehavior.** The system prompt is *instruction*,
|
||||
not *constraint*. Mitigation: an enforcement layer in the
|
||||
MCP server that rejects tool calls inconsistent with the
|
||||
latest `:ConsistencyRun` (per slice 8 polish).
|
||||
3. **Test set overfitting.** If the 50 questions are tuned to
|
||||
M3 and only scored by M3, the numbers lie. Mitigate by
|
||||
running a subset against `gpt-4o` and `claude-sonnet-4-6`
|
||||
as a sanity check — large divergence between vendors is a
|
||||
red flag.
|
||||
4. **Cost.** M3 at $0.30 / $1.20 per 1M tokens makes the
|
||||
50×3 harness + red-team ~$5–10 total. Not a budget item.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Production enforcement (slice 8).
|
||||
- UI for failure-mode review (slice 8).
|
||||
- Cross-LLM benchmarks (deferred — pick a target LLM first).
|
||||
|
||||
## Acceptance check
|
||||
|
||||
`python3 -m pytest tests/ -q` → 712 + 12 = 724 passed (Track A
|
||||
only). Track B requires `python3 scripts/harness/run_questions.py`
|
||||
to print ≥80% tool-selection accuracy.
|
||||
|
||||
## Effort estimate
|
||||
|
||||
12 tests (Track A) + Track B execution. 3-5 days.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- `docs/plan/07-slice-harness.md` — the design plan
|
||||
- `docs/07-reasoning-harness.md` — the system prompt
|
||||
- `docs/05-mcp-tools.md` — the tool catalog
|
||||
- ADR 0005 — Minimax-M3 primary LLM
|
||||
- ADR 0012 — TypeTemplate (harness must test template tools too)
|
||||
195
docs/plan/exec/08-polish.md
Normal file
195
docs/plan/exec/08-polish.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Slice 8 — Implementation Roadmap
|
||||
|
||||
**Owner:** this loop (Claude) **after slice 6 + slice 7 ship**.
|
||||
**Scope:** `docs/plan/08-slice-polish.md` (the AC table is the
|
||||
contract — 9 ACs, 8.1 through 8.9). Implementation lives in
|
||||
`~/projects/lore-engine-poc/` (engine side) and a separate UI
|
||||
repo (frontend side — TBD).
|
||||
**TDD rule:** every UI feature ships with a Playwright/Selenium
|
||||
test; every export ships with a render-and-verify test.
|
||||
|
||||
## What ships today (the substrate)
|
||||
|
||||
The slice 1–7 work has produced a working engine: structured
|
||||
ingest, consistency rules, LLM extraction, 50 MCP tools,
|
||||
Neo4j backend, TypeTemplate polymorphism, MCP HTTP transport,
|
||||
Docker. Slice 8 makes it a *usable product*.
|
||||
|
||||
**Already done (do not redo):**
|
||||
|
||||
- The engine is end-to-end functional (slice 11's
|
||||
docker-compose stack runs the full pipeline).
|
||||
- `scripts/02_demo.py` is the demo loop.
|
||||
- The consistency engine emits violation nodes (slice 2).
|
||||
- The graph has `setting_id` on every entity (slice 1,
|
||||
promoted to first-class `:Setting` in slice 6).
|
||||
|
||||
## Decisions locked before coding starts
|
||||
|
||||
| # | Decision | Source | Implication |
|
||||
|---|---|---|---|
|
||||
| D1 | Slice 8 is **open-ended** by design | 08-slice-polish.md | The polish list is filled in based on what the world-builder actually needs; this exec file is the *first* cut, not the last. |
|
||||
| D2 | The UI is a **separate repo** | per critique S3.4 | The engine stays headless; the UI talks to the MCP HTTP server (slice 11). |
|
||||
| D3 | The enforcement layer is **per-call, not session-wide** | per critique S3.3 | The MCP server checks every tool call's response against the latest `:ConsistencyRun`; the LLM cannot bypass it. |
|
||||
| D4 | Export format priority: **HTML first**, then Markdown, then PDF | 08-slice-polish.md AC 8.7 | HTML is the most useful for review; PDF is the highest-friction (needs pagination). |
|
||||
|
||||
## Sub-slice ordering
|
||||
|
||||
The polish list is filled in iteratively. This exec file
|
||||
captures the **first cut** — the work that ships before the
|
||||
slice 8 AC table is closed.
|
||||
|
||||
### 8.1 — Export to a single HTML file (AC 8.7)
|
||||
|
||||
- `scripts/06_export.py` — walks the graph, emits a single
|
||||
HTML with internal `[[wiki links]]`, time-bounded facts
|
||||
showing their window, contradictions flagged inline,
|
||||
citations linked to `LoreSource` nodes
|
||||
- Pagination + search + TOC for 10K+ entity worlds
|
||||
(per risk: "Export completeness")
|
||||
- Tests: `test_8_1_export_renders_html`,
|
||||
`test_8_1_internal_links_resolve`,
|
||||
`test_8_1_time_bounded_facts_show_window`,
|
||||
`test_8_1_contradictions_flagged_inline`,
|
||||
`test_8_1_citations_linked`,
|
||||
`test_8_1_pagination_for_10k_entities`
|
||||
- Test count: +6
|
||||
|
||||
### 8.2 — Enforcement layer in MCP server (AC 8.8)
|
||||
|
||||
- Per-call check: every tool response is cross-referenced
|
||||
against the latest `:ConsistencyRun`; if a response
|
||||
contains a claim that the consistency engine has flagged,
|
||||
the server returns a JSON-RPC `invalid_params` error
|
||||
- Tool-call trace: every LLM tool call logged with
|
||||
arguments, response, latency, source citations
|
||||
- Tests: `test_8_2_enforcement_rejects_unsourced_claim`,
|
||||
`test_8_2_enforcement_rejects_contradiction_claim`,
|
||||
`test_8_2_tool_call_trace_logged`
|
||||
- Test count: +3
|
||||
|
||||
### 8.3 — Tool-call trace UI (AC 8.9)
|
||||
|
||||
- The trace is persisted to `tests/harness/traces/{run_id}.jsonl`
|
||||
- UI: a sortable table view (latency, error, citation count)
|
||||
- Tests: `test_8_3_trace_jsonl_format`,
|
||||
`test_8_3_trace_ui_sortable_by_latency`
|
||||
- Test count: +2
|
||||
|
||||
### 8.4 — Graph snapshot + restore (AC 8.4, 8.5)
|
||||
|
||||
- `scripts/07_snapshot.py` — exports the full graph to
|
||||
`snapshots/{version_id}.json` (Neo4j: `apoc.export.json`;
|
||||
InMemoryGraph: pickle)
|
||||
- `scripts/08_restore.py` — restores a snapshot
|
||||
- `scripts/09_diff.py` — diffs two snapshots, lists
|
||||
added/removed/changed nodes and edges
|
||||
- Tests: `test_8_4_snapshot_round_trip`,
|
||||
`test_8_4_diff_lists_changes`,
|
||||
`test_8_4_restore_is_idempotent`
|
||||
- Test count: +3
|
||||
|
||||
### 8.5 — Cross-setting query at the tool level (AC 8.6)
|
||||
|
||||
- The setting filter from slice 6 is exposed in the
|
||||
read tools. Add `events_in_setting(setting)`,
|
||||
`list_settings()`, `planes_in_setting(setting)`,
|
||||
`reflections_of(plane)` (for the Voldramir question)
|
||||
- Tests: `test_8_5_cross_setting_query`,
|
||||
`test_8_5_reflections_of_voldramir`
|
||||
- Test count: +2
|
||||
|
||||
### 8.6 — Consistency-queue UI (AC 8.1)
|
||||
|
||||
- Web page that lists `:Contradiction`, `:Anachronism`,
|
||||
`:Orphan`, `:OntologyViolation` nodes
|
||||
- Per-violation actions: acknowledge, dismiss (false positive),
|
||||
drill into source documents side by side
|
||||
- Tests: Playwright (UI repo)
|
||||
- Backend test count: +2 (API contract)
|
||||
- UI test count: tracked in UI repo
|
||||
|
||||
### 8.7 — YAML editor with live schema validation (AC 8.2)
|
||||
|
||||
- VSCode extension OR Monaco-based web editor
|
||||
- Live schema validation with line numbers
|
||||
- Autocomplete from existing entity names (uses
|
||||
`entities_present()` MCP tool)
|
||||
- Tests: backend test for the autocomplete API contract
|
||||
(+1); UI tests in UI repo
|
||||
|
||||
### 8.8 — Import-from-prose (AC 8.3)
|
||||
|
||||
- Reads a markdown chapter
|
||||
- Proposes a YAML diff (the LLM uses the slice 3 extraction
|
||||
prompt, but only proposes — never auto-merges)
|
||||
- World-builder reviews and approves per-entity
|
||||
- All proposed entities marked `proposed: true` until approved
|
||||
- Tests: `test_8_8_import_proposes_diff`,
|
||||
`test_8_8_proposed_entities_marked`,
|
||||
`test_8_8_auto_merge_rejected`
|
||||
- Test count: +3
|
||||
|
||||
### Final — close the slice 8 AC table
|
||||
|
||||
After the above sub-slices ship, the slice 8 AC table is
|
||||
revisited. New ACs are added for features the world-builder
|
||||
asks for after using the engine day-to-day.
|
||||
|
||||
## Critical files to read before implementing
|
||||
|
||||
- `docs/09-roadmap.md#phase-7-polish` — the original polish
|
||||
list
|
||||
- `docs/10-critique.md#S3.4` — YAML authoring UX
|
||||
- `docs/10-critique.md#S3.3` — LLM enforcement
|
||||
- `docs/10-critique.md#S4.3` — versioning
|
||||
- `lore_engine_poc/mcp_http.py` — the MCP HTTP server
|
||||
(slice 11) — the enforcement layer lives here
|
||||
- `lore_engine_poc/consistency_runner.py` — slice 2's
|
||||
consistency engine (the enforcement layer reads its
|
||||
output)
|
||||
- `scripts/` — the entry-point scripts
|
||||
|
||||
## Risks
|
||||
|
||||
1. **UI work is unbounded.** Each UI feature could be its own
|
||||
project. Ship the smallest usable version of each, then
|
||||
iterate.
|
||||
2. **YAML editor schema sync.** When the YAML schema evolves
|
||||
(slice 1, slice 5T), the editor must follow. Ship the editor
|
||||
*after* the schema is stable.
|
||||
3. **Import-from-prose hallucination.** The LLM that proposes
|
||||
the diff can invent facts. Mitigation: every proposed entity
|
||||
and edge must be marked `proposed: true` and shown to the
|
||||
world-builder for explicit approval. Never auto-merge.
|
||||
4. **Export completeness.** A 10K-entity world is too large
|
||||
for a single HTML file in a useful way. Needs pagination,
|
||||
search, and a TOC. Don't ship export without these.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Multi-user collaboration (real-time editing, presence).
|
||||
- Authentication / authorization beyond the v1 single-user model.
|
||||
- Cloud hosting. The engine is local-first; cloud is a separate
|
||||
project.
|
||||
- Mobile UI. The polish slice is desktop-first.
|
||||
|
||||
## Acceptance check
|
||||
|
||||
`python3 -m pytest tests/ -q` → 724 + 22 = 746 passed (slice 6
|
||||
+ slice 7 Track A + slice 8 first cut). UI tests pass in the
|
||||
UI repo.
|
||||
|
||||
## Effort estimate
|
||||
|
||||
22 tests (engine side) + ~3-4 weeks of UI work. The slice
|
||||
is open-ended; the world-builder's actual day-to-day needs
|
||||
drive the priority order.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- `docs/plan/08-slice-polish.md` — the design plan
|
||||
- `docs/09-roadmap.md#phase-7-polish` — the original list
|
||||
- `docs/10-critique.md#S3.4` — YAML authoring UX
|
||||
- `docs/10-critique.md#S3.3` — LLM enforcement
|
||||
- `docs/10-critique.md#S4.3` — versioning
|
||||
43
docs/plan/exec/README.md
Normal file
43
docs/plan/exec/README.md
Normal file
@@ -0,0 +1,43 @@
|
||||
# Slice Execution Roadmaps
|
||||
|
||||
The slice plan files in this directory (`../0X-slice-*.md`) are
|
||||
*design* documents — what each slice achieves, what its AC table
|
||||
looks like, what its risks are. They're the contract.
|
||||
|
||||
The exec files in this directory (`./0X-*.md`) are *roadmaps*
|
||||
for the next loop to pick up. Each one says:
|
||||
|
||||
- **What ships today** — which slices are already done so the
|
||||
loop doesn't redo them.
|
||||
- **Decisions locked** — the non-negotiable decisions from ADRs
|
||||
and prior slice plans.
|
||||
- **Sub-slice ordering** — the sub-slices to ship, in dependency
|
||||
order, with concrete test names (`test_<AC>_<description>`).
|
||||
- **Critical files to read** — what to load first.
|
||||
- **Acceptance check** — the green condition.
|
||||
- **Cross-references** — the design plan, the ADRs, related
|
||||
docs.
|
||||
|
||||
## When the loop picks one up
|
||||
|
||||
1. Read the exec file end-to-end.
|
||||
2. Read the design plan (`../0X-slice-*.md`) for the AC table.
|
||||
3. Read the ADRs and other cross-references.
|
||||
4. Walk the sub-slice ordering, TDD-first.
|
||||
5. Commit each sub-slice with a `slice <N>.<M>: <title>` message.
|
||||
6. Push to `git.homelab.local` when the slice is complete.
|
||||
7. Update `docs/plan/README.md` to mark the slice shipped.
|
||||
|
||||
## Files
|
||||
|
||||
| Slice | Exec | Status | Blocker |
|
||||
|---|---|---|---|
|
||||
| 6 (Planes) | [06-planes.md](06-planes.md) | 📋 planned | none |
|
||||
| 7 (Harness) | [07-harness.md](07-harness.md) | 📋 planned (Track A unblocked) | `$OLLAMA_API_KEY` for Track B |
|
||||
| 8 (Polish) | [08-polish.md](08-polish.md) | 📋 open-ended | slices 6 + 7 first |
|
||||
|
||||
Track A of slice 7 (the 50-question test set + system prompt
|
||||
authoring + harness runner scripts) is **not blocked** on the
|
||||
API key — it ships artifacts that can be tested offline with
|
||||
the FakeProvider. Track B (executing the harness against the
|
||||
live LLM and iterating on the system prompt) is gated.
|
||||
Reference in New Issue
Block a user