The 18 design docs were the engine spec. These 12 new files are the operational layer that goes with it: Design (docs/): - 18-eval-policy.md thresholds + cadence (lifted from slice 7) - 19-retcon-policy.md declare/preserve/surface retcons; codex syntax stub - 20-multi-setting-policy settings.yaml; cross-setting queries - 21-quickstart.md 1-page 'first 5 minutes' - 22-cognee-boundary.md substrate vs domain contract - cognee-integration.md recipe for prompt override + LiteLLM routing Registries (docs/prompts/, docs/models/): - prompts/README.md convention; system-prompt.md mirror, extraction-prompt.md stub - models/README.md convention; minimax-m3.md primary notes ADRs (docs/adr/): - 0006-cognee-version-pin.md Cognee pinned at 1.1.2; harness is the upgrade gate Index updates: - 00-overview.md full doc set with categories - 07-reasoning-harness.md link to prompt mirror - 09-roadmap.md link to operational docs Co-Authored-By: Claude <noreply@anthropic.com>
2.5 KiB
2.5 KiB
18 — Eval policy
Status: 📋 planned. Codifies the threshold and cadence story
behind the 50-question test set in docs/plan/07-slice-harness.md.
Goal
The reasoning harness is a measurement. The eval policy is the gate: what the numbers mean, when we re-run, and what action each result triggers.
Thresholds (lifted from slice 7)
| Metric | Ship gate | Re-iterate if |
|---|---|---|
| Tool-selection accuracy | ≥80% | < 80% |
| Citation rate | ≥90% | < 90% |
| Hallucination rate | <5% | ≥5% |
| Time-window violation rate | <5% | ≥5% |
A regression on any single metric holds the change. Two or more regressions roll back to the last green.
Cadence
- Pre-merge. Every PR that touches
prompts/,models/, the MCP tool layer, or the extraction pipeline runs the full 50-question harness. - Weekly. Full harness + red-team on
main, results committed totests/harness/results/with a date-stamped filename. - Model swap. Full harness, twice, 24h apart. Variance > 5% on any metric blocks the swap.
- Cognee version bump. Full harness + 1000-chunk extraction sanity check (latency + label-conformance rate).
Who runs it
- The harness is a script (
scripts/harness/run_questions.py), not a manual process. - CI runs the harness on every PR; the result is posted as a PR comment with the four metrics.
- A weekly cron job runs the harness on
mainand writes a markdown summary totests/harness/results/weekly-<date>.md.
How the eval set grows
- Every failed production question (caught by the world-builder or flagged in a consistency run) becomes a new harness question.
- The red-team set grows by 5 questions per month, drawn from new failure modes.
- Net target: 50 worked + 20 red-team at v1 ship; 100 worked + 40 red-team by v1.1.
What "pass" means
- Pre-merge: the four metrics are all green on the candidate commit, AND no individual question regressed that was previously green.
- Weekly: the four metrics are all green. A regression on
mainblocks the next release tag. - Model swap: the four metrics are all green, with variance <5% between the two runs.
Cross-references
docs/plan/07-slice-harness.md— the test setdocs/19-retcon-policy.md— what happens when a retcon invalidates a previously-green questiondocs/22-cognee-boundary.md— what we re-test when Cognee bumps