Files

Kaysser Kayyali c8a8dcef2e docs: operational layer — eval policy, retcon policy, multi-setting, quickstart, Cognee boundary, prompt + model registries

The 18 design docs were the engine spec. These 12 new files are
the operational layer that goes with it:

Design (docs/):
  - 18-eval-policy.md        thresholds + cadence (lifted from slice 7)
  - 19-retcon-policy.md      declare/preserve/surface retcons; codex syntax stub
  - 20-multi-setting-policy  settings.yaml; cross-setting queries
  - 21-quickstart.md         1-page 'first 5 minutes'
  - 22-cognee-boundary.md    substrate vs domain contract
  - cognee-integration.md    recipe for prompt override + LiteLLM routing

Registries (docs/prompts/, docs/models/):
  - prompts/README.md        convention; system-prompt.md mirror, extraction-prompt.md stub
  - models/README.md         convention; minimax-m3.md primary notes

ADRs (docs/adr/):
  - 0006-cognee-version-pin.md   Cognee pinned at 1.1.2; harness is the upgrade gate

Index updates:
  - 00-overview.md           full doc set with categories
  - 07-reasoning-harness.md  link to prompt mirror
  - 09-roadmap.md            link to operational docs

Co-Authored-By: Claude <noreply@anthropic.com>

2026-06-17 22:16:07 -04:00

2.5 KiB

Raw Permalink Blame History

18 — Eval policy

Status: 📋 planned. Codifies the threshold and cadence story behind the 50-question test set in docs/plan/07-slice-harness.md.

Goal

The reasoning harness is a measurement. The eval policy is the gate: what the numbers mean, when we re-run, and what action each result triggers.

Thresholds (lifted from slice 7)

Metric	Ship gate	Re-iterate if
Tool-selection accuracy	≥80%	< 80%
Citation rate	≥90%	< 90%
Hallucination rate	<5%	≥5%
Time-window violation rate	<5%	≥5%

A regression on any single metric holds the change. Two or more regressions roll back to the last green.

Cadence

Pre-merge. Every PR that touches prompts/, models/, the MCP tool layer, or the extraction pipeline runs the full 50-question harness.
Weekly. Full harness + red-team on main, results committed to tests/harness/results/ with a date-stamped filename.
Model swap. Full harness, twice, 24h apart. Variance > 5% on any metric blocks the swap.
Cognee version bump. Full harness + 1000-chunk extraction sanity check (latency + label-conformance rate).

Who runs it

The harness is a script (scripts/harness/run_questions.py), not a manual process.
CI runs the harness on every PR; the result is posted as a PR comment with the four metrics.
A weekly cron job runs the harness on main and writes a markdown summary to tests/harness/results/weekly-<date>.md.

How the eval set grows

Every failed production question (caught by the world-builder or flagged in a consistency run) becomes a new harness question.
The red-team set grows by 5 questions per month, drawn from new failure modes.
Net target: 50 worked + 20 red-team at v1 ship; 100 worked + 40 red-team by v1.1.

What "pass" means

Pre-merge: the four metrics are all green on the candidate commit, AND no individual question regressed that was previously green.
Weekly: the four metrics are all green. A regression on main blocks the next release tag.
Model swap: the four metrics are all green, with variance <5% between the two runs.

Cross-references

docs/plan/07-slice-harness.md — the test set
docs/19-retcon-policy.md — what happens when a retcon invalidates a previously-green question
docs/22-cognee-boundary.md — what we re-test when Cognee bumps

2.5 KiB Raw Permalink Blame History