docs(plan): runtime examples + cost/test-overfit paragraphs to minimax-m3

Harness now boots with OPENAI_BASE_URL pointed at minimax.io, LLM_MODEL=openai/minimax-m3, thinking-mode=adaptive. Cost risk downgraded: 50x3 + red-team is ~$5-10 at M3 pricing. Test-overfitting mitigation: subset cross-checked on gpt-4o and claude-sonnet-4-6 as a sanity check, not a parallel target. Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-17 19:31:05 -04:00
parent 552ad29fcd
commit 3f6cdaf17d
2 changed files with 16 additions and 8 deletions
--- a/docs/plan/03-slice-llm-extraction.md
+++ b/docs/plan/03-slice-llm-extraction.md
@@ -72,8 +72,10 @@ Each test:
 ### Integration

 ```bash
-export ANTHROPIC_API_KEY=sk-ant-...
-export LLM_MODEL=anthropic/claude-sonnet-4-6
+export LLM_PROVIDER=openai
+export LLM_MODEL=openai/minimax-m3
+export OPENAI_BASE_URL=https://api.minimax.io/v1
+export OPENAI_API_KEY=$MINIMAX_API_KEY

 python3 scripts/01_ingest.py  # full run with cognify
 python3 scripts/02_demo.py --query "MEMBER_OF,Elysia Petalbrooke,Petalbrooke Enclave,..."
--- a/docs/plan/07-slice-harness.md
+++ b/docs/plan/07-slice-harness.md
@@ -62,10 +62,13 @@ python3 scripts/harness/build_questions.py \
 # expected_answer_shape, expected_citations

 # 2. Run the harness against the live LLM
-export LLM_PROVIDER=anthropic
-export LLM_MODEL=claude-sonnet-4-6
+export LLM_PROVIDER=openai
+export LLM_MODEL=openai/minimax-m3
+export OPENAI_BASE_URL=https://api.minimax.io/v1
+export OPENAI_API_KEY=$MINIMAX_API_KEY
 python3 scripts/harness/run_questions.py \
  --questions tests/harness/questions.json \
+  --thinking-mode adaptive \
  --out tests/harness/results/run-001.json
 # Tool selection, answer shape, citation rate, hallucination rate
 # all measured per-question and aggregated.
@@ -147,10 +150,13 @@ ADVERSARIAL_QUESTIONS = [
   MCP server that rejects tool calls inconsistent with the latest
   `:ConsistencyRun`.
 3. **Test set overfitting.** If the 50 questions are tuned to
-   the same LLM that scores them, the numbers lie. Mitigate by
-   running against 2-3 different LLMs and comparing.
-4. **Cost.** Running 50 questions × 3 iterations × 3 LLMs is
-   non-trivial. Use Haiku-tier models for the bulk of the harness.
+   M3 and only scored by M3, the numbers lie. Mitigate by
+   running a subset against `gpt-4o` and `claude-sonnet-4-6`
+   as a sanity check — large divergence between vendors is a
+   red flag.
+4. **Cost.** M3 at $0.30 input / $1.20 output per 1M tokens
+   makes the 50×3 harness + red-team ~$5–10 total. Not a
+   budget item.

 ## Out of scope