docs(plan): runtime examples + cost/test-overfit paragraphs to minimax-m3

Harness now boots with OPENAI_BASE_URL pointed at minimax.io,
LLM_MODEL=openai/minimax-m3, thinking-mode=adaptive.
Cost risk downgraded: 50x3 + red-team is ~$5-10 at M3 pricing.
Test-overfitting mitigation: subset cross-checked on gpt-4o
and claude-sonnet-4-6 as a sanity check, not a parallel target.

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2026-06-17 19:31:05 -04:00
parent 552ad29fcd
commit 3f6cdaf17d
2 changed files with 16 additions and 8 deletions

View File

@@ -72,8 +72,10 @@ Each test:
### Integration
```bash
export ANTHROPIC_API_KEY=sk-ant-...
export LLM_MODEL=anthropic/claude-sonnet-4-6
export LLM_PROVIDER=openai
export LLM_MODEL=openai/minimax-m3
export OPENAI_BASE_URL=https://api.minimax.io/v1
export OPENAI_API_KEY=$MINIMAX_API_KEY
python3 scripts/01_ingest.py # full run with cognify
python3 scripts/02_demo.py --query "MEMBER_OF,Elysia Petalbrooke,Petalbrooke Enclave,..."

View File

@@ -62,10 +62,13 @@ python3 scripts/harness/build_questions.py \
# expected_answer_shape, expected_citations
# 2. Run the harness against the live LLM
export LLM_PROVIDER=anthropic
export LLM_MODEL=claude-sonnet-4-6
export LLM_PROVIDER=openai
export LLM_MODEL=openai/minimax-m3
export OPENAI_BASE_URL=https://api.minimax.io/v1
export OPENAI_API_KEY=$MINIMAX_API_KEY
python3 scripts/harness/run_questions.py \
--questions tests/harness/questions.json \
--thinking-mode adaptive \
--out tests/harness/results/run-001.json
# Tool selection, answer shape, citation rate, hallucination rate
# all measured per-question and aggregated.
@@ -147,10 +150,13 @@ ADVERSARIAL_QUESTIONS = [
MCP server that rejects tool calls inconsistent with the latest
`:ConsistencyRun`.
3. **Test set overfitting.** If the 50 questions are tuned to
the same LLM that scores them, the numbers lie. Mitigate by
running against 2-3 different LLMs and comparing.
4. **Cost.** Running 50 questions × 3 iterations × 3 LLMs is
non-trivial. Use Haiku-tier models for the bulk of the harness.
M3 and only scored by M3, the numbers lie. Mitigate by
running a subset against `gpt-4o` and `claude-sonnet-4-6`
as a sanity check — large divergence between vendors is a
red flag.
4. **Cost.** M3 at $0.30 input / $1.20 output per 1M tokens
makes the 50×3 harness + red-team ~$510 total. Not a
budget item.
## Out of scope