docs(plan): runtime examples + cost/test-overfit paragraphs to minimax-m3
Harness now boots with OPENAI_BASE_URL pointed at minimax.io, LLM_MODEL=openai/minimax-m3, thinking-mode=adaptive. Cost risk downgraded: 50x3 + red-team is ~$5-10 at M3 pricing. Test-overfitting mitigation: subset cross-checked on gpt-4o and claude-sonnet-4-6 as a sanity check, not a parallel target. Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -72,8 +72,10 @@ Each test:
|
||||
### Integration
|
||||
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY=sk-ant-...
|
||||
export LLM_MODEL=anthropic/claude-sonnet-4-6
|
||||
export LLM_PROVIDER=openai
|
||||
export LLM_MODEL=openai/minimax-m3
|
||||
export OPENAI_BASE_URL=https://api.minimax.io/v1
|
||||
export OPENAI_API_KEY=$MINIMAX_API_KEY
|
||||
|
||||
python3 scripts/01_ingest.py # full run with cognify
|
||||
python3 scripts/02_demo.py --query "MEMBER_OF,Elysia Petalbrooke,Petalbrooke Enclave,..."
|
||||
|
||||
@@ -62,10 +62,13 @@ python3 scripts/harness/build_questions.py \
|
||||
# expected_answer_shape, expected_citations
|
||||
|
||||
# 2. Run the harness against the live LLM
|
||||
export LLM_PROVIDER=anthropic
|
||||
export LLM_MODEL=claude-sonnet-4-6
|
||||
export LLM_PROVIDER=openai
|
||||
export LLM_MODEL=openai/minimax-m3
|
||||
export OPENAI_BASE_URL=https://api.minimax.io/v1
|
||||
export OPENAI_API_KEY=$MINIMAX_API_KEY
|
||||
python3 scripts/harness/run_questions.py \
|
||||
--questions tests/harness/questions.json \
|
||||
--thinking-mode adaptive \
|
||||
--out tests/harness/results/run-001.json
|
||||
# Tool selection, answer shape, citation rate, hallucination rate
|
||||
# all measured per-question and aggregated.
|
||||
@@ -147,10 +150,13 @@ ADVERSARIAL_QUESTIONS = [
|
||||
MCP server that rejects tool calls inconsistent with the latest
|
||||
`:ConsistencyRun`.
|
||||
3. **Test set overfitting.** If the 50 questions are tuned to
|
||||
the same LLM that scores them, the numbers lie. Mitigate by
|
||||
running against 2-3 different LLMs and comparing.
|
||||
4. **Cost.** Running 50 questions × 3 iterations × 3 LLMs is
|
||||
non-trivial. Use Haiku-tier models for the bulk of the harness.
|
||||
M3 and only scored by M3, the numbers lie. Mitigate by
|
||||
running a subset against `gpt-4o` and `claude-sonnet-4-6`
|
||||
as a sanity check — large divergence between vendors is a
|
||||
red flag.
|
||||
4. **Cost.** M3 at $0.30 input / $1.20 output per 1M tokens
|
||||
makes the 50×3 harness + red-team ~$5–10 total. Not a
|
||||
budget item.
|
||||
|
||||
## Out of scope
|
||||
|
||||
|
||||
Reference in New Issue
Block a user