docs(plan,adr): primary LLM is Minimax-M3 (per ADR 0005)
Minimax-M3 (released June 2026, OpenAI-compatible API at
api.minimax.io, 1M context, 428B-param MoE with 23B activated).
Cognee routes to it via LiteLLM with model id openai/minimax-m3.
Slice 3 (LLM extraction) and slice 7 (harness) updated to
reference M3 specifically:
- LiteLLM routing via OPENAI_BASE_URL
- M3's 1M context means the 45-tool catalog + system prompt
fit in one context
- Harness uses thinking mode 'adaptive'
- Cost risk downgraded (M3 is cheap enough that the 50x3
harness is ~$5-10, not a budget item)
- Cross-vendor sanity check (gpt-4o, claude-sonnet-4-6)
becomes a test-set-overfitting mitigation, not a parallel
target
Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
48
docs/adr/0005-primary-llm-minimax-m3.md
Normal file
48
docs/adr/0005-primary-llm-minimax-m3.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Primary LLM is Minimax-M3
|
||||
|
||||
**Status:** accepted.
|
||||
|
||||
The Lore Engine's primary reasoning model is **Minimax-M3**
|
||||
(released June 2026, OpenAI-compatible API at
|
||||
`https://api.minimax.io/v1/text/chatcompletion_v2`, 1M context
|
||||
window, 128K output, 428B-parameter MoE with 23B activated).
|
||||
Cognee talks to it through LiteLLM with the model id
|
||||
`openai/minimax-m3` and `OPENAI_BASE_URL` pointed at the
|
||||
Minimax endpoint.
|
||||
|
||||
Why M3 and not the obvious alternatives:
|
||||
|
||||
- **1M context.** The 45-tool catalog, the reasoning harness
|
||||
system prompt, and the 50-question test set all fit in a
|
||||
single context. No need for prompt compression or selective
|
||||
tool loading.
|
||||
- **Thinking mode.** M3 has a toggleable "thinking" mode
|
||||
(`enabled | adaptive | disabled`). Slice 7's harness uses
|
||||
`adaptive` — let the model decide when to think more deeply
|
||||
(e.g. on the adversarial red-team questions) and when to
|
||||
answer directly (e.g. on the time-window tool lookups).
|
||||
- **SWE-Bench Pro 59%.** Beats most other models on
|
||||
agentic/coding benchmarks, which is a reasonable proxy for
|
||||
tool-selection accuracy on a structured 45-tool surface.
|
||||
- **Cost.** $0.30 / $1.20 per 1M tokens is cheap enough to
|
||||
run the full harness (50 questions × 3 iterations × the
|
||||
red-team set) without separate budget for a Haiku-tier
|
||||
bulk model.
|
||||
|
||||
What we deliberately *don't* promise:
|
||||
|
||||
- **Cross-vendor parity.** Slice 7 measures selection
|
||||
accuracy on M3 only. Running the harness against `gpt-4o`
|
||||
or `claude-sonnet-4-6` is a separate exercise — useful for
|
||||
the test-set-overfitting mitigation but not in scope.
|
||||
- **Local-model support.** M3 is too large to run locally at
|
||||
acceptable latency. A future local-model tier would need a
|
||||
different harness and a different tool budget.
|
||||
- **Older-model compatibility.** Anthropic Claude 3.x, GPT-3.5,
|
||||
Llama 2 — out of scope.
|
||||
|
||||
The "45-tool ceiling" critique (S2.4) is re-tested with M3 in
|
||||
slice 7. The empirical ceiling may have shifted upward; if M3
|
||||
selects well from all 45 tools without collapsing, slice 4
|
||||
ships the full surface as designed. If M3 starts confusing
|
||||
tools, collapse per the existing plan.
|
||||
@@ -16,19 +16,29 @@ Wire up an LLM-backed extraction pipeline that:
|
||||
|
||||
## What's in the slice
|
||||
|
||||
1. LLM provider configuration (Anthropic, OpenAI, or local Ollama
|
||||
via LiteLLM — Cognee's existing path).
|
||||
1. LLM provider configuration via LiteLLM. The primary model is
|
||||
**Minimax-M3** (per ADR 0005), reached via the OpenAI-
|
||||
compatible endpoint at `https://api.minimax.io/v1`. The
|
||||
Cognee config uses `LLM_MODEL=openai/minimax-m3` and
|
||||
`OPENAI_BASE_URL=https://api.minimax.io/v1`. Older
|
||||
`claude-*` and `gpt-4o` configs remain supported via the
|
||||
same LiteLLM routing but are not the primary target.
|
||||
2. Custom extraction prompt that emits the 36 typed labels from
|
||||
`docs/01-ontology.md`.
|
||||
3. Custom relation extraction prompt that emits the ~70 typed edge
|
||||
types.
|
||||
4. Entity resolution: pre-computed embeddings of entity names,
|
||||
top-K by similarity to the chunk being extracted (addresses
|
||||
critique S1.3).
|
||||
critique S1.3). M3's 1M context window means the prompt can
|
||||
carry all canonical entity names up to ~10K; beyond that,
|
||||
embeddings + top-K is still required.
|
||||
5. `lore_engine_extraction_prompt.txt` — registered with Cognee
|
||||
as the default extraction prompt for this dataset.
|
||||
6. Cost gate: extraction is opt-in per chunk; bulk extraction
|
||||
runs offline, not in user-facing tool calls.
|
||||
runs offline, not in user-facing tool calls. M3's $0.30 /
|
||||
$1.20 per 1M tokens makes the cost much lower than earlier
|
||||
models, but the gate stays because extraction is still the
|
||||
dominant cost driver at scale.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
|
||||
@@ -29,6 +29,10 @@ hallucinate? **This is what tells us the design actually works.**
|
||||
ambiguous names, contradiction traps, "ignore the system prompt"
|
||||
attacks).
|
||||
5. Tool-selection accuracy measurement across the 45-tool surface.
|
||||
Calibrated for **Minimax-M3** (per ADR 0005) with thinking
|
||||
mode `adaptive`. M3's 1M context means the entire 45-tool
|
||||
catalog + system prompt + test question can fit in a single
|
||||
context — no tool-loading tricks needed.
|
||||
6. Failure-mode log: every wrong answer is recorded with the
|
||||
question, the actual answer, the expected answer, and a
|
||||
one-line hypothesis for the failure.
|
||||
|
||||
Reference in New Issue
Block a user