docs(plan,adr): primary LLM is Minimax-M3 (per ADR 0005)

Minimax-M3 (released June 2026, OpenAI-compatible API at
api.minimax.io, 1M context, 428B-param MoE with 23B activated).
Cognee routes to it via LiteLLM with model id openai/minimax-m3.

Slice 3 (LLM extraction) and slice 7 (harness) updated to
reference M3 specifically:
  - LiteLLM routing via OPENAI_BASE_URL
  - M3's 1M context means the 45-tool catalog + system prompt
    fit in one context
  - Harness uses thinking mode 'adaptive'
  - Cost risk downgraded (M3 is cheap enough that the 50x3
    harness is ~$5-10, not a budget item)
  - Cross-vendor sanity check (gpt-4o, claude-sonnet-4-6)
    becomes a test-set-overfitting mitigation, not a parallel
    target

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2026-06-17 19:28:41 -04:00
parent b8dcc13585
commit 552ad29fcd
3 changed files with 66 additions and 4 deletions

View File

@@ -0,0 +1,48 @@
# Primary LLM is Minimax-M3
**Status:** accepted.
The Lore Engine's primary reasoning model is **Minimax-M3**
(released June 2026, OpenAI-compatible API at
`https://api.minimax.io/v1/text/chatcompletion_v2`, 1M context
window, 128K output, 428B-parameter MoE with 23B activated).
Cognee talks to it through LiteLLM with the model id
`openai/minimax-m3` and `OPENAI_BASE_URL` pointed at the
Minimax endpoint.
Why M3 and not the obvious alternatives:
- **1M context.** The 45-tool catalog, the reasoning harness
system prompt, and the 50-question test set all fit in a
single context. No need for prompt compression or selective
tool loading.
- **Thinking mode.** M3 has a toggleable "thinking" mode
(`enabled | adaptive | disabled`). Slice 7's harness uses
`adaptive` — let the model decide when to think more deeply
(e.g. on the adversarial red-team questions) and when to
answer directly (e.g. on the time-window tool lookups).
- **SWE-Bench Pro 59%.** Beats most other models on
agentic/coding benchmarks, which is a reasonable proxy for
tool-selection accuracy on a structured 45-tool surface.
- **Cost.** $0.30 / $1.20 per 1M tokens is cheap enough to
run the full harness (50 questions × 3 iterations × the
red-team set) without separate budget for a Haiku-tier
bulk model.
What we deliberately *don't* promise:
- **Cross-vendor parity.** Slice 7 measures selection
accuracy on M3 only. Running the harness against `gpt-4o`
or `claude-sonnet-4-6` is a separate exercise — useful for
the test-set-overfitting mitigation but not in scope.
- **Local-model support.** M3 is too large to run locally at
acceptable latency. A future local-model tier would need a
different harness and a different tool budget.
- **Older-model compatibility.** Anthropic Claude 3.x, GPT-3.5,
Llama 2 — out of scope.
The "45-tool ceiling" critique (S2.4) is re-tested with M3 in
slice 7. The empirical ceiling may have shifted upward; if M3
selects well from all 45 tools without collapsing, slice 4
ships the full surface as designed. If M3 starts confusing
tools, collapse per the existing plan.

View File

@@ -16,19 +16,29 @@ Wire up an LLM-backed extraction pipeline that:
## What's in the slice
1. LLM provider configuration (Anthropic, OpenAI, or local Ollama
via LiteLLM — Cognee's existing path).
1. LLM provider configuration via LiteLLM. The primary model is
**Minimax-M3** (per ADR 0005), reached via the OpenAI-
compatible endpoint at `https://api.minimax.io/v1`. The
Cognee config uses `LLM_MODEL=openai/minimax-m3` and
`OPENAI_BASE_URL=https://api.minimax.io/v1`. Older
`claude-*` and `gpt-4o` configs remain supported via the
same LiteLLM routing but are not the primary target.
2. Custom extraction prompt that emits the 36 typed labels from
`docs/01-ontology.md`.
3. Custom relation extraction prompt that emits the ~70 typed edge
types.
4. Entity resolution: pre-computed embeddings of entity names,
top-K by similarity to the chunk being extracted (addresses
critique S1.3).
critique S1.3). M3's 1M context window means the prompt can
carry all canonical entity names up to ~10K; beyond that,
embeddings + top-K is still required.
5. `lore_engine_extraction_prompt.txt` — registered with Cognee
as the default extraction prompt for this dataset.
6. Cost gate: extraction is opt-in per chunk; bulk extraction
runs offline, not in user-facing tool calls.
runs offline, not in user-facing tool calls. M3's $0.30 /
$1.20 per 1M tokens makes the cost much lower than earlier
models, but the gate stays because extraction is still the
dominant cost driver at scale.
## Acceptance criteria

View File

@@ -29,6 +29,10 @@ hallucinate? **This is what tells us the design actually works.**
ambiguous names, contradiction traps, "ignore the system prompt"
attacks).
5. Tool-selection accuracy measurement across the 45-tool surface.
Calibrated for **Minimax-M3** (per ADR 0005) with thinking
mode `adaptive`. M3's 1M context means the entire 45-tool
catalog + system prompt + test question can fit in a single
context — no tool-loading tricks needed.
6. Failure-mode log: every wrong answer is recorded with the
question, the actual answer, the expected answer, and a
one-line hypothesis for the failure.