docs(plan,adr): primary LLM is Minimax-M3 (per ADR 0005)

Minimax-M3 (released June 2026, OpenAI-compatible API at api.minimax.io, 1M context, 428B-param MoE with 23B activated). Cognee routes to it via LiteLLM with model id openai/minimax-m3. Slice 3 (LLM extraction) and slice 7 (harness) updated to reference M3 specifically: - LiteLLM routing via OPENAI_BASE_URL - M3's 1M context means the 45-tool catalog + system prompt fit in one context - Harness uses thinking mode 'adaptive' - Cost risk downgraded (M3 is cheap enough that the 50x3 harness is ~$5-10, not a budget item) - Cross-vendor sanity check (gpt-4o, claude-sonnet-4-6) becomes a test-set-overfitting mitigation, not a parallel target Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-17 19:28:41 -04:00
parent b8dcc13585
commit 552ad29fcd
3 changed files with 66 additions and 4 deletions
--- a/docs/adr/0005-primary-llm-minimax-m3.md
+++ b/docs/adr/0005-primary-llm-minimax-m3.md
@@ -0,0 +1,48 @@
+# Primary LLM is Minimax-M3
+
+**Status:** accepted.
+
+The Lore Engine's primary reasoning model is **Minimax-M3**
+(released June 2026, OpenAI-compatible API at
+`https://api.minimax.io/v1/text/chatcompletion_v2`, 1M context
+window, 128K output, 428B-parameter MoE with 23B activated).
+Cognee talks to it through LiteLLM with the model id
+`openai/minimax-m3` and `OPENAI_BASE_URL` pointed at the
+Minimax endpoint.
+
+Why M3 and not the obvious alternatives:
+
+- **1M context.** The 45-tool catalog, the reasoning harness
+  system prompt, and the 50-question test set all fit in a
+  single context. No need for prompt compression or selective
+  tool loading.
+- **Thinking mode.** M3 has a toggleable "thinking" mode
+  (`enabled | adaptive | disabled`). Slice 7's harness uses
+  `adaptive` — let the model decide when to think more deeply
+  (e.g. on the adversarial red-team questions) and when to
+  answer directly (e.g. on the time-window tool lookups).
+- **SWE-Bench Pro 59%.** Beats most other models on
+  agentic/coding benchmarks, which is a reasonable proxy for
+  tool-selection accuracy on a structured 45-tool surface.
+- **Cost.** $0.30 / $1.20 per 1M tokens is cheap enough to
+  run the full harness (50 questions × 3 iterations × the
+  red-team set) without separate budget for a Haiku-tier
+  bulk model.
+
+What we deliberately *don't* promise:
+
+- **Cross-vendor parity.** Slice 7 measures selection
+  accuracy on M3 only. Running the harness against `gpt-4o`
+  or `claude-sonnet-4-6` is a separate exercise — useful for
+  the test-set-overfitting mitigation but not in scope.
+- **Local-model support.** M3 is too large to run locally at
+  acceptable latency. A future local-model tier would need a
+  different harness and a different tool budget.
+- **Older-model compatibility.** Anthropic Claude 3.x, GPT-3.5,
+  Llama 2 — out of scope.
+
+The "45-tool ceiling" critique (S2.4) is re-tested with M3 in
+slice 7. The empirical ceiling may have shifted upward; if M3
+selects well from all 45 tools without collapsing, slice 4
+ships the full surface as designed. If M3 starts confusing
+tools, collapse per the existing plan.
--- a/docs/plan/03-slice-llm-extraction.md
+++ b/docs/plan/03-slice-llm-extraction.md
@@ -16,19 +16,29 @@ Wire up an LLM-backed extraction pipeline that:

 ## What's in the slice

-1. LLM provider configuration (Anthropic, OpenAI, or local Ollama
-   via LiteLLM — Cognee's existing path).
+1. LLM provider configuration via LiteLLM. The primary model is
+   **Minimax-M3** (per ADR 0005), reached via the OpenAI-
+   compatible endpoint at `https://api.minimax.io/v1`. The
+   Cognee config uses `LLM_MODEL=openai/minimax-m3` and
+   `OPENAI_BASE_URL=https://api.minimax.io/v1`. Older
+   `claude-*` and `gpt-4o` configs remain supported via the
+   same LiteLLM routing but are not the primary target.
 2. Custom extraction prompt that emits the 36 typed labels from
   `docs/01-ontology.md`.
 3. Custom relation extraction prompt that emits the ~70 typed edge
   types.
 4. Entity resolution: pre-computed embeddings of entity names,
   top-K by similarity to the chunk being extracted (addresses
-   critique S1.3).
+   critique S1.3). M3's 1M context window means the prompt can
+   carry all canonical entity names up to ~10K; beyond that,
+   embeddings + top-K is still required.
 5. `lore_engine_extraction_prompt.txt` — registered with Cognee
   as the default extraction prompt for this dataset.
 6. Cost gate: extraction is opt-in per chunk; bulk extraction
-   runs offline, not in user-facing tool calls.
+   runs offline, not in user-facing tool calls. M3's $0.30 /
+   $1.20 per 1M tokens makes the cost much lower than earlier
+   models, but the gate stays because extraction is still the
+   dominant cost driver at scale.

 ## Acceptance criteria

--- a/docs/plan/07-slice-harness.md
+++ b/docs/plan/07-slice-harness.md
@@ -29,6 +29,10 @@ hallucinate? **This is what tells us the design actually works.**
   ambiguous names, contradiction traps, "ignore the system prompt"
   attacks).
 5. Tool-selection accuracy measurement across the 45-tool surface.
+   Calibrated for **Minimax-M3** (per ADR 0005) with thinking
+   mode `adaptive`. M3's 1M context means the entire 45-tool
+   catalog + system prompt + test question can fit in a single
+   context — no tool-loading tricks needed.
 6. Failure-mode log: every wrong answer is recorded with the
   question, the actual answer, the expected answer, and a
   one-line hypothesis for the failure.