⚠ ADR-022/023 superseded the model choices below. Current topology: qwen3.5:27b (extraction) + Qwen3.5-9B Opus-Distilled (Annie Voice) + qwen3:32b (contradiction only). See summary note and Phase 1.2 section below for current state.
Context Engine — Extraction Pass (Single LLM Call)
Qwen3.5-35B-A3B A− 8.2 ← ADR-019 primary
One call extracts everything. Entities, promises, emotions, and summary are all fields in a single structured JSON response. 15.5s/transcript (7× faster than Qwen3). Background Lane Queue processing. Privacy-critical: raw conversation data never leaves the machine. Thinking OFF required.
Rules + Qwen3.5-35B-A3B ← ADR-019 routes here
Combined: Qwen3.5 A−(8.0) > Qwen3 A−(7.8) > Sarvam A−(7.5) > Gemini B+(7.2) > Haiku B+(6.5)
5-tier classification: Open → Sensitive → Guarded → Inferred → Forbidden. Most classification by rules (health data = Guarded, family conflicts = Sensitive). LLM for ambiguous cases. Qwen3.5 is #1 (8.0), beating Qwen3 (7.8) and Sarvam API (7.5). Faster + more accurate. Local and private.
Qwen3.5-35B-A3B
Links entities across conversations. "Arun" from conversation 1 → "Arun Krishnamurthy" from conversation 5. Coreference resolution, relationship type inference, temporal edge creation. Incremental graph updates after each extraction pass. Same model as entity extraction — no context switch.
Algorithmic + Qwen3 32B ← ADR-019 specialist
Manual: Qwen3 A(10.0) > Sarvam A(8.7) > Qwen3.5 B+(7.0) — Qwen3.5 over-escalates
This is WHY dual-model routing exists. Five factors: frequency, recency, consistency, source quality, corroboration. LLM for contradiction detection. Qwen3 32B is perfect (10/10) in manual evaluation. Qwen3.5 has systematic over-escalation (3/10 errors: treats role changes as contradictions, continued behavior as updates). Knowledge graph integrity depends on this task — false alerts cause user fatigue. Routed to Qwen3 32B despite slower speed (110s).
ADR-019 Dual-Model Routing in the Context Engine: Qwen3.5 handles entity extraction, sensitivity classification, and graph building (~76% of calls, 7× faster). Qwen3 32B handles contradiction detection only (~24%, 10.0 quality). Both local, $0, 67 GiB total. Why not Sarvam-M? Requires structured JSON output (guaranteed schema). Sarvam-M has no native JSON mode. Sarvam-M is #1 overall (8.3/10) for text generation — used for briefing and nudge, not extraction.
Agentic Actions & Tool Calling (MCP Portals)
Claude Haiku 4.5 A 8.7
Requires native function calling. Selects which portal (Search, Calendar, Email, Browser, GitHub, Slack, etc.) to invoke, constructs parameters, interprets results. Sarvam-M and Qwen3 lack reliable tool-use support. Claude's native tool_use API is purpose-built for this.
Claude Haiku 4.5 A 8.7
Query formulation + result synthesis. "Who is that person Rajesh mentioned?" → formulate search query → call Search portal → interpret results → update entity. Part of MCP tool calling pipeline. Triggered during extraction (0.8 weight) or on-demand by user query.
Claude Haiku 4.5 A 8.7
Multi-step web navigation. "Book that restaurant Suresh recommended" → plan steps → execute via Browser MCP → confirm result. 3 execution channels: browser, voice, desktop API. 5-tier approval (auto → quick-approve → full review). Commerce and communication skills need function calling.
Direct API calls (no LLM)
Simple REST calls to known APIs. Fetch today's schedule, check weather for debrief. No LLM reasoning needed — just API integration code. Calendar events feed into Morning Debrief context. Weather data is a template variable.
Why Claude for tool calling? Native function calling (tool_use API) is the key differentiator. Claude sends structured tool invocations that MCP servers can execute directly. Sarvam-M lacks tool-use support. Qwen3 has experimental function calling but unvalidated on our MCP topology. When tool calling matures in open models, this can shift local.
MCP Portal topology (from Mindscape observability): 9+ services connected via weighted routes. Extraction→Search (0.8), Email→Email (0.9), Briefing→Calendar (0.7), Briefing→Weather (0.8), Nudge→Slack (0.5), Moltbook→Browser (0.9), Backup→Filesystem (0.9).
Image & Video Understanding
Qwen3.5-35B-A3B VALIDATED ×2 + VLM OCR
Speed: 61 tok/s (Q4) / 30 tok/s (BF16)
VRAM: ~21 GiB (Q4) / 94 GiB (BF16)
Cost: $0.00
Primary local model (~80% of calls) + VLM. Text + Image + Video in one MoE model (35B total, 3B active). ADR-019 Dual-Model Routing: handles entity extraction, sensitivity, briefing, nudge, email triage, chat QA, and all vision tasks. Two validated paths: Q4_K_M via llama.cpp (61 tok/s, 21 GiB, FORCE_CUBLAS build) and BF16 via vLLM nightly (30 tok/s, 94 GiB). 262K context, 201 languages, Apache 2.0. OmniDocBench: 89.3, MMMU: 81.4, VideoMME: 86.6. Q4 recommended. Entity extraction A−(8.2), #5/16 overall, beats Qwen3 32B A−(8.0). Task benchmark 8.0 (#4/9) — wins sensitivity, nudge, email triage. VLM OCR tested (session 73): English doc OCR 9/10 (scanned legal doc near-perfect), JSON extraction 9/10 (valid structured output), screenshot understanding 8/10. Kannada OCR: 2/10 (pure) / 5/10 (code-mixed) — needs separate model for Kannada image text. Known weakness: contradiction detection (7.0) — over-escalation pattern, routed to Qwen3 32B instead. Thinking mode must be OFF — chat_template_kwargs: {enable_thinking: false} via --jinja flag.
Cosmos Reason2 8B FAILED
NIM: nvcr.io/nim/nvidia/cosmos-reason2-8b
Error: cudaErrorStreamCaptureInvalidated
CUDA graph crash on Blackwell SM_120. NIM container auto-detected DGX Spark, selected FP8 profile, model loaded — crashed during torch.compile/CUDA graph compilation. vLLM in NIM is too old for Blackwell. Superseded by Qwen3.5-35B-A3B.
Nemotron Nano 12B VL FAILED
NIM: nvcr.io/nim/nvidia/nemotron-nano-12b-v2-vl
Error: Hung — 0% GPU after profile select
Container hung indefinitely. Selected BF16 profile, then no progress for 10+ minutes. No GPU utilization. Same root cause: NIM vLLM incompatible with Blackwell. Superseded by Qwen3.5-35B-A3B (OmniDocBench 89.3).
Architecture (ADR-022/023, supersedes ADR-019 above): Consolidated to
qwen3.5:27b dense (Ollama) for extraction + background,
Qwen3.5-9B Opus-Distilled (llama-server :8003) for Annie Voice. Qwen3 32B retained only for contradiction detection (config-only, rarely loaded). GPU queue defers extraction during voice conversations.
Memory budget (current — see RESOURCE-REGISTRY.md): Idle ~19 GB. With extraction (qwen3.5:27b 40 GB) = ~59 GB. With embeddings (+14 GB) = ~73 GB. Peak (+ qwen3:32b) = ~105 GB. Budget limit: 110 GB, 18 GB reserved for OS/CUDA.
Critical build flag: -DGGML_CUDA_FORCE_CUBLAS=ON required. Native MMQ kernels crash on Blackwell with MXFP4 tensors (
#18331 — CLOSED). cuBLAS bypasses this and is actually faster (61 tok/s).
NIM VLM status: Both Cosmos Reason2 and Nemotron Nano VL NIM containers
FAIL on DGX Spark (Blackwell SM_121). Qwen3.5 via llama.cpp/vLLM is the validated alternative.
Current model topology (ADR-022/023, supersedes ADR-019 dual-model routing above):
•
qwen3.5:27b dense (Ollama, $0, creature: unicorn/lion) —
Primary extraction + background tasks: entity extraction, Graphiti graph-building, daily reflection, nudge, wonder, comic. 40 GB Q4_K_M. ADR-022: dense 27B beats 35B MoE on IFEval (95.0 vs 91.9) and structured output quality.
•
Qwen3.5-9B Opus-Distilled (llama-server :8003, creature: minotaur) —
Annie Voice chat: 94% tool calling accuracy (16 cases), 6.6 GB. ADR-023: chosen over Llama 3.1 8B for tool reliability (no bogus searches/text leaks).
--reasoning-budget 0 required.
•
qwen3:32b (Ollama, creature: fairy) —
Contradiction detection specialist: 10.0/10 (perfect). Config-only, rarely loaded. ADR-019 v2.
•
qwen3-embedding:8b (Ollama, 14 GB) — Matryoshka 1024-dim embeddings. ADR-016 v2.
•
Nemotron Speech 0.6B (NeMo RNNT, creature: serpent) — Annie Voice STT, 2.5 GB. Replaced Qwen3-ASR-1.7B.
•
Kokoro v0.19 (in-process GPU, creature: leviathan) — TTS, 0.5 GB, ~30ms. ELO 1059 (#1 TTS Arena).
VRAM budget (canonical: RESOURCE-REGISTRY.md): Idle ~19 GB (llama-server 6.6 + audio 7.3 + SER 1.2 + Kokoro 0.5 + Nemotron STT 2.5). Extraction: +40 GB = ~59 GB. Peak (+ embed + contradiction): ~105 GB. Budget limit: 110 GB.
GPU queue: Voice-active lease defers extraction + embedding during Annie conversations. 300s lease expires naturally after disconnect. Prevents GPU contention.
The key trade-off: Privacy (data stays on machine) vs
capability. All extraction and search are local, $0. Claude API used for Annie Voice alt backend only.