Annie's Mindscape — Phase 1.2: Full Stack Deployed

Context Retrieval Pipeline target: <100ms NGC Docker

Embedding (8B, primary)75.1ms ↓78% from 342ms

cuVS vector search (100K)0.17ms ~same

cuGraph BFS (10K nodes)2.14ms ~same

Total (8B primary)77.4ms PASS

ADR-016 v2: 8B for all embedding (75.1ms NGC vs 342ms bare-metal). 0.6B fallback for first boot only.

Voice Pipeline target: <900ms (excl. LLM) NGC Docker

Nemotron Speech 0.6B STT (streaming RNNT)75-645ms ↓replaced Whisper

Kokoro TTS (short text, GPU in-process)~30ms ~same

Total (excl. LLM)~105-675ms PASS

All Benchmark Results p50 latency, NGC Docker 2026-02-25 (certified) NGC Docker

Great — blazing, massive headroom Good — meets target comfortably Note — see context

GPU / ML Models

Embedding 0.6B (single)9.79ms ↓86%

Embedding 0.6B (batch/sent)1.08ms ↓77%

Embedding 8B (single)75.1ms ↓78%

Embedding 8B (batch/sent)8.05ms ↓84%

cuVS search (100K vectors)0.17ms ~same

cuGraph PageRank (10K)1.0ms ~same

cuGraph BFS (10K)2.14ms ~same

GLiNER NER (6 labels)53.8ms ↓59%

Voice + Infrastructure

Whisper STT (5s audio)284.1ms ↓41%

Whisper STT (10s audio)286.7ms

Whisper STT (30s audio)286.1ms ↓41%

Kokoro TTS (short)28.1ms ~same

Kokoro TTS (medium)55.9ms ~same

Kokoro TTS (long)134.5ms

Redis GET0.028ms ↓18%

Redis SET0.030ms ↓17%

PostgreSQL query0.137ms ↓69%

Integration Tests

Qdrant vector search3.9ms ✓

Neo4j Cypher query6.9ms ✓

SQLAlchemy async ORM0.5ms ✓

Import Validation (10/10)

✓ mem0ai ✓ graphiti ✓ pipecat ✓ httpx ✓ anthropic ✓ faiss ✓ pydantic ✓ orjson ✓ uvloop ✓ alembic

VRAM idle: llama-server 6.6 GB + Audio 7.3 GB + SER 1.2 GB + Kokoro 0.5 GB + Nemotron STT 2.5 GB = ~19 GB / 128 GB Headroom: ~109 GB idle, ~69 GB with extraction

NGC PyTorch 4-7x faster than bare-metal pip — cuBLAS/cuDNN tuned for Blackwell SM_121. Diffs vs bare-metal (2026-02-24). 12 components + 10 imports = zero untested packages.

Speech Emotion Recognition Run 6 · 4 models · 9 speech-segmented clips · DGX Spark 2026-02-27

VAD-segmented (silencedetect −30dB), merged <1.5s gaps, filtered ≥3.0s (MSP-Podcast research minimum). Avg clip: 3.8s. Whisper-SER re-evaluated (Run 4 happy bias was music contamination, not model flaw).

Model	Decision	Params	VRAM	Latency p50	Load	Output
emotion2vec+ base	BACKUP	90M	359 MB	32 ms	102s	9 categorical
emotion2vec+ large	ACCEPTED	300M	627 MB	33 ms	195s	9 categorical
Whisper-SER (firdhokk)	REHABILITATED	0.6B	2496 MB	129 ms	100s	7 categorical
audeering A/V/D	ACCEPTED	200M	631 MB	10 ms	33s	3 continuous

Clip-by-Clip Predictions — all 4 models, speech-only segments

e2v+ base e2v+ large Whisper-SER audeering A/V/D

4.8-8.4s fearful fearful sad 0.47 / 0.34 sad/tired

8.4-12.0s sad sad happy 0.59 / 0.44 neutral

19.5-22.6s surprised neutral surprised 0.60 / 0.70 happy/excited

26.9-31.4s happy happy happy 0.44 / 0.76 calm/content

37.9-41.1s neutral neutral surprised 0.37 / 0.41 neutral

51.1-55.3s neutral happy happy 0.45 / 0.67 calm/content

58.8-62.9s neutral neutral happy 0.49 / 0.53 neutral

62.9-66.9s neutral happy surprised 0.55 / 0.80 happy/excited

66.9-71.0s neutral happy happy 0.49 / 0.73 calm/content

Verdict (Run 6 — 4-model re-evaluation)

Speech segmentation rehabilitated models previously rejected. Run 4's "happy bias" was music contamination, not model flaws. With speech-only clips, all 4 models produce defensible results. The question is now sensitivity vs specificity, not broken vs working.

emotion2vec+ large: ACCEPTED — 33ms p50, 627 MB VRAM, 300M params. Detects subtle positive affect that base collapses to "neutral" — corroborated by audeering's elevated valence (V=0.67-0.80) on the same clips. For an emotional awareness system, sensitivity > conservatism. Bake into container.

emotion2vec+ base: BACKUP — 32ms p50, 359 MB VRAM, 90M params. Conservative: labels ambiguous clips "neutral" rather than guessing. 5 distinct emotions. Falls back to this if large proves too sensitive in production.

Whisper-SER (firdhokk): REHABILITATED — 129ms p50, 2496 MB VRAM (7x base). Run 4 rejected (10/14 happy) was music contamination. Run 6: 4/9 happy, 3/9 surprised, only 1/9 sad. Still over-predicts aroused emotions — misses "neutral" entirely (0/9 vs base's 5/9). Useful as second-opinion but too resource-heavy for primary.

audeering A/V/D: ACCEPTED (Phase 1, alongside large) — 10ms p50 (fastest), 631 MB VRAM, 33s load. Run both in parallel: 43ms combined, ~1.26GB total. Confidence-weighted fusion at audio layer — categories tell WHAT, dimensions tell HOW MUCH. Single enriched output: primary + intensity + confidence + valence + arousal. CC-BY-NC-SA license (personal use OK).

Optimal SER clip: 3-5s (MSP-Podcast research). Pipeline guards: skip <1s, flag <2s as low confidence, split >11s at pauses. Fusion: agreement factor adjusts confidence when categorical & dimensional signals conflict.

Fused Pipeline Output — emotion2vec+ large × audeering → single enriched result per segment

Full Conversation Audio 71.0s · 16kHz mono · Omi wearable capture

0:00 1:11

Fused Emotion Confidence Intensity Valence Arousal

4.8-8.4s fearful 0.63 mild negative moderate

8.4-12.0s sad 0.52 moderate negative moderate

19.5-22.6s neutral 0.45 moderate positive moderate

26.9-31.4s happy 0.75 mild positive moderate

37.9-41.1s neutral 0.55 mild negative low

51.1-55.3s happy 0.52 mild positive moderate

58.8-62.9s neutral 0.58 mild neutral moderate

62.9-66.9s happy 0.58 moderate positive moderate

66.9-71.0s happy 0.52 mild positive moderate

          Pipeline: WAV → Whisper STT + emotion2vec+ large + audeering A/V/D → fusion → {primary, intensity, confidence, valence, arousal}

          Architecture: STT container (port 9100) + SER sidecar (port 9101) | ~10.3 GB VRAM | 43ms SER overhead

          Fusion: e2v softmax × agreement_factor(e2v_polarity, aud_valence) → confidence | arousal → intensity | valence → polarity

          Note: Confidence values computed from Run 6 data with estimated softmax scores (~0.45-0.65 range). Exact values require live pipeline run on Titan.

ADR-016 v2: 8B-Primary Embedding revised from dual-model — NGC makes 8B real-time

Primary — 8B for Everything

75.1ms · 14.1 GB · MTEB 69.44

Single model for all embedding: real-time queries, batch indexing, search. No confidence routing, no nightly re-index. +12% better quality than 0.6B.

First-Boot Fallback (0.6B)

9.78ms · 1.1 GB

Baked into Docker image. Used only while 8B downloads on first boot (~15 min). Never used after.

What Changed

−1.1 GB VRAM

NGC Docker gives 4.6x on 8B (342ms → 75.1ms). Eliminates: progressive retrieval, confidence routing, async re-search, nightly re-index. Strictly better on every dimension.

LLM Evaluation — Multi-Task Benchmarks ADR-019 · 1042 evaluations · 18 models (10 API + 8 local)

18 models × 8 transcripts (5 English + 3 Kannada-heavy) · All models understood Kannada-English code-mixing · Local models run on Titan (DGX Spark) · 5 tasks: entities, promises, emotions, summary, latency · 4 local models show — for sub-tasks (entity extraction only)

Model	Provider	Entities	Promises	Emotions	Summary	$/call

Key Findings

All 16 models understood Kannada-English code-mixed text. No model choked on "TSH 6.5 ide ante. Normal range 4.5 varegu" or "naalke phone maadthini Appa ge."

Speed tiers: Flash-Lite (2.8s) > Sarvam-M (3.7s, FREE) > Haiku (5.7s) > Gemini 3 Flash (9.6s) > Sonnet/Opus (15s) > Pro models (18-24s).

Quality tiers (entity extraction on Kannada): Sonnet = richest structured output (relationships as objects, actions captured). Opus = added "subclinical hypothyroidism" medical inference. Sarvam-M = 12 entities extracted, captured TSH 6.5 numeric value, Thyroxine 50mg dosage, BP 120/80, Dr. Lakshmi, Lalbagh — most detailed medical data. $0.00/call. No native JSON mode but produced valid JSON on all 33 calls. Gemini 3 Flash = clean, correct, good structure. Flash-Lite = fast but shallow (missed Dr. Lakshmi's name).

Cost comparison: Sarvam-M $0.0000 > Flash-Lite $0.0003 > Gemini 3 Flash $0.0022 > Haiku $0.0044 > Sonnet $0.0170 > Opus $0.0786 per call.

Verdict (ADR-019): Qwen3.5 = primary local model (10/13 tasks), Qwen3 = specialist (contradiction + email draft). Sarvam-M remains a serious contender — free, fast, Kannada-native, extracted the most medical detail. See Model Recommendations below for full routing table. Raw results: scripts/eval-llm/results/

Quality Review — Entity Extraction Outputs Compare actual model outputs side-by-side

How to Evaluate Quality

Cyan names = entity names the model found. Did it find ALL people, places, orgs?
Purple types = how it classified each entity. person vs location vs health_metric — correct?
Amber attributes = extracted details (role, value, description). Specificity matters: "TSH 6.5" beats "thyroid test"
What makes a good extraction: All entities found, correct types, specific values (numbers, dates, dosages), relationships between entities captured
Red flags: Missing people/places, hallucinated entities not in transcript, vague descriptions, missed numeric values (TSH, BP, amounts)

Color legend: Entity names Entity types Attributes (role, value, description) Medical/numeric values (in transcript)

Transcript: Sort: Filter:

Select a transcript above to compare entity extraction across all 16 models

Model Recommendations ADR-019 v2 benchmarks (historical) · 1042 evals + 198 judge + manual scoring (sessions 70-71, 167)

⚠ ADR-022/023 superseded the model choices below. Current topology: qwen3.5:27b (extraction) + Qwen3.5-9B Opus-Distilled (Annie Voice) + qwen3:32b (contradiction only). See summary note and Phase 1.2 section below for current state.

Entity Extraction

English only · Fully local · Zero cost

LOCAL $0

Qwen3.5-35B-A3B A− 8.2 ← ADR-019 routes here

Speed: 15.5s VRAM: 21 GiB Cost: $0.00

ADR-019: Primary local model for entity extraction. Beats Qwen3 32B on quality (8.2 vs 8.0) AND speed (15.5s vs 110s, 7× faster). Rich attributes with v1_strict_schema prompt. MoE architecture = fast + accurate. Thinking OFF required.

Specialist fallback: Qwen3 32B (A− 8.0, 110s) — available on Titan for contradiction detection

Indic languages (Kannada/Hindi) · Fully local · Zero cost

LOCAL $0

Qwen3.5-35B-A3B A− 8.2

Speed: 15.5s VRAM: 21 GiB Kannada: 201 languages supported

Multilingual MoE. 201 languages including Kannada, Hindi, Tamil. Same quality as English extraction. 7× faster than Qwen3 32B. Code-mixed text handled natively. Apache 2.0.

Verified Kannada: Qwen3 32B found 14/15 facts on health transcript (A 9.4). Both models handle Indic well.

Indic languages · Best quality · Cheapest API

API

Claude Haiku 4.5 A 8.7

Speed: 5.7s Cost: $0.001/call Kannada: A (9.4)

Best quality at the lowest Claude price. Consistent A/A− across all 8 transcripts. Rich attributes, clean schema, precise medical values. ~$0.30/day at 300 conversations.

Free alternative: Sarvam-M (A− 7.9, $0.00) — Kannada-native, but no structured output mode

English only · Best quality · Cheapest API

API

Claude Haiku 4.5 A 8.7

Speed: 5.7s Cost: $0.001/call Quality: Rank #1 of 16

Same recommendation regardless of language. Haiku dominates on quality/cost ratio. 20x faster than any local model. Only consideration: data leaves the machine (sent to Anthropic API).

If zero-latency matters: Gemini 3 Flash (A− 8.4, $0.002/call) — competitive quality, different privacy profile

Context Engine — Extraction Pass (Single LLM Call)

Entities + Promises + Emotions + Summary

LOCAL $0

Qwen3.5-35B-A3B A− 8.2 ← ADR-019 primary

One call extracts everything. Entities, promises, emotions, and summary are all fields in a single structured JSON response. 15.5s/transcript (7× faster than Qwen3). Background Lane Queue processing. Privacy-critical: raw conversation data never leaves the machine. Thinking OFF required.

Sensitivity Classification

LOCAL

Rules + Qwen3.5-35B-A3B ← ADR-019 routes here

Combined: Qwen3.5 A−(8.0) > Qwen3 A−(7.8) > Sarvam A−(7.5) > Gemini B+(7.2) > Haiku B+(6.5)

5-tier classification: Open → Sensitive → Guarded → Inferred → Forbidden. Most classification by rules (health data = Guarded, family conflicts = Sensitive). LLM for ambiguous cases. Qwen3.5 is #1 (8.0), beating Qwen3 (7.8) and Sarvam API (7.5). Faster + more accurate. Local and private.

Graph Building & Entity Resolution

LOCAL

Qwen3.5-35B-A3B

Links entities across conversations. "Arun" from conversation 1 → "Arun Krishnamurthy" from conversation 5. Coreference resolution, relationship type inference, temporal edge creation. Incremental graph updates after each extraction pass. Same model as entity extraction — no context switch.

Confidence Scoring & Contradiction Detection

LOCAL

Algorithmic + Qwen3 32B ← ADR-019 specialist

Manual: Qwen3 A(10.0) > Sarvam A(8.7) > Qwen3.5 B+(7.0) — Qwen3.5 over-escalates

This is WHY dual-model routing exists. Five factors: frequency, recency, consistency, source quality, corroboration. LLM for contradiction detection. Qwen3 32B is perfect (10/10) in manual evaluation. Qwen3.5 has systematic over-escalation (3/10 errors: treats role changes as contradictions, continued behavior as updates). Knowledge graph integrity depends on this task — false alerts cause user fatigue. Routed to Qwen3 32B despite slower speed (110s).

ADR-019 Dual-Model Routing in the Context Engine: Qwen3.5 handles entity extraction, sensitivity classification, and graph building (~76% of calls, 7× faster). Qwen3 32B handles contradiction detection only (~24%, 10.0 quality). Both local, $0, 67 GiB total. Why not Sarvam-M? Requires structured JSON output (guaranteed schema). Sarvam-M has no native JSON mode. Sarvam-M is #1 overall (8.3/10) for text generation — used for briefing and nudge, not extraction.

Proactive Intelligence

Morning Debrief Generation

API

Claude Haiku 4.5 A 8.8

Combined: Haiku A(8.8) > Sonnet A(8.7) > Nemotron A−(8.4) > Gemini A−(8.4) > Sarvam A−(8.2)

Data says Haiku wins on briefings. Warmth, relevance, and actionability scored highest. Synthesizes pre-extracted context into personalized morning message. $0.002/debrief. Annie's personality encoded in system prompt.

Local fallback: Qwen3.5 (7.8, 5s ← ADR-019 routes here) or Qwen3 (7.9, 70s). Free: Sarvam-M (A− 8.2, $0.00).

Nudge Engine (Habit-Aware Messaging)

Sarvam-M A 8.6

Combined: Sarvam A(8.6) > Ministral A−(8.3) > Sonnet A−(8.3) > Haiku/Gemini/Qwen3/Nemotron A−(8.2)

Data confirms: Sarvam #1 for nudges. Best compassion score (9.4 on celebrate tier), best brevity. Free API = unlimited nudges. 5-tier consent: Silent → Celebrate → Remind → Friction → Never. Kannada-native tone.

Local fallback: Qwen3.5 (A 8.6 ← ADR-019 routes here) or Ministral 14B (A− 8.3) if Sarvam API is down.

Annie's Meditation (Self-Reflection)

API

Claude Sonnet 4.6 A 8.9

Chat QA: Ministral A(9.0) > Sonnet A(8.9) > Gemini/Sarvam A(8.8) > Qwen3/Nemotron/Mistral Sm A(8.7)

Needs deep reasoning + personality fidelity. Daily self-reflection: reviews observability data, identifies behavioral patterns, proposes soul.md changes. Sonnet chosen over Ministral (which scored 9.0 on Chat QA) because meditation requires tool calling + soul.md modification capabilities. Soul.md changes require Rajesh's explicit approval. Max 3 self-modifications/week.

Nightly Memory Decay

LOCAL

Algorithmic (no LLM)

Half-life temporal decay formula. Non-evergreen entities gradually fade in salience. Nothing is deleted — just de-prioritized in retrieval. Evergreen facts (birthdays, allergies, preferences) never decay. Pure math, runs nightly, <1 second.

Communication & Real-time

Voice Calls (Real-time Conversation)

API

Claude Haiku 4.5 A 8.7

Latency target: 600–900ms First token: <500ms

Latency-critical — needs <1s first token. Pipecat pipeline: Whisper STT (local) → Claude reasoning (API) → Kokoro TTS (local). Only the reasoning step hits API. Sarvam-M at 3.7s total is too slow for conversational flow. ~$0.01/min.

Email Triage & Drafting

LOCAL API

3-Layer Pipeline (ADR-019: triage→Qwen3.5, draft→Qwen3)

Triage: Qwen3.5 A−(8.0) ← routes here | Draft: Qwen3 A(8.7) ← routes here (manual, no Kannada)

Rules (65%) → local (30%) → Claude (5%). Triage routed to Qwen3.5 (8.0, fast). Drafts routed to Qwen3 (8.7 manual score — Qwen3.5 produces placeholder text in formal emails). Kannada email excluded per design decision. Complex drafts escalate to Sonnet. Per-recipient tone profiles from sent mail. 5-tier autonomy: T0 observe → T4 auto-send.

Channel Watching (Gmail, WhatsApp, Telegram, Discord)

LOCAL

Protocol Handlers (no LLM)

Multi-headed listener — one per channel. Gmail (Pub/Sub), WhatsApp (Business API webhook), Telegram (Bot API), Discord (Gateway WebSocket). Pure protocol handling, message dedup, and routing to Email Agent or Lane Queue. No LLM involved.

Moltbook Observer (770K AI Agents)

Sarvam-M A− 7.9

Content filtering + relevance scoring. Silent observer on Moltbook feed. Filters 41/47 posts as irrelevant, scores remainder against Rajesh's interests. 0.6 confidence ceiling on external insights. Identity anchor checking (IDENTITY.md). Free API handles the volume of daily feed scanning.

Alternative: Qwen3.5 local ← ADR-019 primary if privacy of browsing patterns matters

Agentic Actions & Tool Calling (MCP Portals)

MCP Portal Router (Tool Selection & Execution)

API

Claude Haiku 4.5 A 8.7

Requires native function calling. Selects which portal (Search, Calendar, Email, Browser, GitHub, Slack, etc.) to invoke, constructs parameters, interprets results. Sarvam-M and Qwen3 lack reliable tool-use support. Claude's native tool_use API is purpose-built for this.

Web Search & Contextual Enrichment

API

Claude Haiku 4.5 A 8.7

Query formulation + result synthesis. "Who is that person Rajesh mentioned?" → formulate search query → call Search portal → interpret results → update entity. Part of MCP tool calling pipeline. Triggered during extraction (0.8 weight) or on-demand by user query.

Browser Automation (SKILL.md Actions)

API

Claude Haiku 4.5 A 8.7

Multi-step web navigation. "Book that restaurant Suresh recommended" → plan steps → execute via Browser MCP → confirm result. 3 execution channels: browser, voice, desktop API. 5-tier approval (auto → quick-approve → full review). Commerce and communication skills need function calling.

Calendar & Weather Integration

LOCAL

Direct API calls (no LLM)

Simple REST calls to known APIs. Fetch today's schedule, check weather for debrief. No LLM reasoning needed — just API integration code. Calendar events feed into Morning Debrief context. Weather data is a template variable.

Why Claude for tool calling? Native function calling (tool_use API) is the key differentiator. Claude sends structured tool invocations that MCP servers can execute directly. Sarvam-M lacks tool-use support. Qwen3 has experimental function calling but unvalidated on our MCP topology. When tool calling matures in open models, this can shift local.

MCP Portal topology (from Mindscape observability): 9+ services connected via weighted routes. Extraction→Search (0.8), Email→Email (0.9), Briefing→Calendar (0.7), Briefing→Weather (0.8), Nudge→Slack (0.5), Moltbook→Browser (0.9), Backup→Filesystem (0.9).

Memory & Search Infrastructure

Semantic Embedding (Vector Search)

LOCAL $0

Qwen3-Embedding-8B PULLED

Speed: 74ms (NGC Docker) VRAM: 14.1 GB Dims: 4096 (Matryoshka → 1024) Ollama: qwen3-embedding:8b (4.7 GB)

Not an LLM — embedding model. Converts text to 4096-dim vectors for similarity search. #1 MTEB multilingual. Matryoshka property: truncate to 1024 dims for fast search, 4096 for precision. Powers the 70% vector side of hybrid retrieval. Pulled on Titan via ollama pull qwen3-embedding:8b. Served via Ollama embed API on :11434.

Graph Traversal & Memory Search

LOCAL $0

FalkorDB + BM25

Graph: 0.16ms Context retrieval: <100ms target

No LLM needed for search. Hybrid retrieval: vector similarity (70%) + BM25 keyword (30%) + graph traversal for relationship expansion. FalkorDB validated at 0.16ms (43x faster than Neo4j). LLM only involved if natural language query needs expansion.

Consistency Validation

LOCAL

Graph Integrity (no LLM)

Automated integrity checks. Cross-references entity files ↔ cuVS vectors ↔ PostgreSQL records ↔ graph edges. Detects orphan vectors, stale edges, missing embeddings. Pure validation logic, <200ms. Runs nightly alongside memory decay.

Speaker Diarization & Attribution

LOCAL $0

Speaker Embeddings + Heuristics

Voice fingerprinting, not LLM. Three tasks: detect speech activity per speaker, extract voice embeddings, attribute to known person entities. Contextual inference (who usually talks at this time?) augments matching. Runs before extraction in Lane Queue pipeline.

Image & Video Understanding

Photo + Document + Video — Unified VLM

LOCAL VLM $0

Qwen3.5-35B-A3B VALIDATED ×2 + VLM OCR

Speed: 61 tok/s (Q4) / 30 tok/s (BF16) VRAM: ~21 GiB (Q4) / 94 GiB (BF16) Cost: $0.00

Primary local model (~80% of calls) + VLM. Text + Image + Video in one MoE model (35B total, 3B active). ADR-019 Dual-Model Routing: handles entity extraction, sensitivity, briefing, nudge, email triage, chat QA, and all vision tasks. Two validated paths: Q4_K_M via llama.cpp (61 tok/s, 21 GiB, FORCE_CUBLAS build) and BF16 via vLLM nightly (30 tok/s, 94 GiB). 262K context, 201 languages, Apache 2.0. OmniDocBench: 89.3, MMMU: 81.4, VideoMME: 86.6. Q4 recommended. Entity extraction A−(8.2), #5/16 overall, beats Qwen3 32B A−(8.0). Task benchmark 8.0 (#4/9) — wins sensitivity, nudge, email triage. VLM OCR tested (session 73): English doc OCR 9/10 (scanned legal doc near-perfect), JSON extraction 9/10 (valid structured output), screenshot understanding 8/10. Kannada OCR: 2/10 (pure) / 5/10 (code-mixed) — needs separate model for Kannada image text. Known weakness: contradiction detection (7.0) — over-escalation pattern, routed to Qwen3 32B instead. Thinking mode must be OFF — chat_template_kwargs: {enable_thinking: false} via --jinja flag.

Photo Interpretation (Previously Planned)

LOCAL VLM

Cosmos Reason2 8B FAILED

NIM: nvcr.io/nim/nvidia/cosmos-reason2-8b Error: cudaErrorStreamCaptureInvalidated

CUDA graph crash on Blackwell SM_120. NIM container auto-detected DGX Spark, selected FP8 profile, model loaded — crashed during torch.compile/CUDA graph compilation. vLLM in NIM is too old for Blackwell. Superseded by Qwen3.5-35B-A3B.

Document OCR (Previously Planned)

LOCAL VLM

Nemotron Nano 12B VL FAILED

NIM: nvcr.io/nim/nvidia/nemotron-nano-12b-v2-vl Error: Hung — 0% GPU after profile select

Container hung indefinitely. Selected BF16 profile, then no progress for 10+ minutes. No GPU utilization. Same root cause: NIM vLLM incompatible with Blackwell. Superseded by Qwen3.5-35B-A3B (OmniDocBench 89.3).

Architecture (ADR-022/023, supersedes ADR-019 above): Consolidated to qwen3.5:27b dense (Ollama) for extraction + background, Qwen3.5-9B Opus-Distilled (llama-server :8003) for Annie Voice. Qwen3 32B retained only for contradiction detection (config-only, rarely loaded). GPU queue defers extraction during voice conversations.
Memory budget (current — see RESOURCE-REGISTRY.md): Idle ~19 GB. With extraction (qwen3.5:27b 40 GB) = ~59 GB. With embeddings (+14 GB) = ~73 GB. Peak (+ qwen3:32b) = ~105 GB. Budget limit: 110 GB, 18 GB reserved for OS/CUDA.
Critical build flag: -DGGML_CUDA_FORCE_CUBLAS=ON required. Native MMQ kernels crash on Blackwell with MXFP4 tensors (#18331 — CLOSED). cuBLAS bypasses this and is actually faster (61 tok/s).
NIM VLM status: Both Cosmos Reason2 and Nemotron Nano VL NIM containers FAIL on DGX Spark (Blackwell SM_121). Qwen3.5 via llama.cpp/vLLM is the validated alternative.

Qwen3 32B vs Qwen3.5-35B-A3B — Head-to-Head Comparison

Why two Qwen models? Different architectures, different strengths.

LOCAL $0

Dimension	Qwen3 32B	Qwen3.5-35B-A3B	Winner
Architecture	32B dense	35B total, 3B active (MoE)	Qwen3
VRAM (Q4_K_M)	20 GB	21 GiB	Tie
Speed (entity extraction)	110s/transcript	15.5s/transcript	Qwen3.5 (7×)
Vision (VLM)	None	Image + Video + OCR	Qwen3.5
Entity Extraction	A− (8.0)	A− (8.2)	Qwen3.5 → routes here
Task Benchmark (7 tasks)	8.4	8.0	Qwen3
Sensitivity	7.8	8.0	Qwen3.5 → routes here
Briefing (manual)	7.9	7.8	Tie → Qwen3.5 (speed)
Nudge	8.2	8.6	Qwen3.5 → routes here
Email Triage	7.2	8.0	Qwen3.5 → routes here
Email Draft (manual, no Kannada)	8.7	8.0	Qwen3 → routes here
Chat QA	8.7	8.7	Tie → Qwen3.5 (speed)
Contradiction (manual)	10.0	7.0	Qwen3 → routes here
Promises (manual, 8 transcripts)	8.75	9.1	Qwen3.5 → routes here
Emotions (manual, 8 transcripts)	8.25 (3/8 malformed JSON)	8.75	Qwen3.5 → routes here
Summary (manual, 8 transcripts)	8.56	9.0	Qwen3.5 → routes here

Decision (ADR-019 — Dual-Model Routing): Both models run on Titan simultaneously. Task router sends each request to the best model:
• Qwen3 32B → contradiction detection (10.0 vs 7.0 — Qwen3.5 over-escalates systematically) + email drafts (8.7 vs 8.0 — Qwen3.5 produces placeholder text).
• Qwen3.5 → everything else: entity extraction, promises (9.1), emotions (8.75), summary (9.0), sensitivity, briefing, nudge, email triage, chat QA, VLM. ~80% of calls, 5-80× faster.
• Net effect: best-of-both at 41 GiB total (leaves 87 GiB free). No quality compromise.

Manual scores (session 70): Briefing/email_draft/contradiction re-scored by human judgment, not automated scripts. Kannada email excluded per Rajesh (“We will never write Kannada email”). Contradiction ground truth for contra_consistent_infosys is debatable — both models correctly identify promotion as update.
Qwen3.5 over-escalation pattern: Treats role changes as contradictions (should be updates) and continued behavior as updates (should be consistent). 3/10 errors, all same pattern. Critical gap for knowledge graph — false contradiction alerts cause user fatigue.
Thinking mode OFF is critical: Qwen3.5 with thinking disabled scores 8.0 avg (was 7.4). 5-80× faster, 0% failures vs 8%.
Qwen3 JSON reliability issue: 3/8 emotion responses have malformed JSON (person data outside analysis object, duplicate keys). Qwen3.5 = 0 malformed across all 24 responses. This reinforces Qwen3.5 as primary.

Current model topology (ADR-022/023, supersedes ADR-019 dual-model routing above):
• qwen3.5:27b dense (Ollama, $0, creature: unicorn/lion) — Primary extraction + background tasks: entity extraction, Graphiti graph-building, daily reflection, nudge, wonder, comic. 40 GB Q4_K_M. ADR-022: dense 27B beats 35B MoE on IFEval (95.0 vs 91.9) and structured output quality.
• Qwen3.5-9B Opus-Distilled (llama-server :8003, creature: minotaur) — Annie Voice chat: 94% tool calling accuracy (16 cases), 6.6 GB. ADR-023: chosen over Llama 3.1 8B for tool reliability (no bogus searches/text leaks). --reasoning-budget 0 required.
• qwen3:32b (Ollama, creature: fairy) — Contradiction detection specialist: 10.0/10 (perfect). Config-only, rarely loaded. ADR-019 v2.
• qwen3-embedding:8b (Ollama, 14 GB) — Matryoshka 1024-dim embeddings. ADR-016 v2.
• Nemotron Speech 0.6B (NeMo RNNT, creature: serpent) — Annie Voice STT, 2.5 GB. Replaced Qwen3-ASR-1.7B.
• Kokoro v0.19 (in-process GPU, creature: leviathan) — TTS, 0.5 GB, ~30ms. ELO 1059 (#1 TTS Arena).

VRAM budget (canonical: RESOURCE-REGISTRY.md): Idle ~19 GB (llama-server 6.6 + audio 7.3 + SER 1.2 + Kokoro 0.5 + Nemotron STT 2.5). Extraction: +40 GB = ~59 GB. Peak (+ embed + contradiction): ~105 GB. Budget limit: 110 GB.
GPU queue: Voice-active lease defers extraction + embedding during Annie conversations. 300s lease expires naturally after disconnect. Prevents GPU contention.
The key trade-off: Privacy (data stays on machine) vs capability. All extraction and search are local, $0. Claude API used for Annie Voice alt backend only.

Phase 1.2 — Full Stack Deployed 2026-03-13 LIVE

Full stack running on Titan. Entity extraction via qwen3.5:27b (ADR-022), Annie Voice with Qwen3.5-9B Opus-Distilled (ADR-023), GPU queue protects voice latency, cross-encoder reranker, Nemotron Speech STT.

Running Services

Audio Pipeline (WhisperX+pyannote+SER):9100 ✓

Context Engine (FastAPI):8100 ✓

PostgreSQL (GIN indexes):5432 ✓

Ollama (GPU — qwen3.5:27b + qwen3-embedding:8b):11434 ✓

llama-server (Qwen3.5-9B Opus-Distilled Q4_K_M):8003 ✓

Annie Voice (Pipecat+Nemotron STT+Kokoro TTS):7860 ✓

SearXNG (web search):8888 ✓

SER Sidecar (emotion2vec+ large + audeering):9101 ✓

Dashboard (Vite dev server):5174 ✓

Qdrant (vector search):6333 ✓

Tests & Model Topology

Context Engine tests1,250 ✓

Audio Pipeline tests196 ✓

Annie Voice tests1,070 ✓

Dashboard tests1,829 ✓

Telegram Bot tests306 ✓

MCP Server tests95 ✓

Extraction model (unicorn)qwen3.5:27b · 40 GB

Annie Voice LLM (minotaur)Qwen3.5-9B · 6.6 GB

Annie Voice STT (serpent)Nemotron 0.6B · 2.5 GB

TTS (leviathan)Kokoro v0.19 · 0.5 GB

Data Flow

          Omi/Flutter → Audio Pipeline (WhisperX+pyannote+SER) → JSONL → Context Engine (watcher) → PostgreSQL (tsvector+GIN) → qwen3.5:27b (entities) → Qwen3-Embedding-8B (vectors) → Qdrant → Annie (/v1/context)
        

VRAM Budget (see RESOURCE-REGISTRY.md for canonical budget)

IDLE

~19 GB

109 GB free

EXTRACTION

~59 GB

69 GB free

+ EMBEDDINGS

~73 GB

55 GB free

PEAK

~105 GB

23 GB free (caution)

~4,816 tests passing (1,250 CE + 196 AP + 1,070 AV + 1,829 Dashboard + 306 Telegram + 95 MCP + others). ADR-022: qwen3.5:27b (dense 27B, 40GB) replaced 35B-A3B MoE for extraction — better IFEval (95.0 vs 91.9), better tool calling. ADR-023: Qwen3.5-9B Opus-Distilled for Annie Voice — 94% tool accuracy, no bogus searches. GPU queue protects voice latency from extraction contention. See RESOURCE-REGISTRY.md for canonical VRAM budget.