Context Retrieval Pipeline target: <100ms NGC Docker

Embedding (8B, primary)75.1ms ↓78% from 342ms
cuVS vector search (100K)0.17ms ~same
cuGraph BFS (10K nodes)2.14ms ~same
Total (8B primary)77.4ms PASS
ADR-016 v2: 8B for all embedding (75.1ms NGC vs 342ms bare-metal). 0.6B fallback for first boot only.

Voice Pipeline target: <900ms (excl. LLM) NGC Docker

Nemotron Speech 0.6B STT (streaming RNNT)75-645ms ↓replaced Whisper
Kokoro TTS (short text, GPU in-process)~30ms ~same
Total (excl. LLM)~105-675ms PASS

All Benchmark Results p50 latency, NGC Docker 2026-02-25 (certified) NGC Docker

Great — blazing, massive headroom Good — meets target comfortably Note — see context
GPU / ML Models
Embedding 0.6B (single)9.79ms ↓86%
Embedding 0.6B (batch/sent)1.08ms ↓77%
Embedding 8B (single)75.1ms ↓78%
Embedding 8B (batch/sent)8.05ms ↓84%
cuVS search (100K vectors)0.17ms ~same
cuGraph PageRank (10K)1.0ms ~same
cuGraph BFS (10K)2.14ms ~same
GLiNER NER (6 labels)53.8ms ↓59%
Voice + Infrastructure
Whisper STT (5s audio)284.1ms ↓41%
Whisper STT (10s audio)286.7ms
Whisper STT (30s audio)286.1ms ↓41%
Kokoro TTS (short)28.1ms ~same
Kokoro TTS (medium)55.9ms ~same
Kokoro TTS (long)134.5ms
Redis GET0.028ms ↓18%
Redis SET0.030ms ↓17%
PostgreSQL query0.137ms ↓69%
Integration Tests
Qdrant vector search3.9ms
Neo4j Cypher query6.9ms
SQLAlchemy async ORM0.5ms
Import Validation (10/10)
✓ mem0ai ✓ graphiti ✓ pipecat ✓ httpx ✓ anthropic ✓ faiss ✓ pydantic ✓ orjson ✓ uvloop ✓ alembic
VRAM idle: llama-server 6.6 GB + Audio 7.3 GB + SER 1.2 GB + Kokoro 0.5 GB + Nemotron STT 2.5 GB = ~19 GB / 128 GB Headroom: ~109 GB idle, ~69 GB with extraction
NGC PyTorch 4-7x faster than bare-metal pip — cuBLAS/cuDNN tuned for Blackwell SM_121. Diffs vs bare-metal (2026-02-24). 12 components + 10 imports = zero untested packages.

Speech Emotion Recognition Run 6 · 4 models · 9 speech-segmented clips · DGX Spark 2026-02-27

VAD-segmented (silencedetect −30dB), merged <1.5s gaps, filtered ≥3.0s (MSP-Podcast research minimum). Avg clip: 3.8s. Whisper-SER re-evaluated (Run 4 happy bias was music contamination, not model flaw).
ModelDecisionParamsVRAMLatency p50LoadOutput
emotion2vec+ base BACKUP 90M 359 MB 32 ms 102s 9 categorical
emotion2vec+ large ACCEPTED 300M 627 MB 33 ms 195s 9 categorical
Whisper-SER (firdhokk) REHABILITATED 0.6B 2496 MB 129 ms 100s 7 categorical
audeering A/V/D ACCEPTED 200M 631 MB 10 ms 33s 3 continuous
Clip-by-Clip Predictions — all 4 models, speech-only segments
e2v+ base e2v+ large Whisper-SER audeering A/V/D
4.8-8.4s fearful fearful sad 0.47 / 0.34 sad/tired
8.4-12.0s sad sad happy 0.59 / 0.44 neutral
19.5-22.6s surprised neutral surprised 0.60 / 0.70 happy/excited
26.9-31.4s happy happy happy 0.44 / 0.76 calm/content
37.9-41.1s neutral neutral surprised 0.37 / 0.41 neutral
51.1-55.3s neutral happy happy 0.45 / 0.67 calm/content
58.8-62.9s neutral neutral happy 0.49 / 0.53 neutral
62.9-66.9s neutral happy surprised 0.55 / 0.80 happy/excited
66.9-71.0s neutral happy happy 0.49 / 0.73 calm/content

Verdict (Run 6 — 4-model re-evaluation)

Speech segmentation rehabilitated models previously rejected. Run 4's "happy bias" was music contamination, not model flaws. With speech-only clips, all 4 models produce defensible results. The question is now sensitivity vs specificity, not broken vs working.

emotion2vec+ large: ACCEPTED — 33ms p50, 627 MB VRAM, 300M params. Detects subtle positive affect that base collapses to "neutral" — corroborated by audeering's elevated valence (V=0.67-0.80) on the same clips. For an emotional awareness system, sensitivity > conservatism. Bake into container.

emotion2vec+ base: BACKUP — 32ms p50, 359 MB VRAM, 90M params. Conservative: labels ambiguous clips "neutral" rather than guessing. 5 distinct emotions. Falls back to this if large proves too sensitive in production.

Whisper-SER (firdhokk): REHABILITATED — 129ms p50, 2496 MB VRAM (7x base). Run 4 rejected (10/14 happy) was music contamination. Run 6: 4/9 happy, 3/9 surprised, only 1/9 sad. Still over-predicts aroused emotions — misses "neutral" entirely (0/9 vs base's 5/9). Useful as second-opinion but too resource-heavy for primary.

audeering A/V/D: ACCEPTED (Phase 1, alongside large) — 10ms p50 (fastest), 631 MB VRAM, 33s load. Run both in parallel: 43ms combined, ~1.26GB total. Confidence-weighted fusion at audio layer — categories tell WHAT, dimensions tell HOW MUCH. Single enriched output: primary + intensity + confidence + valence + arousal. CC-BY-NC-SA license (personal use OK).

Optimal SER clip: 3-5s (MSP-Podcast research). Pipeline guards: skip <1s, flag <2s as low confidence, split >11s at pauses. Fusion: agreement factor adjusts confidence when categorical & dimensional signals conflict.

Fused Pipeline Output — emotion2vec+ large × audeering → single enriched result per segment
Full Conversation Audio 71.0s · 16kHz mono · Omi wearable capture
0:00 1:11
Fused Emotion Confidence Intensity Valence Arousal
4.8-8.4s fearful 0.63 mild negative moderate
8.4-12.0s sad 0.52 moderate negative moderate
19.5-22.6s neutral 0.45 moderate positive moderate
26.9-31.4s happy 0.75 mild positive moderate
37.9-41.1s neutral 0.55 mild negative low
51.1-55.3s happy 0.52 mild positive moderate
58.8-62.9s neutral 0.58 mild neutral moderate
62.9-66.9s happy 0.58 moderate positive moderate
66.9-71.0s happy 0.52 mild positive moderate
Pipeline: WAV → Whisper STT + emotion2vec+ large + audeering A/V/Dfusion → {primary, intensity, confidence, valence, arousal}
Architecture: STT container (port 9100) + SER sidecar (port 9101) | ~10.3 GB VRAM | 43ms SER overhead
Fusion: e2v softmax × agreement_factor(e2v_polarity, aud_valence) → confidence | arousal → intensity | valence → polarity
Note: Confidence values computed from Run 6 data with estimated softmax scores (~0.45-0.65 range). Exact values require live pipeline run on Titan.

ADR-016 v2: 8B-Primary Embedding revised from dual-model — NGC makes 8B real-time

Primary — 8B for Everything

75.1ms · 14.1 GB · MTEB 69.44

Single model for all embedding: real-time queries, batch indexing, search. No confidence routing, no nightly re-index. +12% better quality than 0.6B.

First-Boot Fallback (0.6B)

9.78ms · 1.1 GB

Baked into Docker image. Used only while 8B downloads on first boot (~15 min). Never used after.

What Changed

−1.1 GB VRAM

NGC Docker gives 4.6x on 8B (342ms → 75.1ms). Eliminates: progressive retrieval, confidence routing, async re-search, nightly re-index. Strictly better on every dimension.

LLM Evaluation — Multi-Task Benchmarks ADR-019 · 1042 evaluations · 18 models (10 API + 8 local)

18 models × 8 transcripts (5 English + 3 Kannada-heavy) · All models understood Kannada-English code-mixing · Local models run on Titan (DGX Spark) · 5 tasks: entities, promises, emotions, summary, latency · 4 local models show — for sub-tasks (entity extraction only)
ModelProviderEntitiesPromisesEmotionsSummary$/call

Key Findings

All 16 models understood Kannada-English code-mixed text. No model choked on "TSH 6.5 ide ante. Normal range 4.5 varegu" or "naalke phone maadthini Appa ge."

Speed tiers: Flash-Lite (2.8s) > Sarvam-M (3.7s, FREE) > Haiku (5.7s) > Gemini 3 Flash (9.6s) > Sonnet/Opus (15s) > Pro models (18-24s).

Quality tiers (entity extraction on Kannada): Sonnet = richest structured output (relationships as objects, actions captured). Opus = added "subclinical hypothyroidism" medical inference. Sarvam-M = 12 entities extracted, captured TSH 6.5 numeric value, Thyroxine 50mg dosage, BP 120/80, Dr. Lakshmi, Lalbagh — most detailed medical data. $0.00/call. No native JSON mode but produced valid JSON on all 33 calls. Gemini 3 Flash = clean, correct, good structure. Flash-Lite = fast but shallow (missed Dr. Lakshmi's name).

Cost comparison: Sarvam-M $0.0000 > Flash-Lite $0.0003 > Gemini 3 Flash $0.0022 > Haiku $0.0044 > Sonnet $0.0170 > Opus $0.0786 per call.

Verdict (ADR-019): Qwen3.5 = primary local model (10/13 tasks), Qwen3 = specialist (contradiction + email draft). Sarvam-M remains a serious contender — free, fast, Kannada-native, extracted the most medical detail. See Model Recommendations below for full routing table. Raw results: scripts/eval-llm/results/

Quality Review — Entity Extraction Outputs Compare actual model outputs side-by-side

How to Evaluate Quality

  • Cyan names = entity names the model found. Did it find ALL people, places, orgs?
  • Purple types = how it classified each entity. person vs location vs health_metric — correct?
  • Amber attributes = extracted details (role, value, description). Specificity matters: "TSH 6.5" beats "thyroid test"
  • What makes a good extraction: All entities found, correct types, specific values (numbers, dates, dosages), relationships between entities captured
  • Red flags: Missing people/places, hallucinated entities not in transcript, vague descriptions, missed numeric values (TSH, BP, amounts)
Color legend: Entity names Entity types Attributes (role, value, description) Medical/numeric values (in transcript)
Select a transcript above to compare entity extraction across all 16 models

Model Recommendations ADR-019 v2 benchmarks (historical) · 1042 evals + 198 judge + manual scoring (sessions 70-71, 167)

⚠ ADR-022/023 superseded the model choices below. Current topology: qwen3.5:27b (extraction) + Qwen3.5-9B Opus-Distilled (Annie Voice) + qwen3:32b (contradiction only). See summary note and Phase 1.2 section below for current state.
Entity Extraction
English only · Fully local · Zero cost
LOCAL $0
Qwen3.5-35B-A3B A− 8.2 ← ADR-019 routes here
Speed: 15.5s VRAM: 21 GiB Cost: $0.00
ADR-019: Primary local model for entity extraction. Beats Qwen3 32B on quality (8.2 vs 8.0) AND speed (15.5s vs 110s, 7× faster). Rich attributes with v1_strict_schema prompt. MoE architecture = fast + accurate. Thinking OFF required.
Specialist fallback: Qwen3 32B (A− 8.0, 110s) — available on Titan for contradiction detection
Indic languages (Kannada/Hindi) · Fully local · Zero cost
LOCAL $0
Qwen3.5-35B-A3B A− 8.2
Speed: 15.5s VRAM: 21 GiB Kannada: 201 languages supported
Multilingual MoE. 201 languages including Kannada, Hindi, Tamil. Same quality as English extraction. 7× faster than Qwen3 32B. Code-mixed text handled natively. Apache 2.0.
Verified Kannada: Qwen3 32B found 14/15 facts on health transcript (A 9.4). Both models handle Indic well.
Indic languages · Best quality · Cheapest API
API
Claude Haiku 4.5 A 8.7
Speed: 5.7s Cost: $0.001/call Kannada: A (9.4)
Best quality at the lowest Claude price. Consistent A/A− across all 8 transcripts. Rich attributes, clean schema, precise medical values. ~$0.30/day at 300 conversations.
Free alternative: Sarvam-M (A− 7.9, $0.00) — Kannada-native, but no structured output mode
English only · Best quality · Cheapest API
API
Claude Haiku 4.5 A 8.7
Speed: 5.7s Cost: $0.001/call Quality: Rank #1 of 16
Same recommendation regardless of language. Haiku dominates on quality/cost ratio. 20x faster than any local model. Only consideration: data leaves the machine (sent to Anthropic API).
If zero-latency matters: Gemini 3 Flash (A− 8.4, $0.002/call) — competitive quality, different privacy profile
Context Engine — Extraction Pass (Single LLM Call)
Entities + Promises + Emotions + Summary
LOCAL $0
Qwen3.5-35B-A3B A− 8.2 ← ADR-019 primary
One call extracts everything. Entities, promises, emotions, and summary are all fields in a single structured JSON response. 15.5s/transcript (7× faster than Qwen3). Background Lane Queue processing. Privacy-critical: raw conversation data never leaves the machine. Thinking OFF required.
Sensitivity Classification
LOCAL
Rules + Qwen3.5-35B-A3B ← ADR-019 routes here
Combined: Qwen3.5 A−(8.0) > Qwen3 A−(7.8) > Sarvam A−(7.5) > Gemini B+(7.2) > Haiku B+(6.5)
5-tier classification: Open → Sensitive → Guarded → Inferred → Forbidden. Most classification by rules (health data = Guarded, family conflicts = Sensitive). LLM for ambiguous cases. Qwen3.5 is #1 (8.0), beating Qwen3 (7.8) and Sarvam API (7.5). Faster + more accurate. Local and private.
Graph Building & Entity Resolution
LOCAL
Qwen3.5-35B-A3B
Links entities across conversations. "Arun" from conversation 1 → "Arun Krishnamurthy" from conversation 5. Coreference resolution, relationship type inference, temporal edge creation. Incremental graph updates after each extraction pass. Same model as entity extraction — no context switch.
Confidence Scoring & Contradiction Detection
LOCAL
Algorithmic + Qwen3 32B ← ADR-019 specialist
Manual: Qwen3 A(10.0) > Sarvam A(8.7) > Qwen3.5 B+(7.0) — Qwen3.5 over-escalates
This is WHY dual-model routing exists. Five factors: frequency, recency, consistency, source quality, corroboration. LLM for contradiction detection. Qwen3 32B is perfect (10/10) in manual evaluation. Qwen3.5 has systematic over-escalation (3/10 errors: treats role changes as contradictions, continued behavior as updates). Knowledge graph integrity depends on this task — false alerts cause user fatigue. Routed to Qwen3 32B despite slower speed (110s).
ADR-019 Dual-Model Routing in the Context Engine: Qwen3.5 handles entity extraction, sensitivity classification, and graph building (~76% of calls, 7× faster). Qwen3 32B handles contradiction detection only (~24%, 10.0 quality). Both local, $0, 67 GiB total. Why not Sarvam-M? Requires structured JSON output (guaranteed schema). Sarvam-M has no native JSON mode. Sarvam-M is #1 overall (8.3/10) for text generation — used for briefing and nudge, not extraction.
Proactive Intelligence
Morning Debrief Generation
API
Claude Haiku 4.5 A 8.8
Combined: Haiku A(8.8) > Sonnet A(8.7) > Nemotron A−(8.4) > Gemini A−(8.4) > Sarvam A−(8.2)
Data says Haiku wins on briefings. Warmth, relevance, and actionability scored highest. Synthesizes pre-extracted context into personalized morning message. $0.002/debrief. Annie's personality encoded in system prompt.
Local fallback: Qwen3.5 (7.8, 5s ← ADR-019 routes here) or Qwen3 (7.9, 70s). Free: Sarvam-M (A− 8.2, $0.00).
Nudge Engine (Habit-Aware Messaging)
$0
Sarvam-M A 8.6
Combined: Sarvam A(8.6) > Ministral A−(8.3) > Sonnet A−(8.3) > Haiku/Gemini/Qwen3/Nemotron A−(8.2)
Data confirms: Sarvam #1 for nudges. Best compassion score (9.4 on celebrate tier), best brevity. Free API = unlimited nudges. 5-tier consent: Silent → Celebrate → Remind → Friction → Never. Kannada-native tone.
Local fallback: Qwen3.5 (A 8.6 ← ADR-019 routes here) or Ministral 14B (A− 8.3) if Sarvam API is down.
Annie's Meditation (Self-Reflection)
API
Claude Sonnet 4.6 A 8.9
Chat QA: Ministral A(9.0) > Sonnet A(8.9) > Gemini/Sarvam A(8.8) > Qwen3/Nemotron/Mistral Sm A(8.7)
Needs deep reasoning + personality fidelity. Daily self-reflection: reviews observability data, identifies behavioral patterns, proposes soul.md changes. Sonnet chosen over Ministral (which scored 9.0 on Chat QA) because meditation requires tool calling + soul.md modification capabilities. Soul.md changes require Rajesh's explicit approval. Max 3 self-modifications/week.
Nightly Memory Decay
LOCAL
Algorithmic (no LLM)
Half-life temporal decay formula. Non-evergreen entities gradually fade in salience. Nothing is deleted — just de-prioritized in retrieval. Evergreen facts (birthdays, allergies, preferences) never decay. Pure math, runs nightly, <1 second.
Communication & Real-time
Voice Calls (Real-time Conversation)
API
Claude Haiku 4.5 A 8.7
Latency target: 600–900ms First token: <500ms
Latency-critical — needs <1s first token. Pipecat pipeline: Whisper STT (local) → Claude reasoning (API) → Kokoro TTS (local). Only the reasoning step hits API. Sarvam-M at 3.7s total is too slow for conversational flow. ~$0.01/min.
Email Triage & Drafting
LOCAL API
3-Layer Pipeline (ADR-019: triage→Qwen3.5, draft→Qwen3)
Triage: Qwen3.5 A−(8.0) ← routes here | Draft: Qwen3 A(8.7) ← routes here (manual, no Kannada)
Rules (65%) → local (30%) → Claude (5%). Triage routed to Qwen3.5 (8.0, fast). Drafts routed to Qwen3 (8.7 manual score — Qwen3.5 produces placeholder text in formal emails). Kannada email excluded per design decision. Complex drafts escalate to Sonnet. Per-recipient tone profiles from sent mail. 5-tier autonomy: T0 observe → T4 auto-send.
Channel Watching (Gmail, WhatsApp, Telegram, Discord)
LOCAL
Protocol Handlers (no LLM)
Multi-headed listener — one per channel. Gmail (Pub/Sub), WhatsApp (Business API webhook), Telegram (Bot API), Discord (Gateway WebSocket). Pure protocol handling, message dedup, and routing to Email Agent or Lane Queue. No LLM involved.
Moltbook Observer (770K AI Agents)
$0
Sarvam-M A− 7.9
Content filtering + relevance scoring. Silent observer on Moltbook feed. Filters 41/47 posts as irrelevant, scores remainder against Rajesh's interests. 0.6 confidence ceiling on external insights. Identity anchor checking (IDENTITY.md). Free API handles the volume of daily feed scanning.
Alternative: Qwen3.5 local ← ADR-019 primary if privacy of browsing patterns matters
Agentic Actions & Tool Calling (MCP Portals)
MCP Portal Router (Tool Selection & Execution)
API
Claude Haiku 4.5 A 8.7
Requires native function calling. Selects which portal (Search, Calendar, Email, Browser, GitHub, Slack, etc.) to invoke, constructs parameters, interprets results. Sarvam-M and Qwen3 lack reliable tool-use support. Claude's native tool_use API is purpose-built for this.
Web Search & Contextual Enrichment
API
Claude Haiku 4.5 A 8.7
Query formulation + result synthesis. "Who is that person Rajesh mentioned?" → formulate search query → call Search portal → interpret results → update entity. Part of MCP tool calling pipeline. Triggered during extraction (0.8 weight) or on-demand by user query.
Browser Automation (SKILL.md Actions)
API
Claude Haiku 4.5 A 8.7
Multi-step web navigation. "Book that restaurant Suresh recommended" → plan steps → execute via Browser MCP → confirm result. 3 execution channels: browser, voice, desktop API. 5-tier approval (auto → quick-approve → full review). Commerce and communication skills need function calling.
Calendar & Weather Integration
LOCAL
Direct API calls (no LLM)
Simple REST calls to known APIs. Fetch today's schedule, check weather for debrief. No LLM reasoning needed — just API integration code. Calendar events feed into Morning Debrief context. Weather data is a template variable.
Why Claude for tool calling? Native function calling (tool_use API) is the key differentiator. Claude sends structured tool invocations that MCP servers can execute directly. Sarvam-M lacks tool-use support. Qwen3 has experimental function calling but unvalidated on our MCP topology. When tool calling matures in open models, this can shift local.

MCP Portal topology (from Mindscape observability): 9+ services connected via weighted routes. Extraction→Search (0.8), Email→Email (0.9), Briefing→Calendar (0.7), Briefing→Weather (0.8), Nudge→Slack (0.5), Moltbook→Browser (0.9), Backup→Filesystem (0.9).
Memory & Search Infrastructure
Semantic Embedding (Vector Search)
LOCAL $0
Qwen3-Embedding-8B PULLED
Speed: 74ms (NGC Docker) VRAM: 14.1 GB Dims: 4096 (Matryoshka → 1024) Ollama: qwen3-embedding:8b (4.7 GB)
Not an LLM — embedding model. Converts text to 4096-dim vectors for similarity search. #1 MTEB multilingual. Matryoshka property: truncate to 1024 dims for fast search, 4096 for precision. Powers the 70% vector side of hybrid retrieval. Pulled on Titan via ollama pull qwen3-embedding:8b. Served via Ollama embed API on :11434.
Graph Traversal & Memory Search
LOCAL $0
FalkorDB + BM25
Graph: 0.16ms Context retrieval: <100ms target
No LLM needed for search. Hybrid retrieval: vector similarity (70%) + BM25 keyword (30%) + graph traversal for relationship expansion. FalkorDB validated at 0.16ms (43x faster than Neo4j). LLM only involved if natural language query needs expansion.
Consistency Validation
LOCAL
Graph Integrity (no LLM)
Automated integrity checks. Cross-references entity files ↔ cuVS vectors ↔ PostgreSQL records ↔ graph edges. Detects orphan vectors, stale edges, missing embeddings. Pure validation logic, <200ms. Runs nightly alongside memory decay.
Speaker Diarization & Attribution
LOCAL $0
Speaker Embeddings + Heuristics
Voice fingerprinting, not LLM. Three tasks: detect speech activity per speaker, extract voice embeddings, attribute to known person entities. Contextual inference (who usually talks at this time?) augments matching. Runs before extraction in Lane Queue pipeline.
Image & Video Understanding
Photo + Document + Video — Unified VLM
LOCAL VLM $0
Qwen3.5-35B-A3B VALIDATED ×2 + VLM OCR
Speed: 61 tok/s (Q4) / 30 tok/s (BF16) VRAM: ~21 GiB (Q4) / 94 GiB (BF16) Cost: $0.00
Primary local model (~80% of calls) + VLM. Text + Image + Video in one MoE model (35B total, 3B active). ADR-019 Dual-Model Routing: handles entity extraction, sensitivity, briefing, nudge, email triage, chat QA, and all vision tasks. Two validated paths: Q4_K_M via llama.cpp (61 tok/s, 21 GiB, FORCE_CUBLAS build) and BF16 via vLLM nightly (30 tok/s, 94 GiB). 262K context, 201 languages, Apache 2.0. OmniDocBench: 89.3, MMMU: 81.4, VideoMME: 86.6. Q4 recommended. Entity extraction A−(8.2), #5/16 overall, beats Qwen3 32B A−(8.0). Task benchmark 8.0 (#4/9) — wins sensitivity, nudge, email triage. VLM OCR tested (session 73): English doc OCR 9/10 (scanned legal doc near-perfect), JSON extraction 9/10 (valid structured output), screenshot understanding 8/10. Kannada OCR: 2/10 (pure) / 5/10 (code-mixed) — needs separate model for Kannada image text. Known weakness: contradiction detection (7.0) — over-escalation pattern, routed to Qwen3 32B instead. Thinking mode must be OFFchat_template_kwargs: {enable_thinking: false} via --jinja flag.
Photo Interpretation (Previously Planned)
LOCAL VLM
Cosmos Reason2 8B FAILED
NIM: nvcr.io/nim/nvidia/cosmos-reason2-8b Error: cudaErrorStreamCaptureInvalidated
CUDA graph crash on Blackwell SM_120. NIM container auto-detected DGX Spark, selected FP8 profile, model loaded — crashed during torch.compile/CUDA graph compilation. vLLM in NIM is too old for Blackwell. Superseded by Qwen3.5-35B-A3B.
Document OCR (Previously Planned)
LOCAL VLM
Nemotron Nano 12B VL FAILED
NIM: nvcr.io/nim/nvidia/nemotron-nano-12b-v2-vl Error: Hung — 0% GPU after profile select
Container hung indefinitely. Selected BF16 profile, then no progress for 10+ minutes. No GPU utilization. Same root cause: NIM vLLM incompatible with Blackwell. Superseded by Qwen3.5-35B-A3B (OmniDocBench 89.3).
Architecture (ADR-022/023, supersedes ADR-019 above): Consolidated to qwen3.5:27b dense (Ollama) for extraction + background, Qwen3.5-9B Opus-Distilled (llama-server :8003) for Annie Voice. Qwen3 32B retained only for contradiction detection (config-only, rarely loaded). GPU queue defers extraction during voice conversations.
Memory budget (current — see RESOURCE-REGISTRY.md): Idle ~19 GB. With extraction (qwen3.5:27b 40 GB) = ~59 GB. With embeddings (+14 GB) = ~73 GB. Peak (+ qwen3:32b) = ~105 GB. Budget limit: 110 GB, 18 GB reserved for OS/CUDA.
Critical build flag: -DGGML_CUDA_FORCE_CUBLAS=ON required. Native MMQ kernels crash on Blackwell with MXFP4 tensors (#18331 — CLOSED). cuBLAS bypasses this and is actually faster (61 tok/s).
NIM VLM status: Both Cosmos Reason2 and Nemotron Nano VL NIM containers FAIL on DGX Spark (Blackwell SM_121). Qwen3.5 via llama.cpp/vLLM is the validated alternative.
Qwen3 32B vs Qwen3.5-35B-A3B — Head-to-Head Comparison
Why two Qwen models? Different architectures, different strengths.
LOCAL $0
Dimension Qwen3 32B Qwen3.5-35B-A3B Winner
Architecture32B dense35B total, 3B active (MoE)Qwen3
VRAM (Q4_K_M)20 GB21 GiBTie
Speed (entity extraction)110s/transcript15.5s/transcriptQwen3.5 (7×)
Vision (VLM)NoneImage + Video + OCRQwen3.5
Entity ExtractionA− (8.0)A− (8.2)Qwen3.5 → routes here
Task Benchmark (7 tasks)8.48.0Qwen3
Sensitivity7.88.0Qwen3.5 → routes here
Briefing (manual)7.97.8Tie → Qwen3.5 (speed)
Nudge8.28.6Qwen3.5 → routes here
Email Triage7.28.0Qwen3.5 → routes here
Email Draft (manual, no Kannada)8.78.0Qwen3 → routes here
Chat QA8.78.7Tie → Qwen3.5 (speed)
Contradiction (manual)10.07.0Qwen3 → routes here
Promises (manual, 8 transcripts)8.759.1Qwen3.5 → routes here
Emotions (manual, 8 transcripts)8.25 (3/8 malformed JSON)8.75Qwen3.5 → routes here
Summary (manual, 8 transcripts)8.569.0Qwen3.5 → routes here
Decision (ADR-019 — Dual-Model Routing): Both models run on Titan simultaneously. Task router sends each request to the best model:
Qwen3 32Bcontradiction detection (10.0 vs 7.0 — Qwen3.5 over-escalates systematically) + email drafts (8.7 vs 8.0 — Qwen3.5 produces placeholder text).
Qwen3.5everything else: entity extraction, promises (9.1), emotions (8.75), summary (9.0), sensitivity, briefing, nudge, email triage, chat QA, VLM. ~80% of calls, 5-80× faster.
• Net effect: best-of-both at 41 GiB total (leaves 87 GiB free). No quality compromise.

Manual scores (session 70): Briefing/email_draft/contradiction re-scored by human judgment, not automated scripts. Kannada email excluded per Rajesh (“We will never write Kannada email”). Contradiction ground truth for contra_consistent_infosys is debatable — both models correctly identify promotion as update.
Qwen3.5 over-escalation pattern: Treats role changes as contradictions (should be updates) and continued behavior as updates (should be consistent). 3/10 errors, all same pattern. Critical gap for knowledge graph — false contradiction alerts cause user fatigue.
Thinking mode OFF is critical: Qwen3.5 with thinking disabled scores 8.0 avg (was 7.4). 5-80× faster, 0% failures vs 8%.
Qwen3 JSON reliability issue: 3/8 emotion responses have malformed JSON (person data outside analysis object, duplicate keys). Qwen3.5 = 0 malformed across all 24 responses. This reinforces Qwen3.5 as primary.
Current model topology (ADR-022/023, supersedes ADR-019 dual-model routing above):
qwen3.5:27b dense (Ollama, $0, creature: unicorn/lion) — Primary extraction + background tasks: entity extraction, Graphiti graph-building, daily reflection, nudge, wonder, comic. 40 GB Q4_K_M. ADR-022: dense 27B beats 35B MoE on IFEval (95.0 vs 91.9) and structured output quality.
Qwen3.5-9B Opus-Distilled (llama-server :8003, creature: minotaur) — Annie Voice chat: 94% tool calling accuracy (16 cases), 6.6 GB. ADR-023: chosen over Llama 3.1 8B for tool reliability (no bogus searches/text leaks). --reasoning-budget 0 required.
qwen3:32b (Ollama, creature: fairy) — Contradiction detection specialist: 10.0/10 (perfect). Config-only, rarely loaded. ADR-019 v2.
qwen3-embedding:8b (Ollama, 14 GB) — Matryoshka 1024-dim embeddings. ADR-016 v2.
Nemotron Speech 0.6B (NeMo RNNT, creature: serpent) — Annie Voice STT, 2.5 GB. Replaced Qwen3-ASR-1.7B.
Kokoro v0.19 (in-process GPU, creature: leviathan) — TTS, 0.5 GB, ~30ms. ELO 1059 (#1 TTS Arena).

VRAM budget (canonical: RESOURCE-REGISTRY.md): Idle ~19 GB (llama-server 6.6 + audio 7.3 + SER 1.2 + Kokoro 0.5 + Nemotron STT 2.5). Extraction: +40 GB = ~59 GB. Peak (+ embed + contradiction): ~105 GB. Budget limit: 110 GB.
GPU queue: Voice-active lease defers extraction + embedding during Annie conversations. 300s lease expires naturally after disconnect. Prevents GPU contention.
The key trade-off: Privacy (data stays on machine) vs capability. All extraction and search are local, $0. Claude API used for Annie Voice alt backend only.

Phase 1.2 — Full Stack Deployed 2026-03-13 LIVE

Full stack running on Titan. Entity extraction via qwen3.5:27b (ADR-022), Annie Voice with Qwen3.5-9B Opus-Distilled (ADR-023), GPU queue protects voice latency, cross-encoder reranker, Nemotron Speech STT.
Running Services
Audio Pipeline (WhisperX+pyannote+SER):9100 ✓
Context Engine (FastAPI):8100 ✓
PostgreSQL (GIN indexes):5432 ✓
Ollama (GPU — qwen3.5:27b + qwen3-embedding:8b):11434 ✓
llama-server (Qwen3.5-9B Opus-Distilled Q4_K_M):8003 ✓
Annie Voice (Pipecat+Nemotron STT+Kokoro TTS):7860 ✓
SearXNG (web search):8888 ✓
SER Sidecar (emotion2vec+ large + audeering):9101 ✓
Dashboard (Vite dev server):5174 ✓
Qdrant (vector search):6333 ✓
Tests & Model Topology
Context Engine tests1,250 ✓
Audio Pipeline tests196 ✓
Annie Voice tests1,070 ✓
Dashboard tests1,829 ✓
Telegram Bot tests306 ✓
MCP Server tests95 ✓
Extraction model (unicorn)qwen3.5:27b · 40 GB
Annie Voice LLM (minotaur)Qwen3.5-9B · 6.6 GB
Annie Voice STT (serpent)Nemotron 0.6B · 2.5 GB
TTS (leviathan)Kokoro v0.19 · 0.5 GB
Data Flow
Omi/Flutter → Audio Pipeline (WhisperX+pyannote+SER) → JSONL → Context Engine (watcher) → PostgreSQL (tsvector+GIN) → qwen3.5:27b (entities) → Qwen3-Embedding-8B (vectors) → Qdrant → Annie (/v1/context)
VRAM Budget (see RESOURCE-REGISTRY.md for canonical budget)
IDLE
~19 GB
109 GB free
EXTRACTION
~59 GB
69 GB free
+ EMBEDDINGS
~73 GB
55 GB free
PEAK
~105 GB
23 GB free (caution)
~4,816 tests passing (1,250 CE + 196 AP + 1,070 AV + 1,829 Dashboard + 306 Telegram + 95 MCP + others). ADR-022: qwen3.5:27b (dense 27B, 40GB) replaced 35B-A3B MoE for extraction — better IFEval (95.0 vs 91.9), better tool calling. ADR-023: Qwen3.5-9B Opus-Distilled for Annie Voice — 94% tool accuracy, no bogus searches. GPU queue protects voice latency from extraction contention. See RESOURCE-REGISTRY.md for canonical VRAM budget.