# Research: Nemotron 3 Super vs Nano — Suitability for Annie on DGX Spark **Date:** 2026-03-19 **Status:** Complete **Question:** Are we missing out by using Nemotron 3 Nano instead of Super for Annie's agentic framework? **Context:** NemoClaw (NVIDIA's enterprise agent stack) exclusively showcases Nemotron 3 Super. Our Annie agentic framework uses Nemotron 3 Nano. --- ## TL;DR — The Verdict **We are NOT missing out for voice. We ARE missing out for background agents. And we have the hardware to fix it NOW.** | Workload | Best Model | Why | |----------|-----------|-----| | Voice (latency-critical) | **Nano on Titan** (keep current) | 48-65 tok/s vs 14-19 tok/s. Voice needs speed, not smarts. | | Background agents (accuracy-critical) | **Super on Beast** (deploy now) | +22 pts on SWE-Bench, 97% tool accuracy, 1M context | | Dual-model feasibility | **Titan + Beast** | Both DGX Spark (128 GB each), ConnectX-7 link (< 1ms). No VRAM conflict. | **Recommended: Deploy Super NVFP4 on Beast, route background agents to it.** This is exactly what NemoClaw does (Super as brain, Nano as edge executor) — except we run both locally for privacy. ~2 hours of setup, ~20 lines of code change, zero voice regression risk. --- ## Why vLLM, Not Ollama? NemoClaw uses Ollama for local inference. We chose vLLM. Here's why: | Factor | vLLM (our choice) | Ollama | |--------|-------------------|--------| | **Model format** | NVFP4 — **native training precision**, zero quality loss | GGUF — re-quantized, potential quality degradation | | **MoE compatibility** | Works via Marlin backend (spark-vllm-docker) | **Known issue**: Ollama MoE GGUF blobs NOT compatible with upstream llama.cpp | | **Tool calling** | `--tool-call-parser qwen3_coder` — battle-tested | Basic support, less reliable for complex chains | | **Reasoning parser** | `super_v3` plugin — cleanly separates thinking from content | None — raw `` tags leak into responses | | **Consistency** | Same as Titan (Nano on vLLM) — identical API + client code | Different API quirks, different debugging surface | | **Setup effort** | Medium (spark-vllm-docker build + recipe) | Easy (`ollama pull nemotron-3-super`) | | **NemoClaw stance** | "Experimental" for local | "Experimental" for local — default is cloud API | **Bottom line:** For background agents where quality is the whole point (97% tool accuracy, +22 SWE-Bench), using the native NVFP4 precision via vLLM is the right call. Ollama would be fine for a quick demo but not for production agents. --- ## 1. Model Specifications — Side by Side | Property | Nemotron 3 Nano (30B-A3B) | Nemotron 3 Super (120B-A12B) | |----------|--------------------------|------------------------------| | **Total parameters** | 31.6B | 120B | | **Active parameters** | 3.2B per token | 12B per token (4x Nano) | | **Architecture** | Hybrid Mamba-2 + Transformer + MoE | Hybrid Mamba-2 + Transformer + **Latent** MoE | | **Expert count** | ~128 experts, ~6 active | **512 experts, 22 active** | | **Context window** | 1M tokens | 1M tokens | | **Training precision** | BF16 → post-training quant | **Native NVFP4** (trained from scratch in FP4) | | **Multi-Token Prediction** | No | **Yes** (built-in speculative decoding) | | **NVFP4 model size** | ~7.5 GB (our QAT v4: 18 GB) | ~80 GB | | **Release** | 2025 | March 11, 2026 (GTC 2026) | | **License** | Nemotron Open (commercial) | Nemotron Open (commercial) | | **Open artifacts** | Weights | Weights + 10T training data + RL environments + recipes | ### Architecture Deep Dive: What Makes Super Different Super has three innovations that Nano lacks: 1. **LatentMoE** — Tokens are compressed to 1/4 dimension (4096→1024) before expert routing. This allows 512 experts (vs ~128) with 22 active per token (vs ~6). Result: 4x more expert capacity for the same compute. Nano uses standard MoE without latent compression. 2. **Multi-Token Prediction (MTP)** — Predicts multiple future tokens per forward pass. Acts as built-in speculative decoding without needing a separate draft model. On SPEED-Bench, Super achieves the highest average acceptance length (3.45 tokens), beating DeepSeek-R1. This gives 2-3x speedup on structured generation (tool calls, JSON). Nano has no MTP. 3. **Native NVFP4 Training** — Super was pre-trained from scratch in FP4 precision. This means NVFP4 quantization is lossless (the model never knew higher precision). Nano was trained in BF16 and quantized post-training, which introduces quality loss (we experienced this — our QAT v4 was needed to recover behavioral quality after NVFP4 quantization). --- ## 2. Quality Benchmarks: Super vs Nano ### Head-to-Head | Benchmark | Nano (30B-A3B) | Super (120B-A12B) | Delta | Significance | |-----------|----------------|-------------------|-------|-------------| | **MMLU-Pro** | 78.30% | 83.73% | **+5.43** | General knowledge | | **GPQA (no tools)** | 73.04% | 79.23% | **+6.19** | Graduate-level science | | **GPQA (with tools)** | 75.00% | 82.70% | **+7.70** | Tool-augmented reasoning | | **LiveCodeBench** | 68.25% | 81.19% | **+12.94** | Non-contaminated coding | | **SWE-Bench Verified** | 38.76% | 60.47% | **+21.71** | Multi-file code changes | | **SWE-Bench Multilingual** | N/A | 45.78% | — | Cross-language coding | | **Tool calling accuracy** | Not quantified | **97.0%** | — | vs GPT-4.1 at 96.3% | | **PinchBench (agentic)** | Not reported | **85.6%** | — | #1 open model | | **RULER (1M context)** | Not reported | **91.75%** | — | Context retention | ### Where Super Dominates The gap widens dramatically on **agentic tasks**: - **SWE-Bench**: +21.71 points — this tests exactly what background agents do (multi-step reasoning, tool use, code generation) - **Tool calling**: 97% accuracy — critical for meditation, self-improvement, and extraction agents that rely on structured tool outputs - **LiveCodeBench**: +12.94 points — complex multi-step problem solving - **PinchBench**: 85.6% — NVIDIA's own benchmark for OpenClaw-style agents. Super beats Claude Opus 4.6 (80.8%) and GPT-5.4 (80.5%) ### Where Nano is Sufficient - **MMLU-Pro**: Only 5.43 points behind — for conversational knowledge, Nano is fine - **Simple tool calls**: Nano handles single-round `web_search` and `save_note` reliably (our QAT training ensured this) - **Voice conversation**: Personality, warmth, conciseness — all behavioral, not capability-driven. Nano + QAT nails this. ### Comparison with Frontier Models | Benchmark | N3 Super | Claude Opus 4.6 | GPT-5.4 | Qwen3.5-122B | |-----------|----------|-----------------|---------|--------------| | **PinchBench** | **85.6%** | 80.8% | 80.5% | N/A | | **MMLU-Pro** | 83.73 | ~92+ | ~90+ | ~80.9 | | **SWE-Bench** | 60.47 | ~65+ | ~60+ | 66.40 | | **RULER (1M)** | 91.75 | N/A | N/A | 91.33 | | **Throughput** | 1x | API only | API only | 0.13x | Super is competitive with frontier models on agentic tasks while being fully local (privacy-preserving). --- ## 3. Performance on DGX Spark — Head to Head ### Our Measured Numbers (Nano) vs Community Reports (Super) | Metric | Nano (measured on Titan) | Super (community DGX Spark) | Impact | |--------|------------------------|----------------------------|--------| | **NVFP4 model size** | 18 GB (QAT v4) | ~80 GB | 4.4x more VRAM | | **TTFT (1 req)** | 90 ms | ~200-400 ms (estimated) | 2-4x slower first token | | **Decode speed** | 48-65 tok/s | 14-19.5 tok/s | **2.5-4.6x slower** | | **Prefill speed** | Not measured | ~1,000 tok/s | — | | **Max concurrent** | 2-3 before degradation | 1 (VRAM-limited) | Less headroom | | **VRAM headroom** | 87 GB free | ~0-48 GB free | Much tighter | | **KV cache budget** | Abundant | Limited (depends on MTP) | Shorter effective context | ### Throughput Analysis for Voice **Voice latency budget**: TTFT < 700ms, decode > 25 tok/s for natural speech pacing. - **Nano**: 90ms TTFT + 48 tok/s decode = **well within budget**. Comfortable margin. - **Super**: ~300ms TTFT + 14-19.5 tok/s decode = **below minimum decode threshold**. Would feel sluggish. A 50-word response takes ~1.3s at 14 tok/s vs ~0.5s at 48 tok/s. **Verdict for voice: Nano wins decisively.** Super is too slow for real-time voice on DGX Spark. ### Throughput Analysis for Background Agents Background agents (meditation, self-improvement, extraction) have no latency requirement — they run while voice is idle. - **Nano**: Good enough for simple extraction, but struggles with complex multi-step agent tasks (38.76% on SWE-Bench) - **Super**: 60.47% on SWE-Bench, 97% tool accuracy, 85.6% PinchBench. Even at 14 tok/s, a 500-token meditation response takes ~36s — perfectly acceptable for a background job. **Verdict for background agents: Super wins decisively.** Latency doesn't matter; quality does. --- ## 4. VRAM Budget: Can We Run Both? ### Current Budget (Nano only) | Component | VRAM | |-----------|------| | Nemotron 3 Nano (QAT v4 NVFP4) | 18 GB | | Audio pipeline (Whisper + pyannote + speaker) | 7.3 GB | | SER (emotion2vec + wav2vec2) | 1.2 GB | | Nemotron Speech STT 0.6B | 2.49 GB | | Kokoro TTS | 0.5 GB | | qwen3-embedding:8b (on-demand) | 14 GB | | **Total peak** | **~41 GB** | | **Free** | **87 GB** | ### Hypothetical: Dual-Model (Nano + Super) | Component | VRAM | |-----------|------| | Nemotron 3 Nano (voice) | 18 GB | | Nemotron 3 Super NVFP4 (agents) | 80 GB | | Audio pipeline | 7.3 GB | | SER | 1.2 GB | | Nemotron Speech STT | 2.49 GB | | Kokoro TTS | 0.5 GB | | **Total (without embeddings)** | **~109.5 GB** | | **Free** | **18.5 GB** | **Problem**: This exceeds our 110 GB safety ceiling. And we haven't loaded qwen3-embedding:8b (14 GB). ### Option: Super replaces Nano + Ollama extraction | Component | VRAM | |-----------|------| | Nemotron 3 Super NVFP4 (everything) | 80 GB | | Audio pipeline | 7.3 GB | | SER | 1.2 GB | | Nemotron Speech STT | 2.49 GB | | Kokoro TTS | 0.5 GB | | qwen3-embedding:8b (on-demand) | 14 GB | | **Total peak** | **~105.5 GB** | | **Free** | **22.5 GB** | This fits under 110 GB, but voice latency suffers (14-19 tok/s). **Not viable.** ### Option: Super loaded on-demand (swapped with embeddings) | Mode | Models Loaded | VRAM | |------|--------------|------| | **Voice active** | Nano (18) + audio (11.5) + Kokoro (0.5) | ~30 GB | | **Agent active** (voice idle) | Super (80) + audio (11.5) | ~91.5 GB | | **Embedding burst** | Nano (18) + embedding (14) + audio (11.5) | ~43.5 GB | This works if Super is loaded/unloaded on demand. But model loading takes 2-5 minutes for an 80 GB model — not practical for frequent switching. ### Verdict: Dual-Model Not Practical on Single DGX Spark The VRAM budget is too tight for running both simultaneously. On-demand swapping is too slow. **Super as a standalone replacement kills voice latency.** The only viable path is: 1. **Keep Nano for everything today** (it's working well) 2. **Use Super via API** when needed (NVIDIA NIM, or a second machine) 3. **Wait for DGX Spark Pro** (256 GB) or use a cloud GPU for Super workloads --- ## 5. NemoClaw vs Annie's Agentic Framework ### Architecture Comparison | Feature | NemoClaw (NVIDIA) | Annie Agentic Framework | |---------|-------------------|------------------------| | **LLM brain** | Nemotron 3 Super (120B-A12B) | Nemotron 3 Nano (30B-A3B) | | **Edge executor** | Nemotron 3 Nano (sidecar) | Same Nano (single model) | | **Agent framework** | OpenClaw (TypeScript) | Custom Python (6 modules) | | **Workspace files** | SOUL.md, RULES.md, USER.md, TOOLS.md | Same pattern (adopted from OpenClaw) | | **Self-improvement** | Yes (OpenClaw pattern) | Yes (self_improve.py) | | **Meditation/reflection** | Not mentioned | Yes (meditation.py — daily/weekly/monthly) | | **Security sandbox** | OpenShell (process isolation, YAML policies) | Workspace allowlist + path traversal guard | | **Privacy routing** | Differential privacy router (local vs cloud) | No cloud fallback (ADR-004: privacy-first) | | **Tool calling** | Super (97% accuracy) | Nano (good but unquantified) | | **Agent discovery** | Plugin manifest registry | YAML agent definitions (hot-reload) | | **Scheduling** | Not documented (OpenClaw is on-demand) | Cron/interval/one-shot scheduler | | **Voice integration** | Telephony (Twilio/Telnyx) | WebRTC + local GPU (Pipecat) | | **Hardware target** | DGX Station / cloud | DGX Spark (128 GB) | ### What NemoClaw Has That We Don't 1. **OpenShell Security Runtime** — Process-level isolation per agent session with YAML-based policy enforcement. Our workspace_io.py uses file allowlists, but doesn't sandbox agent execution at the OS level. For a single-user personal AI, this is overkill, but for multi-tenant scenarios it matters. 2. **Differential Privacy Router** — Routes sensitive data to local models, general tasks to cloud. We don't need this because we never use cloud models (ADR-004), but the concept of routing by data sensitivity is interesting for future multi-model setups. 3. **Super's 97% Tool Accuracy** — This is the one genuine capability gap. Our Nano handles simple tool calls (web_search, save_note) well, but for complex multi-step agent tasks (meditation producing structured JSON, extraction parsing entities), Super's 97% accuracy would reduce errors and retries. 4. **PinchBench-Validated Agent Patterns** — NVIDIA specifically optimized Super for the tool-calling patterns that OpenClaw agents use. Since our agentic framework is modeled after OpenClaw, we'd benefit from a model tuned for exactly these patterns. ### What We Have That NemoClaw Doesn't 1. **Meditation System** — Multi-timescale self-reflection (daily/weekly/monthly) with journaling. NemoClaw's OpenClaw base has self-improvement but not structured reflection. 2. **Voice-First Architecture** — Our entire system is designed around real-time voice with GPU-local STT/TTS. NemoClaw targets telephony (Twilio), which adds 200-500ms of network latency. 3. **Emotional Awareness** — SER pipeline (emotion2vec + wav2vec2) feeds emotional context into prompts. NemoClaw has no emotion recognition. 4. **Ambient Context (Omi)** — Continuous ambient awareness from wearable device. NemoClaw agents are on-demand, not always-listening. 5. **Scheduled Background Agents** — Our AgentScheduler supports cron/interval/one-shot scheduling. NemoClaw's OpenClaw agents are triggered, not scheduled. ### Assessment: Are We Missing Out? **For voice conversations**: No. Nano at 48-65 tok/s with QAT behavioral training is excellent. NemoClaw doesn't even target real-time voice on local hardware. **For agentic background tasks**: Partially. Super's quality advantage is real: - Meditation prompts are complex (multi-section structured output with journal entries, fact extraction, SOUL.md proposals). Super's 97% tool accuracy and +21 pts on SWE-Bench would produce higher-quality self-reflections. - Self-improvement requires parsing conversations for subtle learnings. Super's larger active parameter count (12B vs 3B) gives it more capacity for nuanced analysis. - But these are **background tasks with no latency requirement**. The question is whether Nano's quality is "good enough" or whether the errors compound over time. **For security**: OpenShell is overkill for a single-user personal AI. Our allowlist + sanitization approach is appropriate. --- ## 6. NemoClaw's Nemotron 3 Super — Key Technical Details ### Architecture: Why It's Fast Despite 120B Parameters ``` 88-layer hybrid stack: ├── Mamba-2 layers (bulk) ← O(n) scaling, not O(n²) │ └── Paired with LatentMoE (512 experts, 22 active) │ └── Latent compression: 4096→1024 before routing (4x savings) ├── Self-attention "anchor" layers (sparse) ← Precision reasoning └── Multi-Token Prediction head ← 2-3x speedup on structured output ``` **Why 120B total but only 12B active**: MoE activates only 22 of 512 experts per token. Each expert is small (~230M params). The "120B" is the sum of all experts, but compute cost per token is comparable to a 12B dense model. **Why LatentMoE matters**: Standard MoE routes full-dimension tokens (4096-dim). LatentMoE compresses to 1024-dim first, allowing 4x more experts for the same all-to-all communication cost. This is why Super can have 512 experts vs typical 128. ### PinchBench — What It Tests PinchBench is NVIDIA's agent-specific benchmark designed for OpenClaw patterns: | Model | PinchBench Score | |-------|-----------------| | **Nemotron 3 Super** | **84.7-85.6%** | | Claude Opus 4.6 | 80.8% | | GPT-5.4 | 80.5% | It tests: - Multi-step tool calling chains - Goal maintenance across long contexts - Error recovery and retry patterns - Structured output consistency ### Failure Mode Mitigation Super specifically targets two agentic failure modes: 1. **Goal drift** — Agent loses alignment as context grows. Mitigated by Mamba-2's O(n) attention (doesn't degrade with length) + 91.75% RULER score at 1M tokens. 2. **Tool-call failures** — Malformed function calls. Mitigated by trajectory-based RL post-training (trained on full agent sessions, not isolated completions). ### Available Formats | Format | Size | DGX Spark Fit? | Notes | |--------|------|----------------|-------| | **NVFP4** | ~80 GB | Yes (tight) | Recommended. Native precision — no quality loss. | | **FP8** | ~87 GB | Barely | No room for other models | | **BF16** | ~240 GB | No | Needs 4x H100 or 8x A100 | | **GGUF Q4_K_M** | ~64-72 GB | Yes | Ollama MoE GGUF not compatible with llama.cpp | ### Serving on DGX Spark — VERIFIED WORKING RECIPE **Critical: The official NVIDIA vLLM containers (25.12, 26.02) do NOT support Super NVFP4.** Super uses `MIXED_PRECISION` quant format (Mamba layers FP8, MoE experts NVFP4). NVIDIA's vLLM images only support `FP8` or `NVFP4` as standalone modes, not mixed. Also, `--quantization modelopt_fp4` and `--num-speculative-tokens` are invalid flags for this model/version. **Solution: Use [eugr/spark-vllm-docker](https://github.com/eugr/spark-vllm-docker)** — a community Docker setup specifically for DGX Spark that builds vLLM from nightly wheels with all necessary patches. #### One-Time Setup (on Beast) ```bash # 1. Clone the repo cd ~ && git clone https://github.com/eugr/spark-vllm-docker.git # 2. Build the Docker image (~25 GB, uses nvidia/pytorch:26.01-py3 base) cd spark-vllm-docker && ./build-and-copy.sh # 3. Download the model (~75 GB, 17 safetensor shards) pip3 install --user --break-system-packages huggingface_hub[cli] ~/.local/bin/hf download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ --local-dir ~/models/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 # 4. Create eager-mode recipe (skips 45+ min CUDA graph compilation) cp recipes/nemotron-3-super-nvfp4.yaml recipes/nemotron-3-super-nvfp4-eager.yaml sed -i 's|--tensor-parallel-size {tensor_parallel}|--tensor-parallel-size {tensor_parallel} --enforce-eager|' \ recipes/nemotron-3-super-nvfp4-eager.yaml ``` #### Launch Command ```bash cd ~/spark-vllm-docker python3 run-recipe.py nemotron-3-super-nvfp4-eager \ --solo --tensor-parallel 1 \ --max-model-len 32768 --port 8003 \ --gpu-memory-utilization 0.70 \ --name vllm-nemotron-super -d ``` #### What the Recipe Does The recipe sets these critical env vars and flags: ```yaml env: VLLM_NVFP4_GEMM_BACKEND: "marlin" # Marlin backend for NVFP4 (not FlashInfer) VLLM_TEST_FORCE_FP8_MARLIN: "1" # Force Marlin for FP8 layers too VLLM_MARLIN_USE_ATOMIC_ADD: "1" # Atomic add for MoE routing command: | vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ --kv-cache-dtype fp8 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --load-format fastsafetensors \ --reasoning-parser-plugin super_v3_reasoning_parser.py \ --reasoning-parser super_v3 \ --enforce-eager ``` The `mods/nemotron-super/run.sh` downloads `super_v3_reasoning_parser.py` from the model's HuggingFace repo — this is a custom vLLM reasoning parser that separates thinking from content. #### Startup Time | Mode | Startup Time | Notes | |------|-------------|-------| | `--enforce-eager` | **~13 min** | Model load only, no graph compilation | | Default (CUDA graphs) | **45+ min** | 88 layers × multiple seq lengths | #### Measured GPU Usage ``` nvidia-smi output: VLLM::EngineCore 71,765 MiB (70 GB) ``` #### API Response Format The `super_v3` reasoning parser separates thinking from content: ```json { "choices": [{ "message": { "content": "Four", // actual answer "reasoning": "The user asks... so answer is four." // thinking process } }] } ``` Our `OpenAICompatClient.create_completion()` reads `message.content` — the thinking is automatically separated by the parser. The client-side regex also strips any leaked `` tags as a safety net. #### Important Gotchas (learned the hard way) 1. **DO NOT use `nvcr.io/nvidia/vllm:25.12.post1-py3` or `26.02-py3`** — they reject `MIXED_PRECISION` 2. **DO NOT pass `--quantization modelopt_fp4`** — model config says `modelopt`, not `modelopt_fp4` 3. **DO NOT pass `--num-speculative-tokens`** — not a valid arg in this vLLM version 4. **DO NOT pass `--reasoning-parser nemotron_v3`** — doesn't exist in NVIDIA images; use `super_v3` from the model's HF repo via `--reasoning-parser-plugin` 5. **Driver 580 is FINE** — the spark-vllm-docker image enables CUDA Forward Compatibility mode. Driver 590 actually has a CUDAGraph deadlock bug on GB10. 6. **Docker `--gpus all` works without `--runtime=nvidia`** — Beast uses CDI mode (Docker 29+) 7. **Model name is the full HF path** — `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`, not `nemotron-super` --- ## 7. Nano vs Super — Decision Matrix for Annie | Criterion | Nano Wins | Super Wins | Weight for Annie | |-----------|-----------|-----------|-----------------| | Voice decode speed | ✅ 48-65 tok/s | ❌ 14-19 tok/s | **Critical** | | Voice TTFT | ✅ 90ms | ❌ ~300ms | **Critical** | | VRAM footprint | ✅ 18 GB | ❌ 80 GB | **High** | | Tool accuracy | ❌ Unquantified | ✅ 97% | **Medium** (agents) | | Complex reasoning | ❌ 38.76% SWE | ✅ 60.47% SWE | **Medium** (agents) | | Long context retention | ❓ Not tested | ✅ 91.75% RULER | **Low** (we compact) | | Multi-step agents | ❌ 3B active | ✅ 12B active | **Medium** | | Quantization quality | ❌ Needs QAT | ✅ Native NVFP4 | **Low** (QAT works) | | Cost to adopt | ✅ Already deployed | ❌ VRAM redesign | **High** | **Score**: For voice-first + single DGX Spark, **Nano is the right choice today**. --- ## 8. Dual-Machine Architecture: Titan + Beast (RECOMMENDED) ### Hardware Available | Machine | Spec | ConnectX-7 | Current Role | |---------|------|-----------|-------------| | **Titan** | DGX Spark, 128 GB unified, GB10 Blackwell | Yes | Annie voice + Context Engine + all services | | **Beast** | DGX Spark, 128 GB unified, GB10 Blackwell | Yes | **Nemotron 3 Super (background agents)** | | **Link** | WiFi (192.168.68.x) | **~75ms latency** | ConnectX-7 available but not configured yet | ### Why This Changes Everything The single-machine VRAM constraint was the entire reason we couldn't run Super. With Beast available: | Setup | Titan VRAM | Beast VRAM | Total Available | |-------|-----------|-----------|-----------------| | Current (Nano only) | 41 GB used / 87 GB free | **128 GB idle** | 215 GB free | | Dual-model | 41 GB (unchanged) | ~80 GB Super | 135 GB free total | **Both machines have plenty of headroom.** No VRAM juggling needed. ### Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ TITAN (128 GB) │ │ │ │ ┌─────────────────────────────────────────┐ │ │ │ Voice Pipeline (latency-critical) │ │ │ │ Nemotron Speech STT → Nano → Kokoro TTS│ │ │ │ Nano: 18 GB, 48-65 tok/s, 90ms TTFT │ │ │ └─────────────────────────────────────────┘ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Context │ │ Audio │ │ Dashboard │ │ │ │ Engine │ │ Pipeline │ │ + Telegram │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ ┌─────────────────────────────────────────┐ │ │ │ Agent Runtime (agent_context.py) │ │ │ │ ├── Voice queries → Nano (localhost) │ │ │ │ └── Agent queries → Super (beast:8003) │◄── NEW ROUTING │ │ └─────────────────────────────────────────┘ │ │ │ │ └───────────────────────────┼─────────────────────────────────────┘ │ ConnectX-7 (< 1ms) ┌───────────────────────────┼─────────────────────────────────────┐ │ BEAST (128 GB) │ │ ▼ │ │ ┌─────────────────────────────────────────┐ │ │ │ Nemotron 3 Super NVFP4 (vLLM :8003) │ │ │ │ 80 GB, 14-19 tok/s, 97% tool accuracy │ │ │ │ 1M context, native FP4 │ │ │ └─────────────────────────────────────────┘ │ │ │ │ Free: ~48 GB (for future services) │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ### What Runs Where | Workload | Machine | Model | Why | |----------|---------|-------|-----| | Voice conversation | **Titan** | Nano | Latency-critical, 48-65 tok/s | | Text chat | **Titan** | Nano | Same latency requirements | | Daily meditation | **Beast** | Super | Accuracy-critical, 97% tool accuracy | | Weekly/monthly meditation | **Beast** | Super | Complex multi-section structured output | | Self-improvement (post-session) | **Beast** | Super | Nuanced conversation analysis | | Entity extraction | **Beast** | Super | Better at structured JSON extraction | | Daily reflection generation | **Beast** | Super | Richer, more insightful reflections | | Omi ambient processing | **Beast** | Super | Complex context understanding | | Context compaction | **Titan** | Nano | Runs during voice session, needs to be local | | Embedding generation | **Titan** | qwen3-embedding | Stays on Ollama, co-located with Context Engine | ### Code Changes (IMPLEMENTED — commits `f776a53`, `82dad39`) **1. `agent_context.py` — Dual LLM client with health-checked fallback:** ```python # __init__: reads env vars self._beast_base_url = os.getenv("BEAST_LLM_BASE_URL", "") self._beast_model = os.getenv("BEAST_LLM_MODEL", "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4") self._beast_healthy = False # _check_beast_health(): httpx GET to beast:8003/health (5s timeout) # Called on startup + before each agent execution (auto-recovery) # _get_client(): returns Beast client if healthy, else Nano (fallback) # All AgentRunner calls are background agents — voice never goes through here ``` **2. `start.sh` — Auto-detects Beast on Annie startup:** ```bash # In start_annie(): checks if Beast vLLM is healthy # If yes → passes BEAST_LLM_BASE_URL and BEAST_LLM_MODEL env vars to Annie # If no → agents use local Nano (current behavior, zero change) ``` **3. `start.sh` — `start_beast_llm()` uses spark-vllm-docker recipe:** ```bash ./start.sh beast # Start Super on Beast ./start.sh status # Shows Beast health in status display ./stop.sh beast # Stop Super on Beast ``` **4. Voice-priority gate:** Still works but now optional — Beast GPU is independent from Titan. ### VRAM Budget (Dual-Machine) **Titan (unchanged):** | Component | VRAM | |-----------|------| | Nemotron 3 Nano (QAT v4) | 18 GB | | Audio pipeline | 7.3 GB | | SER | 1.2 GB | | Nemotron Speech STT | 2.49 GB | | Kokoro TTS | 0.5 GB | | qwen3-embedding (on-demand) | 14 GB | | **Peak** | **~41 GB** | | **Free** | **87 GB** | **Beast (measured):** | Component | VRAM | |-----------|------| | Nemotron 3 Super NVFP4 (measured) | **70 GB** (71,765 MiB per nvidia-smi) | | vLLM overhead | included above | | **Peak** | **~70 GB** | | **Free** | **58 GB** | **Combined**: 111 GB used out of 256 GB total. **145 GB headroom across both machines.** ### Latency Analysis (measured) For background agent calls (Titan → Beast → Titan): - Network round-trip (WiFi): **~75ms** (ConnectX-7 not configured yet, would be <1ms) - vLLM TTFT on Beast: **~200-400ms** (estimated) - Decode at 14-19 tok/s: **~26-36s for 500-token response** - Total: **~30s per agent call** This is perfectly fine for background agents (meditation, self-improvement, extraction). They have no real-time requirement — they run while voice is idle. ### Advantages Over Single-Machine | Dimension | Single (Titan only) | Dual (Titan + Beast) | |-----------|-------------------|---------------------| | Voice latency | 90ms TTFT, 48 tok/s | **Same** (Nano stays on Titan) | | Agent quality | Nano (38.76% SWE) | **Super (60.47% SWE, 97% tool)** | | VRAM pressure | 41 GB / 128 GB | **41 + 90 = 131 / 256 GB** | | GPU contention | Voice vs agents share GPU | **Zero contention** | | Voice-priority gate | Required (busy-wait) | **Optional** (separate GPUs) | | Privacy | Local | **Still fully local** (LAN only) | | Failure isolation | Single point of failure | **Independent** (voice works if Beast down) | ### Failure Handling If Beast goes down: - Voice continues unaffected (Nano on Titan) - Background agents fail gracefully (existing timeout + skip pattern in `agent_context.py`) - Alert via observability (agent_complete events stop) - Manual failback: point agent client to Titan's Nano temporarily --- ## 9. Other Future Paths ### Path A: Keep Nano for Everything (Fallback) - **When**: If Beast is unavailable or Super proves unreliable - Our current setup works. It's just not optimal for complex agents. ### Path B: Super-Distilled QAT (Complementary) - Even with Super on Beast, we should still distill its behavior into Nano - Use Super to generate 1000+ high-quality agent conversations - QAT v5 Nano with Super-distilled data → better Nano for when Beast is offline - **Effort**: Medium (we have the QAT v4 pipeline) ### Path C: Wait for DGX Spark Pro - Rumored 256 GB unified memory - Would fit Nano + Super on a single machine - But dual-machine is available NOW --- ## 10. Current Status — DEPLOYED (2026-03-20) ### What's Running | Machine | Model | Container | Image | Status | |---------|-------|-----------|-------|--------| | **Titan** | Nemotron 3 Nano 30B-A3B NVFP4 | `vllm-nemotron` | `nvcr.io/nvidia/vllm:25.12.post1-py3` | ✅ Running | | **Beast** | Nemotron 3 Super 120B-A12B NVFP4 | `vllm-nemotron-super` | `vllm-node` (spark-vllm-docker) | ✅ Running | ### How to Start/Stop ```bash # From laptop (start.sh SSHes into machines) ./start.sh beast # Start Super on Beast ./start.sh annie # Start Annie (auto-detects Beast) ./stop.sh beast # Stop Super on Beast ./stop.sh annie # Stop Annie # Status check ./start.sh status # Shows Beast + Titan health # Manual health check curl -sf http://192.168.68.58:8003/health # Beast Super curl -sf http://192.168.68.52:8003/health # Titan Nano ``` ### If Beast Needs Restart ```bash # Option 1: via start.sh (handles everything) ./stop.sh beast && ./start.sh beast # Option 2: manually on Beast ssh beast cd ~/spark-vllm-docker docker rm -f vllm-nemotron-super python3 run-recipe.py nemotron-3-super-nvfp4-eager \ --solo --tensor-parallel 1 \ --max-model-len 32768 --port 8003 \ --gpu-memory-utilization 0.70 \ --name vllm-nemotron-super -d # Wait ~13 min for model load curl -sf http://localhost:8003/health # should return healthy ``` ### If Beast Is Down Annie auto-detects and falls back to Nano. No restart needed. When Beast comes back, restart Annie (`./stop.sh annie && ./start.sh annie`) to pick it up, or wait — agent_context.py re-checks Beast health before each agent execution. ### Next Steps 1. **Super-Distilled QAT v5** — Use Super to generate 1000+ high-quality agent conversations, QAT v5 Nano with this data → better Nano for when Beast is offline 2. **Configure ConnectX-7** — Direct cable link would reduce latency from 75ms (WiFi) to <1ms 3. **Enable CUDA graphs** — Once startup time isn't a concern (e.g., always-on Beast), remove `--enforce-eager` for better inference speed 4. **Benchmark agent quality** — Compare meditation/self-improvement output quality between Nano and Super --- ## Sources - [spark-vllm-docker: DGX Spark vLLM Docker (community, USED FOR DEPLOYMENT)](https://github.com/eugr/spark-vllm-docker) - [NVIDIA NemoClaw GitHub](https://github.com/NVIDIA/NemoClaw) - [NVIDIA Nemotron Super vLLM Cookbook](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/usage-cookbook/Nemotron-3-Super/AdvancedDeploymentGuide) - [NVIDIA Technical Blog: Introducing Nemotron 3 Super](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/) - [NVIDIA Blog: Nemotron 3 Super Delivers 5x Higher Throughput](https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/) - [NVIDIA Research: Nemotron 3 Super](https://research.nvidia.com/labs/nemotron/Nemotron-3-Super/) - [NVIDIA Research: Nemotron 3 Family](https://research.nvidia.com/labs/nemotron/Nemotron-3/) - [Nemotron 3 Super Technical Report (PDF)](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) - [NVIDIA NIM: Nemotron 3 Super Model Card](https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b/modelcard) - [HuggingFace: Nemotron 3 Super NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) - [HuggingFace: Nemotron 3 Super FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) - [HuggingFace: Nemotron 3 Super BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) - [vLLM Blog: Nemotron 3 Super](https://vllm.ai/blog/nemotron-3-super) - [Daily.co: Nemotron 3 Super for Voice AI](https://www.daily.co/blog/nvidia-nemotron-3-super/) - [Running Nemotron 3 Super on DGX Spark](https://blog.kubesimplify.com/nemotron3-on-dgx-spark) - [NVIDIA Forums: DGX Spark Nemotron3 NVFP4 65+ tps](https://forums.developer.nvidia.com/t/dgx-spark-nemotron3-and-nvfp4-getting-to-65-tps/355261) - [NVIDIA Forums: Nemotron 3 Super NVFP4 on DGX Spark](https://forums.developer.nvidia.com/t/nvidia-nemotron-3-super-120b-a12b-nvfp4/363175) - [VentureBeat: NemoClaw](https://venturebeat.com/technology/nvidia-lets-its-claws-out-nemoclaw-brings-security-scale-to-the-agent) - [The New Stack: NemoClaw = OpenClaw with Guardrails](https://thenewstack.io/nemoclaw-openclaw-with-guardrails/) - [NVIDIA: NemoClaw Official Page](https://www.nvidia.com/en-us/ai/nemoclaw/) - [NVIDIA OpenShell Technical Blog](https://developer.nvidia.com/blog/run-autonomous-self-evolving-agents-more-safely-with-nvidia-openshell/) - [NVIDIA: Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/) - [Unsloth: Nemotron 3 Super How To Run Guide](https://unsloth.ai/docs/models/nemotron-3-super) - [Ollama: nemotron-3-super](https://ollama.com/library/nemotron-3-super) - [DataCamp: Nemotron 3 Architecture, Benchmarks, Comparisons](https://www.datacamp.com/blog/nvidia-nemotron-3) - [llm-stats: Nemotron 3 Super Launch](https://llm-stats.com/blog/research/nemotron-3-super-launch) - [Artificial Analysis: Nemotron 3 Super](https://artificialanalysis.ai/models/nvidia-nemotron-3-super-120b-a12b) - [OpenClaw Report: Nemotron 3 Super PinchBench Leader](https://openclaw.report/ecosystem/nemotron-3-super-pinchbench-leader) - [Saiyam Pathak: Running Nemotron 3 Super on DGX Spark](https://saiyampathak.substack.com/p/heres-what-i-learned-about-nemotron) - [NVIDIA GTC 2026: RTX AI Garage NemoClaw](https://blogs.nvidia.com/blog/rtx-ai-garage-gtc-2026-nemoclaw/)