# Research: Nemotron 3 Super vs Nano — Suitability for Annie on DGX Spark

**Date:** 2026-03-19
**Status:** Complete
**Question:** Are we missing out by using Nemotron 3 Nano instead of Super for Annie's agentic framework?
**Context:** NemoClaw (NVIDIA's enterprise agent stack) exclusively showcases Nemotron 3 Super. Our Annie agentic framework uses Nemotron 3 Nano.

---

## TL;DR — The Verdict

**We are NOT missing out for voice. We ARE missing out for background agents. And we have the hardware to fix it NOW.**

| Workload | Best Model | Why |
|----------|-----------|-----|
| Voice (latency-critical) | **Nano on Titan** (keep current) | 48-65 tok/s vs 14-19 tok/s. Voice needs speed, not smarts. |
| Background agents (accuracy-critical) | **Super on Beast** (deploy now) | +22 pts on SWE-Bench, 97% tool accuracy, 1M context |
| Dual-model feasibility | **Titan + Beast** | Both DGX Spark (128 GB each), ConnectX-7 link (< 1ms). No VRAM conflict. |

**Recommended: Deploy Super NVFP4 on Beast, route background agents to it.** This is exactly what NemoClaw does (Super as brain, Nano as edge executor) — except we run both locally for privacy. ~2 hours of setup, ~20 lines of code change, zero voice regression risk.

---

## Why vLLM, Not Ollama?

NemoClaw uses Ollama for local inference. We chose vLLM. Here's why:

| Factor | vLLM (our choice) | Ollama |
|--------|-------------------|--------|
| **Model format** | NVFP4 — **native training precision**, zero quality loss | GGUF — re-quantized, potential quality degradation |
| **MoE compatibility** | Works via Marlin backend (spark-vllm-docker) | **Known issue**: Ollama MoE GGUF blobs NOT compatible with upstream llama.cpp |
| **Tool calling** | `--tool-call-parser qwen3_coder` — battle-tested | Basic support, less reliable for complex chains |
| **Reasoning parser** | `super_v3` plugin — cleanly separates thinking from content | None — raw `<think>` tags leak into responses |
| **Consistency** | Same as Titan (Nano on vLLM) — identical API + client code | Different API quirks, different debugging surface |
| **Setup effort** | Medium (spark-vllm-docker build + recipe) | Easy (`ollama pull nemotron-3-super`) |
| **NemoClaw stance** | "Experimental" for local | "Experimental" for local — default is cloud API |

**Bottom line:** For background agents where quality is the whole point (97% tool accuracy, +22 SWE-Bench), using the native NVFP4 precision via vLLM is the right call. Ollama would be fine for a quick demo but not for production agents.

---

## 1. Model Specifications — Side by Side

| Property | Nemotron 3 Nano (30B-A3B) | Nemotron 3 Super (120B-A12B) |
|----------|--------------------------|------------------------------|
| **Total parameters** | 31.6B | 120B |
| **Active parameters** | 3.2B per token | 12B per token (4x Nano) |
| **Architecture** | Hybrid Mamba-2 + Transformer + MoE | Hybrid Mamba-2 + Transformer + **Latent** MoE |
| **Expert count** | ~128 experts, ~6 active | **512 experts, 22 active** |
| **Context window** | 1M tokens | 1M tokens |
| **Training precision** | BF16 → post-training quant | **Native NVFP4** (trained from scratch in FP4) |
| **Multi-Token Prediction** | No | **Yes** (built-in speculative decoding) |
| **NVFP4 model size** | ~7.5 GB (our QAT v4: 18 GB) | ~80 GB |
| **Release** | 2025 | March 11, 2026 (GTC 2026) |
| **License** | Nemotron Open (commercial) | Nemotron Open (commercial) |
| **Open artifacts** | Weights | Weights + 10T training data + RL environments + recipes |

### Architecture Deep Dive: What Makes Super Different

Super has three innovations that Nano lacks:

1. **LatentMoE** — Tokens are compressed to 1/4 dimension (4096→1024) before expert routing. This allows 512 experts (vs ~128) with 22 active per token (vs ~6). Result: 4x more expert capacity for the same compute. Nano uses standard MoE without latent compression.

2. **Multi-Token Prediction (MTP)** — Predicts multiple future tokens per forward pass. Acts as built-in speculative decoding without needing a separate draft model. On SPEED-Bench, Super achieves the highest average acceptance length (3.45 tokens), beating DeepSeek-R1. This gives 2-3x speedup on structured generation (tool calls, JSON). Nano has no MTP.

3. **Native NVFP4 Training** — Super was pre-trained from scratch in FP4 precision. This means NVFP4 quantization is lossless (the model never knew higher precision). Nano was trained in BF16 and quantized post-training, which introduces quality loss (we experienced this — our QAT v4 was needed to recover behavioral quality after NVFP4 quantization).

---

## 2. Quality Benchmarks: Super vs Nano

### Head-to-Head

| Benchmark | Nano (30B-A3B) | Super (120B-A12B) | Delta | Significance |
|-----------|----------------|-------------------|-------|-------------|
| **MMLU-Pro** | 78.30% | 83.73% | **+5.43** | General knowledge |
| **GPQA (no tools)** | 73.04% | 79.23% | **+6.19** | Graduate-level science |
| **GPQA (with tools)** | 75.00% | 82.70% | **+7.70** | Tool-augmented reasoning |
| **LiveCodeBench** | 68.25% | 81.19% | **+12.94** | Non-contaminated coding |
| **SWE-Bench Verified** | 38.76% | 60.47% | **+21.71** | Multi-file code changes |
| **SWE-Bench Multilingual** | N/A | 45.78% | — | Cross-language coding |
| **Tool calling accuracy** | Not quantified | **97.0%** | — | vs GPT-4.1 at 96.3% |
| **PinchBench (agentic)** | Not reported | **85.6%** | — | #1 open model |
| **RULER (1M context)** | Not reported | **91.75%** | — | Context retention |

### Where Super Dominates

The gap widens dramatically on **agentic tasks**:
- **SWE-Bench**: +21.71 points — this tests exactly what background agents do (multi-step reasoning, tool use, code generation)
- **Tool calling**: 97% accuracy — critical for meditation, self-improvement, and extraction agents that rely on structured tool outputs
- **LiveCodeBench**: +12.94 points — complex multi-step problem solving
- **PinchBench**: 85.6% — NVIDIA's own benchmark for OpenClaw-style agents. Super beats Claude Opus 4.6 (80.8%) and GPT-5.4 (80.5%)

### Where Nano is Sufficient

- **MMLU-Pro**: Only 5.43 points behind — for conversational knowledge, Nano is fine
- **Simple tool calls**: Nano handles single-round `web_search` and `save_note` reliably (our QAT training ensured this)
- **Voice conversation**: Personality, warmth, conciseness — all behavioral, not capability-driven. Nano + QAT nails this.

### Comparison with Frontier Models

| Benchmark | N3 Super | Claude Opus 4.6 | GPT-5.4 | Qwen3.5-122B |
|-----------|----------|-----------------|---------|--------------|
| **PinchBench** | **85.6%** | 80.8% | 80.5% | N/A |
| **MMLU-Pro** | 83.73 | ~92+ | ~90+ | ~80.9 |
| **SWE-Bench** | 60.47 | ~65+ | ~60+ | 66.40 |
| **RULER (1M)** | 91.75 | N/A | N/A | 91.33 |
| **Throughput** | 1x | API only | API only | 0.13x |

Super is competitive with frontier models on agentic tasks while being fully local (privacy-preserving).

---

## 3. Performance on DGX Spark — Head to Head

### Our Measured Numbers (Nano) vs Community Reports (Super)

| Metric | Nano (measured on Titan) | Super (community DGX Spark) | Impact |
|--------|------------------------|----------------------------|--------|
| **NVFP4 model size** | 18 GB (QAT v4) | ~80 GB | 4.4x more VRAM |
| **TTFT (1 req)** | 90 ms | ~200-400 ms (estimated) | 2-4x slower first token |
| **Decode speed** | 48-65 tok/s | 14-19.5 tok/s | **2.5-4.6x slower** |
| **Prefill speed** | Not measured | ~1,000 tok/s | — |
| **Max concurrent** | 2-3 before degradation | 1 (VRAM-limited) | Less headroom |
| **VRAM headroom** | 87 GB free | ~0-48 GB free | Much tighter |
| **KV cache budget** | Abundant | Limited (depends on MTP) | Shorter effective context |

### Throughput Analysis for Voice

**Voice latency budget**: TTFT < 700ms, decode > 25 tok/s for natural speech pacing.

- **Nano**: 90ms TTFT + 48 tok/s decode = **well within budget**. Comfortable margin.
- **Super**: ~300ms TTFT + 14-19.5 tok/s decode = **below minimum decode threshold**. Would feel sluggish. A 50-word response takes ~1.3s at 14 tok/s vs ~0.5s at 48 tok/s.

**Verdict for voice: Nano wins decisively.** Super is too slow for real-time voice on DGX Spark.

### Throughput Analysis for Background Agents

Background agents (meditation, self-improvement, extraction) have no latency requirement — they run while voice is idle.

- **Nano**: Good enough for simple extraction, but struggles with complex multi-step agent tasks (38.76% on SWE-Bench)
- **Super**: 60.47% on SWE-Bench, 97% tool accuracy, 85.6% PinchBench. Even at 14 tok/s, a 500-token meditation response takes ~36s — perfectly acceptable for a background job.

**Verdict for background agents: Super wins decisively.** Latency doesn't matter; quality does.

---

## 4. VRAM Budget: Can We Run Both?

### Current Budget (Nano only)

| Component | VRAM |
|-----------|------|
| Nemotron 3 Nano (QAT v4 NVFP4) | 18 GB |
| Audio pipeline (Whisper + pyannote + speaker) | 7.3 GB |
| SER (emotion2vec + wav2vec2) | 1.2 GB |
| Nemotron Speech STT 0.6B | 2.49 GB |
| Kokoro TTS | 0.5 GB |
| qwen3-embedding:8b (on-demand) | 14 GB |
| **Total peak** | **~41 GB** |
| **Free** | **87 GB** |

### Hypothetical: Dual-Model (Nano + Super)

| Component | VRAM |
|-----------|------|
| Nemotron 3 Nano (voice) | 18 GB |
| Nemotron 3 Super NVFP4 (agents) | 80 GB |
| Audio pipeline | 7.3 GB |
| SER | 1.2 GB |
| Nemotron Speech STT | 2.49 GB |
| Kokoro TTS | 0.5 GB |
| **Total (without embeddings)** | **~109.5 GB** |
| **Free** | **18.5 GB** |

**Problem**: This exceeds our 110 GB safety ceiling. And we haven't loaded qwen3-embedding:8b (14 GB).

### Option: Super replaces Nano + Ollama extraction

| Component | VRAM |
|-----------|------|
| Nemotron 3 Super NVFP4 (everything) | 80 GB |
| Audio pipeline | 7.3 GB |
| SER | 1.2 GB |
| Nemotron Speech STT | 2.49 GB |
| Kokoro TTS | 0.5 GB |
| qwen3-embedding:8b (on-demand) | 14 GB |
| **Total peak** | **~105.5 GB** |
| **Free** | **22.5 GB** |

This fits under 110 GB, but voice latency suffers (14-19 tok/s). **Not viable.**

### Option: Super loaded on-demand (swapped with embeddings)

| Mode | Models Loaded | VRAM |
|------|--------------|------|
| **Voice active** | Nano (18) + audio (11.5) + Kokoro (0.5) | ~30 GB |
| **Agent active** (voice idle) | Super (80) + audio (11.5) | ~91.5 GB |
| **Embedding burst** | Nano (18) + embedding (14) + audio (11.5) | ~43.5 GB |

This works if Super is loaded/unloaded on demand. But model loading takes 2-5 minutes for an 80 GB model — not practical for frequent switching.

### Verdict: Dual-Model Not Practical on Single DGX Spark

The VRAM budget is too tight for running both simultaneously. On-demand swapping is too slow. **Super as a standalone replacement kills voice latency.** The only viable path is:

1. **Keep Nano for everything today** (it's working well)
2. **Use Super via API** when needed (NVIDIA NIM, or a second machine)
3. **Wait for DGX Spark Pro** (256 GB) or use a cloud GPU for Super workloads

---

## 5. NemoClaw vs Annie's Agentic Framework

### Architecture Comparison

| Feature | NemoClaw (NVIDIA) | Annie Agentic Framework |
|---------|-------------------|------------------------|
| **LLM brain** | Nemotron 3 Super (120B-A12B) | Nemotron 3 Nano (30B-A3B) |
| **Edge executor** | Nemotron 3 Nano (sidecar) | Same Nano (single model) |
| **Agent framework** | OpenClaw (TypeScript) | Custom Python (6 modules) |
| **Workspace files** | SOUL.md, RULES.md, USER.md, TOOLS.md | Same pattern (adopted from OpenClaw) |
| **Self-improvement** | Yes (OpenClaw pattern) | Yes (self_improve.py) |
| **Meditation/reflection** | Not mentioned | Yes (meditation.py — daily/weekly/monthly) |
| **Security sandbox** | OpenShell (process isolation, YAML policies) | Workspace allowlist + path traversal guard |
| **Privacy routing** | Differential privacy router (local vs cloud) | No cloud fallback (ADR-004: privacy-first) |
| **Tool calling** | Super (97% accuracy) | Nano (good but unquantified) |
| **Agent discovery** | Plugin manifest registry | YAML agent definitions (hot-reload) |
| **Scheduling** | Not documented (OpenClaw is on-demand) | Cron/interval/one-shot scheduler |
| **Voice integration** | Telephony (Twilio/Telnyx) | WebRTC + local GPU (Pipecat) |
| **Hardware target** | DGX Station / cloud | DGX Spark (128 GB) |

### What NemoClaw Has That We Don't

1. **OpenShell Security Runtime** — Process-level isolation per agent session with YAML-based policy enforcement. Our workspace_io.py uses file allowlists, but doesn't sandbox agent execution at the OS level. For a single-user personal AI, this is overkill, but for multi-tenant scenarios it matters.

2. **Differential Privacy Router** — Routes sensitive data to local models, general tasks to cloud. We don't need this because we never use cloud models (ADR-004), but the concept of routing by data sensitivity is interesting for future multi-model setups.

3. **Super's 97% Tool Accuracy** — This is the one genuine capability gap. Our Nano handles simple tool calls (web_search, save_note) well, but for complex multi-step agent tasks (meditation producing structured JSON, extraction parsing entities), Super's 97% accuracy would reduce errors and retries.

4. **PinchBench-Validated Agent Patterns** — NVIDIA specifically optimized Super for the tool-calling patterns that OpenClaw agents use. Since our agentic framework is modeled after OpenClaw, we'd benefit from a model tuned for exactly these patterns.

### What We Have That NemoClaw Doesn't

1. **Meditation System** — Multi-timescale self-reflection (daily/weekly/monthly) with journaling. NemoClaw's OpenClaw base has self-improvement but not structured reflection.

2. **Voice-First Architecture** — Our entire system is designed around real-time voice with GPU-local STT/TTS. NemoClaw targets telephony (Twilio), which adds 200-500ms of network latency.

3. **Emotional Awareness** — SER pipeline (emotion2vec + wav2vec2) feeds emotional context into prompts. NemoClaw has no emotion recognition.

4. **Ambient Context (Omi)** — Continuous ambient awareness from wearable device. NemoClaw agents are on-demand, not always-listening.

5. **Scheduled Background Agents** — Our AgentScheduler supports cron/interval/one-shot scheduling. NemoClaw's OpenClaw agents are triggered, not scheduled.

### Assessment: Are We Missing Out?

**For voice conversations**: No. Nano at 48-65 tok/s with QAT behavioral training is excellent. NemoClaw doesn't even target real-time voice on local hardware.

**For agentic background tasks**: Partially. Super's quality advantage is real:
- Meditation prompts are complex (multi-section structured output with journal entries, fact extraction, SOUL.md proposals). Super's 97% tool accuracy and +21 pts on SWE-Bench would produce higher-quality self-reflections.
- Self-improvement requires parsing conversations for subtle learnings. Super's larger active parameter count (12B vs 3B) gives it more capacity for nuanced analysis.
- But these are **background tasks with no latency requirement**. The question is whether Nano's quality is "good enough" or whether the errors compound over time.

**For security**: OpenShell is overkill for a single-user personal AI. Our allowlist + sanitization approach is appropriate.

---

## 6. NemoClaw's Nemotron 3 Super — Key Technical Details

### Architecture: Why It's Fast Despite 120B Parameters

```
88-layer hybrid stack:
├── Mamba-2 layers (bulk) ← O(n) scaling, not O(n²)
│   └── Paired with LatentMoE (512 experts, 22 active)
│       └── Latent compression: 4096→1024 before routing (4x savings)
├── Self-attention "anchor" layers (sparse) ← Precision reasoning
└── Multi-Token Prediction head ← 2-3x speedup on structured output
```

**Why 120B total but only 12B active**: MoE activates only 22 of 512 experts per token. Each expert is small (~230M params). The "120B" is the sum of all experts, but compute cost per token is comparable to a 12B dense model.

**Why LatentMoE matters**: Standard MoE routes full-dimension tokens (4096-dim). LatentMoE compresses to 1024-dim first, allowing 4x more experts for the same all-to-all communication cost. This is why Super can have 512 experts vs typical 128.

### PinchBench — What It Tests

PinchBench is NVIDIA's agent-specific benchmark designed for OpenClaw patterns:

| Model | PinchBench Score |
|-------|-----------------|
| **Nemotron 3 Super** | **84.7-85.6%** |
| Claude Opus 4.6 | 80.8% |
| GPT-5.4 | 80.5% |

It tests:
- Multi-step tool calling chains
- Goal maintenance across long contexts
- Error recovery and retry patterns
- Structured output consistency

### Failure Mode Mitigation

Super specifically targets two agentic failure modes:

1. **Goal drift** — Agent loses alignment as context grows. Mitigated by Mamba-2's O(n) attention (doesn't degrade with length) + 91.75% RULER score at 1M tokens.

2. **Tool-call failures** — Malformed function calls. Mitigated by trajectory-based RL post-training (trained on full agent sessions, not isolated completions).

### Available Formats

| Format | Size | DGX Spark Fit? | Notes |
|--------|------|----------------|-------|
| **NVFP4** | ~80 GB | Yes (tight) | Recommended. Native precision — no quality loss. |
| **FP8** | ~87 GB | Barely | No room for other models |
| **BF16** | ~240 GB | No | Needs 4x H100 or 8x A100 |
| **GGUF Q4_K_M** | ~64-72 GB | Yes | Ollama MoE GGUF not compatible with llama.cpp |

### Serving on DGX Spark — VERIFIED WORKING RECIPE

**Critical: The official NVIDIA vLLM containers (25.12, 26.02) do NOT support Super NVFP4.**
Super uses `MIXED_PRECISION` quant format (Mamba layers FP8, MoE experts NVFP4). NVIDIA's vLLM images
only support `FP8` or `NVFP4` as standalone modes, not mixed. Also, `--quantization modelopt_fp4`
and `--num-speculative-tokens` are invalid flags for this model/version.

**Solution: Use [eugr/spark-vllm-docker](https://github.com/eugr/spark-vllm-docker)** — a community
Docker setup specifically for DGX Spark that builds vLLM from nightly wheels with all necessary patches.

#### One-Time Setup (on Beast)

```bash
# 1. Clone the repo
cd ~ && git clone https://github.com/eugr/spark-vllm-docker.git

# 2. Build the Docker image (~25 GB, uses nvidia/pytorch:26.01-py3 base)
cd spark-vllm-docker && ./build-and-copy.sh

# 3. Download the model (~75 GB, 17 safetensor shards)
pip3 install --user --break-system-packages huggingface_hub[cli]
~/.local/bin/hf download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --local-dir ~/models/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

# 4. Create eager-mode recipe (skips 45+ min CUDA graph compilation)
cp recipes/nemotron-3-super-nvfp4.yaml recipes/nemotron-3-super-nvfp4-eager.yaml
sed -i 's|--tensor-parallel-size {tensor_parallel}|--tensor-parallel-size {tensor_parallel} --enforce-eager|' \
  recipes/nemotron-3-super-nvfp4-eager.yaml
```

#### Launch Command

```bash
cd ~/spark-vllm-docker
python3 run-recipe.py nemotron-3-super-nvfp4-eager \
  --solo --tensor-parallel 1 \
  --max-model-len 32768 --port 8003 \
  --gpu-memory-utilization 0.70 \
  --name vllm-nemotron-super -d
```

#### What the Recipe Does

The recipe sets these critical env vars and flags:
```yaml
env:
  VLLM_NVFP4_GEMM_BACKEND: "marlin"      # Marlin backend for NVFP4 (not FlashInfer)
  VLLM_TEST_FORCE_FP8_MARLIN: "1"        # Force Marlin for FP8 layers too
  VLLM_MARLIN_USE_ATOMIC_ADD: "1"         # Atomic add for MoE routing

command: |
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --load-format fastsafetensors \
  --reasoning-parser-plugin super_v3_reasoning_parser.py \
  --reasoning-parser super_v3 \
  --enforce-eager
```

The `mods/nemotron-super/run.sh` downloads `super_v3_reasoning_parser.py` from the model's
HuggingFace repo — this is a custom vLLM reasoning parser that separates thinking from content.

#### Startup Time

| Mode | Startup Time | Notes |
|------|-------------|-------|
| `--enforce-eager` | **~13 min** | Model load only, no graph compilation |
| Default (CUDA graphs) | **45+ min** | 88 layers × multiple seq lengths |

#### Measured GPU Usage

```
nvidia-smi output:
VLLM::EngineCore    71,765 MiB (70 GB)
```

#### API Response Format

The `super_v3` reasoning parser separates thinking from content:
```json
{
  "choices": [{
    "message": {
      "content": "Four",           // actual answer
      "reasoning": "The user asks... so answer is four."  // thinking process
    }
  }]
}
```

Our `OpenAICompatClient.create_completion()` reads `message.content` — the thinking is automatically
separated by the parser. The client-side regex also strips any leaked `<think>` tags as a safety net.

#### Important Gotchas (learned the hard way)

1. **DO NOT use `nvcr.io/nvidia/vllm:25.12.post1-py3` or `26.02-py3`** — they reject `MIXED_PRECISION`
2. **DO NOT pass `--quantization modelopt_fp4`** — model config says `modelopt`, not `modelopt_fp4`
3. **DO NOT pass `--num-speculative-tokens`** — not a valid arg in this vLLM version
4. **DO NOT pass `--reasoning-parser nemotron_v3`** — doesn't exist in NVIDIA images; use `super_v3` from the model's HF repo via `--reasoning-parser-plugin`
5. **Driver 580 is FINE** — the spark-vllm-docker image enables CUDA Forward Compatibility mode. Driver 590 actually has a CUDAGraph deadlock bug on GB10.
6. **Docker `--gpus all` works without `--runtime=nvidia`** — Beast uses CDI mode (Docker 29+)
7. **Model name is the full HF path** — `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`, not `nemotron-super`

---

## 7. Nano vs Super — Decision Matrix for Annie

| Criterion | Nano Wins | Super Wins | Weight for Annie |
|-----------|-----------|-----------|-----------------|
| Voice decode speed | ✅ 48-65 tok/s | ❌ 14-19 tok/s | **Critical** |
| Voice TTFT | ✅ 90ms | ❌ ~300ms | **Critical** |
| VRAM footprint | ✅ 18 GB | ❌ 80 GB | **High** |
| Tool accuracy | ❌ Unquantified | ✅ 97% | **Medium** (agents) |
| Complex reasoning | ❌ 38.76% SWE | ✅ 60.47% SWE | **Medium** (agents) |
| Long context retention | ❓ Not tested | ✅ 91.75% RULER | **Low** (we compact) |
| Multi-step agents | ❌ 3B active | ✅ 12B active | **Medium** |
| Quantization quality | ❌ Needs QAT | ✅ Native NVFP4 | **Low** (QAT works) |
| Cost to adopt | ✅ Already deployed | ❌ VRAM redesign | **High** |

**Score**: For voice-first + single DGX Spark, **Nano is the right choice today**.

---

## 8. Dual-Machine Architecture: Titan + Beast (RECOMMENDED)

### Hardware Available

| Machine | Spec | ConnectX-7 | Current Role |
|---------|------|-----------|-------------|
| **Titan** | DGX Spark, 128 GB unified, GB10 Blackwell | Yes | Annie voice + Context Engine + all services |
| **Beast** | DGX Spark, 128 GB unified, GB10 Blackwell | Yes | **Nemotron 3 Super (background agents)** |
| **Link** | WiFi (192.168.68.x) | **~75ms latency** | ConnectX-7 available but not configured yet |

### Why This Changes Everything

The single-machine VRAM constraint was the entire reason we couldn't run Super. With Beast available:

| Setup | Titan VRAM | Beast VRAM | Total Available |
|-------|-----------|-----------|-----------------|
| Current (Nano only) | 41 GB used / 87 GB free | **128 GB idle** | 215 GB free |
| Dual-model | 41 GB (unchanged) | ~80 GB Super | 135 GB free total |

**Both machines have plenty of headroom.** No VRAM juggling needed.

### Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        TITAN (128 GB)                           │
│                                                                 │
│  ┌─────────────────────────────────────────┐                    │
│  │  Voice Pipeline (latency-critical)      │                    │
│  │  Nemotron Speech STT → Nano → Kokoro TTS│                    │
│  │  Nano: 18 GB, 48-65 tok/s, 90ms TTFT   │                    │
│  └─────────────────────────────────────────┘                    │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │ Context      │  │ Audio        │  │ Dashboard    │          │
│  │ Engine       │  │ Pipeline     │  │ + Telegram   │          │
│  └──────────────┘  └──────────────┘  └──────────────┘          │
│                                                                 │
│  ┌─────────────────────────────────────────┐                    │
│  │  Agent Runtime (agent_context.py)       │                    │
│  │  ├── Voice queries → Nano (localhost)   │                    │
│  │  └── Agent queries → Super (beast:8003) │◄── NEW ROUTING     │
│  └─────────────────────────────────────────┘                    │
│                           │                                     │
└───────────────────────────┼─────────────────────────────────────┘
                            │ ConnectX-7 (< 1ms)
┌───────────────────────────┼─────────────────────────────────────┐
│                        BEAST (128 GB)                           │
│                           ▼                                     │
│  ┌─────────────────────────────────────────┐                    │
│  │  Nemotron 3 Super NVFP4 (vLLM :8003)   │                    │
│  │  80 GB, 14-19 tok/s, 97% tool accuracy  │                    │
│  │  1M context, native FP4                 │                    │
│  └─────────────────────────────────────────┘                    │
│                                                                 │
│  Free: ~48 GB (for future services)                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### What Runs Where

| Workload | Machine | Model | Why |
|----------|---------|-------|-----|
| Voice conversation | **Titan** | Nano | Latency-critical, 48-65 tok/s |
| Text chat | **Titan** | Nano | Same latency requirements |
| Daily meditation | **Beast** | Super | Accuracy-critical, 97% tool accuracy |
| Weekly/monthly meditation | **Beast** | Super | Complex multi-section structured output |
| Self-improvement (post-session) | **Beast** | Super | Nuanced conversation analysis |
| Entity extraction | **Beast** | Super | Better at structured JSON extraction |
| Daily reflection generation | **Beast** | Super | Richer, more insightful reflections |
| Omi ambient processing | **Beast** | Super | Complex context understanding |
| Context compaction | **Titan** | Nano | Runs during voice session, needs to be local |
| Embedding generation | **Titan** | qwen3-embedding | Stays on Ollama, co-located with Context Engine |

### Code Changes (IMPLEMENTED — commits `f776a53`, `82dad39`)

**1. `agent_context.py` — Dual LLM client with health-checked fallback:**
```python
# __init__: reads env vars
self._beast_base_url = os.getenv("BEAST_LLM_BASE_URL", "")
self._beast_model = os.getenv("BEAST_LLM_MODEL",
    "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4")
self._beast_healthy = False

# _check_beast_health(): httpx GET to beast:8003/health (5s timeout)
# Called on startup + before each agent execution (auto-recovery)

# _get_client(): returns Beast client if healthy, else Nano (fallback)
# All AgentRunner calls are background agents — voice never goes through here
```

**2. `start.sh` — Auto-detects Beast on Annie startup:**
```bash
# In start_annie(): checks if Beast vLLM is healthy
# If yes → passes BEAST_LLM_BASE_URL and BEAST_LLM_MODEL env vars to Annie
# If no → agents use local Nano (current behavior, zero change)
```

**3. `start.sh` — `start_beast_llm()` uses spark-vllm-docker recipe:**
```bash
./start.sh beast     # Start Super on Beast
./start.sh status    # Shows Beast health in status display
./stop.sh beast      # Stop Super on Beast
```

**4. Voice-priority gate:** Still works but now optional — Beast GPU is independent from Titan.

### VRAM Budget (Dual-Machine)

**Titan (unchanged):**
| Component | VRAM |
|-----------|------|
| Nemotron 3 Nano (QAT v4) | 18 GB |
| Audio pipeline | 7.3 GB |
| SER | 1.2 GB |
| Nemotron Speech STT | 2.49 GB |
| Kokoro TTS | 0.5 GB |
| qwen3-embedding (on-demand) | 14 GB |
| **Peak** | **~41 GB** |
| **Free** | **87 GB** |

**Beast (measured):**
| Component | VRAM |
|-----------|------|
| Nemotron 3 Super NVFP4 (measured) | **70 GB** (71,765 MiB per nvidia-smi) |
| vLLM overhead | included above |
| **Peak** | **~70 GB** |
| **Free** | **58 GB** |

**Combined**: 111 GB used out of 256 GB total. **145 GB headroom across both machines.**

### Latency Analysis (measured)

For background agent calls (Titan → Beast → Titan):
- Network round-trip (WiFi): **~75ms** (ConnectX-7 not configured yet, would be <1ms)
- vLLM TTFT on Beast: **~200-400ms** (estimated)
- Decode at 14-19 tok/s: **~26-36s for 500-token response**
- Total: **~30s per agent call**

This is perfectly fine for background agents (meditation, self-improvement, extraction). They have no real-time requirement — they run while voice is idle.

### Advantages Over Single-Machine

| Dimension | Single (Titan only) | Dual (Titan + Beast) |
|-----------|-------------------|---------------------|
| Voice latency | 90ms TTFT, 48 tok/s | **Same** (Nano stays on Titan) |
| Agent quality | Nano (38.76% SWE) | **Super (60.47% SWE, 97% tool)** |
| VRAM pressure | 41 GB / 128 GB | **41 + 90 = 131 / 256 GB** |
| GPU contention | Voice vs agents share GPU | **Zero contention** |
| Voice-priority gate | Required (busy-wait) | **Optional** (separate GPUs) |
| Privacy | Local | **Still fully local** (LAN only) |
| Failure isolation | Single point of failure | **Independent** (voice works if Beast down) |

### Failure Handling

If Beast goes down:
- Voice continues unaffected (Nano on Titan)
- Background agents fail gracefully (existing timeout + skip pattern in `agent_context.py`)
- Alert via observability (agent_complete events stop)
- Manual failback: point agent client to Titan's Nano temporarily

---

## 9. Other Future Paths

### Path A: Keep Nano for Everything (Fallback)
- **When**: If Beast is unavailable or Super proves unreliable
- Our current setup works. It's just not optimal for complex agents.

### Path B: Super-Distilled QAT (Complementary)
- Even with Super on Beast, we should still distill its behavior into Nano
- Use Super to generate 1000+ high-quality agent conversations
- QAT v5 Nano with Super-distilled data → better Nano for when Beast is offline
- **Effort**: Medium (we have the QAT v4 pipeline)

### Path C: Wait for DGX Spark Pro
- Rumored 256 GB unified memory
- Would fit Nano + Super on a single machine
- But dual-machine is available NOW

---

## 10. Current Status — DEPLOYED (2026-03-20)

### What's Running

| Machine | Model | Container | Image | Status |
|---------|-------|-----------|-------|--------|
| **Titan** | Nemotron 3 Nano 30B-A3B NVFP4 | `vllm-nemotron` | `nvcr.io/nvidia/vllm:25.12.post1-py3` | ✅ Running |
| **Beast** | Nemotron 3 Super 120B-A12B NVFP4 | `vllm-nemotron-super` | `vllm-node` (spark-vllm-docker) | ✅ Running |

### How to Start/Stop

```bash
# From laptop (start.sh SSHes into machines)
./start.sh beast          # Start Super on Beast
./start.sh annie          # Start Annie (auto-detects Beast)
./stop.sh beast           # Stop Super on Beast
./stop.sh annie           # Stop Annie

# Status check
./start.sh status         # Shows Beast + Titan health

# Manual health check
curl -sf http://192.168.68.58:8003/health   # Beast Super
curl -sf http://192.168.68.52:8003/health   # Titan Nano
```

### If Beast Needs Restart

```bash
# Option 1: via start.sh (handles everything)
./stop.sh beast && ./start.sh beast

# Option 2: manually on Beast
ssh beast
cd ~/spark-vllm-docker
docker rm -f vllm-nemotron-super
python3 run-recipe.py nemotron-3-super-nvfp4-eager \
  --solo --tensor-parallel 1 \
  --max-model-len 32768 --port 8003 \
  --gpu-memory-utilization 0.70 \
  --name vllm-nemotron-super -d
# Wait ~13 min for model load
curl -sf http://localhost:8003/health  # should return healthy
```

### If Beast Is Down

Annie auto-detects and falls back to Nano. No restart needed.
When Beast comes back, restart Annie (`./stop.sh annie && ./start.sh annie`) to pick it up,
or wait — agent_context.py re-checks Beast health before each agent execution.

### Next Steps

1. **Super-Distilled QAT v5** — Use Super to generate 1000+ high-quality agent conversations, QAT v5 Nano with this data → better Nano for when Beast is offline
2. **Configure ConnectX-7** — Direct cable link would reduce latency from 75ms (WiFi) to <1ms
3. **Enable CUDA graphs** — Once startup time isn't a concern (e.g., always-on Beast), remove `--enforce-eager` for better inference speed
4. **Benchmark agent quality** — Compare meditation/self-improvement output quality between Nano and Super

---

## Sources

- [spark-vllm-docker: DGX Spark vLLM Docker (community, USED FOR DEPLOYMENT)](https://github.com/eugr/spark-vllm-docker)
- [NVIDIA NemoClaw GitHub](https://github.com/NVIDIA/NemoClaw)
- [NVIDIA Nemotron Super vLLM Cookbook](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/usage-cookbook/Nemotron-3-Super/AdvancedDeploymentGuide)
- [NVIDIA Technical Blog: Introducing Nemotron 3 Super](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/)
- [NVIDIA Blog: Nemotron 3 Super Delivers 5x Higher Throughput](https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/)
- [NVIDIA Research: Nemotron 3 Super](https://research.nvidia.com/labs/nemotron/Nemotron-3-Super/)
- [NVIDIA Research: Nemotron 3 Family](https://research.nvidia.com/labs/nemotron/Nemotron-3/)
- [Nemotron 3 Super Technical Report (PDF)](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf)
- [NVIDIA NIM: Nemotron 3 Super Model Card](https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b/modelcard)
- [HuggingFace: Nemotron 3 Super NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4)
- [HuggingFace: Nemotron 3 Super FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8)
- [HuggingFace: Nemotron 3 Super BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16)
- [vLLM Blog: Nemotron 3 Super](https://vllm.ai/blog/nemotron-3-super)
- [Daily.co: Nemotron 3 Super for Voice AI](https://www.daily.co/blog/nvidia-nemotron-3-super/)
- [Running Nemotron 3 Super on DGX Spark](https://blog.kubesimplify.com/nemotron3-on-dgx-spark)
- [NVIDIA Forums: DGX Spark Nemotron3 NVFP4 65+ tps](https://forums.developer.nvidia.com/t/dgx-spark-nemotron3-and-nvfp4-getting-to-65-tps/355261)
- [NVIDIA Forums: Nemotron 3 Super NVFP4 on DGX Spark](https://forums.developer.nvidia.com/t/nvidia-nemotron-3-super-120b-a12b-nvfp4/363175)
- [VentureBeat: NemoClaw](https://venturebeat.com/technology/nvidia-lets-its-claws-out-nemoclaw-brings-security-scale-to-the-agent)
- [The New Stack: NemoClaw = OpenClaw with Guardrails](https://thenewstack.io/nemoclaw-openclaw-with-guardrails/)
- [NVIDIA: NemoClaw Official Page](https://www.nvidia.com/en-us/ai/nemoclaw/)
- [NVIDIA OpenShell Technical Blog](https://developer.nvidia.com/blog/run-autonomous-self-evolving-agents-more-safely-with-nvidia-openshell/)
- [NVIDIA: Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/)
- [Unsloth: Nemotron 3 Super How To Run Guide](https://unsloth.ai/docs/models/nemotron-3-super)
- [Ollama: nemotron-3-super](https://ollama.com/library/nemotron-3-super)
- [DataCamp: Nemotron 3 Architecture, Benchmarks, Comparisons](https://www.datacamp.com/blog/nvidia-nemotron-3)
- [llm-stats: Nemotron 3 Super Launch](https://llm-stats.com/blog/research/nemotron-3-super-launch)
- [Artificial Analysis: Nemotron 3 Super](https://artificialanalysis.ai/models/nvidia-nemotron-3-super-120b-a12b)
- [OpenClaw Report: Nemotron 3 Super PinchBench Leader](https://openclaw.report/ecosystem/nemotron-3-super-pinchbench-leader)
- [Saiyam Pathak: Running Nemotron 3 Super on DGX Spark](https://saiyampathak.substack.com/p/heres-what-i-learned-about-nemotron)
- [NVIDIA GTC 2026: RTX AI Garage NemoClaw](https://blogs.nvidia.com/blog/rtx-ai-garage-gtc-2026-nemoclaw/)
