# Research: NVIDIA Nemotron 3 LLM — Suitability for DGX Spark

**Date:** 2026-03-13 (Session 299), updated 2026-03-17 (Session 349)
**Status:** Research complete, updated with NemoClaw + vLLM vs Ollama benchmarks
**Context:** Evaluate Nemotron 3 family (Nano, Super, Ultra) as potential replacement/complement for Qwen3.5-27B on DGX Spark (128GB unified memory).

---

## 1. Executive Summary

Nemotron 3 is NVIDIA's first fully custom LLM — a **hybrid Mamba-2 + Transformer + MoE** architecture. Released Dec 2025 (Nano) and Mar 2026 (Super). Optimized for Blackwell with native NVFP4 training.

**Verdict:** Qwen3.5-27B remains the better choice for her-os entity extraction and Annie voice (stronger tool calling, Kannada support). Nemotron 3 Super is worth watching if NVFP4 fits in ~65 GB on Spark.

---

## 2. Model Family

| Tier | Total Params | Active/Token | Experts | Context | Status |
|------|-------------|-------------|---------|---------|--------|
| **Nano** | 31.6B | ~3.2B | 128+1 shared | 1M | Released Dec 2025 |
| **Super** | 120B | 12B | 512 (22 active) | 1M | Released Mar 2026 |
| **Ultra** | ~500B | ~50B | Unknown | 1M | Expected H1 2026 |

## 3. Architecture

- **Hybrid Mamba-2 / Transformer MoE** — NOT based on Llama
- Three interleaved layer types: Mamba-2 (efficient sequence), Transformer attention (precision reasoning), MoE (scalable compute)
- **LatentMoE** (Super/Ultra): Token embeddings projected to compressed latent space before routing → 4x more experts at same cost
- **Multi-Token Prediction (MTP)**: Native speculative decoding, avg 3.45 tokens/step (Super)
- **NVFP4 training**: Natively trained in Blackwell 4-bit floating point — zero quantization degradation
- **1M context**: Mamba layers store constant-size state (no quadratic KV cache growth)

## 4. DGX Spark VRAM Budget

Current Qwen3.5-27B uses 40 GB via Ollama. Comparison:

| Model | Precision | VRAM | Fits with 19 GB idle stack? | Free |
|-------|-----------|------|---------------------------|------|
| Nemotron 3 Nano Q4_K_M GGUF | Q4 | ~26 GB | Yes | 83 GB |
| Nemotron 3 Nano NVFP4 | FP4 native | ~20 GB | Yes | 89 GB |
| Nemotron 3 Nano BF16 | BF16 | ~63 GB | Yes (tight) | 46 GB |
| Nemotron 3 Super NVFP4 | FP4 native | ~64-72 GB | Tight | 37-45 GB |
| Nemotron 3 Super BF16 | BF16 | 240+ GB | No (needs 8xH100) | — |
| **Qwen3.5-27B Q4_K_M** | Q4 | **40 GB** | **Yes** | **69 GB** |

DGX Spark verified: Nemotron 3 Nano NVFP4 achieves ~67 t/s via vLLM.

## 5. Benchmarks vs Qwen3.5

### Nano (3.2B active) vs Qwen3.5-27B (27B dense)

| Benchmark | Nemotron 3 Nano | Qwen3.5-27B (est.) | Winner |
|-----------|----------------|---------------------|--------|
| MMLU-Pro | 78.3% | ~83% | Qwen |
| GPQA Diamond | 73.0% | ~82% | Qwen |
| AIME 2025 | 89.1% | ~89% | Tie |
| SWE-Bench | 38.8% | ~55% | **Qwen (+16)** |
| TAU2-Bench (Agentic) | 49.0% | ~70% | **Qwen (+21)** |
| 1M context RULER-100 | **86.3%** | N/A (131K) | **Nemotron** |

### Super (12B active) vs Qwen3.5-27B

| Benchmark | Nemotron 3 Super | Qwen3.5-27B | Notes |
|-----------|-----------------|-------------|-------|
| MMLU-Pro | 83.7% | ~83% | Tie |
| SWE-Bench | 60.5% | ~55% | Super wins |
| LiveCodeBench v5 | 81.2% | ~72% | Super wins |
| 1M context | 91.8% | N/A | Super |
| Throughput | **7.5x faster** than Qwen3.5-122B | baseline | Super dominates |

## 6. Tool Calling

Nemotron 3 supports OpenAI-compatible function calling, but benchmarks show significant gap:

| Benchmark | Nemotron 3 Nano | Qwen3.5-35B-A3B |
|-----------|----------------|-----------------|
| BFCL v4 | 53.8% | — |
| TAU2-Bench V2 | 49.0% | 81.2% |
| SWE-Bench | 38.8% | 69.2% |

**Critical for Annie Voice:** Tool selection accuracy matters for `web_search` vs `get_entity_details` routing. Weaker tool calling would worsen the Session 298 bug.

## 7. Throughput on DGX Spark

| Model | Hardware | Tokens/sec |
|-------|----------|-----------|
| **Nemotron 3 Nano NVFP4** | **DGX Spark** | **~67 t/s** |
| Nemotron 3 Super | H200 NIM | ~450 t/s |
| Qwen3.5-27B Q4_K_M | DGX Spark (Ollama) | ~25-35 t/s (est.) |

## 8. Blockers for her-os

1. **Tool calling gap** — 49% vs 81% on TAU2-Bench. Annie needs reliable tool routing.
2. **Kannada gap** — Nemotron lacks Kannada. Dealbreaker for multilingual entity extraction.
3. **Super VRAM** — NVFP4 needs ~65-72 GB, leaving only ~37-45 GB for rest of stack.
4. **Ecosystem maturity** — Ollama support for Nano exists, but NVFP4 requires vLLM with Blackwell-specific flags.

## 9. License

NVIDIA Open Model License — commercial use allowed, royalty-free, derivatives permitted. Guardrail stripping terminates license. More permissive than Llama, less than Apache 2.0.

## 10. DGX Spark Benchmark Attempt (Session 301)

**Nemotron 3 Super on DGX Spark: NOT PRACTICAL.**
- Ollama Q4 model size: **86 GB**
- With always-loaded stack (19 GB): **105 GB total** — only 23 GB free
- Cannot coexist with qwen3-embedding:8b (14 GB) or any other Ollama model
- Download attempted and killed after confirming impracticality
- Super is designed for 8x H100 or as sole model, not multi-model orchestration

**Nemotron 3 Nano (already on Titan, 24.3 GB):** Available for benchmark but tool calling weakness (TAU2-Bench 49%) is a concern for Annie Voice's multi-tool pipeline.

## 11. Alternative: Qwen3.5-27B NVFP4 (Best Long-Term Option)

Instead of switching models, quantize the existing Qwen3.5-27B to NVFP4 for Blackwell-native inference.

**Available community quantizations (NOT official NVIDIA):**
- [txn545/Qwen3.5-27B-NVFP4](https://huggingface.co/txn545/Qwen3.5-27B-NVFP4) — via NVIDIA Model Optimizer tool
- [kaitchup/Qwen3.5-27B-NVFP4](https://huggingface.co/kaitchup/Qwen3.5-27B-NVFP4) — via llm-compressor
- [AxionML/Qwen3.5-27B-NVFP4](https://huggingface.co/AxionML/Qwen3.5-27B-NVFP4) — via TensorRT Model Optimizer
- [osoleve/Qwen3.5-27B-Text-NVFP4-MTP](https://huggingface.co/osoleve/Qwen3.5-27B-Text-NVFP4-MTP) — with MTP speculative decoding head

**Potential gains:**
- VRAM: **~18.5 GB** (vs 40 GB Q4 Ollama) — saves 21.5 GB
- Speed: **3.3x faster** than BF16 on Spark
- Quality: **identical** — same model, same tool calling, same Kannada support
- MTP variant adds speculative decoding (multiple tokens per forward pass)

**Current blocker:** [vLLM NVFP4 crash on ARM64/GB10](https://github.com/vllm-project/vllm/issues/35519) — CUDA illegal instruction. SM_121 needs SM_120-forward-compatible workarounds. Not officially supported by vLLM yet. Community patches exist but are fragile.

**Important:** These are community quantizations, not official NVIDIA releases. NVIDIA published NVFP4 for their own models and Qwen3.5-397B-A17B, but not for Qwen3.5-27B.

**Action:** Revisit when vLLM officially supports DGX Spark GB10. This is the optimal upgrade path — same model quality, half the VRAM, faster inference.

## 12. Nemotron 3 Nano vs Qwen3.5-27B Benchmark (Session 301)

Benchmark script: `scripts/benchmark_nemotron_nano.py`. Run on DGX Spark via Ollama.

### Throughput

| Metric | Qwen3.5-27B | Nemotron 3 Nano | Advantage |
|--------|-------------|-----------------|-----------|
| Generation tok/s | 11.1 | **67.3** | **6.1x** |
| Prompt eval tok/s (small) | 222.5 | 200.3 | ~equal |
| Wall time (512 gen tokens) | 59.8s | **39.1s** | 1.5x |

### Large Context Prompt Eval (3,700 tokens — Graphiti bottleneck)

| Metric | Qwen3.5-27B | Nemotron 3 Nano | Advantage |
|--------|-------------|-----------------|-----------|
| Prompt eval tok/s | 624.5 | **1,869.1** | **3x** |
| Generation tok/s | 10.8 | **71.6** | **6.6x** |
| Wall time | 100.9s | **22.6s** | **4.5x** |

### Tool Calling (8 test cases with Annie's 3-tool menu)

| Model | Score | Avg latency |
|-------|-------|-------------|
| **Qwen3.5-27B** | **8/8 (100%)** | 10.5s |
| Nemotron 3 Nano | 6/8 (75%) | 3.7s |

Nano failures: "What did Priya say about the product launch?" (text response instead of `search_memory`) and "Tell me about Arun" (text response instead of `get_entity_details`).

### Entity Extraction & Kannada

Both models hit 2048 token limit (Qwen burned tokens on thinking). Nano's partial output was structurally correct (Priya/person, Bangalore/place, Arun/person). Not a model quality issue — benchmark needs higher `num_predict`.

### Key Finding: Hybrid Opportunity

Nano could replace Qwen for **Graphiti graph-building only** (no tool calling needed — just entity parsing + relationship inference). This would cut Graphiti sync from 5-15 min to 1-3 min. Annie Voice stays on Qwen for tool accuracy.

## 12b. Nemotron 3 Nano vs Qwen3.5-9B Tool Calling (Session 301)

The real comparison for Annie Voice — Nano (Ollama) vs the actual 9B Opus-Distilled (llama-server :8003).

**16 test cases** covering all 3 tools (`web_search`, `search_memory`, `get_entity_details`).

| Metric | Nemotron 3 Nano | Qwen3.5-9B | Winner |
|--------|----------------|------------|--------|
| **Tool accuracy** | **12/16 (75%)** | **15/16 (94%)** | **Qwen (+3)** |
| Avg latency/call | 3.2s | 2.9s | Qwen (slightly) |
| Total wall time | 51.6s | 46.1s | Qwen |

### Failure analysis

**Nano failures (4):**
- "What did Priya say about product launch?" → text response (should: `search_memory`)
- "Tell me about Arun" → `search_memory` (should: `get_entity_details`)
- "What does Deepak work on?" → text response (should: `get_entity_details`)
- "Look up Dr. Sharma's details" → text response (should: `get_entity_details`)

Pattern: **Nano consistently fails `get_entity_details`** — either answers directly or falls back to `search_memory`. 0/3 correct on `get_entity_details` calls.

**Qwen 9B failure (1):**
- "What did Priya say about product launch?" → `get_entity_details` (should: `search_memory`)

Pattern: Qwen 9B confuses `search_memory` vs `get_entity_details` on one ambiguous case. Strong on all other categories.

### Verdict

Qwen3.5-9B is clearly better for Annie Voice: **94% vs 75%** tool accuracy, and **faster** (2.9s vs 3.2s per call). Nano's `get_entity_details` blindness makes it unsuitable for Annie's multi-tool pipeline.

## 13. Recommendation

| Use Case | Choice | Rationale |
|----------|--------|-----------|
| Entity extraction (unicorn) | **Keep Qwen3.5-27B** | Better tool calling, Kannada, proven |
| Annie voice chat (minotaur) | **Keep Qwen3.5-9B distill** | Tool accuracy critical; Nano is weaker |
| Background LLM | **Keep Qwen3.5-27B** | Shared with extraction |
| **Best upgrade path** | **Qwen3.5-27B NVFP4** | Same model, ~18.5 GB VRAM, 3.3x faster. Blocked by vLLM ARM64 bug |
| Future experiment | **Watch Super NVFP4** | If fits in ~65 GB, beats Qwen on coding with 1M context |

## 14. NemoClaw — NVIDIA's OpenClaw Sandbox Platform (Session 349)

**NemoClaw** is NVIDIA's open-source enterprise AI agent platform, announced at GTC 2026 (March 16). It wraps OpenClaw in a secure sandbox with NVIDIA inference routing.

### Architecture

```
User → nemoclaw CLI (TypeScript plugin)
  → Blueprint Runner (Python orchestrator)
    → OpenShell CLI (sandbox + policy + inference)
      → OpenClaw running inside sandboxed container
```

- **Plugin** (`nemoclaw/src/`): Thin TypeScript CLI, registers under `openclaw nemoclaw`
- **Blueprint** (`nemoclaw-blueprint/`): Python orchestrator with declarative YAML manifest
- **Policy**: Default-deny filesystem/network with explicit allowlists (Landlock + seccomp)
- **Repo**: https://github.com/NVIDIA/NemoClaw (cloned to `vendor/NemoClaw/`)

### NemoClaw Default Models

| Model | Context | Max Output |
|-------|---------|-----------|
| Nemotron 3 Super 120B (default) | 131K | 8,192 |
| Nemotron Ultra 253B | 131K | 4,096 |
| Nemotron Super 49B v1.5 | 131K | 4,096 |
| Nemotron 3 Nano 30B | 131K | 4,096 |

### Inference Profiles (blueprint.yaml)

| Profile | Where | Notes |
|---------|-------|-------|
| `default` | build.nvidia.com | Recommended, zero infra |
| `ncp` | NVIDIA Cloud Partner | Dedicated capacity |
| `nim-local` | Self-hosted NIM container | Experimental |
| `vllm` | Local vLLM | Experimental |
| `ollama` | Local Ollama | Experimental |

### Relevance to her-os

- The `vllm` and `ollama` profiles could point at our existing vLLM/Ollama on Titan
- Blueprint versioning pattern (immutable, digest-verified) could inspire Annie deployment configs
- Early-stage: APIs may change without notice

## 15. Ollama Available Tags for Nemotron 3 Nano (Session 349)

**NVFP4 is NOT available on Ollama.** Ollama doesn't support the NVFP4 format (requires FlashInfer MoE FP4 kernels).

| Tag | Size | Context |
|-----|------|---------|
| `nemotron-3-nano:30b` (latest) | 24 GB | 1M |
| `nemotron-3-nano:30b-a3b-q4_K_M` | 24 GB | 1M |
| `nemotron-3-nano:30b-a3b-q8_0` | 34 GB | 1M |
| `nemotron-3-nano:30b-a3b-fp16` | 63 GB | 1M |
| `nemotron-3-nano:4b` | 2.8 GB | 256K |
| `nemotron-3-nano:4b-q8_0` | 4.2 GB | 256K |
| `nemotron-3-nano:4b-bf16` | 8.0 GB | 256K |

Source: https://ollama.com/library/nemotron-3-nano/tags

## 16. vLLM vs Ollama Performance on DGX Spark (Session 349)

### DGX Spark Benchmarks (from NVIDIA Developer Forums)

| Metric | vLLM NVFP4 | vLLM FP8 | Ollama Q4_K_M (est.) |
|--------|-----------|----------|---------------------|
| Output throughput | **167 tok/s** | 154 tok/s | ~40-60 tok/s |
| Peak throughput | **248 tok/s** | 208 tok/s | — |
| Median TTFT | **292 ms** | 414 ms | — |
| Per-token latency (TPOT) | **36 ms** | 48 ms | ~15-25 ms |
| Request throughput | **1.31 req/s** | 1.21 req/s | — |

**NVFP4 wins on DGX Spark** — faster throughput AND lower median TTFT than FP8.

### Why vLLM is Faster

- **PagedAttention**: Virtual memory for KV cache, eliminates fragmentation
- **Continuous batching**: Aggregates concurrent requests into unified GPU ops
- **FlashInfer FP4 kernels**: Purpose-built for Blackwell Tensor Cores (4x FLOPS over BF16)
- **At scale**: 16.6x throughput advantage over Ollama on Blackwell (8,033 vs 484 tok/s on 70B Llama)

### Stability Caveat

A DGX Spark user reported NVFP4 crashes after 20-60 minutes with GPU memory errors. Recommended: `--gpu-memory-utilization 0.7` (not 0.9) due to unified memory architecture.

### vLLM NVFP4 Serving Recipe (Nemotron 3 Nano)

```bash
# Download custom reasoning parser
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py

# Launch vLLM server (256K context)
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND=throughput \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --served-model-name model \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --port 8000 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.7
```

**Docker image**: `nvcr.io/nvidia/vllm:25.12.post1-py3` (DGX Spark/Jetson Thor)

**Required env vars**:
- `VLLM_USE_FLASHINFER_MOE_FP4=1` — Enable FP4 MoE support
- `VLLM_FLASHINFER_MOE_BACKEND=throughput` — Throughput optimization

## 17. NVFP4 Quantization Quality (QAD vs BF16)

NVFP4 uses Quantization-Aware Distillation (QAD), not simple PTQ. Accuracy loss is minimal:

| Benchmark | BF16 | FP8 | NVFP4 | Loss vs BF16 |
|-----------|------|-----|-------|---------------|
| MMLU-Pro | 78.3 | 78.1 | 77.4 | -0.9 |
| AIME25 | 89.1 | 87.7 | 86.7 | -2.4 |
| GPQA | 73.0 | 72.5 | 71.9 | -1.1 |
| LiveCodeBench | 68.3 | 67.6 | 65.4 | -2.9 |
| TauBench Avg | 49.0 | 47.0 | 45.6 | -3.4 |

Selective strategy: Attention + Mamba→Attention layers stay BF16. KV cache: FP8.

## 18. Updated VRAM Comparison for Extraction Model Swap (Session 349)

| Option | VRAM | Server | Speed (DGX Spark) | Tool Calling |
|--------|------|--------|-------------------|--------------|
| **Current: Qwen3.5-27B Q4_K_M** | **40 GB** | Ollama | ~11 tok/s gen | **100% (8/8)** |
| Nemotron 3 Nano Q4_K_M | 24 GB | Ollama | ~67 tok/s gen | 75% (6/8) |
| Nemotron 3 Nano NVFP4 | ~18 GB | vLLM | ~167 tok/s out | 75% (6/8)* |
| Qwen3.5-27B NVFP4 | ~18.5 GB | vLLM | ~3.3x BF16** | 100%** |

\* Tool calling accuracy not re-benchmarked with NVFP4; assumed same as base model.
\** Blocked by vLLM ARM64/SM_121 bug (see Section 11).

**Recommendation (unchanged)**: Qwen3.5-27B NVFP4 remains the optimal upgrade path (same quality, half VRAM, faster). Nemotron 3 Nano is faster but tool calling weakness (75%) makes it unsuitable for extraction. Revisit when vLLM ARM64 support stabilizes.

## 19. Sources (updated Session 349)

- [NVIDIA Nemotron 3 Family Launch](https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models)
- [Nemotron 3 Super Technical Blog](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/)
- [Inside Nemotron 3 Architecture](https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/)
- [Nemotron 3 Research Page](https://research.nvidia.com/labs/nemotron/Nemotron-3/)
- [Nemotron 3 Super Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf)
- [Nemotron 3 Nano Technical Report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf)
- [DGX Spark + NVFP4 Forum Thread](https://forums.developer.nvidia.com/t/dgx-spark-nemotron3-and-nvfp4-getting-to-65-tps/355261)
- [Nemotron 3 Nano GGUF (Unsloth)](https://huggingface.co/unsloth/Nemotron-3-Nano-30B-A3B-GGUF)
- [Nemotron 3 Super NIM](https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b/modelcard)
- [Qwen3.5-35B vs Nemotron Nano Comparison](https://awesomeagents.ai/tools/qwen-3-5-35b-a3b-vs-nemotron-3-nano/)
- [Artificial Analysis: Super vs Qwen3.5-122B](https://artificialanalysis.ai/models/comparisons/nvidia-nemotron-3-super-120b-a12b-vs-qwen3-5-122b-a10b)
- [GitHub - NVIDIA/NemoClaw](https://github.com/NVIDIA/NemoClaw) (Session 349)
- [NemoClaw Cookbook](https://build.nvidia.com/spark/nemoclaw) (Session 349)
- [HuggingFace - Nemotron 3 Nano NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4) (Session 349)
- [vLLM Recipes - Nemotron 3 Nano](https://docs.vllm.ai/projects/recipes/en/latest/NVIDIA/Nemotron-3-Nano-30B-A3B.html) (Session 349)
- [DGX Spark Nemotron 3 Nano vLLM Benchmarks (NVIDIA Forums)](https://forums.developer.nvidia.com/t/testing-nemotron-3-nano-models-on-nvidia-dgx-spark-jetson-thor-with-vllm-and-flashinfer/360642) (Session 349)
- [Ollama vs vLLM Performance Benchmark 2026 (SitePoint)](https://www.sitepoint.com/ollama-vs-vllm-performance-benchmark-2026/) (Session 349)
- [Red Hat - Ollama vs vLLM Deep Dive](https://developers.redhat.com/articles/2025/08/08/ollama-vs-vllm-deep-dive-performance-benchmarking) (Session 349)
- [Ollama - nemotron-3-nano tags](https://ollama.com/library/nemotron-3-nano/tags) (Session 349)
- [NVFP4 QAD Research](https://research.nvidia.com/labs/nemotron/nemotron-qad/) (Session 349)
