# Nemotron 3 Nano 30B NVFP4 vs Qwen3.5-27B NVFP4 — Benchmark Results

**Date:** 2026-03-17
**Platform:** DGX Spark (128 GB unified LPDDR5x, Blackwell SM 12.1)
**GPU Isolation:** TOTAL — all services stopped, only vLLM running per model
**Methodology:** 2 warmup + 5 measured runs, thinking disabled (`enable_thinking=False`)
**Context length:** 32,768 tokens

## Docker Images (Different by necessity)

| Model | Docker Image | Reason |
|-------|-------------|--------|
| Nemotron 3 Nano NVFP4 | `nvcr.io/nvidia/vllm:25.12.post1-py3` | NGC image has FlashInfer MoE FP4 kernels |
| Qwen3.5-27B NVFP4 | `hellohal2064/vllm-qwen3.5-gb10:latest` | NGC lacks `qwen3_5` model type support |

**Confound note:** Different vLLM versions may affect performance slightly. Both are Blackwell SM 12.1 compatible.

## GPU Memory (Solo, DGX Spark Unified Memory)

| Model | GPU Process Memory | Architecture |
|-------|-------------------|--------------|
| **Nemotron 3 Nano 30B NVFP4** | **44,732 MB** (0.5 util) | MoE: 31.6B total, 3.2B active |
| **Qwen3.5-27B NVFP4** | **28,612 MB** (0.3 util) | Dense: 27B all active |

*Note: GPU memory includes model weights + KV cache pre-allocation. Different utilization settings make direct comparison approximate. Model weights: Nemotron ~18.6 GB, Qwen ~19 GB.*

## Performance Comparison

### Latency & Throughput (Standard Tests)

| Test | Metric | Nemotron Nano | Qwen3.5-27B | Winner | Speedup |
|------|--------|--------------|-------------|---------|---------|
| **simple_greeting** | TTFT p50 | **125 ms** | 282 ms | Nemotron | 2.3x |
| | tok/s p50 | **44.1** | 11.5 | Nemotron | 3.8x |
| **factual_question** | TTFT p50 | **233 ms** | 282 ms | Nemotron | 1.2x |
| | tok/s p50 | **59.6** | 12.6 | Nemotron | 4.7x |
| **conversational** | TTFT p50 | **197 ms** | 286 ms | Nemotron | 1.5x |
| | tok/s p50 | **64.5** | 12.9 | Nemotron | 5.0x |
| **reasoning** | TTFT p50 | **139 ms** | 291 ms | Nemotron | 2.1x |
| | tok/s p50 | **36.4** | 9.8 | Nemotron | 3.7x |
| **long_response** | TTFT p50 | **131 ms** | 286 ms | Nemotron | 2.2x |
| | tok/s p50 | **51.3** | 13.2 | Nemotron | 3.9x |

**Average decode speed: Nemotron 51.2 tok/s vs Qwen 12.0 tok/s = 4.3x faster**
**Average TTFT: Nemotron 165 ms vs Qwen 285 ms = 1.7x faster**

### Multi-turn TTFT Scaling (5 turns)

| Turn | Nemotron Nano | Qwen3.5-27B | Notes |
|------|--------------|-------------|-------|
| Turn 1 | N/A* | 282 ms | |
| Turn 2 | N/A* | 293 ms | |
| Turn 3 | N/A* | 320 ms | |
| Turn 4 | N/A* | 365 ms | |
| Turn 5 | N/A* | 412 ms | +46% from turn 1 |

*Nemotron multi-turn returned 0ms (parser bug with multi-turn — streaming fix only applied to standard tests). Qwen shows expected DeltaNet linear attention scaling: 282ms → 412ms over 5 turns (+46%).*

### Multi-turn TTFT Scaling (10 turns, Qwen only)

| Turn | TTFT p50 | Growth |
|------|----------|--------|
| Turn 1 | 285 ms | baseline |
| Turn 5 | 469 ms | +65% |
| Turn 8 | 659 ms | +131% |
| Turn 10 | 821 ms | +188% |

*Qwen3.5-27B shows near-linear TTFT growth with context length. Nemotron's Mamba-2 architecture should show near-constant TTFT (O(1) state), but multi-turn data was not captured in this run.*

## Tool Calling (16-case 3-tool suite)

| Model | Score | Accuracy | Notes |
|-------|-------|----------|-------|
| **Nemotron Nano** | **5/16** | **31%** | Passed 5/6 first tests, then vLLM crashed for remaining 10 |
| **Qwen3.5-27B** | **0/16** | **0%** | Used `TOOL_SYSTEM_PROMPT` not QAT-trained prompt; model answered directly instead of calling tools |

**IMPORTANT CAVEAT:** Both tool-16 results are unreliable:
- Nemotron: vLLM crashed mid-suite (tests 7-16 all failed). First 6: 5/6 = 83%
- Qwen: The 16-case suite uses `TOOL_SYSTEM_PROMPT` (generic), not the QAT-v4 production prompt. The 2-tool standard tests show Qwen correctly calling `web_search` (5/5 ✓) but failing `search_memory` (0/5 — answered directly). This is a benchmark prompt issue, not a model issue.

**Prior benchmark data (more reliable):**
- Nemotron Nano (Ollama Q4_K_M, Session 301): 12/16 = 75%
- Qwen3.5-9B QAT-v4 (production): 15/16 = 94%

## Entity Extraction (Graphiti Use Case)

| Test | Nemotron Nano | Qwen3.5-27B |
|------|--------------|-------------|
| English transcript (7 persons, 3 places) | JSON parse failed* | **7/7 persons, valid JSON** |
| Kannada mixed (6 entities) | JSON parse failed* | **6/6 entities, valid JSON** |

*Nemotron entity extraction returned empty responses — likely the model's thinking went entirely to `reasoning_content` with no visible content output. Needs reasoning-ON mode for extraction tasks.*

## Quality Gates (All Standard Tests)

| Gate | Nemotron Nano | Qwen3.5-27B |
|------|--------------|-------------|
| Non-empty | 25/25 (100%) | 25/25 (100%) |
| No markdown | 15/15 (100%) | 15/15 (100%) |
| Factual correct | 10/10 (100%) | 10/10 (100%) |
| Thinking leak | 0/25 (0%) | 0/25 (0%) |

Both models pass all quality gates perfectly.

## Summary Comparison

| Metric | Nemotron 3 Nano 30B | Qwen3.5-27B | Winner |
|--------|--------------------|--------------|----|
| **TTFT (p50)** | **125-233 ms** | 279-291 ms | **Nemotron (2x)** |
| **Decode speed** | **36-65 tok/s** | 9-15 tok/s | **Nemotron (4x)** |
| **GPU Memory (model)** | ~18.6 GB | ~19 GB | Tie |
| **Tool calling** | 31-83%* | 0-100%** | Inconclusive |
| **Entity extraction** | Failed (parser) | **100%** | **Qwen** |
| **Kannada** | Failed (parser) | **100%** | **Qwen** |
| **Multi-turn scaling** | N/A (parser bug) | 282→821ms | Need retest |
| **Quality gates** | 100% | 100% | Tie |
| **Startup time** | ~545s (9 min) | ~165s (2.8 min) | Qwen (3.3x faster) |
| **Stability** | Crashed at 0.3 util | Stable | **Qwen** |

\* Nemotron tool-16: 5/6 before crash = 83%, but only 6 tests completed
\** Qwen tool-16: 0/16 due to wrong system prompt. Standard 2-tool: web_search 5/5 ✓, search_memory 0/5 ✗

## Recommendation

**For Annie Voice (primary use case): Keep Qwen3.5-9B QAT-v4** (current production)
- The 9B QAT model is behaviorally tuned with 999 adversarial conversations
- 10/10 behavioral pass rate
- 35 tok/s is sufficient for voice (Annie speaks ~3 words/sec)

**For entity extraction (Context Engine): Nemotron Nano is worth investigating**
- 4x throughput (51 tok/s vs 12 tok/s) would dramatically speed up Graphiti extraction
- But entity extraction needs reasoning-ON mode (which our benchmark disabled)
- Re-benchmark with `enable_thinking=True` and `reasoning_budget=512` for extraction tasks

**For future Annie Voice 2.0 (when speed matters more):**
- Nemotron's 125ms TTFT + 51 tok/s would enable real-time voice (2x faster first word, 4x faster response)
- But needs QAT/fine-tuning for behavioral alignment (tool calling, persona)
- The MoE architecture (3.2B active) is fundamentally more efficient for single-user voice

**Action items:**
1. Re-run Nemotron benchmark with `enable_thinking=True` for extraction tasks
2. Fix multi-turn parser to capture Nemotron's reasoning_content properly
3. Benchmark Nemotron with production QAT-v4 system prompt for tool calling accuracy
4. Test Nemotron stability at 0.5-0.7 GPU utilization for sustained workloads

## 11. Production Viability Test (All Services Running)

Ran Nemotron benchmark with ALL production services simultaneously:
- vllm-annie (Qwen3.5-9B QAT-v4, port 8003)
- vllm-nemotron (Nemotron Nano NVFP4, port 8004)
- Ollama (qwen3-embedding:8b)
- Context Engine, Audio Pipeline, SearXNG

**Stability fixes applied:**
- `--enforce-eager` (disables CUDA graphs — fixes SM 12.1 crash)
- `VLLM_FLASHINFER_MOE_BACKEND=latency` (stable backend for SM 12.1)
- `--gpu-memory-utilization 0.35`
- Removed `--load-format fastsafetensors` (not in NGC image)

### Production Results (1 warmup + 3 measured, 0 errors)

| Test | TTFT | tok/s | Tokens |
|------|------|-------|--------|
| simple_greeting | **109-115ms** | 32-37 | 15-30 |
| factual_question | 120-123ms | 43-48 | 7-47 |
| conversational_medium | 121-126ms | 52-53 | 111-140 |
| tool_call_memory | 155-158ms | 19-48 | 19-57 |
| tool_call_web | 156-165ms | 1.4 | tool call only |
| reasoning | 132-138ms | 25-32 | 20-33 |
| long_response | 122-126ms | 46-49 | 130-156 |

### MCP Tool Calling (Chrome Browser): 8/10 (80%)

| Test | Tool | Result |
|------|------|--------|
| "Open Google" | browser_navigate | PASS |
| "Click search button" | browser_click | PASS |
| "Type in search box" | browser_fill_form | PASS |
| "What does the page say?" | browser_get_page_text | PASS |
| "Take a screenshot" | browser_take_screenshot | PASS |
| "Go to news.ycombinator.com" | browser_navigate | PASS |
| "Click first article" | browser_click | PASS |
| "Fill login form with email" | browser_fill_form | FAIL (answered directly) |
| "Read main article" | browser_get_page_text | FAIL (answered directly) |
| "Navigate to NVIDIA forums" | browser_navigate | PASS |

### Tool-16 (3-tool suite): 12/16 (75%)

Failures: `get_entity_details` confused with `search_memory` (2), `web_search` skipped (1), `search_memory` skipped (1).

### Production VRAM Profile

| Process | GPU Memory |
|---------|-----------|
| vLLM Nemotron Nano (port 8004) | 27,648 MiB |
| vLLM Annie QAT-v4 (port 8003) | 50,432 MiB |
| **Total** | **76.2 GiB** |
| **Headroom** | **51.8 GiB** (for audio, embeddings, OS) |

### Startup Time

| Model | Startup | Notes |
|-------|---------|-------|
| Nemotron (with --enforce-eager) | **~3-4 min** | No torch.compile needed |
| Annie QAT-v4 | ~1 min | Cached CUDA graphs |

**VERDICT: Production-ready.** Zero crashes, 126ms TTFT, 40 tok/s, 80% MCP accuracy, 52 GB headroom.

## Sources

- [NVIDIA Nemotron 3 Nano NVFP4 (HuggingFace)](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4)
- [DGX Spark NVFP4 65+ tps thread](https://forums.developer.nvidia.com/t/dgx-spark-nemotron3-and-nvfp4-getting-to-65-tps/355261)
- [vLLM Nemotron Nano Recipe](https://docs.vllm.ai/projects/recipes/en/latest/NVIDIA/Nemotron-3-Nano-30B-A3B.html)
- [DGX Spark vLLM + FlashInfer benchmarks](https://forums.developer.nvidia.com/t/testing-nemotron-3-nano-models-on-nvidia-dgx-spark-jetson-thor-with-vllm-and-flashinfer/360642)
