# Research: NVIDIA Nemotron Voice Agent Stack on DGX Spark

**Date:** 2026-03-12
**Status:** Research complete -- no code written
**Repo:** [pipecat-ai/nemotron-january-2026](https://github.com/pipecat-ai/nemotron-january-2026)

---

## Executive Summary

NVIDIA + Daily.co released a complete open-source voice agent stack in January 2026 built on three models: **Nemotron Speech ASR 0.6B** (STT), **Nemotron 3 Nano 30B-A3B** (LLM), and **Magpie TTS 357M** (TTS). All three run together in a single unified Docker container targeting DGX Spark (SM_121/Blackwell). The stack is orchestrated via Pipecat and achieves **~1.2s median voice-to-voice latency on DGX Spark** (Q8 mode).

**Key finding for her-os:** The stack is fully modular -- each component (ASR, LLM, TTS) runs as an independent service on separate ports and can be swapped independently. We could adopt just the ASR (Nemotron Speech 0.6B) and keep our Qwen3.5-9B LLM + Kokoro TTS. In fact, we already did this in Session 293 (Nemotron Speech 0.6B is now live in Annie Voice as creature `serpent`).

---

## 1. The Three Models

### 1.1 Nemotron Speech ASR 0.6B

| Property | Value |
|----------|-------|
| Architecture | FastConformer-CacheAware-RNNT (24 encoder layers) |
| Parameters | 600M |
| VRAM | ~2.4 GB (FP16) |
| Sample rate | 16 kHz mono |
| Streaming latency modes | 80ms, 160ms, 560ms, 1.12s (runtime configurable) |
| Output | English text with native punctuation + capitalization |
| Training data | 285k hours (NVIDIA Granary + public datasets) |
| License | NVIDIA Open Model License (commercial OK) |
| HuggingFace | [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) |

**WER Benchmarks (by chunk size):**

| Dataset | 80ms | 160ms | 560ms | 1.12s |
|---------|------|-------|-------|-------|
| LibriSpeech test-clean | 2.71 | 2.40 | 2.40 | 2.31 |
| LibriSpeech test-other | 5.48 | 4.97 | 4.97 | 4.75 |
| TEDLIUM | 4.76 | 4.46 | 4.46 | 4.50 |
| AMI | 12.12 | 11.69 | 11.69 | 11.58 |
| Earnings22 | 13.02 | 12.61 | 12.61 | 12.48 |
| Gigaspeech | 11.87 | 11.43 | 11.43 | 11.45 |
| **Average** | **8.53** | **7.84** | **7.22** | **7.16** |

**Key innovation:** 8x downsampling (vs traditional 4x) reduces VRAM per stream and boosts throughput. Cache-aware design reuses hidden states without overlapping computations. On H100, supports ~560 concurrent streams at 320ms chunk size.

**Hotword/biasing support:** The model itself has NO native hotword biasing. However, the NeMo framework provides CTC-based context biasing (word spotter + hotword_weight parameter) that works with FastConformer RNNT models. This is a post-decode biasing approach, not the native prompt-level biasing that Qwen3-ASR offers.

### 1.2 Nemotron 3 Nano 30B-A3B (LLM)

| Property | Value |
|----------|-------|
| Architecture | Hybrid Mamba-2 + Transformer MoE |
| Total parameters | 31.6B |
| Active per token | ~3.6B (128 routed experts + 1 shared, 6 active) |
| Layers | 52 total (23 Mamba-2, 23 MoE, 6 GQA attention) |
| Context window | 1M tokens |
| Reasoning | ON/OFF modes, configurable thinking budget |
| License | NVIDIA Open Model License |

**VRAM by quantization:**

| Quantization | VRAM | Recommended hardware |
|-------------|------|---------------------|
| BF16 | ~72 GB | Multi-GPU (H100/A100) |
| FP8 | ~32 GB | DGX Spark |
| Q8 (GGUF) | ~32 GB | DGX Spark (default) |
| NVFP4 | ~20 GB | DGX Spark (compact) |
| Q4 (GGUF) | ~16-24 GB | RTX 5090 |

**DGX Spark performance:**
- Prefill: 1,000 tokens/second
- Generation: 14 tokens/second (native), up to 30 t/s with SM_121 kernel optimization
- 1M token context with stable performance (Mamba linear state, no KV cache blowup)

**Benchmarks vs Qwen3-30B-A3B:**

| Benchmark | Nemotron 3 Nano | Qwen3-30B-A3B | Winner |
|-----------|----------------|---------------|--------|
| MATH | 82.88% | 61.14% | Nemotron (+21.7pp) |
| AIME 2025 | 89.1% | 85.0% | Nemotron |
| LiveCodeBench v6 | 68.3% | 66.0% | Nemotron |
| Arena-Hard-v2 | 67.7% | 57.8% | Nemotron |
| MMLU-Pro | 78.3% | 80.9% | Qwen3 |
| RULER-100 (256K) | 92.9% | 89.4% | Nemotron |
| RULER-100 (1M) | 86.3% | 77.5% | Nemotron |
| Throughput (H200) | 3.3x baseline | 1x baseline | Nemotron |

**Voice agent specific (Daily.co aiewf eval):**
- 91.4% pass rate (287/304 tool use, 286/304 instruction following, 298/304 knowledge grounding)
- Nemotron 3 Super (bigger sibling) matches GPT-4.1 on 30-turn voice conversations

### 1.3 Magpie TTS 357M

| Property | Value |
|----------|-------|
| Architecture | Transformer encoder-decoder with local refinement |
| Parameters | 357M |
| Encoder | 6-layer causal transformer |
| Decoder | 12-layer causal transformer |
| Audio codec | NanoCodec (8 codebooks, 22kHz, 1.89kbps) |
| VRAM | ~1.4 GB model weights (NIM recommends 16+ GB total) |
| Languages | 9 (En, Es, De, Fr, Vi, It, Zh, Hi, Ja) |
| Voices | 5 (Sofia, Aria, Jason, Leo, John Van Stan) |
| Max duration | 20 seconds per generation (standard mode) |
| License | NVIDIA Open Model License |

**Quality metrics:**

| Metric | Value |
|--------|-------|
| ELO (Artificial Analysis) | 1,014 (#17 on leaderboard) |
| CER English (LibriTTS) | 0.34% |
| SV-SSIM English | 0.835 |

**Latency on DGX Spark:**
- Batch mode: ~600ms per sentence
- Streaming TTFB: ~185ms (P50), up to 1171ms (P90)
- With adaptive streaming optimization: 2.3x improvement on DGX Spark

---

## 2. The Unified Stack (pipecat-ai/nemotron-january-2026)

### 2.1 Architecture

```
                      +---------+
                      |  Client |  (WebRTC / Daily / Twilio)
                      +----+----+
                           |
                    +------+------+
                    | Pipecat Bot |  (Python orchestrator)
                    +------+------+
                     /     |      \
            +-------+ +-------+ +-------+
            |  ASR  | |  LLM  | |  TTS  |
            | :8080 | | :8000 | | :8001 |
            +-------+ +-------+ +-------+
               WS       HTTP    HTTP+WS

        All three run inside ONE Docker container
        (Dockerfile.unified, shared CUDA context)
```

### 2.2 Docker Build (Dockerfile.unified)

```bash
docker build -f Dockerfile.unified -t nemotron-unified:cuda13 .
```

- **Build time:** 2-3 hours (compiles from source)
- **What gets compiled:** PyTorch (with NVRTC), torchaudio, NeMo toolkit, vLLM, llama.cpp
- **Target:** CUDA 13.1 / Blackwell SM_121 (DGX Spark ARM64) or CUDA 13.0 / SM_120 (RTX 5090 x86_64)
- **Base image:** NVIDIA NGC container
- **SM_121 critical:** `cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121"` for llama.cpp

### 2.3 Start Script (scripts/nemotron.sh)

```bash
./scripts/nemotron.sh start [--mode MODE] [--model PATH]
./scripts/nemotron.sh stop
./scripts/nemotron.sh restart
./scripts/nemotron.sh status
./scripts/nemotron.sh logs [SERVICE]
./scripts/nemotron.sh shell
```

**Modes:**
- `llamacpp-q8` (default) -- 32 GB VRAM, DGX Spark
- `llamacpp-q4` -- 16 GB VRAM, RTX 5090
- `vllm` -- 72+ GB VRAM, multi-GPU production

**Service disable flags:** `--no-asr`, `--no-tts`, `--no-llm`

**Model auto-detection:** Searches `~/.cache/huggingface/hub/` for Q8 GGUF first, falls back to Q4.

### 2.4 Environment Variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `NVIDIA_ASR_URL` | `ws://localhost:8080` | ASR WebSocket endpoint |
| `NVIDIA_LLAMA_CPP_URL` | `http://localhost:8000` | LLM HTTP endpoint |
| `NVIDIA_TTS_URL` | `http://localhost:8001` | TTS HTTP+WS endpoint |
| `ENABLE_RECORDING` | `false` | Stereo audio capture |
| `LLM_MODE` | `llamacpp-q8` | Backend selection |
| `SERVICE_TIMEOUT` | `60` (600 first run) | Startup timeout (seconds) |
| `PYTORCH_CUDA_ALLOC_CONF` | `expandable_segments:True` | Memory allocator config |

### 2.5 Model Downloads

**Auto-downloaded on first run:**
- `nvidia/nemotron-speech-streaming-en-0.6b` (~2.4 GB)
- `nvidia/magpie_tts_multilingual_357m` (~1.4 GB)

**Manual download required (LLM):**
```bash
# Q8 for DGX Spark
huggingface-cli download unsloth/Nemotron-3-Nano-30B-A3B-GGUF

# BF16 for vLLM
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
```

### 2.6 Bot Variants

| File | LLM Backend | Turn Detection | Use Case |
|------|-------------|----------------|----------|
| `bot_interleaved_streaming.py` | llama.cpp buffered | SmartTurnAnalyzerV3 | Lowest latency (single GPU) |
| `bot_simple_vad.py` | llama.cpp buffered | Fixed silence threshold | Simple deployment |
| `bot_vllm.py` | vLLM + SentenceAggregator | SmartTurnAnalyzerV3 | Production multi-GPU |

**Transport options:** `-t webrtc` (default, localhost:7860), `-t daily`, `-t twilio`

### 2.7 Interleaved Streaming Pipeline

The key latency optimization for single-GPU deployments:

1. ASR streams at 160ms chunks via WebSocket
2. Smart Turn model (CPU) detects end-of-turn on 200ms pause
3. LLM generates **first 24 tokens** quickly (small segment for fast TTFB)
4. TTS streams first audio chunk (~370ms TTFB in streaming mode)
5. LLM continues generating in larger 96-token segments
6. TTS processes subsequent segments in batch mode (higher quality)
7. GPU time-slices between LLM and TTS (never concurrent)

**KV cache reuse:** Single-slot llama.cpp operation achieves 100% KV cache reuse between turns (no context re-evaluation).

---

## 3. Voice-to-Voice Latency Benchmarks

### 3.1 RTX 5090 (Best Single-GPU)

| Component | P50 | Range |
|-----------|-----|-------|
| ASR | 19ms | 13-70ms |
| LLM | 171ms | 71-255ms |
| TTS | 108ms | 99-146ms |
| **Total V2V** | **508ms** | **415-639ms** |

### 3.2 DGX Spark (Our Hardware)

| Component | P50 | Range |
|-----------|-----|-------|
| ASR | 27ms | 24-122ms |
| LLM | 750ms | 343-1669ms |
| TTS | 185ms | 158-1171ms |
| **Total V2V** | **1,180ms** | **759-2,981ms** |

### 3.3 Production (3x H100, 64 concurrent streams)

| Metric | Value |
|--------|-------|
| End-to-End | 1.0s |
| ASR | 67ms |
| TTS TTFB | 110ms |
| LLM TTFT | 156ms |
| LLM first sentence | 386ms |

### 3.4 Adaptive TTS Improvement (DGX Spark)

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| P50 | 421ms | 185ms | 2.3x |
| Mean | 1,365ms | 950ms | 3.3x (estimated) |
| P90 | 3,836ms | 3,000ms | 4.6x (estimated) |

---

## 4. Total VRAM Requirements

### 4.1 Pipecat Unified Stack (All Three Models)

| Configuration | ASR | LLM | TTS | Total | Hardware |
|--------------|-----|-----|-----|-------|----------|
| Q4 (compact) | 2.4 GB | ~16 GB | 1.4 GB | **~20 GB** | RTX 5090 |
| Q8 (default) | 2.4 GB | ~32 GB | 1.4 GB | **~36 GB** | DGX Spark |
| BF16 (production) | 2.4 GB | ~72 GB | 1.4 GB | **~76 GB** | Multi-GPU |

### 4.2 NIM Blueprint (Enterprise, Separate Containers)

| Component | GPU Allocation | VRAM |
|-----------|---------------|------|
| ASR + TTS | 1x GPU | 16+ GB each |
| LLM (Nano) | 1-2x GPU | 48-80 GB |
| **Total** | **2-3x GPUs** | **80-160 GB** |

### 4.3 Comparison with Current her-os Annie Voice Stack

| Component | her-os Current | NVIDIA Stack (Q8) | Delta |
|-----------|---------------|-------------------|-------|
| STT | Nemotron 0.6B (2.49 GB) | Nemotron 0.6B (2.4 GB) | Same model |
| LLM | Qwen3.5-9B Q4_K_M (6.6 GB) | Nemotron 3 Nano Q8 (32 GB) | +25.4 GB |
| TTS | Kokoro v0.19 (0.5 GB) | Magpie TTS (1.4 GB) | +0.9 GB |
| **Voice total** | **~9.6 GB** | **~36 GB** | **+26.4 GB** |

---

## 5. Key Questions Answered

### Q1: Total VRAM for all 3 models?

**~36 GB in Q8 mode** (default for DGX Spark). This is the LLM that dominates: 32 GB for Nemotron 3 Nano Q8, plus ~4 GB for ASR + TTS. In Q4 mode, total drops to ~20 GB. In BF16, jumps to ~76 GB.

On our DGX Spark (128 GB unified), the Q8 stack would consume 36 GB for voice alone, vs our current 9.6 GB. That leaves less headroom for extraction (40 GB qwen3.5:27b) and embeddings (14 GB).

### Q2: How does Nemotron 3 Nano compare to Qwen3.5-9B for conversation quality?

Nemotron 3 Nano is a **much more capable model** -- it's a 30B MoE with 3.6B active parameters vs Qwen3.5-9B's 9B dense parameters. Benchmarks show it beating Qwen3-30B (a larger sibling of our 9B) on math (+21.7pp), coding, reasoning, and long-context tasks. It only trails on MMLU-Pro by 2.6pp.

However, on DGX Spark it generates at only **14 t/s** (vs our Qwen3.5-9B which runs faster due to smaller size). The 1M-token context window and Mamba layers mean it doesn't slow down with conversation length like transformers do.

**For voice specifically:** Daily.co's aiewf benchmark shows 91.4% pass rate on tool use + instruction following + knowledge grounding in multi-turn voice conversations. The 512-token thinking budget balances accuracy and latency.

**Trade-off:** Better quality, but 5x more VRAM and slower generation. Our Qwen3.5-9B is adequate for current conversations and costs only 6.6 GB.

### Q3: Does Magpie TTS compare favorably to Kokoro?

**No. Kokoro is better by measurable margins:**

| Metric | Kokoro 82M | Magpie 357M |
|--------|-----------|-------------|
| ELO (Artificial Analysis) | 1,059 (#9) | 1,014 (#17) |
| Parameters | 82M | 357M |
| VRAM | ~0.5 GB | ~1.4 GB |
| Latency (our DGX Spark) | ~30ms | ~185ms (P50) batch: ~600ms |
| Languages | English (+ limited multilingual) | 9 languages |
| Voice cloning | No | No (5 fixed voices) |

Kokoro achieves higher quality with 4.4x fewer parameters and dramatically lower latency. Magpie's advantage is multilingual support (Hindi, Japanese, etc.) -- relevant if we add Kannada/Hindi TTS later. But for English voice quality, Kokoro wins.

### Q4: Is there hotword/biasing support in the stack?

**Nemotron Speech ASR: NO native hotword biasing.** The model is a FastConformer-RNNT -- it does not accept prompt text or hotword lists at inference time. The NeMo framework provides a CTC-based word spotter that can bias decoding, but this is a post-decode approach (not as effective as Qwen3-ASR's native prompt-level biasing with context paragraphs).

This is a significant gap. The "Claude" -> "cloud" problem that motivated our STT work would NOT be solved by Nemotron Speech ASR alone. We would still need post-processing correction.

**Comparison:**

| Feature | Nemotron Speech 0.6B | Qwen3-ASR 1.7B | WhisperX large-v3 |
|---------|---------------------|----------------|-------------------|
| Hotword biasing | NeMo CTC word spotter (post-decode) | Native prompt-level + context paragraphs | initial_prompt (weak) |
| Streaming | Yes (80ms-1.12s configurable) | No (offline only) | No (offline only) |
| WER (average) | 7.16-8.53% | ~4.2% (claimed) | ~8-10% |
| VRAM | 2.4 GB | 3.5 GB | 4.8 GB |
| Latency to final transcript | 24ms median | N/A (batch) | N/A (batch) |
| Language | English only | Multilingual | Multilingual |

### Q5: What's the voice-to-voice latency on DGX Spark specifically?

**Median: 1,180ms (P50).** Range: 759ms to 2,981ms. This is with all three NVIDIA models running on DGX Spark in Q8 mode with interleaved streaming.

The LLM is the bottleneck: 750ms P50 for LLM alone. Our current stack (Qwen3.5-9B + Kokoro) achieves faster V2V because the 9B model generates much faster than the 30B Nemotron on the same hardware.

### Q6: Can we swap just STT and keep our LLM + TTS?

**Yes, absolutely.** The stack is fully modular:

1. Each model runs as an independent service on its own port (ASR:8080, LLM:8000, TTS:8001)
2. The `nemotron.sh` script supports `--no-asr`, `--no-tts`, `--no-llm` flags to disable individual components
3. Pipecat's architecture is designed for component swapping -- "high flexibility -- easy to swap out STT, TTS, and LLM independently"
4. The ASR service exposes a standard WebSocket interface, LLM is OpenAI-compatible HTTP, TTS is HTTP+WebSocket

**We already did this** -- Session 293 integrated Nemotron Speech 0.6B into Annie Voice as the `serpent` creature, keeping Qwen3.5-9B (minotaur) and Kokoro (leviathan). This is the optimal configuration for our VRAM budget.

---

## 6. Comparison: Full NVIDIA Stack vs Current her-os Stack

| Dimension | her-os (Current) | NVIDIA Full Stack (Q8) | Winner |
|-----------|-----------------|----------------------|--------|
| **STT model** | Nemotron 0.6B | Nemotron 0.6B | Tie (same) |
| **STT streaming** | Yes (NeMo RNNT) | Yes (NeMo RNNT) | Tie |
| **STT hotword biasing** | Post-decode only | Post-decode only | Tie |
| **LLM model** | Qwen3.5-9B Q4 | Nemotron 3 Nano Q8 | NVIDIA (quality) / her-os (VRAM) |
| **LLM VRAM** | 6.6 GB | 32 GB | her-os (5x less) |
| **LLM quality** | Good for casual chat | Matches GPT-4.1 on voice tasks | NVIDIA |
| **LLM speed (DGX Spark)** | Fast (~40+ t/s) | Slow (14 t/s) | her-os |
| **TTS model** | Kokoro v0.19 | Magpie 357M | her-os (quality) |
| **TTS quality (ELO)** | 1,059 (#9) | 1,014 (#17) | her-os |
| **TTS latency** | ~30ms | ~185ms (P50) | her-os (6x faster) |
| **TTS multilingual** | English only | 9 languages (inc. Hindi) | NVIDIA |
| **Total voice VRAM** | ~9.6 GB | ~36 GB | her-os (3.7x less) |
| **V2V latency (est.)** | <800ms | ~1,180ms (P50) | her-os |
| **Docker build** | Standard pip | 2-3 hour source compile | her-os |
| **Modularity** | Pipecat | Pipecat | Tie |
| **Context window** | 32K | 1M | NVIDIA |

---

## 7. Adoption Recommendations

### Already Done
- **Nemotron Speech ASR 0.6B** -- Already integrated in Session 293 as creature `serpent`. 2.49 GB VRAM, 431ms avg latency. This was the right call.

### Consider Later
- **Nemotron 3 Nano** -- When conversation quality becomes the bottleneck. The 32 GB Q8 VRAM cost means we'd need to either:
  - Drop Qwen3.5-9B (6.6 GB) and use Nemotron for voice (net +25.4 GB)
  - Or use NVFP4 quantization (~20 GB) at some quality cost
  - With extraction (40 GB) + Nemotron voice (32 GB) + audio pipeline (10 GB) = 82 GB, still under 110 GB budget
- **Magpie TTS** -- Only if Hindi/Kannada TTS becomes a priority. Kokoro is strictly better for English.

### Do Not Adopt
- **Full unified container** -- Our modular approach (separate services) is better for our use case. The 2-3 hour Docker build is painful, and we need fine-grained control over VRAM allocation.
- **llama.cpp backend for LLM** -- We already have llama-server running. No need to switch.

### Worth Stealing
- **Interleaved streaming pattern** -- The 24-token first segment + 96-token subsequent segments approach could improve our Annie Voice TTFB.
- **Adaptive TTS** -- Streaming mode for first chunk, batch mode for subsequent chunks. Could apply to Kokoro.
- **SmartTurnAnalyzerV3** -- Pipecat's turn detection model running on CPU. Better than fixed silence threshold.
- **KV cache reuse** -- Single-slot llama.cpp with 100% cache reuse. Check if our llama-server already does this.

---

## 8. PyPI Packages

| Package | Version | Purpose |
|---------|---------|---------|
| `nvidia-pipecat` | 0.2.0 | NVIDIA services for Pipecat (Riva ASR/TTS, NIM LLM, NAT) |
| `pipecat-ai` | 0.0.98+ | Core Pipecat framework |
| `nemo_toolkit[asr]` | 25.11+ | NeMo for Nemotron Speech inference |
| `nemo_toolkit[tts]` | 25.11+ | NeMo for Magpie TTS inference |

Note: `pipecat-ai[nvidia]` is NOT a valid extra. Use `nvidia-pipecat` separately. Requires Python 3.12.

---

## 9. VRAM Budget Impact Analysis (DGX Spark 128 GB)

### Scenario A: Current her-os (status quo)

| Component | VRAM |
|-----------|------|
| Audio pipeline (Whisper + pyannote + SER) | ~7.9 GB |
| Annie Voice (Nemotron 0.6B + Qwen3.5-9B + Kokoro) | ~9.6 GB |
| Extraction (qwen3.5:27b on-demand) | +40 GB |
| Embeddings (qwen3-embedding:8b on-demand) | +14 GB |
| **Peak** | **~71.5 GB** |
| **Free** | **56.5 GB** |

### Scenario B: Swap LLM to Nemotron 3 Nano Q8

| Component | VRAM |
|-----------|------|
| Audio pipeline (Whisper + pyannote + SER) | ~7.9 GB |
| Annie Voice (Nemotron 0.6B + Nemotron Nano Q8 + Kokoro) | ~35 GB |
| Extraction (qwen3.5:27b on-demand) | +40 GB |
| Embeddings (qwen3-embedding:8b on-demand) | +14 GB |
| **Peak** | **~96.9 GB** |
| **Free** | **31.1 GB** |

Scenario B is feasible but tight. Peak is under 110 GB budget, but leaves only 31 GB free vs current 56.5 GB. The voice-active GPU lease would become even more critical.

### Scenario C: Full NVIDIA stack (all three models)

Same as Scenario B since we already use Nemotron ASR and Kokoro > Magpie. Swapping to Magpie would save no VRAM (it's larger than Kokoro).

---

## Sources

- [pipecat-ai/nemotron-january-2026 (GitHub)](https://github.com/pipecat-ai/nemotron-january-2026)
- [Building Voice Agents with NVIDIA Open Models (Daily.co)](https://www.daily.co/blog/building-voice-agents-with-nvidia-open-models/)
- [NVIDIA Nemotron 3 Super for Voice AI (Daily.co)](https://www.daily.co/blog/nvidia-nemotron-3-super/)
- [Nemotron Speech ASR model card (HuggingFace)](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
- [Magpie TTS model card (HuggingFace)](https://huggingface.co/nvidia/magpie_tts_multilingual_357m)
- [Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR (HuggingFace blog)](https://huggingface.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents)
- [Nemotron 3 Nano: Efficient, Open, Intelligent (HuggingFace blog)](https://huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models)
- [Nemotron Voice Agent Blueprint (NVIDIA NIM)](https://build.nvidia.com/nvidia/nemotron-voice-agent)
- [NVIDIA Nemotron 3 Family (research)](https://research.nvidia.com/labs/nemotron/Nemotron-3/)
- [Tier 0 DGX Spark Findings (NVIDIA Forums)](https://forums.developer.nvidia.com/t/tier-0-findings-on-dgx-spark-why-hybrid-mamba-nemotron-beats-120b-for-agents-plus-sm121-fix/359275)
- [ASR Benchmark: Nemotron vs Whisper vs Deepgram (GitHub)](https://github.com/QbitLoop/RealtimeVoice)
- [Best TTS APIs for Real-Time Voice Agents 2026 (Inworld)](https://inworld.ai/resources/best-voice-ai-tts-apis-for-real-time-voice-agents-2026-benchmarks)
- [nvidia-pipecat (PyPI)](https://pypi.org/project/nvidia-pipecat/)
- [NVIDIA voice-agent-examples (GitHub)](https://github.com/NVIDIA/voice-agent-examples)
- [DeepWiki: nemotron-january-2026 analysis](https://deepwiki.com/pipecat-ai/nemotron-january-2026)
- [Nemotron 3 Nano Technical Report (PDF)](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf)