# Research: NVIDIA Voice/AI Components on DGX Spark (GB10, aarch64, SM 12.1)

**Date:** 2026-03-18
**Status:** Research complete
**Context:** Evaluating NVIDIA's official voice/AI stack for potential adoption on her-os running on DGX Spark (Titan).

---

## Table of Contents

1. [Executive Summary](#1-executive-summary)
2. [NIM Containers on DGX Spark aarch64](#2-nim-containers-on-dgx-spark-aarch64)
3. [Riva ASR NIMs on DGX Spark](#3-riva-asr-nims-on-dgx-spark)
4. [Riva TTS NIMs on DGX Spark](#4-riva-tts-nims-on-dgx-spark)
5. [Nemotron Speech ASR (Open-Source Alternative)](#5-nemotron-speech-asr-open-source-alternative)
6. [NeMo Models on DGX Spark](#6-nemo-models-on-dgx-spark)
7. [DGX Spark Compatibility Issues Compendium](#7-dgx-spark-compatibility-issues-compendium)
8. [NVIDIA ACE on DGX Spark](#8-nvidia-ace-on-dgx-spark)
9. [TensorRT on GB10/SM 12.1](#9-tensorrt-on-gb10sm-121)
10. [Riva ASR vs Our Whisper PyTorch — Latency Comparison](#10-riva-asr-vs-our-whisper-pytorch--latency-comparison)
11. [Riva TTS vs Kokoro — Quality and Latency](#11-riva-tts-vs-kokoro--quality-and-latency)
12. [Community Voice Pipelines on DGX Spark](#12-community-voice-pipelines-on-dgx-spark)
13. [NVIDIA's Recommended Workloads for DGX Spark](#13-nvidias-recommended-workloads-for-dgx-spark)
14. [Nemotron Voice Agent Blueprint](#14-nemotron-voice-agent-blueprint)
15. [VRAM Budget Analysis — NVIDIA Stack vs Current Stack](#15-vram-budget-analysis--nvidia-stack-vs-current-stack)
16. [Verdict and Recommendations](#16-verdict-and-recommendations)

---

## 1. Executive Summary

**Bottom line:** NVIDIA's voice stack on DGX Spark is partially functional but immature. The aarch64 + SM 12.1 + CUDA 13.0 triple creates a compatibility minefield. Our current stack (Nemotron Speech 0.6B + Kokoro + Qwen3.5-9B) is more reliable and battle-tested on this hardware than NVIDIA's official NIM-based offerings.

| Component | NVIDIA Offering | DGX Spark Status | Our Current Stack | Recommendation |
|-----------|----------------|------------------|-------------------|----------------|
| **ASR** | Riva Parakeet 1.1B NIM | Officially supported (2 models only) | Nemotron Speech 0.6B (NeMo RNNT) | Keep current — our stack works, Riva has deployment issues |
| **TTS** | Riva Magpie TTS NIM | Officially supported but 600ms latency | Kokoro v0.19 (~30ms) | Keep Kokoro — 20x faster than Magpie on Spark |
| **LLM** | Nemotron 3 Nano 30B NIM | Limited NIM availability | Qwen3.5-9B NVFP4 QAT-v4 via vLLM | Keep current — working, tuned for Annie |
| **Full pipeline** | Nemotron Voice Agent Blueprint | Needs 3x H100 GPUs | Pipecat + Whisper + Claude/vLLM + Kokoro | Keep current |
| **ACE** | Full avatar stack | Not available for DGX Spark | N/A | Not viable |

**Key finding:** NVIDIA markets DGX Spark as a "personal AI supercomputer" but the voice AI NIM ecosystem lags significantly behind LLM NIM support. The community is filling gaps with open-source alternatives (VibeVoice, faster-whisper, Kokoro).

---

## 2. NIM Containers on DGX Spark aarch64

### What Actually Works

NVIDIA is "in the process of updating our NIMs models to be compatible with DGX Spark" but coverage is thin:

**Confirmed working LLM NIMs:**
- `nvcr.io/nim/meta/llama-3.1-8b-instruct-dgx-spark:latest` — official DGX Spark variant
- `nvcr.io/nim/nvidia/nvidia-nemotron-nano-9b-v2-dgx-spark:latest` — requires NIM license
- Qwen3 32B — referenced in NIM playbook

**Confirmed NOT working:**
- Llama 3.3 Nemotron Super 49B — crashes on sm_121 kernel compilation ("sm_121 is not a recognized processor")
- Biology NIM containers — all fail on DGX Spark
- Embedding NIMs (llama-3.2-nv-embedqa-1b-v2) — CUDA symbol errors on aarch64
- Stable Diffusion 3.5-large — amd64 only, no ARM64 build
- Most NIM containers — built only for amd64

**The core problem:** Most NIMs ship x86-64 Docker images only. The `-dgx-spark` suffix denotes the rare ARM64 variant. Users on the NVIDIA forums express frustration: "complex, manual workarounds and community-driven improvisation instead of having a ready-to-use solution."

**Sources:**
- [NIM LLM Containers Fail on DGX Spark](https://forums.developer.nvidia.com/t/nim-llm-containers-fail-on-dgx-spark-gb10-triton-vllm-crash-on-sm-121-and-ngc-permission-errors/353346)
- [Missing official native ARM64 NIM images](https://forums.developer.nvidia.com/t/missing-official-native-arm64-nim-images-for-essential-ai-models/350681)
- [NIMs should be built multiplatform](https://forums.developer.nvidia.com/t/nims-should-be-built-multiplatform/348914)
- [NIM on Spark playbook](https://build.nvidia.com/spark/nim-llm)

---

## 3. Riva ASR NIMs on DGX Spark

### Supported Models (as of Release 1.8.0 / Speech NIM 26.02.0)

| Model | Size | DGX Spark | Blackwell RTX | Languages | Streaming |
|-------|------|-----------|---------------|-----------|-----------|
| Parakeet 1.1B CTC English | 1.1B | Yes | Yes | en-US only | Yes (160ms chunks) |
| Parakeet 1.1B RNNT Multilingual | 1.1B | Yes | Yes | 30+ languages (incl. Indian) | Yes (160ms chunks) |
| Parakeet 0.6B TDT v2 | 0.6B | No | No (not on Blackwell) | en-US | Yes |
| Canary 1B | 1B | No | — | Multilingual | Yes |
| Whisper Large v3 | 1.5B | No | — | 99 languages | No (offline) |
| Nemotron ASR Streaming 0.6B | 0.6B | No (NIM) | — | en-US | Yes |

**Critical:** Only 2 of 9+ ASR models have DGX Spark NIM support. The Whisper Large v3 NIM does NOT support DGX Spark.

### Deployment Reality

The old `riva_start` deployment path has known issues on DGX Spark:
- 4 TensorRT models fail with missing `.plan` files
- Log warning: "Detected NVIDIA GB10 GPU, which may not yet be supported in this version of the container"
- Some ONNX-based models (FastPitch encoder, punctuation) load successfully
- TensorRT-compiled models (conformer, HiFi-GAN) fail — likely missing SM 12.1 TRT engine builds

**Recommendation:** Skip the `riva_start` path. The new Speech NIM containers are the forward-looking approach, but only 2 ASR models are supported.

### Parakeet ASR Performance (A100 reference — no DGX Spark numbers published)

| Model | 1 Stream Latency | 1 Stream RTFX | 32 Stream RTFX |
|-------|-------------------|---------------|----------------|
| Parakeet CTC 0.6B | 12ms | 1.0x | 31.86x |
| Parakeet CTC 1.1B | 13ms | 1.0x | 31.9x |
| Parakeet RNNT 1.1B | 16ms | 1.0x | 31.65x |
| Whisper Large (Riva) | 49ms | 1.0x | 7.95x (8 streams) |

**Note:** These are A100 numbers. DGX Spark GB10 has fewer SMs (48 vs 108) and lower memory bandwidth (273 GB/s unified vs 2 TB/s HBM). Expect 2-3x worse throughput but single-stream latency should be comparable.

### Parakeet Running Natively (Not via NIM)

A community member got Parakeet running directly via NeMo on DGX Spark (not through NIM containers):
- Parakeet TDT 0.6B: **282.91x RTFX** (offline benchmark)
- Parakeet TDT 1.1B: **223.95x RTFX** (offline benchmark)
- WER: 0.035-0.055

Setup requires PyTorch 2.9 (NGC 25.10 container, NOT 25.12 which breaks Lhotse). Fragile dependency chain.

**Sources:**
- [Riva ASR Support Matrix](https://docs.nvidia.com/nim/riva/asr/latest/support-matrix.html)
- [ASR Performance Benchmarks](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-performance.html)
- [Running Parakeet STT on Spark](https://forums.developer.nvidia.com/t/running-parakeet-speech-to-text-on-spark/356353)
- [Feature Request: ARM64 Riva + Whisper Large-v3 Turbo](https://forums.developer.nvidia.com/t/feature-request-arm64-grace-cpu-support-for-riva-with-whisper-large-v3-turbo/354519)

---

## 4. Riva TTS NIMs on DGX Spark

### Supported Models (Speech NIM 26.02.0)

| Model | DGX Spark | Blackwell | Notes |
|-------|-----------|-----------|-------|
| Magpie TTS Multilingual | Yes | Yes (batch_size=1 limit on Blackwell) | 7 languages, 5 voices |
| Magpie TTS Zeroshot | Listed | — | Restricted access |
| Magpie TTS Flow | Listed | — | Offline only, restricted |
| FastPitch HiFi-GAN en-US | Listed | Listed | Legacy pipeline |

### Magpie TTS Performance on DGX Spark

| Metric | DGX Spark | RTX 5090 | H100 |
|--------|-----------|----------|------|
| Single sentence latency (batch mode) | **~600ms** | ~300ms | ~17ms (FastPitch ref) |
| Max concurrent streams | 5 | — | — |
| Default batch size | 8 (all hardware) | — | — |
| VRAM at batch_size=8 | ~10.87 GB | — | — |
| VRAM at batch_size=32 | ~31.55 GB | — | — |

### Known Issues on DGX Spark

- "Magpie TTS Multilingual may return partial audio for short utterances (one word texts) and incomplete audio on DGX Spark platform"
- batch_size=1 constraint on Blackwell in some releases (later updated to batch_size=8 default)

### Quality Assessment

Magpie TTS Multilingual is NVIDIA's production TTS model, supporting 7 languages with emotion control. Quality is likely competitive with cloud TTS services. However, at 600ms latency per sentence on DGX Spark, it's significantly slower than Kokoro's ~30ms.

**Sources:**
- [Riva TTS Support Matrix](https://docs.nvidia.com/nim/riva/tts/latest/support-matrix.html)
- [Riva TTS Release Notes](https://docs.nvidia.com/nim/riva/tts/latest/release-notes.html)
- [Speech NIM Release Notes](https://docs.nvidia.com/nim/speech/latest/about/release-notes.html)
- [Riva start download failures on DGX Spark](https://forums.developer.nvidia.com/t/riva-start-cant-download-all-models-on-dgx-spark/352021)

---

## 5. Nemotron Speech ASR (Open-Source Alternative)

Nemotron Speech Streaming 0.6B is NVIDIA's newest ASR model, designed for voice agents. It's the same architecture family as Parakeet (FastConformer RNNT) but with cache-aware streaming optimization.

### DGX Spark Compatibility

- **Officially listed as supported hardware** (added Dec 2025 with commit message "Added DGX spark after testing")
- Runs via NeMo framework directly (not as NIM container)
- We already use this model in production for Annie Voice STT (2.49 GB VRAM, 431ms avg latency)

### Performance Characteristics

| Metric | Value | Context |
|--------|-------|---------|
| Parameters | 600M | FastConformer RNNT |
| Streaming latency | sub-100ms | On supported hardware |
| Median time to final transcription | 24ms | Hardware unspecified |
| Chunk sizes | 80ms, 160ms, 560ms, 1120ms | Configurable |
| WER | Competitive with Parakeet 1.1B | NVIDIA benchmark |

### Community NIM Deployment Issues

Forum thread documents failed attempts to deploy via NIM on DGX Spark:
- nemo2riva model conversion fails (nvidia-eff package unavailable for ARM64)
- Suggested workaround: convert model on x86_64 system, transfer to Spark
- The NIM container path for Nemotron Speech is not DGX Spark compatible as of March 2026

**Our approach (running via NeMo directly) is correct.** The NIM container path has blockers.

**Sources:**
- [Nemotron Speech Streaming HuggingFace](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
- [ASR on Spark with Nemotron Speech](https://forums.developer.nvidia.com/t/asr-on-spark-with-nemotron-speech-streaming-en-0-6b/358614)
- [Scaling Voice Agents with Cache-Aware Streaming ASR](https://huggingface.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents)

---

## 6. NeMo Models on DGX Spark

NeMo is NVIDIA's framework for training and deploying speech/NLP models. On DGX Spark:

### What Works
- NeMo fine-tuning — official DGX Spark playbook exists
- Parakeet models via NeMo inference — working (with correct PyTorch version)
- Nemotron Speech 0.6B via NeMo — working (our production setup)

### What Doesn't Work
- nemo2riva conversion — ARM64 dependency gap (nvidia-eff not available)
- FastPitch fine-tuning — reported issues loading pretrained weights on DGX Spark
- TTS models requiring conformer decoder — ARM64 compatibility issues with s3tokenizer/unit vocoders

### Key Gotcha: PyTorch Version

NeMo on DGX Spark is fragile around PyTorch versions:
- PyTorch 2.12 (NGC 25.12) — breaks NeMo due to Sampler API removal
- PyTorch 2.10.0 — removed `data_source` arg from Sampler, breaking Lhotse
- **PyTorch 2.9 (NGC 25.10)** — the working version for NeMo + Parakeet on DGX Spark

**Sources:**
- [Fine-tune with NeMo on DGX Spark](https://build.nvidia.com/spark/nemo-fine-tune)
- [Can we fine-tune FastPitch on DGX Spark](https://forums.developer.nvidia.com/t/can-we-fine-tune-fastpitch-on-dgx-spark-using-nemo/358214)

---

## 7. DGX Spark Compatibility Issues Compendium

The aarch64 + CUDA 13.0 + SM 12.1 triple causes widespread compatibility issues. This is a comprehensive list of what we know breaks:

### Category 1: No aarch64 Wheels / Binaries

| Library | Issue | Workaround |
|---------|-------|------------|
| vLLM (pip) | No cu130 aarch64 stable wheel | Use NGC Docker container OR cu130 nightly wheel |
| SGLang (pip) | Same as vLLM | NGC Docker container |
| flash-attention | No aarch64 wheel, slower than SDPA on Blackwell | Use `attn_implementation="sdpa"` (cuDNN 9.13 outperforms flash-attn) |
| CTranslate2 (PyPI) | aarch64 wheel is CPU-only, no CUDA | Build from source or use Docker with compiled version |
| mamba-ssm | Does not compile on aarch64 | No workaround (blocks Mamba-based models) |
| torchaudio (PyPI) | Fetches CPU-only wheel on aarch64 | `pip install torchaudio --index-url https://download.pytorch.org/whl/cu130` |
| nvidia-eff | Not available for ARM64 | Cannot convert NeMo models to Riva format on DGX Spark |
| Audio2Face-3D SDK | Downloads x86_64 dependencies | No known workaround |
| s3tokenizer | Incompatible on aarch64 | Blocks some TTS vocoders |

### Category 2: SM 12.1 / CUDA 13.0 Issues

| Library | Issue | Workaround |
|---------|-------|------------|
| NVRTC JIT compilation | Rejects `sm_121` for some operations | First run takes 1+ hours for JIT cache warm-up; sm_120 is binary compatible |
| TensorRT-LLM CUTLASS | FP4 GEMM tiles request more SMEM than GB10 has (99 KiB vs B200's 228 KiB) | Fixed in PR #12141 (runtime SMEM detection) |
| Kokoro TTS | `torch.stft` complex abs triggers Jiterator NVRTC error | TorchSTFT monkey-patch (our `blackwell_patch.py`) |
| PyTorch Jiterator | JIT-compiles elementwise CUDA kernels, fails on unrecognized sm_121 | Use cu130 PyTorch builds |
| Most pip packages | Ship CUDA 12.x wheels, fail with `libcudart.so.12` missing | Install from cu130 index |

### Category 3: TTS Frameworks on DGX Spark

| Framework | Status | Notes |
|-----------|--------|-------|
| Kokoro v0.19 | **Working** (with monkey-patch) | ~30ms latency, 82M params, our production TTS |
| VibeVoice 0.5B | **Working** | 0.48x RTF (2x realtime), ~300ms TTFA, 7 voices |
| Magpie TTS (Riva NIM) | **Working** (officially supported) | ~600ms per sentence, partial audio bugs |
| XTTS | Broken | aarch64 compatibility issues |
| AllTalk | Broken | aarch64 compatibility issues |
| F5-TTS | Broken | aarch64 compatibility issues |
| Qwen3-TTS | Unknown | Community reports suggest possible on Spark |
| Spark TTS | Poor quality | "Inferior to Kokoro in nearly every way" per community |

### Category 4: General Ecosystem Issues

| Issue | Details |
|-------|---------|
| Unified memory swap death spiral | Training exhausts 128 GB, system freezes instead of clean OOM. Fix: `sudo swapoff -a` + memory-capped systemd scopes |
| NGC container PyTorch version fragility | NeMo works on 25.10 (PyTorch 2.9), breaks on 25.12 (PyTorch 2.12) |
| torchaudio missing from NGC containers | Critical issue for ASR/TTS workflows, causes 10-50x slowdown when falling back to CPU |
| JIT compilation overhead | First execution: 1+ hours to compile ~42 kernels (~436 MB cache). Subsequent runs use cache. |
| No stable cu130 aarch64 wheels | Ecosystem pins to specific nightly wheels that may be removed |

**Sources:**
- [Architecture and library compatibility on aarch64](https://forums.developer.nvidia.com/t/architecture-and-library-compatibility-on-aarch64/350389)
- [No sm_121 support on aarch64 (vLLM)](https://github.com/vllm-project/vllm/issues/36821)
- [DGX Spark setup guide (natolambert)](https://github.com/natolambert/dgx-spark-setup)
- [FP4 CUTLASS GEMM on GB10](https://github.com/NVIDIA/TensorRT-LLM/issues/11368)
- [torchaudio incompatibility on Blackwell GB10](https://github.com/pytorch/audio/issues/4169)

---

## 8. NVIDIA ACE on DGX Spark

### What is ACE?

NVIDIA ACE (Avatar Cloud Engine) is a suite of microservices for building digital humans/avatars:
- **Audio2Face-3D** — facial animation from audio
- **Animation Graph** — procedural animation
- **Omniverse Renderer** — real-time rendering
- **Riva ASR/TTS** — speech services
- **ACE Agent** — conversational AI
- **Voice Font** — voice transfer

### DGX Spark Compatibility: NOT VIABLE

**ACE is NOT designed for DGX Spark.** Key blockers:
1. Audio2Face-3D SDK downloads x86_64 dependencies — ARM64 incompatible
2. Omniverse Renderer requires discrete GPU with dedicated VRAM (not unified memory architecture)
3. ACE documentation lists no aarch64 or DGX Spark platform support
4. The full ACE stack targets multi-GPU servers (H100/A100) or RTX workstations
5. No VRAM requirements published for individual ACE components, but the full stack would exceed DGX Spark's capacity alongside an LLM

### Could ACE Fit Alongside Nemotron 30B on 128GB?

**No.** The Nemotron Voice Agent Blueprint (which uses only ASR + LLM + TTS, a subset of ACE) requires:
- GPU 0: ASR + TTS = 48 GB
- GPU 1-2: LLM = 48-80 GB per GPU
- Total: ~3x H100 (240 GB HBM)

Even the minimal voice pipeline without avatar rendering needs more than 128 GB when using NVIDIA's reference model sizes. Our stack (Nemotron STT 2.5 GB + Kokoro 0.5 GB + Qwen3.5-9B 7.5 GB = ~10.5 GB for voice) is 23x more memory-efficient.

**Sources:**
- [NVIDIA ACE Overview](https://docs.nvidia.com/ace/overview/2025.03.06/index.html)
- [Nemotron Voice Agent Blueprint](https://build.nvidia.com/nvidia/nemotron-voice-agent)
- [Nemotron Voice Agent GitHub](https://github.com/NVIDIA-AI-Blueprints/nemotron-voice-agent)

---

## 9. TensorRT on GB10/SM 12.1

### Does TensorRT Work?

**Yes, TensorRT works on DGX Spark,** but with important caveats:

- TensorRT is included in the DGX Spark software stack (CUDA 13.0)
- NVFP4 quantization support for Blackwell architecture is available
- TensorRT-LLM had a CUTLASS GEMM bug on GB10 (FP4 tiles requesting more SMEM than available) — **fixed in PR #12141**
- Riva's TensorRT-compiled models (.plan files) may need recompilation for SM 12.1 (the `riva_start` failures suggest pre-compiled plans target only A100/H100)

### TensorRT-LLM on DGX Spark

- Official playbook exists: "TRT LLM for Inference"
- NVFP4 quantization works (we use it via vLLM which has TensorRT integration)
- GB10 has only 99 KiB shared memory per SM (vs B200's 228 KiB), so some CUTLASS tile configurations fail
- The runtime SMEM detection fix ensures correct tile selection for GB10

### Practical Impact

TensorRT is the backbone of Riva ASR/TTS acceleration. Its SM 12.1 support is now functional after bug fixes, but Riva's pre-compiled model artifacts still may not include GB10 targets. This explains why `riva_start` fails to load TRT-compiled models.

**Sources:**
- [TensorRT Support Matrix](https://docs.nvidia.com/deeplearning/tensorrt/latest/getting-started/support-matrix.html)
- [TensorRT for Blackwell DGX Spark forum](https://forums.developer.nvidia.com/t/tensorrt-for-blackwell-dgx-spark/355279)
- [FP4 CUTLASS GEMM fix](https://github.com/NVIDIA/TensorRT-LLM/issues/11368)

---

## 10. Riva ASR vs Our Whisper PyTorch — Latency Comparison

### Our Current Stack

| Component | Model | VRAM | Latency | RTF |
|-----------|-------|------|---------|-----|
| Audio Pipeline STT | WhisperX large-v3 (CTranslate2 in Docker) | 4.8 GB | — | 62x realtime |
| Annie Voice STT | Nemotron Speech 0.6B (NeMo RNNT) | 2.49 GB | 431ms avg | — |

### Riva Parakeet NIM (if deployed on DGX Spark)

| Model | Est. VRAM | Streaming Latency | RTFX (A100 ref) |
|-------|-----------|-------------------|-----------------|
| Parakeet CTC 1.1B | ~2-4 GB | ~13ms (A100), est. ~20-30ms (Spark) | 1.0x (1 stream) |
| Parakeet RNNT 1.1B Multilingual | ~2-4 GB | ~16ms (A100), est. ~25-40ms (Spark) | 1.0x (1 stream) |

### Comparison

| Aspect | Our Stack | Riva Parakeet NIM |
|--------|-----------|-------------------|
| Deployment reliability | Production-proven on Spark | Unknown — NIM deployment untested by us |
| Streaming support | Yes (Nemotron 0.6B) | Yes (both models) |
| Language support | English only (Nemotron 0.6B) | English or 30+ languages (RNNT Multilingual) |
| VRAM | 2.49 GB (Nemotron) / 4.8 GB (WhisperX) | Est. 2-4 GB |
| Latency | 431ms (Nemotron, our measurement) | Est. 20-40ms (extrapolated from A100) |
| Diarization | pyannote 3.1 (separate) | Built-in Sortformer diarization |
| VAD | WebRTC VAD (separate) | Built-in Silero VAD |

**Potential win for Riva:** Built-in VAD + diarization could simplify the pipeline. But deployment reliability on DGX Spark is unproven for our use case.

---

## 11. Riva TTS vs Kokoro — Quality and Latency

### Our Current Stack

| Component | Model | VRAM | Latency | Notes |
|-----------|-------|------|---------|-------|
| Annie Voice TTS | Kokoro v0.19 | 0.5 GB | ~30ms | 82M params, monkey-patched for Blackwell |

### Riva Magpie TTS (on DGX Spark)

| Metric | Value |
|--------|-------|
| Latency | ~600ms per sentence |
| VRAM (batch_size=8) | ~10.87 GB |
| VRAM (batch_size=32) | ~31.55 GB |
| Languages | 7 (multilingual) |
| Voices | 5 preset + emotion control |
| Known bugs | Partial audio for short utterances on DGX Spark |

### Alternative: VibeVoice (Microsoft, open source)

| Metric | Value |
|--------|-------|
| RTF | 0.48x (2x faster than realtime) |
| Time to first audio | ~300ms (streaming) |
| VRAM | Not documented (0.5B model) |
| Voices | 7 preset (0.5B) or voice cloning (1.5B) |
| DGX Spark status | Working (confirmed in community) |

### Comparison

| Aspect | Kokoro (ours) | Magpie TTS (Riva) | VibeVoice |
|--------|---------------|-------------------|-----------|
| Latency | **~30ms** | ~600ms | ~300ms |
| VRAM | **0.5 GB** | 10.87 GB | ~2-4 GB est. |
| Voice quality | Good (single voice) | Good (multilingual, emotion) | Good (multi-speaker) |
| Languages | English | 7 languages | English (0.5B), multilingual (1.5B) |
| Voice cloning | No | Zeroshot model (restricted) | Yes (1.5B variant) |
| DGX Spark stability | Proven (with patch) | Officially supported but buggy | Community-confirmed |

**Verdict:** Kokoro is 20x faster and uses 22x less VRAM than Magpie on DGX Spark. Magpie's advantage is multilingual support and emotion control. Unless we need multilingual TTS, Kokoro remains the superior choice.

**VibeVoice is worth monitoring** — 0.5B model with 300ms latency and voice cloning at 1.5B could be interesting for future Annie voice customization.

---

## 12. Community Voice Pipelines on DGX Spark

### Arm + NVIDIA Reference Pipeline

Architecture: faster-whisper (CPU) + vLLM (GPU) + unspecified TTS
- ASR: 70-90ms transcription latency (CPU, faster-whisper)
- Full pipeline: ~4 seconds end-to-end voice-to-voice
- Uses unified memory to skip CPU-GPU data transfers

### Logos-Flux Spark Voice Pipeline

Architecture: whisper.cpp + Ollama (LLM) + VibeVoice 0.5B
- Time to first audio: **766ms**
- TTS RTF: 0.48x (2x realtime)
- WebSocket streaming throughout
- Sentence-level buffering between LLM and TTS

### Key Community Insights

1. **CPU for ASR is viable** — faster-whisper on ARM Cortex-X cores achieves 70-90ms, competitive with GPU Whisper on this hardware
2. **Unified memory advantage** — no PCIe transfer overhead between ASR (CPU) and LLM (GPU)
3. **Sentence-level streaming** is the standard pattern — buffer LLM tokens until sentence boundary, then send to TTS while generation continues
4. **Our pipeline is more optimized** — we achieve sub-1s voice-to-voice with GPU ASR + GPU TTS, while community pipelines are at 766ms-4s

**Sources:**
- [Arm blog: Rethinking voice AI at the edge](https://developer.arm.com/community/arm-community-blogs/b/ai-blog/posts/rethinking-voice-ai-at-the-edge-a-practical-offline-pipeline-on-dgx-spark)
- [Spark Voice Pipeline (GitHub)](https://github.com/Logos-Flux/spark-voice-pipeline)
- [Arm learning path: Offline voice chatbot](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/1_offline_voice_assistant/)
- [VibeVoice on DGX Spark setup](https://forums.developer.nvidia.com/t/dgx-spark-vibevoice-tts-streaming-voice-pipeline-setup-guide/356424)

---

## 13. NVIDIA's Recommended Workloads for DGX Spark

### Official Marketing Position

NVIDIA positions DGX Spark as a "personal AI supercomputer" for:
- Large context window inference (30K-250K tokens)
- Multi-agent concurrent processing
- Fine-tuning up to 120B parameters (single node)
- Local AI factory operations
- Voice-based AI assistants (explicitly mentioned)

### Officially Tested Model Configurations

| Model | Quantization | Single Node | Dual Node |
|-------|-------------|-------------|-----------|
| Nemotron 3 Nano 30B | NVFP4 | Yes | — |
| Nemotron 3 Super 120B | NVFP4 | — | Yes |
| Qwen3.5 35B | — | Yes | — |
| Qwen3 Coder Next 80B | — | — | Yes |
| Llama 3.3 70B Instruct | — | — | Yes |
| Qwen-235B MoE | NVFP4 | — | Yes (2 nodes) |
| FLUX.2 | — | Yes (90 GB) | — |

### Voice-Specific Official Guidance

NVIDIA mentions voice assistants in marketing but provides **no official DGX Spark voice assistant reference architecture.** The closest is:

1. **Nemotron Voice Agent Blueprint** — targets 3x H100, not DGX Spark
2. **DGX Spark Playbooks** — includes NeMo fine-tuning and NemoClaw, but no dedicated voice pipeline playbook
3. **Community contributions** — the actual voice pipeline work is coming from Arm, Logos-Flux, and individual developers on the forums

**The gap between marketing ("personal AI supercomputer for voice assistants") and reality (no official voice pipeline playbook, 2 of 9 ASR NIMs supported) is significant.**

**Sources:**
- [Scaling Autonomous AI Agents with DGX Spark](https://developer.nvidia.com/blog/scaling-autonomous-ai-agents-and-workloads-with-nvidia-dgx-spark/)
- [Software Optimizations for DGX Spark](https://developer.nvidia.com/blog/new-software-and-model-optimizations-supercharge-nvidia-dgx-spark/)
- [DGX Spark Playbooks (GitHub)](https://github.com/NVIDIA/dgx-spark-playbooks)
- [NVIDIA DGX Spark product page](https://www.nvidia.com/en-us/products/workstations/dgx-spark/)

---

## 14. Nemotron Voice Agent Blueprint

The Nemotron Voice Agent Blueprint is NVIDIA's reference implementation for an end-to-end voice agent.

### Architecture

```
User Mic → Nemotron ASR (RNNT/CTC) → Nemotron LLM (Nano 30B or Super 49B) → Magpie TTS → Speaker
           + VAD + SVAD + EOU
           Pipeline: Pipecat + WebRTC
```

### Hardware Requirements

| Service | GPU | VRAM |
|---------|-----|------|
| ASR + TTS | L40/A100/H100 | 48 GB |
| LLM (Nemotron Nano 30B) | H100 | 48 GB |
| LLM (Nemotron Super 49B) | H100 | 80 GB |
| **Minimum total** | **2-3x H100** | **96-160 GB HBM** |

### Performance (3x H100 reference)

| Metric | 1 Stream | 64 Streams |
|--------|----------|------------|
| End-to-end latency | 0.79s | ~1.0s |
| ASR latency | 0.04s | 0.067s |
| TTS TTFB | 0.066s | 0.11s |
| LLM TTFT | 0.061s | 0.156s |

### Can It Run on DGX Spark?

**Not as designed.** 48 GB for ASR+TTS alone exceeds what we'd want to allocate on a 128 GB unified memory system that also runs an LLM + extraction pipeline.

However, the **architecture is reusable:**
- Pipecat orchestration (we already use this)
- WebRTC transport (we already use this)
- Speculative speech processing (we could adopt this)
- The model choices need downsizing (our stack already does this)

**What we could adopt:**
- Speculative speech processing (start LLM inference before ASR finalizes)
- Cache-aware streaming ASR patterns from Nemotron Speech
- VAD + SVAD + EOU pipeline stages

**Sources:**
- [Nemotron Voice Agent Blueprint](https://build.nvidia.com/nvidia/nemotron-voice-agent)
- [Blueprint GitHub](https://github.com/NVIDIA-AI-Blueprints/nemotron-voice-agent)
- [Building Voice Agents with NVIDIA Open Models (Daily.co)](https://www.daily.co/blog/building-voice-agents-with-nvidia-open-models/)

---

## 15. VRAM Budget Analysis — NVIDIA Stack vs Current Stack

### Current her-os Voice Stack (Production)

| Component | Model | VRAM |
|-----------|-------|------|
| Annie Voice STT | Nemotron Speech 0.6B | 2.49 GB |
| Annie Voice TTS | Kokoro v0.19 | 0.5 GB |
| Annie Voice LLM | Qwen3.5-9B NVFP4 QAT-v4 (vLLM) | 7.55 GB |
| **Voice total** | | **10.54 GB** |

### Hypothetical NVIDIA NIM Voice Stack

| Component | Model | VRAM |
|-----------|-------|------|
| ASR | Riva Parakeet 1.1B RNNT NIM | ~3 GB est. |
| TTS | Riva Magpie TTS Multilingual NIM | ~10.87 GB |
| LLM | Nemotron 3 Nano 30B NVFP4 NIM | ~20 GB est. |
| **Voice total** | | **~34 GB** |

### Comparison

| Aspect | Current Stack | NVIDIA NIM Stack |
|--------|---------------|-----------------|
| Voice VRAM | **10.54 GB** | ~34 GB |
| Room for extraction (qwen3.5:27b) | 68 GB free after voice | 54 GB free after voice |
| TTS latency | **~30ms** | ~600ms |
| ASR streaming | Yes | Yes |
| Multilingual ASR | No (English only) | Yes (30+ languages with RNNT) |
| Multilingual TTS | No | Yes (7 languages) |
| LLM personality | QAT-tuned for Annie | Generic (would need tuning) |
| Deployment reliability on Spark | Proven | Untested |

**The NVIDIA NIM stack costs 3.2x more VRAM and is 20x slower on TTS, with questionable deployment reliability on DGX Spark.** The only clear win is multilingual support.

---

## 16. Verdict and Recommendations

### Keep Current Stack (No Changes)

Our production voice pipeline is well-optimized for DGX Spark:
- **10.5 GB total voice VRAM** — leaves maximum room for extraction and other services
- **Proven reliability** — battle-tested through 350+ sessions
- **Low latency** — Kokoro at ~30ms TTS is industry-leading for local deployment
- **Nemotron Speech 0.6B** — NVIDIA's own model, running natively via NeMo (not NIM)

### Monitor for Future Adoption

| Component | Watch For | Timeline |
|-----------|-----------|----------|
| Riva Parakeet 1.1B RNNT NIM | Reliable DGX Spark deployment with built-in VAD+diarization | When Speech NIM aarch64 matures |
| VibeVoice 1.5B | Voice cloning for Annie personalization | When we need custom voice |
| Nemotron Voice Agent Blueprint | Speculative speech processing techniques | Adopt patterns, not the full stack |
| NVIDIA Speech NIM consolidation | More models getting DGX Spark support | Quarterly NIM releases |
| Magpie TTS | Latency improvement on DGX Spark | Currently 20x too slow vs Kokoro |

### Specific Opportunities to Investigate

1. **Speculative speech processing** from the Nemotron Voice Agent Blueprint — start LLM inference before ASR finalizes the utterance. Could reduce perceived latency.

2. **Parakeet 1.1B RNNT Multilingual** for the audio-pipeline (not Annie Voice) — if it runs reliably, the built-in 30+ language support and Sortformer diarization could replace our WhisperX + pyannote stack with a single model.

3. **VibeVoice 0.5B** as Kokoro alternative — 300ms latency (10x slower than Kokoro) but has 7 preset voices and confirmed DGX Spark compatibility. Worth testing if we want Annie voice variety.

### What NOT to Pursue

- **NVIDIA ACE** — not compatible with DGX Spark, wrong use case
- **Full NIM-based voice pipeline** — deployment reliability is too low on aarch64
- **Riva `riva_start` deployment** — broken on DGX Spark (missing TRT plans for SM 12.1)
- **Replacing Kokoro with Magpie TTS** — 20x latency regression for marginal quality gain
- **Nemotron 3 Nano 30B as Annie LLM** — 20 GB VRAM vs our 7.5 GB, and lacks our QAT personality tuning

---

## Appendix: DGX Spark Software Compatibility Quick Reference

For developers working with DGX Spark aarch64 + CUDA 13.0 + SM 12.1:

### Must-Use Package Sources

```bash
# PyTorch with CUDA 13.0 support (critical for aarch64 GPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

# vLLM (nightly with cu130 support)
pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly/cu130

# Or use NGC Docker containers (most reliable path)
docker pull nvcr.io/nvidia/vllm:25.12.post1-py3
```

### Critical Environment Variables

```bash
export TORCH_CUDA_ARCH_LIST="12.1a"
export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
```

### Memory Safety (Unified Memory)

```bash
# Disable swap to prevent death spiral
sudo swapoff -a

# Run training/inference with memory cap
sudo systemd-run --scope -p MemoryMax=100G -p MemorySwapMax=0 bash -lc 'your_command'

# Keep GPU memory utilization low for vLLM
--gpu-memory-utilization 0.15  # Not 0.85!
```

### Key Compatibility Notes

- SM 12.0 and SM 12.1 are binary compatible — sm_120 kernels run on sm_121
- SDPA (cuDNN 9.13) outperforms flash-attention on Blackwell — use `attn_implementation="sdpa"`
- First-time JIT compilation takes 1+ hours, subsequent runs use cached kernels (~436 MB)
- NeMo requires PyTorch 2.9 (NGC 25.10), NOT 25.12