# Research: NVIDIA Voice AI Pipeline vs her-os Custom Pipeline

**Date:** 2026-03-18
**Status:** Research complete
**Context:** Comprehensive comparison of NVIDIA's end-to-end voice AI stack (ACE, NIM, Riva, Audio2Face) with our custom Annie voice pipeline on DGX Spark (128GB unified memory, GB10 Blackwell, aarch64).

---

## Table of Contents

1. [Executive Summary](#1-executive-summary)
2. [NVIDIA's Complete Voice AI Stack](#2-nvidias-complete-voice-ai-stack)
3. [Our Current Annie Pipeline](#3-our-current-annie-pipeline)
4. [Component-by-Component Comparison](#4-component-by-component-comparison)
5. [NVIDIA Pipecat Integration](#5-nvidia-pipecat-integration)
6. [NVIDIA NIM Agent Blueprint](#6-nvidia-nim-agent-blueprint)
7. [DGX Spark Compatibility — The aarch64 Reality](#7-dgx-spark-compatibility--the-aarch64-reality)
8. [Licensing and Cost Analysis](#8-licensing-and-cost-analysis)
9. [Performance Comparison](#9-performance-comparison)
10. [What We Would Gain by Switching](#10-what-we-would-gain-by-switching)
11. [What We Would Lose](#11-what-we-would-lose)
12. [Hybrid Approach — Best of Both Worlds](#12-hybrid-approach--best-of-both-worlds)
13. [Recommendation](#13-recommendation)
14. [References](#14-references)

---

## 1. Executive Summary

NVIDIA's voice AI stack is a deep, enterprise-grade ecosystem built around ACE (Avatar Cloud Engine) microservices, NIM containers, and Riva speech services. It is designed for scalable, multi-user deployments with visual avatar integration — fundamentally different from our single-user personal companion use case.

### Key Findings

| Dimension | NVIDIA Stack | Our Stack (Annie) | Winner for her-os |
|-----------|-------------|-------------------|-------------------|
| **ASR quality** | Parakeet TDT 0.6B v2: 6.05% WER (English best-in-class) | WhisperX large-v3 + Nemotron Speech 0.6B | **Draw** — Parakeet wins English, Whisper wins multilingual |
| **ASR latency** | Parakeet: RTFx 3,386 (50x faster than alternatives) | Nemotron Speech: 431ms avg, Whisper: ~62x real-time | **NVIDIA** (Parakeet is blazing fast) |
| **TTS quality** | Magpie TTS / Chatterbox 0.35B (voice cloning, paralinguistic) | Kokoro v0.19 (~30ms, natural voice) | **Draw** — different strengths |
| **TTS latency** | Magpie TTS: <200ms (with Dynamo-Triton) | Kokoro: ~30ms in-process | **Annie** (6x faster) |
| **LLM** | Nemotron family (various sizes) | Nemotron 3 Nano 30B NVFP4 via vLLM (48-65 tok/s) | **Draw** — we already use Nemotron |
| **Orchestration** | ACE Agent (Colang + NeMo Guardrails) | Pipecat (Python, open-source) | **Annie** (simpler, more flexible) |
| **SER** | Audio2Emotion (ACE component) | emotion2vec+ large | **Annie** (proven, validated) |
| **DGX Spark compat** | **Broken** — ARM64 NIM issues, many containers x86 only | **Fully working** — all components validated | **Annie** (it actually runs) |
| **Customizability** | Limited (enterprise containers, Colang DSL) | Full control (Python, open source) | **Annie** |
| **Cost** | Free for dev (16 GPU limit), enterprise license for production | Zero cost, fully open source | **Annie** |
| **Multi-user scale** | Excellent (Kubernetes, autoscaling, Triton) | Single-user only | **NVIDIA** (but irrelevant for her-os) |

**Bottom line:** Our custom pipeline is better suited for her-os today. NVIDIA's stack excels at enterprise scale and visual avatars but offers limited advantage for a single-user personal AI companion on DGX Spark, especially given the ongoing ARM64 compatibility issues.

---

## 2. NVIDIA's Complete Voice AI Stack

### 2.1 ACE (Avatar Cloud Engine)

ACE is NVIDIA's suite of AI technologies for building conversational digital humans. It is NOT just a voice pipeline — it includes visual avatar rendering, lip-sync, and facial animation alongside speech.

**Core microservices:**

| Microservice | Function | Category |
|-------------|----------|----------|
| **Riva ASR** | Speech-to-text (Parakeet, Canary, Whisper, Nemotron Speech) | Speech |
| **Riva TTS** | Text-to-speech (Magpie, Chatterbox, FastPitch+HiFi-GAN) | Speech |
| **Riva NMT** | Neural machine translation (36 languages) | Speech |
| **ACE Agent** | Conversational orchestrator (NeMo Guardrails + Colang) | Intelligence |
| **Audio2Face** | Audio-driven facial animation + lip-sync | Animation |
| **Audio2Emotion** | Emotion inference from audio | Emotion |
| **AnimGraph** | Animation state machine controller | Animation |
| **Omniverse RTX** | Real-time pixel streaming renderer | Rendering |
| **Maxine Speech Live Portrait** | 2D lip-sync and animation | Animation |

**Key insight:** At least half of ACE is about **visual avatars** — not voice. For a voice-only companion like Annie, most of ACE is irrelevant overhead.

### 2.2 ACE Agent — The Orchestrator

ACE Agent 4.1 is the conversational controller that wires together ASR, LLM, and TTS. It is the closest analog to our Pipecat pipeline.

**Architecture:**
- **Chat Controller** — primary orchestrator, creates the ASR → Chat Engine → TTS pipeline
- **Chat Engine** — built on NVIDIA NeMo Guardrails, uses **Colang** (a domain-specific language for conversation flows)
- **NLP Server** — RESTful interfaces for NLP models
- **Plugin Server** — custom business logic integration
- **Web App** — frontend with voice and text I/O

**How it compares to Pipecat:**

| Feature | ACE Agent | Pipecat |
|---------|-----------|---------|
| Language | Colang (DSL) + Python | Pure Python |
| Ecosystem lock-in | NVIDIA-only backends | 60+ providers (NVIDIA, OpenAI, Google, etc.) |
| Learning curve | Steep (Colang, NeMo Guardrails, Kubernetes) | Moderate (standard Python, pip install) |
| Deployment | Kubernetes with NVIDIA UCS | Any Python environment |
| Guardrails | Built-in (NeMo Guardrails) | BYO (or add NeMo Guardrails separately) |
| RAG | LangChain/LlamaIndex integration | BYO |
| Community | NVIDIA enterprise forum | Active open-source (Daily.co backed) |
| Real-time streaming | Yes (via Riva gRPC) | Yes (frames-based pipeline) |
| Interruption handling | Yes | Yes (VAD-based, automatic) |

### 2.3 Riva — Speech Services

Riva is NVIDIA's speech AI platform, now available as NIM containers. It provides ASR, TTS, and NMT as gRPC microservices.

**ASR Models (current, as of March 2026):**

| Model | Params | Architecture | Languages | WER (avg) | RTFx | Streaming |
|-------|--------|-------------|-----------|-----------|------|-----------|
| **Parakeet TDT 0.6B v2** | 600M | FastConformer-TDT | English | 6.05% (#1 on HF ASR leaderboard) | 3,386 | Yes |
| **Parakeet CTC 1.1B** | 1.1B | FastConformer-CTC | English | ~6.2% | High | Yes |
| **Parakeet RNNT 1.1B** | 1.1B | FastConformer-RNNT | 25 languages | — | — | Yes |
| **Nemotron-ASR-Streaming 0.6B** | 600M | Cache-Aware FastConformer-RNNT | English | — | — | Yes (low latency) |
| **Canary-Qwen 2.5B** | 2.5B | FastConformer + Qwen3-1.7B LLM | English | 5.63% (best) | 418 | Batch (not streaming) |
| **Canary 1B** | 1B | FastConformer + LLM decoder | Multilingual | — | — | Yes |
| **Whisper Large v3** | 1.55B | Transformer encoder-decoder | 99 languages | — | — | Batch |

**TTS Models (current):**

| Model | Params | Architecture | Languages | Features | Latency |
|-------|--------|-------------|-----------|----------|---------|
| **Magpie TTS Multilingual** | — | Streaming encoder-decoder transformer | EN, ES, FR, DE | Production voice agents | <200ms (with Dynamo-Triton) |
| **Magpie TTS Zeroshot** | — | Streaming encoder-decoder transformer | English | 5-sec voice cloning | <200ms |
| **Magpie TTS Flow** | — | Flow matching decoder | English | 3-sec voice cloning, studio quality | Offline only |
| **Chatterbox-Turbo** | 350M | Distilled decoder (1-step) | English | `[laugh]`, `[cough]` paralinguistic tags, zero-shot cloning | Production-grade |
| **Chatterbox-Multilingual** | 500M | — | 23 languages | Zero-shot cloning | — |
| **FastPitch + HiFi-GAN** | ~50M | Mel spectrogram + vocoder | English | Fast, lightweight | Very low |

### 2.4 NVIGI SDK — In-Process Inference

The **NVIDIA In-Game Inferencing (NVIGI) SDK** is a newer component designed for running AI models in-process (not as separate microservices). It provides a C++ API with CUDA acceleration for running STT, LLM, and TTS directly inside applications.

This is conceptually similar to what we do with Kokoro (in-process TTS) and Nemotron Speech (in-process STT), but using NVIDIA's proprietary C++ SDK instead of Python.

---

## 3. Our Current Annie Pipeline

### Architecture

```
User speaks → WebRTC (SmallWebRTCTransport)
  → Silero VAD (voice activity detection)
  → Nemotron Speech 0.6B (streaming RNNT ASR, 2.49 GB VRAM)
  → Context Aggregator (conversation history + memory)
  → Nemotron 3 Nano 30B NVFP4 via vLLM (LLM, 7.55 GB VRAM)
  → SpeechTextFilter (strip markdown, think tags)
  → Kokoro v0.19 GPU TTS (~30ms, 0.5 GB VRAM)
  → WebRTC audio output to user
```

### Supporting Components

| Component | Implementation | VRAM | Notes |
|-----------|---------------|------|-------|
| **Orchestration** | Pipecat (open-source Python) | 0 | Frames-based pipeline |
| **STT (voice chat)** | Nemotron Speech 0.6B (NeMo RNNT) | 2.49 GB | Cache-aware streaming, English |
| **STT (transcription)** | WhisperX large-v3 + pyannote | 4.8 GB + 1.9 GB | Multi-language, diarization |
| **LLM** | Nemotron 3 Nano 30B-A3B NVFP4, QLoRA-finetuned | 7.55 GB | Annie persona via QAT v4 |
| **TTS** | Kokoro v0.19 (in-process GPU) | 0.5 GB | Blackwell SM_121 patched |
| **SER** | emotion2vec+ large + wav2vec2-large | 0.6 + 0.6 GB | Dual-model emotion recognition |
| **Context Engine** | Custom FastAPI + PostgreSQL | CPU | Entity extraction, BM25 retrieval |
| **Web search** | SearXNG (Docker) | 0 | Federated meta-search |
| **Transport** | Pipecat SmallWebRTCTransport | 0 | Browser-based, no app needed |

### Total VRAM for Voice Pipeline

Always-loaded voice components: ~13.5 GB (Nemotron STT 2.49 + vLLM 7.55 + SER 1.2 + Kokoro 0.5 + pyannote 1.9)

---

## 4. Component-by-Component Comparison

### 4.1 ASR: Parakeet/Nemotron Speech vs WhisperX/Nemotron Speech

**What NVIDIA offers:**
- **Parakeet TDT 0.6B v2** is the current state-of-the-art for English ASR. 6.05% average WER across 8 benchmarks, RTFx 3,386 (processes audio 3,386x faster than real-time). Trained on 120K hours. Supports punctuation, capitalization, and word-level timestamps. English only.
- **Parakeet RNNT 1.1B** supports 25 languages but is English-focused for best quality.
- **Canary-Qwen 2.5B** combines FastConformer encoder with Qwen3-1.7B LLM decoder. Best WER (5.63% average) but 418 RTFx and batch-only (not streaming). English only.
- **Nemotron-ASR-Streaming 0.6B** (which we already use) is their streaming-optimized model for voice agents. Cache-aware, low latency, English.

**What we have:**
- **Nemotron Speech 0.6B** for real-time voice chat (streaming, 431ms avg latency, 2.49 GB). Already an NVIDIA model.
- **WhisperX large-v3** for background transcription (99 languages, diarization, 62x real-time). Used in audio-pipeline for ambient capture from Omi wearable.

**Verdict:** We already use NVIDIA's streaming ASR (Nemotron Speech) for voice chat. Upgrading to Parakeet TDT 0.6B v2 could improve WER but we would need to validate DGX Spark compatibility. The multilingual gap matters — WhisperX handles Kannada ambient transcripts that no NVIDIA ASR model can.

**Potential upgrade:** Swap Nemotron Speech 0.6B for Parakeet TDT 0.6B v2 in voice chat path. Same 600M param class. Would need NeMo toolkit integration test on aarch64.

### 4.2 TTS: Magpie/Chatterbox vs Kokoro

**What NVIDIA offers:**
- **Magpie TTS Multilingual** — streaming TTS for voice agents. EN/ES/FR/DE. <200ms latency with Dynamo-Triton optimization. Available as Riva NIM.
- **Magpie TTS Zeroshot** — 5-second voice cloning. Could clone a specific voice for Annie.
- **Chatterbox-Turbo 0.35B** — from Resemble AI (NVIDIA partnership). Paralinguistic tags (`[laugh]`, `[cough]`, `[chuckle]`), zero-shot voice cloning, 350M params. Open source (Apache 2.0). NVIDIA watermarking. Designed for production voice agents.
- **FastPitch + HiFi-GAN** — lightweight traditional TTS. Fast but less natural.

**What we have:**
- **Kokoro v0.19** — in-process GPU TTS, ~30ms latency, 0.5 GB VRAM. Natural voice quality. Required a custom `TorchSTFT` monkey-patch for Blackwell SM_121 (`blackwell_patch.py`). No voice cloning.

**Verdict:** This is the most interesting comparison.

| Feature | Magpie/Chatterbox | Kokoro |
|---------|-------------------|--------|
| Latency | <200ms (NIM container) | ~30ms (in-process) |
| Voice cloning | Yes (5-sec sample) | No |
| Paralinguistic | Yes (`[laugh]`, `[cough]`) | No |
| Emotional control | Yes (via tags) | Limited |
| Languages | EN/ES/FR/DE (Magpie), 23 (Chatterbox-Multi) | English + limited |
| VRAM | Unknown (NIM container, likely 2-4 GB) | 0.5 GB |
| Voice quality | High (enterprise-grade) | High (natural, warm) |
| Open source | Chatterbox yes, Magpie NIM only | Yes |
| DGX Spark | Unvalidated (NIM aarch64 issues) | Validated (with patch) |

**Key insight:** Chatterbox-Turbo's paralinguistic features (`[laugh]`, `[cough]`) are compelling for a personal companion. Annie could laugh at jokes, sigh sympathetically, etc. This is a genuine capability gap in our Kokoro setup. However, Kokoro's 30ms latency is 6x faster than Magpie's <200ms — latency matters enormously for conversational feel.

### 4.3 LLM: Same Model

We already use **Nemotron 3 Nano 30B-A3B NVFP4** — an NVIDIA model. Our QLoRA fine-tune (Annie persona) is what makes it special. NVIDIA's NIM would serve the same base model but without our behavioral fine-tuning. No advantage to switching.

### 4.4 SER: Audio2Emotion vs emotion2vec+

**What NVIDIA offers:**
- **Audio2Emotion** — ACE microservice that infers emotional state from audio. Designed to drive facial animation blendshapes (7 basic emotions). Part of ACE's avatar pipeline.

**What we have:**
- **emotion2vec+ large** + **wav2vec2-large** — dual-model SER pipeline. emotion2vec+ is the current SOTA for speech emotion recognition. Outputs valence, arousal, and categorical emotions with confidence scores.

**Verdict:** Audio2Emotion is designed for avatar animation (mapping emotion to facial blendshapes), not for enriching conversation context. emotion2vec+ is purpose-built for SER and is the community SOTA. Our pipeline is better for our use case.

### 4.5 Orchestration: ACE Agent vs Pipecat

**What NVIDIA offers:**
- ACE Agent 4.1 — orchestrates ASR → LLM → TTS using Colang DSL and NeMo Guardrails. Kubernetes deployment. Tight integration with all NVIDIA microservices. TensorRT optimization. Triton Inference Server.

**What we have:**
- Pipecat — Python-based, frames pipeline, 60+ provider integrations. Simple, flexible, well-documented. WebRTC built in. No Kubernetes needed.

**Verdict for her-os:**
ACE Agent is designed for enterprise deployments with multiple concurrent users, Kubernetes orchestration, and visual avatars. For a single-user personal companion on one machine, Pipecat is dramatically simpler and equally capable. ACE Agent's Colang DSL adds complexity without benefit for our use case.

---

## 5. NVIDIA Pipecat Integration

NVIDIA has **official Pipecat service integrations** in the Pipecat framework. Located at `pipecat/services/nvidia/`:

### Available Services

| Service | File | Backend | Default Model |
|---------|------|---------|---------------|
| **NvidiaSTTService** | `stt.py` | Riva ASR (gRPC, cloud) | `parakeet-ctc-1.1b-asr` |
| **NvidiaSegmentedSTTService** | `stt.py` | Riva ASR (gRPC, cloud) | `canary-1b-asr` |
| **NvidiaLLMService** | `llm.py` | NIM API (OpenAI-compat) | `nvidia/llama-3.1-nemotron-70b-instruct` |
| **NvidiaTTSService** | `tts.py` | Riva TTS (gRPC, cloud) | `magpie-tts-multilingual` (voice: `Magpie-Multilingual.EN-US.Aria`) |

### Architecture Details

- **STT:** Streaming via gRPC to `grpc.nvcf.nvidia.com:443` (NVIDIA Cloud Functions). Supports 30+ language variants. Features: profanity filter, automatic punctuation, LM word boosting.
- **LLM:** OpenAI-compatible API at `integrate.api.nvidia.com/v1`. Extends `OpenAILLMService` with NVIDIA-specific token accumulation (NVIDIA reports tokens incrementally during streaming, unlike OpenAI's final summary).
- **TTS:** Riva gRPC with `synthesize_online()` streaming. Quality parameter (0-100, default 20). SSL support.

### Pipecat Examples

The Pipecat repo includes 6 NVIDIA-specific examples:
- `07r-interruptible-nvidia.py` — full voice agent with NVIDIA ASR/TTS/LLM
- `01c-nvidia-riva-tts.py` — NVIDIA Riva TTS demo
- Settings update examples for STT, segmented STT, LLM, and TTS

### Critical Limitation for her-os

**All NVIDIA Pipecat services default to cloud endpoints** (`grpc.nvcf.nvidia.com`). For self-hosted/local deployment, you would need to:
1. Run Riva NIM containers locally on DGX Spark (ARM64 compatibility issues)
2. Point the Pipecat services to local gRPC endpoints
3. The services are designed for cloud-first, local-second

Our current pipeline runs entirely in-process or via local Docker containers — zero cloud dependencies.

---

## 6. NVIDIA NIM Agent Blueprint

### Nemotron Voice Agent Blueprint

NVIDIA offers a **Nemotron Voice Agent** blueprint on `build.nvidia.com` — a pre-configured voice agent that combines:
- Riva ASR (speech-to-text)
- Nemotron LLM (response generation)
- Riva TTS (text-to-speech)

This is a "launchable" blueprint — deploy with a single command and get a working voice agent.

### How It Works

The blueprint packages multiple NIM containers into a single deployable unit:
1. **ASR NIM** — Riva streaming ASR for real-time transcription
2. **LLM NIM** — Nemotron for response generation
3. **TTS NIM** — Riva TTS for speech synthesis
4. Unified API gateway with WebSocket support

### Relevance to her-os

**Low relevance.** The blueprint is designed for generic voice agents, not personal companions. It lacks:
- Custom persona (our QLoRA-trained Annie)
- Emotion recognition
- Personal context engine
- Tool calling (web search, memory save)
- Session management with compaction
- All the custom behaviors we built over 300+ sessions

It would be a starting point for a new project, not a replacement for our mature pipeline.

---

## 7. DGX Spark Compatibility — The aarch64 Reality

This is the **most important section** for practical decision-making.

### Known ARM64/DGX Spark Issues

| Issue | Status | Impact |
|-------|--------|--------|
| **Missing ARM64 NIM images** | Many containers x86_64 only | Cannot pull/run on DGX Spark |
| **CUDA 13 + ONNX Runtime** | `cudaErrorSymbolNotFound` on ARM64 | Embedding NIMs broken |
| **vLLM SM_121 support** | Partially fixed (cu130-nightly works) | Workaround available |
| **Nemotron 3 Nano NIM** | Initially broken, fixed with pre-built container | Now working with workaround |
| **Riva ASR NIM** | Unvalidated on ARM64 | Unknown |
| **Riva TTS NIM** | Unvalidated on ARM64 | Unknown |
| **ACE Agent containers** | x86_64 assumed | Likely broken on ARM64 |
| **Audio2Face** | Requires Omniverse, x86_64 | Not available on ARM64 |

### What We Know Works on DGX Spark

| Component | Status | How |
|-----------|--------|-----|
| **Nemotron Speech 0.6B** | Working | NeMo Python (not NIM container) |
| **WhisperX large-v3** | Working | PyTorch direct |
| **Kokoro v0.19 GPU** | Working | In-process Python (with SM_121 patch) |
| **Nemotron 3 Nano NVFP4** | Working | vLLM Docker (cu130-nightly) |
| **emotion2vec+** | Working | PyTorch direct |
| **Pipecat** | Working | Pure Python |
| **SearXNG** | Working | Docker (multi-arch) |

### The Pattern

Models loaded via **NeMo Python** or **PyTorch directly** work on DGX Spark. Models loaded via **NIM containers** have ARM64 compatibility risks. This is because NIM containers bundle specific CUDA libraries and TensorRT versions that may not have ARM64 builds.

**NVIDIA is actively fixing ARM64 NIM support** (DGX Spark is their flagship personal AI hardware, so they have strong incentive). But as of March 2026, it is not production-ready for the full speech stack.

---

## 8. Licensing and Cost Analysis

### Our Stack: Zero Cost

| Component | License | Cost |
|-----------|---------|------|
| Pipecat | BSD-2 | $0 |
| WhisperX | BSD | $0 |
| Nemotron Speech 0.6B | NVIDIA Open Model License | $0 |
| Kokoro v0.19 | Apache 2.0 | $0 |
| emotion2vec+ | MIT | $0 |
| Nemotron 3 Nano 30B | NVIDIA Open Model License | $0 |
| vLLM | Apache 2.0 | $0 |
| PostgreSQL | PostgreSQL License | $0 |
| SearXNG | AGPL | $0 |
| **Total** | All open source | **$0/month** |

### NVIDIA Stack: Tiered

| Tier | Cost | Limits | Includes |
|------|------|--------|----------|
| **Developer Program (free)** | $0 | Up to 2 nodes / 16 GPUs | NIM container download for dev/test/research |
| **API Catalog (cloud)** | Free credits, then per-token | Rate-limited | Cloud-hosted NIM endpoints |
| **AI Enterprise (90-day trial)** | $0 for 90 days | — | Full production license |
| **AI Enterprise (production)** | ~$4,500/GPU/year (list price) | — | Enterprise support, security patches, SLAs |

### Verdict for her-os

Our use case (1 personal DGX Spark, non-commercial) falls within the **free Developer Program** tier. We can use NIM containers for free — the cost concern is not about money but about ARM64 compatibility and enterprise complexity overhead.

The real "cost" of NVIDIA's stack is **complexity** — Kubernetes, container orchestration, gRPC service mesh, NeMo Guardrails configuration. Our Python-first approach is dramatically simpler to develop, debug, and iterate on.

---

## 9. Performance Comparison

### 9.1 End-to-End Voice Latency

The critical metric for a voice companion is **voice-to-voice round-trip time** — how long from when the user stops speaking until they hear the first word of the response.

**Our pipeline (measured):**

| Stage | Latency | Notes |
|-------|---------|-------|
| VAD silence detection | ~300ms | Silero VAD, configurable |
| Nemotron Speech ASR | ~431ms | Streaming, first transcript |
| Context aggregation | ~5ms | In-memory |
| vLLM TTFT | ~88-130ms | Nemotron 3 Nano NVFP4, first token |
| First sentence accumulation | ~200-400ms | Depends on response length |
| Kokoro TTS (first chunk) | ~30ms | In-process GPU |
| WebRTC transport | ~20-50ms | Network dependent |
| **Total (estimated)** | **~1,100-1,350ms** | **First word heard** |

**NVIDIA optimized pipeline (theoretical, from docs):**

| Stage | Latency | Notes |
|-------|---------|-------|
| VAD | ~200ms | Riva VAD or client-side |
| Riva ASR (Parakeet/Nemotron) | ~100-200ms | TensorRT optimized, streaming |
| LLM TTFT | ~50-100ms | TensorRT-LLM optimized |
| First sentence accumulation | ~200-400ms | Same constraint |
| Riva TTS (Magpie) | <200ms | Dynamo-Triton optimized |
| Transport | ~20-50ms | Same |
| **Total (theoretical)** | **~770-1,150ms** | **First word heard** |

**Comparison:** NVIDIA's optimized pipeline could be ~200-400ms faster end-to-end, primarily from TensorRT optimization of ASR and LLM. However:
1. These numbers are from NVIDIA marketing materials, not validated on DGX Spark ARM64
2. Our Kokoro TTS at 30ms is actually faster than Riva TTS at <200ms
3. The LLM TTFT advantage assumes TensorRT-LLM (which has ARM64 issues)

### 9.2 ASR Quality (WER)

| Model | LibriSpeech Clean | LibriSpeech Other | Average (8 benchmarks) |
|-------|-------------------|-------------------|----------------------|
| **Parakeet TDT 0.6B v2** | 1.69% | 3.19% | **6.05%** (SOTA) |
| **Parakeet CTC 0.6B** | 1.87% | 3.76% | ~8.5% |
| **Parakeet CTC 1.1B** | 1.83% | 3.54% | ~7.5% |
| **Canary-Qwen 2.5B** | 1.60% | 3.10% | **5.63%** (best overall) |
| **Whisper Large v3** | ~2.5% | ~5.5% | ~10-12% |
| **Nemotron Speech 0.6B** | — | — | — (no published WER on standard benchmarks) |

**Key finding:** Parakeet TDT 0.6B v2 and Canary-Qwen 2.5B significantly outperform Whisper Large v3 on English. However, Whisper supports 99 languages (including Kannada for ambient transcription). For English-only voice chat, Parakeet is strictly better.

### 9.3 TTS Quality

No standardized benchmarks allow direct comparison. Subjective assessment:

| Dimension | Magpie TTS | Chatterbox-Turbo | Kokoro v0.19 |
|-----------|-----------|-----------------|-------------|
| Naturalness | High | High | High |
| Expressiveness | Good | Excellent (paralinguistic tags) | Good |
| Voice variety | Multi-voice, multi-language | Zero-shot cloning + 23 langs | Limited voices |
| Latency | <200ms | Production-grade | ~30ms |
| Emotional range | Standard | `[laugh]`, `[cough]`, `[chuckle]` | Limited |
| Warmth/personality | Enterprise neutral | Configurable | Warm, natural |

### 9.4 VRAM Comparison

| Component | Our Stack | NVIDIA NIM Stack (estimated) |
|-----------|-----------|------------------------------|
| ASR | 2.49 GB (Nemotron Speech) | ~2-3 GB (Riva ASR NIM) |
| LLM | 7.55 GB (vLLM NVFP4) | ~7-8 GB (NIM NVFP4) |
| TTS | 0.5 GB (Kokoro) | ~2-4 GB (Riva TTS NIM) |
| SER | 1.2 GB (emotion2vec+ + wav2vec2) | ~1-2 GB (Audio2Emotion NIM) |
| Overhead | ~2 GB (vLLM, Docker) | ~4-6 GB (multiple NIM containers, Triton) |
| **Total voice pipeline** | **~13.7 GB** | **~16-23 GB** |

NVIDIA's NIM containers carry more overhead (each is a separate Docker container with its own CUDA runtime, TensorRT engine, and Triton server). Our in-process approach (Kokoro, emotion2vec+, Pipecat) is more memory-efficient.

---

## 10. What We Would Gain by Switching

### 10.1 TensorRT Optimization

TensorRT provides optimized inference kernels that can be 2-5x faster than vanilla PyTorch for specific model architectures. For ASR and TTS models, TensorRT optimization could reduce latency.

**But:** We already get TensorRT-level optimization for the LLM via vLLM's CUDA graph acceleration. The ASR (Nemotron Speech via NeMo) already uses optimized RNNT decoding. The main gain would be in TTS.

### 10.2 Paralinguistic TTS

Chatterbox-Turbo's `[laugh]`, `[cough]`, `[chuckle]` tags are genuinely novel. Annie could:
- Laugh when the user tells a joke
- Sigh sympathetically during emotional conversations
- Use verbal filler like "hmm" naturally

This is a capability we cannot replicate with Kokoro today.

### 10.3 Voice Cloning

Magpie TTS Zeroshot and Chatterbox both support voice cloning from 5-second audio samples. We could:
- Clone a specific voice for Annie (consistent identity across sessions)
- Let users choose Annie's voice
- Match voice characteristics to emotional state

### 10.4 Enterprise-Grade Reliability

NIM containers include:
- Continuous vulnerability fixes
- Production-tested configurations
- Health monitoring and metrics
- Kubernetes autoscaling (irrelevant for single user)

### 10.5 Future TensorRT-LLM Integration

As NVIDIA improves ARM64 NIM support, TensorRT-LLM could provide significant speedup for the Nemotron 3 Nano model, potentially pushing throughput from 48-65 tok/s to 80-100+ tok/s.

---

## 11. What We Would Lose

### 11.1 Kokoro's Ultra-Low Latency

Kokoro's ~30ms in-process TTS is 6x faster than Riva TTS (<200ms). In conversational AI, TTS latency directly impacts perceived responsiveness. Switching to Riva TTS would make Annie feel slower.

### 11.2 Full Pipeline Control

Our entire pipeline is Python. Every component can be debugged, modified, or replaced independently. ACE Agent's Colang DSL and NIM containers are opaque boxes — harder to debug, harder to customize.

**Example from our experience:** Over 300+ sessions, we have:
- Built a custom ThinkBlockFilter to strip `<think>` tags from streaming
- Created a SpeechTextFilter to clean markdown without content loss
- Implemented anti-Alzheimer compaction with tiered escalation
- Added context dirtying prevention for tool results
- Built a custom text chat path with independent system prompt

None of these would be possible with ACE Agent's opaque pipeline.

### 11.3 emotion2vec+ (Speech Emotion Recognition)

emotion2vec+ large is the community SOTA for speech emotion recognition. It provides:
- Valence/arousal continuous values
- 9 categorical emotions with confidence scores
- Works on DGX Spark (validated)

Audio2Emotion is designed for avatar facial animation blendshapes, not for enriching conversation context. Switching would degrade our emotional awareness capability.

### 11.4 Zero-Dependency Architecture

Our stack has zero cloud dependencies. Every component runs locally on DGX Spark. NVIDIA's Pipecat integration defaults to cloud endpoints. Self-hosting the full NIM stack locally adds container management complexity.

### 11.5 Annie's Persona (QLoRA Fine-Tune)

Our Nemotron 3 Nano is QLoRA-fine-tuned with 999 Annie conversations (90% behavioral pass rate). This persona training does not transfer to NIM containers — we would need to serve our custom model weights regardless, making NIM's "pre-optimized model" advantage irrelevant.

### 11.6 Multilingual Ambient Transcription

WhisperX handles Kannada-English code-mixed ambient transcription from the Omi wearable. No NVIDIA ASR model supports Kannada. Switching to all-NVIDIA ASR would break ambient transcription.

---

## 12. Hybrid Approach — Best of Both Worlds

Rather than all-or-nothing, we can selectively adopt NVIDIA components where they provide clear advantage.

### Tier 1: Already Done (Keep)

| Component | Current | Status |
|-----------|---------|--------|
| **LLM** | Nemotron 3 Nano 30B NVFP4 | Already NVIDIA model |
| **Voice STT** | Nemotron Speech 0.6B | Already NVIDIA model |
| **Orchestration** | Pipecat | Superior for our use case |
| **SER** | emotion2vec+ | Superior for our use case |

### Tier 2: Evaluate When ARM64 Matures

| Component | Current | Potential Upgrade | When |
|-----------|---------|-------------------|------|
| **Voice STT** | Nemotron Speech 0.6B | **Parakeet TDT 0.6B v2** (via NeMo Python, NOT NIM) | Now — test via NeMo toolkit |
| **TTS** | Kokoro v0.19 | **Chatterbox-Turbo 0.35B** (for paralinguistic features) | When validated on aarch64 |
| **LLM serving** | vLLM (cu130-nightly) | **TensorRT-LLM** (when ARM64 NIM works) | When NIM ARM64 is stable |

### Tier 3: Future Consideration

| Component | Current | Potential | When |
|-----------|---------|-----------|------|
| **Voice cloning** | None | Magpie TTS Zeroshot or Chatterbox | When Annie needs a unique voice identity |
| **ASR (ambient)** | WhisperX | Canary 1B (if multilingual+Kannada) | When Canary supports Kannada |
| **NIM containers** | Not used | Full NIM stack | When ARM64 is production-ready |

### Specific Hybrid Recommendations

#### 1. Try Parakeet TDT 0.6B v2 via NeMo Python (LOW RISK)

Parakeet TDT 0.6B v2 is the #1 ASR model on HuggingFace's Open ASR Leaderboard. Same 600M param class as our current Nemotron Speech. Can be loaded via NeMo Python (not NIM container), so ARM64 compatibility is likely.

**Benefit:** Better English WER (6.05% vs unknown for Nemotron Speech). Word-level timestamps. Punctuation and capitalization.

**Risk:** Need to validate streaming latency on DGX Spark. Nemotron Speech's cache-aware architecture may still be faster for real-time voice chat.

**Effort:** ~1 session to test. Wrap Parakeet in a Pipecat STT processor.

#### 2. Evaluate Chatterbox-Turbo for Paralinguistic TTS (MEDIUM RISK)

Chatterbox-Turbo is open source (GitHub: resemble-ai/chatterbox). 350M params. Paralinguistic tags (`[laugh]`, `[cough]`). Zero-shot voice cloning.

**Benefit:** Annie could laugh, sigh, use verbal fillers. More human-like interaction.

**Risk:** Latency unknown on DGX Spark. May need CUDA/PyTorch compatibility work. Kokoro's 30ms is hard to beat.

**Approach:** Run both TTS models. Use Kokoro for normal speech (low latency), Chatterbox for paralinguistic moments (when emotion2vec+ detects humor, sadness, etc.). Dual-TTS with emotion-driven routing.

#### 3. Watch TensorRT-LLM ARM64 Maturity (NO ACTION NOW)

TensorRT-LLM could push our LLM throughput from 48-65 tok/s to 80-100+ tok/s. But ARM64 NIM support is not ready. Monitor quarterly.

---

## 13. Recommendation

### Decision: Stay with Custom Pipeline, Selectively Adopt Components

**Do NOT switch to NVIDIA's full ACE/NIM voice stack.** The reasons:

1. **ARM64 compatibility is not production-ready** for the speech NIM stack on DGX Spark
2. **Our pipeline already works** — 13.5 GB VRAM, ~1.1-1.35s voice-to-voice latency, validated over 300+ sessions
3. **ACE is designed for enterprise multi-user with visual avatars** — massive overkill for single-user voice companion
4. **Kokoro's 30ms TTS beats Riva's <200ms** — latency matters for conversational feel
5. **Our QLoRA-trained persona cannot run in vanilla NIM** — we need custom model serving regardless
6. **Pipecat is superior to ACE Agent** for our use case — simpler, more flexible, better ecosystem

### What TO Do

| Action | Priority | Effort | Impact |
|--------|----------|--------|--------|
| **Test Parakeet TDT 0.6B v2** via NeMo Python on DGX Spark | **HIGH** | 1 session | Better English ASR WER |
| **Evaluate Chatterbox-Turbo** for paralinguistic TTS | **MEDIUM** | 1-2 sessions | Annie can laugh/sigh |
| **Monitor ARM64 NIM maturity** quarterly | **LOW** | Ongoing | Future TensorRT gains |
| **Keep Kokoro as primary TTS** | **FIRM** | 0 | 30ms latency is irreplaceable |
| **Keep Pipecat as orchestrator** | **FIRM** | 0 | Full control, proven stability |
| **Keep emotion2vec+ for SER** | **FIRM** | 0 | Superior to Audio2Emotion for our use case |
| **Keep WhisperX for ambient** | **FIRM** | 0 | Kannada support, no NVIDIA alternative |

### What NOT To Do

1. **Do NOT deploy Riva NIM containers** on DGX Spark until ARM64 is validated
2. **Do NOT switch to ACE Agent** — Colang adds complexity without benefit
3. **Do NOT replace Kokoro with Magpie TTS** — 30ms vs <200ms is a downgrade
4. **Do NOT abandon WhisperX** — no NVIDIA ASR supports Kannada
5. **Do NOT invest in Kubernetes/Triton** setup — single user does not need it

---

## 14. References

### NVIDIA Official

1. **NVIDIA ACE overview:** https://developer.nvidia.com/ace
2. **ACE GitHub repository:** https://github.com/NVIDIA/ACE
3. **ACE Agent 4.1 README:** https://github.com/NVIDIA/ACE/blob/main/microservices/ace_agent/4.1/README.md
4. **NVIDIA NIM documentation:** https://docs.nvidia.com/nim/index.html
5. **NVIDIA Speech NIM docs:** https://docs.nvidia.com/nim/speech/latest/index.html
6. **NIM free for developers:** https://developer.nvidia.com/blog/access-to-nvidia-nim-now-available-free-to-developer-program-members/
7. **NVIDIA Speech AI performance blog:** https://developer.nvidia.com/blog/nvidia-speech-ai-models-deliver-industry-leading-accuracy-and-performance/
8. **Riva TTS multilingual blog:** https://developer.nvidia.com/blog/enhancing-multilingual-human-like-speech-and-voice-cloning-with-nvidia-riva-tts/
9. **DGX Spark optimizations blog:** https://developer.nvidia.com/blog/new-software-and-model-optimizations-supercharge-nvidia-dgx-spark/
10. **DGX Spark performance blog:** https://developer.nvidia.com/blog/how-nvidia-dgx-sparks-performance-enables-intensive-ai-tasks/
11. **DGX Spark agents blog:** https://developer.nvidia.com/blog/scaling-autonomous-ai-agents-and-workloads-with-nvidia-dgx-spark/

### Models (HuggingFace)

12. **Parakeet CTC 0.6B:** https://huggingface.co/nvidia/parakeet-ctc-0.6b
13. **Parakeet TDT 0.6B v2:** https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
14. **Parakeet CTC 1.1B:** https://huggingface.co/nvidia/parakeet-ctc-1.1b
15. **Canary-Qwen 2.5B:** https://huggingface.co/nvidia/canary-qwen-2.5b
16. **Chatterbox (Resemble AI):** https://github.com/resemble-ai/chatterbox
17. **NVIDIA Nemotron 3 Nano:** https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

### Pipecat Integration

18. **Pipecat NVIDIA services:** https://github.com/pipecat-ai/pipecat/tree/main/src/pipecat/services/nvidia
19. **Pipecat NVIDIA STT:** https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/nvidia/stt.py
20. **Pipecat NVIDIA TTS:** https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/nvidia/tts.py
21. **Pipecat NVIDIA LLM:** https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/nvidia/llm.py

### her-os Internal

22. **NVIDIA open models research:** `docs/RESEARCH-NVIDIA-OPEN-MODELS.md`
23. **Resource registry (VRAM budget):** `docs/RESOURCE-REGISTRY.md`
24. **Pipecat voice agent research:** `docs/RESEARCH-PIPECAT-VOICE-AGENT.md`
25. **NVFP4 lessons learned:** `docs/NVFP4-LESSONS-LEARNED.md`
26. **ARM64 NIM issues forum:** https://forums.developer.nvidia.com/t/missing-official-native-arm64-nim-images-for-essential-ai-models/350681
