# Research: Hume AI TADA TTS — Suitability for Annie Voice Pipeline

**Date:** 2026-03-13 (Session 301)
**Status:** Research complete
**Context:** Evaluate TADA (Text-Acoustic Dual Alignment) as potential replacement for Kokoro TTS in Annie Voice pipeline on DGX Spark (128GB unified memory).

---

## 1. Executive Summary

TADA is Hume AI's first open-source TTS — a **Llama 3.2-based speech-language model** with a novel 1:1 text-acoustic token alignment. Released March 2026 under MIT license. Zero hallucinations, 5x faster RTF than comparable LLM-based TTS systems.

**Verdict:** TADA is architecturally impressive but **NOT suitable for Annie Voice** due to extreme VRAM cost (17.4 GB vs Kokoro's 0.5 GB = 35x more), no Pipecat integration, no streaming support, and lower naturalness scores than Kokoro. Kokoro remains the optimal choice.

---

## 2. Model Family

| Variant | Base Model | Total Params | Languages | VRAM | RTF | License |
|---------|-----------|-------------|-----------|------|-----|---------|
| **TADA-1B** | Llama 3.2 1B | 2B (incl. encoder) | English | **17.4 GB** | 0.09 | MIT |
| **TADA-3B-ml** | Llama 3.2 3B | 4B (incl. encoder) | 10 (en, ar, ch, de, es, fr, it, ja, pl, pt) | **26.1 GB** | 0.13 | MIT |
| **TADA-Codec** | — | — | — | — | — | MIT (shared encoder) |

## 3. Architecture — What Makes TADA Novel

### 3.1 Text-Acoustic Dual Alignment

Traditional LLM-based TTS (like XTTS, CosyVoice) tokenizes audio at 12.5–75 Hz fixed frame rates. A 10-second clip = 125–750 tokens. This wastes context window and causes alignment drift.

TADA's key innovation: **1:1 token alignment** — every text subword token has exactly one corresponding acoustic vector. The tokenizer encodes audio into a sequence whose length matches the text token count.

- Frame rate: **2–3 Hz** (vs 12.5–75 Hz baselines)
- Context efficiency: **~700 seconds in 2048 tokens** (vs ~70 seconds in conventional systems = 10x more)
- One autoregressive step = one text token + its full speech segment (dynamic duration)

### 3.2 Dynamic Duration Synthesis

Each step dynamically determines:
- Duration of the speech segment for that token
- Prosody and intonation
- Full waveform generation regardless of phoneme length

This eliminates the need for separate duration predictors (like in FastSpeech) and prevents transcript hallucination (skipped/inserted words).

### 3.3 Flow Matching Decoder

TADA uses a flow matching model to decode acoustic vectors into waveforms:

| FM Steps | RTF | LLM Time (ms) | FM Time (ms) | CER | Speaker Sim | oMOS |
|----------|-----|---------------|-------------|-----|-------------|------|
| 2 | 0.05 | 31 | 8 | 1.82 | 77.3 | 2.82 |
| 4 | 0.05 | 31 | 14 | 0.85 | 79.1 | 2.96 |
| **10** | **0.09** | **31** | **23** | **0.55** | **80.2** | **3.11** |
| 20 | 0.13 | 31 | 54 | 0.63 | 79.8 | 3.11 |

Default: 10 steps. LLM inference is only 31ms — the flow matching decode is the variable cost.

## 4. Benchmarks vs Competing TTS Systems

### 4.1 Voice Cloning Quality (SeedTTS-Eval)

| Model | VRAM | RTF | Training Hours | CER | Speaker Sim | oMOS |
|-------|------|-----|----------------|-----|-------------|------|
| XTTS v2 | 2.1 GB | 0.19 | 30K | 2.10 | 70.3 | 2.67 |
| Index-TTS2 | 10.1 GB | 0.58 | 55K | **0.31** | **79.8** | 2.95 |
| Higgs Audio v2 | 16.3 GB | 0.44 | 10M | 9.57 | 75.3 | **2.98** |
| VibeVoice 1.5B | 5.2 GB | 0.51 | 0.5–1.6M | 3.07 | 72.3 | 2.92 |
| FireRedTTS-2 | 12.2 GB | 0.76 | 1.1–1.4M | 0.81 | 75.1 | 2.96 |
| **TADA-1B** | **17.4 GB** | **0.09** | 270K | 0.73 | 77.9 | 2.79 |
| **TADA-3B-ML** | **26.1 GB** | **0.13** | 900K | 0.76 | 75.1 | 2.85 |

### 4.2 Clean Speech Quality (LibriTTS-Clean)

| Model | CER | Speaker Sim | oMOS |
|-------|-----|-------------|------|
| XTTS v2 | 0.59 | 74.0 | 3.17 |
| Index-TTS2 | **0.23** | **83.3** | **3.34** |
| Higgs Audio v2 | 1.88 | 79.7 | 3.28 |
| VibeVoice 1.5B | 1.25 | 79.5 | 3.24 |
| FireRedTTS-2 | 1.44 | 81.2 | 3.28 |
| **TADA-1B** | 0.55 | 80.2 | 3.11 |
| **TADA-3B-ML** | **0.40** | 79.9 | 3.17 |

### 4.3 Long-Form Expressive (EARS Dataset — Human Evaluation)

| Model | CER | Speaker Sim | oMOS | Subj. Speaker Sim | Subj. MOS |
|-------|-----|-------------|------|--------------------|-----------|
| Index-TTS | 1.90 | 76.9 | 2.84 | **4.25** | 3.61 |
| VibeVoice 1.5B | 2.51 | 73.3 | 2.54 | 3.92 | **3.91** |
| FireRedTTS-2 | 21.6 | 73.8 | 2.84 | 3.98 | 3.58 |
| **TADA-3B + Online RS** | 2.74 | 74.7 | 2.84 | 4.18 | 3.78 |

### 4.4 Hallucination Rate (Critical Metric)

| Model | Hallucinated Samples (CER > 0.15) |
|-------|-----------------------------------|
| **TADA** | **0** |
| VibeVoice 1.5B | 17 |
| Higgs Audio V2 | 24 |
| FireRedTTS-2 | 41 |

TADA's 1:1 alignment makes content hallucination structurally impossible.

## 5. TADA vs Kokoro — Head-to-Head for Annie Voice

### 5.1 The Numbers That Matter

| Metric | Kokoro v0.19 (Current) | TADA-1B | TADA-3B-ml |
|--------|----------------------|---------|-----------|
| **Parameters** | **82M** | 2B (24x larger) | 4B (49x larger) |
| **VRAM** | **0.5 GB** | 17.4 GB (35x more) | 26.1 GB (52x more) |
| **Latency (per chunk)** | **~30ms** | ~54ms (LLM 31ms + FM 23ms) | ~44ms (est.) |
| **RTF** | **~0.04** (est.) | 0.09 | 0.13 |
| **TTS Arena ELO** | **1059** (1st place) | Not ranked | Not ranked |
| **Sample rate** | 24 kHz | Not specified (likely 24 kHz) |  |
| **Pipecat integration** | **Yes (custom TTSService)** | **No** | **No** |
| **Streaming** | **Yes (chunk-by-chunk)** | **No (batch)** | **No (batch)** |
| **Voice cloning** | No (preset voices) | Yes (reference audio) | Yes (reference audio) |
| **Multilingual** | English only | English only | 10 languages |
| **Hallucination** | Low (82M = no LLM drift) | **Zero** | **Zero** |
| **License** | Apache 2.0 | MIT | MIT |
| **Blackwell SM_121** | **Working (with patch)** | **Unknown** | **Unknown** |

### 5.2 VRAM Impact on DGX Spark

Current idle stack: **19 GB** (audio pipeline + llama-server + SER + Kokoro + Nemotron STT).

| Scenario | With Kokoro (0.5 GB) | With TADA-1B (17.4 GB) | With TADA-3B (26.1 GB) |
|----------|---------------------|------------------------|------------------------|
| Idle stack | 19 GB | **35.9 GB** (+16.9 GB) | **44.6 GB** (+25.6 GB) |
| + Extraction (27B) | 59 GB | **75.9 GB** | **84.6 GB** |
| + Embeddings (8B) | 73 GB | **89.9 GB** | **98.6 GB** |
| Peak (all models) | 105 GB | **121.9 GB !!** | **130.6 GB !!** |

**TADA-1B pushes peak to 121.9 GB — exceeding the 110 GB safety limit.**
**TADA-3B-ml pushes peak to 130.6 GB — exceeding the 128 GB physical limit.**

### 5.3 Quality Comparison

Kokoro advantages:
- **TTS Arena ELO 1059** — ranked #1 among open-source TTS models
- **Naturalness**: Users consistently rate Kokoro as more natural-sounding
- **Consistency**: 82M parameter model = deterministic, no LLM sampling drift
- **Proven on Blackwell**: Running in production with `blackwell_patch.py`

TADA advantages:
- **Zero hallucinations** — structurally guaranteed by 1:1 alignment
- **Voice cloning** — can clone any voice from a ~10s reference
- **Long-form**: 700 seconds context vs Kokoro's per-sentence processing
- **Multilingual** (3B variant): 10 languages vs Kokoro's English-only

### 5.4 Real-Time Voice Pipeline Compatibility

Annie Voice uses a **Pipecat streaming pipeline**: `STT → LLM → TTS → WebRTC`. Each component must stream incrementally:

| Requirement | Kokoro | TADA |
|-------------|--------|------|
| Pipecat TTSService | `kokoro_tts.py` (working) | **Not implemented** |
| Chunk-by-chunk streaming | Yes (yields `TTSAudioRawFrame` per sentence) | **No (batch generation)** |
| First-byte latency | ~30ms | Unknown (~54ms+ batch) |
| Interruption handling | Works (Pipecat manages) | **N/A (no streaming)** |

**Building a Pipecat adapter for TADA** would require:
1. Custom `TTSService` subclass (like `kokoro_tts.py`)
2. Sentence-level batching (TADA can't stream mid-sentence)
3. Async generation to avoid blocking the pipeline
4. Blackwell SM_121 compatibility testing (TADA uses Llama, which should work, but flow matching decoder is untested)

## 6. Configuration Options

TADA's API is minimal compared to commercial TTS:

```python
from tada.modules.encoder import Encoder
from tada.modules.tada import TadaForCausalLM

# Load encoder (shared across all TADA models)
encoder = Encoder.from_pretrained("HumeAI/tada-codec", subfolder="encoder")

# Load model
model = TadaForCausalLM.from_pretrained("HumeAI/tada-1b")

# Voice cloning: provide reference audio + transcript
audio, sr = torchaudio.load("reference.wav")
prompt = encoder(audio, text=["Reference transcript here"], sample_rate=sr)

# Generate
output = model.generate(
    prompt=prompt,                # Voice reference (required)
    text="Text to synthesize",   # Target text
    num_extra_steps=50,          # For speech continuation mode
)

# Multilingual (3B only)
encoder = Encoder.from_pretrained(
    "HumeAI/tada-codec", subfolder="encoder",
    language="ja"  # Language-specific aligner
)
```

**Available controls:**
- Voice selection via reference audio (voice cloning)
- Language selection (3B variant: ar, ch, de, es, fr, it, ja, pl, pt)
- Flow matching steps (2–20, controls quality vs speed)
- `num_extra_steps` for speech continuation

**Missing controls (compared to Kokoro):**
- No preset voices — must provide reference audio
- No speed control parameter
- No emotion/expressiveness control
- No phoneme-level control
- No temperature/sampling parameters documented

## 7. Community Reception

From Hacker News discussion:
- Audio quality noted as having "subtle modulation" artifacts (harmonic/phase shifts)
- **Mac/MPS not working** — "script just hangs" (NVIDIA-only)
- Content creators question emotional consistency across scenes
- Limited real-world deployment reports (very new)

## 8. Blockers for Annie Voice

| # | Blocker | Severity | Notes |
|---|---------|----------|-------|
| 1 | **VRAM: 17.4 GB** (1B) or **26.1 GB** (3B) vs 0.5 GB Kokoro | **CRITICAL** | Pushes peak above 110 GB safety limit |
| 2 | **No Pipecat integration** | **HIGH** | Would need custom TTSService + batch-to-stream adapter |
| 3 | **No streaming** — batch generation only | **HIGH** | Adds latency to first byte; breaks interruption handling |
| 4 | **Lower naturalness than Kokoro** | MEDIUM | oMOS 3.11 vs Kokoro ELO 1059 (different scales, but Kokoro is #1) |
| 5 | **Blackwell SM_121 untested** | MEDIUM | Flow matching decoder may need patching like Kokoro did |
| 6 | **No Kannada** (3B has 10 langs, no Indic) | LOW | Not currently needed for TTS, but future consideration |
| 7 | **No preset voices** — requires reference audio | LOW | Would need to create/maintain voice reference files |

## 9. What TADA Does Better

Despite being wrong for Annie Voice, TADA has genuinely novel properties:

1. **Zero hallucination guarantee** — The 1:1 alignment makes it impossible to skip/insert words. No other LLM-based TTS can claim this.
2. **Voice cloning** — 10-second reference audio → cloned voice. Kokoro only has preset voices.
3. **700-second context** — Could generate an entire podcast in one call. Kokoro processes sentence-by-sentence.
4. **Unified speech-language model** — Can generate both text AND speech simultaneously (speech continuation mode). Interesting for future "thinking out loud" features.
5. **Multilingual voice cloning** — Same voice, different language (3B variant).

## 10. Recommendation

| Use Case | Choice | Rationale |
|----------|--------|-----------|
| Annie Voice TTS | **Keep Kokoro** | 35x less VRAM, lower latency, Pipecat integrated, proven on Blackwell |
| Voice cloning feature | **Watch TADA** | If we ever need "speak as someone" — TADA is the best open option |
| Long-form narration | **Consider TADA** | Daily reflections, stories — batch generation is fine, voice cloning adds personality |
| Multilingual TTS | **Consider TADA-3B** | If Annie needs to speak 10 languages — but VRAM cost is extreme |

**Best upgrade path for Annie Voice TTS:** Wait for Kokoro updates (or Kokoro v2) that may add voice cloning. Alternatively, explore **Orpheus-TTS** (Llama-based, 3B, emotional speech) as a middle ground.

## 11. Benchmark Plan (If Desired)

If we want to benchmark TADA on Titan anyway, here's what we'd test:

1. **VRAM verification** — Confirm 17.4 GB measurement on DGX Spark/Blackwell
2. **Latency** — First-byte and total generation time for typical Annie sentences (10-30 words)
3. **Quality A/B** — Generate same sentences with Kokoro and TADA, compare naturalness
4. **Blackwell compatibility** — Test flow matching decoder on SM_121
5. **Voice cloning quality** — Clone Annie's voice from Kokoro reference, compare

**Estimated benchmark time:** ~30 min (model download + setup + tests)
**Risk:** 17.4 GB may not coexist with Ollama models during benchmark

## 12. Sources

- [Hume AI Blog: Opensourcing TADA](https://www.hume.ai/blog/opensource-tada)
- [TADA Paper (arXiv:2602.23068)](https://arxiv.org/abs/2602.23068)
- [TADA GitHub](https://github.com/HumeAI/tada)
- [TADA-1B on HuggingFace](https://huggingface.co/HumeAI/tada-1b)
- [TADA-3B-ml on HuggingFace](https://huggingface.co/HumeAI/tada-3b-ml)
- [TADA-Codec on HuggingFace](https://huggingface.co/HumeAI/tada-codec)
- [Hacker News Discussion](https://news.ycombinator.com/item?id=47332054)
- [Open Source For You: TADA Launch](https://www.opensourceforu.com/2026/03/hume-ai-launches-open-source-tada-tts-model-supporting-long-context-speech/)
- [TestingCatalog: TADA Release](https://www.testingcatalog.com/hume-ai-releases-its-first-open-source-tts-model-tada/)
