# Next Session: TTS Alternatives Research — Replacing Chatterbox 500M

**Created:** 2026-04-14 (session 101)
**Type:** Research session (no code — survey + recommend)
**Expected duration:** 1 session, ~30 min
**Blocks on:** Nothing (can run independently of hardware upgrade decisions)

---

## The ask

User wants to survey **alternatives to the current Chatterbox TTS (original 500M)** running on Panda for English phone call TTS. Specifically asked about "Voxtral 4B TTS 2603" as one example — investigate that, but also do a broader sweep of 2025-2026 TTS models.

Goal: produce a ranked shortlist of 3-5 realistic alternatives with explicit tradeoffs vs the incumbent Chatterbox 500M, so user can decide whether to swap.

---

## Context the next session needs (load-bearing)

### Current deployment (verified 2026-04-14)

- **Model:** ResembleAI/chatterbox (original, NOT Turbo)
- **Parameters:** 500M (T3 speech transformer + S3Gen audio decoder + voice encoder)
- **Runtime:** PyTorch CUDA BF16 via `chatterbox_server.py` (FastAPI, port 8772)
- **Live VRAM:** 3,730 MiB on RTX 5070 Ti
- **Reference audio:** `samantha_evolving.wav` (voice-cloned "Her"-inspired Samantha persona)
- **Tuning parameters:** `cfg_weight=0.3` (slow pace), `exaggeration=0.3` (calm), `temperature=0.6`
- **Latency:** ~500ms TTFB on GPU (benchmarked in community, not her-os-measured)
- **Use case:** Annie's English phone calls (primary), command responses, notification TTS
- **Users:** Mom's phone calls are the highest-stakes use — they require Samantha voice identity + natural prosody

### Prior constraints from her-os

- **English-only is fine** — IndicF5 retired session 67, Kannada not needed
- **Quality > speed** in most cases — phone calls are tolerant of ~500ms TTFB
- **Voice cloning is REQUIRED** — must preserve the Samantha voice identity
- **Paralinguistic expressiveness is NICE TO HAVE** — `[laugh]`, `[sigh]`, etc. make conversation more natural
- **CPU viability is STRATEGICALLY VALUABLE** — frees 3.7 GB VRAM for NVFP4 E4B nav VLM (see `docs/RESEARCH-CHATTERBOX-CPU-BENCHMARK.md`)
- **ONNX availability** is a big plus — enables CPU INT8 inference path on Ryzen 9 9900X3D (AVX-512 VNNI)

### What we already know about Chatterbox-Turbo (already researched — don't re-do)

Chatterbox-Turbo (350M, 1-step decoder) is the direct successor to original Chatterbox. Research already complete in `docs/RESEARCH-CHATTERBOX-CPU-BENCHMARK.md`. Summary:
- **Gains:** native paralinguistic tags (9 of them), sub-200ms TTFB, ONNX + INT8 availability, CPU viable
- **Losses:** cfg_weight and exaggeration parameters removed, English-only (already OK for us), some distillation quality loss (subjective)

The broader survey this session does should **include Chatterbox-Turbo as the baseline alternative** but go beyond it.

---

## Candidates to investigate

### User's explicit ask

1. **Voxtral 4B TTS 2603** (user's specific example) — ✅ **DEPLOYED on Titan (RTF 0.79×)** BUT ❌ **CANNOT CLONE SAMANTHA — encoder weights withheld from open-source release (confirmed by Mistral org-member on HF Discussion #17, 2026-03-27: "voice cloning feature is not included in the current release, and we don't yet have a timeline"). Self-hosted Voxtral is restricted to the 20 preset voices (neutral_female, casual_female, etc.). Not the right tool for Samantha voice goal — drop it. Keep Chatterbox or evaluate next candidate.**
   - **Full results:** `docs/BENCHMARK-VOXTRAL-TITAN-20260415.json`. 10/10 utterances synthesized, all HTTP 200, all 24 kHz mono PCM valid. Mean latency 4.4 s, RTF 0.79× (matches TrevorS's Burn-based DGX Spark benchmark).
   - **Verified runtime**: vllm-omni v0.18.0 (git tag) + vllm-openai:v0.18.0-aarch64-cu130 Docker base image. This is the only verified-working combo — HEAD vllm-omni + current vllm nightlies have API skew. Use paired tags, per the project's own Dockerfile.ci.
   - **Unblock chain** (gotchas we hit):
     - `SETUPTOOLS_SCM_PRETEND_VERSION=0.18.0` env var required (detached HEAD can't resolve tags)
     - Stage-configs YAML hardcodes `gpu_memory_utilization: 0.8`; CLI flag doesn't override; `sed -i 's/0.8/0.5/'` if tight on GPU memory
     - Install common.txt deps separately (`aenum pydub omegaconf diffusers accelerate==1.12.0 torchsde x-transformers einops cache-dit janus openai-whisper av soundfile resampy sox prettytable imageio`) before `pip install --no-deps -e .`
     - Mandatory flags: `--tokenizer-mode mistral --omni --stage-configs-path vllm_omni/model_executor/stage_configs/voxtral_tts.yaml`
   - **Samantha voice clone**: PARTIAL. Upload via `/v1/audio/voices` with `consent` field works (trim to 28s, 30s max). Synthesis with uploaded voice name fails with 400 "Unknown voice" — v0.18.0 bug, open PR #2790 fixes it. Inline `ref_audio` base64 crashes orchestrator. Cherry-pick PR #2790 or wait for next release to unblock voice-clone end-to-end.
   - **Next concrete steps:** (a) cherry-pick PR #2790 → run Samantha A/B vs Chatterbox, (b) test `/v1/audio/speech/stream` for TTFA to check <500 ms phone-call budget, (c) `docker commit` the working install as a persistent image to skip 45 s install cycle.

#### Historical analysis (for context — superseded by 2026-04-14 benchmark finding above)

- ✅ **VERIFIED REAL (session 102, 2026-04-14)**
   - Mistral released Voxtral TTS on **2026-03-26** (hence "2603"), completing their Voxtral speech stack (STT was July 2025)
   - HF: `mistralai/Voxtral-4B-TTS-2603` • Demo: `https://huggingface.co/spaces/mistralai/voxtral-tts-demo` • Paper: `https://mistral.ai/static/research/voxtral-tts.pdf`
   - **4B params, BF16, ≥16 GB GPU required** (tight fit on Panda 16.3 GB — would require stopping Chatterbox + something else)
   - **Cost: $0 self-hosted** (weights publicly downloadable, runs on Panda GPU free forever). Mistral's hosted API is $0.016/1K chars but optional — self-hosting skips it entirely
   - **License: CC BY-NC 4.0 (non-commercial only)** — restricts *use*, not cost. Fine for personal her-os use (Rajesh + Mom + family). Only an issue if her-os is ever commercialized. Chatterbox (MIT) has no such restriction
   - Zero-shot voice cloning from 3s reference, 9 languages (incl. English + Hindi), 20 preset voices, emotion-steering
   - Claims parity with ElevenLabs v3, beats ElevenLabs Flash v2.5 on naturalness at similar TTFA
   - Output: 24 kHz WAV/PCM/FLAC/MP3/AAC/Opus with streaming + batch
   - Pure-C impl exists: `mudler/voxtral-tts.c` — may enable CPU INT8 path (strategically important vs VRAM squeeze)
   - MLX 4-bit variant: `mlx-community/Voxtral-4B-TTS-2603-mlx-4bit` (Apple-only, not useful for Panda)
   - **Deployment options (verified session 102, 2026-04-14):**
     - **Option A — Panda:** Does NOT fit (needs ~10 GB, Panda has 3.7 GB free after stopping Chatterbox; would also require stopping nav VLM)
     - **Option B — Titan (DGX Spark, aarch64 Blackwell):** ✅ **59 GB free at peak load**, Voxtral needs ~10 GB (~15% of budget). BUT aarch64 support for `vLLM Omni >= 0.18.0` is UNVERIFIED (Mistral benchmarks are H200 x86). Fallbacks: HF transformers, build from source, or `mudler/voxtral-tts.c` pure C.
     - **Latency trade-off if on Titan:** +2-10 ms LAN hop BUT Voxtral streams (Chatterbox batches). Streaming first-chunk can beat Chatterbox's 500 ms TTFA. Phone-call budget: TTFA must be <300 ms — Mistral claims "low TTFA" on H200, unknown on aarch64.
     - **Side benefit of Titan hosting:** Frees 3.73 GB on Panda → unlocks NVFP4 E4B nav VLM path from session 101 (see `docs/RESEARCH-GEMMA4-E4B-QUANTIZATIONS.md`)
   - **First concrete test (5 min):** `ssh titan && uv pip install vllm-omni` → resolves aarch64 question, unblocks Phase B
   - **Tier-1 shortlist candidate — VRAM fits on Titan, license is non-issue for personal use. Main unknowns: aarch64 vLLM Omni wheel + actual TTFA on DGX Spark**

### Fresh release-date snapshot (2026-04-15 primary-source check — session 106)

| Model | Latest version | Release date | Active (<30d commit)? | Status |
|---|---|---|---|---|
| Chatterbox (Resemble AI) | v0.1.2 tag; master `59bc590` | tag 2025-06-13; master 2026-03-26 | ✅ yes | **Already deployed**. No "Chatterbox-Turbo" variant exists upstream (folklore — I previously called the next-gen variant that; corrected). |
| CosyVoice 2 (Alibaba) | v2.0 tag; master `ace7c47` | master 2026-03-16 | ⚠️ borderline (29d) | **DEFERRED** (session 107) — weights Apache-2.0 ✅ but `requirements.txt` pins `torch==2.3.1` + `tensorrt-cu12==10.13.3.9`, **incompatible with Blackwell SM_121**. Porting cost 1-8h; not worth the spend vs Chatterbox. See `docs/RESEARCH-COSYVOICE2.md`. |
| F5-TTS (SWivid) | **v1.1.18** | **2026-03-24** | ✅ yes | **REJECTED** (session 107) — code MIT ✅ but **weights CC-BY-NC-4.0** ❌ (Emilia training data contamination). Non-commercial block. See `docs/RESEARCH-F5-TTS.md`. |
| XTTS-v2 (Coqui) | v0.22.0 | 2023-12-12 | ❌ Coqui defunct 2024-08 | **Abandoned** — skip. Community fork `idiap/coqui-ai-TTS` is successor. |
| StyleTTS 2 | no tags; HEAD `5cedc71` | 2024-03-07 | ❌ dormant 2+ years | **Dormant** — skip. |

**Final recommendation (2026-04-15, post-F5-TTS, post-CosyVoice-2 primary-source checks — session 107)**: **KEEP CHATTERBOX**. 2026-04 TTS survey complete. No candidate passes both the license gate AND the Blackwell SM_121 infrastructure gate at acceptable install cost. Chatterbox (MIT, deployed, working on Panda :8772) retains primacy for Samantha voice-clone. See `docs/BENCHMARK-F5-TTS-TITAN-20260415.json` and `docs/BENCHMARK-COSYVOICE2-TITAN-20260415.json` for full verdicts.

### 2025-2026 TTS models worth surveying

**Large models (high quality, GPU-hungry):**
2. **Chatterbox-Turbo** (350M) — baseline alternative, already researched
3. **CosyVoice 2** (Alibaba, ~900M) — multilingual, streaming, voice cloning, strong quality
4. **F5-TTS** (2024) — non-autoregressive, very fast
5. **E2 TTS** (Microsoft) — flow-matching based, zero-shot cloning
6. **StyleTTS 2** — fast, high quality, voice cloning
7. **XTTS v3** (Coqui) — multilingual, established
8. **Parler-TTS Large** — natural prosody, controllable via text description
9. **Fish Speech 1.5** (Fish Audio) — multilingual, voice cloning, popular
10. **GPT-SoVITS v2** — voice cloning, community-popular
11. **MeloTTS** — multilingual, CPU-friendly
12. **Llasa TTS** (2025) — Llama-based TTS
13. **Spark-TTS** (2025) — controllable generation
14. **IndexTTS-2** (ByteDance, 2026) — emotion + paralinguistic
15. **Orpheus TTS** (Canopy Labs, 2026) — ultra-natural
16. **Maya1** / **Sesame CSM-1B** (2025) — natural conversation
17. **Zonos** (2025) — fast voice cloning
18. **NaturalSpeech 3** (Microsoft) — research/demo-only mostly

**Lightweight / edge models (CPU-first, low latency):**
19. **Kokoro** (already deployed on Panda at 0.5 GB) — very lightweight, English-only
20. **Piper** (Rhasspy) — ONNX, CPU-only, very fast, lower quality
21. **OpenVoice v2** — fast voice cloning
22. **MetaVoice-1B** — balanced CPU/GPU

**Closed-source API (for quality reference, not deployment):**
23. **ElevenLabs** — closed API, highest quality reference
24. **Play.ht** — closed API
25. **OpenAI TTS HD** — closed API

---

## Evaluation criteria (must score each candidate)

| Dimension | What to check |
|-----------|---------------|
| **Voice cloning** | Zero-shot from ~10s audio? Quality of Samantha voice reproduction? |
| **Quality / naturalness** | MOS published? Demos? Comparison clips vs Chatterbox? |
| **Latency** | TTFB at GPU and CPU, RTF (real-time factor), sustained throughput |
| **Paralinguistic tags** | Native support for `[laugh]`, `[sigh]`, `[cough]`, etc.? |
| **VRAM (GPU)** | How much GPU memory in FP16/BF16? Q4/INT8 options? |
| **CPU viability** | ONNX/ggml/MLX variants? Int8 support? Expected RTF on Ryzen 9 9900X3D? |
| **License** | Apache 2.0, MIT, CC, research-only, commercial restrictions? |
| **Language support** | English at minimum (we're English-only now) |
| **Streaming** | Can it stream audio chunks while generating, or only batch? |
| **Community & maintenance** | Active repo, recent commits, production users? |
| **Fit for her-os** | 1-5 ranking of overall suitability for Annie's phone calls with Mom |

---

## Required deliverables

Produce a new file: `docs/RESEARCH-TTS-ALTERNATIVES.md` with:

1. **TL;DR** table: top 5 candidates ranked, with 1-line pitch + critical tradeoff for each
2. **Full matrix** of all ~20 candidates scored on the criteria above
3. **Demo links** for each — Hugging Face Space, GitHub demos, YouTube comparisons
4. **Voice-cloning quality** assessment from published samples (since we can't actually run 20 models this session)
5. **India-specific pricing** for any closed-source/API options (per user memory `user_location_india.md`)
6. **Resolution of the Voxtral question** — does Voxtral TTS exist, what is "2603"?
7. **3-candidate shortlist** for Phase B (actual deployment benchmark on Panda) with justification
8. **Decision tree** for next-next-session: how to narrow the shortlist via listening tests

## Must NOT happen in this session

- Do NOT write benchmark scripts yet — that's Phase B after the shortlist is chosen
- Do NOT deploy any model to Panda yet — that comes after A/B listening tests
- Do NOT recommend a specific model without showing the tradeoff table — avoid hand-waving

## Must happen

- Listen to at least one demo per top-5 candidate (via HF Spaces or GitHub demos) and write 1-2 sentence impressions
- Cite sources for every latency / VRAM / licensing claim
- Use India-sourced pricing for any API/SaaS options (per user memory)
- Flag any candidates that require commercial license for production use (if any user-facing)
- Explicitly compare each shortlisted candidate to current Chatterbox 500M on the dimensions above

---

## Links to read first (in-repo context)

- `docs/RESEARCH-CHATTERBOX-CPU-BENCHMARK.md` — current Chatterbox analysis + Turbo comparison
- `docs/RESOURCE-REGISTRY.md:74+` — Panda hardware, current voice pipeline, VRAM budget
- `docs/ARCHITECTURE-PANDA-VOICE-PIPELINE.md` — why BF16 (Vocos complex ops), EPSS 7-step technique
- `services/annie-voice/chatterbox_server.py` — current TTS server implementation
- `services/annie-voice/tts_backends.py` — TTS backend abstraction (where a new TTS would plug in)
- `services/annie-voice/phone_loop.py` — how TTS is used in phone calls
- MEMORY.md "Infrastructure decisions" — IndicF5 retirement rationale (English-only is OK)
- MEMORY.md "User is based in India" — pricing sources

---

## Output format of the research doc

```markdown
# Research — TTS Alternatives for Panda (2026 Survey)

**Date:** <next session>
**Status:** Complete, awaiting user decision on shortlist
**Driver:** User wants to evaluate alternatives to current Chatterbox 500M

## TL;DR — top 5 ranked shortlist

| Rank | Model | Size | CPU? | Voice clone? | Paralinguistic? | Quality vs Chatterbox | Swap verdict |
|------|-------|-----:|:----:|:---:|:---:|-----------------------|-------------|
| 1 | ... | ... | ... | ... | ... | ... | ... |
| ... |

## Full matrix

... (20 rows × 11 columns)

## Demo impressions

- Model X: "Listened to HF Space demo. Samantha reference produced..."

## Voxtral question resolved

...

## Shortlist for Phase B (deployment benchmark)

1. Candidate A: justification...
2. Candidate B: ...
3. Candidate C: ...

## Decision tree

Phase A (next session): pick 3 candidates from shortlist above for actual deployment test
Phase B: deploy each on Panda parallel server, A/B listening test
Phase C: narrow to 1 winner, full latency benchmark + concurrent-load test
Phase D: swap production if winner clearly beats Chatterbox
```

---

## Meta-notes for future Claude

- **India pricing only** — user memory `user_location_india.md` is load-bearing
- **English-only is accepted** — don't over-weight multilingual models
- **Voice cloning is non-negotiable** — models without zero-shot cloning are disqualified unless they offer superior voice training quality for <1 hour of reference audio
- **Samantha voice identity** is load-bearing — any model must preserve the "Her"-inspired calm, slow, warm voice
- **Don't let the paralinguistic tags drive the decision** — nice to have, not a requirement. Voice quality > expressiveness
- If the user says "just pick one and swap it" — DON'T. Always present tradeoffs and let them decide

---

## Expected next-next-session

After this research session produces the 3-candidate shortlist, next session:

**Phase B — Deployment benchmark on Panda**
1. Download each of 3 shortlist candidates
2. Write `services/annie-voice/{model}_server.py` for each on distinct ports
3. Generate test corpus: 10 utterances covering conversational range
4. Record each model's output to WAV
5. A/B listening test (user subjectively evaluates Samantha voice identity)
6. Latency measurement (TTFB, RTF)
7. VRAM measurement
8. **Decision:** keep Chatterbox, OR swap to winner