# Research: Voice Cloning TTS Models — English + Kannada (April 2026)

**Date:** 2026-04-06
**Status:** Complete. Recommendation: Chatterbox Turbo (English) + IndicF5 (Kannada) dual-TTS.
**Motivation:** IndicF5 English pronunciation was rejected in Session 454. Need a model (or proven pair) that handles both English AND Kannada with speaker cloning from a 5-second Samantha reference clip.

---

## Hardware Constraints

| Machine | GPU | VRAM | Role |
|---------|-----|------|------|
| Panda | RTX 5070 Ti | 16 GB | TTS inference hub |
| Titan | GB10 (DGX Spark) | 128 GB | LLM reasoning |

**VRAM budget on Panda:** ~9 GB free after other services (STT, IndicConformerASR, etc.)
**Latency budget:** <500ms TTFB for conversational voice (phone calls via BT HFP)
**Reference clip:** `~/.her-os/annie/voice-references/samantha_evolving.wav` (5s, 24kHz, clean Samantha dialogue from "Her")

---

## Evaluation Criteria

| Criterion | Weight | Notes |
|-----------|--------|-------|
| English quality | CRITICAL | Must be natural, warm, conversational |
| Voice cloning quality | CRITICAL | Must capture Samantha's timbre from 5s clip |
| Kannada support | HIGH | Native training data or proven multilingual transfer |
| VRAM fit on Panda (16 GB) | HIGH | Must fit alongside other Panda services (~9 GB free) |
| Inference speed (RTF) | HIGH | <0.5 RTF for real-time conversation |
| Streaming support | MEDIUM | Chunked output for low TTFB |
| License | MEDIUM | Apache/MIT preferred, no commercial restrictions |
| Active maintenance | LOW | Recent commits, responsive issues |

---

## TIER 1 — Primary Candidates

### 1. F5-TTS (SWivid/F5-TTS)
- **Size**: 336M params / 6-8 GB VRAM
- **English quality**: 4.5/5 — MOS 4.1, among the best open-source for naturalness
- **Voice cloning**: 4.5/5 — zero-shot from 3-10s reference audio (5s is the sweet spot)
- **Kannada/Indian langs**: NO (base model). IndicF5 fork adds 11 Indian langs but **breaks English** (confirmed in our production testing)
- **Speed**: RTF 0.15 (base, 32 NFE), RTF 0.03 (Fast F5-TTS EPSS 7-step on RTX 3090)
- **Streaming**: Partial — batch chunks only, no native streaming API. Community has requested it (Issue #700)
- **License**: Code = MIT, **models = CC-BY-NC** (Emilia dataset restriction)
- **Fit on Panda**: YES (6-8 GB)
- **Key insight**: Great English + cloning, but no Indian langs. IndicF5 fork trades English quality for Indian langs — a fundamental tradeoff in the training data
- **Gotchas**: Long text hallucinations at >8-10s, noisy reference causes artifacts, CC-BY-NC on model weights
- **Links**: [GitHub](https://github.com/SWivid/F5-TTS), [Paper](https://arxiv.org/abs/2410.06885), [Fast F5-TTS EPSS](https://arxiv.org/html/2505.19931v1)

### 2. CosyVoice 2 (FunAudioLLM/CosyVoice2-0.5B)
- **Size**: 0.5B params / ~8 GB VRAM
- **English quality**: 4/5 — good streaming quality, Alibaba's production model
- **Voice cloning**: 4/5 — zero-shot cross-lingual cloning
- **Kannada/Indian langs**: **NO. No Indian languages at all.** Not even Hindi. CosyVoice 3 has the same gap. Languages: Chinese, English, Japanese, Cantonese, Korean
- **Speed**: RTF ~0.7 unoptimized, 150ms first-packet streaming
- **Streaming**: YES — native streaming, best streaming architecture among all models evaluated
- **License**: Apache-2.0 (best commercial license)
- **Fit on Panda**: YES (~8 GB)
- **Key insight**: Best streaming architecture and Apache license, but zero Indian language support is a fatal blocker
- **Gotchas**: Complex setup (requires FlowMatching, LLM, and vocoder components), v1 streaming had quality issues
- **Links**: [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B), [GitHub](https://github.com/FunAudioLLM/CosyVoice)

### 3. XTTS v2 (coqui/XTTS-v2)
- **Size**: ~467M params / 2-6 GB VRAM
- **English quality**: 4/5 — still decent but dated (released 2023)
- **Voice cloning**: 4/5 — zero-shot from 6s reference, 17 languages
- **Kannada/Indian langs**: **NO. Hindi only** among Indian languages
- **Speed**: RTF ~0.25-0.3
- **Streaming**: YES (native)
- **License**: **CPML (non-commercial). Coqui shut down late 2024. No commercial license path exists anymore.**
- **Fit on Panda**: YES (2-6 GB, lightest of Tier 1)
- **Key insight**: Dead project — Coqui is gone, no maintenance, permanently non-commercial license
- **Gotchas**: End-of-sentence hallucinations, RAM bloat with long inference, no future updates
- **Links**: [HuggingFace](https://huggingface.co/coqui/XTTS-v2), [GitHub (archived)](https://github.com/coqui-ai/TTS)

### 4. Chatterbox (ResembleAI/chatterbox) ★ ENGLISH RECOMMENDATION
- **Size**: Turbo variant: 350M params / ~4.5 GB VRAM
- **English quality**: 4.5/5 — **beats ElevenLabs in blind evaluations** (63.75% preference). #1 trending TTS on HuggingFace at time of research
- **Voice cloning**: 4.5/5 — zero-shot from seconds of reference audio. First open-source model with emotion control via `exaggeration` parameter
- **Kannada/Indian langs**: **NO Kannada. Hindi only** among Indian languages (23 total languages, Hindi quality rated "very good")
- **Speed**: Sub-200ms inference, up to 6x real-time on GPU. RTF ~0.5 streaming on RTX 4090
- **Streaming**: YES — community server available, Turbo variant designed for agent/conversational workflows
- **License**: **MIT** (fully permissive — best license of all models evaluated)
- **Fit on Panda**: YES (4.5 GB, leaves headroom for IndicF5)
- **Key insight**: Best English quality + MIT license + emotion control + voice cloning. No Kannada kills it as a single-model solution, but perfect for the English side of a dual-TTS stack
- **Gotchas**: Perth watermarking baked into output audio, no fine-tuning scripts released yet, non-English quality can degrade for languages not in training set
- **Links**: [GitHub](https://github.com/resemble-ai/chatterbox), [HuggingFace](https://huggingface.co/ResembleAI/chatterbox)

### 5. Spark-TTS (SparkAudio/Spark-TTS-0.5B)
- **Size**: 0.5B params / ~4 GB VRAM
- **English quality**: 3.5/5 — reasonable but below F5-TTS and Chatterbox
- **Voice cloning**: 3.5/5 — zero-shot + controllable parameters (pitch, rate, gender)
- **Kannada/Indian langs**: **NO. Chinese + English only.** Confirmed in GitHub issues
- **Speed**: RTF 0.07 (with TensorRT-LLM optimization), 25-30s without optimization
- **Streaming**: NO (no streaming support)
- **License**: **CC-BY-NC-SA** (non-commercial, ShareAlike)
- **Fit on Panda**: YES (~4 GB)
- **Key insight**: Interesting controllability (pitch/rate sliders) but no Indian langs, non-commercial, no streaming
- **Gotchas**: Max 150 chars per inference call, device auto-select fails, immature project
- **Links**: [GitHub](https://github.com/SparkAudio/Spark-TTS), [HuggingFace](https://huggingface.co/SparkAudio/Spark-TTS-0.5B)

---

## TIER 2 — Worth Checking

### 6. Dia 1.6B (nari-labs/Dia-1.6B)
- **Size**: 1.6B params / ~6-8 GB VRAM
- **English quality**: 4/5 — excellent multi-speaker dialogue generation
- **Voice cloning**: 3/5 — context-based (not reference-audio cloning like F5/Chatterbox)
- **Kannada/Indian langs**: NO — English only
- **Speed**: Moderate
- **Streaming**: Partial
- **License**: Apache 2.0
- **Fit on Panda**: YES
- **Key insight**: Best for dialogue/multi-speaker scenarios, but no voice cloning from reference audio and no Indian languages
- **Links**: [GitHub](https://github.com/nari-labs/dia)

### 7. CSM-1B (Sesame/csm-1b)
- **Size**: 1B params / ~4 GB VRAM
- **English quality**: 3.5/5 — lightweight, decent for conversational
- **Voice cloning**: 2/5 — context-based only, weakest cloning of all models evaluated
- **Kannada/Indian langs**: NO — English only
- **Speed**: Fast
- **Streaming**: NO
- **License**: Apache 2.0
- **Fit on Panda**: YES
- **Key insight**: Lightweight conversational model but weakest cloning capability
- **Links**: [GitHub](https://github.com/SesameAILabs/csm)

### 8. Orpheus-TTS 3B (canopyai/orpheus-tts)
- **Size**: 3B params / ~8-12 GB VRAM
- **English quality**: 4/5 — good with excellent emotion control via tags (`<laugh>`, `<sigh>`, `<gasp>`)
- **Voice cloning**: 3.5/5 — supports Hindi
- **Kannada/Indian langs**: Hindi only, no Kannada
- **Speed**: Moderate (LLM-based, autoregressive)
- **Streaming**: YES — LLM-based architecture makes streaming natural
- **License**: Apache 2.0
- **Fit on Panda**: TIGHT — 3B at fp16 needs ~8-12 GB, leaves little headroom alongside IndicF5
- **Key insight**: Best emotion control of any model (inline tags), but too large for Panda dual-model setup. 1B variant announced but not yet released
- **Gotchas**: 3B is memory-hungry, Llama-based so needs llama.cpp or vLLM serving
- **Links**: [GitHub](https://github.com/canopyai/Orpheus-TTS)

### 9. Mars5-TTS (Camb-AI/mars5-tts) — SKIP
- **License**: AGPL (restrictive) + last commit July 2024 (abandoned)
- **English only**, no Indian languages
- **Key insight**: Dead project with restrictive license. Not worth evaluating.
- **Links**: [GitHub](https://github.com/Camb-ai/MARS5-TTS)

### 10. MaskGCT (Amphion)
- **Size**: ~400M params / 14+ GB VRAM (multiple components)
- **English quality**: 4/5 — strong academic pedigree (ICLR 2025 paper)
- **Voice cloning**: 4/5 — non-autoregressive approach, fast generation
- **Kannada/Indian langs**: NO — English + Chinese only
- **Speed**: Fast inference (non-autoregressive) but high VRAM due to multi-component architecture
- **Streaming**: NO
- **License**: MIT
- **Fit on Panda**: NO — 14+ GB VRAM exceeds budget
- **Key insight**: Interesting non-autoregressive approach but VRAM-hungry and no Indian languages
- **Links**: [Paper](https://arxiv.org/abs/2409.00750), [GitHub](https://github.com/open-mmlab/Amphion)

### 11. OuteTTS 1.0 (OuteAI/OuteTTS-1.0-1B)
- **Size**: 0.6B-1B params / ~4-6 GB VRAM
- **English quality**: 3.5/5
- **Voice cloning**: 3.5/5 — LLM-based audio token approach (interesting architecture)
- **Kannada/Indian langs**: Tamil and Bengali (but NOT Kannada) in the 1B model. Closest to Kannada support among non-IndicF5 models
- **Speed**: Moderate
- **Streaming**: Partial
- **License**: Apache 2.0 (0.6B variant)
- **Fit on Panda**: YES
- **Key insight**: Pure LLM approach to TTS (extends any text LLM with audio tokens). Has Tamil/Bengali but not Kannada. Interesting architecture but not yet competitive on quality
- **Links**: [Blog](https://outeai.com/blog/outetts-1-0-release), [GitHub](https://github.com/OuteAI/OuteTTS)

---

## TIER 3 — Check for Recent Developments

### 12. Fish Speech S2 (fishaudio/fish-speech) ★ ONLY DUAL-LANGUAGE MODEL
- **Size**: 4.4B params / ~8-16 GB VRAM (needs INT4 quantization for 16 GB)
- **English quality**: 4.5/5 — beats Seed-TTS on Audio Turing Test
- **Voice cloning**: 4.5/5 — excellent zero-shot from short reference
- **Kannada/Indian langs**: **YES — 80+ languages including Kannada.** The ONLY model besides IndicF5 that natively supports Kannada
- **Speed**: Fast with quantization, streaming supported
- **Streaming**: YES
- **License**: **CC-BY-NC-SA (research-only for free use). Commercial license requires paid agreement.**
- **Fit on Panda**: TIGHT — needs INT4 quantization to fit, leaves no headroom for other models
- **Key insight**: The dream model — covers both English AND Kannada with cloning in a single model. **But the non-commercial license is a blocker.** If Fish Speech goes Apache/MIT, this becomes the instant choice
- **Gotchas**: License, 4.4B model size needs aggressive quantization, paid commercial license
- **Links**: [GitHub](https://github.com/fishaudio/fish-speech), [HuggingFace](https://huggingface.co/fishaudio)

### 13. KokoClone (Ashish-Patnaik/kokoclone)
- **Size**: <500M params / <2 GB VRAM (inherits Kokoro's lightweight footprint)
- **English quality**: 4/5 — same as Kokoro base
- **Voice cloning**: 3/5 — adds voice cloning layer on top of Kokoro's architecture
- **Kannada/Indian langs**: Hindi but NO Kannada
- **Speed**: Fast (inherits Kokoro speed)
- **Streaming**: YES (inherits Kokoro streaming)
- **License**: Apache 2.0
- **Fit on Panda**: YES (ultra-lightweight)
- **Key insight**: Closest to "just add cloning to Kokoro" — but no Kannada. Could be interesting if Kokoro adds Indian language support
- **Links**: [GitHub](https://github.com/Ashish-Patnaik/kokoclone)

### 14. MetaVoice — SKIP
- **Status**: Abandoned (last commit July 2024). Meta has not continued this project.
- **Key insight**: Dead project, do not evaluate.
- **Links**: [GitHub](https://github.com/metavoiceio/metavoice-src)

### 15. Voxtral TTS (mistralai/Voxtral-4B-TTS-2603)
- **Size**: 4B params / ~8-16 GB VRAM
- **English quality**: 4/5
- **Voice cloning**: 3.5/5
- **Kannada/Indian langs**: Hindi but NO Kannada
- **Speed**: RTF 9.7x (fastest measured RTF of any model in this evaluation)
- **Streaming**: YES
- **License**: CC-BY-NC (non-commercial)
- **Fit on Panda**: NO — 4B too large alongside other services
- **Key insight**: Mistral's TTS entry. Impressively fast but no Kannada, non-commercial, and too large for Panda
- **Links**: [HuggingFace](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)

### Bonus: Additional 2025-2026 Models Checked

| Model | Verdict | Why |
|-------|---------|-----|
| Higgs Audio V2 (boson-ai) | Skip | 3B, Apache 2.0, most expressive (speech+music+humming), but NO Indian languages |
| Qwen3-TTS (QwenLM) | Skip | 0.6B/1.7B, no Indian langs, torchaudio ARM64 blocker, already researched |
| GLM-TTS (Zhipu/zai-org) | Skip | Chinese-focused, no Indian langs |
| IndexTTS-2 (index-tts) | Skip | Chinese + English only |
| Indic Parler-TTS (ai4bharat) | Skip | Already rejected — 5-6s for 10 tokens, streaming broken, 880M params, too slow |
| Piper TTS | Skip | Already confirmed — no Kannada support |
| Gemma 4 E2B/E4B | Skip | Audio INPUT only, NO TTS output (text-only generation). Already researched in `RESEARCH-GEMMA4-E2B-E4B-AUDIO.md` |

---

## Summary: Kannada Support Landscape

**The Kannada gap is the defining constraint.** Only 2 of 17 models support Kannada:

| Kannada Support | Models |
|----------------|--------|
| **YES** (native training data) | IndicF5, Fish Speech S2 |
| Hindi only (closest Indian) | Chatterbox, XTTS v2, Orpheus, KokoClone, Voxtral |
| Some Indian (Tamil/Bengali) | OuteTTS 1.0 |
| No Indian langs at all | F5-TTS, CosyVoice 2, Spark-TTS, Dia, CSM, MaskGCT |

---

## Recommendation: Dual-TTS Architecture

### ★ RECOMMENDED: Chatterbox Turbo (English) + IndicF5 (Kannada)

| Aspect | English | Kannada |
|--------|---------|---------|
| **Model** | Chatterbox Turbo 350M | IndicF5 400M |
| **Quality** | 4.5/5 (beats ElevenLabs) | 4/5 (good Kannada) |
| **Voice cloning** | YES (Samantha ref clip) | YES (Samantha ref clip) |
| **Emotion control** | YES (`exaggeration` param) | NO |
| **VRAM** | ~4.5 GB | ~1.7-6 GB |
| **License** | MIT | MIT (code) |
| **Combined VRAM** | **~6-10.5 GB** (fits Panda 16 GB) |

**Why this combo:**
- Chatterbox has the best English quality of any open-source model (beats ElevenLabs)
- MIT license — no commercial restrictions
- Voice cloning from 5s reference — Samantha identity preserved in both languages
- Emotion control in English (warmth, playfulness)
- Both models fit on Panda's 16 GB with headroom
- IndicF5 server infrastructure already exists — Chatterbox follows same pattern

### Alternative: Keep Current Stack (Kokoro + IndicF5)

If voice cloning in English isn't a priority:
- Kokoro 82M (0.5 GB) + IndicF5 400M (1.7-6 GB) = ~2-6.5 GB (lowest footprint)
- Kokoro has no cloning, so Annie sounds different between English and Kannada

### Future Watch: Fish Speech S2

If Fish Speech changes to Apache/MIT license:
- Single model for both English AND Kannada
- 4.4B params needs INT4 quantization
- Would simplify architecture from dual-model to single-model

---

## Architecture Decision: Language Detection Routing

```
User speaks → STT → detect language
  ├─ English detected  → Chatterbox Turbo (Samantha clone, emotion control)
  └─ Kannada detected  → IndicF5 (Samantha clone)
```

Both models clone from the same `samantha_evolving.wav` reference, so Annie maintains consistent Samantha voice identity across languages.

**Language detection**: Already available from STT output (Whisper auto-detects language, IndicConformerASR is per-language). Can also use simple heuristic: if input contains Kannada unicode characters (U+0C80-U+0CFF), route to IndicF5.

---

## Quick Prototype Plan (Chatterbox Turbo)

### Step 1: Install on Panda
```bash
ssh panda
cd ~/workplace/her/her-os
source .venv/bin/activate
pip install chatterbox-tts
```

### Step 2: Test voice cloning with Samantha reference
```python
from chatterbox.tts import ChatterboxTTS
import torchaudio

model = ChatterboxTTS.from_pretrained(device="cuda")
ref_audio = "~/.her-os/annie/voice-references/samantha_evolving.wav"

wav = model.generate(
    "Hey, how's your day going? I was just thinking about you.",
    audio_prompt_path=ref_audio,
)
torchaudio.save("test_english_samantha.wav", wav, model.sr)
```

### Step 3: Create Chatterbox server (same pattern as indicf5_server.py)
- `chatterbox_server.py` — FastAPI, asyncio.Semaphore(1), auth, health check
- `chatterbox_tts.py` — Pipecat TTSService HTTP bridge
- Add `TTS_BACKEND=chatterbox` option to bot.py

### Step 4: Language-aware routing
- bot.py detects language from STT output
- Routes to Chatterbox (English) or IndicF5 (Kannada)
- Fallback: Kokoro if both are unreachable

---

## Key References

### Models
- [F5-TTS](https://github.com/SWivid/F5-TTS) — MIT code, CC-BY-NC weights
- [CosyVoice 2](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) — Apache 2.0
- [XTTS v2](https://huggingface.co/coqui/XTTS-v2) — CPML (dead project)
- [Chatterbox](https://github.com/resemble-ai/chatterbox) — MIT
- [Spark-TTS](https://github.com/SparkAudio/Spark-TTS) — CC-BY-NC-SA
- [Fish Speech](https://github.com/fishaudio/fish-speech) — CC-BY-NC-SA
- [IndicF5](https://huggingface.co/ai4bharat/IndicF5) — MIT (our current Kannada model)
- [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) — Apache 2.0 (our current English model)

### Benchmarks & Comparisons
- [BentoML TTS Roundup](https://www.bentoml.com/blog/exploring-the-world-of-open-source-text-to-speech-models)
- [SiliconFlow Voice Cloning Guide](https://www.siliconflow.com/articles/en/best-open-source-models-for-voice-cloning)
- [Fast F5-TTS EPSS Paper](https://arxiv.org/html/2505.19931v1)

### Prior her-os Research
- `docs/RESEARCH-INDICF5-TTS-OPTIMIZATION.md` — IndicF5 latency optimization (NFE, BF16, EPSS)
- `docs/RESEARCH-INDIAN-LANGUAGE-SPEECH.md` — Indian language STT/TTS landscape
- `docs/RESEARCH-GEMMA4-E2B-E4B-AUDIO.md` — Gemma 4 audio (input only, no TTS)
- `docs/NEXT-SESSION-SAMANTHA-VOICE-B.md` — Phase B deployment (IndicF5 + Kokoro dual stack)
- `docs/RESEARCH-QWEN35-OMNI-PLUS.md` — Qwen Omni (no open weights, wait)
