# Next Session: Voice Cloning TTS Research — English + Kannada with Samantha's Voice

## What

Find the best open-source TTS model that can clone Samantha's voice for **both English AND Kannada**. IndicF5 was tried — Kannada is good but English is bad. Kokoro is great for English but has no voice cloning. We need a single model (or a proven pair) that handles both languages with speaker cloning from a 5-second reference clip.

## Context

- **Current stack**: Kokoro 82M (English, fast, no cloning) + IndicF5 400M (Kannada cloning, bad English)
- **Hardware**: Panda (DGX Spark, 16 GB VRAM, ARM/aarch64 Jetson) for TTS, Titan (128 GB H100) for LLM
- **Reference clip**: `~/.her-os/annie/voice-references/samantha_evolving.wav` (5s, 24kHz, clean Samantha dialogue from "Her")
- **IndicF5 server already running**: Panda:8771, FastAPI, asyncio.Semaphore GPU serialization, auth, streaming. Can be swapped to a different model with minimal code changes.
- **Latency budget**: <500ms TTFB for conversational voice (phone calls via BT HFP)
- **The code infrastructure is ready** — just need to find the right model to plug in

## Research Scope

### Models to Evaluate (prioritized)

**Tier 1 — Most likely candidates:**
1. **F5-TTS** (SWivid/F5-TTS) — The original that IndicF5 forked from. English should be much better. Check multilingual + Indian language support.
2. **CosyVoice 2** (FunAudioLLM/CosyVoice2-0.5B) — Alibaba, strong multilingual, zero-shot cloning. Check Kannada/Indian language coverage.
3. **XTTS v2** (coqui-ai/XTTS-v2) — Battle-tested voice cloning, 17 languages. Check if Indian languages included.
4. **Chatterbox** (ResembleAI/chatterbox) — ResembleAI's open-source, emotion control, voice cloning. Very recent (2025).
5. **Spark-TTS** (SparkAudio/Spark-TTS-0.5B) — Controllable, voice cloning, small model.

**Tier 2 — Worth checking:**
6. **Dia 1.6B** (Nari Labs) — Conversational focus, multi-speaker, emotion. Check cloning capability.
7. **CSM-1B** (Sesame/csm-1b) — Conversational Speech Model, watermarked. Check cloning.
8. **Orpheus-TTS** (canopy/orpheus-tts-3b) — Llama-based, emotion tags. 3B may be too large for Panda.
9. **Mars5-TTS** (Camb-AI/mars5-tts) — Camb AI, voice cloning focus.
10. **MaskGCT** (Amphion) — Non-autoregressive, fast. Check quality.
11. **OuteTTS** (OuteAI/OuteTTS-0.3-1B) — Pure LLM approach, interesting architecture.
12. **Parler-TTS** (parler-tts/parler-tts-large-v1) — Text-described voice control, not reference-based cloning.

**Tier 3 — Check for recent developments:**
13. **Kokoro voice cloning** — Any forks/extensions adding cloning to Kokoro's architecture?
14. **Qwen-TTS / Qwen3-TTS** — Any Qwen audio-generation models?
15. **Fish Speech** (fishaudio/fish-speech) — Fast, multilingual, voice cloning.
16. **MetaVoice** — Meta's voice model, check status.
17. **Piper + voice training** — Can we fine-tune Piper on Samantha clips?
18. Any other 2025-2026 models found during research.

### Evaluation Criteria (for each model)

| Criterion | Weight | Notes |
|-----------|--------|-------|
| English quality | CRITICAL | Must be natural, warm, conversational |
| Voice cloning quality | CRITICAL | Must capture Samantha's timbre from 5s clip |
| Kannada support | HIGH | Native training data or proven multilingual transfer |
| VRAM fit on Panda (16 GB) | HIGH | Must fit alongside other Panda services (~9 GB free) |
| Inference speed (RTF) | HIGH | <0.5 RTF for real-time conversation |
| Streaming support | MEDIUM | Chunked output for low TTFB |
| ARM/aarch64 compatibility | MEDIUM | Panda is Jetson-based. Fallback: run on Titan |
| License | MEDIUM | Apache/MIT preferred, no commercial restrictions |
| Active maintenance | LOW | Recent commits, responsive issues |

### Report Format (for each model)

```
### Model Name (org/repo)
- **Size**: params / VRAM
- **English quality**: X/5 (source: benchmarks, demos, user reports)
- **Voice cloning**: X/5 — method (reference audio / fine-tune / prompt)
- **Kannada/Indian langs**: YES/NO/UNTESTED — which languages trained on
- **Speed**: RTF on GPU (which GPU?)
- **Streaming**: YES/NO
- **ARM**: YES/NO/UNKNOWN
- **License**: ...
- **Fit on Panda**: YES/NO (with VRAM estimate)
- **Key insight**: one-line takeaway
- **Gotchas**: known issues
```

### Decision Output

After evaluating all models, produce:

1. **Recommendation table** — top 3 candidates ranked by overall fit
2. **Quick prototype plan** — for the top candidate: exact commands to download, load, and test synthesis with `samantha_evolving.wav` as reference, producing both English and Kannada output
3. **Architecture decision** — single model for both languages, or dual-model with language detection?

## Start Command

```
cat docs/NEXT-SESSION-VOICE-CLONING-RESEARCH.md
```

Then research each model. Use web search, HuggingFace model cards, GitHub READMEs, and any benchmark papers. Write findings to `docs/RESEARCH-VOICE-CLONING-TTS-2026.md`.

## Key Constraints

- **No API-only models** — must be self-hosted (privacy requirement)
- **No SCP to Panda** — git for code, SSH pipe for data
- **IndicF5 server pattern is reusable** — FastAPI + semaphore + auth + health check. New model just replaces the `_synthesize_gpu()` function.
- **samantha_evolving.wav is the golden reference** — 5s, clean, confirmed good quality
