# TTS Models Research: Tier 2 & Tier 3 (Voice Cloning Focus)

**Date:** 2026-04-06
**Purpose:** Evaluate open-source TTS models for voice cloning quality, multilingual support (especially Kannada/Indian languages + English), and self-hosted deployment on 16 GB GPU (Panda).
**Context:** Annie currently uses Kokoro (English) + IndicF5 (Kannada). This research explores alternatives and complements.

---

## TIER 2 MODELS

---

### 6. Dia 1.6B (nari-labs/Dia-1.6B)

- **Size**: 1.6B params / ~10 GB VRAM (full precision)
- **English quality**: 4/5 — Ultra-realistic dialogue generation, natural non-verbal sounds (laughter, coughing). Competes with ElevenLabs on demo benchmarks. Multi-speaker dialogue in a single pass is the killer feature.
- **Voice cloning**: 3/5 — Zero-shot via audio prompt (provide reference audio + transcript). 5-second clip sufficient. No fine-tuning needed, but no voice consistency by default (must use audio prompt or fix seed).
- **Kannada/Indian langs**: NO — English only. No multilingual roadmap published.
- **Speed**: ~40 tok/s on A4000 (86 tokens = 1 second of audio), so RTF ~0.46. 2x real-time on RTX 4090. Slower on older GPUs.
- **Streaming**: NO — No native streaming support. Users have requested it in GitHub issues.
- **License**: Apache 2.0
- **Fit on 16 GB GPU**: YES — Needs ~10 GB VRAM. Quantized version planned for further reduction.
- **Key insight**: Best-in-class for multi-speaker dialogue generation. If you need two speakers talking naturally in one pass, nothing else comes close.
- **Gotchas**: (1) English only — hard blocker for Indian languages. (2) GPU-only, no CPU inference. (3) No streaming. (4) Voice inconsistency without audio prompt or seed fixing. (5) CUDA 12.6 required. (6) First run slow due to Descript Audio Codec download.

**Sources:**
- [GitHub](https://github.com/nari-labs/dia)
- [HuggingFace](https://huggingface.co/nari-labs/Dia-1.6B)
- [MarkTechPost](https://www.marktechpost.com/2025/04/22/open-source-tts-reaches-new-heights-nari-labs-releases-dia-a-1-6b-parameter-model-for-real-time-voice-cloning-and-expressive-speech-synthesis-on-consumer-device/)

---

### 7. CSM-1B (sesame/csm-1b)

- **Size**: 1B params / ~4 GB VRAM (per user testing)
- **English quality**: 3.5/5 — Conversational tone is the focus. Llama backbone + Mimi audio codec. Quality is decent but not top-tier for isolated sentences — shines in multi-turn dialogue context.
- **Voice cloning**: 2/5 — Not traditional cloning. Uses context-based speaker conditioning via `Segment` objects (provide prior audio segments with speaker IDs). No dedicated voice cloning pipeline. Community forks (isaiahbjork/csm-voice-cloning) add better cloning.
- **Kannada/Indian langs**: NO — English only. Docs explicitly state: "some capacity for non-English languages due to data contamination, but it likely won't do well."
- **Speed**: No official RTF published. Generation produces complete audio with configurable `max_audio_length_ms`.
- **Streaming**: NO — Generates complete audio outputs.
- **License**: Apache 2.0
- **Fit on 16 GB GPU**: YES — Only ~4 GB VRAM needed.
- **Key insight**: Lightweight and conversational, but weak on voice cloning and English-only. The "conversational context" approach (feeding prior turns) is architecturally interesting but limited in practice.
- **Gotchas**: (1) English only. (2) Base model not fine-tuned on any specific voice. (3) Voice cloning is indirect (context-based, not reference-audio-based). (4) Windows needs `triton-windows` instead of `triton`. (5) Works best with short phrases.

**Sources:**
- [GitHub](https://github.com/SesameAILabs/csm)
- [HuggingFace](https://huggingface.co/sesame/csm-1b)
- [VRAM Issue](https://github.com/SesameAILabs/csm/issues/9)

---

### 8. Orpheus-TTS (canopyai/orpheus-3b-0.1-ft)

- **Size**: 3B params (primary) / ~15 GB VRAM full precision, ~8 GB with FP8 quantization. Planned variants: 1B, 400M, 150M.
- **English quality**: 4.5/5 — Trained on 100K+ hours of English. Exceptional prosody and emotion. Emotion tags (`<laugh>`, `<sigh>`, `<chuckle>`, `<cough>`, `<yawn>`, `<gasp>`) are a standout feature.
- **Voice cloning**: 3.5/5 — Zero-shot voice cloning supported. Pretrained variant can condition on text-speech pairs. Quality is good but not the primary focus (emotion/prosody is).
- **Kannada/Indian langs**: PARTIAL — Multilingual research release includes Hindi, Chinese, Korean, Spanish (7 language pairs). No Kannada. No Indian languages beyond Hindi.
- **Speed**: ~200ms streaming latency, reducible to ~100ms with input streaming. Streaming supported via vLLM backend.
- **Streaming**: YES — Native streaming via vLLM at 24kHz sample rate.
- **License**: Apache 2.0
- **Fit on 16 GB GPU**: TIGHT — 3B at full precision needs ~15 GB. With FP8 quantization + SNAC model, fits on 24 GB (RTX 3090). On 16 GB, would need aggressive quantization (IQ3_XS). The 1B variant (when released) would fit easily.
- **Key insight**: The emotion tag system is unique and powerful. If you want expressive, emotionally-controlled speech, Orpheus is the best open-source option. But 3B is too large for a 16 GB GPU running other services.
- **Gotchas**: (1) 3B model too large for 16 GB GPU with other services. (2) vLLM version sensitivity — v0.7.3 recommended (later versions buggy). (3) KV cache errors with PyPI version. (4) Smaller variants (1B/400M/150M) not yet released. (5) Hindi supported but not Kannada. (6) Occasional frame-skipping glitch in streaming.

**Sources:**
- [GitHub](https://github.com/canopyai/Orpheus-TTS)
- [HuggingFace](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft)
- [VRAM Issue](https://github.com/canopyai/Orpheus-TTS/issues/9)
- [Multilingual Release](https://canopylabs.ai/releases/orpheus_can_speak_any_language)

---

### 9. Mars5-TTS (Camb-AI/MARS5-TTS)

- **Size**: ~1.2B total (AR: ~750M + NAR: ~450M) / ~6 GB VRAM estimated (fp16)
- **English quality**: 3/5 — Variable consistency. Developers themselves acknowledge quality/stability needs improvement. Good for sports commentary, anime, prosodically hard scenarios. Not best-in-class for general TTS.
- **Voice cloning**: 4/5 — This is the primary focus. Two modes: shallow clone (fast, no transcript) and deep clone (higher quality, needs reference transcript). Best with ~6 seconds of reference audio.
- **Kannada/Indian langs**: NO — Open-source version is English-only. Commercial platform (CAMB.AI Studio) supports 140+ languages. The open-source release does NOT include multilingual weights.
- **Speed**: No official RTF published. Speed optimization listed as an "area for improvement."
- **Streaming**: NO — Not mentioned in documentation.
- **License**: GNU AGPL 3.0 (IMPORTANT: copyleft, requires derivative works to be open-source)
- **Fit on 16 GB GPU**: YES — ~6 GB estimated. But MPS (Apple Silicon) not supported.
- **Key insight**: Focused on voice cloning quality, but English-only in open-source version. AGPL license is a concern for any proprietary integration. Last significant update was July 2024 — maintenance appears stale.
- **Gotchas**: (1) AGPL license — copyleft poison pill. (2) English-only open-source. (3) Variable output consistency. (4) Long-form generation not implemented. (5) Last commit July 2024 — likely abandoned. (6) Apple Silicon unsupported.

**Sources:**
- [GitHub](https://github.com/Camb-ai/MARS5-TTS)
- [HuggingFace](https://huggingface.co/CAMB-AI/MARS5-TTS)
- [VentureBeat](https://venturebeat.com/ai/exclusive-camb-takes-on-elevenlabs-with-open-voice-cloning-ai-model-mars5-offering-higher-realism-support-for-140-languages)

---

### 10. MaskGCT (amphion/MaskGCT)

- **Size**: ~1.3B total (T2S-Large: 695M + S2A: 353M + Semantic Codec: 44M + Acoustic Codec: 170M). ~10-16 GB VRAM (Gradio demo fails on 8 GB GPUs; 16 GB minimum per GitHub issues).
- **English quality**: 4/5 — ICLR 2025 paper. Outperforms VALL-E, VoiceBox, NaturalSpeech 3, VoiceCraft, XTTS-v2 on LibriSpeech/SeedTTS benchmarks. SMOS scores 4.27-4.33.
- **Voice cloning**: 4/5 — Zero-shot via speaker prompt. No fine-tuning needed. Strong benchmark numbers on speaker similarity (0.687-0.777).
- **Kannada/Indian langs**: NO — English and Mandarin Chinese only (50K hours each from Emilia dataset).
- **Speed**: Non-autoregressive = fixed inference steps regardless of speech length (25-50 steps for T2S). Approximately 2x real-time on RTX 4090 (per community reports). Significantly faster than autoregressive models for long utterances.
- **Streaming**: NO — Parallel generation means full output is produced at once.
- **License**: Not explicitly stated in repo. Part of Amphion project (MIT license for framework, but model weights may differ).
- **Fit on 16 GB GPU**: MAYBE — Reports of 8 GB being insufficient. Likely needs 10-14 GB with full pipeline. Tight on 16 GB with other services.
- **Key insight**: Best speed-quality tradeoff among non-autoregressive models. The parallel generation approach means inference time doesn't scale with output length. Academic pedigree (ICLR 2025) is strong.
- **Gotchas**: (1) English + Chinese only. (2) VRAM-hungry for the full pipeline (w2v-bert-2.0 + T2S + S2A + codec). (3) Complex multi-component setup. (4) No streaming (inherent to non-autoregressive design). (5) License ambiguity. (6) No active maintenance updates.

**Sources:**
- [GitHub](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct)
- [HuggingFace](https://huggingface.co/amphion/MaskGCT)
- [arXiv Paper](https://arxiv.org/abs/2409.00750)
- [VRAM Issue](https://github.com/open-mmlab/Amphion/issues/328)

---

### 11. OuteTTS (OuteAI/OuteTTS-0.3-1B → now OuteTTS-1.0)

- **Size**: 0.3-1B: 1B params on OLMo-1B base. 1.0-0.6B: 600M on Qwen3-0.6B base. 1.0-1B: 1B on Llama. Estimated ~3-5 GB VRAM for 0.6B, ~5-7 GB for 1B.
- **English quality**: 3.5/5 — Improved significantly in v1.0 with new DAC encoder and automatic word alignment. 150 tokens/second audio encoding rate (up from 75).
- **Voice cloning**: 3.5/5 — Create speaker from 5-10 second reference audio. v1.0 has "noticeably more accurate voice reproduction." Save/load speaker profiles (JSON). Works across languages.
- **Kannada/Indian langs**: PARTIAL — 1.0-1B covers 23+ languages including Bengali, Tamil (moderate coverage). No Kannada. 1.0-0.6B covers 14 languages (no Indian languages).
- **Speed**: No published RTF. Token rate of 150 tokens/s suggests near-real-time on decent GPU. Best with 30-second generation batches.
- **Streaming**: NO — Not mentioned in documentation.
- **License**: 1.0-1B: CC-BY-NC-SA-4.0 (non-commercial). 1.0-0.6B: Apache 2.0 (commercial OK). 0.3-1B: CC-BY-NC-SA-4.0 (non-commercial).
- **Fit on 16 GB GPU**: YES — Both variants fit easily (3-7 GB estimated).
- **Key insight**: The pure-LLM architecture is elegant — it extends any existing LLM with TTS capability. The 0.6B Apache 2.0 variant is interesting for lightweight deployment. Tamil and Bengali (but not Kannada) in the 1B model.
- **Gotchas**: (1) No Kannada support. (2) 1B model is non-commercial license. (3) Repetition penalty must only apply to 64-token recent window — full context penalty breaks output. (4) 30-second generation window is limiting. (5) No streaming. (6) Quality below Kokoro/Dia for English.

**Sources:**
- [HuggingFace 0.3](https://huggingface.co/OuteAI/OuteTTS-0.3-1B)
- [OuteTTS 1.0 Blog](https://outeai.com/blog/outetts-1-0-release)
- [HuggingFace 1.0-1B](https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B)
- [HuggingFace 1.0-0.6B](https://huggingface.co/OuteAI/OuteTTS-1.0-0.6B)

---

## TIER 3 MODELS

---

### 12. Fish Speech S2 (fishaudio/fish-speech)

- **Size**: 4B (Slow AR) + 400M (Fast AR) = ~4.4B total. ~17 GB VRAM at full precision. 24 GB recommended, 12 GB absolute minimum. With 4-bit NF4 quantization: ~10-11 GB. GGUF q8_0: ~7 GB.
- **English quality**: 4.5/5 — Top-tier. 0.515 posterior mean on Audio Turing Test (beats Seed-TTS 0.417 and MiniMax 0.387). 91.61% win rate on paralinguistics. Trained on 10M+ hours.
- **Voice cloning**: 4.5/5 — 10-30 second reference audio. Captures timbre, speaking style, emotional tendencies. Multi-speaker support with speaker ID tokens. No fine-tuning needed.
- **Kannada/Indian langs**: YES — Explicitly lists Kannada, Tamil, Telugu, Malayalam, Hindi, Bengali, Punjabi, Gujarati, Marathi, Assamese, Sinhala, Nepali, Urdu among 80+ languages. Quality tier is lower than English/Chinese/Japanese but present. Trained on global corpus.
- **Speed**: RTF 0.195 on H200. TTFA ~100ms. 3000+ acoustic tokens/s. Streaming via SGLang with continuous batching + CUDA Graph.
- **Streaming**: YES — Full streaming support with prefix caching, paged KV cache, continuous batching.
- **License**: FISH AUDIO RESEARCH LICENSE (non-commercial). Commercial use requires separate written license from Fish Audio.
- **Fit on 16 GB GPU**: BARELY — Full precision needs ~17 GB (won't fit). NF4 quantization: ~10-11 GB (fits, but slow). GGUF q8_0: ~7 GB (fits). Quality degrades with quantization.
- **Key insight**: The ONLY model in this entire survey that explicitly supports Kannada with voice cloning. 80+ languages, production-grade streaming, top-tier quality. The catch is the restrictive research-only license and large VRAM footprint.
- **Gotchas**: (1) Research-only license — commercial use requires paid license from Fish Audio. (2) 4.4B params is large for 16 GB GPU. (3) Quantization needed for 16 GB, with quality/speed tradeoff. (4) Kannada quality is untested in production (lower tier than English). (5) No community Kannada voice samples to reference.

**Sources:**
- [GitHub](https://github.com/fishaudio/fish-speech)
- [Fish Audio](https://fish.audio/)
- [S2 Technical Report](https://arxiv.org/html/2603.08823v1)
- [VRAM Discussion](https://huggingface.co/fishaudio/s2-pro/discussions/10)
- [License](https://github.com/fishaudio/fish-speech/blob/main/LICENSE)

---

### 13. Kokoro Voice Cloning Extensions

#### KokoClone (Ashish-Patnaik/kokoclone)

- **Size**: Kokoro-82M base + Kanade voice conversion model. Total estimated ~500M-1 GB VRAM.
- **English quality**: 4/5 — Inherits Kokoro's quality (TTS Arena #2 behind ElevenLabs). Very low WER.
- **Voice cloning**: 3/5 — Two methods: (a) TTS + zero-shot voice transfer (3-10 second reference), (b) Audio-to-audio via Kanade voice conversion (no transcription needed). Chunked processing for long recordings with overlap smoothing.
- **Kannada/Indian langs**: PARTIAL — Supports Hindi (via Kokoro's language list). 8 languages total: English, Hindi, French, Japanese, Chinese, Italian, Portuguese, Spanish. No Kannada.
- **Speed**: Fast — Kokoro-ONNX is one of the fastest open-source TTS engines. GPU chunk sizing auto-calibrated to 50% of GPU memory.
- **Streaming**: NO — Not mentioned. Likely inherits Kokoro's non-streaming nature.
- **License**: Apache 2.0
- **Fit on 16 GB GPU**: YES — Very lightweight. Well under 2 GB VRAM.
- **Key insight**: Adds voice cloning to the already-excellent Kokoro engine with minimal VRAM overhead. Hindi support but no Kannada. The audio-to-audio conversion mode (Kanade) is interesting — re-voice any recording without transcription.
- **Gotchas**: (1) No Kannada. (2) Cloning quality is "transfer" not "true clone" — may not capture all voice characteristics. (3) RoPE ceiling caps chunks at ~8.9s. (4) Young project, limited testing.

#### KVoiceWalk (RobViren/kvoicewalk)

- Random walk algorithm + hybrid scoring to create new Kokoro voice style tensors matching target voices. More experimental, research-quality.

**Sources:**
- [KokoClone GitHub](https://github.com/Ashish-Patnaik/kokoclone)
- [KVoiceWalk GitHub](https://github.com/RobViren/kvoicewalk)

---

### 14. MetaVoice-1B (metavoiceio/metavoice-1B-v0.1)

- **Size**: 1.2B params / ~12 GB GPU RAM minimum
- **English quality**: 3.5/5 — Trained on 100K hours. Good emotional speech rhythm and tone in English. Not top-tier compared to newer models (Dia, Fish Speech).
- **Voice cloning**: 3.5/5 — Zero-shot for American & British voices with 30-second reference. Cross-lingual cloning with fine-tuning. Success with 1 minute training data for Indian speakers specifically noted.
- **Kannada/Indian langs**: PARTIAL — Cross-lingual cloning with fine-tuning has been tested on Indian speakers. No native Indian language training. English-focused.
- **Speed**: RTF < 1.0 on Ampere/Ada/Hopper GPUs (once compiled). int4 quantization is ~2x faster than bf16.
- **Streaming**: UNCLEAR — Not documented in current repo.
- **License**: Apache 2.0
- **Fit on 16 GB GPU**: YES — Needs ~12 GB, but tight with other services.
- **Key insight**: Early mention of Indian speaker fine-tuning is promising. But the project appears abandoned — last significant commit July 2024. Not recommended for new deployments.
- **Gotchas**: (1) Last commit July 2024 — effectively abandoned. (2) 12 GB minimum is tight on 16 GB GPU. (3) Cross-lingual Indian voice cloning requires fine-tuning (not zero-shot). (4) Not competitive with 2025-2026 models on quality. (5) ElevenLabs acquisition rumor was false, but project is still unmaintained.

**Sources:**
- [GitHub](https://github.com/metavoiceio/metavoice-src)
- [HuggingFace](https://huggingface.co/metavoiceio/metavoice-1B-v0.1)

---

### 15. Notable Additional Models (2025-2026)

---

#### 15a. Chatterbox (resemble-ai/chatterbox)

- **Size**: Original: ~full size unspecified. Turbo: 350M params / ~4-8 GB VRAM.
- **English quality**: 4.5/5 — First open-source model with emotion exaggeration control (monotone to dramatic). Excellent naturalness.
- **Voice cloning**: 4/5 — 5-second reference audio. Zero-shot. Works across 23 languages.
- **Kannada/Indian langs**: PARTIAL — Hindi is in the 23-language list. No Kannada. Languages: Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, Chinese.
- **Speed**: Sub-200ms latency (original). Turbo: RTF 0.499 on RTX 4090, first-chunk latency ~472ms.
- **Streaming**: YES — Community streaming implementations exist (davidbrowne17/chatterbox-streaming).
- **License**: MIT (very permissive!)
- **Fit on 16 GB GPU**: YES — Turbo (350M) easily. Original needs 8-16 GB.
- **Key insight**: MIT license + emotion control + 23 languages + voice cloning makes this extremely compelling. Perth neural watermarking is built-in (every generated audio is watermarked). Hindi but no Kannada.
- **Gotchas**: (1) No Kannada. (2) Watermarking is always-on (Perth watermarks survive MP3 compression). (3) Hindi quality untested in depth. (4) Turbo sacrifices some quality for speed.

**Sources:**
- [GitHub](https://github.com/resemble-ai/chatterbox)
- [Resemble AI](https://www.resemble.ai/chatterbox/)
- [Turbo](https://www.resemble.ai/chatterbox-turbo/)

---

#### 15b. Higgs Audio V2 (bosonai/higgs-audio-v2-generation-3B-base)

- **Size**: 3B base (Llama-3.2-3B) + 2.2B DualFFN audio adapter. Model weights ~5.8 GB fp16. 4-bit: ~8 GB VRAM. Full precision: ~24 GB.
- **English quality**: 4.5/5 — 75.7% win rate over gpt-4o-mini-tts on emotions. SOTA on Seed-TTS Eval and ESD benchmarks.
- **Voice cloning**: 4/5 — 3-10 second reference audio. Multi-speaker dialogue. Can hum melodies in cloned voice. Generate speech + background music simultaneously.
- **Kannada/Indian langs**: NO — 32 languages seen in pre-training but quality is best for en-US, zh-CN, es-ES. No Indian languages mentioned.
- **Speed**: 1.3x real-time on RTX 4090. Streaming mode experimental.
- **Streaming**: EXPERIMENTAL — Mentioned in roadmap, not production-ready.
- **License**: Apache 2.0
- **Fit on 16 GB GPU**: TIGHT — 4-bit inference at ~8 GB fits, but quality tradeoff. Full precision needs 24 GB.
- **Key insight**: Most expressive open-source TTS — can generate music + speech simultaneously, hum melodies in cloned voices. Li Mu's team (credible). But 3B is heavy and no Indian languages.
- **Gotchas**: (1) No Indian language support. (2) 3B params means 16 GB GPU is tight. (3) Streaming still experimental. (4) 24kHz output (not 44.1kHz — planned for V3). (5) Distilled smaller models promised but not yet delivered.

**Sources:**
- [GitHub](https://github.com/boson-ai/higgs-audio)
- [HuggingFace](https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base)
- [Blog](https://www.boson.ai/blog/higgs-audio-v2)

---

#### 15c. Voxtral TTS (mistralai/Voxtral-4B-TTS-2603)

- **Size**: 4B params (3.4B Transformer decoder + components). ~8 GB weights (BF16). Needs 16 GB+ VRAM for inference.
- **English quality**: 4.5/5 — 68.4% win rate against ElevenLabs Flash v2.5 in human preference tests.
- **Voice cloning**: 4/5 — 5-25 second reference audio (accepts as little as 3 seconds). 20 built-in preset voices.
- **Kannada/Indian langs**: PARTIAL — Supports Hindi and Arabic among 9 languages (English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic). No Kannada.
- **Speed**: 70ms model latency. RTF ~9.7x (generates audio 9.7x faster than real-time). Low-latency streaming.
- **Streaming**: YES — Native low-latency streaming with ~90ms processing time.
- **License**: CC BY-NC 4.0 (NON-COMMERCIAL for self-hosted weights). Commercial use requires Mistral API.
- **Fit on 16 GB GPU**: BARELY — Needs 16 GB+ VRAM. Won't fit alongside other services. Quantized versions (~3 GB weights) available.
- **Key insight**: Mistral-quality engineering. Fastest RTF in this survey (9.7x). Hindi support. But CC BY-NC license and 4B size are blockers for the 16 GB Panda deployment.
- **Gotchas**: (1) CC BY-NC 4.0 — non-commercial only for self-hosting. (2) 4B too large for 16 GB GPU with other services. (3) No Kannada. (4) Very new (March 2026) — limited community testing. (5) Hindi quality not independently verified.

**Sources:**
- [HuggingFace](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
- [Mistral Blog](https://mistral.ai/news/voxtral-tts)
- [arXiv](https://arxiv.org/html/2603.25551v1)

---

#### 15d. Spark-TTS (SparkAudio/Spark-TTS-0.5B)

- **Size**: 0.5B params (Qwen2.5 base). ~4 GB+ VRAM minimum.
- **English quality**: 3.5/5 — Good for its tiny size. Eliminates need for flow matching by directly predicting audio codes from LLM.
- **Voice cloning**: 3.5/5 — Zero-shot with 5-10 second reference (15-30 second recommended). Cross-lingual and code-switching scenarios supported.
- **Kannada/Indian langs**: NO — Chinese and English only.
- **Speed**: RTF 0.0704 with TensorRT-LLM + Triton optimization. Extremely fast when optimized.
- **Streaming**: UNCLEAR — Not explicitly documented.
- **License**: Research/academic use stated. Not clearly Apache/MIT. Users told to ensure local law compliance.
- **Fit on 16 GB GPU**: YES — Only ~4 GB needed.
- **Key insight**: Tiny, fast, and the TensorRT-optimized RTF (0.07) is remarkable. But Chinese + English only, and the license is restrictive/ambiguous.
- **Gotchas**: (1) Chinese + English only. (2) License is unclear/restrictive. (3) Qwen2.5 base means Chinese-first design. (4) VoxBox training dataset (100K hours) is not publicly released for reproduction.

**Sources:**
- [GitHub](https://github.com/SparkAudio/Spark-TTS)
- [HuggingFace](https://huggingface.co/SparkAudio/Spark-TTS-0.5B)

---

#### 15e. Qwen3-TTS (QwenLM/Qwen3-TTS)

- **Size**: 1.7B (full) / 0.6B (lightweight). VRAM not officially published but estimated ~6-10 GB for 1.7B, ~3-5 GB for 0.6B.
- **English quality**: 4/5 — Good quality with voice design capability (describe a voice in natural language to create it).
- **Voice cloning**: 4/5 — 3-second minimum (10-15 seconds recommended for best quality). Also supports voice design via natural language description.
- **Kannada/Indian langs**: NO — 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian.
- **Speed**: RTF 3-5x on CPU (GPU recommended). No official GPU RTF published.
- **Streaming**: YES — Supports streaming speech generation.
- **License**: Not explicitly stated in searches. Alibaba Cloud model.
- **Fit on 16 GB GPU**: YES — 0.6B variant easily. 1.7B fits too.
- **Key insight**: Voice design via natural language is a unique feature ("make a warm female voice with slight raspiness"). Good cloning. But no Indian languages at all.
- **Gotchas**: (1) No Indian languages. (2) Released January 2026 — still new. (3) RTF on GPU not well documented. (4) Alibaba Cloud licensing may have restrictions.

**Sources:**
- [GitHub](https://github.com/QwenLM/Qwen3-TTS)
- [Simon Willison](https://simonwillison.net/2026/Jan/22/qwen3-tts/)

---

#### 15f. GLM-TTS (zai-org/GLM-TTS)

- **Size**: Not published (Llama-based LLM + Flow Matching + Vocoder). ~8 GB VRAM, ~9 GB disk for weights.
- **English quality**: 3.5/5 — Secondary to Chinese. CER 0.89 (best among open-source, close to MiniMax's 0.83).
- **Voice cloning**: 4/5 — 3-second zero-shot cloning. Reinforcement learning (GRPO) optimizes cloning quality via similarity reward function.
- **Kannada/Indian langs**: NO — Chinese primary, English secondary. Mixed Chinese-English supported. Tokenizer optimized for Pinyin.
- **Speed**: Not published. Streaming inference supported.
- **Streaming**: YES — Native streaming inference.
- **License**: Open-source (specific license not found in searches). Zhipu AI.
- **Fit on 16 GB GPU**: YES — ~8 GB VRAM.
- **Key insight**: The RL-based optimization approach (GRPO with multi-reward: similarity, CER, emotion, laughter) is architecturally innovative. But Chinese-first, English-secondary, no Indian languages.
- **Gotchas**: (1) Chinese-first — English is secondary. (2) Pinyin-optimized tokenizer. (3) No Indian languages. (4) Zhipu AI model — license terms unclear for commercial use.

**Sources:**
- [GitHub](https://github.com/zai-org/GLM-TTS)
- [arXiv](https://arxiv.org/html/2512.14291v1)

---

#### 15g. IndexTTS-2 (IndexTeam/IndexTTS-2)

- **Size**: Not published (transformer + BigVGANv2 vocoder). ~8 GB VRAM estimated. RTX 3060 12 GB works but slow.
- **English quality**: 4/5 — "Most realistic and expressive" per community. Emotional expression breakthrough.
- **Voice cloning**: 4.5/5 — Zero-shot. Independent control of timbre and emotion (disentangled). Duration control for dubbing.
- **Kannada/Indian langs**: NO — Chinese, English, Japanese only.
- **Speed**: RTF 13.1 on RTX 3060 12 GB (very slow!). FP16 + DeepSpeed + CUDA kernels can improve. RTX 4090 likely much better.
- **Streaming**: NO — Batch generation.
- **License**: Open-source (specific terms not found).
- **Fit on 16 GB GPU**: YES — ~8 GB, but inference is slow.
- **Key insight**: Best emotion-timbre disentanglement — can keep the same voice but change emotion independently. Great for dubbing. But Chinese/English/Japanese only, and very slow on consumer GPUs.
- **Gotchas**: (1) Extremely slow on RTX 3060 (RTF 13.1). (2) Chinese + English + Japanese only. (3) No streaming. (4) License unclear. (5) DeepSpeed optimization needed for acceptable speed.

**Sources:**
- [GitHub](https://github.com/index-tts/index-tts)
- [HuggingFace](https://huggingface.co/IndexTeam/IndexTTS-2)

---

#### 15h. CosyVoice2-0.5B (FunAudioLLM/CosyVoice2-0.5B)

- **Size**: 0.5B params. VRAM not published but ~3-5 GB estimated for 0.5B.
- **English quality**: 4/5 — Alibaba's FunAudioLLM team. Good quality, low latency.
- **Voice cloning**: 4/5 — Zero-shot voice cloning. Cross-lingual synthesis.
- **Kannada/Indian langs**: NO — 9 languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian) + 18+ Chinese dialects. No Indian languages.
- **Speed**: ~300ms latency for interactive use. 4x acceleration with TensorRT-LLM.
- **Streaming**: YES — Low-latency streaming synthesis.
- **License**: Open-source (specific license varies by component).
- **Fit on 16 GB GPU**: YES — 0.5B fits easily.
- **Key insight**: Ultra-lightweight (0.5B) with streaming and TensorRT optimization. Alibaba's production pedigree. But no Indian languages and CosyVoice3 is already in research.
- **Gotchas**: (1) No Indian languages. (2) CosyVoice3 paper already published — v2 may be superseded soon. (3) Chinese dialect focus doesn't help for Indian languages.

**Sources:**
- [GitHub](https://github.com/FunAudioLLM/CosyVoice)
- [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B)

---

## COMPARATIVE SUMMARY TABLE

| Model | Params | VRAM | English | Cloning | Kannada | Hindi | Streaming | License | 16 GB Fit |
|-------|--------|------|---------|---------|---------|-------|-----------|---------|-----------|
| **Dia 1.6B** | 1.6B | ~10 GB | 4/5 | 3/5 | NO | NO | NO | Apache 2.0 | YES |
| **CSM-1B** | 1B | ~4 GB | 3.5/5 | 2/5 | NO | NO | NO | Apache 2.0 | YES |
| **Orpheus 3B** | 3B | ~15 GB | 4.5/5 | 3.5/5 | NO | YES | YES | Apache 2.0 | TIGHT |
| **Mars5-TTS** | 1.2B | ~6 GB | 3/5 | 4/5 | NO | NO | NO | AGPL 3.0 | YES |
| **MaskGCT** | 1.3B | ~14 GB | 4/5 | 4/5 | NO | NO | NO | Unclear | TIGHT |
| **OuteTTS 1.0** | 0.6-1B | ~5 GB | 3.5/5 | 3.5/5 | NO | NO | NO | Mixed* | YES |
| **Fish Speech S2** | 4.4B | ~17 GB | 4.5/5 | 4.5/5 | **YES** | **YES** | YES | Research Only | BARELY** |
| **KokoClone** | ~500M | <2 GB | 4/5 | 3/5 | NO | YES | NO | Apache 2.0 | YES |
| **MetaVoice** | 1.2B | ~12 GB | 3.5/5 | 3.5/5 | NO | NO | NO | Apache 2.0 | YES |
| **Chatterbox** | 350M-? | 4-8 GB | 4.5/5 | 4/5 | NO | YES | YES | **MIT** | YES |
| **Higgs Audio V2** | 3B+ | 8-24 GB | 4.5/5 | 4/5 | NO | NO | Exp. | Apache 2.0 | TIGHT |
| **Voxtral TTS** | 4B | 16 GB+ | 4.5/5 | 4/5 | NO | YES | YES | CC BY-NC | BARELY |
| **Spark-TTS** | 0.5B | ~4 GB | 3.5/5 | 3.5/5 | NO | NO | ? | Restrictive | YES |
| **Qwen3-TTS** | 0.6-1.7B | 3-10 GB | 4/5 | 4/5 | NO | NO | YES | Unclear | YES |
| **GLM-TTS** | ~?B | ~8 GB | 3.5/5 | 4/5 | NO | NO | YES | Unclear | YES |
| **IndexTTS-2** | ~?B | ~8 GB | 4/5 | 4.5/5 | NO | NO | NO | Unclear | YES |
| **CosyVoice2** | 0.5B | ~4 GB | 4/5 | 4/5 | NO | NO | YES | Mixed | YES |

\* OuteTTS: 0.6B variant is Apache 2.0; 1B variant is CC-BY-NC-SA-4.0
\** Fish Speech: Fits with NF4 quantization (~10 GB) or GGUF q8_0 (~7 GB)

---

## KEY FINDINGS FOR ANNIE

### The Kannada Problem
**Only Fish Speech S2 explicitly supports Kannada** among all models surveyed. IndicF5 (our current solution) remains the best option for Kannada TTS. No other model in this survey has trained on Kannada speech data.

### The Hindi Opportunity
Several models support Hindi: Orpheus (multilingual release), KokoClone, Chatterbox, Voxtral TTS, Fish Speech S2. If Annie needs Hindi TTS in the future, Chatterbox (MIT license, lightweight) or Orpheus (Apache 2.0, streaming) are strong candidates.

### Best English TTS (if we wanted to replace Kokoro)
1. **Chatterbox Turbo** — MIT license, 350M params, emotion control, voice cloning, sub-200ms latency
2. **Dia 1.6B** — Best for dialogue, Apache 2.0, fits on 16 GB
3. **Fish Speech S2** — Best overall quality but restrictive license and heavy

### Best Voice Cloning Quality
1. **Fish Speech S2** — 4.5/5, but research-only license
2. **IndexTTS-2** — 4.5/5, emotion-timbre disentanglement
3. **Chatterbox** — 4/5, MIT license, 23 languages

### Recommendation for Annie's Dual-TTS Architecture
**Current setup (Kokoro English + IndicF5 Kannada) remains optimal.** No model beats this combination for our specific needs:
- Kokoro: 82M params, <2 GB VRAM, English quality 4.5/5, fastest RTF
- IndicF5: 6 GB VRAM, 11 Indian languages including Kannada, voice cloning

**Watch list for future evaluation:**
1. **Chatterbox** — If MIT-licensed Hindi TTS is needed, or to replace Kokoro for English with cloning
2. **Fish Speech S2** — If Fish Audio releases a permissive license, this would be the ultimate multilingual TTS
3. **Orpheus 1B** — When the 1B variant ships, it could be a compelling English + Hindi option
4. **OuteTTS 1.0** — If Tamil/Bengali TTS is needed (no Kannada though)
5. **Voxtral TTS** — If Mistral releases under Apache 2.0

---

## MODELS TO DEFINITIVELY SKIP

| Model | Reason to Skip |
|-------|---------------|
| Mars5-TTS | AGPL license + stale (July 2024) + English-only |
| MetaVoice | Abandoned (July 2024) + no Indian languages |
| CSM-1B | Weak cloning + English-only + no streaming |
| Spark-TTS | Chinese-first + unclear license + English-only |
| GLM-TTS | Chinese-first + no Indian languages |
| MaskGCT | VRAM-hungry + English/Chinese only + no streaming |
