# TTS Models Research: Tier 2 & Tier 3 (Voice Cloning Focus) **Date:** 2026-04-06 **Purpose:** Evaluate open-source TTS models for voice cloning quality, multilingual support (especially Kannada/Indian languages + English), and self-hosted deployment on 16 GB GPU (Panda). **Context:** Annie currently uses Kokoro (English) + IndicF5 (Kannada). This research explores alternatives and complements. --- ## TIER 2 MODELS --- ### 6. Dia 1.6B (nari-labs/Dia-1.6B) - **Size**: 1.6B params / ~10 GB VRAM (full precision) - **English quality**: 4/5 — Ultra-realistic dialogue generation, natural non-verbal sounds (laughter, coughing). Competes with ElevenLabs on demo benchmarks. Multi-speaker dialogue in a single pass is the killer feature. - **Voice cloning**: 3/5 — Zero-shot via audio prompt (provide reference audio + transcript). 5-second clip sufficient. No fine-tuning needed, but no voice consistency by default (must use audio prompt or fix seed). - **Kannada/Indian langs**: NO — English only. No multilingual roadmap published. - **Speed**: ~40 tok/s on A4000 (86 tokens = 1 second of audio), so RTF ~0.46. 2x real-time on RTX 4090. Slower on older GPUs. - **Streaming**: NO — No native streaming support. Users have requested it in GitHub issues. - **License**: Apache 2.0 - **Fit on 16 GB GPU**: YES — Needs ~10 GB VRAM. Quantized version planned for further reduction. - **Key insight**: Best-in-class for multi-speaker dialogue generation. If you need two speakers talking naturally in one pass, nothing else comes close. - **Gotchas**: (1) English only — hard blocker for Indian languages. (2) GPU-only, no CPU inference. (3) No streaming. (4) Voice inconsistency without audio prompt or seed fixing. (5) CUDA 12.6 required. (6) First run slow due to Descript Audio Codec download. **Sources:** - [GitHub](https://github.com/nari-labs/dia) - [HuggingFace](https://huggingface.co/nari-labs/Dia-1.6B) - [MarkTechPost](https://www.marktechpost.com/2025/04/22/open-source-tts-reaches-new-heights-nari-labs-releases-dia-a-1-6b-parameter-model-for-real-time-voice-cloning-and-expressive-speech-synthesis-on-consumer-device/) --- ### 7. CSM-1B (sesame/csm-1b) - **Size**: 1B params / ~4 GB VRAM (per user testing) - **English quality**: 3.5/5 — Conversational tone is the focus. Llama backbone + Mimi audio codec. Quality is decent but not top-tier for isolated sentences — shines in multi-turn dialogue context. - **Voice cloning**: 2/5 — Not traditional cloning. Uses context-based speaker conditioning via `Segment` objects (provide prior audio segments with speaker IDs). No dedicated voice cloning pipeline. Community forks (isaiahbjork/csm-voice-cloning) add better cloning. - **Kannada/Indian langs**: NO — English only. Docs explicitly state: "some capacity for non-English languages due to data contamination, but it likely won't do well." - **Speed**: No official RTF published. Generation produces complete audio with configurable `max_audio_length_ms`. - **Streaming**: NO — Generates complete audio outputs. - **License**: Apache 2.0 - **Fit on 16 GB GPU**: YES — Only ~4 GB VRAM needed. - **Key insight**: Lightweight and conversational, but weak on voice cloning and English-only. The "conversational context" approach (feeding prior turns) is architecturally interesting but limited in practice. - **Gotchas**: (1) English only. (2) Base model not fine-tuned on any specific voice. (3) Voice cloning is indirect (context-based, not reference-audio-based). (4) Windows needs `triton-windows` instead of `triton`. (5) Works best with short phrases. **Sources:** - [GitHub](https://github.com/SesameAILabs/csm) - [HuggingFace](https://huggingface.co/sesame/csm-1b) - [VRAM Issue](https://github.com/SesameAILabs/csm/issues/9) --- ### 8. Orpheus-TTS (canopyai/orpheus-3b-0.1-ft) - **Size**: 3B params (primary) / ~15 GB VRAM full precision, ~8 GB with FP8 quantization. Planned variants: 1B, 400M, 150M. - **English quality**: 4.5/5 — Trained on 100K+ hours of English. Exceptional prosody and emotion. Emotion tags (``, ``, ``, ``, ``, ``) are a standout feature. - **Voice cloning**: 3.5/5 — Zero-shot voice cloning supported. Pretrained variant can condition on text-speech pairs. Quality is good but not the primary focus (emotion/prosody is). - **Kannada/Indian langs**: PARTIAL — Multilingual research release includes Hindi, Chinese, Korean, Spanish (7 language pairs). No Kannada. No Indian languages beyond Hindi. - **Speed**: ~200ms streaming latency, reducible to ~100ms with input streaming. Streaming supported via vLLM backend. - **Streaming**: YES — Native streaming via vLLM at 24kHz sample rate. - **License**: Apache 2.0 - **Fit on 16 GB GPU**: TIGHT — 3B at full precision needs ~15 GB. With FP8 quantization + SNAC model, fits on 24 GB (RTX 3090). On 16 GB, would need aggressive quantization (IQ3_XS). The 1B variant (when released) would fit easily. - **Key insight**: The emotion tag system is unique and powerful. If you want expressive, emotionally-controlled speech, Orpheus is the best open-source option. But 3B is too large for a 16 GB GPU running other services. - **Gotchas**: (1) 3B model too large for 16 GB GPU with other services. (2) vLLM version sensitivity — v0.7.3 recommended (later versions buggy). (3) KV cache errors with PyPI version. (4) Smaller variants (1B/400M/150M) not yet released. (5) Hindi supported but not Kannada. (6) Occasional frame-skipping glitch in streaming. **Sources:** - [GitHub](https://github.com/canopyai/Orpheus-TTS) - [HuggingFace](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft) - [VRAM Issue](https://github.com/canopyai/Orpheus-TTS/issues/9) - [Multilingual Release](https://canopylabs.ai/releases/orpheus_can_speak_any_language) --- ### 9. Mars5-TTS (Camb-AI/MARS5-TTS) - **Size**: ~1.2B total (AR: ~750M + NAR: ~450M) / ~6 GB VRAM estimated (fp16) - **English quality**: 3/5 — Variable consistency. Developers themselves acknowledge quality/stability needs improvement. Good for sports commentary, anime, prosodically hard scenarios. Not best-in-class for general TTS. - **Voice cloning**: 4/5 — This is the primary focus. Two modes: shallow clone (fast, no transcript) and deep clone (higher quality, needs reference transcript). Best with ~6 seconds of reference audio. - **Kannada/Indian langs**: NO — Open-source version is English-only. Commercial platform (CAMB.AI Studio) supports 140+ languages. The open-source release does NOT include multilingual weights. - **Speed**: No official RTF published. Speed optimization listed as an "area for improvement." - **Streaming**: NO — Not mentioned in documentation. - **License**: GNU AGPL 3.0 (IMPORTANT: copyleft, requires derivative works to be open-source) - **Fit on 16 GB GPU**: YES — ~6 GB estimated. But MPS (Apple Silicon) not supported. - **Key insight**: Focused on voice cloning quality, but English-only in open-source version. AGPL license is a concern for any proprietary integration. Last significant update was July 2024 — maintenance appears stale. - **Gotchas**: (1) AGPL license — copyleft poison pill. (2) English-only open-source. (3) Variable output consistency. (4) Long-form generation not implemented. (5) Last commit July 2024 — likely abandoned. (6) Apple Silicon unsupported. **Sources:** - [GitHub](https://github.com/Camb-ai/MARS5-TTS) - [HuggingFace](https://huggingface.co/CAMB-AI/MARS5-TTS) - [VentureBeat](https://venturebeat.com/ai/exclusive-camb-takes-on-elevenlabs-with-open-voice-cloning-ai-model-mars5-offering-higher-realism-support-for-140-languages) --- ### 10. MaskGCT (amphion/MaskGCT) - **Size**: ~1.3B total (T2S-Large: 695M + S2A: 353M + Semantic Codec: 44M + Acoustic Codec: 170M). ~10-16 GB VRAM (Gradio demo fails on 8 GB GPUs; 16 GB minimum per GitHub issues). - **English quality**: 4/5 — ICLR 2025 paper. Outperforms VALL-E, VoiceBox, NaturalSpeech 3, VoiceCraft, XTTS-v2 on LibriSpeech/SeedTTS benchmarks. SMOS scores 4.27-4.33. - **Voice cloning**: 4/5 — Zero-shot via speaker prompt. No fine-tuning needed. Strong benchmark numbers on speaker similarity (0.687-0.777). - **Kannada/Indian langs**: NO — English and Mandarin Chinese only (50K hours each from Emilia dataset). - **Speed**: Non-autoregressive = fixed inference steps regardless of speech length (25-50 steps for T2S). Approximately 2x real-time on RTX 4090 (per community reports). Significantly faster than autoregressive models for long utterances. - **Streaming**: NO — Parallel generation means full output is produced at once. - **License**: Not explicitly stated in repo. Part of Amphion project (MIT license for framework, but model weights may differ). - **Fit on 16 GB GPU**: MAYBE — Reports of 8 GB being insufficient. Likely needs 10-14 GB with full pipeline. Tight on 16 GB with other services. - **Key insight**: Best speed-quality tradeoff among non-autoregressive models. The parallel generation approach means inference time doesn't scale with output length. Academic pedigree (ICLR 2025) is strong. - **Gotchas**: (1) English + Chinese only. (2) VRAM-hungry for the full pipeline (w2v-bert-2.0 + T2S + S2A + codec). (3) Complex multi-component setup. (4) No streaming (inherent to non-autoregressive design). (5) License ambiguity. (6) No active maintenance updates. **Sources:** - [GitHub](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct) - [HuggingFace](https://huggingface.co/amphion/MaskGCT) - [arXiv Paper](https://arxiv.org/abs/2409.00750) - [VRAM Issue](https://github.com/open-mmlab/Amphion/issues/328) --- ### 11. OuteTTS (OuteAI/OuteTTS-0.3-1B → now OuteTTS-1.0) - **Size**: 0.3-1B: 1B params on OLMo-1B base. 1.0-0.6B: 600M on Qwen3-0.6B base. 1.0-1B: 1B on Llama. Estimated ~3-5 GB VRAM for 0.6B, ~5-7 GB for 1B. - **English quality**: 3.5/5 — Improved significantly in v1.0 with new DAC encoder and automatic word alignment. 150 tokens/second audio encoding rate (up from 75). - **Voice cloning**: 3.5/5 — Create speaker from 5-10 second reference audio. v1.0 has "noticeably more accurate voice reproduction." Save/load speaker profiles (JSON). Works across languages. - **Kannada/Indian langs**: PARTIAL — 1.0-1B covers 23+ languages including Bengali, Tamil (moderate coverage). No Kannada. 1.0-0.6B covers 14 languages (no Indian languages). - **Speed**: No published RTF. Token rate of 150 tokens/s suggests near-real-time on decent GPU. Best with 30-second generation batches. - **Streaming**: NO — Not mentioned in documentation. - **License**: 1.0-1B: CC-BY-NC-SA-4.0 (non-commercial). 1.0-0.6B: Apache 2.0 (commercial OK). 0.3-1B: CC-BY-NC-SA-4.0 (non-commercial). - **Fit on 16 GB GPU**: YES — Both variants fit easily (3-7 GB estimated). - **Key insight**: The pure-LLM architecture is elegant — it extends any existing LLM with TTS capability. The 0.6B Apache 2.0 variant is interesting for lightweight deployment. Tamil and Bengali (but not Kannada) in the 1B model. - **Gotchas**: (1) No Kannada support. (2) 1B model is non-commercial license. (3) Repetition penalty must only apply to 64-token recent window — full context penalty breaks output. (4) 30-second generation window is limiting. (5) No streaming. (6) Quality below Kokoro/Dia for English. **Sources:** - [HuggingFace 0.3](https://huggingface.co/OuteAI/OuteTTS-0.3-1B) - [OuteTTS 1.0 Blog](https://outeai.com/blog/outetts-1-0-release) - [HuggingFace 1.0-1B](https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B) - [HuggingFace 1.0-0.6B](https://huggingface.co/OuteAI/OuteTTS-1.0-0.6B) --- ## TIER 3 MODELS --- ### 12. Fish Speech S2 (fishaudio/fish-speech) - **Size**: 4B (Slow AR) + 400M (Fast AR) = ~4.4B total. ~17 GB VRAM at full precision. 24 GB recommended, 12 GB absolute minimum. With 4-bit NF4 quantization: ~10-11 GB. GGUF q8_0: ~7 GB. - **English quality**: 4.5/5 — Top-tier. 0.515 posterior mean on Audio Turing Test (beats Seed-TTS 0.417 and MiniMax 0.387). 91.61% win rate on paralinguistics. Trained on 10M+ hours. - **Voice cloning**: 4.5/5 — 10-30 second reference audio. Captures timbre, speaking style, emotional tendencies. Multi-speaker support with speaker ID tokens. No fine-tuning needed. - **Kannada/Indian langs**: YES — Explicitly lists Kannada, Tamil, Telugu, Malayalam, Hindi, Bengali, Punjabi, Gujarati, Marathi, Assamese, Sinhala, Nepali, Urdu among 80+ languages. Quality tier is lower than English/Chinese/Japanese but present. Trained on global corpus. - **Speed**: RTF 0.195 on H200. TTFA ~100ms. 3000+ acoustic tokens/s. Streaming via SGLang with continuous batching + CUDA Graph. - **Streaming**: YES — Full streaming support with prefix caching, paged KV cache, continuous batching. - **License**: FISH AUDIO RESEARCH LICENSE (non-commercial). Commercial use requires separate written license from Fish Audio. - **Fit on 16 GB GPU**: BARELY — Full precision needs ~17 GB (won't fit). NF4 quantization: ~10-11 GB (fits, but slow). GGUF q8_0: ~7 GB (fits). Quality degrades with quantization. - **Key insight**: The ONLY model in this entire survey that explicitly supports Kannada with voice cloning. 80+ languages, production-grade streaming, top-tier quality. The catch is the restrictive research-only license and large VRAM footprint. - **Gotchas**: (1) Research-only license — commercial use requires paid license from Fish Audio. (2) 4.4B params is large for 16 GB GPU. (3) Quantization needed for 16 GB, with quality/speed tradeoff. (4) Kannada quality is untested in production (lower tier than English). (5) No community Kannada voice samples to reference. **Sources:** - [GitHub](https://github.com/fishaudio/fish-speech) - [Fish Audio](https://fish.audio/) - [S2 Technical Report](https://arxiv.org/html/2603.08823v1) - [VRAM Discussion](https://huggingface.co/fishaudio/s2-pro/discussions/10) - [License](https://github.com/fishaudio/fish-speech/blob/main/LICENSE) --- ### 13. Kokoro Voice Cloning Extensions #### KokoClone (Ashish-Patnaik/kokoclone) - **Size**: Kokoro-82M base + Kanade voice conversion model. Total estimated ~500M-1 GB VRAM. - **English quality**: 4/5 — Inherits Kokoro's quality (TTS Arena #2 behind ElevenLabs). Very low WER. - **Voice cloning**: 3/5 — Two methods: (a) TTS + zero-shot voice transfer (3-10 second reference), (b) Audio-to-audio via Kanade voice conversion (no transcription needed). Chunked processing for long recordings with overlap smoothing. - **Kannada/Indian langs**: PARTIAL — Supports Hindi (via Kokoro's language list). 8 languages total: English, Hindi, French, Japanese, Chinese, Italian, Portuguese, Spanish. No Kannada. - **Speed**: Fast — Kokoro-ONNX is one of the fastest open-source TTS engines. GPU chunk sizing auto-calibrated to 50% of GPU memory. - **Streaming**: NO — Not mentioned. Likely inherits Kokoro's non-streaming nature. - **License**: Apache 2.0 - **Fit on 16 GB GPU**: YES — Very lightweight. Well under 2 GB VRAM. - **Key insight**: Adds voice cloning to the already-excellent Kokoro engine with minimal VRAM overhead. Hindi support but no Kannada. The audio-to-audio conversion mode (Kanade) is interesting — re-voice any recording without transcription. - **Gotchas**: (1) No Kannada. (2) Cloning quality is "transfer" not "true clone" — may not capture all voice characteristics. (3) RoPE ceiling caps chunks at ~8.9s. (4) Young project, limited testing. #### KVoiceWalk (RobViren/kvoicewalk) - Random walk algorithm + hybrid scoring to create new Kokoro voice style tensors matching target voices. More experimental, research-quality. **Sources:** - [KokoClone GitHub](https://github.com/Ashish-Patnaik/kokoclone) - [KVoiceWalk GitHub](https://github.com/RobViren/kvoicewalk) --- ### 14. MetaVoice-1B (metavoiceio/metavoice-1B-v0.1) - **Size**: 1.2B params / ~12 GB GPU RAM minimum - **English quality**: 3.5/5 — Trained on 100K hours. Good emotional speech rhythm and tone in English. Not top-tier compared to newer models (Dia, Fish Speech). - **Voice cloning**: 3.5/5 — Zero-shot for American & British voices with 30-second reference. Cross-lingual cloning with fine-tuning. Success with 1 minute training data for Indian speakers specifically noted. - **Kannada/Indian langs**: PARTIAL — Cross-lingual cloning with fine-tuning has been tested on Indian speakers. No native Indian language training. English-focused. - **Speed**: RTF < 1.0 on Ampere/Ada/Hopper GPUs (once compiled). int4 quantization is ~2x faster than bf16. - **Streaming**: UNCLEAR — Not documented in current repo. - **License**: Apache 2.0 - **Fit on 16 GB GPU**: YES — Needs ~12 GB, but tight with other services. - **Key insight**: Early mention of Indian speaker fine-tuning is promising. But the project appears abandoned — last significant commit July 2024. Not recommended for new deployments. - **Gotchas**: (1) Last commit July 2024 — effectively abandoned. (2) 12 GB minimum is tight on 16 GB GPU. (3) Cross-lingual Indian voice cloning requires fine-tuning (not zero-shot). (4) Not competitive with 2025-2026 models on quality. (5) ElevenLabs acquisition rumor was false, but project is still unmaintained. **Sources:** - [GitHub](https://github.com/metavoiceio/metavoice-src) - [HuggingFace](https://huggingface.co/metavoiceio/metavoice-1B-v0.1) --- ### 15. Notable Additional Models (2025-2026) --- #### 15a. Chatterbox (resemble-ai/chatterbox) - **Size**: Original: ~full size unspecified. Turbo: 350M params / ~4-8 GB VRAM. - **English quality**: 4.5/5 — First open-source model with emotion exaggeration control (monotone to dramatic). Excellent naturalness. - **Voice cloning**: 4/5 — 5-second reference audio. Zero-shot. Works across 23 languages. - **Kannada/Indian langs**: PARTIAL — Hindi is in the 23-language list. No Kannada. Languages: Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, Chinese. - **Speed**: Sub-200ms latency (original). Turbo: RTF 0.499 on RTX 4090, first-chunk latency ~472ms. - **Streaming**: YES — Community streaming implementations exist (davidbrowne17/chatterbox-streaming). - **License**: MIT (very permissive!) - **Fit on 16 GB GPU**: YES — Turbo (350M) easily. Original needs 8-16 GB. - **Key insight**: MIT license + emotion control + 23 languages + voice cloning makes this extremely compelling. Perth neural watermarking is built-in (every generated audio is watermarked). Hindi but no Kannada. - **Gotchas**: (1) No Kannada. (2) Watermarking is always-on (Perth watermarks survive MP3 compression). (3) Hindi quality untested in depth. (4) Turbo sacrifices some quality for speed. **Sources:** - [GitHub](https://github.com/resemble-ai/chatterbox) - [Resemble AI](https://www.resemble.ai/chatterbox/) - [Turbo](https://www.resemble.ai/chatterbox-turbo/) --- #### 15b. Higgs Audio V2 (bosonai/higgs-audio-v2-generation-3B-base) - **Size**: 3B base (Llama-3.2-3B) + 2.2B DualFFN audio adapter. Model weights ~5.8 GB fp16. 4-bit: ~8 GB VRAM. Full precision: ~24 GB. - **English quality**: 4.5/5 — 75.7% win rate over gpt-4o-mini-tts on emotions. SOTA on Seed-TTS Eval and ESD benchmarks. - **Voice cloning**: 4/5 — 3-10 second reference audio. Multi-speaker dialogue. Can hum melodies in cloned voice. Generate speech + background music simultaneously. - **Kannada/Indian langs**: NO — 32 languages seen in pre-training but quality is best for en-US, zh-CN, es-ES. No Indian languages mentioned. - **Speed**: 1.3x real-time on RTX 4090. Streaming mode experimental. - **Streaming**: EXPERIMENTAL — Mentioned in roadmap, not production-ready. - **License**: Apache 2.0 - **Fit on 16 GB GPU**: TIGHT — 4-bit inference at ~8 GB fits, but quality tradeoff. Full precision needs 24 GB. - **Key insight**: Most expressive open-source TTS — can generate music + speech simultaneously, hum melodies in cloned voices. Li Mu's team (credible). But 3B is heavy and no Indian languages. - **Gotchas**: (1) No Indian language support. (2) 3B params means 16 GB GPU is tight. (3) Streaming still experimental. (4) 24kHz output (not 44.1kHz — planned for V3). (5) Distilled smaller models promised but not yet delivered. **Sources:** - [GitHub](https://github.com/boson-ai/higgs-audio) - [HuggingFace](https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base) - [Blog](https://www.boson.ai/blog/higgs-audio-v2) --- #### 15c. Voxtral TTS (mistralai/Voxtral-4B-TTS-2603) - **Size**: 4B params (3.4B Transformer decoder + components). ~8 GB weights (BF16). Needs 16 GB+ VRAM for inference. - **English quality**: 4.5/5 — 68.4% win rate against ElevenLabs Flash v2.5 in human preference tests. - **Voice cloning**: 4/5 — 5-25 second reference audio (accepts as little as 3 seconds). 20 built-in preset voices. - **Kannada/Indian langs**: PARTIAL — Supports Hindi and Arabic among 9 languages (English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic). No Kannada. - **Speed**: 70ms model latency. RTF ~9.7x (generates audio 9.7x faster than real-time). Low-latency streaming. - **Streaming**: YES — Native low-latency streaming with ~90ms processing time. - **License**: CC BY-NC 4.0 (NON-COMMERCIAL for self-hosted weights). Commercial use requires Mistral API. - **Fit on 16 GB GPU**: BARELY — Needs 16 GB+ VRAM. Won't fit alongside other services. Quantized versions (~3 GB weights) available. - **Key insight**: Mistral-quality engineering. Fastest RTF in this survey (9.7x). Hindi support. But CC BY-NC license and 4B size are blockers for the 16 GB Panda deployment. - **Gotchas**: (1) CC BY-NC 4.0 — non-commercial only for self-hosting. (2) 4B too large for 16 GB GPU with other services. (3) No Kannada. (4) Very new (March 2026) — limited community testing. (5) Hindi quality not independently verified. **Sources:** - [HuggingFace](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) - [Mistral Blog](https://mistral.ai/news/voxtral-tts) - [arXiv](https://arxiv.org/html/2603.25551v1) --- #### 15d. Spark-TTS (SparkAudio/Spark-TTS-0.5B) - **Size**: 0.5B params (Qwen2.5 base). ~4 GB+ VRAM minimum. - **English quality**: 3.5/5 — Good for its tiny size. Eliminates need for flow matching by directly predicting audio codes from LLM. - **Voice cloning**: 3.5/5 — Zero-shot with 5-10 second reference (15-30 second recommended). Cross-lingual and code-switching scenarios supported. - **Kannada/Indian langs**: NO — Chinese and English only. - **Speed**: RTF 0.0704 with TensorRT-LLM + Triton optimization. Extremely fast when optimized. - **Streaming**: UNCLEAR — Not explicitly documented. - **License**: Research/academic use stated. Not clearly Apache/MIT. Users told to ensure local law compliance. - **Fit on 16 GB GPU**: YES — Only ~4 GB needed. - **Key insight**: Tiny, fast, and the TensorRT-optimized RTF (0.07) is remarkable. But Chinese + English only, and the license is restrictive/ambiguous. - **Gotchas**: (1) Chinese + English only. (2) License is unclear/restrictive. (3) Qwen2.5 base means Chinese-first design. (4) VoxBox training dataset (100K hours) is not publicly released for reproduction. **Sources:** - [GitHub](https://github.com/SparkAudio/Spark-TTS) - [HuggingFace](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) --- #### 15e. Qwen3-TTS (QwenLM/Qwen3-TTS) - **Size**: 1.7B (full) / 0.6B (lightweight). VRAM not officially published but estimated ~6-10 GB for 1.7B, ~3-5 GB for 0.6B. - **English quality**: 4/5 — Good quality with voice design capability (describe a voice in natural language to create it). - **Voice cloning**: 4/5 — 3-second minimum (10-15 seconds recommended for best quality). Also supports voice design via natural language description. - **Kannada/Indian langs**: NO — 10 languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian. - **Speed**: RTF 3-5x on CPU (GPU recommended). No official GPU RTF published. - **Streaming**: YES — Supports streaming speech generation. - **License**: Not explicitly stated in searches. Alibaba Cloud model. - **Fit on 16 GB GPU**: YES — 0.6B variant easily. 1.7B fits too. - **Key insight**: Voice design via natural language is a unique feature ("make a warm female voice with slight raspiness"). Good cloning. But no Indian languages at all. - **Gotchas**: (1) No Indian languages. (2) Released January 2026 — still new. (3) RTF on GPU not well documented. (4) Alibaba Cloud licensing may have restrictions. **Sources:** - [GitHub](https://github.com/QwenLM/Qwen3-TTS) - [Simon Willison](https://simonwillison.net/2026/Jan/22/qwen3-tts/) --- #### 15f. GLM-TTS (zai-org/GLM-TTS) - **Size**: Not published (Llama-based LLM + Flow Matching + Vocoder). ~8 GB VRAM, ~9 GB disk for weights. - **English quality**: 3.5/5 — Secondary to Chinese. CER 0.89 (best among open-source, close to MiniMax's 0.83). - **Voice cloning**: 4/5 — 3-second zero-shot cloning. Reinforcement learning (GRPO) optimizes cloning quality via similarity reward function. - **Kannada/Indian langs**: NO — Chinese primary, English secondary. Mixed Chinese-English supported. Tokenizer optimized for Pinyin. - **Speed**: Not published. Streaming inference supported. - **Streaming**: YES — Native streaming inference. - **License**: Open-source (specific license not found in searches). Zhipu AI. - **Fit on 16 GB GPU**: YES — ~8 GB VRAM. - **Key insight**: The RL-based optimization approach (GRPO with multi-reward: similarity, CER, emotion, laughter) is architecturally innovative. But Chinese-first, English-secondary, no Indian languages. - **Gotchas**: (1) Chinese-first — English is secondary. (2) Pinyin-optimized tokenizer. (3) No Indian languages. (4) Zhipu AI model — license terms unclear for commercial use. **Sources:** - [GitHub](https://github.com/zai-org/GLM-TTS) - [arXiv](https://arxiv.org/html/2512.14291v1) --- #### 15g. IndexTTS-2 (IndexTeam/IndexTTS-2) - **Size**: Not published (transformer + BigVGANv2 vocoder). ~8 GB VRAM estimated. RTX 3060 12 GB works but slow. - **English quality**: 4/5 — "Most realistic and expressive" per community. Emotional expression breakthrough. - **Voice cloning**: 4.5/5 — Zero-shot. Independent control of timbre and emotion (disentangled). Duration control for dubbing. - **Kannada/Indian langs**: NO — Chinese, English, Japanese only. - **Speed**: RTF 13.1 on RTX 3060 12 GB (very slow!). FP16 + DeepSpeed + CUDA kernels can improve. RTX 4090 likely much better. - **Streaming**: NO — Batch generation. - **License**: Open-source (specific terms not found). - **Fit on 16 GB GPU**: YES — ~8 GB, but inference is slow. - **Key insight**: Best emotion-timbre disentanglement — can keep the same voice but change emotion independently. Great for dubbing. But Chinese/English/Japanese only, and very slow on consumer GPUs. - **Gotchas**: (1) Extremely slow on RTX 3060 (RTF 13.1). (2) Chinese + English + Japanese only. (3) No streaming. (4) License unclear. (5) DeepSpeed optimization needed for acceptable speed. **Sources:** - [GitHub](https://github.com/index-tts/index-tts) - [HuggingFace](https://huggingface.co/IndexTeam/IndexTTS-2) --- #### 15h. CosyVoice2-0.5B (FunAudioLLM/CosyVoice2-0.5B) - **Size**: 0.5B params. VRAM not published but ~3-5 GB estimated for 0.5B. - **English quality**: 4/5 — Alibaba's FunAudioLLM team. Good quality, low latency. - **Voice cloning**: 4/5 — Zero-shot voice cloning. Cross-lingual synthesis. - **Kannada/Indian langs**: NO — 9 languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian) + 18+ Chinese dialects. No Indian languages. - **Speed**: ~300ms latency for interactive use. 4x acceleration with TensorRT-LLM. - **Streaming**: YES — Low-latency streaming synthesis. - **License**: Open-source (specific license varies by component). - **Fit on 16 GB GPU**: YES — 0.5B fits easily. - **Key insight**: Ultra-lightweight (0.5B) with streaming and TensorRT optimization. Alibaba's production pedigree. But no Indian languages and CosyVoice3 is already in research. - **Gotchas**: (1) No Indian languages. (2) CosyVoice3 paper already published — v2 may be superseded soon. (3) Chinese dialect focus doesn't help for Indian languages. **Sources:** - [GitHub](https://github.com/FunAudioLLM/CosyVoice) - [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) --- ## COMPARATIVE SUMMARY TABLE | Model | Params | VRAM | English | Cloning | Kannada | Hindi | Streaming | License | 16 GB Fit | |-------|--------|------|---------|---------|---------|-------|-----------|---------|-----------| | **Dia 1.6B** | 1.6B | ~10 GB | 4/5 | 3/5 | NO | NO | NO | Apache 2.0 | YES | | **CSM-1B** | 1B | ~4 GB | 3.5/5 | 2/5 | NO | NO | NO | Apache 2.0 | YES | | **Orpheus 3B** | 3B | ~15 GB | 4.5/5 | 3.5/5 | NO | YES | YES | Apache 2.0 | TIGHT | | **Mars5-TTS** | 1.2B | ~6 GB | 3/5 | 4/5 | NO | NO | NO | AGPL 3.0 | YES | | **MaskGCT** | 1.3B | ~14 GB | 4/5 | 4/5 | NO | NO | NO | Unclear | TIGHT | | **OuteTTS 1.0** | 0.6-1B | ~5 GB | 3.5/5 | 3.5/5 | NO | NO | NO | Mixed* | YES | | **Fish Speech S2** | 4.4B | ~17 GB | 4.5/5 | 4.5/5 | **YES** | **YES** | YES | Research Only | BARELY** | | **KokoClone** | ~500M | <2 GB | 4/5 | 3/5 | NO | YES | NO | Apache 2.0 | YES | | **MetaVoice** | 1.2B | ~12 GB | 3.5/5 | 3.5/5 | NO | NO | NO | Apache 2.0 | YES | | **Chatterbox** | 350M-? | 4-8 GB | 4.5/5 | 4/5 | NO | YES | YES | **MIT** | YES | | **Higgs Audio V2** | 3B+ | 8-24 GB | 4.5/5 | 4/5 | NO | NO | Exp. | Apache 2.0 | TIGHT | | **Voxtral TTS** | 4B | 16 GB+ | 4.5/5 | 4/5 | NO | YES | YES | CC BY-NC | BARELY | | **Spark-TTS** | 0.5B | ~4 GB | 3.5/5 | 3.5/5 | NO | NO | ? | Restrictive | YES | | **Qwen3-TTS** | 0.6-1.7B | 3-10 GB | 4/5 | 4/5 | NO | NO | YES | Unclear | YES | | **GLM-TTS** | ~?B | ~8 GB | 3.5/5 | 4/5 | NO | NO | YES | Unclear | YES | | **IndexTTS-2** | ~?B | ~8 GB | 4/5 | 4.5/5 | NO | NO | NO | Unclear | YES | | **CosyVoice2** | 0.5B | ~4 GB | 4/5 | 4/5 | NO | NO | YES | Mixed | YES | \* OuteTTS: 0.6B variant is Apache 2.0; 1B variant is CC-BY-NC-SA-4.0 \** Fish Speech: Fits with NF4 quantization (~10 GB) or GGUF q8_0 (~7 GB) --- ## KEY FINDINGS FOR ANNIE ### The Kannada Problem **Only Fish Speech S2 explicitly supports Kannada** among all models surveyed. IndicF5 (our current solution) remains the best option for Kannada TTS. No other model in this survey has trained on Kannada speech data. ### The Hindi Opportunity Several models support Hindi: Orpheus (multilingual release), KokoClone, Chatterbox, Voxtral TTS, Fish Speech S2. If Annie needs Hindi TTS in the future, Chatterbox (MIT license, lightweight) or Orpheus (Apache 2.0, streaming) are strong candidates. ### Best English TTS (if we wanted to replace Kokoro) 1. **Chatterbox Turbo** — MIT license, 350M params, emotion control, voice cloning, sub-200ms latency 2. **Dia 1.6B** — Best for dialogue, Apache 2.0, fits on 16 GB 3. **Fish Speech S2** — Best overall quality but restrictive license and heavy ### Best Voice Cloning Quality 1. **Fish Speech S2** — 4.5/5, but research-only license 2. **IndexTTS-2** — 4.5/5, emotion-timbre disentanglement 3. **Chatterbox** — 4/5, MIT license, 23 languages ### Recommendation for Annie's Dual-TTS Architecture **Current setup (Kokoro English + IndicF5 Kannada) remains optimal.** No model beats this combination for our specific needs: - Kokoro: 82M params, <2 GB VRAM, English quality 4.5/5, fastest RTF - IndicF5: 6 GB VRAM, 11 Indian languages including Kannada, voice cloning **Watch list for future evaluation:** 1. **Chatterbox** — If MIT-licensed Hindi TTS is needed, or to replace Kokoro for English with cloning 2. **Fish Speech S2** — If Fish Audio releases a permissive license, this would be the ultimate multilingual TTS 3. **Orpheus 1B** — When the 1B variant ships, it could be a compelling English + Hindi option 4. **OuteTTS 1.0** — If Tamil/Bengali TTS is needed (no Kannada though) 5. **Voxtral TTS** — If Mistral releases under Apache 2.0 --- ## MODELS TO DEFINITIVELY SKIP | Model | Reason to Skip | |-------|---------------| | Mars5-TTS | AGPL license + stale (July 2024) + English-only | | MetaVoice | Abandoned (July 2024) + no Indian languages | | CSM-1B | Weak cloning + English-only + no streaming | | Spark-TTS | Chinese-first + unclear license + English-only | | GLM-TTS | Chinese-first + no Indian languages | | MaskGCT | VRAM-hungry + English/Chinese only + no streaming |