# Research: IndicF5 TTS Latency Optimization on Panda

**Date:** 2026-03-31
**Status:** Research complete. Recommended optimization order established.
**Target:** RTF < 0.5 (from current 0.808)
**Hardware:** Panda desktop — RTX 5070 Ti (16 GB VRAM, Blackwell, 175.8 TFLOPS FP16), x86_64, Ubuntu

---

## Current Baseline

| Metric | Value |
|--------|-------|
| Model | IndicF5 (400M params, F5-TTS based, AI4Bharat) |
| RTF | 0.808 (2112ms for 2.61s audio) |
| VRAM | 1.7 GB |
| Precision | FP32 |
| NFE steps | 32 (default) |
| torch.compile | Disabled (meta tensor crash) |
| PyTorch | 2.11.0+cu130, CUDA 13.0 |
| Vocoder | Vocos |
| Output | 24000 Hz WAV |

---

## Optimization Analysis (10 approaches)

### 1. Reduce NFE Steps (32 → 16)

**Expected RTF:** ~0.40 (2x speedup — linear relationship between NFE and compute)
**Difficulty:** 1/5
**Quality impact:** Minor (WER 2.37→2.44 on LibriSpeech; negligible for Indian languages)
**VRAM change:** None

The single biggest lever. F5-TTS default is 32 NFE. The upstream repo comments `nfe_step = 32 # 16, 32` — 16 is an officially supported value. On RTX 3090, RTF drops from ~0.12 to ~0.06 at 16 NFE. Since our baseline is ~6.5x slower than RTX 3090 benchmarks (likely due to FP32 + no compile), the same 2x ratio should apply.

**Code change:** In the inference call, pass `steps=16` instead of 32. In IndicF5's `utils_infer.py`, the global `nfe_step = 32` can be overridden at call site:

```python
# In the model.sample() call or infer_batch_process()
audio = model_obj.sample(cond=audio, text=..., duration=..., steps=16, ...)
```

Or if using the AutoModel interface:
```python
# Modify the global before inference
from f5_tts.infer import utils_infer
utils_infer.nfe_step = 16
```

**Risks:** Slight quality degradation on short utterances. Test with Kannada samples and verify subjectively.

---

### 2. Half Precision (FP16/BF16)

**Expected RTF:** ~0.40-0.50 (1.6-2x speedup from FP32)
**Difficulty:** 2/5
**Quality impact:** None to minor
**VRAM change:** 1.7 GB → ~0.9 GB

The RTX 5070 Ti has 175.8 TFLOPS FP16 vs ~44 TFLOPS FP32 — a 4x theoretical advantage. Practical speedup is 1.5-2x due to memory bandwidth bottleneck (the model is small enough that compute is not the sole bottleneck).

The upstream F5-TTS `utils_infer.py` already has conditional FP16 support:
```python
# From upstream F5-TTS (NOT IndicF5's fork):
if torch.cuda.get_device_properties(device).major >= 7:
    dtype = torch.float16  # Uses FP16 on Volta+ GPUs
```

IndicF5's fork does NOT enable this — it loads in FP32 always. The fix:

```python
# After loading the model
model = model.half()  # Convert DiT to FP16

# For the vocoder (Vocos) — keep in FP32 to avoid artifacts
# The upstream F5-TTS also keeps BigVGAN vocoder in FP32
# Vocos should be safe in FP16, but test first
```

Alternatively, use `torch.autocast`:
```python
with torch.autocast(device_type='cuda', dtype=torch.float16):
    audio = model(text, ref_audio_path=ref_path, ref_text=ref_text)
```

**BF16 vs FP16:** RTX 5070 Ti supports both. BF16 has better dynamic range (less overflow risk), same speed. Use `torch.bfloat16` if FP16 produces artifacts.

**Risks:**
- Vocos vocoder may produce artifacts in FP16. If so, keep vocoder in FP32 and only convert the DiT transformer.
- The F5-TTS ONNX project noted "silence output when using float16" that was later fixed. Test carefully.

---

### 3. NFE 16 + FP16 Combined

**Expected RTF:** ~0.20-0.25 (combined 3-4x speedup)
**Difficulty:** 2/5
**Quality impact:** Minor
**VRAM change:** 1.7 GB → ~0.9 GB

Combining approaches 1 and 2. This is the recommended first step — zero dependencies, no retraining, minimal code changes.

**Projected latency:** 2112ms × (16/32) × (1/1.7) ≈ 620ms for 2.61s audio. RTF ≈ 0.24.

---

### 4. EPSS (Fast F5-TTS) — 7-Step Generation

**Expected RTF:** ~0.15-0.20 (with FP16)
**Difficulty:** 3/5
**Quality impact:** Minor (WER 2.37→2.45, SIM-o maintained at 0.66)
**VRAM change:** None

Fast F5-TTS (Empirically Pruned Step Sampling) is a training-free, plug-and-play method that reduces NFE from 32 to 7 while maintaining quality. On RTX 3090 it achieves RTF 0.030.

**How it works:** Instead of uniform time steps, EPSS uses non-uniform steps that concentrate on the early phases (high curvature) and skip the later linear phases:
- 7-NFE time steps: `{0, 1/16, 1/8, 3/16, 1/4, 1/2, 3/4, 1}`
- Derived from indices `[0, 2, 4, 6, 8, 16, 24, 32]` scaled by 1/32

**Implementation:** Modify the ODE solver's time step schedule in the sampling function. The method is architecture-agnostic — it works with any F5-TTS variant including IndicF5.

```python
# Replace uniform steps [0/32, 1/32, 2/32, ..., 32/32]
# With EPSS steps [0, 2/32, 4/32, 6/32, 8/32, 16/32, 24/32, 32/32]
epss_steps = torch.tensor([0, 2/32, 4/32, 6/32, 8/32, 16/32, 24/32, 1.0])
```

**Status:** Paper published (Interspeech 2025). Code availability uncertain — authors said "We will release our model and code." The time step schedule itself is fully specified in the paper and trivial to implement.

**Risks:** Below 6 NFE, quality degrades sharply. 7 is the sweet spot.

---

### 5. torch.compile (DiT Only, Not Vocoder)

**Expected RTF improvement:** 1.3-1.5x on top of other optimizations
**Difficulty:** 3/5
**Quality impact:** None
**VRAM change:** +200-400 MB (compilation cache)

The crash occurs because the Vocos vocoder uses meta tensors that are incompatible with `torch.compile` + Python 3.12. The fix: compile ONLY the DiT transformer, not the vocoder.

```python
# After loading the model, compile just the transformer backbone
model.transformer = torch.compile(model.transformer, mode="reduce-overhead")
# Leave vocoder uncompiled
```

`mode="reduce-overhead"` uses CUDA graphs — best for repeated inference with same input shapes. `mode="max-autotune"` tries more kernel variants but takes longer to warm up.

**First-run penalty:** 30-60 seconds for compilation. Subsequent runs benefit. For a persistent server process this is fine.

**Risks:**
- Python 3.12 + PyTorch 2.11 may have other compilation issues. Test.
- Dynamic sequence lengths may reduce benefits (CUDA graphs work best with fixed shapes).
- If the transformer submodule is not cleanly separable in IndicF5's `INF5Model`, some refactoring may be needed.

---

### 6. ONNX Export + TensorRT

**Expected RTF:** ~0.10-0.15 (with TensorRT on RTX 5070 Ti)
**Difficulty:** 4/5
**Quality impact:** None to minor
**VRAM change:** Similar or slightly less

DakeQQ's [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX) project provides ONNX export for the upstream F5-TTS. The ONNX model has ~1281 nodes. FP16 export is supported (`use_fp16_transformer = True`).

TensorRT RTX Execution Provider supports RTX 30xx+ (Ampere and later). The RTX 5070 Ti (Blackwell) qualifies. TensorRT provides:
- Kernel auto-tuning for specific GPU
- Layer fusion
- Dynamic shape specialization (cached kernels)

**Blockers:**
- No existing ONNX export for IndicF5 specifically. Would need to adapt DakeQQ's export script.
- IndicF5 has a custom vocab (Indian languages) — export must include the text tokenizer.
- TensorRT compilation adds startup time and requires engine caching.
- The community has requested ONNX for IndicF5 but AI4Bharat has not provided one.

**Implementation path:**
1. Clone DakeQQ/F5-TTS-ONNX
2. Point it at IndicF5's checkpoint and vocab
3. Export with `use_fp16_transformer = True`
4. Use `onnxruntime-gpu` with CUDAExecutionProvider or TensorRTExecutionProvider
5. Use I/O binding for maximum performance

---

### 7. Shorter Reference Audio

**Expected RTF improvement:** Modest (10-20%)
**Difficulty:** 1/5
**Quality impact:** Minor (shorter reference = slightly less voice fidelity)

F5-TTS generates audio by filling in a "masked" portion of a combined reference+target sequence. Shorter reference = shorter total sequence = fewer computations in the attention mechanism (quadratic scaling).

Current: 3-second reference. Minimum: ~1 second (enough for voice timbre). The generation duration (target text) matters more than reference length for total compute.

For a 25-char Kannada utterance generating 2.61s audio with 3s reference: the DiT processes ~5.6s total (3s ref + 2.61s gen). With 1s reference: ~3.6s total — ~36% reduction in sequence length, which translates to roughly 20-30% speedup due to quadratic attention.

**Code change:** Use a 1-1.5 second reference audio clip instead of 3 seconds.

**Risks:** Voice cloning quality degrades with very short references. 1.5s is likely the sweet spot.

---

### 8. Model Quantization (INT8/INT4)

**Expected RTF improvement:** 1.3-1.5x (INT8), uncertain for INT4
**Difficulty:** 4/5
**Quality impact:** Minor (INT8), potentially significant (INT4)
**VRAM change:** 1.7 GB → ~0.5-0.9 GB

IndicF5's `requirements.txt` includes `bitsandbytes`, suggesting quantization was considered. However:
- No documented INT8 inference for F5-TTS exists
- DiT models (diffusion transformers) are more sensitive to quantization than language models because they operate on continuous signals
- bitsandbytes INT8 uses vector-wise quantization with mixed-precision decomposition for outlier features — decent for LLMs, untested for TTS

**Implementation:**
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained("ai4bharat/IndicF5", 
    quantization_config=quantization_config,
    trust_remote_code=True)
```

**Risks:** High risk of audio artifacts. Diffusion models generate spectrograms where small numerical errors become audible. INT4 is almost certainly too aggressive. INT8 might work but needs careful quality testing.

---

### 9. Indic Parler-TTS Alternative

**Expected RTF:** Worse (~2-3x slower than IndicF5)
**Difficulty:** 2/5
**Quality impact:** Different (text-prompted vs voice-cloned)
**VRAM:** ~4-6 GB

Indic Parler-TTS (ai4bharat/indic-parler-tts, 880M params) supports 21 languages including Kannada with emotion control. However:
- Users report 5-6 seconds for 10 tokens on G5 instances (much slower than IndicF5)
- 10-15 seconds for a 10-15 word sentence reported
- **Streaming mode is NOT working** — users attempted the Parler-TTS streaming guide but found it incompatible
- Flash attention support claimed but not verified
- It is autoregressive (token-by-token), unlike F5-TTS's non-autoregressive generation
- Larger model (880M vs 400M) = more compute

**Verdict:** NOT recommended for latency optimization. It's slower than IndicF5, not faster. Useful for emotion control in non-latency-critical scenarios.

---

### 10. Sarvam Bulbul v3 API / Smallest.ai Lightning API

**Sarvam Bulbul v3:**
- **Expected latency:** ~600ms (reported, REST API). Streaming WebSocket available — TTFB likely 100-300ms.
- **Difficulty:** 1/5 (API integration)
- **Quality:** High (30+ Indian voices, 11 languages including Kannada)
- **Cost:** Rs 30 per 10K characters. Free tier available.
- **Streaming:** Yes, WebSocket. `min_buffer_size` controls latency/quality tradeoff.
- **Voice cloning:** No (preset voices only)
- **Pipecat integration:** Yes, native support.

**Smallest.ai Lightning v3.1:**
- **Expected latency:** Sub-100ms TTFB. RTF 0.3. Generates 10s audio in 100ms.
- **Difficulty:** 1/5 (API integration)
- **Quality:** High (WVMOS 5.06, 44.1 kHz native)
- **Languages:** 15 including Kannada. Auto language detection. Code-switching support.
- **Cost:** $0.0135 per 1K characters (~Rs 1.13/1K chars). Much cheaper than Sarvam.
- **Streaming:** HTTP, SSE, WebSocket. Geo-routed servers in Hyderabad.
- **Voice cloning:** Yes, built-in.

**Verdict:** Smallest.ai Lightning is the strongest API fallback — sub-100ms TTFB, Kannada support, voice cloning, cheaper than Sarvam. Use as streaming fallback while local IndicF5 generates the full response.

**Hybrid strategy:** Stream first ~500ms of audio from Lightning API (TTFB ~100ms), then switch to locally generated IndicF5 audio once it's ready. Perceived latency = ~100ms.

---

### Bonus: Piper TTS / eSpeak

**Piper does NOT support Kannada.** Piper's language coverage is primarily European languages. Not an option.

eSpeak has Kannada phonemes but extremely robotic quality — not suitable for Annie's voice.

---

## Optimization Ranking (Impact/Effort Ratio)

| Rank | Approach | Expected RTF | Effort | Quality Loss | Recommended? |
|-----:|----------|-------------:|-------:|-----------:|:------------:|
| 1 | NFE 32→16 | ~0.40 | 1/5 | Negligible | **YES** |
| 2 | + FP16 | ~0.20-0.25 | 2/5 | None | **YES** |
| 3 | Shorter ref audio (1.5s) | -20% further | 1/5 | Minor | **YES** |
| 4 | EPSS 7-step | ~0.15-0.20 | 3/5 | Minor | YES (after 1+2) |
| 5 | torch.compile (DiT only) | -30% further | 3/5 | None | YES (after 1+2) |
| 6 | Lightning API fallback | ~100ms TTFB | 1/5 | None | YES (streaming) |
| 7 | ONNX + TensorRT | ~0.10-0.15 | 4/5 | None | Later |
| 8 | INT8 quantization | ~0.15-0.20 | 4/5 | Unknown | Risky |
| 9 | Indic Parler-TTS | Worse | 2/5 | Different | **NO** |
| 10 | Piper/eSpeak | N/A | N/A | Terrible | **NO** |

---

## Recommended Optimization Order

### Phase 1: Quick Wins (30 min, target RTF ~0.20-0.25)

1. **Reduce NFE to 16** — single line change, 2x speedup
2. **Convert model to FP16** — `model.half()` or `torch.autocast`, another 1.5-2x
3. **Trim reference audio to 1.5s** — free 20% bonus

**Expected result:** RTF ~0.20-0.25. Latency ~520-650ms for 2.61s audio. **Target achieved.**

### Phase 2: Further Optimization (2-4 hours, target RTF ~0.12-0.18)

4. **Implement EPSS 7-step schedule** — replace uniform ODE steps with `[0, 2/32, 4/32, 6/32, 8/32, 16/32, 24/32, 1.0]`
5. **torch.compile the DiT backbone** — compile only the transformer, skip vocoder

**Expected result:** RTF ~0.12-0.18. Latency ~310-470ms for 2.61s audio.

### Phase 3: API Fallback for Streaming (2 hours)

6. **Integrate Smallest.ai Lightning** as streaming fallback — sub-100ms TTFB while IndicF5 generates full audio

**Expected result:** Perceived latency ~100ms. Full audio from local IndicF5 available by time first API chunk finishes playing.

### Phase 4: Maximum Performance (1-2 days, optional)

7. **ONNX export + TensorRT** — adapt DakeQQ's F5-TTS-ONNX for IndicF5 weights

**Expected result:** RTF ~0.10-0.15. Likely overkill if Phase 1-2 achieves target.

---

## Key References

### F5-TTS Upstream
- [F5-TTS Paper](https://arxiv.org/abs/2410.06885) — Architecture, sway sampling, RTF 0.15 at 16 NFE
- [F5-TTS GitHub](https://github.com/SWivid/F5-TTS) — v1.1.18 (Mar 2026), v1 model (Mar 2025)
- [F5-TTS utils_infer.py](https://github.com/SWivid/F5-TTS/blob/main/src/f5_tts/infer/utils_infer.py) — FP16 support, NFE config
- [F5-TTS Issue #224](https://github.com/SWivid/F5-TTS/issues/224) — Speed discussion, maintainer recommends reducing NFE
- [F5-TTS Issue #700](https://github.com/SWivid/F5-TTS/issues/700) — Streaming: "chunk inference... welcome pr~"

### Fast F5-TTS (EPSS)
- [Fast F5-TTS Paper](https://arxiv.org/html/2505.19931v1) — 7-step, RTF 0.030 on RTX 3090, training-free
- [Fast F5-TTS Demo](https://fast-f5-tts.github.io/) — Audio samples, PCA trajectory analysis

### ONNX
- [F5-TTS-ONNX (DakeQQ)](https://github.com/DakeQQ/F5-TTS-ONNX) — ONNX export, FP16 fix, I/O binding
- [F5-TTS-ONNX on HuggingFace](https://huggingface.co/huggingfacess/F5-TTS-ONNX) — Pre-exported ONNX models

### IndicF5
- [IndicF5 GitHub](https://github.com/AI4Bharat/IndicF5) — Source code, requirements
- [IndicF5 HuggingFace](https://huggingface.co/ai4bharat/IndicF5) — Model card, 400M params, 1417h training data
- [IndicF5 Discussion #5](https://huggingface.co/ai4bharat/IndicF5/discussions/5) — "Need of faster backend" (unresolved)
- [IndicF5 Discussion #2](https://huggingface.co/ai4bharat/IndicF5/discussions/2) — ONNX request (no response)
- [IndicF5 Discussion #1](https://huggingface.co/ai4bharat/IndicF5/discussions/1) — Upstream F5-TTS compatibility

### API Alternatives
- [Sarvam Bulbul v3](https://www.sarvam.ai/blogs/bulbul-v3) — 11 Indian languages, streaming WebSocket
- [Sarvam TTS API](https://docs.sarvam.ai/api-reference-docs/api-guides-tutorials/text-to-speech/overview) — REST + streaming
- [Sarvam Pricing](https://www.sarvam.ai/api-pricing) — Rs 30/10K chars
- [Sarvam Pipecat TTS](https://reference-server.pipecat.ai/en/stable/api/pipecat.services.sarvam.tts.html) — Native integration
- [Smallest.ai Lightning](https://smallest.ai/text-to-speech) — Sub-100ms TTFB, 15 languages, Kannada
- [Smallest.ai Lightning Blog](https://smallest.ai/blog/lightning-fastest-text-to-speech-model-by-smallestai) — RTF 0.3, voice cloning
- [Smallest.ai vs Sarvam](https://smallest.ai/blog/smallest-ai-vs-sarvam-ai) — Feature comparison

### Alternatives Evaluated
- [Indic Parler-TTS](https://huggingface.co/ai4bharat/indic-parler-tts) — 21 languages, emotions, but slow (5-6s for 10 tokens)
- [Indic Parler-TTS Speed Discussion](https://huggingface.co/ai4bharat/indic-parler-tts/discussions/10) — Streaming not working
- [Piper TTS](https://github.com/rhasspy/piper) — No Kannada support

### Hardware
- [RTX 5070 Ti Specs](https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5070-family/) — 175.8 TFLOPS FP16, 16 GB GDDR7, 896 GB/s bandwidth
- [Blackwell Architecture Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf) — 5th gen Tensor Cores, FP4 support
