# Research: Indian Language Speech for Annie on Pixel 9a

**Date:** 2026-03-31
**Status:** Models installed and benchmarked on Panda. E2E pipeline validated. TTS latency needs optimization.
**Relevance:** Annie speaking/understanding Kannada, Hindi, English on the Pixel 9a via Panda

---

## Architecture: Audio Flow

Annie doesn't process speech on the Pixel. The Pixel is a dumb terminal (mic + speaker). Audio flows:

```
Pixel 9a (mic) → USB ADB / WebSocket → Panda (RTX 5070 Ti, 16 GB VRAM)
                                              ↕ STT + TTS here
Pixel 9a (speaker) ← USB ADB / WebSocket ← Panda ← response audio
```

Panda is the speech processing hub (x86_64, no aarch64 issues). Titan handles LLM reasoning via SSH.

---

## Key Insight: Meaning Over Accuracy

**The ASR transcript is intermediate — no human reads it.** The real metric is whether the LLM responds correctly, NOT transcription WER. Nemotron Nano is multilingual and handles noisy/garbled code-mixed text. Don't over-optimize ASR sub-components — test end-to-end.

**Implication for mixed Kannada-English (Rajesh):** Whisper's imperfect transcription may be sufficient as long as Nemotron understands the intent.

**Implication for pure Kannada (Mom):** IndicConformerASR gives best accuracy for single-language Kannada.

---

## STT (Speech Recognition) — BENCHMARKED ON PANDA

### Comparison

| Model | Params | Languages | Kannada | Code-Mixed | VRAM | License |
|-------|-------:|-----------|---------|------------|-----:|---------|
| **IndicConformerASR 600M** ★ | 600M | 22 Indian | Best | No (single-lang only) | **303 MB** | Open (MIT) |
| Whisper medium | 769M | 99 | Weak (garbled) | Yes (auto-detect) | 2,915 MB | MIT |
| **Whisper large-v3** ★ | 1.55B | 99 | **Perfect** | Yes | **6,029 MB** | MIT |
| Whisper large-v3-turbo | 809M | 99 | Good (minor errors) | Yes (misdetects as Tamil!) | 3,171 MB | MIT |
| Sarvam Saaras v3 | API | 22 Indian | Good | **Yes (native)** | 0 | Paid (₹30/hr) |
| Pingala V1 Universal | 800M | 204 | Claimed | No | ~3 GB | RAIL-M |
| Meta Omnilingual ASR | 300M-7B | 1600+ | Unknown | Unclear | 1-14 GB | Apache 2.0 |
| Google Cloud STT | API | 73 | Good | Partial | 0 | Paid ($0.024/min) |

### IndicConformerASR 600M — BENCHMARK RESULTS (Panda RTX 5070 Ti)

```
Model load:     2.2s (cached), 107s (first time, 404 files)
VRAM:           303 MB (266 MB model + 38 MB inference buffers)
Execution:      GPU (ONNX Runtime 1.24.4, CUDAExecutionProvider confirmed)
```

| Audio Duration | CTC Latency | RNNT Latency | RTF (RNNT) |
|---------------:|------------:|-------------:|-----------:|
| 1s (Kannada) | 275ms | 109ms | 0.109 |
| 3s (Kannada) | 135ms | 149ms | 0.050 |
| 5s (Kannada) | 142ms | 262ms | 0.052 |
| 10s (Kannada) | 303ms | 285ms | 0.028 |
| 3s (Hindi) | 126ms | 120ms | 0.040 |

**Warm run (10x, 3s Kannada RNNT):** Avg 145ms, P50 146ms, Min 117ms, Max 165ms, **RTF 0.048 (21x real-time)**

**English:** NOT supported. KeyError on `en`/`eng`/`english`. Model covers 22 Indian scheduled languages only.

**Code-mixed:** NOT supported (official docs confirmed). Single-language per inference call. No auto-detect mode.

### Code-Mixed Speech: The Gap

**No open-source Kannada-English code-mixed ASR exists.** Only Hindi-English has one (`shunyalabs/zero-stt-hinglish`). Options for Rajesh's mixed Kannada-English:

1. **Sarvam Saaras v3 API** (best) — ₹30/hr, ₹1K free credits (~33 hrs). Handles code-mixed natively. 5 output formats (transcribe, codemix, verbatim, translit, translate). Streaming WebSocket. Pipecat integration. **Latency: 567ms for 3s audio** (transcribe mode), ~2s for codemix/verbatim. API key in `.env.eval`.

2. **Whisper** (free fallback) — auto-detect, handles code-switching but struggles with Indian accents. Medium model installed on Panda GPU.

3. **IndicConformerASR in `kn` mode** — Kannada words OK, English garbled. But per the "meaning over accuracy" insight, the LLM may still understand.

### Installation Notes (Panda)

```
Dependencies: transformers 4.57.6, torchaudio 2.11.0, onnxruntime-gpu 1.24.4, torchcodec 0.11.0, soundfile
HF token:     ~/.cache/huggingface/token (gated model, must accept at HuggingFace)
Model cache:  ~/.cache/huggingface/hub/models--ai4bharat--indic-conformer-600m-multilingual/
ONNX CUDA:    onnxruntime-gpu 1.20.1 had CUDA 12/13 mismatch → upgraded to 1.24.4 (fixed)
```

---

## TTS (Speech Generation) — BENCHMARKED ON PANDA

### Comparison

| Model | Params | Languages | Kannada | Voice Cloning | VRAM | RTF | License |
|-------|-------:|-----------|---------|---------------|-----:|----:|---------|
| **IndicF5** ★ | 400M | 11 Indian | Yes | Yes (3-sec ref) | **1.3 GB** | **0.082 (EPSS7+BF16)** | MIT |
| Sarvam Bulbul v3 | API | 11 Indian | Yes | No | 0 | API | Paid |
| Indic Parler-TTS Mini | ~880M | 21 Indian | Yes + emotions | No | ~4-6 GB | Open | FREE |
| Kokoro (current Annie) | 82M | English | No | No | ~2 GB | Apache 2.0 | FREE |
| Google Cloud TTS | API | 75+ | Yes | No | 0 | API | Paid |

### IndicF5 TTS — BENCHMARK RESULTS (Panda RTX 5070 Ti)

```
Model load:     2.0s (INF5Model with patched model.py)
VRAM:           1,347 MB
Output:         24000 Hz WAV
```

| Config | Latency | Audio | RTF | Speedup |
|--------|---------|-------|-----|---------|
| NFE32 FP32 (original) | 2284ms | 3.51s | 0.651 | 1.0x |
| NFE16 FP32 | 2287ms | 3.51s | 0.652 | 1.0x |
| NFE16 + FP16 | 1185ms | 3.51s | 0.338 | 1.9x |
| EPSS7 FP32 | 527ms | 3.51s | 0.150 | 4.3x |
| **EPSS7 + BF16** | **285ms** | **3.51s** | **0.082** | **8.0x** |

**RTF 0.082 with EPSS7+BF16 = 12x faster than real-time.** TTS is no longer the bottleneck.

**CRITICAL: FP16 is BROKEN** — Vocos vocoder uses complex numbers that overflow in FP16 (ComplexHalf). Use BF16 only (`torch.autocast("cuda", torch.bfloat16)`).

**EPSS 7-step schedule:** Non-uniform time steps `[0, 2/32, 4/32, 6/32, 8/32, 16/32, 24/32, 1.0]` replace 32 uniform steps. Training-free, from Fast F5-TTS paper (Interspeech 2025).

### Whisper STT — BENCHMARK RESULTS (Panda RTX 5070 Ti)

Test audio: gTTS Kannada "ನಮಸ್ಕಾರ, ನಾನು ಆನಿ. ನಿಮಗೆ ಹೇಗೆ ಸಹಾಯ ಮಾಡಬಹುದು?" (5.42s)

| Model | Latency | VRAM | Kannada Transcription | Auto-detect |
|-------|---------|------|-----------------------|-------------|
| medium | 521ms | 2,915 MB | ನಾಮಸ್ಕರಾ ನಾನೆ ವಾನೆ... (garbled) | kn ✓ |
| **large-v3** | **805ms** | **6,029 MB** | **ನಮಸ್ಕಾರ ನಾನು ಆನೀ ನಿಮಗೆ ಹೇಗೆ ಸಹಾಯ ಮಾಡಬಹುದು?** (perfect) | kn ✓ |
| large-v3-turbo | 226ms | 3,171 MB | ನಮಸ್ಕಾರ ನಾನುವಾನಿ... (minor errors) | ta ✗ |

**Whisper large-v3 is the clear winner for Kannada.** Perfect transcription. Medium garbles almost every word.

### Installation Notes (Panda)

```
pip install git+https://github.com/ai4bharat/IndicF5.git
Model.py patched (backup at model.py.bak):
  1. torch.compile removed (crashes Python 3.12)
  2. Safetensors loading uncommented + key remapping:
     state_dict = {k.replace("ema_model._orig_mod.", ""): v for k, v in state_dict.items() if k.startswith("ema_model.")}
  3. strict=True in load_state_dict (catches key mismatches)
  4. transformers 4.57.6 (downgraded from 5.4)
  5. torchcodec 0.11.0 installed
  6. HF gated model — must accept at https://huggingface.co/ai4bharat/IndicF5
Reference prompts: Only PAN_F_HAPPY_00001.wav (Punjabi, 8.1s) and MAR_F_HAPPY_00001.wav (Marathi).
  Prompt determines VOICE STYLE, not output language. Kannada text → Kannada speech.
  For production: record 3s of Annie's desired voice as reference.
See memory/project_indicf5_loading_gotchas.md for full loading bug documentation.
```

---

## E2E Pipeline — VALIDATED

### Pure Kannada (Mom's use case)

Tested by sending Kannada text to Nemotron Nano on Titan:
- Input: "ನನಗೆ ತಲೆನೋವು ಬಂದಿದೆ ಮಾತ್ರೆ ತೆಗೆದುಕೊಳ್ಳಬೇಕಾ" (I have a headache, should I take medicine?)
- Output: Responded in Kannada with step-by-step advice (rest, low light, etc.)
- **LLM understands and responds in Kannada.** Minor issue: occasionally mixes Arabic script. Fix with system prompt: "respond ONLY in Kannada script."

### Full Pipeline Latency

| Step | Component | Location | Latency | Status |
|------|-----------|----------|--------:|--------|
| Audio capture | BT HFP (pw-record) | Panda | **~30ms** | Validated with iPhone |
| Speech → Text | IndicConformerASR 600M | Panda GPU | **145ms** | Verified, 21x real-time |
| Speech → Text | Whisper large-v3 (code-mixed) | Panda GPU | **805ms** | Perfect Kannada |
| Text → Response | Nemotron Nano 30B | Titan vLLM | **~500ms** | Verified, understands Kannada |
| Response → Speech | IndicF5 TTS (EPSS7+BF16) | Panda GPU | **~285ms** | Verified, RTF 0.082 |
| Audio playback | BT HFP (pw-play) | Panda | **~30ms** | Validated with iPhone |
| **Total (IndicConformer)** | | | **~1.0s** | Fast path (pure Kannada) |
| **Total (Whisper large-v3)** | | | **~1.6s** | Code-mixed path |

### Mixed Kannada-English (Rajesh's use case)

- Whisper medium on Panda GPU for STT (auto-detect, code-mixed)
- OR Sarvam Saaras v3 API (best quality, ₹30/hr)
- Nemotron Nano handles garbled code-mixed text adequately
- Per "meaning over accuracy" principle, ASR quality is less critical than E2E conversation quality

---

## VRAM Budget on Panda (RTX 5070 Ti, 16 GB) — MEASURED

| Model | VRAM (Measured) | Status |
|-------|----------------:|--------|
| IndicConformerASR 600M | 303 MB | Installed, GPU verified |
| IndicF5 TTS | 1,347 MB | Installed, EPSS7+BF16 working |
| Whisper large-v3 | 6,029 MB | Installed, perfect Kannada |
| Qwen3-VL-2B (vision) | 1,900 MB | Installed via Ollama |
| **Total (all 4 loaded)** | **~9,579 MB** | |
| Headroom | **~6,724 MB** | Room for future models |
| **Panda RTX 5070 Ti** | **16,303 MB** | 41% free |

Note: Not all models loaded simultaneously. STT + TTS = ~7.4 GB concurrent. Vision loaded on-demand.

---

## Live Demo

`scripts/live_asr_demo.py` on Panda port 8765. Three models:
- **Sarvam Saaras v3** (default) — mixed language, API-based, 5 output modes
- **Whisper medium** — local GPU fallback, auto-detect
- **IndicConformer** — pure Indian language, fastest (145ms)

Accessible via HTTPS cloudflared tunnel for browser mic access. WebM→WAV conversion via ffmpeg.

---

## Phone Call Audio Flow (Mom ↔ Annie)

```
[Mom calls Annie's Airtel SIM]
    ↓
Pixel 9a auto-answers (ADB keyevent)
    ↓
Audio captured → WebSocket/scrcpy → Panda
    ↓
IndicConformerASR (Kannada STT) → text
    ↓
SSH to Titan → Nemotron LLM → response text
    ↓
Back to Panda → IndicF5 (Kannada TTS) → audio
    ↓
Audio injected → Pixel speaker/call → Mom hears Annie
```

---

## Annie's Voice Identity Strategy

Two options:
1. **Keep Kokoro on Titan for English** (existing Pipecat pipeline) + IndicF5 on Panda for Indian languages
2. **Use IndicF5 on Panda for ALL languages** — voice-cloned Annie identity sounds the same in English, Kannada, Hindi

Option 2 is more elegant — Annie has ONE voice identity across all languages.

To create Annie's voice: record 3 seconds of desired voice → IndicF5 clones it across all 11 languages.

---

## Original Decision Rationale

### Why IndicConformerASR 600M (STT)

- Built by IIT Madras (AI4Bharat) specifically for Indian languages
- First open-source ASR covering all 22 scheduled Indian languages
- Conformer architecture (attention + convolution hybrid)
- NeMo-based, well-documented self-hosting
- GitHub: https://github.com/AI4Bharat/IndicConformerASR
- HuggingFace: https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual

### Why IndicF5 (TTS)

- Trained on 1,417 hours of Indian speech data
- 11 languages: Hindi, Kannada, Bengali, Gujarati, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, Assamese
- **Voice cloning** — 3 seconds of reference audio → Annie speaks in that voice
- Based on F5-TTS architecture
- Can create Annie's own voice identity that speaks Kannada!
- GitHub: https://github.com/AI4Bharat/IndicF5
- HuggingFace: https://huggingface.co/ai4bharat/IndicF5

### Runner-up TTS: Indic Parler-TTS Mini

- 21 languages (most comprehensive — covers Bodo, Dogri, Santali too)
- Text-prompted voice control: "Speak in a warm, calm Kannada female voice"
- 10 languages support emotion prompts (including Kannada)
- No voice cloning but natural-sounding with descriptive prompts
- HuggingFace: https://huggingface.co/ai4bharat/indic-parler-tts

### Fallback STT: Google Cloud STT

- $0.024/min, 60 min/month free
- Use for phone calls to mom where accuracy is critical

### Fallback TTS: Google Cloud TTS

- kn-IN (Kannada), hi-IN (Hindi) voices
- WaveNet/Neural2 quality is industry-leading
- Free tier: 1M WaveNet chars/month (~2-3 hours speech)
- $4/M chars (Standard), $16/M chars (WaveNet/Neural2)

### Additional STT Models Considered

- **IndicConformerASR 130M** — lighter 130M variant, 22 Indian languages, ~1 GB VRAM
- **NVIDIA Parakeet 1.1B** — Hindi confirmed, Kannada uncertain, ~3 GB VRAM, Apache 2.0

### Additional TTS Models Considered

- **Vakyansh TTS** — ~100M params, Hindi/Kannada/Tamil, ~2 GB, Docker+GPU, open source
- **AI4Bharat Indic-TTS** — ~50M params, 13 Indian languages, ~1-2 GB, FastPitch+HiFi-GAN

---

## Implementation Steps

1. ~~Buy Pixel 9a at Croma~~ (pending)
2. ~~Install IndicF5 + IndicConformerASR on Panda~~ (**DONE** — both installed and benchmarked)
3. Create Annie's voice identity (3-sec reference audio → IndicF5 clones it)
4. Build audio bridge (Android app or scrcpy audio routing to stream Pixel audio ↔ Panda)
5. Test Kannada quality — have Annie say a few sentences, iterate on voice
6. Wire into Annie's kernel as new speech tools
7. Optimize IndicF5 TTS latency (RTF 0.808 → target < 0.5)

---

## Open Issues

1. **IndicF5 TTS latency** — RTF 0.808 means slower than real-time. **Full optimization research completed:** see `docs/RESEARCH-INDICF5-TTS-OPTIMIZATION.md`. Top 3 quick wins (30 min, target RTF ~0.20-0.25):
   - Reduce NFE 32→16 (2x speedup, single line change)
   - Convert to FP16 (1.5-2x speedup, `model.half()`)
   - Trim reference audio to 1.5s (20% bonus)
   - Also evaluated: EPSS 7-step, torch.compile, ONNX/TensorRT, INT8, Indic Parler-TTS (slower), Smallest.ai Lightning API (sub-100ms TTFB fallback)

2. **No Kannada reference prompt** — IndicF5 ships only Punjabi + Marathi prompts. Need to record Annie's voice as reference.

3. **transformers version conflict** — Downgraded to 4.57.6 for IndicF5. May conflict with IndicConformerASR or other models if loaded in same process. Consider separate venvs or process isolation.

4. **torch.compile disabled** — IndicF5 uses torch.compile for vocoder + model, but crashes on Python 3.12 + torchaudio meta tensors. Performance may improve with Python 3.10 or when upstream fixes land.

---

## Sources

- [AI4Bharat IndicF5 — HuggingFace](https://huggingface.co/ai4bharat/IndicF5)
- [AI4Bharat IndicF5 — GitHub](https://github.com/AI4Bharat/IndicF5)
- [AI4Bharat Indic Parler-TTS — HuggingFace](https://huggingface.co/ai4bharat/indic-parler-tts)
- [AI4Bharat IndicConformerASR — GitHub](https://github.com/AI4Bharat/IndicConformerASR)
- [IndicConformer 600M Multilingual — HuggingFace](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)
- [AI4Bharat Models Portal](https://models.ai4bharat.org/)
- [Sarvam Saaras v3 STT — Blog](https://www.sarvam.ai/blogs/asr)
- [Sarvam STT API](https://www.sarvam.ai/apis/speech-to-text)
- [Sarvam Pricing](https://www.sarvam.ai/api-pricing)
- [Sarvam Pipecat Integration](https://docs.pipecat.ai/server/services/stt/sarvam)
- [Sarvam Bulbul v3 TTS](https://www.sarvam.ai/blogs/bulbul-v3)
- [Pingala V1 Universal — HuggingFace](https://huggingface.co/shunyalabs/pingala-v1-universal)
- [Zero-STT-Hinglish — HuggingFace](https://huggingface.co/shunyalabs/zero-stt-hinglish)
- [Meta Omnilingual ASR — GitHub](https://github.com/facebookresearch/omnilingual-asr)
- [Google Cloud TTS Pricing](https://cloud.google.com/text-to-speech/pricing)
- [Google Cloud STT Pricing](https://cloud.google.com/speech-to-text/pricing)
- [F5-TTS on RTX 5070 WSL2 Guide](https://sneekes.app/posts/f5-tts-installation-guide-for-rtx-5070-on-wsl2/)
- [Whisper Indian Language Enhancement Research](https://arxiv.org/html/2412.19785v1)
- [Whisper Code-Switching Adaptation](https://arxiv.org/html/2412.16507v2)
- [Vakyansh TTS API](https://open-speech-ekstep.github.io/tts_model_api/)
- [Sarvam Bulbul v3 + Pipecat Integration](https://dev.to/agent_paaru/indian-language-tts-for-your-ai-agent-integrating-sarvamai-bulbul-v3-with-openclaw-1fdg)