# Annie Voice Pipeline — Architecture & Benchmarks

**Date:** 2026-03-31 (Session 380)
**Status:** All components benchmarked. BT HFP validated. Ready to build conversation loop.

---

## Architecture

```
                    ANNIE VOICE PIPELINE — COMPLETE ARCHITECTURE
                    ════════════════════════════════════════════

    ┌─────────────────────────────────────────────────────────────────┐
    │                     PHONE (Pixel 9a / iPhone)                   │
    │                                                                 │
    │   Mom calls ──→ Airtel SIM ──→ Auto-answer                     │
    │                                                                 │
    │   ┌──────────┐                              ┌──────────┐       │
    │   │   MIC    │──── BT HFP downlink ────────→│ SPEAKER  │       │
    │   └──────────┘     (~30ms)                  └──────────┘       │
    │        ▲                                         │              │
    │        │              Bluetooth SCO              │              │
    │        └─────── BT HFP uplink ──────────────────┘              │
    │                    (~30ms)                                      │
    └────────────────────────┬──────────────────────┬─────────────────┘
                             │                      ▲
                             ▼                      │
    ┌────────────────────────────────────────────────────────────────┐
    │                PANDA (RTX 5070 Ti, 16 GB VRAM)                 │
    │                Ubuntu, x86_64, 192.168.68.57                   │
    │                                                                │
    │   ┌─────────────────────────────────────────────────────────┐  │
    │   │  pw-record --target bluez_input.<MAC>.0                 │  │
    │   │  --rate 16000 --channels 1 --format s16                 │  │
    │   └───────────────────────┬─────────────────────────────────┘  │
    │                           │ PCM 16kHz mono                     │
    │                           ▼                                    │
    │   ┌──────────────── STT (choose one) ──────────────────────┐  │
    │   │                                                         │  │
    │   │  PATH A: Pure Kannada (Mom)                             │  │
    │   │  ┌──────────────────────────────┐                       │  │
    │   │  │ IndicConformerASR 600M       │                       │  │
    │   │  │ ONNX Runtime GPU             │                       │  │
    │   │  │ 303 MB VRAM                  │                       │  │
    │   │  │ ⏱ 145ms (RTF 0.048)          │                       │  │
    │   │  └──────────────────────────────┘                       │  │
    │   │                                                         │  │
    │   │  PATH B: Code-Mixed Kannada+English (Rajesh)            │  │
    │   │  ┌──────────────────────────────┐                       │  │
    │   │  │ Whisper large-v3             │                       │  │
    │   │  │ PyTorch GPU                  │                       │  │
    │   │  │ 6,029 MB VRAM               │                       │  │
    │   │  │ ⏱ 805ms (perfect Kannada)    │                       │  │
    │   │  └──────────────────────────────┘                       │  │
    │   └─────────────────────┬───────────────────────────────────┘  │
    │                         │ text (Kannada/English)               │
    │                         ▼                                      │
    │   ┌─────────────────────────────────────────────────────────┐  │
    │   │  SSH to Titan ──→ HTTP POST to vLLM :8003               │  │
    │   └───────────────────────┬─────────────────────────────────┘  │
    │                           │                                    │
    └───────────────────────────┼────────────────────────────────────┘
                                │
                                ▼
    ┌────────────────────────────────────────────────────────────────┐
    │          TITAN (DGX Spark, 128 GB Unified Memory)              │
    │                                                                │
    │   ┌──────────────────────────────┐                             │
    │   │ Nemotron Nano 30B NVFP4      │                             │
    │   │ vLLM port 8003               │                             │
    │   │ 18 GB VRAM                   │                             │
    │   │ ⏱ ~500ms                     │                             │
    │   │ Understands Kannada + English │                             │
    │   └──────────────┬───────────────┘                             │
    │                  │ response text                                │
    └──────────────────┼─────────────────────────────────────────────┘
                       │
                       ▼
    ┌────────────────────────────────────────────────────────────────┐
    │                PANDA (continued)                                │
    │                                                                │
    │   ┌──────────────────────────────┐                             │
    │   │ IndicF5 TTS (EPSS7 + BF16)  │                             │
    │   │ 1,347 MB VRAM               │                             │
    │   │ ⏱ 285ms (RTF 0.082)         │                             │
    │   │ 24kHz WAV output            │                             │
    │   │ Voice cloning (3s ref)      │                             │
    │   └──────────────┬───────────────┘                             │
    │                  │ PCM 24kHz                                   │
    │                  ▼                                              │
    │   ┌─────────────────────────────────────────────────────────┐  │
    │   │  pw-play --target bluez_output.<MAC>.1                  │  │
    │   └─────────────────────────────────────────────────────────┘  │
    │                  │                                              │
    │   ┌──────────────┴───────────────────────────────────────┐    │
    │   │  Also on Panda (available, not in voice loop):       │    │
    │   │  • Qwen3-VL-2B (1,900 MB) — phone screen vision     │    │
    │   │  • Ollama 0.19.0 runtime                             │    │
    │   └──────────────────────────────────────────────────────┘    │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
                                │
                                ▼ BT HFP uplink
                          Mom hears Annie
```

---

## Latency Benchmarks — All Paths

### PATH A: Pure Kannada (Mom's call) — ~1.0s total

| Step | Component | Latency |
|------|-----------|--------:|
| BT capture | pw-record | 30ms |
| STT | IndicConformerASR 600M | 145ms |
| LLM | Nemotron Nano 30B (Titan SSH) | 500ms |
| TTS | IndicF5 EPSS7+BF16 | 285ms |
| BT play | pw-play | 30ms |
| **Total** | | **990ms** |

### PATH B: Code-Mixed Kannada+English (Rajesh) — ~1.6s total

| Step | Component | Latency |
|------|-----------|--------:|
| BT capture | pw-record | 30ms |
| STT | Whisper large-v3 | 805ms |
| LLM | Nemotron Nano 30B (Titan SSH) | 500ms |
| TTS | IndicF5 EPSS7+BF16 | 285ms |
| BT play | pw-play | 30ms |
| **Total** | | **1650ms** |

---

## VRAM Budget — Panda RTX 5070 Ti (16,303 MB)

| Model | VRAM | Status |
|-------|-----:|--------|
| Whisper large-v3 | 6,029 MB | Deployed |
| Qwen3-VL-2B | 1,900 MB | Deployed (Ollama, stopped during voice test) |
| IndicF5 TTS | 1,347 MB | Deployed (Kannada) |
| Kokoro 82M | 554 MB | Deployed (English) |
| IndicConformerASR | 303 MB | Deployed |
| **Total (voice test)** | **8,233 MB** | **50%** (without Qwen3-VL) |
| **Total (all)** | **10,133 MB** | **62%** |
| **Free** | **6,170 MB** | **38%** |

---

## Component Benchmarks

### TTS Comparison (same Kannada text, Panda RTX 5070 Ti)

| Engine | Latency | RTF | Type | Voice Clone | Quality |
|--------|--------:|----:|------|:-----------:|---------|
| **IndicF5 EPSS7+BF16** | **285ms** | 0.082 | Local GPU | Yes | Good (Kannada) |
| **Kokoro 82M** | **~30ms** | 0.008 | Local GPU | No | **Best (English)** |
| gTTS (Google) | 519ms | 0.096 | Cloud API | No | Nice (both) |
| IndicF5 EPSS7 FP32 | 527ms | 0.150 | Local GPU | Yes | Good (Kannada) |
| Sarvam Bulbul v3 | 1112ms | 0.366 | Cloud API | No | Not bad |
| IndicF5 NFE32 FP32 | 2284ms | 0.651 | Local GPU | Yes | Good (Kannada) |
| IndicF5 EPSS7+FP16 | 286ms | — | Local GPU | — | **BROKEN** (noise) |

**IndicF5 for Kannada, Kokoro for English.** IndicF5 produces unintelligible output for English (Indian phoneme processing). Kokoro 82M: ~554 MB VRAM, ~30ms/sentence after warmup.

**FP16 is dead. Use BF16 only** — Vocos vocoder uses complex numbers that overflow in FP16.

### STT Comparison (same Kannada audio, Panda RTX 5070 Ti)

| Model | Latency | VRAM | Kannada Quality | Auto-detect |
|-------|--------:|-----:|-----------------|:-----------:|
| IndicConformerASR | 145ms | 303 MB | Best (pure Kannada only) | N/A |
| Whisper large-v3-turbo | 226ms | 3,171 MB | Good (minor errors) | Tamil ✗ |
| Whisper medium | 521ms | 2,915 MB | Poor (garbled) | Kannada ✓ |
| **Whisper large-v3** | **805ms** | **6,029 MB** | **Perfect** | **Kannada ✓** |

---

## Key Technical Decisions

1. **BT HFP over VoIP/scrcpy**: Phone is a phone. Panda acts as Bluetooth headset. Bidirectional call audio. Zero extra hardware.
2. **EPSS 7-step**: Non-uniform time steps `[0, 2/32, 4/32, 6/32, 8/32, 16/32, 24/32, 1.0]` — training-free 4.3x speedup.
3. **BF16 over FP16**: BF16 has 8-bit exponent (same as FP32), handles Vocos complex ops. FP16 overflows.
6. **IndicF5 for Kannada, Kokoro for English**: IndicF5 speaks English with Indian phonemes (unintelligible). Kokoro 82M handles English natively (~30ms, 554 MB). Dual-TTS routing by language.
4. **Whisper large-v3 over medium**: 3x more VRAM but perfect vs garbled Kannada. Worth it.
5. **IndicF5 model.py patched**: `torch.compile` removed, safetensors key remapping (`ema_model._orig_mod.` stripped), `strict=True`.

---

## BT HFP Commands

```bash
# Pair phone (temporary, 180s)
bluetoothctl discoverable on && bluetoothctl pairable on

# Record from phone call
pw-record --target bluez_input.<MAC>.0 --rate 16000 --channels 1 --format s16 /tmp/call.wav

# Play into phone call
pw-play --target bluez_output.<MAC>.1 /tmp/response.wav
```