# Research — Small Open-Source STT Alternatives to Whisper (2026-04-15)

**Status:** Research complete; no implementation.
**Session:** 108
**Driver:** User asked "is there a SOTA STT smaller than Whisper open-source available" for the Panda phone-call path.

---

## Context

Current production STT: **OpenAI Whisper large-v3-turbo** (PyTorch reference impl, not faster-whisper), 809M params, ~1.6 GB FP16 weights, **5.1 GB live VRAM** on Panda RTX 5070 Ti (x86_64, SM 12.0 Blackwell, CUDA 12.8+). Loaded in `services/annie-voice/phone_audio.py:509-519` by the `phone_call.py auto` daemon.

Constraints that shaped the search:
- English primary (Mom speaks English, Indic retired).
- Real-time phone-call latency target.
- Must coexist with Chatterbox 3.7 GB + E2B nav 3.2 GB + panda-nav sidecar on a 16 GB Panda.
- CTranslate2 x86_64 CUDA wheels DO exist on PyPI (the aarch64 gap from session 101 only affects Titan/Beast DGX Spark).
- License: commercial-compatible (MIT / Apache / CC-BY; NOT CC-BY-NC).

Research was conducted via three parallel Agent deep-dives covering: (a) 2024-2026 model landscape, (b) licensing + Blackwell/x86 compat, (c) real-world WER + streaming.

---

## TL;DR recommendation

**NVIDIA Parakeet TDT 0.6B v2** is the primary replacement candidate. All three research agents converged on it independently.

| Metric | Whisper large-v3-turbo (current) | Parakeet TDT 0.6B v2 | Delta |
|---|---:|---:|---:|
| Params | 809M | 600M | −26% |
| FP16 weight size | 1.6 GB | 1.2 GB | −25% |
| Live VRAM (Panda estimate) | 5.1 GB | ~3 GB | **~2 GB freed** |
| Open ASR Leaderboard avg WER | 7.75% | **6.05%** | **−22% rel.** |
| LibriSpeech test-clean | ~2.1% | **1.69%** | −19% |
| LibriSpeech test-other | ~4.4% | **3.19%** | −28% |
| Hallucination class | Seq2seq silence-phantom, repetition-loop | **Structurally immune** (RNN-T can't emit past acoustic frames) | qualitative |
| μ-law 8 kHz (phone) WER delta vs 16 kHz studio | ~+1-2 pp (not phone-trained) | **+0.27 pp** | Purpose-built |
| License | MIT | CC-BY-4.0 | both commercial-OK |
| Runtime | `whisper` PyPI pkg (heavy PyTorch ref impl) | `transformers>=4.52` (mainlined `NvidiaParakeetTDT` class — no NeMo needed) | Both pip-installable |

Secondary candidates (fallbacks if Parakeet has an unforeseen issue):

1. **NVIDIA Canary 1B Flash** — 883M, CC-BY-4.0, 6.36% avg WER, LS-clean 1.48 (best in table). Seq2seq so hallucination-capable. No native streaming.
2. **Parakeet TDT 0.6B v3** — same 600M, 25-language, **native streaming** (2 s chunks, frame-sync token emission). Slightly worse WER than v2 (6.34% vs 6.05%) but unlocks partial hypotheses during a live call.
3. **Nemotron Speech Streaming 0.6B** (March 2026 release) — explicitly optimized for voice agents, runtime-configurable 80/160/560/1120 ms chunks. License claimed CC-BY-4.0, **verify on live card**.
4. **distil-large-v3.5** (MIT, released 2026-04-13) — drop-in faster-whisper replacement; trivial size win (~6% smaller) but 1.5× faster than turbo. Safe if you want minimal-disruption path.
5. **Canary 180M Flash** — 182M, 0.37 GB, 6.91% WER. Great size/quality tradeoff if you need a lot of VRAM headroom.
6. **Moonshine base** — 61M, MIT, ~0.12 GB, native streaming, <200 ms TTFA. But ~10% avg WER — CPU/edge fallback only, not a Whisper replacement.

---

## Open ASR Leaderboard snapshot (arXiv 2510.06961v3, Mar 2026)

Top 12 by average WER across AMI, Earnings22, GigaSpeech, LibriSpeech clean/other, SPGISpeech, TED-LIUM, VoxPopuli:

| Rank | Model | Params | Avg WER | License | Fit for her-os? |
|---|---|---:|---:|---|---|
| 1 | Canary-Qwen-2.5B (NVIDIA) | 2.5B | 5.63% | CC-BY-4.0 | ❌ 40-s cap, LLM hallucinations |
| 2 | IBM Granite Speech 3.3 8B | 8B | 5.74% | Apache-2.0 | ❌ too big |
| 3 | IBM Granite Speech 3.3 2B | 2B | 6.00% | Apache-2.0 | ⚠ Earnings-22 fails (WER 280) |
| 4 | Phi-4 Multimodal | 5.6B | 6.02% | MIT | ❌ breaks VRAM budget |
| 5 | **Parakeet-TDT-0.6B-v2** | 600M | **6.05%** | **CC-BY-4.0** | ✅ **WINNER** |
| 6 | Parakeet-TDT-0.6B-v3 (25-lang) | 600M | 6.34% | CC-BY-4.0 | ✅ streaming path |
| 7 | Canary-1B-Flash | 883M | 6.36% | CC-BY-4.0 | ✅ accuracy fallback |
| 8 | Parakeet-CTC-1.1B | 1.1B | 6.43-6.68% | CC-BY-4.0 | ✅ but larger |
| 9 | Canary-180M-Flash | 182M | 6.91% | CC-BY-4.0 | ✅ edge/size |
| 10 | **Whisper-large-v3** | 1.55B | **7.44%** | MIT | current tier |
| 11 | **Whisper-large-v3-turbo** | **809M** | **7.75%** | **MIT** | **baseline (what we run today)** |
| 12 | Distil-Large-v3 | 756M | ~7.5% | MIT | flat swap, no real gain |

**Read:** Whisper-turbo sits at rank 11. Everything above it in the commercial-OK tier is NVIDIA Parakeet/Canary family. The Whisper ecosystem is no longer SOTA by this benchmark.

---

## Why Parakeet-TDT-0.6B-v2 specifically

### Accuracy

- **LS test-clean 1.69%, test-other 3.19%** — a real, measurable step up from Whisper-turbo (2.1/4.4).
- Average WER across 8 benchmark datasets is 6.05% vs turbo's 7.75% — 22% relative reduction.
- Gains are consistent across conversational (AMI), financial (Earnings22, SPGI), and read-speech (LS, TED-LIUM) domains.

### VRAM

- FP16 weights 1.2 GB (vs turbo 1.6 GB).
- Live footprint estimated ~3 GB (vs turbo's observed 5.1 GB on Panda).
- ~2 GB saving on a tight 16 GB card is enough to:
  - Run Gemma E4B nav (needs ~5.5 GB) alongside if you ever switch off E2B.
  - Restore IndicF5 if Indic support becomes relevant again.
  - Leave permanent headroom for peak-concurrent operation.

### Hallucinations

- **Whisper's silence-phantom and repetition-loop failures are seq2seq artifacts.** The LM prior decodes plausible-sounding text even when the audio is silent or noisy — hence the "thank you for watching subscribe to my channel" hallucinations that users routinely hit.
- **TDT (Token-and-Duration Transducer) cannot emit tokens without acoustic support.** It's frame-synchronous; no hidden decoder LM. Structurally different, not just better-tuned.
- NVIDIA also publishes an explicit MUSAN-48-hour hallucination-rate benchmark. Canary-1B-Flash scores 60.92 chars/min; Canary-Qwen-2.5B scores 138.1 (Qwen LLM decoder makes it worse). TDT/RNN-T transducers are expected to score dramatically lower but I didn't find a directly comparable Parakeet number.

### Phone telephony

- NVIDIA card reports **only +0.27 pp WER on μ-law 8 kHz** vs 16 kHz studio. That's the phone codec — what calls actually arrive as.
- Whisper was trained primarily on 16 kHz broadband audio; telephony degradation is worse and largely unpublished.
- Parakeet was included in the Granary training mix with phone-quality audio explicitly.

### Integration cost

- `transformers >= 4.52` merged `NvidiaParakeetTDT` / `NvidiaParakeetCTC` / `NvidiaParakeetRNNT` mainline in 2025. **No NeMo toolkit install required.**
- `phone_audio.py:509-519` currently does:
  ```python
  import whisper
  self._model = whisper.load_model("large-v3-turbo", device="cuda")
  # ...
  self._model.transcribe(tmp_wav)
  ```
- Parakeet-via-transformers is a roughly equivalent 3-line change:
  ```python
  from transformers import AutoModelForCTC, AutoProcessor
  self._processor = AutoProcessor.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
  self._model = AutoModelForCTC.from_pretrained("nvidia/parakeet-tdt-0.6b-v2").to("cuda")
  ```
- Estimated A/B integration: **~2 hours** to fork `phone_audio.py`, write an A/B harness, and compare outputs on a batch of archived call audio.

---

## Hazards (things that can go wrong)

1. **License verification on Nemotron-Speech-Streaming-0.6B** — agent 1 flagged CC-BY-4.0 "verify on live card". Confirm before committing to streaming path.
2. **`nvidia/canary-1b` (no "-flash") is CC-BY-NC-4.0** — non-commercial. Only the "-flash" variant is CC-BY-4.0. Easy to mix up.
3. **SenseVoice is blocked on Blackwell** — FunASR `requirements.txt` pins `torch<=2.3`, which has no SM_120 kernels (same trap that killed CosyVoice 2 in session 107). Viable only via ONNX runtime, which is a 4h+ port.
4. **Phi-4-multimodal is 5.6B total params** — breaks Panda's VRAM budget. Also ASR quality is generation-dependent and weaker than dedicated transducers on tail-of-distribution audio. Do not use.
5. **k2 / icefall directly** — Python package hasn't shipped wheels since Nov 2023. Use `sherpa-onnx` (Apache-2.0) as the Zipformer runtime if you go that direction.
6. **Granite-4.0-1b-speech (Apache 2.0, Mar 2026)** has an outlier Earnings-22 WER of 280 — catastrophic long-form decoder hallucination. Short-utterance only until fixed.
7. **NeMo toolkit install is heavy** (~5 GB deps, pulls Lhotse, pyannote, webdataset). **Always prefer the `transformers`-native Parakeet/Canary class** over installing NeMo.

---

## Why Whisper isn't SOTA anymore

- Last substantive OpenAI release was **large-v3-turbo in Oct 2024**.
- No "Whisper v4" announced as of 2026-04-15. OpenAI's official Whisper repo README hasn't added a new model since turbo.
- Whisper ecosystem is a stable plateau; NVIDIA's Parakeet (May 2025), Canary (Mar 2025), Nemotron-Speech (Mar 2026) have passed it.
- `distil-whisper/distil-large-v3.5` (MIT, refreshed 2026-04-13) is the last meaningful update *in* the Whisper ecosystem — 1.5× faster than turbo, ~same WER. It's a Whisper stack upgrade, not a leap.

---

## Recommended next step

A ~2-hour Panda-side A/B test:

1. `pip install transformers>=4.52` in the phone daemon venv (or new venv if deps conflict).
2. Fork `services/annie-voice/phone_audio.py PhoneSTT` class → new `parakeet_stt.py` using `AutoModelForCTC.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")`.
3. Harness: replay N archived call audio files through both Whisper-turbo and Parakeet.
4. Compare on:
   - WER against a held-out ground truth transcript (if available) OR
   - Cohen's kappa on Annie's downstream tool-call decisions (weaker but proxies end-to-end behavior).
   - Live VRAM footprint via `nvidia-smi --query-compute-apps`.
   - Latency (mean wall-time per transcription).
5. Decision gate: if WER within 10% of Whisper AND tool-call kappa ≥ 0.9 AND VRAM savings ≥ 1 GB → swap. Else investigate or stay.

If the swap proceeds, next session after that is switching to **Parakeet-TDT-0.6B-v3** for native streaming partials — unlocks real-time partial-hypothesis rendering while the caller is still speaking.

---

## Primary sources

- [Open ASR Leaderboard (HF Space)](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
- [Open ASR Leaderboard paper — arXiv 2510.06961v3](https://arxiv.org/html/2510.06961v3)
- [nvidia/parakeet-tdt-0.6b-v2 model card](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)
- [nvidia/parakeet-tdt-0.6b-v3 model card](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
- [nvidia/canary-1b-flash model card](https://huggingface.co/nvidia/canary-1b-flash)
- [nvidia/canary-180m-flash model card](https://huggingface.co/nvidia/canary-180m-flash)
- [distil-whisper/distil-large-v3.5 model card](https://huggingface.co/distil-whisper/distil-large-v3.5)
- [Moonshine paper — arXiv 2410.15608](https://arxiv.org/pdf/2410.15608v2)
- [Moonshine v2 streaming paper — arXiv 2602.12241](https://arxiv.org/html/2602.12241v1)
- [openai/whisper-large-v3-turbo model card](https://huggingface.co/openai/whisper-large-v3-turbo)
- [Northflank 2026 open STT benchmarks](https://northflank.com/blog/best-open-source-speech-to-text-stt-model-in-2026-benchmarks)

---

## Measured benchmark results (session 113, 2026-04-15)

**Verdict: `stay-whisper (strict)`** — pre-committed G1–G4 gates require all to pass; G3 (VRAM saving ≥ 1 GB) fails for both Parakeet variants.

### Summary table

| Gate | Threshold | whisper | v2 | v3_offline | Status |
|---|---|---|---|---|---|
| **G1 WER (ls-clean)** | Parakeet ≤ whisper + 1 pp | 2.05% | **1.62%** | 2.34% | v2 PASS, v3 PASS |
| **G1 WER (phone-sim)** | Parakeet ≤ whisper + 1 pp | 2.68% | **1.67%** | 2.20% | both PASS |
| **G2 cross-WER vs whisper (phone-sim)** | ≤ 10% | — | 1.16% | 1.69% | both PASS |
| **G3 VRAM peak** | ≥ 1 GB less than whisper | 4702 MB | 4757 MB (+55) | 4827 MB (+125) | both FAIL |
| **G4 mean latency (phone-sim)** | ≤ whisper | 99 ms | 30 ms | 30 ms | both PASS (3.3× faster) |
| **G5 streaming TTFT** | p50 ≤ 300 ms | — | — | — | **N/A** — distributed checkpoint lacks streaming |
| **G6 LLM-proxy** | 4/5 agree with whisper | — | — | — | **Deferred** — out of session-113 scope |

### Load-bearing findings for future sessions

1. **`nvidia/parakeet-tdt-0.6b-v3` on HuggingFace is offline-only.** The distributed `.nemo` checkpoint has `encoder.att_context_size=[-1, -1]` and `EncDecRNNTBPEModel.transcribe_simulate_cache_aware_streaming` raises `NotImplementedError`. Session 108's assumption that v3 offers a streaming path on this checkpoint was incorrect — a streaming variant requires a different model ID or encoder reconfiguration at init time.
2. **VRAM: NeMo full model class loads more than just weights.** Measured Parakeet v2 = 4.76 GB, v3 = 4.83 GB; session 108 expected ~3 GB. The gap (≈ 1.8 GB) is NeMo training scaffolding (Lhotse dataloader, decoder cache). A minimal inference-only wrapper that retains only encoder + predictor + joint + tokenizer could likely recover this, which would flip G3 and therefore the overall verdict to adopt-v2.
3. **WER: v2 is materially better than v3 on clean English audio.** On ls-clean, v2 = 1.62% vs v3 = 2.34% — v3's multilingual training trades some English accuracy. If the only deployment target is English phone audio, v2 is the better checkpoint.
4. **Latency: Parakeet is 3.3× faster than Whisper per clip.** 30 ms vs 99 ms mean on phone-sim. Latency wins alone are significant for phone-turn perceived responsiveness, independent of WER.
5. **Phone-sim is a proxy, not ground truth.** `sox` was not installed on Panda; the bench used `torchaudio.functional.mu_law_{en,de}coding` for μ-law 8 kHz round-trip. Real BT HFP channel degradation (packet loss, AGC, echo) is not simulated. Any adoption decision should re-validate on live phone clips.

### Follow-ups (not in session 113 scope)

- **Minimal NeMo wrapper investigation** — ship a runtime that retains only encoder/predictor/joint modules and strips Lhotse/training hooks; measure actual VRAM at runtime. If ≤ 3.5 GB, verdict flips to adopt-v2.
- **Real BT HFP corpus** — capture ~30 live phone-path clips during next bench window; re-run G1/G2/G4 on that set.
- **v3 streaming path** — evaluate `parakeet-tdt_ctc-110m-streaming` or a v3 checkpoint with explicit streaming config if streaming partials become a product requirement.
- **G6 LLM-proxy gate** — once a real-phone corpus exists, run 5-clip rubric on Gemma 4 `:11435` with Annie's production system prompt.

Full raw data: `benchmark-results/stt-2026-04-15/` (ab-results.json, SUMMARY.md).