# Next Session: Parakeet v2 vs v3 A/B Benchmark on Panda (STT Replacement)

**Created:** 2026-04-15 (session 108)
**Status:** Handoff draft — execution NOT started.
**Predecessor research:** `docs/RESEARCH-STT-ALTERNATIVES.md` (session 108 — Parakeet TDT 0.6B v2 identified as primary Whisper replacement)
**User bias:** Leaning toward **v3** for UX (streaming partials + faster perceived Annie response), but wants data from both before committing.

---

## What (one paragraph)

A/B test `nvidia/parakeet-tdt-0.6b-v2` (English-only, offline-optimized, avg WER 6.05% per Open ASR Leaderboard) vs `nvidia/parakeet-tdt-0.6b-v3` (25-lang, native streaming, avg WER 6.34%) against the current **OpenAI Whisper large-v3-turbo** baseline on Panda's production phone-call audio. Measure offline WER, live-call TTFT (time-to-first-token), streaming-vs-offline quality delta for v3, and VRAM footprint. Decision gate: pick the model that balances (a) WER within ≤1 pp of Whisper, (b) ≥1 GB VRAM savings, (c) lowest perceived-latency Annie response for UX. Whisper stays loaded and un-touched during the benchmark — zero production risk.

## Why this design

- **Whisper stays live throughout.** Don't stop the production phone daemon. Run Parakeet models in a parallel test harness; compare transcripts offline.
- **User's v3 bias is informed by UX, not raw accuracy.** The ~0.29 pp avg WER gap favouring v2 is below the noise floor for conversational English. The streaming partials v3 enables (200-500 ms TTFT vs Whisper's "wait-for-end-of-utterance") could feel dramatically faster even if WER is slightly worse. Plan must actually measure the TTFT — don't let the user's hypothesis override data.
- **Decision criteria are pre-committed in this doc** to prevent confirmation bias at read-time.

## Deltas from prior benchmark plans (sessions 104 / 108)

| Aspect | E4B benchmark (s104/108) | STT benchmark (this session) |
|---|---|---|
| What's stopped | E2B nav + Chatterbox for window | **Nothing** — Whisper + phone daemon stay live |
| New service port | :11437 temporary | **None** — Parakeet runs in a test venv, out-of-band |
| Tested workload | 1 image × 139 prompts | Real archived phone audio clips (need to source) + LibriSpeech reference set |
| Decision risk | Low (pure measurement) | Medium — production component swap candidate |
| Audio format | n/a (vision) | 16 kHz mono PCM (preferred) and μ-law 8 kHz (actual phone audio) |

## Pre-execution gates (MUST pass before starting)

### Gate 1 — Test audio inventory
Find real phone-call audio to evaluate against. Options in order of preference:
```bash
# Check for archived phone audio on Panda
ssh panda 'ls -lh ~/.her-os/phone-call-logs/ 2>/dev/null | head -20'
ssh panda 'find ~ -name "*.wav" -path "*phone*" -mtime -30 2>/dev/null | head -20'
ssh panda 'ls -lh /tmp/phone-auto.log ~/workplace/her/her-os/services/audio-pipeline/recordings/ 2>/dev/null'
```
If none exist, **Path A fallback**: record 10-20 short clips manually (user + Mom) via `arecord -f S16_LE -r 16000 -c 1 test-NN.wav`. This is the decision-quality path; LibriSpeech alone won't reflect phone-call distribution.

### Gate 2 — `transformers >= 4.52` availability
```bash
ssh panda 'cd ~/workplace/her/her-os && source .venv/bin/activate && python -c "import transformers; print(transformers.__version__); from transformers import AutoModelForCTC; print(\"OK\")"'
```
If < 4.52: create a separate test venv (`~/parakeet-bench-venv`) with `pip install transformers>=4.52 torch>=2.6+cu128`. **Do not upgrade transformers in the main `.venv`** — would risk Whisper or other services breaking.

### Gate 3 — Ground-truth transcripts
For each test clip, we need a reference transcript. Options:
- **Human transcription** (highest quality, required for final-decision WER)
- **Current Whisper-turbo output as proxy** (biases toward Whisper — use only for cross-model agreement, not absolute WER)
- **LibriSpeech test-clean ground truth** (download + cache — distribution-shifted from phone audio but unimpeachable reference)

Default plan: **LibriSpeech test-clean for absolute WER**, **Cohen's kappa between all three models on archived phone audio** (no ground truth needed — just measures agreement).

### Gate 4 — VRAM headroom snapshot
```bash
ssh panda 'nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv'
```
Expected: phone(5286) + Chatterbox(~3740) + E2B(3222) = ~12.2 GB used / 4 GB free. Each Parakeet model is ~1.5 GB live → **one model at a time** fits, two simultaneously does not. Plan loads v2 and v3 **serially**, not in parallel.

### Gate 5 — User-bias pre-commit
Before running anything, document in this session's working notes:
> "I (user/Claude) expect v3 to win on UX grounds. If v2's offline WER is > 1 pp better AND streaming TTFT on v3 is > 500 ms, the data overrides the bias."

This is pre-committing the decision gate so we don't motivated-reason after the numbers come in.

## Execution Plan

### Phase 0 — Contract verification (read-only, ~2 min)
1. SSH reachable: `ssh panda 'hostname'`.
2. Baseline VRAM matches expected 12.2 GB ± 0.5 GB, 3 expected processes.
3. Whisper currently healthy (phone daemon answering, Chatterbox healthy).
4. Gate 1-4 checks above.

### Phase 1 — Setup test venv & download weights (~5-10 min)
5. `ssh panda 'python3 -m venv ~/parakeet-bench-venv && source ~/parakeet-bench-venv/bin/activate && pip install --upgrade pip && pip install "transformers>=4.52" "torch>=2.6" "torchaudio>=2.6" librosa soundfile jiwer'`
6. Pre-download both models: `ssh panda 'source ~/parakeet-bench-venv/bin/activate && python -c "from transformers import AutoModelForCTC, AutoProcessor; [AutoModelForCTC.from_pretrained(m) for m in [\"nvidia/parakeet-tdt-0.6b-v2\", \"nvidia/parakeet-tdt-0.6b-v3\"]]"'` — caches to `~/.cache/huggingface`. ~2.4 GB total.
7. Fetch LibriSpeech test-clean reference subset: `ssh panda 'cd ~/parakeet-bench && huggingface-cli download openslr/librispeech_asr --repo-type dataset --include "test.clean/*" --local-dir ./librispeech'` or use HF datasets streaming.

### Phase 2 — Benchmark harness (~30 min)
Write `scripts/benchmark_stt_ab.py` that:
- Loads a test clip
- Runs Whisper-turbo via HTTP call to the running phone daemon (OR loads it in a separate process — **do not** load it in the same process as Parakeet; OOM risk)
- Loads Parakeet v2, transcribes all clips serially, releases GPU
- Loads Parakeet v3 offline mode, transcribes all clips serially, releases GPU
- Loads Parakeet v3 streaming mode (`speech_to_text_streaming_infer_rnnt.py` pattern), transcribes all clips with TTFT measurement, releases GPU
- Writes all outputs to a single JSON with per-clip rows: `{clip_id, whisper_text, v2_text, v3_offline_text, v3_streaming_text, v3_streaming_ttft_ms, reference_text?}`

Critical design:
- **One model in VRAM at a time.** `del model; torch.cuda.empty_cache()` between loads.
- **Same clips for all four passes.** Deterministic comparison.
- **Record wall-time for each clip, each model.** Latency is a decision metric, not just WER.

### Phase 3 — Run benchmark (~15-20 min)
8. Run `scripts/benchmark_stt_ab.py` on LibriSpeech test-clean (100 clips) + any archived phone audio.
9. Monitor `nvidia-smi` in parallel — confirm serial loading, no concurrent OOM.

### Phase 4 — Analysis (~20 min)
Compute per model:
- **Absolute WER** against LibriSpeech reference (jiwer)
- **Cross-model agreement** (Cohen's kappa on phone audio without ground truth — WER against Whisper's output as proxy reference)
- **Mean / p50 / p95 latency** per clip (wall time)
- **v3 streaming TTFT distribution** (p50, p95) — THE key UX metric
- **Streaming-vs-offline WER delta for v3** (streaming often sacrifices accuracy for latency; quantify this)
- **VRAM footprint during inference** (peak per model)

### Phase 5 — Decision gate
Pre-committed gates (in order — first to fail kills the candidate):

| Gate | Requirement | v2 | v3 offline | v3 streaming |
|---|---|---|---|---|
| G1 Absolute WER | ≤ Whisper WER + 1 pp on LS-clean | ? | ? | ? |
| G2 Tool-call agreement | Cohen's kappa ≥ 0.90 vs Whisper on phone audio (proxy for "Annie's downstream behavior doesn't change") | ? | ? | ? |
| G3 VRAM saving | ≥ 1 GB less than Whisper's 5.1 GB live | ? | ? | ? |
| G4 Latency | mean wall-time per clip ≤ Whisper's current | ? | ? | ? |
| G5 TTFT (streaming-only) | p50 ≤ 300 ms, p95 ≤ 500 ms on phone-quality audio | — | — | ? |

**Adoption rules:**
- If **v3 streaming passes all G1-G5** → adopt v3 streaming. UX wins.
- If **v3 streaming fails only G5** (TTFT too slow) → fall back to v3 offline (if G1-G4 pass) or v2 offline.
- If **v3 offline passes G1-G4** → adopt v3 offline. Gain multilingual grace + streaming upgrade path later.
- If **only v2 passes G1-G4** → adopt v2. Accuracy wins over UX upgrade.
- If **none pass** → stay on Whisper. Document why.

### Phase 6 — Writeup (no production changes yet)
10. Update `docs/RESEARCH-STT-ALTERNATIVES.md` with measured numbers section.
11. Update `docs/RESOURCE-REGISTRY.md` Change Log with session N entry.
12. Session MEMORY block.
13. **Do not modify `services/annie-voice/phone_audio.py` in the benchmark session** — that's a follow-up session with full restoration plan (adoption isn't just a code change; it's a service-swap window like E2B was). If decision is "adopt", write `docs/NEXT-SESSION-PARAKEET-DEPLOY.md` handoff.

## Files to create/modify

| Path | Change |
|---|---|
| `scripts/benchmark_stt_ab.py` | **NEW** — harness (Whisper client + Parakeet v2/v3 loaders + WER + TTFT + JSON output) |
| `scripts/benchmark_stt_streaming.py` | **NEW** — isolated v3 streaming harness (transcribes with partial emission, logs TTFT per token) |
| `benchmark-results/stt-2026-MM-DD/` | **NEW** — JSON outputs + per-model transcript artifacts |
| `docs/RESEARCH-STT-ALTERNATIVES.md` | Append "Measured benchmark results (session N)" section after decision made |
| `docs/RESOURCE-REGISTRY.md` | Change Log entry for the benchmark |
| `MEMORY.md` | Session summary + next steps |

**Not modified this session:** `services/annie-voice/phone_audio.py` — that's a separate deployment session after adoption decision.

## VRAM plan

- At rest: 12.2 GB used (phone + Chatterbox + E2B), 4 GB free.
- During benchmark: one Parakeet model at a time (~1.5 GB each). Peak simultaneous: 12.2 + 1.5 = 13.7 GB → 2.6 GB headroom. Safe.
- **Do not load v2 and v3 simultaneously.** `del` + `torch.cuda.empty_cache()` between.
- **Do not stop Whisper during benchmark.** The whole point is production stays up.

## Rollback / restoration

None needed — this is a pure measurement session. At end:
1. `rm -rf ~/parakeet-bench-venv` (or keep if deployment session follows)
2. Model weights stay in `~/.cache/huggingface` (2.4 GB, useful for follow-up)
3. Whisper, phone daemon, Chatterbox, E2B all untouched

## Critical design decisions inherited from prior sessions

1. **Use `transformers`-native Parakeet class**, not NeMo. NeMo pulls ~5 GB of deps and has no integration benefit for inference.
2. **`hf` CLI is venv-scoped on Panda** (session 104) — must activate venv.
3. **Parakeet-via-transformers may need `trust_remote_code=True`** — check model card. If so, document as a supply-chain consideration.
4. **Same audio clip through all models = fair comparison.** Load clip once, pass to each model's preprocessor.
5. **Streaming TTFT is per-token, not per-utterance.** Measure time from `send_audio_chunk(0)` to `first_token_received` on EACH utterance; report distribution.

## Time budget

- Phase 0-1 (setup): 15 min
- Phase 2 (harness): 30 min  
- Phase 3 (run): 20 min
- Phase 4 (analysis): 20 min
- Phase 5-6 (decision + docs): 20 min
- **Total: ~2 hours** if all gates pass cleanly. Up to 4 hours if gate 1 (test audio) requires manual recording.

## Start command

```bash
cat docs/NEXT-SESSION-PARAKEET-STT-BENCH.md                     # this doc
cat docs/RESEARCH-STT-ALTERNATIVES.md                           # predecessor research
```

Then run Gate 1 (audio inventory) — that's the decision on whether this is a 2h or 4h session.

## References

- Research predecessor: `docs/RESEARCH-STT-ALTERNATIVES.md`
- Current STT code: `services/annie-voice/phone_audio.py:509-519` (PhoneSTT Whisper loader)
- Prior benchmark structure: `docs/NEXT-SESSION-GEMMA4-E4B-BENCH.md` (Panda), `docs/NEXT-SESSION-GEMMA4-E4B-BEAST-BENCH.md` (Beast)
- Open ASR Leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
- Model cards:
  - https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
  - https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
- Streaming inference example: `nvidia/parakeet-tdt-0.6b-v3/speech_to_text_streaming_infer_rnnt.py`