# Next Session: Parakeet v2/v3 vs Whisper STT Benchmark on Panda (V2 — post-adversarial-review)

**Created:** 2026-04-15 (session 112)
**Supersedes:** `docs/NEXT-SESSION-PARAKEET-STT-BENCH.md` (original handoff draft from session 108)
**Plan file (authoritative):** `/home/rajesh/.claude/plans/radiant-noodling-dolphin.md`
**Status:** Approved + stress-tested via `/planning-with-review`. Ready to execute.

---

## What

Benchmark NVIDIA Parakeet TDT 0.6B `v2` (English, offline) and `v3` (25-lang, offline + streaming via NeMo) against OpenAI Whisper large-v3-turbo on Panda. Decide whether to replace Whisper in the phone daemon. Bench takes the phone daemon DOWN for 2-4 hours (user is oracle, no calls expected); Chatterbox/E2B/panda-nav/WhatsApp all stay UP. No production STT code is modified this session — a minimal temporary endpoint is added to `phone_api.py` for baseline capture and reverted at end.

## Plan

**Read `/home/rajesh/.claude/plans/radiant-noodling-dolphin.md` first — it has the full 8-phase implementation, Stage 0 known gotchas (13 entries), Stage 1B state machines (3 machines), Stage 1C pre-mortem (13 scenarios), and Stage 6 adversarial-review response table (22 findings all Implemented). The execution phases section has a SUPERSEDE NOTICE — Stage 6 is authoritative where they conflict.**

## Key Design Decisions (from adversarial review)

Numbered list of load-bearing decisions that MUST survive into execution. A fresh session that accidentally reverts any of these is implementing the pre-review version.

1. **Temporary `/v1/phone/debug/transcribe` endpoint** added to `services/annie-voice/phone_api.py` in Phase 0.5, auth'd by `X-Internal-Token: $CHATTERBOX_TOKEN`. Reuses the phone daemon's already-loaded Whisper; no second-process OOM risk. Reverted in Phase 7.3. Phase 1.4 calls this endpoint only — no alternate whisper route.

2. **Torch install order:** Phase 2.4 installs `torch==2.11.0+cu128 torchaudio==2.11.0+cu128 --index-url https://download.pytorch.org/whl/cu128` FIRST, THEN `pip install nemo_toolkit[asr] --no-deps` followed by NeMo's non-torch deps resolved from `pip show`. Reversing this order triggers NeMo C-extension ABI mismatch — silent numerical corruption, not a clean ImportError.

3. **Phase 2.6 verification runs a CUDA tensor op**, not just `import`. Required: `torch.zeros(1).cuda()`, `torch.fft.fft(torch.randn(1024).cuda())`, `assert torch.version.cuda == "12.8" and torch.cuda.get_device_capability() == (12, 0)`.

4. **Phase 2.8a resolves the correct transformers class via AutoConfig** — Parakeet TDT is a Transducer, NOT CTC. `AutoModelForCTC` is WRONG; use `AutoConfig.from_pretrained(model_id).architectures[0]`. Do not hardcode a class name.

5. **Subprocess-per-pass (SM-2 revised):** Each of the 3 passes (v2, v3-offline, v3-streaming) runs in its own Python subprocess via `subprocess.run([sys.executable, "-m", "scripts.stt_bench_pass", ...])`. OS process death reclaims VRAM deterministically — the earlier `del model + empty_cache + ≤1 GB residual` gate was flakiness-prone (NeMo leaves 1.5-2.5 GB residual routinely).

6. **TTFT measurement:** Phase 3.1 Pass 3 uses **80 ms chunks** for the TTFT pass (Parakeet TDT's minimum look-ahead). Each clip runs **3×** after warmup; TTFT distribution = all 3N samples. Phase 3.1 Pass 3b runs 10 clips with Chatterbox active-load contention. G5 threshold = `chunk-floor + 50 ms ≤ 300 ms`, not absolute 300 ms.

7. **Phone-path corpus is primary for G1**, LibriSpeech secondary. Phase 1.2b has user record 20 clips from a read-aloud script; clips flow through sox μ-law 8 kHz round-trip to approximate BT HFP degradation. LibriSpeech test-clean capped at 30 clips as a smoke baseline.

8. **End-to-end LLM proxy (Phase 4.3) — new G6 gate:** 5 phone clips × 4 transcript paths → Gemma 4 at `:11435/v1/chat/completions` → compare `(content_first_160, tool_calls[].name, tool_calls[].args)` tuple equality. Rubric: ≥ 4/5 agree with Whisper baseline → G6 passes. This is the ASR-meaning-over-accuracy check (`feedback_asr_meaning_not_accuracy`).

9. **"Cohen's kappa" was a mislabel** — the metric is pairwise cross-WER. All gate tables + SUMMARY.md say "cross-WER ≤ 10%" not "kappa ≥ 0.90". Analysis uses `jiwer.Compose([RemovePunctuation(), ToLowerCase(), Strip(), RemoveMultipleSpaces()])` identically across all paths.

10. **PII gate on commit (Phase 7.6):** `.gitignore` appended with `benchmark-results/**/phone/` + `benchmark-results/**/*phone*.wav` + `benchmark-results/**/ab-results.json`. Pre-commit guard: `git diff --cached --name-only | grep -Eq '\.wav$|phone/' && { echo "ABORT"; exit 1; }`. Only `SUMMARY.md` + aggregate JSONs + LS-clean outputs get committed. Phone audio + transcripts stay on Panda.

11. **Phone restart budget is empirically measured** in new pre-flight Gate 14 before Phase 0 (a single dry `stop+start` cycle); stored in `$PHONE_RESTART_BUDGET_S` (min 180, max 600). Phase 5.1 uses this variable, wraps `$HER_OS/start.sh phone` in `timeout $PHONE_RESTART_BUDGET_S ...`.

12. **Phase 5.1 ESCALATE fallback** is a concrete laptop-side command (written into SM-1 ESCALATE row), not a prose "user intervenes." Exact command: `ssh $PANDA_HOST 'cd ~/workplace/her/her-os && source .venv/bin/activate && <env vars> nohup python3 scripts/phone_call.py auto >> /tmp/phone-auto.log 2>&1 &'`.

13. **Verdict extracted programmatically** from SUMMARY.md's `Verdict: <tag>` line via grep. Empty verdict → commit ABORTS. No `<verdict>` placeholder ever ends up in a real commit message.

14. **`$HER_OS/stop.sh`/`start.sh` runs from LAPTOP only** — SSH-to-Panda-then-run-stop is FORBIDDEN (pgrep self-match + CWD non-persistence combo breaks it).

15. **Pre-flight Gate 15** verifies HF token acceptance of `openslr/librispeech_asr` before bench begins. If auth-gated → fall back to `mozilla-foundation/common_voice` test subset.

## Files to Modify

Ordered by phase:

1. `services/annie-voice/phone_api.py` — Phase 0.5: add `POST /v1/phone/debug/transcribe` (temporary; reverted in 7.3). ~20 LOC.
2. `.gitignore` — Phase 7.6 pre-step: append PII patterns.
3. `scripts/_stt_bench_common.py` — **NEW**: `load_audio_16k_mono`, `load_corpus`, `percentile_small_n`, `wer_normalize`, `write_results_atomic`.
4. `scripts/stt_bench_pass.py` — **NEW**: single-pass worker (invoked via `python -m scripts.stt_bench_pass --model <id> ...`).
5. `scripts/bench_whisper_baseline.py` — **NEW**: Phase 1.4 baseline via the Phase 0.5 endpoint.
6. `scripts/benchmark_stt_ab.py` — **NEW**: Phase 3 orchestrator (spawns 3 subprocess passes + nvsmi monitor).
7. `scripts/analyze_stt_ab.py` — **NEW**: Phase 4 WER/cross-WER/latency/TTFT + Phase 4.3 G6 proxy.
8. `benchmark-results/stt-2026-04-15/` — **NEW DIR** (only SUMMARY.md + aggregates get committed).
9. `docs/RESEARCH-STT-ALTERNATIVES.md` — Phase 7.1 append measured results.
10. `docs/RESOURCE-REGISTRY.md` — Phase 7.2 Change Log + optional Panda table update.
11. `MEMORY.md` — Phase 7.3 new session-112 block.
12. `docs/NEXT-SESSION-PARAKEET-DEPLOY.md` — only if verdict = ADOPT.

**Explicitly NOT modified:**
- `services/annie-voice/phone_audio.py` (the STT engine) — adoption is a separate session.
- `services/annie-voice/phone_call.py` — unchanged.
- `start.sh`, `stop.sh` — unchanged.
- Main Panda `~/workplace/her/her-os/.venv` — unchanged (bench venv is isolated per user decision).

## Start Command

```bash
cat /home/rajesh/.claude/plans/radiant-noodling-dolphin.md
```

Then execute pre-flight gates 1-15 in order. First failure aborts. After gates pass: Phase 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7. **All adversarial findings are already addressed in the plan's Stage 6 table — do not re-design any of the 15 decisions above without explicit user re-approval.**

## Verification

1. Pre-flight Gates 1-15 all green (including Gate 14 restart-budget measurement).
2. Phase 1 `whisper-baseline.json` covers ≥ 90% of corpus.
3. Phase 2 bench venv: `python -c "import torch; assert torch.version.cuda == '12.8' and torch.cuda.get_device_capability() == (12, 0)"` passes; NeMo imports + runs CUDA op.
4. Phase 3 `ab-results.json` has all 3 subprocess passes complete with ≥ 90% coverage, no OOM.
5. Phase 4 `SUMMARY.md` contains all 6 gates (G1-G6) filled, `Verdict:` line with one of `adopt-v2|adopt-v3-offline|adopt-v3-streaming|stay-whisper`.
6. Phase 5 phone daemon restored: `/v1/phone/status` → 200, Chatterbox/:8772, E2B/:11435, panda-nav/:11436 all green. `[AUTO] Waiting for incoming call` in `/tmp/phone-auto.log`. Phase 0.5 temporary endpoint reverted (grep phone_api.py on Panda confirms).
7. Phase 7 draft PR opened against `main` with title `bench(stt): Parakeet v2/v3 Panda A/B — <verdict>`; PR body references plan path; no `.wav` or phone-transcript files in the diff.
8. User places a test call via Telegram (or schedules one within 30 min of bench end) to confirm Annie is answering normally.

## Retrospective (after adoption decision lands)

After the deploy-or-stay decision is implemented in a follow-up session, tag each of the 22 Stage 6 findings as HIT / MISS / PARTIAL / N/A per the `planning-with-review` retrospective protocol. Findings that consistently HIT become elevated to CLAUDE.md or memory; findings that MISS get a note for future reviewers. This closes the feedback loop on the review process itself.
