# Next session — streaming ASR + barge-in for the her-os phone loop

**Written:** session 113, 2026-04-15
**Executes against:** `parakeet-stt-bench-20260415` branch (PR #4 open, parent of this work)
**Predecessor:** `docs/NEXT-SESSION-PARAKEET-STT-BENCH-V2.md` (bench complete, verdict: stay-whisper strict; live-swapped to Parakeet v2 anyway per user override)

---

## Starting state (inherited from session 113)

- **Phone daemon runs Parakeet v2 (`nvidia/parakeet-tdt-0.6b-v2`)** via an HTTP sidecar at `http://localhost:11438` on Panda. Whisper is NOT loaded (skipped when `PARAKEET_URL` env var is set).
- **Parakeet v3 tested live** and verified equivalent to v2 by the user ("works fine, I didn't find any difference"). Either checkpoint is acceptable.
- **Sidecar implementation**: `scripts/parakeet_stt_server.py` using aiohttp, running in `~/parakeet-bench-venv` (NeMo + lhotse + pyannote + lightning + nv_one_logger stub + torchcodec + openai-whisper). Uses `PARAKEET_MODEL_PATH` + `PARAKEET_MODEL_NAME` env vars for checkpoint choice.
- **Phone daemon hook**: `services/annie-voice/phone_audio.py:PhoneSTT` — `load()` skips Whisper when `PARAKEET_URL` is set; `transcribe()` POSTs WAV bytes to sidecar. Whisper code path preserved for one-line revert.
- **Current behavior is half-duplex**: VAD-segmented whole utterances → sidecar → text → LLM → TTS → playback-to-completion → repeat. **No barge-in.**
- **Sidecar is NOT supervised** (manual `nohup` launch, no systemd). Promote to a real service before adding more complexity.

### Revert recipe
```
# Back to Whisper in-process:
PARAKEET_URL_OVERRIDE= ./stop.sh phone && ./start.sh phone
# Kill sidecar harmlessly:
ssh panda 'pkill -f parakeet_stt_server'
```

---

## Why this session exists

The user asked about `parakeet-tdt_ctc-110m-streaming` after the live v2 swap. Research shows that **model ID doesn't exist** — 110m is the offline Hybrid TDT+CTC model, not a streaming variant. But the underlying asks are real and coupled:

1. **Streaming ASR** — chunk-by-chunk partial transcripts for lower perceived turn latency.
2. **Barge-in** — user can interrupt Annie mid-TTS and be understood immediately.

These are **coupled** because the natural trigger for barge-in is VAD on the post-AEC mic during TTS playback, and the natural way to render "Annie heard me" is streaming partials surfaced to the LLM as they arrive.

Neither requires a model swap in isolation — the current setup could do barge-in with VAD alone. But if we're reworking `phone_loop.py` anyway, moving to a streaming-native checkpoint is cheaper than bolting streaming onto v2 after the fact.

---

## Selected streaming checkpoint (locked by user session 113)

**`nvidia/nemotron-speech-streaming-en-0.6b`** — NVIDIA's current-generation cache-aware streaming ASR (released 2026-03-12).

| Attribute | Value |
|---|---|
| Params | 600 M |
| Architecture | FastConformer RNN-T, 24 encoder layers, 8× depth-wise separable subsampling |
| Streaming native | ✅ cache-aware; encoder state carried across chunks with zero recompute |
| Runtime-configurable chunk sizes | 80 ms / 160 ms / 560 ms / 1120 ms via `att_context_size=[70,R]` where R ∈ {0,1,6,13} |
| Partial transcripts | ⚠ **NOT exposed in public NeMo API** — internal state is there but `transcribe()` returns finals only. Workaround decided below. |
| EOU signal | ❌ (not part of this model — use our own VAD) |
| Languages | English (US) only |
| WER | 2.32% LS-clean @ 1120 ms mode, 2.80% @ 80 ms mode, 6.93% average @ 1120 ms |
| Time-to-final (NVIDIA H100) | 24 ms median |
| License | NVIDIA Open Model License — self-host OK |
| Blackwell SM_121 | Listed as tested in model card |

### How we work around "no native partial transcripts"

NeMo's `transcribe_simulate_cache_aware_streaming` does the cache-aware passes internally but only returns finals. Two strategies — **pre-flight smoke test picks one**:

1. **Preferred:** patch the inference loop to surface `best_hyp_text` from the decoder state at each chunk boundary. We've read the streaming script; it's ~100 lines of Python and the hypothesis is in scope inside the chunk loop — just not returned. Saves one day vs. going deeper.
2. **Fallback:** skip per-chunk partials entirely. Treat the streaming API as "same transcribe API, just faster finalization (24 ms vs. Whisper's 99 ms)." Barge-in trigger becomes pure VAD on post-AEC mic, no ASR involvement until user finishes. This is simpler; loses speculative-LLM potential but still delivers the core UX win.

Pre-flight Gate 4 smoke tests both. Decision between them is driven by whichever the NeMo streaming script actually exposes in practice on Blackwell.

### Why not the alternatives (brief, for future-you if you reconsider)
- **`parakeet_realtime_eou_120m-v1`** — 5× smaller, ships EOU for free, but 120 M Parakeet's WER on real phone audio is a live unknown and 600 m is proven on English quality. User chose 600 m explicitly.
- **Keep v2 + custom chunked wrapper** — we'd reinvent cache-aware streaming badly. Don't.

---

## What "barge-in" actually means in this codebase

**Required components:**
1. **AEC during playback** — already present via PulseAudio `BT_INPUT_NODE='echo_cancel_source'`. Verify it's echo-cancelling the Chatterbox playback reference properly; tune if mic picks up Annie's own voice.
2. **VAD or EOU-detector tap on post-AEC mic stream while TTS is playing.** Currently the phone loop drains echo frames between turns (`phone_loop.py:1321`) rather than VAD-listening during playback. This is the **main rework**.
3. **TTS cancellation primitive** — close the in-flight Chatterbox HTTP request + drop queued audio frames from the output buffer. Currently `phone_loop.py:677` silently drops TTS on error but has no explicit cancel.
4. **Interrupted-utterance capture** — from barge-in trigger until user stops, send to STT (streaming or whole-utterance), hand to LLM.
5. **LLM-in-flight cancellation** (optional, for speculative LLM) — abort the current Gemma 4 `/v1/chat/completions` call when user barges in with a redirection.

**Variants to pick from during planning:**
- **Full barge-in**: any user speech during TTS cancels it (most natural, highest false-trigger risk from background noise).
- **Confirmation barge-in**: only cancel on specific phrases ("Annie stop", "wait"). Safer but requires STT to parse barge-in audio before cancelling — adds latency.
- **EOU-only detection** (if using the 120m EOU model): the model itself signals turn-boundary; we cancel TTS when EOU flips from False → True on barge-in audio.

Primary recommendation: **full barge-in, AEC-tuned VAD trigger, Chatterbox HTTP request aborted via `httpx.CancelledError`.**

---

## Scope for next session

### In-scope
- Research the 120m EOU model + 600m streaming model: actual HF weights available, license check, NeMo API for streaming loop, Blackwell compatibility (SM_121).
- Extend `scripts/parakeet_stt_server.py` (or add a new server) to:
  - Accept a WebSocket or chunked-POST stream of audio frames.
  - Emit per-chunk partial transcripts (if the model supports) or at minimum an EOU signal.
  - Maintain per-session cache state.
- Rework `services/annie-voice/phone_loop.py` to:
  - Run VAD/EOU on the post-AEC mic stream DURING TTS playback (not just between turns).
  - Issue explicit Chatterbox cancel on barge-in trigger.
  - Handle the interrupted-utterance turn as a normal STT → LLM → TTS cycle.
- One benchmark session (similar to session 113's pattern) comparing the streaming model vs. current Parakeet v2 on live phone audio: latency, WER, perceived responsiveness, barge-in false-trigger rate.
- Revert plan at every step — barge-in can be env-gated via `ENABLE_BARGE_IN=1` so default remains half-duplex until validated.

### Out-of-scope (defer to follow-up sessions)
- Speculative LLM invocation (starting Gemma 4 completion on partial transcripts).
- Multi-user/multi-session cache management in the streaming sidecar.
- systemd/supervisor for the sidecar (should land in this session if time permits, but not a blocker).
- Mom-voice tuning passes.

---

## Pre-flight gates for the new session

1. **Current v2 deploy is still healthy.** `curl :11438/health` returns ok; phone `/v1/phone/status` returns 200. If not, revert to Whisper first and debug.
2. **User-oracle "no calls for 2-3 h"** — this rework will require multiple phone-daemon restarts. Same discipline as session 113.
3. **Panda VRAM headroom check** — the streaming model is ≤ 2.4 GB (600M FP16) or ≤ 500 MB (120M). Should fit alongside existing Chatterbox + E2B without additional dependency. Verify via nvsmi.
4. **NeMo streaming-loop example runs** — before touching her-os code, confirm `examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py` works end-to-end on a test WAV in `~/parakeet-bench-venv`. If this fails on Blackwell/SM_121, stop and investigate before any phone-loop edits.
5. **Chatterbox cancellation path validated** — write a tiny smoke test that starts a long TTS synth, cancels 500 ms in, verifies the output buffer drops cleanly. Without this primitive, barge-in can't be implemented.
6. **AEC echo-cancellation sanity check** — record 5 s of Annie speaking via Chatterbox through the phone's speaker-to-mic loop; capture `echo_cancel_source` output; verify Annie's voice is attenuated ≥ 20 dB. If not, the VAD during playback will constantly false-trigger.

---

## Key load-bearing decisions (some resolved, rest for planning)

| Decision | Status | Default | Alternative | Impact |
|---|---|---|---|---|
| Streaming model | **LOCKED (session 113)** | `nemotron-speech-streaming-en-0.6b` | — | User chose 600 m for WER confidence on English phone audio. |
| Barge-in trigger | **LOCKED** (given no EOU in the 600 m model) | VAD on post-AEC mic stream | — | Reuse webrtcvad (already a daemon dep) or Silero VAD. Tune on real phone audio. |
| Chunk size | Open | 160 ms (balance latency + throughput + 7.5% WER) | 80 ms (24 ms TTF, +1.5 pp WER) or 560 ms (1120 ms TTF best WER). | Cascades through VAD tuning + TTS cancel latency. Decide empirically in Phase 3 of the plan. |
| Partial transcripts surfaced | Open | **Finals-only (Fallback)** | Hack NeMo loop (~1 day) to surface per-chunk hyps | Decided by pre-flight Gate 4 smoke test. |
| Protocol to phone daemon | Open | WebSocket | Chunked HTTP POST | WebSocket cleanest for bidirectional (partials / cancel / final). Pick WebSocket unless a NeMo-inside-asyncio gotcha forces HTTP. |
| Whisper fallback retention | Resolved | Keep code path in `phone_audio.py` until streaming is live 1 week | Delete now | Defensive — session 113's Parakeet swap is still young. |
| TTS cancellation mechanism | Open | `httpx.Client.close()` on in-flight synth + drop queued PCM | Add explicit `/cancel/<session_id>` endpoint on Chatterbox | Pre-flight Gate 5 validates the close-connection route; if buffer-flush proves fragile, escalate to the explicit endpoint. |
| Phone-side gate | **LOCKED** | `ENABLE_BARGE_IN=1` env var in `start.sh` phone block; **default OFF** until validated | — | Mirrors the `PARAKEET_URL_OVERRIDE` pattern used for v2 rollback. |

---

## Risks + gotchas (seed for /planning-with-review adversarial stage)

1. **110m / 120m models have materially higher WER than 600m Parakeet.** Session 113 measured v2 at 1.67% WER (phone-sim); 120m EOU on same corpus is likely 3–5%. If Mom's voice drops below intelligible threshold, we regress.
2. **The 600m streaming model's NeMo API doesn't expose partials.** "Partial transcripts" may require reaching into the streaming encoder state between chunks — NeMo internals churn. Budget a day just for this if we go that route.
3. **AEC tuning is brittle.** PulseAudio `echo_cancel` is good for phone-pose mic-to-speaker distances but struggles with speakerphone mode. Validate on Mom's actual device acoustics before trusting VAD-during-playback.
4. **TTS-in-flight cancellation may leave Chatterbox in a bad state.** The server keeps model state per request; aborted HTTP may not release GPU buffers promptly. Watch for slow leak over many barge-ins.
5. **Full barge-in false-positives on TV / radio / background speech.** Users in noisy environments will cancel Annie constantly. Consider an AEC sensitivity knob + fallback to confirmation barge-in.
6. **EOU detection on 120m model may require a different inference script than standard NeMo streaming.** The model card should be re-read carefully — some models need `pipeline_builder` with `cache_aware_rnnt.yaml` config.
7. **Sidecar supervision missing.** If we're extending the sidecar to stateful WebSockets, a crash loses all active call contexts. systemd unit becomes non-optional.
8. **nv_one_logger stub from session 113 still needed** in the streaming bench venv — NeMo 2.7.2 imports it unconditionally. Reuse the stub template from `~/parakeet-bench-venv/lib/python3.12/site-packages/nv_one_logger/`.
9. **Blackwell SM_121 compatibility** — Parakeet v2 / v3 / Whisper all work after session 113's torch 2.11.0+cu128 setup, but the streaming models are newer (March 2026). Verify before planning further.

---

## Suggested planning-session prompt

When starting the next session, paste this as the opening message (after the usual SessionStart hook):

```
Read docs/NEXT-SESSION-PARAKEET-STREAMING-BARGE-IN.md.

Goal this session: produce an executable plan (no code edits yet) that takes
the her-os phone daemon from half-duplex Parakeet v2 to full-duplex streaming
ASR + barge-in using nvidia/nemotron-speech-streaming-en-0.6b (model already
locked in the handoff doc — don't re-litigate), without regressing production
until validated.

Required plan stages:
1. Pre-flight gates: Panda health, nemotron-speech-streaming-en-0.6b weights
   downloadable, NeMo streaming script runs end-to-end on Blackwell SM_121 in
   ~/parakeet-bench-venv, Chatterbox cancel-in-flight smoke test, AEC sanity
   against Mom's device acoustics, partial-transcript exposability check
   (picks Preferred vs. Fallback strategy from the handoff doc).
2. Streaming sidecar design: WebSocket protocol between phone daemon and
   sidecar; per-call cache_last_channel state lifecycle; partial + final
   emission rules if the Preferred patch works; systemd unit; token auth
   (same X-Internal-Token pattern as current v2 sidecar); model + chunk size
   env-configurable.
3. phone_loop.py rework: state-machine changes to add PLAYING_WITH_VAD state;
   AEC-side VAD listener; Chatterbox cancel-in-flight; interrupted-utterance
   normal turn handling; ENABLE_BARGE_IN=1 env gate (default OFF) mirroring
   PARAKEET_URL_OVERRIDE pattern.
4. Live bench protocol: LibriSpeech + phone-sim + real phone audio (session
   113 pattern). Metrics: WER delta vs. current v2, chunk-size latency, VAD
   false-trigger rate during Annie TTS, barge-in success rate, perceived user
   satisfaction oracle.
5. Rollback plan at every step. ENABLE_BARGE_IN=0 revert; PARAKEET_URL_OVERRIDE
   back to old sidecar revert; Whisper-in-process revert (keep code path until
   this session fully validated).
6. Adversarial review via /planning-with-review. Address all CRITICAL / HIGH /
   MEDIUM findings with 0 deferrals.

Before planning: check MEMORY.md session 113 block for the load-bearing
findings from the bench. Do NOT re-research ground already covered in
docs/RESEARCH-STT-ALTERNATIVES.md "Measured benchmark results" section.
Do NOT re-open the model-selection debate — the handoff doc locks
nemotron-speech-streaming-en-0.6b based on user session-113 decision.
```

---

## Useful prior-art pointers

- NeMo streaming example: `examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py`
- Cache-aware conformer paper: arXiv 2312.17279
- NVIDIA scaling blog with latency numbers: huggingface.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents
- 120m EOU model card: huggingface.co/nvidia/parakeet_realtime_eou_120m-v1
- 600m streaming model card: huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b

---

## Files likely to change

| Path | Change type | Notes |
|---|---|---|
| `scripts/parakeet_stt_server.py` | REWORK (or split into 2 servers) | Add streaming endpoint; may become `scripts/parakeet_streaming_server.py` alongside existing offline sidecar. |
| `services/annie-voice/phone_audio.py` | MODIFY | Switch from whole-utterance POST to chunked WebSocket / or register a streaming client in PhoneSTT. |
| `services/annie-voice/phone_loop.py` | HEAVY REWORK | Turn state machine: add "PLAYING_WITH_VAD" state; handle BARGE_IN_DETECTED → CANCEL_TTS → BACK_TO_LISTENING transitions. |
| `services/annie-voice/chatterbox_tts.py` | MODIFY | Add cancel-in-flight primitive. |
| `start.sh` | MODIFY | `ENABLE_BARGE_IN` env gate; streaming sidecar launch; systemd unit deploy. |
| `config/systemd/panda-parakeet-streaming.service` | **NEW** | systemd unit for the streaming sidecar. |
| `benchmark-results/streaming-<date>/` | **NEW DIR** | live-call validation traces + latency samples + false-trigger counts. |

---

## Session exit criteria

**Plan session ends when:**
- Approved plan written at `~/.claude/plans/<name>.md`, reviewed via `/planning-with-review`, all findings Implemented.
- Handoff / execution session doc written at `docs/NEXT-SESSION-PARAKEET-STREAMING-BARGE-IN-EXEC.md`.
- User oracle'd that execution is scoped for a future bench window (likely a 4-6 h window given the phone_loop complexity).

**Execution session ends when:**
- Streaming sidecar live under systemd.
- `phone_loop.py` barge-in default off; env-gate default on for Annie to turn it on with `ENABLE_BARGE_IN=1`.
- Test-call transcript with at least one successful barge-in captured in the benchmark dir.
- PR merged (or closed with explicit "defer") against `main`.