# Next Session: Streaming ASR + Barge-in — EXECUTION

**Created:** session 114, 2026-04-15
**Supersedes:** `docs/NEXT-SESSION-PARAKEET-STREAMING-BARGE-IN.md` (session-113 handoff, pre-review)
**Plan file (authoritative):** `~/.claude/plans/reflective-wobbling-blossom.md`
**Status:** Approved + stress-tested via `/planning-with-review`. Ready to execute.

---

## What

Rework the her-os phone daemon from half-duplex Parakeet v2 to full-duplex streaming ASR + barge-in. New streaming sidecar (`nemotron-speech-streaming-en-0.6b` on `:11439`), end-to-end abort chain (Chatterbox GPU-level cancel + llama-server slot-release + STT WebSocket close + playback kill, atomic), systemd supervision for both sidecars, env-gated rollout with `ENABLE_BARGE_IN=0` default until live-call validation passes. Rollback to session-113 state preserved at every step.

## Plan

**Read `~/.claude/plans/reflective-wobbling-blossom.md` first — it has the full 7-phase implementation, Stage 0 known gotchas (28 entries), Stage 1B state machines (3 machines), Stage 1C pre-mortem (22 scenarios), and Stage 6 adversarial-review response table (19 findings all Implemented + 2 alternatives rejected with concrete follow-up TODOs).**

## Key Design Decisions (from adversarial review — DO NOT REVERT)

Load-bearing decisions that MUST survive into execution. A fresh session that accidentally reverts any of these is implementing the pre-review version.

1. **Chatterbox cancel requires thread-side interruption, not asyncio cancel.** `_synthesize_gpu` runs inside `loop.run_in_executor(None, ...)`; `asyncio.Task.cancel()` cannot interrupt the thread. Phase 2.0 prototype decides Option A (patch generate loop with `threading.Event` checkpoint) vs Option B (accept `synth_duration + cleanup` budget and re-oracle user on latency). Never silently ship a cancel that only returns 200 fast while GPU keeps synthesizing.

2. **Re-entrancy primitive is `asyncio.Lock`, never `asyncio.Event`.** An Event can be left set by an unhandled exception, silencing all future barge-ins. Use `async with cancel_lock:` which releases on any exit path.

3. **Global `_cancel_any_pending: asyncio.Event` set before ANY cancel leg fires.** Prevents the race where barge-in triggers before `session_id` is registered on the Chatterbox server. The TTS client checks this event before issuing new synth POSTs and aborts if set.

4. **`asyncio.gather(*ops, return_exceptions=True)` in cancel_chain must inspect results and re-raise `CancelledError` if the chain itself was cancelled externally.** Standard asyncio gather trap; swallowing this deadlocks shutdown paths.

5. **WebSocket auth is first-message protocol** (`{"type":"auth","token":"<t>"}` within 2 s of connect). Authorization headers leak to aiohttp access log. Constant-time compare. 1008 close on mismatch.

6. **Audio frames over WS are binary int16 PCM, not base64 JSON.** Saves 33% wire + CPU encode. Control messages stay JSON text frames.

7. **Sidecar does NOT accept audio until model is loaded.** Module-level `_MODEL_READY: asyncio.Event` gates the WS handler. `/health` returns 503 until ready. Prevents the 3am-pager scenario where first post-reboot call silently drops Mom's first word on a `NoneType.transcribe`.

8. **Strategy pattern for dual-duplex**, not conditional branches: `HalfDuplexOrchestrator` (verbatim copy of session-113 path) + `FullDuplexOrchestrator` (new). Regression test `test_phone_loop_halfduplex_regression.py` replays recorded session-113 fixture to prove no drift.

9. **Systemd ordering: `After=panda-llamacpp.service` + `Requires=panda-llamacpp.service`** on both STT sidecar units. Without this, Panda reboots silently drop the first call.

10. **Interrupt-to-silence budget is honestly `$LLM_CANCEL_FLOOR_S + 500 ms`,** not fictional 500 ms. Gate 0.8 measures the actual llama-server slot-release floor on Panda E2B before budgeting.

11. **Explicit `try/finally` + `del tensor; await loop.run_in_executor(None, torch.cuda.empty_cache)` for nemotron `cache_last_channel` cleanup.** Weakref-based GC is not guaranteed pre-collection. Phase 1.4 includes 500-session VRAM-stable test.

12. **`torch.cuda.empty_cache()` always wrapped in `run_in_executor`** to keep the event loop responsive during cancel.

13. **`_gpu_sem.release()` guarded by `sem_acquired` flag** — calling release without matching acquire permanently leaks the semaphore above its cap. Kills the serialization guarantee silently.

14. **`MAX_CONCURRENT_SESSIONS=2` is per-process.** Assertion at startup blocks `--workers > 1`. Code comment documents the single-process assumption.

15. **200 ms grace delay before reopening STT WebSocket after a cancel.** Prevents reconnect storm hitting session-cap 1013 while old `cache_last_channel` is still draining.

16. **Phase 4 soak test runs with ALL 4 services live simultaneously** (Chatterbox + E2B + Parakeet v2 + nemotron), NOT in isolation. VRAM delta ≤ 200 MB required — real fragmentation only manifests under contention.

17. **Post-call real TTS smoke (byte-level, not `/health`)** after every test call per session-103 MUTE-NOT-CRASH gotcha.

18. **Env flags are sticky for process lifetime.** Restart required between every flag flip. Document in `start.sh` comment + Phase 5 operator steps.

19. **Dev-prod secrets parity:** `config/her-os-secrets.env.example` in repo + `config/dev-setup.sh` generates dev-local `~/.her-os/secrets.env`. Sidecar hard-refuses empty tokens (no more `NO-LAN-ONLY` default).

## Files to Modify

Ordered by phase. See plan file Files-to-Modify table for LOC estimates and NEW-vs-MODIFY.

**Phase 1 (sidecar):**
- `scripts/nemotron_stt_server.py` **NEW** — aiohttp WebSocket sidecar
- `scripts/_stt_stream_common.py` **NEW** — client lib
- `scripts/tests/test_nemotron_stt_server.py` **NEW** — unit + 500-session leak test

**Phase 2 (Chatterbox cancel):**
- `services/annie-voice/chatterbox_server.py` — add `/cancel/<session_id>` + `SynthSession` tracking + Option A/B generate-loop interrupt
- `services/annie-voice/chatterbox_tts.py` — add `session_id` + `cancel(session_id)` + `_cancel_any_pending` guard
- `services/annie-voice/tests/test_chatterbox_cancel.py` **NEW**

**Phase 3 (phone_loop):**
- `services/annie-voice/phone_loop.py` — `TurnState` enum, `CancellableTurn`, Strategy pattern split, `cancel_chain`
- `services/annie-voice/phone_audio.py` — `PhoneSTT.stream_ws()` alongside existing `transcribe()`
- `services/annie-voice/tests/test_phone_loop_bargein.py` **NEW**
- `services/annie-voice/tests/test_phone_loop_halfduplex_regression.py` **NEW**

**Phase 4 (bench):**
- `scripts/bench_streaming_bargein.py` **NEW**
- `scripts/analyze_streaming_bargein.py` **NEW**
- `benchmark-results/streaming-2026-04-15/` **NEW DIR**

**Phase 6 (systemd):**
- `config/systemd/panda-nemotron-stt.service` **NEW**
- `config/systemd/panda-parakeet-stt.service` **NEW** (retroactive for v2)
- `config/her-os-secrets.env.example` **NEW**
- `config/dev-setup.sh` **NEW**
- `start.sh` — env block additions + `~/.her-os/secrets.env` sourcing

**Phase 7 (docs):**
- `docs/RESEARCH-STT-ALTERNATIVES.md` — append streaming validation section
- `docs/RESOURCE-REGISTRY.md` — Active Models + Change Log
- `MEMORY.md` — session-114 block + new topic files
- `docs/TODO-INPROCESS-STREAMING-STT.md` **NEW** (A1 defer)
- `docs/TODO-PIPECAT-MIGRATION.md` **NEW** (A2 defer)

## Rejected Alternatives (know these exist before pivoting)

The Stage 2 adversarial reviewer proposed two architectural alternatives. Both were rejected for this session with concrete follow-up TODO files. **Do not silently switch to either mid-execution** — if Phase 1 measurements surprise you, pause and re-oracle the user with the trade-off, then edit the plan.

### A1 — In-process streaming STT via `multiprocessing.Process`
- **TODO file:** `docs/TODO-INPROCESS-STREAMING-STT.md`
- **What:** replace the HTTP/WebSocket sidecar with a subprocess using `torch.multiprocessing.Queue` for audio + `multiprocessing.Event` for abort.
- **Rejected because:** session 113 just shipped the HTTP sidecar specifically to keep NeMo's 7 GB of deps out of the main `~/.venv`. Reverting risks venv contamination unless subprocess is launched from `~/parakeet-bench-venv/bin/python` — at which point it's closer to "sidecar under a thinner veneer" anyway.
- **Re-open trigger:** Phase 1 cancel-chain measurements show p95 > `LLM_CANCEL_FLOOR_S + 400 ms`.

### A2 — Wholesale Pipecat migration
- **TODO file:** `docs/TODO-PIPECAT-MIGRATION.md`
- **What:** rewrite `phone_loop.py` as a Pipecat `Pipeline` with `InterruptionHandler` + `STTMuteFilter` + `TTS.interrupt()`, reusing `bot.py` infrastructure.
- **Rejected because:** `phone_loop.py` has ~1500 lines of accumulated domain logic over 113 sessions (tool calling, compaction, contact book, SEARXNG, thinking cues). Porting each as a `FrameProcessor` subclass is a multi-session migration.
- **Re-open trigger:** 2+ weeks of production stability post-session-114 ships; then start Phase M1 (parity spec).

### Which is more performant?

**A1 is more performant on cancel-chain latency.** A2 is more performant on maintainability.

Concrete comparison:

| Axis | A1 (in-process subprocess) | A2 (Pipecat) | Session-114 baseline (sidecar + cancel_chain) |
|---|---|---|---|
| Cancel latency floor | **~1 event loop tick** (`multiprocessing.Event.set()` + next inference-loop iteration check) — best in class | ~10-50 ms (Pipecat `InterruptionHandler` fires `InterruptionFrame` through pipeline queue, each processor acks) | ~50-200 ms depending on TCP + thread pool scheduling (mitigated by session-114's `_cancel_any_pending` guard) |
| Audio-frame wire overhead | **Near-zero** (shared-memory queue, binary tensors) | Framework overhead per Frame object (~microseconds, but ~12/s × N processors adds up) | 33% saved vs base64-JSON (we already chose binary WS frames); still has TCP framing |
| LLM cancel propagation | Same as baseline — `httpx` on the subprocess side doesn't change llama-server slot release | Pipecat's `LLMResponseAggregator.interrupt()` is battle-tested against OpenAI-compatible endpoints; slightly tighter than our `current_llm_task.cancel()` | Gate 0.8 measures actual floor; likely ~2-4 s for 200-token response — this dominates everything else regardless of architecture |
| TTS GPU release | Same as baseline — Chatterbox is a separate process in all 3 architectures, so thread-side interrupt (Rule 1) is still required | Same | Phase 2 Option A (threading.Event in generate loop) or Option B (honest budget) — unchanged |
| Startup cost | **~300 ms per subprocess fork** with `spawn` start method (required for CUDA) — bad for per-call fresh process, fine if daemon-reused | Same as baseline (Pipecat pipeline runs persistent) | Persistent sidecar, zero per-call startup |
| VRAM allocator isolation from Chatterbox | Yes (separate OS process) | No (same `bot.py` process if we collapse) — this is a REGRESSION vs baseline | Yes (sidecar is separate process) |

**The honest answer:** for the specific "barge-in latency" metric that matters to UX, **A1 beats A2 beats baseline**, but the win is dominated by the LLM-cancel floor (Gate 0.8). If llama-server releases the slot in 2 s, no architecture change below that floor matters to Mom.

**A2 would be chosen for non-performance reasons** — specifically, cutting ~400 lines of state-machine code + gaining Pipecat's ecosystem (telemetry, transport plugins, interrupt primitives). A2 is the right long-term move once phone_loop's accumulated logic justifies the migration cost.

**A1 would be chosen if** cancel-chain latency post-session-114 measures above budget AND we decide sub-process model beats the current sidecar pattern without unacceptable venv-contamination risk.

Both TODO files contain 4-phase execution plans with measurement gates — read them before pivoting.

## Start Command

```bash
cat ~/.claude/plans/reflective-wobbling-blossom.md
```

Then execute pre-flight Gates 0.1-0.9 in order. First failure aborts. After gates pass: Phase 0 → 1 → 2 (Gate 2.0 prototype first) → 3 → 4 → 5 → 6 → 7. **All adversarial findings are already addressed in the plan's Stage 6 response table — do not re-design any of the 19 decisions above without explicit user re-approval.**

## Verification

1. Pre-flight gates 0.1-0.9 all green, including Gate 0.8 LLM-cancel floor measurement and Gate 0.9 VRAM fragmentation baseline.
2. Phase 1: `pytest scripts/tests/test_nemotron_stt_server.py -q` passes including 500-session VRAM-stable test.
3. Phase 2: Gate 2.0 prototype decides Option A vs Option B; `pytest services/annie-voice/tests/test_chatterbox_cancel.py -q` passes incl. nvidia-smi VRAM asserts.
4. Phase 3: `pytest services/annie-voice/tests/test_phone_loop_{bargein,halfduplex_regression}.py -q` passes; regression fixture byte-identical to session-113 baseline.
5. Phase 4: `SUMMARY.md` has verdict `ship-barge-in | ship-streaming-stt-only | defer`; all 6 metrics + 4-services-live soak filled; VRAM delta ≤ 200 MB.
6. Phase 5: 3 stable test calls over 24 h; at least 1 live barge-in captured with cancel-chain trace; post-call real-TTS byte-level smoke passes after each.
7. Phase 6: `systemctl status panda-{parakeet,nemotron}-stt.service` both `active (running)` with `After=panda-llamacpp.service`; deliberate kill triggers systemd restart within 10 s.
8. Phase 7: PR opens against `main`; CI green; all tests pass; `docs/TODO-{INPROCESS-STREAMING-STT,PIPECAT-MIGRATION}.md` exist with concrete future-session plans.

## Retrospective (after deploy)

Tag each of the 19 Stage 6 findings + 2 alternatives as HIT / MISS / PARTIAL / N/A per `/planning-with-review` retrospective protocol. Findings that consistently HIT become elevated to CLAUDE.md or memory. Findings that MISS get a note for future reviewers. Closes the feedback loop on the review process itself.