# TODO — In-Process Streaming STT (post session-114 follow-up)

**Created:** session 114, 2026-04-15
**Source:** `/planning-with-review` Stage 2 alternative A1, rejected for session 114, filed here per skill rule ("deferred LOW must have concrete follow-up").
**Target session:** 115+ after streaming-sidecar work is validated and stable.
**Parent plan:** `~/.claude/plans/reflective-wobbling-blossom.md`

---

## The proposal

Replace the HTTP/WebSocket sidecar architecture (current: `scripts/parakeet_stt_server.py`, planned: `scripts/nemotron_stt_server.py`) with **in-process streaming STT** inside the phone daemon using `multiprocessing.Process` + `torch.multiprocessing.Queue` + `multiprocessing.Event` for abort.

## Why this was proposed

The adversarial architecture reviewer for session 114's plan observed that the separate-sidecar model buys us **IPC latency on the cancel path**:

- WebSocket framing + serialization overhead per audio frame
- TCP FIN → server-side teardown → GPU tensor free requires multiple async hops
- Cancel propagation depends on TCP state + asyncio event loop scheduling + thread pool scheduling simultaneously

An in-process subprocess model would:
- Replace TCP with shared-memory queues (near-zero serialization)
- Replace "WebSocket close + wait for server finally" with `multiprocessing.Event.set()` (next inference-loop iteration checks it)
- Keep GPU allocator isolated from Chatterbox (separate OS process still preserved)

## Why rejected for session 114

1. **Session 113 just shipped the HTTP sidecar pattern.** The explicit motivation was keeping NeMo's ~7 GB of dependency tree (Lhotse, pyannote, lightning, torchcodec, kaldialign, nv_one_logger stub) out of the main `~/.venv`. Reverting to in-process risks re-contaminating unless the subprocess is run from `~/parakeet-bench-venv` with its own Python interpreter — which is closer to "subprocess with different Python" than "in-process," and arguably the sidecar model again under a thinner veneer.

2. **Scope creep.** The session-114 goal is "add streaming ASR + barge-in." The subprocess-vs-sidecar question is architectural refactoring that deserves its own session with dedicated measurement of actual cancel-latency delta under real phone load.

3. **Unproven win.** The reviewer's claim "cancel latency drops from ~depends~ to ~next event loop tick~" is plausible but unmeasured. Session 114's WebSocket cancel path with the `_cancel_any_pending` pattern + server-side `finally` cleanup may already hit the sub-500 ms target, making the subprocess rework unnecessary. Measure before rewriting.

## Proposed follow-up plan (4 phases, ~1 session each)

### Phase 1 — Measure session-114's actual cancel-chain latency
After session 114 ships, instrument `cancel_chain()` with per-leg timestamps. Run 100 programmatic barge-ins. Compute p50/p95/p99 for:
- barge-in VAD trigger → WebSocket close() returns
- barge-in VAD trigger → `cache_last_channel` freed (verified via sidecar debug endpoint)
- barge-in VAD trigger → next LISTEN state entered

**Gate:** if p95 of the full chain is ≤ `LLM_CANCEL_FLOOR_S + 400 ms`, the sidecar IS fast enough. Close this TODO as "unnecessary."

### Phase 2 — Prototype in-process subprocess
If Phase 1 shows measurable win potential, prototype `phone_audio.py::StreamingSTTSubprocess` using `multiprocessing.get_context('spawn').Process`. Key design constraints:
- Runs from `~/parakeet-bench-venv/bin/python` (preserves venv isolation) via explicit `sys.executable` override
- Frame queue: `torch.multiprocessing.Queue` for int16 PCM bytes
- Partial/final transcript queue: second `multiprocessing.Queue` for JSON
- Abort: `multiprocessing.Event` checked in inference loop at chunk boundaries
- Session lifecycle: subprocess starts with daemon, inference-process-per-call spawned on demand (or long-lived reused?)

### Phase 3 — A/B bench
Replay recorded phone call fixture through both architectures. Compare cancel latency, throughput, VRAM residency, startup time (subprocess fork is ~300 ms, sidecar is persistent).

### Phase 4 — Migration or close
If subprocess wins by ≥ 100 ms p95 with no regression elsewhere, migrate. Otherwise close TODO with evidence.

## Risks if adopted prematurely

- `torch.multiprocessing` has platform-specific quirks with CUDA — some OSes require `spawn` start method (not `fork`), which slows process startup
- `nv_one_logger` stub and other NeMo import-time side effects might not survive `spawn` cleanly; needs verification
- Subprocess crash recovery is harder than sidecar crash recovery (systemd handles the latter)

## Non-goals

- Do NOT revisit the in-process vs. out-of-process decision for Chatterbox. Chatterbox is a separate concern with different contention patterns (GPU synthesis is long-running; STT is chunk-streaming).
- Do NOT bundle this with the Pipecat migration (see `docs/TODO-PIPECAT-MIGRATION.md`) — orthogonal changes.

## Owner / trigger

- **Owner:** whoever picks up post-114 STT performance work
- **Trigger:** Phase 1 metrics from this TODO are the gate — start Phase 2 only if p95 > budget
- **Deadline:** none (deferred indefinitely; only act if measured need)
