# TODO — Chatterbox GPU Interrupt (Option A)

**Filed:** session 115, 2026-04-15
**Context:** follow-up to `~/.claude/plans/reflective-wobbling-blossom.md` Gate 2.0 decision.

## Why this exists

Session 115 chose **Option B** for Chatterbox cancel: `/cancel/<session_id>` marks the session cancelled and the HTTP handler returns `204 No Content` once the in-flight synth completes, but the underlying `ChatterboxTTS.generate()` thread keeps burning GPU until its own EOS or `max_new_tokens`.

For Mom's phone UX this is **invisible** — `phone_loop.kill_playback()` silences audio within ~50 ms regardless. But the Chatterbox GPU wastes 1-3 s of compute per barge-in, and during that tail the `_gpu_sem(1)` semaphore blocks any NEW synth for the next sentence. Under a high-barge-in call, users can experience stutter: first barge-in succeeds, next sentence's synth queues behind the wasted tail.

## What Option A adds

Cut Chatterbox's actual GPU time on cancel from "full synth duration" to "≤ 50 ms" by injecting a `threading.Event` checkpoint into the autoregressive token loop.

## Concrete implementation plan

### Phase A.1 — Locate + verify loop hook

File: `~/workplace/her/her-os/.venv/lib/python3.12/site-packages/chatterbox/models/t3/t3.py`

Target: line ~352 inside `T3.inference()`:

```python
# ---- Generation Loop using kv_cache ----
for i in tqdm(range(max_new_tokens), desc="Sampling", dynamic_ncols=True):
    logits_step = output.logits[:, -1, :]
    ...
    if next_token.view(-1) == self.hp.stop_speech_token:
        logger.info(f"✅ EOS token detected! Stopping generation at step {i+1}")
        break
    ...
```

The loop is **per-token autoregressive** — ~20-50 ms per iteration on Panda. A `threading.Event.is_set()` check at the top of the loop gives us per-token cancel granularity.

### Phase A.2 — Choose injection strategy

**Option A.1 — monkey-patch at server startup** (simplest, most fragile)

```python
# chatterbox_server.py startup
import chatterbox.models.t3.t3 as t3_mod

_original_inference = t3_mod.T3.inference

def _abortable_inference(self, *args, **kwargs):
    # Inject abort check by wrapping the loop body
    abort_event = threading.current_thread().__abort_event__  # set from _synthesize_gpu
    # ... walk the function body, patch in abort.is_set() check before each forward ...
```

This requires re-implementing the full `T3.inference` body in our code — ~100 lines copied from chatterbox. Brittle: any chatterbox version bump drifts from our copy and we silently regress.

**Option A.2 — subclass T3** (cleaner, still invasive)

Subclass `T3`, override `inference()` with a near-clone that checks abort. Instantiate the subclass instead of the stock class at model load time.

**Option A.3 — custom StoppingCriteria (Huggingface pattern)**

Actually the loop is hand-rolled (doesn't use HF `generate()`), so `StoppingCriteria` doesn't apply. Would need to fork the loop anyway.

**Recommended:** A.2 with a version-pin assertion. On chatterbox upgrade, CI fails loudly instead of drifting silently.

### Phase A.3 — Wire abort_event into SynthSession

`services/annie-voice/chatterbox_server.py`:

```python
@dataclass
class SynthSession:
    session_id: str
    abort_event: threading.Event = field(default_factory=threading.Event)  # NEW
    cancelled: bool = False
    created_at: float = field(default_factory=time.monotonic)
    finalized_at: float = 0.0
```

Cancel handler sets both `cancelled` (for 204 semantics) and `abort_event` (for thread-side interrupt).

`_synthesize_gpu` receives the event and stores it somewhere accessible to the patched loop (threading.local, instance attr, or kwarg threading through the subclass).

### Phase A.4 — Tests

- Mid-synth cancel returns 204 within 100 ms AND the `_synthesize_gpu` thread exits within 100 ms of cancel (not after natural EOS).
- `nvidia-smi` VRAM returns to pre-synth baseline within 500 ms.
- Long-text synth (> 2 s) still completes when no cancel fires.

## When to open this

Open when ANY of:
- **Barge-in during TTS synth queue-wait causes audible stutter.** This is the user-visible symptom of Option B's GPU-tail blocking the semaphore. Measure in Phase 4 soak (session 115 did not run it).
- **Panda VRAM headroom shrinks < 1 GB** and Chatterbox's post-cancel CUDA allocations push it over. Option A's empty_cache hits immediately after cancel instead of after natural completion.
- **Barge-in latency budget is formally tightened** (e.g., moving to a real-time pipeline like Pipecat) and `synth_duration + cleanup` becomes too loose.

Until then, Option B is honest: the user hears silence in 50 ms, and the wasted GPU is invisible.

## Related

- Plan file: `~/.claude/plans/reflective-wobbling-blossom.md` (Gate 2.0 + Phase 2.2)
- Exec doc: `docs/NEXT-SESSION-PARAKEET-STREAMING-BARGE-IN-EXEC.md` (rule 1)
- Adversarial-review finding: `H-ARCH-1` in Stage 6 response table
- Deferred alternative: `docs/TODO-INPROCESS-STREAMING-STT.md` (A1) moves STT out-of-process but doesn't address Chatterbox cancel