# Next Session — Chatterbox Titan redundancy follow-ups

**Supersedes:** the pre-execution V2 handoff (which instructed how to run this bench). That plan ran on **2026-04-15 session 110** and landed verdict `titan_chatterbox_synthesis_parity_with_panda`. This V2 now describes what remains.

**Plan file (execution already complete for phases 0–5):** `/home/rajesh/.claude/plans/replicated-cuddling-duckling.md`
**Verdict JSON:** `docs/BENCHMARK-CHATTERBOX-TITAN-20260415.json`
**Research doc:** `docs/RESEARCH-CHATTERBOX-TITAN-REDUNDANCY.md`
**Branch:** `chatterbox-titan-bench-20260415`

## What's still open

### A) Phase 6 — staged HTTP failover dry-run (opt-in, ~60–90 min)

Flag: `--staged-failover-dry-run` in `scripts/run-chatterbox-titan-bench.sh` when the runner is built (plan referenced it; this session did not implement the runner script — a direct invocation of `scripts/benchmark_chatterbox_titan.py` with the Phase 5 Gemma restore cycle is what actually ran).

Purpose: exercise the phone-daemon → HTTP-boundary → Chatterbox-on-Titan codepath end-to-end. Without Phase 6, "redundancy" means "the model works" — it does NOT mean "failover is safe to flip."

**Pre-reqs** (all met):
- ✅ Phase 0–5 complete with `titan_chatterbox_synthesis_parity_with_panda` emitted.
- [ ] User explicitly opts in (late-night quiet window, ~2h budget, user available to observe test calls).
- [ ] Panda phone-daemon is healthy and idle during the window.
- [ ] `CHATTERBOX_TOKEN` value matches across Panda and the Titan shim (same `~/.her-os/.env`).

**Sub-phases (from plan):**
1. **6a (20 min)** — Create `services/annie-voice/chatterbox_titan_shim.py`: FastAPI wrapper binding Titan:8773, mirrors Panda `chatterbox_server.py` API byte-for-byte. Request body `{text, reference_audio, exaggeration, cfg_weight, temperature}`, auth via `X-Internal-Token: $CHATTERBOX_TOKEN`, response is raw int16 PCM (24 kHz mono) with `X-Audio-Duration`, `X-Sample-Rate`, `X-Channels`, `X-Format` headers. Apply both Blackwell patches in the shim's startup hook. Reference audio files must be placed in Titan's `~/.her-os/annie/voice-references/` with matching `metadata.json` allowlist (or skip server-side allowlist by binding to Titan-only reference paths).
2. **6b (10 min)** — Smoke the shim from Panda: `ssh panda 'set -a && . ~/.her-os/.env && curl -sf -o /tmp/titan_shim.pcm -w "%{http_code}" -X POST http://<titan-ip>:8773/v1/tts -H "X-Internal-Token: $CHATTERBOX_TOKEN" -d "{\"text\":\"hi\",\"reference_audio\":\"samantha_evolving.wav\"}"'`. Verify: 200 status, >24 KB body, first 40 bytes non-silent (Chatterbox ~37 ms prefix is expected — use full-file peak/rms instead).
3. **6c (30 min, supervised)** — Repoint ONE phone line's `CHATTERBOX_URL` to `http://<titan-ip>:8773`. Rollback command documented: `kill -HUP $(pgrep -f phone_call.py) ; CHATTERBOX_URL=http://localhost:8772 nohup python3 ~/workplace/her/her-os/scripts/phone_call.py auto &`. User makes 2–3 test calls. Watch `journalctl -u phone-daemon` for HTTP errors, dropped frames, drift.
4. **6d (10 min)** — Rollback + capture: latency samples, subjective audio quality, any warnings in phone-daemon logs.
5. **6e** — If clean, emit additional verdict `titan_chatterbox_failover_dry_run_clean` and update `docs/RESEARCH-CHATTERBOX-TITAN-REDUNDANCY.md` "Failover runbook" section to `validated by dry-run on <date>`. Otherwise `titan_chatterbox_failover_dry_run_partial` with gap list.

### B) Quarterly canary re-bench (2026-07-15)

Re-run the same bench (phases 0–5) without modification to catch silent drift from:
- Chatterbox version bumps (reject if the installed version ≠ 0.1.7 without explicit re-planning)
- torch bumps past 2.11.x (especially 2.12+ — Jiterator behavior may change)
- Titan firmware/driver upgrades
- Panda voice-ref swap (production default was `samantha_evolving.wav`)

Start command:
```bash
ssh titan 'cd ~/workplace/her/her-os && git fetch origin && \
  git worktree add ../her-os-chatterbox-canary-$(date +%Y%m%d) -b chatterbox-titan-canary-$(date +%Y%m%d) origin/main'
# Follow the install recipe in docs/RESEARCH-CHATTERBOX-TITAN-REDUNDANCY.md
# Compare the resulting mean_cosine vs 0.9199 baseline — flag if drift > 0.05
```

### C) Cleanups the 110 session left open

- [ ] **`scripts/run-chatterbox-titan-bench.sh` runner** — the plan referenced a wrapper that exposes `--gemma-mode`, `--staged-failover-dry-run`, `--override-drift`, `--override-venv`. Session 110 invoked the python bench directly with manual Gemma pause/restart + manual drift gate. A runner is only worth building if another session actually re-runs this outside the canary cadence; otherwise delete the reference.
- [ ] **`test_blackwell_patch.py` coverage** — existing 6 tests cover `patch_stft`. No unit tests exist for the two new helpers (`patch_chatterbox_xvector_cpu_fbank`, `patch_chatterbox_s3tokenizer_log_mel`) because mocking chatterbox + torchaudio adds more complexity than value — the helpers are implicitly validated by the bench smoke on Titan. Add coverage if a Phase 6 shim makes these helpers reach production.
- [ ] **Cleanup decision on the Titan venv** — session 110 left `~/workplace/her/her-os-chatterbox-bench/.venv-chatterbox-bench/` in place (~7 GB including torch stacks) per plan's rule "keep on validated verdict." Delete after a Phase 6 dry-run if not needed; otherwise the venv is the canary's starting point.
- [ ] **Pre-commit allowlist** — the plan's Phase 5 allowlist regex did NOT include `services/annie-voice/kokoro_tts.py`, `tts_backends.py`, `server.py` (the three call sites migrated from side-effect `import blackwell_patch` to explicit `patch_stft(...)`). These were legitimately in the 110 commit. Future sessions on this plan should extend the allowlist regex accordingly if they re-run the pre-commit filter.

## Key design decisions from session 110 (do NOT re-litigate)

1. **Two Blackwell patches required, not one.** The original plan expected a single `patch_stft` application (as with Kokoro). In practice, Chatterbox's voice-clone path hits NVRTC in two distinct locations — `xvector.extract_feature` (via torchaudio's Kaldi.fbank) and `s3tokenizer.log_mel_spectrogram`. Both have dedicated helpers in `blackwell_patch.py`.
2. **Voice reference changed from `samantha_movie_primary.wav` to `samantha_evolving.wav`.** The plan's 34.7 s primary ref is NOT in Panda's server-side allowlist; production defaults to the 5 s `samantha_evolving.wav`. Using the production ref on both sides is more production-faithful and also eliminates the Samantha-30s cap concern (CODE-11/PM-3).
3. **A/B scoring is resemblyzer cosine similarity, not human MOS.** The bench loop had no human listener. `scripts/tts_identity_score.py --mode human` is implemented for future audit — it reuses the same paired WAVs already in the samples dir.
4. **nvidia-smi VRAM query returns [N/A] on GB10 unified memory.** `torch.cuda.max_memory_allocated()` is the primary reading; per-process `nvidia-smi --query-compute-apps=pid,used_memory` is the cross-reference. The plan's `--query-gpu=memory.used` path is unusable here.
5. **Gemma post-flight drift ratio written to Registry.** 1.005× this session — cleanly under the 1.2× gate. Next session's Phase 0 prior-drift gate can proceed without `--override-drift`.

## Not doing in this follow-up

- Moving Chatterbox production from Panda to Titan (ADR-017 stands).
- Chatterbox-Turbo evaluation (not released).
- Multilingual tests.
- Automated (non-manual) failover.
- `docker commit`-based persistent Titan Chatterbox service.