# Next Session: Chatterbox 500M Benchmark on Titan DGX Spark

**Created:** 2026-04-15 (session 107, post-TTS-survey-close)
**Type:** Redundancy-path validation benchmark
**Expected duration:** 2–3 hours (Chatterbox is known-working elsewhere; Samantha refs + A/B harness already on main)
**Blocks on:** Nothing. All prerequisites landed in PR #1 + #2 (now on main).
**Relates to:** 2026-04 TTS survey (Voxtral/F5/CosyVoice2 closed; Chatterbox retained primacy). This bench answers a separate question: does Chatterbox also work *natively on Titan* as a failover path?

---

## What

Install `chatterbox-tts` on Titan DGX Spark aarch64 (Blackwell SM_121 / CUDA 13) and benchmark it with the **same Samantha reference WAV and 10 utterances** used in the session-106 Voxtral run. Compare audio output, latency, and VRAM against the Panda RTX 5070 Ti production deployment.

**Goal:** Validate that Chatterbox-on-Titan is a viable **redundancy path** if Panda becomes unavailable (hardware failure, scheduled maintenance, phone daemon crash loop, etc.). Today, losing Panda means phone calls are answered but deliver silence (MEMORY.md "MUTE-NOT-CRASH" gotcha). A Titan-native Chatterbox install gives us a manual failover target.

**Secondary goal:** Produce a clean-host A/B baseline audio set (no LAN-hop artifacts) for future TTS candidate comparisons — today the baseline is always "Chatterbox via Panda over LAN," which is fine for ears but mixes two variables.

**Non-goal:** Moving Chatterbox production to Titan. ADR-017 keeps TTS on the phone-adjacent host for minimum audio round-trip. This bench is redundancy only.

---

## Why now

- The 2026-04 TTS survey closed with "Chatterbox wins" (session 107). The survey compared *new* candidates to Chatterbox-via-Panda. We never validated whether Chatterbox itself runs on the Titan Blackwell SM_121 stack — that's a gap in the redundancy story.
- Session 105 Kokoro findings surfaced a real Blackwell hazard: `TorchSTFT` fails with nvrtc errors on SM_121 and requires a monkey-patch (`services/annie-voice/blackwell_patch.py`). Chatterbox's S3Gen audio decoder uses STFT for vocoding — it **might hit the same bug**. Finding this out now (small test) is cheaper than discovering it during a Panda outage.
- Samantha reference WAV + A/B scorer + 10 canonical utterances are all on main now. Zero setup cost beyond the install itself.

---

## What's already in place (inherited from sessions 106-107 via main)

- **Samantha voice reference:** `services/audio-pipeline/voice-references/samantha_movie_primary.wav` (34.7s, 24 kHz mono, user-confirmed 100% Samantha, volume-normalized). Chatterbox accepts arbitrary reference length; no trimming needed.
- **Blind A/B scorer:** `scripts/tts_ab_score.py` — unchanged, point at two sample dirs.
- **Panda Chatterbox baseline generator:** `scripts/generate_chatterbox_baseline.sh` — calls Panda :8772. Produces the reference audio set to compare Titan-native output against.
- **10 canonical utterances** embedded in `scripts/benchmark_voxtral_titan.py` — fork this for `benchmark_chatterbox_titan.py`.
- **Blackwell STFT fix reference:** `services/annie-voice/blackwell_patch.py` — the TorchSTFT monkey-patch from session 105. Apply the same technique if Chatterbox's S3Gen hits nvrtc errors.
- **Gemma outage mode (pause-gemma):** `docker stop vllm-gemma4` → run bench → `docker start vllm-gemma4`. Post-flight p50 drift must stay within 1.2× of ~190ms baseline. Chatterbox is small (~3.7 GB on Panda BF16), so **parallel mode is likely viable** — ask user in Phase 0.

---

## Primary-source checks to do FIRST (before any install)

Before writing code on Titan:

1. **Chatterbox pip package + HF weights:**
   - https://github.com/resemble-ai/chatterbox — README, LICENSE, install instructions
   - https://huggingface.co/ResembleAI/chatterbox — model card, license (should be MIT for both code + weights per session 106 memory)
   - Confirm MIT license on weights (distinguishes it from F5-TTS's CC-BY-NC trap)

2. **aarch64 / Blackwell SM_121 readiness:**
   - Search `chatterbox-tts` on PyPI — are there aarch64 wheels, or source-only?
   - Grep GitHub issues for: "aarch64", "ARM", "DGX Spark", "Blackwell", "nvrtc", "SM_121", "TorchSTFT"
   - Check requirements/pyproject for `torch` version pin — must be ≥2.5 for SM_121; if pinned to 2.3.x/2.4.x we have a port problem similar to CosyVoice 2

3. **Dependency graph sanity:**
   - `resemble-perth` (watermarker) — does it have aarch64 wheel?
   - `s3tokenizer`, `conformer` — aarch64 / cu13 status?
   - `torch` — does the package allow NGC-container's torch 2.6+ to satisfy the requirement without downgrade?

4. **Chatterbox internals worth knowing:**
   - S3Gen audio decoder uses STFT — likely vulnerable to the same Blackwell nvrtc issue Kokoro had
   - 10-step decoder (NOT the 1-step Turbo variant) — per session 101 RESEARCH-CHATTERBOX-CPU-BENCHMARK.md
   - Uses voice encoder + T3 transformer + S3Gen decoder — three model files load from HF

**Expected runtime:** pip install (like session 105 Kokoro), NOT a Docker container. FastAPI wrapper is OUR code (`services/annie-voice/chatterbox_server.py`), not upstream.

---

## Phased execution

### Phase 0 — pre-flight (10 min)
- SSH Titan, verify Gemma at :8003 healthy, capture p50 baseline (5-curl median). Should be ~190ms.
- `ssh panda 'curl -sf http://localhost:8772/health'` — confirm Panda Chatterbox alive (needed for baseline audio generation).
- Check Titan disk: Chatterbox weights ~3 GB.
- User decision: `--gemma-mode=parallel` (likely OK — 3.7 GB model fits alongside Gemma) vs `pause-gemma`.

### Phase 1 — install chatterbox-tts on Titan (30–60 min)
- Fresh venv on Titan (NOT inside Gemma container; Chatterbox is a plain PyTorch pip install).
- `pip install chatterbox-tts` — observe if it resolves cleanly on aarch64 with SM_121-compatible torch.
- If torch pin conflicts: create venv from NGC `nvcr.io/nvidia/pytorch:25.12-py3` container base, or use the existing annie-voice venv recipe.
- Download model files: `hf download ResembleAI/chatterbox` (~3 GB: t3_cfg, s3gen, ve, conds).
- **Known Blackwell hazard:** import + dummy synthesis will likely fail with `nvrtc: invalid PTX` error in S3Gen STFT. Apply the TorchSTFT monkey-patch from `services/annie-voice/blackwell_patch.py` pattern. This is the same fix session 105 used for Kokoro.
- Smoke-test: 1-utterance synthesis with the default (non-Samantha) voice to confirm the install works before measuring anything.

### Phase 2 — benchmark Titan-native Chatterbox (30 min)
- Fork `scripts/benchmark_voxtral_titan.py` → `scripts/benchmark_chatterbox_titan.py`. Reuse `UTTERANCES` + `bench()`. Replace HTTP POST payload with direct `ChatterboxTTS.from_pretrained(...)` + `.generate(text, audio_prompt_path=samantha_primary)` calls (in-process, no HTTP).
- Run 10 utterances warmup + 10 timed with Samantha ref.
- Match Panda tuning: `cfg_weight=0.3, exaggeration=0.3, temperature=0.6`.
- Pull WAVs to `docs/chatterbox-titan-bench-samples-<timestamp>/`.
- Capture: RTF per utterance, peak VRAM (`nvidia-smi` snapshot mid-synthesis), install wheel/source status.

### Phase 3 — Panda baseline (parallel, 15 min)
- `bash scripts/generate_chatterbox_baseline.sh --out docs/chatterbox-titan-bench-samples-<timestamp>/panda_baseline/`
- Same Samantha ref, same 10 utterances, same tuning params. The output directory will contain `titan_native/` and `panda_baseline/` for A/B comparison.

### Phase 4 — A/B audio equivalence (15 min, human in loop)
- `python3 scripts/tts_ab_score.py --samples-dir docs/chatterbox-titan-bench-samples-<timestamp>/ --output docs/BENCHMARK-CHATTERBOX-TITAN-<timestamp>.json`
- User scores 10 pairs 1–5 on **audio identity similarity** (not preference) — goal is "these should sound identical" not "which is better."
- Decision rule:
  - **|titan_mean − panda_mean| < 0.5** AND user-perceived voice-identity matches → `titan_chatterbox_redundancy_validated`
  - Else → `titan_chatterbox_diverges_from_panda_investigate`

### Phase 5 — teardown + post-flight (10 min)
- If paused: restart Gemma, verify post-flight p50 drift ≤1.2×.
- Commit WAVs + verdict JSON on new branch `chatterbox-titan-bench-<timestamp>`, open PR.
- Document the Titan install recipe (venv, NGC base if used, Blackwell STFT patch applied or not) in `docs/RESEARCH-CHATTERBOX-TITAN-REDUNDANCY.md`.

---

## Success criteria

1. ✅ `chatterbox-tts` installed on Titan aarch64, `from chatterbox.tts import ChatterboxTTS` imports without errors
2. ✅ Blackwell STFT issue either doesn't manifest OR is fixed via monkey-patch (same pattern as `blackwell_patch.py`)
3. ✅ 10/10 Samantha-cloned utterances synthesize without errors
4. ✅ Panda baseline generated for same utterances
5. ✅ A/B similarity score committed with user's ratings
6. ✅ Verdict JSON: `titan_chatterbox_redundancy_validated` OR `titan_chatterbox_diverges_from_panda_investigate` OR install-blocker verdict
7. ✅ Gemma post-flight drift ≤ 1.2×
8. ✅ `docs/RESEARCH-CHATTERBOX-TITAN-REDUNDANCY.md` with install recipe + VRAM number + RTF + failover runbook stub
9. ✅ PR opened

---

## Load-bearing lessons to carry in

- **Blackwell STFT hazard is real** — don't be surprised when nvrtc errors appear. Apply the session-105 TorchSTFT monkey-patch pattern. Ticket to read: `services/annie-voice/blackwell_patch.py`.
- **MIT license confirmed on Chatterbox** (session 106 memory: *"Chatterbox retains primacy for Samantha voice — it already works, already deployed, ships the clone encoder openly, MIT-licensed"*). No license gate concerns.
- **Don't productionize on Titan.** ADR-017 and phone-adjacency keep Chatterbox on Panda. This bench only validates a failover option.
- **A/B scorer target is "identical-sounding"**, not "which is better" — if Titan-native diverges noticeably from Panda, that's a bug to investigate (dtype? STFT patch artifacts? CUDA kernel difference?), not a preference.
- **Parallel-with-Gemma is likely viable.** Chatterbox on Panda uses 3.7 GB; Titan has 128 GB unified. Unlike the Voxtral bench, pause-gemma is probably unnecessary overkill. But confirm with user before running parallel — session 106 showed Gemma drift ratios compound across container restarts.

---

## What WILL NOT be in scope

- Moving Chatterbox production from Panda to Titan (ADR-017 forbids it; this is redundancy, not migration).
- Quality tuning — `cfg_weight/exaggeration/temperature` stay at production values (0.3/0.3/0.6).
- Chatterbox-Turbo evaluation — still not released upstream per session 106 folklore correction.
- Kannada / multilingual tests — Chatterbox is English-only (session 67 IndicF5 retirement stands).

---

## Files to create

| File | Purpose |
|---|---|
| `scripts/benchmark_chatterbox_titan.py` | In-process Chatterbox benchmark (fork of `benchmark_voxtral_titan.py`, but no HTTP — direct `ChatterboxTTS.generate()`) |
| `scripts/run-chatterbox-titan-bench.sh` | Dispatcher (much simpler than Voxtral — no stage-configs, no container) |
| `docs/chatterbox-titan-bench-samples-<timestamp>/titan_native/` | Titan-generated WAVs |
| `docs/chatterbox-titan-bench-samples-<timestamp>/panda_baseline/` | Panda-generated WAVs (same ref, same utterances) |
| `docs/BENCHMARK-CHATTERBOX-TITAN-<timestamp>.json` | Verdict + RTF + VRAM + A/B identity score |
| `docs/RESEARCH-CHATTERBOX-TITAN-REDUNDANCY.md` | Install recipe, Blackwell patches applied, failover runbook, revisit triggers |

---

## Files to update on completion

- `MEMORY.md` — add session entry with verdict + load-bearing install recipe
- `docs/PROJECT.md` — if validated, add a small decision-log entry: "Chatterbox redundancy path on Titan validated (<timestamp>); failover recipe in RESEARCH-CHATTERBOX-TITAN-REDUNDANCY.md"
- `docs/RESEARCH-CHATTERBOX-CPU-BENCHMARK.md` — add cross-reference to Titan recipe if useful context

---

## Open questions for the next session to answer

1. Does `pip install chatterbox-tts` resolve on aarch64 Python 3.10+? Any native-extension compile failures?
2. Does Chatterbox's S3Gen STFT trigger the same Blackwell nvrtc bug that hit Kokoro? If yes, does the session-105 `blackwell_patch.py` pattern fix it (drop-in), or need adaptation?
3. What's the actual Titan VRAM footprint? Panda uses 3.7 GB on RTX 5070 Ti BF16; expect similar on Titan but validate.
4. Is `resemble-perth` watermarker a hard dependency, and does it have aarch64 support?
5. RTF on Titan vs Panda — is the Blackwell tensor-core path faster for this workload, or indifferent for 500M models?
6. Can the bench run **in parallel with Gemma** safely (no VRAM contention, no Gemma latency drift), or does it require pause-gemma?

---

## Start command

```bash
# 1. Fresh branch off main (no stacking this time — main already has session 107 artifacts)
cd ~/workplace/her/her-os
git checkout main && git pull
git checkout -b chatterbox-titan-bench-$(date +%Y%m%d)

# 2. Read primary sources FIRST (no code yet)
#    - https://github.com/resemble-ai/chatterbox
#    - https://huggingface.co/ResembleAI/chatterbox
#    - services/annie-voice/blackwell_patch.py (session-105 TorchSTFT monkey-patch reference)
#    - services/annie-voice/chatterbox_server.py (Panda production server for tuning reference)
#    - docs/RESEARCH-CHATTERBOX-CPU-BENCHMARK.md (session-101 CPU research, known-facts section)

# 3. Start Phase 0 pre-flight
ssh titan "curl -sf http://localhost:8003/v1/models && df -h ~"
ssh panda "curl -sf http://localhost:8772/health"

# 4. Proceed phase-by-phase per the plan above
```

---

## Decision tree (pre-committed)

```
Phase 1 install succeeds cleanly on aarch64 SM_121 (with or without blackwell_patch)
  AND Phase 2 produces 10/10 utterances
  AND Phase 4 A/B identity similarity |titan_mean − panda_mean| < 0.5

  → titan_chatterbox_redundancy_validated
    → document recipe, opened PR, no production change

Phase 1 blocks on a Blackwell-specific issue that requires >1h to patch
  → deferred_blackwell_port_cost  (document exact error, Kokoro-style patch attempt, stop)

Phase 1 succeeds but Phase 4 A/B identity diverges significantly
  → titan_chatterbox_diverges_investigate
    → compare dtype/STFT output/cuda-kernel-versions between hosts, do not claim redundancy until root-caused
```

---

## If Chatterbox also fails on Titan Blackwell

Document the exact error + attempted patches in `docs/RESEARCH-CHATTERBOX-TITAN-REDUNDANCY.md`. This is MATERIALLY IMPORTANT — it means the Panda Chatterbox deployment has no native failover, and a Panda outage drops all phone TTS. Escalate to Rajesh for redundancy-strategy discussion (e.g., second Panda? Titan-CPU chatterbox? Cloud TTS fallback?).

Do NOT spend more than 1 hour on install friction per the session-106 paired-tag lesson.