# Next Session: Voxtral-4B-TTS-2603 Benchmark on Titan

**Created:** 2026-04-14 (session 102, post-adversarial-review)
**Executed:** 2026-04-14 (session 105) — VERDICT: `keep_chatterbox_voxtral_tts_unsupported_upstream`. See `docs/BENCHMARK-VOXTRAL-TITAN-20260414-2300.json`. This doc is preserved for the re-run that'll happen once vLLM upstream adds `voxtral_tts` model_type support.
**Type:** Benchmark execution — research + measurement, no production code change
**Expected duration:** 2-3 hours across Phase A-1 through Phase 5
**Blocks on:** Nothing. Parallel-safe with user's Panda E4B benchmark (different machine, different GPU).

---

## What

Run a voice-identity + latency benchmark of Mistral's new **Voxtral-4B-TTS-2603** (released 2026-03-26) on **Titan** (NVIDIA DGX Spark, aarch64 Blackwell GB10, 128 GB unified memory) to decide whether to swap Panda's current **Chatterbox 500M** TTS.

**Primary metric (load-bearing):** Samantha voice-cloning quality via blind A/B listening against Chatterbox baseline. Per user: "The key I am using all these TTS is to have Samantha's voice instead of Kokoro for example."

**Secondary metrics:** Streaming TTFA on aarch64, RTF, peak VRAM, streaming chunk granularity.

**Outcome:** Go/no-go verdict per a pre-committed decision rule: `Voxtral mean ≥ Chatterbox mean + 0.5 AND Voxtral TTFA p50 ≤ 500 ms → swap candidate`. Otherwise keep Chatterbox.

---

## Plan

**READ FIRST:** `/home/rajesh/.claude/plans/logical-kindling-crown.md`

This plan went through the full `planning-with-review` skill:
- Stage 0 prior-lessons check (12 gotchas from prior sessions mapped)
- Stage 1B state machine (7 states, 11 transitions, enforced via `/tmp/voxtral-bench-state`)
- Stage 1C pre-mortem (12 failure scenarios with mitigations)
- Stage 2 adversarial architecture review (3 CRITICAL + 3 HIGH + 3 MEDIUM + 2 alternatives found)
- Stage 3 adversarial code review (4 CRITICAL + 4 HIGH + 4 MEDIUM found)
- Stage 6 feedback table: **22 findings, 22 implemented, 0 rejected, 0 deferred** (per `feedback_no_reject_defer.md`)

The plan has every review-driven fix baked in. Do NOT revert these — they are load-bearing.

---

## Load-bearing design decisions (do NOT accidentally revert)

1. **Latency-based health gate, NOT binary 200-check (F1)**
   Unified memory on DGX Spark has no hard VRAM wall. Gemma 4's p50 baseline is captured in Phase 0; BENCHING phase aborts if Gemma latency exceeds 2× baseline. A `curl /health → 200` check is insufficient — Gemma can degrade silently from 300 ms to 3000 ms while reporting "healthy."

2. **Dedicated git branch for benchmark artifacts (F6)**
   `git checkout -b voxtral-bench-<YYYYMMDD-HHMMSS> origin/main` before any commit. Benchmark WAVs + JSON live on this branch. User merges to main via explicit PR after review. Prevents binary churn on main and push-conflict race.

3. **`--env-file`, NOT `-e HF_TOKEN=...` (F7)**
   `ps aux` shows process args → `-e HF_TOKEN=<raw>` exposes the token. Use mode-0600 `/tmp/voxtral-bench-envfile` with `--env-file`, delete after container starts.

4. **tmux-wrapped dispatcher (F8)**
   `ssh titan 'tmux new-session -d -s voxtral-bench "bash scripts/run-voxtral-bench.sh"'` — survives SSH drop. Re-attach via `ssh titan -t tmux attach -t voxtral-bench`. Systemd 6am timer on Titan sweeps orphaned containers >4h old.

5. **Script-enforced blind A/B (F20)**
   `scripts/tts_ab_score.py` randomizes Voxtral-vs-Chatterbox WAV pairs with labels hidden. User scores 1-5 per pair. Pre-committed decision rule runs automatically. No free-form "I think Voxtral sounds better" judgment — avoids sunk-cost bias.

6. **TTFA semantics differ per install branch (F5)**
   - 2A/2B (vLLM HTTP streaming): TTFA = first byte via `httpx.stream()`
   - 2C (HF transformers in-process): `total_synthesis_latency_ms` only, NOT TTFA
   - 2D (pure-C batch): same as 2C
   - Results JSON tags `install_path` + `streaming_capable`. Go/no-go verdict requires 2A or 2B — 2C/2D produce `verdict: "aarch64_vllm_omni_unverified"`.

7. **User picks `--gemma-mode=parallel|pause-gemma` before first run (F21)**
   `parallel` = default per user constraint "Nothing on Titan should be broken"; Gemma keeps running.
   `pause-gemma` = adversarial reviewer's safer recommendation; ~3 min Gemma outage but true isolation.
   AskUserQuestion at dispatcher start. Honest safety/constraint trade-off surfaced.

8. **Optional Phase A-1 Mistral API pre-screen (F22)**
   Flag `--allow-api-prescreen` (default OFF, preserves local-first privacy). $0.05 USD for 3 test utterances. If Voxtral obviously fails Samantha voice identity → abort before Titan work, save ~1 hour.

9. **VTT parser auto-detects speaker-marker format (F11)**
   The plan's initial assumption (`- ` prefix) is probably wrong for yt-dlp auto-subtitles. `extract_samantha_from_movie.py` first inspects the file: checks for `<v Speaker>` voice spans (WebVTT canonical), falls back to `- ` (SRT-style), falls back to silence-gap segmentation.

10. **Deterministic container teardown, no `--rm` (F2)**
    `docker run --rm` races with explicit `docker rm -f` + EXIT trap. Plan drops `--rm`; teardown is: `docker stop → wait 10s → docker rm -f → verify via docker container inspect`.

---

## Files to create

| # | File | Purpose |
|---|---|---|
| 1 | `scripts/benchmark_utils.py` | Shared `bench(fn, n=5)` helper with **explicit warmup call outside `bench()`** (F4). Import from both `benchmark_titan.py` and `benchmark_voxtral_titan.py`. Do NOT copy-paste verbatim. |
| 2 | `scripts/extract_samantha_from_movie.py` | Adaptive VTT parser + ffmpeg clip extraction. Path-sanitized via `pathlib.Path.resolve()` + startswith-check (F3). Integer-second filenames only. Asserts `stat().st_size > 2048` after each ffmpeg. |
| 3 | `scripts/benchmark_voxtral_titan.py` | Per-install-path TTFA measurement with `httpx.Timeout(connect=5.0, read=30.0, write=5.0, pool=5.0)` (F10). Explicit warmup call before timed loop. Branch selector via env/flag. |
| 4 | `scripts/tts_ab_score.py` | Blind-randomized listener. Pre-committed decision rule: `voxtral_mean ≥ chatterbox_mean + 0.5 AND voxtral_ttfa_p50 ≤ 500` → swap. Writes verdict to JSON. |
| 5 | `scripts/run-voxtral-bench.sh` | Dispatcher. Implements state machine via `/tmp/voxtral-bench-state` with `assert_state` checks per phase. Background health-check PID tracked in `/tmp/voxtral-bench-healthcheck.pid`. EXIT trap kills PID. Flags: `--gemma-mode`, `--allow-api-prescreen`, `--skip-baseline`. |
| 6 | `scripts/generate_chatterbox_baseline.sh` | Resumable Chatterbox baseline generator. Called by Phase 4 if Chatterbox is reachable; deferred to a later run if Panda is busy with E4B benchmark (F12). |

---

## Files to modify

- `.gitattributes` — add `docs/voxtral-bench-samples-*/**/*.wav filter=lfs diff=lfs merge=lfs` (F15)
- `services/audio-pipeline/voice-references/` (new dir) — confirmed Samantha movie clip lives here, committed
- `services/audio-pipeline/voice-references/metadata.json` — add the new clip to allowlist (Chatterbox pattern)
- `MEMORY.md` (post-benchmark only) — add a session continuation with measured numbers + go/no-go

---

## One-time Titan setup (before first run)

1. `ssh titan 'git lfs --version'` — if missing: `sudo apt install git-lfs && git lfs install`
2. `ssh titan 'which tmux'` — if missing: `sudo apt install tmux`
3. Install systemd 6am orphan-cleanup timer (safety-net for F8 / pre-mortem #9):
   ```bash
   ssh titan 'sudo tee /etc/systemd/system/voxtral-bench-cleanup.service << EOF
   [Unit]
   Description=Sweep orphaned Voxtral benchmark containers
   [Service]
   Type=oneshot
   ExecStart=/bin/bash -c "docker ps --filter name=vllm-voxtral-bench --filter status=running --format \"{{.Names}} {{.RunningFor}}\" | awk \"\\$2 ~ /hour/\" | awk \"{print \\$1}\" | xargs -r docker rm -f"
   EOF
   sudo tee /etc/systemd/system/voxtral-bench-cleanup.timer << EOF
   [Unit]
   Description=Daily cleanup of stale benchmark containers
   [Timer]
   OnCalendar=*-*-* 06:00:00
   [Install]
   WantedBy=timers.target
   EOF
   sudo systemctl daemon-reload
   sudo systemctl enable --now voxtral-bench-cleanup.timer'
   ```

---

## Start command

```bash
# 1. Open the plan and read the full review-driven plan
cat /home/rajesh/.claude/plans/logical-kindling-crown.md | less

# 2. AskUserQuestion to surface the two user-decision flags:
#    - --gemma-mode=parallel (default) vs pause-gemma
#    - --allow-api-prescreen (default off) vs on

# 3. Begin Phase A0 (Samantha extraction) via:
python3 scripts/extract_samantha_from_movie.py \
  --video ~/workplace/her/her-player/downloads/Iroq6EzQcl4/video.mp4 \
  --vtt ~/workplace/her/her-player/downloads/Iroq6EzQcl4/subtitles.vtt \
  --out /tmp/samantha_candidates/

# 4. User listens to candidates + confirms which is Samantha

# 5. Run dispatcher on Titan:
ssh titan 'cd ~/workplace/her/her-os && git fetch origin && \
  tmux new-session -d -s voxtral-bench \
  "bash scripts/run-voxtral-bench.sh --gemma-mode=parallel"'

# 6. Monitor via: ssh titan -t tmux attach -t voxtral-bench
```

---

## Verification (success criteria)

1. ✅ Phase 0 captures `GEMMA_P50_BASELINE_MS` cleanly (Titan production healthy)
2. ✅ Phase A0 produces at least 1 user-confirmed Samantha clip ≥ 5 s, saved to `services/audio-pipeline/voice-references/`
3. ✅ Phase 2: at least one of 2A/2B/2C/2D branches succeeds (aarch64 feasibility answered)
4. ✅ Phase 3: 10 utterances × 2 voice conditions complete with <3/20 timeouts
5. ✅ Phase 4: verdict written to JSON (`swap_candidate` OR `keep_chatterbox` OR `aarch64_vllm_omni_unverified`)
6. ✅ Phase 5 post-flight: Gemma p50 within 1.2× of pre-baseline, no orphan containers, benchmark branch pushed to origin
7. ✅ Results visible: `docs/BENCHMARK-VOXTRAL-TITAN-<timestamp>.json` + `docs/voxtral-bench-samples-<timestamp>/*.wav` committed on `voxtral-bench-<timestamp>` branch

---

## If something fails

**Phase 1 (aarch64 feasibility) fails on ALL 4 branches:**
- Plan: report "Voxtral unusable on aarch64 at this time." Do NOT force-merge fallback results as if they answer the question. Document in `docs/RESEARCH-TTS-ALTERNATIVES.md` and move to next candidate (Chatterbox-Turbo, CosyVoice 2, F5-TTS from the prep doc).

**Phase 3 hits NVRTC/Jiterator error (F17):**
- Diagnostic log written to `voxtral_nvrtc_diagnostic.log` with full stack trace + tensor ops context.
- This becomes a CONCRETE follow-up: write a targeted monkey-patch inspired by `services/annie-voice/blackwell_patch.py`. Not a hand-wave.

**Gemma degrades during BENCHING (F1):**
- Background health-check writes `/tmp/voxtral-bench-abort`
- Main loop transitions to ERROR_ABORTED, runs Phase 5 teardown immediately
- Post-flight verifies Gemma recovered within 60 s; if not → user manually investigates

**SSH drops mid-run (F8):**
- Dispatcher survives inside tmux. Re-attach via `ssh titan -t tmux attach -t voxtral-bench`.
- If tmux session also died (kernel panic etc.), systemd 6am timer sweeps orphans.

**Chatterbox unreachable in Phase 4 (F12):**
- Phase 4 writes `deferred-baseline-utterances.txt` and skips.
- After user's Panda E4B benchmark completes + Chatterbox restarts, run:
  ```bash
  bash scripts/generate_chatterbox_baseline.sh
  ```

---

## Related files in repo (read-only context)

- `/home/rajesh/.claude/plans/logical-kindling-crown.md` — the reviewed plan (PRIMARY reference)
- `docs/NEXT-SESSION-TTS-ALTERNATIVES.md` — original Voxtral verification + 20-model survey context
- `docs/RESEARCH-CHATTERBOX-CPU-BENCHMARK.md` — current Chatterbox state + Turbo comparison
- `docs/RESOURCE-REGISTRY.md` — Titan VRAM budget (59 GB free at peak)
- `docs/TITAN-SETUP-RECIPES.md` — aarch64 gotchas (CUDA 12/13 trap, DNS drop, numpy cascade)
- `docs/RESEARCH-TTS-GPU-DGX-SPARK.md` — SM_121 NVRTC patch history
- `services/annie-voice/chatterbox_server.py` — reference pattern for TTS server (auth, Semaphore, BF16)
- `services/annie-voice/tts_backends.py` — `TTSBackend` Protocol (future `VoxtralBackend` target, OUT OF SCOPE this session)
- `services/annie-voice/blackwell_patch.py` — Kokoro SM_121 monkey-patch pattern (template for Voxtral if NVRTC hits)
- `scripts/benchmark_titan.py` — existing `bench()` helper pattern
- `scripts/run-benchmark.sh` — existing dispatcher pattern (LD_LIBRARY_PATH + HF_TOKEN setup)
- `~/workplace/her/her-player/downloads/Iroq6EzQcl4/` — *Her* movie assets (video.mp4 340 MB, subtitles.vtt 134 KB)

---

## Post-benchmark

After verdict is written:
1. Update `MEMORY.md` session continuation with: TTFA numbers, VRAM on aarch64, verdict, which install branch worked
2. Update `docs/NEXT-SESSION-TTS-ALTERNATIVES.md` Voxtral entry: replace "unknown on aarch64" with actuals
3. If verdict = `swap_candidate` → create `docs/NEXT-SESSION-VOXTRAL-DEPLOY.md` for the deployment plan (adds `VoxtralBackend` to `tts_backends.py`, integrates into `start.sh`, updates `RESOURCE-REGISTRY.md`)
4. If verdict = `keep_chatterbox` → close out the thread; next TTS candidate from the 20-model survey becomes the next-session target

**Do NOT modify `MEMORY.md` during the benchmark run** — only after verdict. Keeps session-continuation entries aligned with measured reality.

---

## Post-implementation retrospective (Stage 9 of planning-with-review)

After the benchmark completes, tag each review finding in `MEMORY.md` or a session-note:

| Finding | Source | Verdict (HIT/MISS/PARTIAL/N/A) | Notes |
|---|---|---|---|
| F1 GPU latency degradation | Arch/Code CRIT-1 | TBD | Did Gemma actually slow during BENCHING? |
| F5 TTFA semantics mismatch | Arch CRIT-3 | TBD | Did the benchmark actually need branch-tagged metrics? |
| F11 VTT format heuristic | Arch MED-1 | TBD | Did the file use `<v>` or `- ` or neither? |
| (all others...) | | | |

If a finding category consistently returns MISS → note for future reviewers. If same gotcha HITs 3+ times → promote to `CLAUDE.md` or root-level feedback memory.
