# Next Session: Voxtral-4B-TTS-2603 Benchmark on Titan **Created:** 2026-04-14 (session 102, post-adversarial-review) **Executed:** 2026-04-14 (session 105) — VERDICT: `keep_chatterbox_voxtral_tts_unsupported_upstream`. See `docs/BENCHMARK-VOXTRAL-TITAN-20260414-2300.json`. This doc is preserved for the re-run that'll happen once vLLM upstream adds `voxtral_tts` model_type support. **Type:** Benchmark execution — research + measurement, no production code change **Expected duration:** 2-3 hours across Phase A-1 through Phase 5 **Blocks on:** Nothing. Parallel-safe with user's Panda E4B benchmark (different machine, different GPU). --- ## What Run a voice-identity + latency benchmark of Mistral's new **Voxtral-4B-TTS-2603** (released 2026-03-26) on **Titan** (NVIDIA DGX Spark, aarch64 Blackwell GB10, 128 GB unified memory) to decide whether to swap Panda's current **Chatterbox 500M** TTS. **Primary metric (load-bearing):** Samantha voice-cloning quality via blind A/B listening against Chatterbox baseline. Per user: "The key I am using all these TTS is to have Samantha's voice instead of Kokoro for example." **Secondary metrics:** Streaming TTFA on aarch64, RTF, peak VRAM, streaming chunk granularity. **Outcome:** Go/no-go verdict per a pre-committed decision rule: `Voxtral mean ≥ Chatterbox mean + 0.5 AND Voxtral TTFA p50 ≤ 500 ms → swap candidate`. Otherwise keep Chatterbox. --- ## Plan **READ FIRST:** `/home/rajesh/.claude/plans/logical-kindling-crown.md` This plan went through the full `planning-with-review` skill: - Stage 0 prior-lessons check (12 gotchas from prior sessions mapped) - Stage 1B state machine (7 states, 11 transitions, enforced via `/tmp/voxtral-bench-state`) - Stage 1C pre-mortem (12 failure scenarios with mitigations) - Stage 2 adversarial architecture review (3 CRITICAL + 3 HIGH + 3 MEDIUM + 2 alternatives found) - Stage 3 adversarial code review (4 CRITICAL + 4 HIGH + 4 MEDIUM found) - Stage 6 feedback table: **22 findings, 22 implemented, 0 rejected, 0 deferred** (per `feedback_no_reject_defer.md`) The plan has every review-driven fix baked in. Do NOT revert these — they are load-bearing. --- ## Load-bearing design decisions (do NOT accidentally revert) 1. **Latency-based health gate, NOT binary 200-check (F1)** Unified memory on DGX Spark has no hard VRAM wall. Gemma 4's p50 baseline is captured in Phase 0; BENCHING phase aborts if Gemma latency exceeds 2× baseline. A `curl /health → 200` check is insufficient — Gemma can degrade silently from 300 ms to 3000 ms while reporting "healthy." 2. **Dedicated git branch for benchmark artifacts (F6)** `git checkout -b voxtral-bench- origin/main` before any commit. Benchmark WAVs + JSON live on this branch. User merges to main via explicit PR after review. Prevents binary churn on main and push-conflict race. 3. **`--env-file`, NOT `-e HF_TOKEN=...` (F7)** `ps aux` shows process args → `-e HF_TOKEN=` exposes the token. Use mode-0600 `/tmp/voxtral-bench-envfile` with `--env-file`, delete after container starts. 4. **tmux-wrapped dispatcher (F8)** `ssh titan 'tmux new-session -d -s voxtral-bench "bash scripts/run-voxtral-bench.sh"'` — survives SSH drop. Re-attach via `ssh titan -t tmux attach -t voxtral-bench`. Systemd 6am timer on Titan sweeps orphaned containers >4h old. 5. **Script-enforced blind A/B (F20)** `scripts/tts_ab_score.py` randomizes Voxtral-vs-Chatterbox WAV pairs with labels hidden. User scores 1-5 per pair. Pre-committed decision rule runs automatically. No free-form "I think Voxtral sounds better" judgment — avoids sunk-cost bias. 6. **TTFA semantics differ per install branch (F5)** - 2A/2B (vLLM HTTP streaming): TTFA = first byte via `httpx.stream()` - 2C (HF transformers in-process): `total_synthesis_latency_ms` only, NOT TTFA - 2D (pure-C batch): same as 2C - Results JSON tags `install_path` + `streaming_capable`. Go/no-go verdict requires 2A or 2B — 2C/2D produce `verdict: "aarch64_vllm_omni_unverified"`. 7. **User picks `--gemma-mode=parallel|pause-gemma` before first run (F21)** `parallel` = default per user constraint "Nothing on Titan should be broken"; Gemma keeps running. `pause-gemma` = adversarial reviewer's safer recommendation; ~3 min Gemma outage but true isolation. AskUserQuestion at dispatcher start. Honest safety/constraint trade-off surfaced. 8. **Optional Phase A-1 Mistral API pre-screen (F22)** Flag `--allow-api-prescreen` (default OFF, preserves local-first privacy). $0.05 USD for 3 test utterances. If Voxtral obviously fails Samantha voice identity → abort before Titan work, save ~1 hour. 9. **VTT parser auto-detects speaker-marker format (F11)** The plan's initial assumption (`- ` prefix) is probably wrong for yt-dlp auto-subtitles. `extract_samantha_from_movie.py` first inspects the file: checks for `` voice spans (WebVTT canonical), falls back to `- ` (SRT-style), falls back to silence-gap segmentation. 10. **Deterministic container teardown, no `--rm` (F2)** `docker run --rm` races with explicit `docker rm -f` + EXIT trap. Plan drops `--rm`; teardown is: `docker stop → wait 10s → docker rm -f → verify via docker container inspect`. --- ## Files to create | # | File | Purpose | |---|---|---| | 1 | `scripts/benchmark_utils.py` | Shared `bench(fn, n=5)` helper with **explicit warmup call outside `bench()`** (F4). Import from both `benchmark_titan.py` and `benchmark_voxtral_titan.py`. Do NOT copy-paste verbatim. | | 2 | `scripts/extract_samantha_from_movie.py` | Adaptive VTT parser + ffmpeg clip extraction. Path-sanitized via `pathlib.Path.resolve()` + startswith-check (F3). Integer-second filenames only. Asserts `stat().st_size > 2048` after each ffmpeg. | | 3 | `scripts/benchmark_voxtral_titan.py` | Per-install-path TTFA measurement with `httpx.Timeout(connect=5.0, read=30.0, write=5.0, pool=5.0)` (F10). Explicit warmup call before timed loop. Branch selector via env/flag. | | 4 | `scripts/tts_ab_score.py` | Blind-randomized listener. Pre-committed decision rule: `voxtral_mean ≥ chatterbox_mean + 0.5 AND voxtral_ttfa_p50 ≤ 500` → swap. Writes verdict to JSON. | | 5 | `scripts/run-voxtral-bench.sh` | Dispatcher. Implements state machine via `/tmp/voxtral-bench-state` with `assert_state` checks per phase. Background health-check PID tracked in `/tmp/voxtral-bench-healthcheck.pid`. EXIT trap kills PID. Flags: `--gemma-mode`, `--allow-api-prescreen`, `--skip-baseline`. | | 6 | `scripts/generate_chatterbox_baseline.sh` | Resumable Chatterbox baseline generator. Called by Phase 4 if Chatterbox is reachable; deferred to a later run if Panda is busy with E4B benchmark (F12). | --- ## Files to modify - `.gitattributes` — add `docs/voxtral-bench-samples-*/**/*.wav filter=lfs diff=lfs merge=lfs` (F15) - `services/audio-pipeline/voice-references/` (new dir) — confirmed Samantha movie clip lives here, committed - `services/audio-pipeline/voice-references/metadata.json` — add the new clip to allowlist (Chatterbox pattern) - `MEMORY.md` (post-benchmark only) — add a session continuation with measured numbers + go/no-go --- ## One-time Titan setup (before first run) 1. `ssh titan 'git lfs --version'` — if missing: `sudo apt install git-lfs && git lfs install` 2. `ssh titan 'which tmux'` — if missing: `sudo apt install tmux` 3. Install systemd 6am orphan-cleanup timer (safety-net for F8 / pre-mortem #9): ```bash ssh titan 'sudo tee /etc/systemd/system/voxtral-bench-cleanup.service << EOF [Unit] Description=Sweep orphaned Voxtral benchmark containers [Service] Type=oneshot ExecStart=/bin/bash -c "docker ps --filter name=vllm-voxtral-bench --filter status=running --format \"{{.Names}} {{.RunningFor}}\" | awk \"\\$2 ~ /hour/\" | awk \"{print \\$1}\" | xargs -r docker rm -f" EOF sudo tee /etc/systemd/system/voxtral-bench-cleanup.timer << EOF [Unit] Description=Daily cleanup of stale benchmark containers [Timer] OnCalendar=*-*-* 06:00:00 [Install] WantedBy=timers.target EOF sudo systemctl daemon-reload sudo systemctl enable --now voxtral-bench-cleanup.timer' ``` --- ## Start command ```bash # 1. Open the plan and read the full review-driven plan cat /home/rajesh/.claude/plans/logical-kindling-crown.md | less # 2. AskUserQuestion to surface the two user-decision flags: # - --gemma-mode=parallel (default) vs pause-gemma # - --allow-api-prescreen (default off) vs on # 3. Begin Phase A0 (Samantha extraction) via: python3 scripts/extract_samantha_from_movie.py \ --video ~/workplace/her/her-player/downloads/Iroq6EzQcl4/video.mp4 \ --vtt ~/workplace/her/her-player/downloads/Iroq6EzQcl4/subtitles.vtt \ --out /tmp/samantha_candidates/ # 4. User listens to candidates + confirms which is Samantha # 5. Run dispatcher on Titan: ssh titan 'cd ~/workplace/her/her-os && git fetch origin && \ tmux new-session -d -s voxtral-bench \ "bash scripts/run-voxtral-bench.sh --gemma-mode=parallel"' # 6. Monitor via: ssh titan -t tmux attach -t voxtral-bench ``` --- ## Verification (success criteria) 1. ✅ Phase 0 captures `GEMMA_P50_BASELINE_MS` cleanly (Titan production healthy) 2. ✅ Phase A0 produces at least 1 user-confirmed Samantha clip ≥ 5 s, saved to `services/audio-pipeline/voice-references/` 3. ✅ Phase 2: at least one of 2A/2B/2C/2D branches succeeds (aarch64 feasibility answered) 4. ✅ Phase 3: 10 utterances × 2 voice conditions complete with <3/20 timeouts 5. ✅ Phase 4: verdict written to JSON (`swap_candidate` OR `keep_chatterbox` OR `aarch64_vllm_omni_unverified`) 6. ✅ Phase 5 post-flight: Gemma p50 within 1.2× of pre-baseline, no orphan containers, benchmark branch pushed to origin 7. ✅ Results visible: `docs/BENCHMARK-VOXTRAL-TITAN-.json` + `docs/voxtral-bench-samples-/*.wav` committed on `voxtral-bench-` branch --- ## If something fails **Phase 1 (aarch64 feasibility) fails on ALL 4 branches:** - Plan: report "Voxtral unusable on aarch64 at this time." Do NOT force-merge fallback results as if they answer the question. Document in `docs/RESEARCH-TTS-ALTERNATIVES.md` and move to next candidate (Chatterbox-Turbo, CosyVoice 2, F5-TTS from the prep doc). **Phase 3 hits NVRTC/Jiterator error (F17):** - Diagnostic log written to `voxtral_nvrtc_diagnostic.log` with full stack trace + tensor ops context. - This becomes a CONCRETE follow-up: write a targeted monkey-patch inspired by `services/annie-voice/blackwell_patch.py`. Not a hand-wave. **Gemma degrades during BENCHING (F1):** - Background health-check writes `/tmp/voxtral-bench-abort` - Main loop transitions to ERROR_ABORTED, runs Phase 5 teardown immediately - Post-flight verifies Gemma recovered within 60 s; if not → user manually investigates **SSH drops mid-run (F8):** - Dispatcher survives inside tmux. Re-attach via `ssh titan -t tmux attach -t voxtral-bench`. - If tmux session also died (kernel panic etc.), systemd 6am timer sweeps orphans. **Chatterbox unreachable in Phase 4 (F12):** - Phase 4 writes `deferred-baseline-utterances.txt` and skips. - After user's Panda E4B benchmark completes + Chatterbox restarts, run: ```bash bash scripts/generate_chatterbox_baseline.sh ``` --- ## Related files in repo (read-only context) - `/home/rajesh/.claude/plans/logical-kindling-crown.md` — the reviewed plan (PRIMARY reference) - `docs/NEXT-SESSION-TTS-ALTERNATIVES.md` — original Voxtral verification + 20-model survey context - `docs/RESEARCH-CHATTERBOX-CPU-BENCHMARK.md` — current Chatterbox state + Turbo comparison - `docs/RESOURCE-REGISTRY.md` — Titan VRAM budget (59 GB free at peak) - `docs/TITAN-SETUP-RECIPES.md` — aarch64 gotchas (CUDA 12/13 trap, DNS drop, numpy cascade) - `docs/RESEARCH-TTS-GPU-DGX-SPARK.md` — SM_121 NVRTC patch history - `services/annie-voice/chatterbox_server.py` — reference pattern for TTS server (auth, Semaphore, BF16) - `services/annie-voice/tts_backends.py` — `TTSBackend` Protocol (future `VoxtralBackend` target, OUT OF SCOPE this session) - `services/annie-voice/blackwell_patch.py` — Kokoro SM_121 monkey-patch pattern (template for Voxtral if NVRTC hits) - `scripts/benchmark_titan.py` — existing `bench()` helper pattern - `scripts/run-benchmark.sh` — existing dispatcher pattern (LD_LIBRARY_PATH + HF_TOKEN setup) - `~/workplace/her/her-player/downloads/Iroq6EzQcl4/` — *Her* movie assets (video.mp4 340 MB, subtitles.vtt 134 KB) --- ## Post-benchmark After verdict is written: 1. Update `MEMORY.md` session continuation with: TTFA numbers, VRAM on aarch64, verdict, which install branch worked 2. Update `docs/NEXT-SESSION-TTS-ALTERNATIVES.md` Voxtral entry: replace "unknown on aarch64" with actuals 3. If verdict = `swap_candidate` → create `docs/NEXT-SESSION-VOXTRAL-DEPLOY.md` for the deployment plan (adds `VoxtralBackend` to `tts_backends.py`, integrates into `start.sh`, updates `RESOURCE-REGISTRY.md`) 4. If verdict = `keep_chatterbox` → close out the thread; next TTS candidate from the 20-model survey becomes the next-session target **Do NOT modify `MEMORY.md` during the benchmark run** — only after verdict. Keeps session-continuation entries aligned with measured reality. --- ## Post-implementation retrospective (Stage 9 of planning-with-review) After the benchmark completes, tag each review finding in `MEMORY.md` or a session-note: | Finding | Source | Verdict (HIT/MISS/PARTIAL/N/A) | Notes | |---|---|---|---| | F1 GPU latency degradation | Arch/Code CRIT-1 | TBD | Did Gemma actually slow during BENCHING? | | F5 TTFA semantics mismatch | Arch CRIT-3 | TBD | Did the benchmark actually need branch-tagged metrics? | | F11 VTT format heuristic | Arch MED-1 | TBD | Did the file use `` or `- ` or neither? | | (all others...) | | | | If a finding category consistently returns MISS → note for future reviewers. If same gotcha HITs 3+ times → promote to `CLAUDE.md` or root-level feedback memory.