# Next Session: Voxtral-4B-TTS-2603 via `vllm-omni` source build on Titan

**Created:** 2026-04-15 (post-session-105 verdict correction)
**EXECUTED:** 2026-04-15 (session 106) — SUCCESS. 10/10 utterances, RTF 0.79×. See `docs/BENCHMARK-VOXTRAL-TITAN-20260415.json` for full results. This doc is preserved for the PR #2790 cherry-pick follow-up run.
**Type:** Source build + benchmark execution
**Expected duration:** 2-4 hours (uncertain — first verified DGX Spark aarch64 vllm-omni build)
**Supersedes:** `docs/NEXT-SESSION-VOXTRAL-BENCHMARK.md` (mainline-vLLM path, confirmed dead)
**Blocks on:** nothing — weights already cached on Titan (7.5 GB), Samantha ref already committed, benchmark scripts already written.

---

## What changed since session 105

Session 105 concluded "voxtral_tts unsupported upstream" after testing mainline `vllm`. **That was wrong.** The research agent on 2026-04-15 traced the actual support path:

- `voxtral_tts` model_type is supported, but only via the **separate `vllm-omni` pip package**, not mainline `vllm`.
- Primary source: https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/model_executor/models/registry.py — registers `VoxtralTTSForConditionalGeneration`, `VoxtralTTSAudioGeneration`, `VoxtralTTSAudioTokenizer` under module `voxtral_tts`.
- Issue #2388 is our exact error, resolved by `vllm-omni serve … --omni` (NOT `vllm serve`).
- `--omni` flag is **mandatory** — without it, falls back to mainline registry and reproduces our failed test.

aarch64 caveat: vllm-omni publishes x86_64 CUDA wheels only (stable v0.18.0 from 2026-03-28, RC v0.19.0rc1 from 2026-04-04). Titan needs source build. Maintainers explicitly recommend source build given fast iteration.

## Plan (compressed — no adversarial re-review needed, this is the corrected path)

### Phase 0 — pre-flight
1. Gemma baseline (reuse session-105 method): p50=190ms, abort threshold 380ms.
2. Verify vllm-omni git repo accessible; confirm HEAD commit SHA.
3. Check disk space on Titan (last check: 1.5 TB free).

### Phase 1 — source build
```bash
ssh titan
cd ~
git clone https://github.com/vllm-project/vllm-omni
cd vllm-omni
git log -1 --format='%H %s'   # capture commit SHA for the results JSON

# Use the existing Titan Python env or a dedicated venv — avoid contaminating vllm-gemma4 runtime
uv venv /tmp/vllm-omni-bench-venv
source /tmp/vllm-omni-bench-venv/bin/activate
uv pip install -e .
# Expected pain: CUDA 13 / aarch64 specific issues; TRITON for SM_121.
# Watch for: missing wheel for flash-attn, numba, etc. Be ready to pin transformers per release notes.
```

### Phase 2 — stop Gemma + launch vllm-omni (pause-gemma mode)
```bash
ssh titan docker stop vllm-gemma4
# Launch — NOTE --omni flag is mandatory
vllm-omni serve mistralai/Voxtral-4B-TTS-2603 \
  --tokenizer-mode mistral \
  --omni \
  --stage-configs-path $(pwd)/vllm_omni/model_executor/stage_configs/voxtral_tts.yaml \
  --host 0.0.0.0 --port 8004 \
  --gpu-memory-utilization 0.30 \
  --max-num-seqs 1 &
```

### Phase 3 — benchmark
Reuse existing scripts:
```bash
# From dev box (current machine)
python3 scripts/benchmark_voxtral_titan.py \
  --endpoint http://192.168.68.52:8004 \
  --install-path 2E_vllm_omni_source \
  --voice-ref services/audio-pipeline/voice-references/samantha_movie_primary.wav \
  --n 10 \
  --out docs/voxtral-bench-samples-<timestamp>/
```

### Phase 4 — Chatterbox baseline + A/B score
- Assumes Chatterbox healthy on Panda (:8772). If Panda is in the parallel E4B benchmark state, defer.
- `bash scripts/generate_chatterbox_baseline.sh --out <out_dir>`
- `python3 scripts/tts_ab_score.py --samples-dir <out_dir> --output docs/BENCHMARK-VOXTRAL-VLLM-OMNI-<ts>.json`

### Phase 5 — teardown + restart Gemma
- Stop vllm-omni container/process
- `docker start vllm-gemma4`
- Verify post-flight Gemma p50 within 1.2× baseline (drift gate from session 105)

## Gotchas (baked in from research)

1. **`--omni` flag mandatory** — without it, reproduces session-105 failure exactly.
2. **Stage-configs YAML required** — `vllm_omni/model_executor/stage_configs/voxtral_tts.yaml`. Not optional.
3. **`--tokenizer-mode mistral` required** — Mistral tokenizer extras silently fail without it.
4. **PR #2405 patch may be needed** — if transformers is recent, `@strict` decorator triggers `VoxtralTTSConfig` AttributeError. Cherry-pick or pin transformers per release notes.
5. **PR #2790 may be needed** — for uploading our Samantha wav as ref_audio. If not merged by run-time, either cherry-pick or use the preset `neutral_female` voice first (rules out cloning issues separately from runtime issues).
6. **aarch64 flash-attn / triton compile** — SM_121 Blackwell has historical NVRTC issues (`blackwell_patch.py` precedent). Voxtral's flow-matching transformer + codec decoder may hit similar patterns. Have `blackwell_patch.py` pattern ready as a template.
7. **Don't source-build inside the existing Gemma Python env.** Use a fresh venv or temp directory — vllm-omni and mainline vllm are separate packages and will conflict.

## Success criteria

- At least 10 Samantha utterances synthesized without errors
- Audio files open and play back (non-empty, correct sample rate)
- Voice identity recognizable as Samantha via A/B listening
- TTFA p50 ≤ 500 ms (streaming; per session-105 decision rule)
- Gemma post-flight drift ≤ 1.2×

## Pre-committed decision rule (unchanged from session 105)

`voxtral_omni_mean ≥ chatterbox_mean + 0.5 AND voxtral_ttfa_p50 ≤ 500ms → swap_candidate`
Else → `keep_chatterbox_defer_voxtral_tts`.

## What's already in place

- Weights cached: `~/.cache/huggingface/hub/models--mistralai--Voxtral-4B-TTS-2603/` on Titan (7.5 GB)
- Samantha reference: `services/audio-pipeline/voice-references/samantha_movie_primary.wav` (user-confirmed, volume-normalized)
- Benchmark client script: `scripts/benchmark_voxtral_titan.py`
- A/B scorer: `scripts/tts_ab_score.py`
- Chatterbox baseline generator: `scripts/generate_chatterbox_baseline.sh`
- Branch: `voxtral-bench-20260414` — base for continuation

## Reference commands for copy-paste

```bash
# Quick source-build sanity probe before committing to stop Gemma
ssh titan 'cd ~/vllm-omni 2>/dev/null || (cd ~ && git clone https://github.com/vllm-project/vllm-omni && cd vllm-omni); \
  git log -1 --format="%H %s"'

# Full run (once source build succeeds)
ssh titan 'tmux new-session -d -s voxtral-omni-bench \
  "bash -c \"cd ~/vllm-omni && source /tmp/vllm-omni-bench-venv/bin/activate && \
  vllm-omni serve mistralai/Voxtral-4B-TTS-2603 --tokenizer-mode mistral --omni \
  --stage-configs-path vllm_omni/model_executor/stage_configs/voxtral_tts.yaml \
  --host 0.0.0.0 --port 8004\""'
```
