# Research: Voxtral-4B-TTS-2603 for her-os (Samantha voice)

**Status:** Complete (2026-04-14 → 2026-04-15, session 106)
**Verdict:** Voxtral runs on Titan DGX Spark aarch64 at RTF 0.79× via `vllm-omni v0.18.0`. **Samantha voice clone is IMPOSSIBLE** — Mistral withheld encoder weights from the open-source release. Self-hosted Voxtral is restricted to 20 preset voices. **Drop Voxtral from the Samantha-voice path; keep Chatterbox.**

This file is the consolidated record. See also:
- `docs/BENCHMARK-VOXTRAL-TITAN-20260415.json` — full structured benchmark results
- `docs/voxtral-voice-samples-20260415/` — 20 WAV samples (same phrase, each preset) + README
- `docs/NEXT-SESSION-TTS-ALTERNATIVES.md` — broader 20-model TTS survey context
- MEMORY.md session-106 entry — chronological session log

---

## 1. Headline findings

| Finding | Primary source |
|---|---|
| Voxtral-4B-TTS-2603 released 2026-03-26 by Mistral | HF: https://huggingface.co/mistralai/Voxtral-4B-TTS-2603 |
| Self-hosted arbitrary voice clone is NOT POSSIBLE on the open-source checkpoint | HF Discussion #17 (see §3) |
| vllm-omni v0.18.0 runs on Titan aarch64 with paired base image | Verified session 106; `docker/Dockerfile.ci` pins `VLLM_BASE_TAG=v0.18.0` |
| Mean RTF 0.79× (faster than real-time on DGX Spark GB10) | 10-utterance benchmark, `BENCHMARK-VOXTRAL-TITAN-20260415.json` |
| License: CC-BY-NC 4.0 (restricts commercial use, not personal) | Model card |

---

## 2. Runtime recipe (what actually works)

**Paired-tag requirement** (non-negotiable — API skew on either side breaks it):

```
vllm-omni v0.18.0 (git tag, 2026-03-28)
vllm/vllm-openai:v0.18.0-aarch64-cu130 (Docker Hub base image)
```

**Launch on Titan** (assumes Gemma is stopped or co-resident with lowered mem utilization):

```bash
cd ~/vllm-omni
git checkout v0.18.0   # detached HEAD

# Lower mem utilization from 0.8 → 0.5 (stage-configs hardcodes it; CLI flag doesn't override)
sed -i 's/gpu_memory_utilization: 0.8/gpu_memory_utilization: 0.5/' \
  vllm_omni/model_executor/stage_configs/voxtral_tts.yaml

docker run -d --name vllm-voxtral-omni \
  --gpus all -e HF_TOKEN=$HF_TOKEN \
  -v $HOME/vllm-omni:/opt/vllm-omni \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  -p 8004:8000 --ipc=host --entrypoint bash \
  vllm/vllm-openai:v0.18.0-aarch64-cu130 \
  -c 'set -e
cd /opt/vllm-omni
pip install --quiet $(grep -v "^#" requirements/common.txt | grep -v "^$" | head -30 | tr "\n" " ")
SETUPTOOLS_SCM_PRETEND_VERSION=0.18.0 pip install --quiet --no-deps -e .
exec vllm-omni serve mistralai/Voxtral-4B-TTS-2603 \
  --tokenizer-mode mistral --omni \
  --stage-configs-path vllm_omni/model_executor/stage_configs/voxtral_tts.yaml \
  --host 0.0.0.0 --port 8000 --max-num-seqs 1'
```

**Gotchas** (each cost us an iteration in session 106):

| Gotcha | Symptom | Fix |
|---|---|---|
| HEAD of vllm-omni + mainline vllm nightly | `ImportError: cannot import name 'OpenAIServingPooling'` | Use paired tags v0.18.0 on both sides |
| `SETUPTOOLS_SCM_PRETEND_VERSION` missing | `InvalidVersion: 'dev'` | Set env var to `0.18.0` before `pip install -e .` |
| `pip install -e . --no-deps` only | `ModuleNotFoundError: aenum` / `pydub` / ... | Install `requirements/common.txt` deps separately first |
| CLI `--gpu-memory-utilization 0.3` flag | Stage-configs YAML hardcodes 0.8, overrides CLI | `sed` the YAML, not the CLI |
| `hf_transfer` enabled but not installed | `ValueError: hf_transfer package not available` | `unset HF_HUB_ENABLE_HF_TRANSFER` before download |
| `hf` CLI not in non-interactive PATH | `command not found` | Prepend `~/.local/bin` to PATH |

**Observed endpoints** (Voxtral server at `:8004`):
```
GET  /v1/models
GET  /v1/audio/voices
POST /v1/audio/voices             (upload custom voice — requires 'consent' field)
DEL  /v1/audio/voices/{name}
POST /v1/audio/speech
POST /v1/audio/speech/stream      (streaming — not yet benchmarked for TTFA)
POST /v1/audio/speech/batch
```

---

## 3. Why Samantha voice cloning is impossible (the definitive block)

**Runtime error** hit when we tried to clone Samantha:
```
RuntimeError: encode_waveforms requires encoder weights which are not available in the open-source checkpoint.
```
Raised by `mistral_common/tokens/tokenizers/audio.py:440` (internal `assert`).

**Primary-source confirmation** from Mistral themselves (HF Discussion #17, 2026-03-27, org-member `y123456y78`):
> "The voice cloning feature is not included in the current release, and we don't yet have a timeline for its availability."
>
> "While we didn't release the encoder weights in this version, all the details about the encoder are available in the paper: https://arxiv.org/pdf/2603.25551"

Source URL: https://huggingface.co/mistralai/Voxtral-4B-TTS-2603/discussions/17

**Architectural confirmation** (repo file listing):
- `consolidated.safetensors` — decoder / LM only
- `voice_embedding/*.pt` × 20 — precomputed speaker embeddings for the 20 presets
- **NO encoder file**

The 20 preset voices work because embeddings are pre-baked. Arbitrary audio cannot be encoded into an embedding without the missing weights.

**PR #2790 does NOT fix this.** PR #2790 ("handle uploaded voice as ref_audio in Voxtral TTS", open, 2026-04-14) only wires uploaded-voice name → `ref_audio` routing. Session 106 hand-patched the same routing in `_build_voxtral_prompt` (`serving_speech.py:1095-1108`) and confirmed: the crash just moves from tokenizer-time to engine-core-time, same missing-weights root cause.

**Community reaction** (same gap):
- HF #16 (17 comments, 11 👍): "Why not open source" — https://huggingface.co/mistralai/Voxtral-4B-TTS-2603/discussions/16
- HF #11: "How to make new voices?"
- HF #5: "Finetuning code?" (closed)
- Community research attempt: `MarvinRomson/voxtral-tts-codes-for-audio` (9 ⭐, 2026-04-13) — reverse-engineering, not a working clone

**All community forks inherit the gap** — `AITRADER/Voxtral-4B-TTS-2603-bf16/mxfp4/mxfp8`, `idontkwow/Voxtral-4B-TTS-2603`, MLX/NVFP4 quants. Can't quantize weights that don't exist.

**Where the encoder lives**: Mistral's paid hosted API at `https://console.mistral.ai/build/audio/text-to-speech` (AI Studio) runs the encoder server-side. Self-hosted users only get the decoder + 20 precomputed embeddings.

---

## 4. Benchmark numbers (session 106)

**Setup**: Titan DGX Spark GB10 (aarch64 Blackwell SM_121, 128 GB unified memory), vllm-omni v0.18.0, `gpu_memory_utilization: 0.5`, `max_num_seqs: 1`, Gemma 4 26B paused.

**10-utterance benchmark, preset `neutral_female`**:

| Metric | Value |
|---|---|
| n | 10 |
| Warmup latency (1st call, torch.compile cost) | 18.3 s |
| Mean synthesis latency (post-warmup) | 4.4 s |
| Min / max | 1.98 s / 9.43 s |
| Mean RTF | **0.79× (faster than real-time)** |
| HTTP 200 rate | 10 / 10 |
| Audio validity | 10 / 10 (24 kHz mono PCM) |
| Output format | WAV (PCM / FLAC / MP3 / AAC / Opus also supported) |

**Consistency check**: TrevorS's Burn-based implementation reports ~1.0× RTF at 3 Euler steps on DGX Spark GB10 per `github.com/TrevorS/voxtral-mini-realtime-rs` README. Our vllm-omni measurement is slightly faster (different config / backend).

**NOT measured yet**: streaming TTFA via `/v1/audio/speech/stream`. For phone-call <500 ms budget check, that test is still pending.

**Gemma impact**: baseline p50 190 ms, post-flight p50 222 ms, drift 1.17× (within 1.2× gate).

---

## 5. The 20 preset voices + audition samples

Same phrase across all 20: *"Hi Rajesh. I have been thinking about the way we talked yesterday. How are you feeling today?"*

Samples committed at `docs/voxtral-voice-samples-20260415/voxtral_<voice>.wav`.

| # | Voice | Gender | Language | Samantha-fit |
|---|---|---|---|---|
| 1 | **neutral_female** | F | English | ⭐⭐⭐ baseline candidate |
| 2 | **casual_female** | F | English | ⭐⭐⭐ warm, conversational |
| 3 | **cheerful_female** | F | English | ⭐⭐⭐ Samantha's playful energy |
| 4 | neutral_male | M | English | ❌ wrong gender |
| 5 | casual_male | M | English | ❌ wrong gender |
| 6 | hi_female | F | Hindi | ⭐ useful for Hindi phrases |
| 7 | hi_male | M | Hindi | ❌ |
| 8 | de_female | F | German | ⚠️ accent |
| 9 | de_male | M | German | ❌ |
| 10 | es_female | F | Spanish | ⚠️ accent |
| 11 | es_male | M | Spanish | ❌ |
| 12 | fr_female | F | French | ⚠️ accent |
| 13 | fr_male | M | French | ❌ |
| 14 | it_female | F | Italian | ⚠️ accent |
| 15 | it_male | M | Italian | ❌ |
| 16 | nl_female | F | Dutch | ⚠️ accent |
| 17 | nl_male | M | Dutch | ❌ |
| 18 | pt_female | F | Portuguese | ⚠️ accent |
| 19 | pt_male | M | Portuguese | ❌ |
| 20 | ar_male | M | Arabic | ❌ |

**9 female voices; 3 English-neutral** = the real Samantha-candidate shortlist.

**USER SELECTION (2026-04-15)**: if we ever deploy Voxtral (e.g., Chatterbox fallback or decision reversal), the chosen preset is **`casual_female`**.
```bash
ffplay -nodisp -autoexit docs/voxtral-voice-samples-20260415/voxtral_casual_female.wav
```

**Listen to all 3 candidates** (local Ubuntu):
```bash
ffplay -nodisp -autoexit docs/voxtral-voice-samples-20260415/voxtral_neutral_female.wav
ffplay -nodisp -autoexit docs/voxtral-voice-samples-20260415/voxtral_casual_female.wav   # USER PICK
ffplay -nodisp -autoexit docs/voxtral-voice-samples-20260415/voxtral_cheerful_female.wav
```

**Mistral-hosted AI Studio** (for hosted-quality comparison): https://console.mistral.ai/build/audio/text-to-speech

---

## 6. Samantha voice reference (salvageable work product)

Even though Voxtral can't clone it, the reference audio extracted from *Her* (2013) is valuable for **Chatterbox** (which ships the voice-clone encoder openly) and any future TTS that supports voice cloning.

**File**: `services/audio-pipeline/voice-references/samantha_movie_primary.wav`
- Duration: 34.7 s
- Sample rate: 24 kHz mono PCM
- Volume: normalized via `ffmpeg loudnorm I=-16 LRA=11 TP=-1.5` (9.5× boost from original)
- Source: Her (2013) video.mp4 at 5750s, clipped via VTT monologue clustering + F0 pitch filter
- User-confirmed 100% Samantha (2026-04-14)

**Alternates** (same dir):
- `samantha_movie_v2_keepwalking_38s.wav` — camera-directing scene
- `samantha_movie_v3_dearTheodore_35s.wav` — letter scene (== primary pre-normalization)
- `samantha_movie_v1_goodish_41s.wav` — user-rejected

**Extraction tools** (committed on branch):
- `scripts/extract_samantha_from_movie.py` — VTT monologue clustering
- `scripts/filter_samantha_by_pitch.py` — F0 pitch filter (>160 Hz = female). **Critical**: text heuristics alone ranked Theodore as top 4 of 5. Pitch analysis flipped the ranking decisively. 118–140 Hz = Theodore (Joaquin Phoenix), 180–210 Hz = Samantha (Scarlett Johansson).
- `scripts/rank_samantha_by_voice_similarity.py` — Resemblyzer d-vector ranking (inconclusive alone — 0.05 spread across all 5; pitch was the decisive signal)

---

## 7. Paths forward (ranked)

1. **KEEP CHATTERBOX (recommended)** — already deployed on Panda, clones Samantha zero-shot from reference audio, MIT-licensed, no restrictions. Swap `samantha_movie_primary.wav` into its voice-clone reference path for immediate Samantha unlock.
2. **Voxtral preset voices (fallback)** — if Chatterbox becomes unavailable or quality-insufficient, self-hosted Voxtral with preset **`casual_female`** (user-selected 2026-04-15) is ready to serve. RTF 0.79× on Titan.
3. **Mistral hosted API** — has voice cloning (encoder runs on their servers). ~$0.016/1K chars. CC-BY-NC license still restricts commercial use — fine for personal her-os, blocks any future commercialization.
4. **Wait for Mistral to release the encoder** — no timeline per HF Discussion #17. Could be months. Our scripts + paired-image recipe are ready for a re-run if that ships.
5. **Evaluate next TTS candidate** — `NEXT-SESSION-TTS-ALTERNATIVES.md` lists Chatterbox-Turbo, CosyVoice 2, F5-TTS, and others. Pick one that ships the voice-clone encoder openly.

**NOT VIABLE**:
- Fine-tuning Voxtral on Samantha audio — requires training a speaker embedding through the missing encoder.
- Community forks — all inherit the same missing-encoder gap.
- Reverse-engineering the encoder — community attempt (`MarvinRomson`) is research-only, not a working clone.

---

## 8. Related / dead-end references

- `docs/NEXT-SESSION-VOXTRAL-BENCHMARK.md` — original benchmark plan targeting mainline vllm. Superseded (mainline doesn't support voxtral_tts).
- `docs/NEXT-SESSION-VOXTRAL-VLLM-OMNI-SOURCE-BUILD.md` — source-build handoff. EXECUTED in session 106.
- `docs/BENCHMARK-VOXTRAL-TITAN-20260414-2300.json` — wrong initial verdict (preserved for audit trail, superseded by 20260415 JSON).
- `vllm-omni` issue tracker: no RFC or roadmap item for releasing the encoder weights.
- TrevorS/voxtral-mini-realtime-rs: targets STT/realtime variant, NOT TTS-2603.
- mudler/voxtral-tts.c (renamed to voxtral-cpp): targets STT/transcription primarily.

---

## 9. Primary-source citation list

1. https://huggingface.co/mistralai/Voxtral-4B-TTS-2603 — model card
2. https://huggingface.co/mistralai/Voxtral-4B-TTS-2603/discussions/17 — encoder-withheld confirmation
3. https://huggingface.co/mistralai/Voxtral-4B-TTS-2603/discussions/16 — community frustration thread
4. https://github.com/vllm-project/vllm-omni — runtime package
5. https://github.com/vllm-project/vllm-omni/blob/v0.18.0/vllm_omni/model_executor/models/registry.py — voxtral_tts class registration
6. https://github.com/vllm-project/vllm-omni/blob/main/docker/Dockerfile.ci — paired-base-tag pinning
7. https://github.com/vllm-project/vllm-omni/pull/2790 — routing fix (does NOT unblock clone)
8. https://arxiv.org/pdf/2603.25551 — Voxtral TTS paper (encoder described here but weights not released)
9. https://hub.docker.com/r/vllm/vllm-openai/tags — `v0.18.0-aarch64-cu130` base image (the unlock)
10. https://console.mistral.ai/build/audio/text-to-speech — Mistral AI Studio (hosted encoder + preset playground)
