# Next Session: F5-TTS → CosyVoice 2 TTS Benchmark for Samantha Clone

**Created:** 2026-04-15 (session 106, post-Voxtral-bench, post-PR-#1)
**Type:** TTS evaluation with fallback chain — F5-TTS primary, CosyVoice 2 fallback
**Expected duration:** 2–4 hours per candidate (much faster than Voxtral — infrastructure already built)
**Blocks on:** Nothing. Samantha reference, A/B scorer, paired-tag Docker recipe pattern all exist.
**Relates to:** PR #1 (https://github.com/myidentity/her-os/pull/1) — scripts + Samantha refs land there.

---

## What

Benchmark **F5-TTS v1.1.18** (SWivid, released 2026-03-24) for Samantha voice cloning on Titan DGX Spark aarch64. If quality is unsatisfactory (per the pre-committed decision rule in Phase 4), fall back to **CosyVoice 2** (Alibaba, master HEAD `ace7c47`, 2026-03-16).

**Primary goal**: Samantha voice identity fidelity. This was the whole point of the TTS survey — Voxtral failed (encoder weights withheld), Chatterbox works but we want to know if there's a better-quality option that ships the clone encoder openly.

**Secondary goals**: TTFA (streaming capable? <500 ms?), RTF, peak VRAM, aarch64 install friction.

---

## Why F5-TTS first

Per `docs/NEXT-SESSION-TTS-ALTERNATIVES.md` fresh-date snapshot (session 106):

| Pick | Latest | Fresh? | License | Key properties |
|---|---|---|---|---|
| **F5-TTS v1.1.18** | 2026-03-24 | ✅ 3 weeks | MIT | Flow-matching DiT, open voice-clone encoder, streaming support added in latest release |
| CosyVoice 2 | master 2026-03-16 | ✅ live | Apache-2.0 | Frontier quality per benchmarks, v3 TRT-LLM in flight |

F5-TTS wins on: freshest release, MIT license (Chatterbox-equivalent, no restrictions), smaller/simpler than CosyVoice 2 (per repo structure — fewer dependencies).

**Chatterbox remains the baseline** to beat. If F5-TTS doesn't clearly exceed Chatterbox quality, we keep Chatterbox.

---

## What's already in place (from PR #1 / session 106)

- **Samantha voice reference**: `services/audio-pipeline/voice-references/samantha_movie_primary.wav` (34.7s, 24 kHz mono, user-confirmed, volume-normalized). Trim to 28s or 10s depending on F5-TTS's ref_audio cap.
- **Blind A/B scorer**: `scripts/tts_ab_score.py` — already works; just point it at two sample dirs.
- **Chatterbox baseline generator**: `scripts/generate_chatterbox_baseline.sh` — same 10 utterances used across all TTS evals.
- **Paired-tag Docker pattern** learned from Voxtral: always use the TTS project's own recommended base image, not mainline or nightly. Their `Dockerfile.ci` is the source of truth.
- **Gemma outage mode (pause-gemma)**: `docker stop vllm-gemma4` → run TTS → `docker start vllm-gemma4`. Verify post-flight p50 within 1.2× of baseline ~190 ms.

---

## Ground in primary sources FIRST (do this before anything else)

Before writing any code:

1. Read F5-TTS README + latest release notes:
   - https://github.com/SWivid/F5-TTS
   - https://github.com/SWivid/F5-TTS/releases/tag/1.1.18
   - https://huggingface.co/SWivid/F5-TTS (model weights + card)
2. Check inference examples for aarch64 / CUDA 13 / SM_121 Blackwell compat. Note any GPU-capability gating (F5-TTS uses flow-matching — may hit NVRTC / complex tensor issues like session 105 did with Kokoro).
3. Find the recommended Docker base image or install path. If there's no official Docker image, check if pip install works or if it's a from-source build.
4. Check voice-clone API — does F5-TTS want a reference WAV + transcript, or just a reference WAV? The `ref_text` requirement matters for our Samantha clip (transcript is in the VTT).
5. Check output sample rate (F5-TTS typically outputs 24 kHz or 44.1 kHz — align with Chatterbox's output for A/B comparison).
6. Find PRs / issues mentioning aarch64, DGX Spark, Blackwell, or voice cloning from custom audio.

**Expect a different runtime than Voxtral.** F5-TTS is NOT a vLLM model — it's a standalone PyTorch inference package. Launch pattern is likely `f5-tts_infer-cli` or a FastAPI wrapper.

---

## Phased execution

### Phase 0 — pre-flight (10 min)
- SSH Titan, verify Gemma at :8003 healthy, capture p50 baseline (reuse session-106 method: 5 curl requests → median). Expect ~190 ms.
- Check disk space (F5-TTS weights are ~1–2 GB — tiny vs Voxtral 7.5 GB).
- Confirm Chatterbox on Panda at :8772 is reachable (baseline dependency).
- User decision: `--gemma-mode=parallel` vs `pause-gemma`. F5-TTS is smaller than Voxtral; `parallel` is likely viable.

### Phase 1 — install F5-TTS on Titan (30–60 min)
- Follow F5-TTS's recommended install path from step 3 above. If official Docker image exists → pull. If pip install → fresh venv on Titan (NOT inside Gemma container).
- Check aarch64 wheel availability. If not pre-built, source install is expected given Blackwell. Watch for `flash-attn` / `torch.compile` issues.
- Download model weights via `hf download SWivid/F5-TTS` (or whatever the repo specifies).
- Smoke-test: generate one utterance with F5-TTS's default voice to confirm the install works before trying Samantha clone.

### Phase 2 — benchmark (30 min)
- Fork `scripts/benchmark_voxtral_titan.py` → `scripts/benchmark_f5_tts_titan.py`. Swap HTTP POST payload for F5-TTS's API shape (probably CLI or FastAPI wrapper). Reuse `UTTERANCES` list + `bench()` helper.
- Run 10 utterances with preset/default voice (warmup + timed).
- Run 10 utterances with Samantha ref_audio (`samantha_movie_primary.wav` or trimmed version if F5-TTS has a length cap).
- If F5-TTS supports streaming, also run streaming TTFA measurement.
- Pull WAVs to `docs/f5-tts-bench-samples-<timestamp>/` locally.

### Phase 3 — Chatterbox baseline (parallel, 15 min)
- `bash scripts/generate_chatterbox_baseline.sh --out docs/f5-tts-bench-samples-<timestamp>/`
- Requires Chatterbox reachable on Panda :8772. If Panda is busy, defer this phase and run baseline later (the scorer handles deferred baseline).

### Phase 4 — blind A/B score (30 min, human in loop)
- `python3 scripts/tts_ab_score.py --samples-dir docs/f5-tts-bench-samples-<timestamp>/ --output docs/BENCHMARK-F5-TTS-TITAN-<timestamp>.json`
- User scores 10 pairs, 1–5 each. Pre-committed decision rule:
  - `f5_mean ≥ chatterbox_mean + 0.5 AND f5_ttfa_p50 ≤ 500 ms` → `swap_candidate`
  - Else → `keep_chatterbox, go_to_cosyvoice2`

### Phase 5 — teardown + post-flight (10 min)
- Restart Gemma if paused, verify p50 drift <1.2×.
- Commit WAVs + verdict JSON on new branch `f5-tts-bench-<timestamp>`, open follow-up PR.

### Phase 6 — FALLBACK: CosyVoice 2 (only if F5-TTS fails gate)
- Repeat phases 1–5 with CosyVoice 2:
  - Repo: https://github.com/FunAudioLLM/CosyVoice (master, not v2 tag — v3 features are on master)
  - HF: `FunAudioLLM/CosyVoice2-0.5B`
  - Launch pattern: likely `cosyvoice_infer` or their server script
- Compare CosyVoice 2 vs Chatterbox (same decision rule).
- If CosyVoice 2 also fails → keep Chatterbox, write a "2026-04 TTS survey complete, Chatterbox wins" summary.

---

## Load-bearing lessons from session 106 (don't repeat these)

- **Paired-tag Docker discipline**: use the project's own `Dockerfile.ci` or README-recommended base image. Don't mix mainline + nightly + HEAD — API skew will burn hours.
- **Stage-configs YAML / launch configs often override CLI flags**. Inspect the config file before assuming `--gpu-memory-utilization X` works.
- **Always install from the requirements file**, not just `pip install -e . --no-deps`. Missing deps like `aenum`, `pydub`, etc. cost restart cycles.
- **Gemma post-flight drift drifts upward with each restart cycle** — cudagraph cache invalidation. Minimize container restarts by getting the launch command right on first try.
- **Voice-clone encoder status matters MORE than RTF / license / streaming**. Check primary source before benchmarking. If the model says "encoder withheld from open release" anywhere, stop and reconsider.
- **F0 pitch filter is the right way to validate voice gender** on any clone output. Our `scripts/filter_samantha_by_pitch.py` pattern can also audit the synthesized output to confirm the clone isn't drifting male.
- **User confirmation is the ONLY gate for voice identity**. Don't trust d-vector similarity alone (Resemblyzer clustered too tight in session 106); human ears are the scorer.

---

## Decision rule (pre-committed)

```
f5_mean ≥ chatterbox_mean + 0.5
  AND f5_ttfa_p50_ms ≤ 500
→ swap_candidate (F5-TTS becomes new primary)

else
→ try CosyVoice 2

IF CosyVoice 2 also fails same rule
→ keep_chatterbox (survey complete, Chatterbox wins)
```

---

## Files to create

| File | Purpose |
|---|---|
| `scripts/benchmark_f5_tts_titan.py` | Fork of `benchmark_voxtral_titan.py` with F5-TTS API shape |
| `scripts/run-f5-tts-bench.sh` | Dispatcher (fork of `run-voxtral-bench.sh`, simpler — no stage-configs needed) |
| `docs/f5-tts-bench-samples-<timestamp>/` | Output WAVs + Chatterbox baselines |
| `docs/BENCHMARK-F5-TTS-TITAN-<timestamp>.json` | Verdict |
| `docs/RESEARCH-F5-TTS.md` | Consolidated research record (mirror of RESEARCH-VOXTRAL.md structure) |

If F5-TTS fails → add equivalent files for CosyVoice 2.

---

## Files to update on completion

- `docs/NEXT-SESSION-TTS-ALTERNATIVES.md` — F5-TTS entry with verdict + numbers
- `MEMORY.md` — session continuation with findings
- `docs/RESEARCH-VOXTRAL.md` §7 paths-forward — reflect the new state

---

## Start command

```bash
# 1. Checkout main (PR #1 should be merged by now)
cd ~/workplace/her/her-os
git checkout main && git pull

# 2. Create new branch
git checkout -b f5-tts-bench-$(date +%Y%m%d)

# 3. Read the F5-TTS primary sources FIRST (no code yet)
#    - https://github.com/SWivid/F5-TTS (latest README)
#    - https://github.com/SWivid/F5-TTS/releases/tag/1.1.18
#    - HuggingFace SWivid/F5-TTS model card

# 4. Start Phase 0 pre-flight checks
ssh titan "curl -sf http://localhost:8003/v1/models && nvidia-smi --query-gpu=memory.used,memory.free --format=csv && df -h ~"

# 5. Then proceed phase-by-phase per the plan above
```

---

## Success criteria

1. ✅ F5-TTS installed on Titan aarch64 (paired-tag or source-build, whichever primary source recommends)
2. ✅ 10/10 Samantha-cloned utterances synthesized without errors
3. ✅ Chatterbox baseline generated (or deferred if Panda busy)
4. ✅ Blind A/B score complete with user's 1–5 ratings
5. ✅ Verdict JSON committed: `swap_candidate` OR `keep_chatterbox_try_cosyvoice2` OR `keep_chatterbox_survey_complete`
6. ✅ Gemma post-flight drift ≤ 1.2×
7. ✅ RESEARCH-F5-TTS.md consolidates findings with primary-source citations
8. ✅ Follow-up PR opened (base `main`, head `f5-tts-bench-<timestamp>`)

---

## If F5-TTS fails aarch64 install entirely

Document in `RESEARCH-F5-TTS.md` with the exact error. Move directly to CosyVoice 2 (Phase 6). Do NOT spend more than 1 hour on aarch64 friction — per session 106's paired-tag lesson, if there's no aarch64-ready path, there's no path.

---

## Open questions for the next session to answer

1. Does F5-TTS v1.1.18 officially support aarch64? Any issues mentioning DGX Spark / Blackwell SM_121?
2. What's the reference-audio format + length cap?
3. Does F5-TTS require `ref_text` (transcript of ref audio)? If so, extract the Samantha transcript from the VTT manifest in `/tmp/samantha_female_candidates/manifest_pitch_filtered.json`.
4. Streaming support: is TTFA actually measurable, or is it batch-only like Voxtral's default mode?
5. License confirmed MIT — but does it include any commercial-use restrictions on the pretrained checkpoint? (Primary sources may differ from the license file.)