# RESEARCH-F5-TTS — verdict: rejected on license (2026-04-15)

**Session:** 107 (F5-TTS bench attempt)
**Verdict:** `rejected_weight_license_cc_by_nc_4_0`
**Action:** No benchmark run. Pivot to CosyVoice 2 deferred (see RESEARCH-COSYVOICE2.md).

---

## TL;DR

F5-TTS v1.1.18 (SWivid, 2026-03-24) has a **split license**:

- **Code:** MIT ✅
- **Pretrained weights:** CC-BY-NC-4.0 ❌ (due to Emilia training dataset contamination)

CC-BY-NC-4.0 forbids commercial use of the weights. her-os targets personal + commercial distribution paths ([RESEARCH-DATA-STORAGE.md](./RESEARCH-DATA-STORAGE.md) "household, revenue" references). Chatterbox (MIT across the board) already satisfies the Samantha voice-clone requirement on Panda :8772. No reason to invest aarch64/Blackwell install hours in a model that can't be shipped.

---

## Primary sources consulted

| Source | Key fact |
|---|---|
| https://github.com/SWivid/F5-TTS (README) | "The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset." |
| https://huggingface.co/SWivid/F5-TTS (model card) | License tag: **CC-BY-NC-4.0** |
| https://github.com/SWivid/F5-TTS/releases/tag/1.1.18 | 2026-03-24 release; adds streaming/non-streaming inference split in `utils_infer.py`, removes ThreadPoolExecutor from batch inference |

---

## Session-plan questions answered

Reference: `docs/NEXT-SESSION-F5-TTS-BENCH.md` §201 "Open questions."

1. **aarch64 / DGX Spark / Blackwell SM_121 support?** — No mention in README, issues, or PRs. Given CUDA 12.4/12.8 tested and no cu13 wheels, aarch64 Blackwell install would require source-build + PyTorch version patches. *Not investigated further; license block resolved first.*
2. **Reference audio format + length cap?** — CLI shape `f5-tts_infer-cli --ref_audio path.wav --ref_text "..." --gen_text "..."`. Length cap not documented in README excerpt (likely 10-30s per community threads, not verified).
3. **`ref_text` required?** — Yes, CLI signature includes `--ref_text`.
4. **Streaming support?** — v1.1.18 explicitly separates streaming and non-streaming inference functions. Streaming IS supported.
5. **Commercial-use restriction?** — **YES**, CC-BY-NC-4.0 on weights. Blocker.

---

## Why this is a hard block (not "just pick a different license tag")

CC-BY-NC is the weights' license because **training data (Emilia) is CC-BY-NC**. Re-licensing the weights would require retraining from scratch on a permissive corpus. No F5-TTS fork on HF or GitHub has done this. The gap is upstream at the dataset level, analogous to session 106's Voxtral situation (encoder weights withheld — unfixable in user code).

---

## What F5-TTS confirmed about the broader TTS survey

- **Flow-matching DiT + streaming inference is now mainstream** (Kokoro, F5-TTS, Chatterbox all support it).
- **License hygiene is the bottleneck, not acoustics.** The state of the art models that clone from 10-30s reference all work well; the question is which ones are deployable.
- **Code-MIT-but-weights-CC-BY-NC is a common trap.** Always check the HuggingFace model card license tag, not just the GitHub LICENSE file.

---

## What remains load-bearing from the Samantha reference work

- `services/audio-pipeline/voice-references/samantha_movie_primary.wav` (34.7s, 24 kHz mono, user-confirmed) — Chatterbox-compatible, already in use.
- `scripts/filter_samantha_by_pitch.py` — F0 pitch-gender validator, reusable on any TTS output to flag voice drift.
- `scripts/tts_ab_score.py`, `scripts/generate_chatterbox_baseline.sh` — A/B harness, preserved for any future candidate (Chatterbox-Turbo, new permissive-license model).

---

## Next candidates if Chatterbox quality proves insufficient

Per `docs/NEXT-SESSION-TTS-ALTERNATIVES.md`:

- **Chatterbox-Turbo** (if/when released) — same license, same clone encoder, lower latency claim
- **Any future MIT/Apache-2.0 flow-matching TTS** — monitor SWivid, Resemble, ElevenLabs open-weight releases
- **Source-permissive retrain of F5-TTS** — unlikely, but the MIT code + permissive dataset (e.g., LibriSpeech) could produce a shippable fork

Do **not** pursue F5-TTS on CC-BY-NC weights regardless of quality.
