# RESEARCH-COSYVOICE2 — verdict: deferred on Blackwell SM_121 incompatibility (2026-04-15)

**Session:** 107 (CosyVoice 2 fallback evaluation)
**Verdict:** `deferred_blackwell_sm121_pin_incompatibility`
**Action:** No install attempted. Survey complete. Chatterbox wins 2026-04 TTS survey.

---

## TL;DR

CosyVoice 2 is Apache-2.0 for both code and weights — **passes license gate** ✅.

But `requirements.txt` pins:
- `torch==2.3.1` — predates Blackwell SM_121 support (torch 2.5+ minimum for SM_121; torch 2.6+ recommended)
- `tensorrt-cu12==10.13.3.9` — CUDA 12 build; Blackwell SM_121 requires CUDA 13
- `onnxruntime-gpu==1.18.0` — aarch64+cu13 wheel status unclear; likely not available

Porting to Blackwell would require forking `requirements.txt`, upgrading torch to 2.6+ with cu13 wheels (NVIDIA NGC container path), replacing `tensorrt-cu12` with `tensorrt-cu13` (different API in places), and testing the `diffusers==0.29.0` + `transformers==4.51.3` + `x-transformers==2.11.24` + `deepspeed==0.15.1` stack for downstream breakage. Estimated: **1-3 hours best case, 6+ hours if TRT-LLM code paths break**.

Per session 106's paired-tag discipline and the F5-TTS-bench plan §196 ("Do NOT spend more than 1 hour on aarch64 friction"): not worth the spend when Chatterbox (MIT, deployed, working on Panda :8772) already covers Samantha voice cloning.

---

## Primary sources consulted

| Source | Key fact |
|---|---|
| https://github.com/FunAudioLLM/CosyVoice (README) | Apache-2.0 code; conda Python 3.10; `pip install -r requirements.txt`; Docker build instructions; 150ms TTFA streaming claim |
| https://github.com/FunAudioLLM/CosyVoice/blob/main/requirements.txt | torch==2.3.1, tensorrt-cu12==10.13.3.9, onnxruntime-gpu==1.18.0, deepspeed==0.15.1 |
| https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B | License tag: **Apache-2.0**; 9 languages + 18 Chinese dialects; zero-shot cross-lingual clone |
| https://github.com/vllm-project/vllm/issues/36821 | "No sm_121 (Blackwell) support on aarch64 — NVIDIA DGX Spark" — confirms ecosystem-wide cu12→cu13 wheel gap |
| https://forums.developer.nvidia.com/t/.../349881 (DGX Spark CUDA install pitfalls) | Pip defaults to +cpu torch on aarch64; must use NGC containers |
| https://forums.developer.nvidia.com/t/.../357663 (DGX Spark SM121 software support) | "Software support is severely lacking" — community consensus |

---

## Specific version-pin blockers

### torch==2.3.1 (2024-06 release)

- Blackwell SM_121 (consumer Blackwell, DGX Spark GB10) landed compute-capability registration in torch 2.5 nightly and stabilized in torch 2.6+
- DGX Spark-recommended path: `nvcr.io/nvidia/pytorch:25.12-py3` which ships torch 2.6.0+ custom build with SM_121 kernels
- Downgrading torch to 2.3.1 inside an NGC container is not safe — NVIDIA's torch build has ABI differences; community wheels fail with `OSError: undefined symbol`

### tensorrt-cu12==10.13.3.9

- TensorRT 10.x has `-cu12` and `-cu13` wheel variants
- Blackwell SM_121 requires `-cu13`, which is `tensorrt-cu13==10.x` on PyPI (available, aarch64 wheels present)
- CosyVoice 2 code imports `tensorrt` and calls TRT builder APIs; swap to cu13 likely works but needs smoke test
- If TRT-LLM optimization path is enabled (master HEAD feature per session 106 memory), the API may have drifted between 10.13.x cu12 and cu13 variants — needs source-level patch

### onnxruntime-gpu==1.18.0

- `onnxruntime-gpu` aarch64 wheels on PyPI: last aarch64 wheel was 1.17.x for cu12; 1.18+ aarch64 missing
- Would need to build ORT from source on Titan, or use NGC's bundled ORT, or drop ONNX codepath

### Cascade risks (not yet investigated)

- `deepspeed==0.15.1` — DeepSpeed's CUDA extension compilation with cu13 has known issues on 0.15.x; 0.16+ is safer
- `diffusers==0.29.0` — old; may have transformers version incompatibilities once transformers is bumped
- Total: 4-6 interlocking pins to re-resolve

---

## Why this isn't "just use the Docker image"

CosyVoice 2's own Docker instructions (`docker build -t cosyvoice:v1.0 .`) use the project's `Dockerfile`, which starts from a pinned CUDA 12 PyTorch base. Their CI does not target aarch64 or Blackwell. Building their Dockerfile on Titan would fail during `pip install -r requirements.txt` for the same pins.

The right approach (if we ever do this) is:
1. Start from `nvcr.io/nvidia/pytorch:25.12-py3` (NGC Blackwell-compatible torch)
2. Fork `requirements.txt` → drop torch/torchaudio pins, replace tensorrt-cu12 with tensorrt-cu13, replace or build ORT aarch64 wheel, bump deepspeed to 0.16
3. `pip install --no-deps` then manually add missing ones
4. Smoke-test CosyVoice2-0.5B zero-shot clone with Samantha ref
5. Benchmark 10 utterances vs Chatterbox

Total effort: session-long (4-8 hours). **Deferred** until/unless Chatterbox quality proves insufficient for a specific use case.

---

## What CosyVoice 2 got right (for future re-evaluation)

- **Apache-2.0 weights** — clean commercial license, no Emilia contamination
- **150ms TTFA streaming claim** — well under the 500ms TTFA gate from our decision rule
- **9 languages + cross-lingual zero-shot clone** — broader language coverage than Chatterbox (English only)
- **Frontend model (0.5B)** — small, fits comfortably on Titan alongside Gemma

If at a future date Rajesh wants Indic-language TTS for Mom or others in the household, CosyVoice 2 is the candidate to port (despite the install friction).

---

## Session-plan questions answered

Reference: `docs/NEXT-SESSION-F5-TTS-BENCH.md` Phase 6 (CosyVoice 2 fallback).

1. **aarch64 / DGX Spark support?** — No upstream support. Would require NGC base + requirements fork.
2. **Install path?** — Upstream: conda + requirements.txt (fails on Blackwell). Viable path: NGC pytorch container + requirements fork.
3. **Zero-shot clone API?** — `--mode zero_shot` in CLI; Python API in `example.py` (not yet read).
4. **Streaming?** — Yes, 150ms TTFA claimed (README).
5. **License?** — Apache-2.0 code + weights. ✅

---

## Decision-log entry

**Date:** 2026-04-15
**Decision:** Do not port CosyVoice 2 to Blackwell SM_121 in this session.
**Rationale:** Chatterbox (MIT, deployed, working) meets the current Samantha voice-clone requirement. Porting cost (4-8h) has no payoff until a specific quality gap or language need is identified.
**Revisit trigger:** Either (a) Chatterbox Samantha quality drops below user-acceptable on production phone calls, OR (b) Indic-language TTS becomes required.
