# Research — Chatterbox TTS on CPU (Panda) to Free VRAM for NVFP4 E4B

**Date:** 2026-04-14 (session 101)
**Status:** Research complete — benchmark not yet run
**Driver:** If Chatterbox can run on CPU with acceptable latency, freeing 3.7 GB VRAM unlocks NVFP4 E4B for nav (~10 GB), which is architecturally superior to Q4_K_M GGUF on Blackwell's FP4 tensor cores.

## ⚠️ Session 101 verified facts (on-machine check)

**We are running the ORIGINAL Chatterbox, NOT Chatterbox-Turbo.**

- HF cache path: `~/.cache/huggingface/hub/models--ResembleAI--chatterbox/` (3.0 GB on disk)
- Model files: `t3_cfg.safetensors`, `s3gen.safetensors`, `ve.safetensors`, `conds.pt`, `tokenizer.json`
- Class used: `ChatterboxTTS` (NOT `ChatterboxTurboTTS`)
- Live VRAM (verified via `nvidia-smi --query-compute-apps`): **3,730 MiB**
- Parameters: 500M (per Resemble's GitHub readme)
- Audio decoder: **10-step** (not the 1-step Turbo variant)

**Panda hardware (verified this session):**
- CPU: **AMD Ryzen 9 9900X3D**, 12C/24T, Zen 5, 3D V-Cache
- Critical flags: `avx512_vnni`, `avx_vnni`, `avx512_bf16`, `avx512vl`, `avx512dq`, `avx512f`
- GPU: RTX 5070 Ti, 16,303 MiB total
- **Current free VRAM: only 2,193 MiB** (tighter than MEMORY.md suggested)

**Current GPU process list:**
| PID | Process | VRAM | Note |
|-----|---------|-----:|------|
| 985615 | `python3` | 5,286 MiB | phone_call.py (Whisper+IndicConformer+Kokoro) |
| 985874 | `python3` | 3,730 MiB | chatterbox_server.py |
| 1296045 | `/app/llama-server` | 4,420 MiB | E2B nav VLM (more than bare model — includes CUDA context + KV) |
| **Total used** | | **13,436 MiB** | |
| **Free** | | **2,193 MiB** | |

**Correction to VRAM planning:** MEMORY.md was quoting "E2B Q4_K_M = 3.2 GB" which was the bare-weights number. Live VRAM is ~4.4 GB due to CUDA context and KV cache allocation. Every VRAM math below should use LIVE numbers, not bare-weights.

---

## The strategic thesis (updated with verified live numbers)

Today Panda's GPU VRAM is almost full:

| Service | LIVE VRAM | Can it move to CPU? |
|---------|----------:|---------------------|
| phone_call.py (Whisper + IndicConformer + Kokoro) | 5,286 MiB | Whisper: yes (slow). Full pipeline: risky. |
| **Chatterbox TTS** | **3,730 MiB** | **MAYBE — this research answers it** |
| **Total voice pipeline** | 9,016 MiB | |
| Nav VLM slot (current E2B Q4_K_M live) | 4,420 MiB | — |
| **Total used** | 13,436 MiB | |
| **Free headroom** | **2,193 MiB** | tight |

If Chatterbox moves to CPU cleanly:

| Service | LIVE VRAM | |
|---------|----------:|---|
| phone_call.py | 5,286 MiB | (unchanged) |
| Chatterbox | **0 MiB** | **3.7 GB freed** |
| Replace E2B slot with NVFP4 E4B | ~10,200 MiB | (swap, not add — E2B must stop) |
| **New total** | ~15,486 MiB | |
| **New free** | ~800 MiB | very tight — needs headroom test |

**Two-step change warning:** to reach NVFP4 E4B, Chatterbox moves to CPU AND E2B nav VLM is replaced with E4B NVFP4. Total headroom after both swaps is razor-thin. Validate with live load test (voice + nav concurrent) before committing.

Alternative lighter swap if NVFP4 doesn't fit:
- Chatterbox → CPU (free 3.7 GB)
- Keep E2B in place (4.4 GB)
- Free headroom: 5.9 GB — comfortable, but nav VLM stays at E2B quality

---

## Original Chatterbox vs Chatterbox-Turbo — what we lose and gain

**Full feature comparison (verified via DeepWiki source + Resemble AI docs):**

| Aspect | Original Chatterbox (deployed) | Chatterbox-Turbo (candidate) |
|--------|-------------------------------|------------------------------|
| Parameters | 500M | 350M |
| Audio decoder | 10-step diffusion | **1-step (distilled)** |
| On-disk size | 3.0 GB | ~1.5 GB (estimated) |
| Live VRAM | 3,730 MiB (measured) | Not measured |
| Class | `ChatterboxTTS` | `ChatterboxTurboTTS` |
| ONNX variant | **Not published** | [chatterbox-turbo-ONNX](https://huggingface.co/ResembleAI/chatterbox-turbo-ONNX) |
| ONNX INT8 | Not available | Available (q8) |
| CPU feasibility | Poor | **Primary path (INT8 ONNX)** |
| Latency | ~500ms TTFB (4090 ref) | **<200ms TTFB** |
| Voice cloning (zero-shot) | ✅ (5-10s reference) | ✅ (10s reference recommended) |
| Perth watermarking | ✅ | ✅ |
| **Multilingual** | **✅ 23+ languages** | ❌ English only |
| **`cfg_weight` (pace control)** | **✅ (we use 0.3 for slow Samantha)** | ❌ **REMOVED** |
| **`exaggeration` (emotion control)** | **✅ (we use 0.3-0.5)** | ❌ **REMOVED** |
| `temperature` | ✅ | ✅ |
| `top_p`, `top_k`, `min_p`, `repetition_penalty` | — | ✅ (new in Turbo) |
| **Paralinguistic tags** | **❌ Not native** | **✅ 9 tags: `[laugh]`, `[chuckle]`, `[cough]`, `[sigh]`, `[gasp]`, +4 more** |

### What we'd LOSE moving to Turbo

**1. `cfg_weight` parameter** — exclusive to original. Our current server code explicitly documents:
```python
# services/annie-voice/chatterbox_server.py:66-69
cfg_weight=0.3 (pace: lower=slower, docs recommend 0.3 for fast refs)
exaggeration=0.3 (emotion: lower=calmer, less rushed)
```
This parameter is how we slow Samantha down for the "Her"-inspired calm cadence. Turbo has no equivalent knob — pace control must come from the reference audio itself or be accepted at Turbo defaults.

**2. `exaggeration` parameter** — also exclusive to original. Used in `bot.py` and `tts_backends.py` (`CHATTERBOX_EXAGGERATION` env var, default 0.5). Controls emotion expressiveness. No Turbo equivalent.

**3. Multilingual support** — Turbo is English-only. **But we already committed to English-only** (MEMORY.md: "IndicF5 RETIRED — Mom speaks English, Chatterbox+Kokoro cover all TTS"). So this loss is nominal for her-os.

**4. Some voice quality** — distillation trades fidelity for speed. Resemble doesn't publish MOS/speaker-similarity numbers, but they describe Turbo as "trades some quality for speed" while retaining "high-fidelity audio output."

**5. Potential Samantha voice identity drift** — distilled models often have subtle changes in voice characteristics that voice-cloning recipients may or may not notice. Requires A/B listening test.

**6. Reference audio requirement bumped** — Turbo recommends 10 seconds (original works with 5-10s). Our current `samantha_evolving.wav` should already be long enough, but worth verifying.

### What we'd GAIN moving to Turbo

**1. Native paralinguistic tags** — This is a **new capability**, not just a rework. Original Chatterbox doesn't natively support `[laugh]`, `[chuckle]`, `[cough]`, `[sigh]`, `[gasp]` etc. Turbo does. For Annie's phone calls with Mom, this could be genuinely expressive — "I had such a long day [sigh]" or "That's hilarious [chuckle]".

**2. Latency** — sub-200ms TTFB vs ~500ms on original. Perceptually indistinguishable from "instant" per conversational thresholds.

**3. CPU viability** — unlocks the entire NVFP4 E4B strategic path.

**4. ONNX ecosystem** — INT8/INT4 quantization, graph optimization, deployment to embedded devices if ever needed.

**5. Smaller model** — ~1.5 GB vs 3.0 GB on disk (faster cold starts, smaller backup/restore).

**6. New sampling parameters** — `top_p`, `top_k`, `min_p`, `repetition_penalty` give LLM-style control over generation diversity. Not a direct substitute for cfg_weight/exaggeration but provides different knobs.

### Net assessment

The tradeoff is asymmetric:
- **Biggest loss:** Fine-grained cfg_weight/exaggeration control for pace/emotion. This is how we got the "Her"-inspired Samantha cadence. **Mitigation:** re-record `samantha_evolving.wav` with the target cadence/emotion baked in, since Turbo voice-clones from reference characteristics.
- **Biggest gain:** Paralinguistic tags — a new expressiveness capability original Chatterbox doesn't have. Plus the entire NVFP4 path.

**Decision risk:** whether Turbo's voice cloning can faithfully preserve Samantha's identity without the cfg_weight/exaggeration knobs. This is empirical — a 5-minute A/B listening test will tell us.

---

**Consequence:** "Move Chatterbox to CPU" is actually a two-step change:
1. **Switch model family** from original Chatterbox → Chatterbox-Turbo (quality/voice retuning needed — see losses above)
2. **Move inference from GPU PyTorch → CPU ONNX INT8** (latency benchmark required)

Each step is a separate risk — need to test them independently.

### GPU baseline (reference numbers from Resemble / community)

| Metric | Value | Hardware |
|--------|------:|----------|
| RTF (streaming) | 0.499 | RTX 4090 |
| TTFB | ~472 ms | RTX 4090 |
| Production inference latency | <200 ms | Commercial setup |
| VRAM (PyTorch BF16) | **3.6 GB** | Panda RTX 5070 Ti (our measurement) |

RTF 0.499 means "generates 1 sec of audio in 0.5 sec of wall-clock" — 2× real-time on 4090.

---

## Current Panda deployment

**File:** `services/annie-voice/chatterbox_server.py` (verified this session)

```python
_model = ChatterboxTTS.from_pretrained(device="cuda")   # hard-coded CUDA
...
with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
    wav = model.generate(text, audio_prompt_path=ref_path, ...)
```

- Port `8772`, FastAPI, `asyncio.Semaphore(1)` serializes GPU access
- BF16 autocast (FP16 would overflow Vocos complex ops — documented in architecture doc)
- Eager preload at startup to avoid OOM under concurrent first-request load
- Reference audio allowlist from `metadata.json`
- Token auth (`X-Internal-Token`)

**Observed tuning (from docstring):**
- `cfg_weight=0.3` — slower pace (Samantha cadence)
- `exaggeration=0.3` — calmer emotion
- `temperature=0.6` — consistent prosody

---

## ONNX variant — the real CPU path

**[ResembleAI/chatterbox-turbo-ONNX](https://huggingface.co/ResembleAI/chatterbox-turbo-ONNX)** ships the same 350M model in ONNX format with multiple quantizations:

| Precision | Use case |
|-----------|----------|
| **fp32** | Reference — highest quality, slowest CPU |
| **fp16** | Half precision — GPU-friendly, moderate CPU speedup |
| **q8 (INT8)** | **Primary CPU target** — INT8 integer math hits AVX2/AVX-512 VNNI |
| **q4** | Aggressive — may degrade voice quality noticeably |
| **q4f16** | Mixed — INT4 weights with FP16 activations |

Dependencies: `onnxruntime`, `transformers`, `librosa`, `soundfile`. Optional `perth` for watermark detection.

### Why ONNX INT8 should win on modern AMD CPUs

- **AVX2 (all Zen)** — baseline 8-way INT8 SIMD
- **AVX-512 VNNI (Zen 4+, e.g. Ryzen 7000/9000)** — native INT8 dot-product acceleration, ~2-3× AVX2 throughput
- **ONNX Runtime graph optimization** — kernel fusion, constant folding, memory planning
- **One-step decoder** — Chatterbox-Turbo's key speedup stays intact on CPU

### Why PyTorch CPU would lose

- Eager-mode kernel dispatch overhead per operator
- No INT8 path in standard PyTorch TTS inference code
- No graph-level fusion (unless you `torch.compile`, which has its own issues on CPU)

**Expected ordering on AMD CPU:**
```
ONNX INT8 (AVX-512 VNNI) >>> ONNX INT8 (AVX2) >> ONNX FP16 > ONNX FP32 > PyTorch CPU
```

---

## Phone call latency budget

The TTS output feeds Annie's phone conversations with Mom. Human conversational tolerance:

| Latency | Perception |
|---------|------------|
| <200 ms TTFB | Indistinguishable from "instant" |
| 200-500 ms | Natural pause, fine |
| 500-1000 ms | Noticeable lag, usable for thoughtful replies |
| 1-2 sec | Awkward, feels robotic |
| >2 sec | Broken experience |

**Target for CPU Chatterbox:** TTFB < 500 ms AND RTF < 1.0.
If we can't hit 500 ms TTFB, CPU is not viable for live phone; only for pre-synthesized thinking-sound cues.

---

## Benchmark plan (what to actually run)

### Panda CPU specs — VERIFIED (session 101)

Already checked. Result: **best-case scenario for ONNX INT8 inference.**

- **CPU:** AMD Ryzen 9 9900X3D — Zen 5, 12 cores / 24 threads
- **3D V-Cache:** stacked L3 cache (large working set stays on-die, reduces memory-bound stall)
- **Key flags present:** `avx512_vnni` ✅, `avx_vnni` ✅, `avx512_bf16` ✅, `avx512vl`, `avx512dq`, `avx512f`, `avx512_vbmi2`, `avx512_vp2intersect`
- **Implication:** ONNX Runtime INT8 will hit VNNI-accelerated INT8 dot product (~2-3× AVX2). BF16 inference also supported natively.

Per ONNX Runtime benchmarks, Zen 5 + VNNI on 12 cores hits roughly MacBook M2/M3 performance tier for transformer inference. This CPU is ~5x faster than the Intel Core i7/Ryzen 5 class CPUs referenced in typical Chatterbox CPU benchmarks.

**Practical consequence:** If Chatterbox-Turbo CPU doesn't hit <500 ms TTFB on Ryzen 9 9900X3D with INT8 + 24 threads, it likely can't hit it on any consumer CPU. This benchmark is the decisive test.

### Phase 1 — install ONNX variant alongside current GPU server

```bash
# On Panda, in a new virtualenv to avoid disturbing the running chatterbox_server.py
ssh panda 'cd ~ && python3 -m venv chatterbox-cpu-venv && \
  source chatterbox-cpu-venv/bin/activate && \
  pip install onnxruntime transformers librosa soundfile numpy'

# Download the ONNX variants (start with q8)
ssh panda 'source chatterbox-cpu-venv/bin/activate && \
  huggingface-cli download ResembleAI/chatterbox-turbo-ONNX \
    --include "*q8*" "*config*" "*tokenizer*" \
    --local-dir ~/chatterbox-onnx-q8'
```

### Phase 2 — CPU latency benchmark script

Write `scripts/benchmark_chatterbox_cpu.py` (stdlib + onnxruntime only):

- 5 reference utterances covering length spectrum:
  - Short: `"Hi Ma, how are you?"` (~1.5 sec)
  - Medium: `"I'm doing well, I just had breakfast and I'm about to head to work."` (~4 sec)
  - Long: ~20-word sentence (~8 sec)
  - With paralinguistic: `"That's so funny [chuckle]"` (~2 sec)
  - Number-heavy: `"The meeting is at 3:45 PM on April 17th."` (~3 sec)
- 10 runs per utterance after warmup
- For each run, capture:
  - TTFB (first audio sample produced)
  - Total wall-clock
  - RTF = wall-clock / audio duration
  - Peak RAM (via `tracemalloc` or `resource.getrusage`)
- Thread-count sweep: 1, 2, 4, 8, all-cores (ONNX Runtime `intra_op_num_threads`)
- Test each precision: q8 first, then fp16, then fp32 if q8 is promising

### Phase 3 — A/B quality comparison

- Synthesize the same 5 utterances on GPU (current setup) and CPU ONNX
- Save to WAV files, subjective listening test
- Key quality checks:
  - Samantha voice identity preserved?
  - Natural prosody on question intonation?
  - Paralinguistic tags render correctly?
  - No artifacts / robotic sections?

### Phase 4 — decision matrix

| CPU result | Action |
|------------|--------|
| TTFB < 300 ms AND RTF < 0.5 AND quality preserved | **Swap to CPU** — unlocks NVFP4 E4B |
| TTFB 300-500 ms AND RTF < 1.0 AND quality OK | **Swap to CPU** — still enables NVFP4, minor UX cost |
| TTFB 500-1000 ms AND RTF < 1.0 | **Partial move** — use CPU only during active nav sessions, GPU otherwise |
| RTF > 1.0 (slower than realtime) | **Keep on GPU** — CPU is infeasible for live phone |
| Quality degradation (q8) | Try fp16; if still bad, keep on GPU |

---

## Risks & known concerns

1. **ONNX Chatterbox-Turbo port may lag PyTorch features.** Check if the ONNX variant supports voice cloning (`audio_prompt_path`) or only pre-baked voices. If pre-baked only, need to re-bake Samantha reference to the ONNX-compatible format.

2. **Vocos vocoder complexity.** The Vocos vocoder uses complex-number operations that need FP32/BF16 for stability. ONNX INT8 quantization may not apply to the vocoder stage — confirm by inspecting the exported graph.

3. **Thread contention with other CPU work.** Panda CPU also runs: FastAPI servers, llama-server mmproj image decode, WhatsApp agent, context-engine, dashboard. Benchmark should include "realistic CPU load" variant, not just isolated runs.

4. **Model cold-load time.** GPU Chatterbox is eager-loaded at server start. ONNX CPU cold-load may be slower — if it's measured in seconds, keep the model resident just like the GPU version.

5. **Memory footprint.** 350M params × INT8 = ~350 MB model + ~1-2 GB activations/cache. Should be fine on Panda RAM but measure.

6. **Paralinguistic tags.** The feature matters for Mom calls ("[laugh]", "[chuckle]"). Confirm ONNX variant preserves this.

---

## Files to create

1. `scripts/benchmark_chatterbox_cpu.py` — new benchmark script
2. `services/annie-voice/chatterbox_cpu_server.py` — parallel CPU server for A/B testing (port 8773)
3. Update `docs/RESOURCE-REGISTRY.md` — if we commit to CPU, update VRAM table and add CPU row

## Files NOT to modify yet

- `services/annie-voice/chatterbox_server.py` — leave running on GPU for rollback safety
- `start.sh` — leave the GPU server in the startup path until CPU is proven
- `docker-compose.yml` — no container changes until CPU verified

---

## Decision tree for next session

```
1. Verify Panda CPU specs → [Ryzen with VNNI? Ryzen without VNNI? Other?]
          │
2. Run ONNX q8 CPU benchmark on Panda
          │
   ┌──────┴──────┐
   │             │
TTFB<500ms?   TTFB>500ms?
   │             │
   ▼             ▼
3a. A/B quality test     3b. Try fp16
    │                        │
    ├─quality good→          ├─fast enough→ swap
    │   swap to CPU,         │
    │   deploy NVFP4 E4B     └─still slow→
    │                             keep on GPU,
    └─quality bad→                E4B stuck at Q4_K_M
        try fp16 or
        stay on GPU
```

---

## Cross-references

- `services/annie-voice/chatterbox_server.py` — current GPU implementation (what we're trying to move)
- `docs/ARCHITECTURE-PANDA-VOICE-PIPELINE.md` — why BF16 over FP16 (Vocos complex ops)
- `docs/RESEARCH-GEMMA4-E4B-QUANTIZATIONS.md` — the E4B research this unblocks
- `docs/RESOURCE-REGISTRY.md` — Panda VRAM budget (mandatory update if CPU swap ships)
- `docs/NEXT-SESSION-CHATTERBOX-VOICE-TUNING.md` — prior tuning work (cfg_weight, exaggeration defaults)
- `services/annie-voice/chatterbox_server.py:54-91` — `_synthesize_gpu()` — the function to port
- `services/annie-voice/generate_thinking_cues.py` — pre-synthesized TTS cues (reference for offline-CPU TTS patterns)

---

## Sources

- [ResembleAI/chatterbox-turbo-ONNX](https://huggingface.co/ResembleAI/chatterbox-turbo-ONNX) — ONNX variants (fp32/fp16/q8/q4/q4f16)
- [resemble-ai/chatterbox (GitHub)](https://github.com/resemble-ai/chatterbox) — official repo
- [Chatterbox-Turbo model card](https://www.resemble.ai/chatterbox-turbo/) — architecture, 350M params, one-step decoder
- [devnen/Chatterbox-TTS-Server](https://github.com/devnen/Chatterbox-TTS-Server) — third-party server explicitly supporting CUDA / ROCm / **CPU** modes
- [chatterbox-streaming fork](https://github.com/davidbrowne17/chatterbox-streaming) — streaming + fine-tuning (RTF 0.499 on 4090 baseline)
- [BentoML TTS comparison 2026](https://www.bentoml.com/blog/exploring-the-world-of-open-source-text-to-speech-models) — positions Chatterbox as GPU-dependent vs Kokoro/Piper CPU-friendly
- [Picovoice TTS latency benchmark](https://github.com/Picovoice/tts-latency-benchmark) — template for methodology
