# Research: GPU-Accelerated Kokoro TTS on DGX Spark

**Date:** 2026-02-24
**Status:** Research complete
**Context:** Phase 0 — Technology validation on Titan (DGX Spark GB10)

## 1. Problem Statement

Kokoro 0.9.4 (installed via `pip install kokoro`) fails in GPU mode with:
```
nvrtc: error: invalid value for --gpu-architecture (-arch)
```
The NVRTC JIT compiler does not recognize SM_120/SM_121 (Blackwell). CPU mode works at 0.39x real-time factor (2.5x faster than real-time). The GPU sits idle.

**Hardware:** NVIDIA DGX Spark (GB10 Grace Blackwell Superchip)
- GPU: Blackwell, compute capability 12.1, SM_121
- CPU: ARM64 (aarch64) — 10x Cortex-X925 + 10x Cortex-A725
- CUDA: 13.0, driver 580.126.09
- Memory: 128GB unified (shared CPU/GPU via NVLink-C2C)

**The core challenge:** aarch64 + CUDA 13.0 + Blackwell SM_121 is a rare triple. The nvrtc error comes from PyTorch's Jiterator attempting to JIT-compile a CUDA kernel for an architecture it does not recognize.

---

## 2. Root Cause Analysis

### 2.1 Where the Error Originates

Kokoro uses PyTorch under the hood. Its architecture is based on **StyleTTS 2 + ISTFTNet** — a decoder-only TTS model with 82M parameters. The critical code path is in `kokoro/istftnet.py`:

```python
# TorchSTFT.transform() — line 88-93
forward_transform = torch.stft(
    input_data,
    self.filter_length, self.hop_length, self.win_length,
    window=self.window.to(input_data.device),
    return_complex=True)
return torch.abs(forward_transform), torch.angle(forward_transform)
```

**The `.abs()` call on a complex-valued tensor is the trigger.** When PyTorch computes the absolute value of a complex tensor on GPU, it dispatches to the **Jiterator** — PyTorch's CUDA TensorIterator interface that JIT-compiles elementwise CUDA kernels using NVRTC. The Jiterator calls `nvrtcCompileProgram()` with `--gpu-architecture=sm_121`, and NVRTC (which ships with the installed CUDA toolkit) rejects this unknown architecture.

**Key insight:** Standard Python monkeypatching of `torch.cuda.get_device_capability()` FAILS because the Jiterator queries the hardware architecture directly from C++, bypassing the Python-level API.

### 2.2 The Jiterator Explained

The Jiterator ([PyTorch Dev Blog](https://dev-discuss.pytorch.org/t/keeping-pytorchs-ops-maintainable-the-jiterator/468)) is a mechanism that JIT-compiles elementwise CUDA kernels at runtime using NVRTC. Instead of pre-compiling all possible kernel variants at build time, PyTorch compiles them on-demand when the operation is first invoked. This is efficient for most architectures but fails when NVRTC does not recognize the target GPU.

Operations that trigger the Jiterator include:
- `.abs()` on complex tensors
- Various elementwise operations on complex-valued tensors
- Certain fused kernel paths

### 2.3 The nvrtc Error is NOT From espeak-ng or phonemizer

The text processing pipeline (espeak-ng -> phonemizer -> misaki G2P) runs on CPU and does not touch CUDA at all. The error occurs exclusively during the **audio synthesis phase** — specifically in the ISTFTNet decoder when it performs spectral transformations on GPU tensors.

---

## 3. Kokoro's Architecture — Relevant CUDA Operations

### 3.1 Model Pipeline

```
Text -> misaki G2P -> phonemes -> CustomAlbert -> TextEncoder
    -> ProsodyPredictor -> F0/Energy
    -> Decoder (ISTFTNet) -> waveform
```

The Decoder contains the problematic STFT/iSTFT operations.

### 3.2 The `disable_complex` Parameter

Kokoro's ISTFTNet has a built-in escape hatch:

```python
# istftnet.py, Generator.__init__
self.stft = (
    CustomSTFT(...)  if disable_complex
    else TorchSTFT(...)
)
```

When `disable_complex=True`:
- Uses `CustomSTFT` — a real-valued STFT implementation that avoids complex tensors entirely
- Originally created for ONNX export compatibility (ONNX does not support complex tensors)
- **This avoids the Jiterator/nvrtc code path completely**

When `disable_complex=False` (default):
- Uses `TorchSTFT` — native `torch.stft()` / `torch.istft()` with complex tensors
- Triggers `.abs()` and `.angle()` on complex results
- **This is the path that hits the nvrtc error**

### 3.3 Other CUDA Operations in Kokoro

Beyond the STFT path, Kokoro uses standard PyTorch ops that have pre-compiled kernels:
- `nn.Conv1d`, `nn.ConvTranspose1d` (upsampling)
- `nn.Linear`, matrix multiplications
- Activation functions (Snake, LeakyReLU)
- `torch.repeat_interleave()`, `torch.bmm()`

These all use pre-compiled CUDA kernels from the PyTorch binary and do NOT trigger nvrtc JIT compilation. The problem is isolated to the complex tensor operations in the STFT path.

---

## 4. Solution Approaches (Ranked by Feasibility)

### 4.1 APPROACH A: Use `disable_complex=True` (BEST — Zero Risk)

**Feasibility: HIGH | Effort: LOW | Risk: NONE**

Kokoro already supports a real-valued STFT path via `CustomSTFT`. This bypasses all complex tensor operations and therefore all Jiterator/nvrtc calls.

**Implementation:**
```python
import kokoro
# When loading the model, pass disable_complex=True to the Generator/Decoder
# This requires either:
# 1. Patching the KModel initialization to pass disable_complex=True
# 2. Or monkey-patching after model load
```

The `CustomSTFT` was designed for ONNX compatibility, but it produces numerically identical results to `TorchSTFT` for inference. There may be minor numerical precision differences at the float32 level, but these are inaudible in TTS output.

**Status:** Needs verification — the `kokoro` pip package may or may not expose this parameter through its high-level API. If not, a small monkey-patch of the model's `stft` attribute after loading would work.

### 4.2 APPROACH B: Monkey-Patch `.abs()` on Complex Tensors (PROVEN)

**Feasibility: HIGH | Effort: LOW | Risk: LOW**

The [whisperx-blackwell](https://github.com/Mekopa/whisperx-blackwell) project proved this approach for DGX Spark. Instead of calling `.abs()` on a complex tensor (which triggers the Jiterator), compute the magnitude manually:

```python
# BEFORE (triggers nvrtc):
magnitude = torch.abs(complex_tensor)

# AFTER (avoids nvrtc):
magnitude = torch.sqrt(complex_tensor.real**2 + complex_tensor.imag**2)
```

**For Kokoro specifically**, patch `istftnet.py`:

```python
# Original TorchSTFT.transform():
# return torch.abs(forward_transform), torch.angle(forward_transform)

# Patched version:
real = forward_transform.real
imag = forward_transform.imag
magnitude = torch.sqrt(real**2 + imag**2)
phase = torch.atan2(imag, real)
return magnitude, phase
```

This avoids the Jiterator entirely. `torch.sqrt()`, `torch.atan2()`, `.real`, `.imag` all use pre-compiled kernels.

**Proven by:** [Mekopa/whisperx-blackwell](https://github.com/Mekopa/whisperx-blackwell) achieved ~115x speedup on DGX Spark (Blackwell) with this exact technique for torchaudio's identical `.abs()` issue.

### 4.3 APPROACH C: Architecture Spoofing (get_device_capability Patch)

**Feasibility: MEDIUM | Effort: MEDIUM | Risk: MEDIUM**

Force PyTorch's `get_device_capability()` to return `(9, 0)` (Hopper) instead of `(12, 1)` (Blackwell):

```python
import torch.cuda
_original = torch.cuda.get_device_capability

def patched_get_device_capability(device=None):
    major, minor = _original(device)
    if major == 12:
        return (9, 0)  # Spoof as Hopper
    return (major, minor)

torch.cuda.get_device_capability = patched_get_device_capability
```

**WARNING:** Standard Python monkeypatching alone is INSUFFICIENT because the Jiterator queries hardware architecture from C++ code, bypassing Python. This patch only works when combined with the `.abs()` patch (Approach B) or when using PyTorch builds that have the Jiterator fixed.

SM_90 (Hopper) and SM_121 (Blackwell) are **binary compatible** — Hopper-compiled CUDA code runs natively on Blackwell hardware. This is confirmed by NVIDIA and PyTorch maintainers.

### 4.4 APPROACH D: Install PyTorch cu130 with SM_121 Support

**Feasibility: MEDIUM-HIGH | Effort: MEDIUM | Risk: LOW**

PyTorch now publishes cu130 wheels that include SM_120 support:

```bash
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
```

**Critical caveat for DGX Spark:** The aarch64 + CUDA 13.0 combination is fragile:
- Standard `pip install torchaudio` fetches the **CPU-only** version for aarch64 (no pre-compiled +cuXXX wheels)
- NGC PyTorch container 25.12 ships with PyTorch 2.10, but torchaudio is broken on aarch64 ([pytorch/audio#4169](https://github.com/pytorch/audio/issues/4169))
- Nightly builds from `https://download.pytorch.org/whl/nightly/cu130` may have aarch64 CUDA wheels

**Recommended install sequence:**
```bash
# Create venv
uv venv .venv --python 3.12
source .venv/bin/activate

# Set environment
export TORCH_CUDA_ARCH_LIST="12.1a"
export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

# Install PyTorch cu130
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

# If torchaudio gets CPU-only version, try nightly:
uv pip install torchaudio --index-url https://download.pytorch.org/whl/nightly/cu130
```

If cu130 aarch64 wheels include SM_120/SM_121 pre-compiled kernels, the Jiterator may not need to JIT-compile at all, solving the problem at the root.

### 4.5 APPROACH E: Pre-Built PyTorch Wheels for Blackwell

**Feasibility: MEDIUM | Effort: LOW | Risk: MEDIUM**

Community pre-built wheels exist:
- [cypheritai/pytorch-blackwell](https://github.com/cypheritai/pytorch-blackwell) — PyTorch 2.11.0a0 built with `TORCH_CUDA_ARCH_LIST="12.1"` and `CUDA_HOME=/usr/local/cuda-13.0`
- Build time: ~45 minutes from source
- Confirmed working for core PyTorch ops (matmul, convolutions, training)

**Limitation:** These wheels may not include torchaudio, and Kokoro does not depend on torchaudio anyway — it uses PyTorch's built-in `torch.stft`/`torch.istft` directly.

### 4.6 APPROACH F: ONNX Runtime with CUDA Execution Provider

**Feasibility: LOW-MEDIUM | Effort: HIGH | Risk: MEDIUM**

Kokoro has an ONNX export path ([kokoro-onnx](https://github.com/thewh1teagle/kokoro-onnx)):

```bash
pip install kokoro-onnx[gpu]  # Installs onnxruntime-gpu
```

ONNX Runtime with `CUDAExecutionProvider` uses pre-compiled CUDA kernels and does NOT use NVRTC JIT compilation. However:

- ONNX Runtime GPU 1.24.0 is **NOT compiled with SM_120** kernels by default
- Flash Attention is disabled in the ONNX Runtime SM_120 build
- Custom community builds exist ([ussoewwin/onnxruntime-gpu-1.24.0](https://huggingface.co/ussoewwin/onnxruntime-gpu-1.24.0)) but are Windows-only
- No aarch64 CUDA build of ONNX Runtime is available

**Verdict:** The ONNX path avoids complex tensor issues (uses `CustomSTFT` / `disable_complex=True` by design), but getting ONNX Runtime GPU working on DGX Spark aarch64 is harder than fixing the PyTorch path.

### 4.7 APPROACH G: NGC PyTorch Container

**Feasibility: MEDIUM | Effort: LOW-MEDIUM | Risk: MEDIUM**

NVIDIA's NGC container `nvcr.io/nvidia/pytorch:25.12-py3` ships PyTorch 2.10 optimized for DGX Spark:

```bash
docker pull nvcr.io/nvidia/pytorch:25.12-py3
docker run --gpus all -it nvcr.io/nvidia/pytorch:25.12-py3
pip install kokoro
```

**Known issues:**
- torchaudio is broken in NGC 25.12 on aarch64 ([NVIDIA forums](https://forums.developer.nvidia.com/t/incompatibility-of-torchaudio-in-ngc-pytorch-container-25-12-on-dgx-spark-blackwell-gb10/357745))
- ABI mismatches between NVIDIA's custom PyTorch build and pip packages
- Container overhead for a 82M model is overkill

**Verdict:** Worth trying if Approach A/B fail, but adds Docker complexity.

---

## 5. Answers to Specific Questions

### Q1: Does Kokoro use PyTorch under the hood?

**Yes.** Kokoro is a pure PyTorch model. Dependencies: `torch`, `transformers` (for CustomAlbert), `phonemizer`/`misaki` (G2P), `scipy`, `munch`, `soundfile`. The model itself is a `torch.nn.Module`. No custom CUDA extensions — all operations use standard PyTorch ops.

### Q2: Can we force pre-compiled PyTorch CUDA kernels instead of nvrtc JIT?

**Yes, via two mechanisms:**
1. **`disable_complex=True`** — Kokoro's own `CustomSTFT` avoids complex tensors entirely, meaning no Jiterator path is triggered.
2. **Manual magnitude computation** — Replace `.abs()` on complex tensors with `torch.sqrt(real**2 + imag**2)`, which uses pre-compiled kernels.

### Q3: Is there a `torch.compile()` or `torch.jit.trace()` approach?

**No.** `torch.compile()` actually makes things **worse** — it uses Triton as a backend, and Triton also does not support SM_120 in current versions ([PyTorch issue #149570](https://github.com/pytorch/pytorch/issues/149570) confirms `torch.compile` fails specifically with Kokoro). The error changes from nvrtc to "sm_120 is not defined for option 'gpu-name'" in Triton.

`torch.jit.trace()` would also eventually hit the same Jiterator path when executing the traced model on SM_121.

### Q4: Does `PYTORCH_CUDA_ALLOC_CONF` help?

**No.** This environment variable controls CUDA memory allocation strategies (e.g., `expandable_segments`, `max_split_size_mb`), not kernel compilation. It has no effect on the nvrtc architecture recognition issue.

### Q5: Is there a newer version of Kokoro (>0.9.4)?

The pip package `kokoro>=0.9.4` is the latest recommended version. Kokoro v0.19 model weights were released Dec 2024 under Apache 2.0. The Kokoro v1.0 weights exist for training data references but the pip package versioning is separate. **No version of Kokoro fixes the Blackwell issue** because the problem is in PyTorch's Jiterator, not in Kokoro's code.

### Q6: What about Kokoro-FastAPI — Blackwell solutions?

[Kokoro-FastAPI issue #365](https://github.com/remsky/Kokoro-FastAPI/issues/365) confirms the same problem. The issue was opened July 2025. No Blackwell-specific fix has been merged into Kokoro-FastAPI. The issue identifies this as a PyTorch ecosystem problem, not a Kokoro-FastAPI problem.

Kokoro-FastAPI does support **ONNX CPU mode** (~2.4x realtime) and **PyTorch GPU mode** (~35x realtime on supported architectures).

### Q7: Can we pre-compile/cache CUDA kernels for SM_120?

**Not directly.** The Jiterator compiles kernels at runtime. There is no built-in mechanism to pre-compile and cache Jiterator kernels. However, if you install a PyTorch build that includes SM_120 pre-compiled kernels (cu130 wheels), the Jiterator may not need to compile anything new.

### Q8: Does `torch.backends.cuda.matmul.allow_tf32 = True` help?

**No.** TF32 controls the precision format for matrix multiplications, not kernel compilation. It will not affect the nvrtc architecture recognition issue.

### Q9: What PyTorch version fully supports SM_120 without JIT issues?

Per PyTorch maintainer ptrblck: **"All nightly and stable PyTorch binaries 2.7.0+ using CUDA 12.8+ support Blackwell GPUs."** However:

- PyTorch 2.7.0+ with cu128 supports SM_120 for **x86_64**
- For **aarch64** (DGX Spark), you need cu130 wheels, and the ecosystem is fragile
- The Jiterator/nvrtc issue may still persist even with 2.7.0+ if the NVRTC bundled with the wheel does not recognize SM_121
- PyTorch nightly `2.10.0.dev+cu130` or `2.11.0.dev+cu130` are the most likely to work

### Q10: What about `CUDA_HOME` or `NVRTC_FLAGS`?

- **`CUDA_HOME`** — Setting this correctly (`/usr/local/cuda` or `/usr/local/cuda-13.0`) ensures PyTorch can find the CUDA toolkit, but does not change what architectures NVRTC supports.
- **`NVRTC_FLAGS`** — There is no such environment variable. NVRTC architecture flags are passed programmatically via the API, not via environment variables.
- **`NVRTC_DISABLE_CONCURRENT_NVVM`** — Only controls concurrent NVVM invocations, not architecture support.

### Q11: Can we bypass nvrtc with `torch.cuda.set_device()`?

**No.** `torch.cuda.set_device()` selects which GPU to use on multi-GPU systems. It does not affect kernel compilation paths. The nvrtc issue is about the compiler not recognizing the architecture, not about device selection.

### Q12: Does `PYTORCH_JIT=0` help?

**Partially.** Setting `PYTORCH_JIT=0` disables TorchScript JIT compilation, but it does NOT disable the Jiterator (they are different systems). The Jiterator operates at a lower level in ATen's CUDA backend. However, combined with other approaches, disabling JIT can reduce the surface area for problems.

---

## 6. PyTorch SM_120/SM_121 Timeline

| Version | CUDA | SM_120 Status | Platform |
|---------|------|---------------|----------|
| PyTorch 2.5.1 (stable) | cu121 | NOT supported | x86_64 |
| PyTorch 2.6.x (stable) | cu124 | NOT supported | x86_64 |
| PyTorch 2.7.0 (stable) | cu128 | Supported (with Triton 3.3) | x86_64 |
| PyTorch 2.9.0 (stable) | cu128 | Supported | x86_64 |
| PyTorch 2.9.1 (stable) | cu130 | Supported | x86_64, aarch64 (fragile) |
| PyTorch nightly | cu130 | Best support | x86_64, aarch64 |
| NGC 25.11/25.12 | cu130 | Custom build, SM_121 | aarch64 (torchaudio broken) |

**SM_120 and SM_121 are binary compatible.** A build containing SM_120 kernels runs fine on SM_121. This is confirmed by NVIDIA and PyTorch maintainers.

---

## 7. Recommended Strategy for DGX Spark

### Phase 1: Quick Win (Approach B — `.abs()` Patch)

This is the fastest path to GPU-accelerated Kokoro TTS on DGX Spark:

```python
#!/usr/bin/env python3
"""
Kokoro TTS Blackwell Bridge Patch
Patches the complex tensor .abs() call in ISTFTNet to avoid
triggering PyTorch's Jiterator/NVRTC on SM_121 (Blackwell).
"""

import torch
import kokoro

# After model is loaded, patch the TorchSTFT.transform method
def patched_transform(self, input_data):
    """Compute STFT magnitude and phase without complex .abs()"""
    forward_transform = torch.stft(
        input_data,
        self.filter_length, self.hop_length, self.win_length,
        window=self.window.to(input_data.device),
        return_complex=True)
    # Avoid .abs() on complex tensor (triggers Jiterator/nvrtc)
    real = forward_transform.real
    imag = forward_transform.imag
    magnitude = torch.sqrt(real ** 2 + imag ** 2)
    phase = torch.atan2(imag, real)
    return magnitude, phase

# Apply patch to TorchSTFT class
# (exact class path depends on kokoro's internal module structure)
```

**Expected speedup:** Based on whisperx-blackwell benchmarks on identical hardware (DGX Spark), GPU inference achieves ~115x speedup over CPU. For Kokoro specifically, GPU should achieve **20-35x real-time** (vs 2.5x on CPU).

### Phase 2: Proper Fix (Approach D — cu130 Wheels)

Install PyTorch cu130 with native SM_121 support:

```bash
export TORCH_CUDA_ARCH_LIST="12.1a"
export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

uv pip install torch --index-url https://download.pytorch.org/whl/cu130
pip install kokoro
```

If cu130 aarch64 wheels include SM_120 pre-compiled kernels, the `.abs()` patch may no longer be needed.

### Phase 3: Production (Approach A — `disable_complex=True`)

For production deployment, use Kokoro's built-in `CustomSTFT` to avoid complex tensors entirely. This is the cleanest solution — no patches, no environment variables, no fragile wheel dependencies.

---

## 8. Alternative: kokoro-onnx Path

If PyTorch GPU remains problematic, the [kokoro-onnx](https://github.com/thewh1teagle/kokoro-onnx) project provides a complete ONNX inference path:

```bash
pip install kokoro-onnx[gpu]
```

The ONNX model uses `CustomSTFT` (no complex tensors) by design. With ONNX Runtime's `CUDAExecutionProvider`, this could work if an aarch64 CUDA build of ONNX Runtime exists. Currently:
- CPU ONNX: ~2.4x realtime (available now)
- GPU ONNX: ~35x realtime (needs ONNX Runtime with SM_120 CUDA — not available for aarch64 yet)

---

## 9. Key Resources

### DGX Spark Setup Guides
- [natolambert/dgx-spark-setup](https://github.com/natolambert/dgx-spark-setup) — ML training setup guide (PyTorch cu130, vLLM nightly)
- [NVIDIA/dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) — Official NVIDIA playbooks for DGX Spark AI/ML workloads
- [cypheritai/pytorch-blackwell](https://github.com/cypheritai/pytorch-blackwell) — Pre-built PyTorch wheels for SM_121
- [martimramos/dgx-spark-ml-guide](https://github.com/martimramos/dgx-spark-ml-guide) — Community DGX Spark ML guide

### Blackwell Bridge Patches
- [Mekopa/whisperx-blackwell](https://github.com/Mekopa/whisperx-blackwell) — Proven `.abs()` patch + capability spoof for DGX Spark
- [whisperX#1326](https://github.com/m-bain/whisperX/issues/1326) — Solved: GPU diarization on Blackwell SM_121

### PyTorch SM_120 Issues
- [pytorch#164342](https://github.com/pytorch/pytorch/issues/164342) — Official support for SM_120 in stable builds
- [pytorch#159207](https://github.com/pytorch/pytorch/issues/159207) — Add SM_120 support (RTX 5090)
- [pytorch#149570](https://github.com/pytorch/pytorch/issues/149570) — torch.compile fails with Kokoro

### Kokoro-Specific
- [Kokoro-FastAPI#365](https://github.com/remsky/Kokoro-FastAPI/issues/365) — Blackwell not supported yet
- [hexgrad/kokoro](https://github.com/hexgrad/kokoro) — Official Kokoro source
- [kokoro/istftnet.py](https://github.com/hexgrad/kokoro/blob/main/kokoro/istftnet.py) — The STFT module with `disable_complex` parameter
- [thewh1teagle/kokoro-onnx](https://github.com/thewh1teagle/kokoro-onnx) — ONNX inference path

### NVIDIA/PyTorch Ecosystem
- [NVIDIA Blackwell Compatibility Guide](https://docs.nvidia.com/cuda/blackwell-compatibility-guide/)
- [NVIDIA Software Migration Guide for Blackwell](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)
- [pytorch/audio#4169](https://github.com/pytorch/audio/issues/4169) — torchaudio broken on Blackwell aarch64
- [DGX Spark SM121 Software Support Lacking](https://forums.developer.nvidia.com/t/dgx-spark-sm121-software-support-is-severely-lacking-official-roadmap-needed/357663)

---

## 10. Summary Decision Matrix

| Approach | Speed | Effort | Risk | Recommendation |
|----------|-------|--------|------|----------------|
| A. `disable_complex=True` | GPU (~20-35x RT) | Low | None | **BEST for production** |
| B. `.abs()` monkey-patch | GPU (~20-35x RT) | Low | Low | **BEST for quick validation** |
| C. Capability spoof | GPU (~20-35x RT) | Medium | Medium | Combine with B |
| D. PyTorch cu130 wheels | GPU (~20-35x RT) | Medium | Medium | **BEST long-term** (if aarch64 wheels work) |
| E. Pre-built wheels | GPU (~20-35x RT) | Low | Medium | Fallback if D fails |
| F. ONNX Runtime GPU | GPU (unknown) | High | High | Not ready for aarch64 |
| G. NGC container | GPU (~20-35x RT) | Medium | Medium | Heavy for 82M model |
| (current) CPU mode | CPU (2.5x RT) | None | None | Already working |

**Recommended execution order:** B (validate GPU works) -> D (proper cu130) -> A (production clean-up)

---

## 11. Expected Performance

Based on benchmarks from similar hardware and model size:

| Mode | Real-time Factor | Latency (1s audio) | Notes |
|------|-------------------|---------------------|-------|
| CPU (current) | 0.39x (2.5x RT) | ~400ms | Working on DGX Spark |
| GPU (patched) | ~0.03x (35x RT) | ~30ms | Expected with `.abs()` patch |
| GPU + batching | ~0.01x (100x RT) | ~10ms | Expected with batch inference |

For a voice call pipeline targeting 600-900ms total latency, the GPU path (30ms for TTS) leaves ample budget for STT + Claude API + pipeline overhead.
