# Research: GPU-Accelerated Speech-to-Text on DGX Spark

**Date:** 2026-02-24
**Status:** Research complete
**Context:** Phase 0 — Technology validation on Titan (DGX Spark GB10)

> **Correction (2026-03-01):** This document originally stated CTranslate2 has no CUDA on aarch64. That applies only to **PyPI wheels**. The `mekopa/whisperx-blackwell` Docker container has CTranslate2 4.4.0 compiled from source with full CUDA support (float16, bfloat16, int8). The audio-pipeline runs WhisperX on GPU via this container.

## 1. Problem Statement

`pip install faster-whisper` installs CTranslate2 4.7.1, but `ctranslate2.get_cuda_device_count()` returns 0 on DGX Spark. CPU mode works (352ms for 30s audio). The GPU (12 TFLOPS FP16, 128GB unified memory) sits idle.

**Hardware:** NVIDIA DGX Spark (GB10 Grace Blackwell Superchip)
- GPU: Blackwell, compute capability 12.1, SM_121
- CPU: ARM64 (aarch64) — 10x Cortex-X925 + 10x Cortex-A725
- CUDA: 13.0, driver 580.126.09
- Memory: 128GB unified (shared CPU/GPU via NVLink-C2C)

**The core challenge:** aarch64 + CUDA 13.0 + Blackwell SM_121 is a rare triple that most ML packages do not support out of the box.

---

## 2. CTranslate2 on aarch64 — The Root Cause

### 2.1 PyPI Wheels: CUDA Not Included for ARM64

CTranslate2 publishes wheels for Linux aarch64 on PyPI, but **these wheels do NOT include CUDA support**. Only x86-64 Linux wheels are compiled with CUDA. This is confirmed by:

- [GitHub Issue #1306](https://github.com/OpenNMT/CTranslate2/issues/1306) — "This CTranslate2 package was not compiled with CUDA support" on aarch64
- [CTranslate2 installation docs](https://opennmt.net/CTranslate2/installation.html) — lists GPU support for Linux/Windows but the aarch64 wheels use CPU-only backends (OpenBLAS, Ruy)
- [open-webui Issue #18858](https://github.com/open-webui/open-webui/issues/18858) — same problem on DGX

**Verdict: The pip-installed CTranslate2 wheel for aarch64 will NEVER detect CUDA. This is not a configuration issue — the binary simply does not contain CUDA code. However, CTranslate2 compiled from source (as in the `mekopa/whisperx-blackwell` Docker image) works with full CUDA.**

### 2.2 Building CTranslate2 from Source with CUDA

Theoretically possible:

```bash
git clone --recursive https://github.com/OpenNMT/CTranslate2.git
cd CTranslate2 && mkdir build && cd build
cmake .. -DWITH_CUDA=ON -DWITH_CUDNN=ON -DCUDA_ARCH_LIST="12.1"
make -j$(nproc)
sudo make install && sudo ldconfig

# Build Python wheel
cd ../python
pip install -r install_requirements.txt
python setup.py bdist_wheel
pip install dist/*.whl
```

**Requirements:**
- C++17 compiler, CMake >= 3.15
- CUDA 12.x or 13.0 toolkit
- cuDNN 8 or 9 (needed for convolutional layers in speech models)

**Risks:**
1. CTranslate2 may not have CUDA kernel code that targets SM_120/SM_121. It claims support for compute capability >= 3.0 (Kepler), but no evidence of SM_120 testing.
2. CUDA 13.0 is newer than what CTranslate2 was designed for (CUDA 12.x). cuDNN version mismatch possible.
3. SM_120 and SM_121 are binary-compatible with SM_90 (Hopper) code, so a build targeting SM_90 *might* work via forward compatibility.
4. No community reports of success building CTranslate2 with CUDA on aarch64 + Blackwell.

**Verdict: HIGH RISK. Possible but untested. Could take hours of debugging with no guarantee.**

### 2.3 CTranslate2 Project Health

- Last release: v4.7.1 (still maintained, supports Python 3.13, cuDNN 9)
- Not abandoned, but updates are infrequent
- No Blackwell-specific support in any release
- No aarch64+CUDA wheels planned

### 2.4 `--extra-index-url=https://pypi.nvidia.com`

NVIDIA's PyPI index does NOT host CTranslate2. It hosts packages like `nvidia-pytriton`, `tensorrt`, etc. No help here.

---

## 3. Alternative: OpenAI Whisper via PyTorch (RECOMMENDED PATH)

### 3.1 The Approach

Use `openai-whisper` (or HuggingFace `transformers` with Whisper) directly on PyTorch with CUDA. This bypasses CTranslate2 entirely.

```python
import whisper
model = whisper.load_model("large-v3", device="cuda")
result = model.transcribe("audio.wav")
```

### 3.2 PyTorch on DGX Spark — Status

PyTorch CUDA support for aarch64 + Blackwell exists via:

1. **NGC Container:** `nvcr.io/nvidia/pytorch:25.11-py3` (or 25.12) — ships PyTorch 2.10.0a0 with native CUDA 13.0 + SM_121 support. This is NVIDIA's officially supported path.

2. **Community wheels:** [cypheritai/pytorch-blackwell](https://github.com/cypheritai/pytorch-blackwell) — pre-built wheels: `torch-2.11.0a0-cp312-cp312-linux_aarch64.whl` with `TORCH_CUDA_ARCH_LIST="12.1"` and `CUDA_HOME=/usr/local/cuda-13.0`.

3. **PyTorch nightly:** cu130 aarch64 wheels available via `--extra-index-url https://download.pytorch.org/whl/nightly/cu130`

### 3.3 The torchaudio Problem

**Critical issue:** Whisper uses torchaudio for audio loading/processing.

- No pre-compiled `torchaudio` GPU wheels for aarch64 on PyPI (only +cpu)
- ABI mismatch if force-installing x86 wheels
- Building from source inside NGC container fails due to header layout changes
- [pytorch/audio Issue #4169](https://github.com/pytorch/audio/issues/4169) documents this

**Workarounds:**
1. Use `soundfile` instead of `torchaudio` for audio I/O (recommended)
2. Use `librosa` for audio loading, bypass torchaudio entirely
3. The Whisper model itself (encoder + decoder) runs on CUDA fine — only audio I/O needs the workaround

### 3.4 The NVRTC/SM_121 Problem

PyTorch's JIT compiler (NVRTC) doesn't recognize SM_121 yet. Operations that trigger JIT compilation (like `.abs()` on complex tensors in torchaudio) crash with:
```
nvrtc: error: invalid value for --gpu-architecture (-arch)
```

**Workaround: Architecture Spoofing** (from [whisperx-blackwell](https://github.com/Mekopa/whisperx-blackwell)):
```python
# Force PyTorch to report Hopper (9, 0) instead of Blackwell (12, 1)
# SM_90 code runs natively on SM_121 via binary compatibility
```

This is a documented, working approach. The whisperx-blackwell project packages this as a Docker image.

### 3.5 Performance Expectations

- OpenAI Whisper large-v3 on GPU: ~30-50ms for 30s audio (vs 352ms CPU with faster-whisper)
- With int8 quantization: even faster
- The 128GB unified memory means no GPU memory constraints for any Whisper model

---

## 4. Alternative: whisper.cpp with CUDA

### 4.1 The Approach

whisper.cpp is a C/C++ port using the ggml tensor library. It supports CUDA acceleration.

### 4.2 Building on DGX Spark

Following the pattern from [Arm's llama.cpp build guide](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu/):

```bash
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
mkdir build-gpu && cd build-gpu
cmake .. \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_F16=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121 \
  -DCMAKE_C_COMPILER=gcc \
  -DCMAKE_CXX_COMPILER=g++ \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
make -j$(nproc)
```

### 4.3 Known Issues

- [Issue #3030](https://github.com/ggml-org/whisper.cpp/issues/3030): `nvcc fatal: Unsupported gpu architecture 'compute_120'` — requires CUDA toolkit that supports SM_120/121 (CUDA 13.0 does)
- [Issue #3339](https://github.com/ggml-org/whisper.cpp/issues/3339): CUDA 12.9 build failures — ggml CUDA backend may need patches for CUDA 13.0
- No official DGX Spark / SM_121 testing in whisper.cpp CI

### 4.4 Advantages

- No Python dependency issues (torchaudio, etc.)
- Smaller footprint, lower latency
- C API can be called from Python via ctypes or subprocess
- Community has confirmed ggml CUDA builds work on DGX Spark (for llama.cpp at least)

### 4.5 Disadvantages

- Requires building from source (no binary releases for aarch64+CUDA)
- May need patches for CUDA 13.0 compatibility
- Less flexible than Python-native solutions
- No Python API (only CLI or C bindings)

---

## 5. Alternative: WhisperX on Blackwell (Docker)

### 5.1 The Solution

[Mekopa/whisperx-blackwell](https://github.com/Mekopa/whisperx-blackwell) — a ready-to-use Docker image for GPU-accelerated WhisperX on DGX Spark.

### 5.2 How It Works

1. **Architecture Spoofing:** Patches `torch.cuda.get_device_capability()` to return `(9, 0)` (Hopper) instead of `(12, 1)` (Blackwell)
2. **JIT Bypass:** Patches torchaudio source to avoid `.abs()` on complex tensors (which triggers broken Jiterator)
3. **Result:** SM_90 (Hopper) CUDA code runs natively on SM_121 (Blackwell) via binary compatibility

### 5.3 Usage

```bash
docker pull mekopa/whisperx-blackwell:latest
docker run --gpus all -p 8000:8000 \
  -e HF_TOKEN=<your_token> \
  mekopa/whisperx-blackwell:latest
```

### 5.4 Assessment

- **Proven working on DGX Spark**
- Includes speaker diarization (pyannote)
- Docker-based (isolated from host Python)
- Will become unnecessary when PyTorch/NVRTC adds official SM_121 support

---

## 6. Alternative: NVIDIA Parakeet via Riva NIM

### 6.1 What Is It

NVIDIA's own ASR model family, purpose-built for GPU acceleration:
- **Parakeet CTC 1.1b** — English, supported on DGX Spark (as of Riva NIM Release 1.8.0)
- **Parakeet RNNT 1.1b Multilingual** — supported on DGX Spark + Blackwell
- Deployed via NVIDIA NIM container from NGC

### 6.2 Setup

```bash
export NGC_API_KEY=<your_key>
docker run --gpus all \
  -v ~/.cache/nim:/opt/nim/.cache \
  -e NGC_API_KEY=$NGC_API_KEY \
  nvcr.io/nim/nvidia/parakeet-1_1b-rnnt-multilingual-asr:latest
```

### 6.3 Performance

- Optimized with TensorRT 10.13
- Native Blackwell GPU acceleration
- Substantially better than Whisper in speed and accuracy per NVIDIA benchmarks

### 6.4 Known Issues

- [Forum thread](https://forums.developer.nvidia.com/t/running-parakeet-speech-to-text-on-spark/356353): Users report NeMo toolkit cannot detect GPU when installed via pip (same CUDA 12/13 mismatch)
- NIM container approach is the supported path (handles all dependencies internally)
- [Feature request](https://forums.developer.nvidia.com/t/feature-request-arm64-grace-cpu-support-for-riva-with-whisper-large-v3-turbo/354519): ARM64 Riva support is still in flux for some models

### 6.5 Assessment

- **NVIDIA's recommended path for DGX Spark ASR**
- Container-based, handles all CUDA/aarch64 compatibility internally
- Requires NGC API key and Docker
- More resource-heavy than whisper.cpp but highest quality

---

## 7. Alternative: Nemotron Speech ASR

### 7.1 What Is It

NVIDIA's newest (January 2026) open-source streaming ASR model:
- 0.6B parameters, English
- Cache-aware FastConformer encoder + RNNT decoder
- Designed for voice agents (low latency)
- **Sub-25ms transcription latency** (median 24ms on benchmarks)
- 3x more efficient than traditional buffered systems

### 7.2 Key Innovation

Cache-aware streaming: processes only new audio "deltas" by reusing past computations. Operates on 16kHz mono audio with minimum 80ms input chunks.

### 7.3 DGX Spark Compatibility

- Can be deployed on DGX Spark for single-user local development
- Available on HuggingFace: [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
- Used in Pipecat voice agents: [pipecat-ai/nemotron-january-2026](https://github.com/pipecat-ai/nemotron-january-2026)
- Requires NeMo framework + PyTorch with CUDA (same dependency challenges as Section 3)

### 7.4 Assessment

- **Best latency** of all options (24ms median)
- Perfect for real-time voice pipeline (our Pipecat use case)
- Newer model, less community testing on DGX Spark
- Same PyTorch/CUDA 13 dependency challenges apply

---

## 8. Alternative: insanely-fast-whisper (PyTorch + BetterTransformer)

### 8.1 What Is It

CLI tool using HuggingFace Transformers pipeline with BetterTransformer + FlashAttention-2. Pure PyTorch, no CTranslate2.

### 8.2 DGX Spark Compatibility

- Requires PyTorch with CUDA (same challenges as Section 3)
- FlashAttention-2 does NOT have aarch64 wheels — [Issue #1969](https://github.com/Dao-AILab/flash-attention/issues/1969)
- Can fall back to SDPA (Scaled Dot-Product Attention) without flash-attn

### 8.3 Assessment

- Good alternative if PyTorch CUDA is working
- Flash attention unavailable, but SDPA works
- Less optimized than faster-whisper or Nemotron for latency

---

## 9. The Arm Learning Path: faster-whisper on DGX Spark (CPU)

Arm published an official learning path: [Build an offline voice chatbot with faster-whisper and vLLM on DGX Spark](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/2_setup/).

**Key finding:** This guide explicitly uses `compute_type="int8"` for **CPU-only** mode. Even Arm's official guide does not demonstrate GPU mode with faster-whisper on DGX Spark. This confirms CTranslate2 PyPI wheels lack CUDA on aarch64 (Docker-built CTranslate2 does have CUDA).

---

## 10. Recommendation: Tiered Approach

### Tier 1: Quick Win (Today) — WhisperX Blackwell Docker

```bash
docker pull mekopa/whisperx-blackwell:latest
```

- Proven working on DGX Spark
- GPU-accelerated via architecture spoofing
- Docker isolation avoids all dependency hell
- Good enough for Phase 0 validation

### Tier 2: Medium-Term — NVIDIA NIM Riva ASR (Parakeet)

```bash
docker run --gpus all nvcr.io/nim/nvidia/parakeet-1_1b-rnnt-multilingual-asr:latest
```

- NVIDIA's officially supported path
- Best accuracy, native Blackwell optimization
- Requires NGC API key
- Use for production deployment

### Tier 3: Optimal Long-Term — Nemotron Speech ASR

- Sub-25ms latency, perfect for voice agents
- Cache-aware streaming design
- Integrate via Pipecat (already in our architecture)
- Wait for stable PyTorch cu130 aarch64 ecosystem to mature

### Tier 4: Fallback — whisper.cpp with CUDA

```bash
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121
```

- No Python dependency issues
- Lowest resource usage
- Build from source, may need patches

### NOT Recommended

- **Building CTranslate2 from source with CUDA on aarch64** — high effort, but proven possible (the `mekopa/whisperx-blackwell` Docker image does exactly this)
- **Waiting for CTranslate2 aarch64 CUDA wheels** — no indication this is planned
- **`pip install faster-whisper` GPU on DGX Spark** — blocked by PyPI wheel limitation (use Docker container instead)

---

## 11. Key Technical Facts

| Fact | Detail |
|------|--------|
| SM_120 and SM_121 | Binary compatible. SM_90 (Hopper) code runs on both. |
| CUDA 13.0 | Required for Blackwell. CUDA 12.x libraries don't exist on DGX Spark. |
| CUDA ABI mismatch | Packages built against CUDA 12.x link to `libcudart.so.12`, but DGX Spark only has `libcudart.so.13`. |
| PyTorch cu130 aarch64 | Available via nightly wheels and NGC containers. Not yet in stable releases. |
| torchaudio aarch64 | No GPU wheels. CPU-only from PyPI. Build from source fails in NGC containers. Use `soundfile` instead. |
| FlashAttention-2 aarch64 | Not available. Use SDPA (built into PyTorch) instead. |
| CTranslate2 aarch64 CUDA | Not compiled into PyPI wheels. Source build works (proven by `mekopa/whisperx-blackwell` Docker image). |
| Riva NIM Parakeet | Officially supports DGX Spark via container deployment. |
| Nemotron Speech ASR | 24ms median latency, 0.6B params, streaming-capable. |

---

## 12. Next Steps for Phase 0

1. **Test whisperx-blackwell Docker image** on Titan — confirm GPU detection and transcription speed
2. **Test Riva NIM Parakeet container** — measure latency for 30s audio clips
3. **Build whisper.cpp with CUDA** as a lightweight alternative
4. **Benchmark all approaches** against the CPU baseline (352ms for 30s)
5. **Decide production STT stack** based on results:
   - If Pipecat integration is key: Nemotron Speech ASR (when ecosystem matures)
   - If container deployment is OK: Riva NIM Parakeet
   - If minimal footprint needed: whisper.cpp with CUDA
   - If Python-native needed: OpenAI Whisper via NGC PyTorch container

---

## Sources

- [CTranslate2 Installation Docs](https://opennmt.net/CTranslate2/installation.html)
- [CTranslate2 Hardware Support](https://opennmt.net/CTranslate2/hardware_support.html)
- [CTranslate2 GitHub Issue #1306 — No CUDA on aarch64](https://github.com/OpenNMT/CTranslate2/issues/1306)
- [faster-whisper GitHub Issue #1086 — CUDA compatibility](https://github.com/SYSTRAN/faster-whisper/issues/1086)
- [faster-whisper GitHub Issue #1401 — Not compiled with CUDA](https://github.com/SYSTRAN/faster-whisper/issues/1401)
- [Mekopa/whisperx-blackwell — GPU WhisperX on Blackwell](https://github.com/Mekopa/whisperx-blackwell)
- [cypheritai/pytorch-blackwell — PyTorch wheels for DGX Spark](https://github.com/cypheritai/pytorch-blackwell)
- [natolambert/dgx-spark-setup — ML training guide](https://github.com/natolambert/dgx-spark-setup)
- [NVIDIA/dgx-spark-playbooks — Official playbooks](https://github.com/NVIDIA/dgx-spark-playbooks)
- [Arm Learning Path: llama.cpp GPU on DGX Spark](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu/)
- [Arm Learning Path: faster-whisper on DGX Spark (CPU)](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/2_setup/)
- [PyTorch Issue #159779 — Enable CUDA 13.0 binaries](https://github.com/pytorch/pytorch/issues/159779)
- [pytorch/audio Issue #4169 — torchaudio on Blackwell aarch64](https://github.com/pytorch/audio/issues/4169)
- [WhisperX Issue #1326 — GPU diarization on Blackwell](https://github.com/m-bain/whisperX/issues/1326)
- [WhisperX Issue #1211 — Blackwell GPU sm_120](https://github.com/m-bain/whisperX/issues/1211)
- [whisper.cpp Issue #3030 — Unsupported compute_120](https://github.com/ggml-org/whisper.cpp/issues/3030)
- [NVIDIA Forum: Running Parakeet on Spark](https://forums.developer.nvidia.com/t/running-parakeet-speech-to-text-on-spark/356353)
- [NVIDIA Forum: Architecture compatibility on aarch64](https://forums.developer.nvidia.com/t/architecture-and-library-compatibility-on-aarch64/350389)
- [NVIDIA Forum: Riva ARM64 feature request](https://forums.developer.nvidia.com/t/feature-request-arm64-grace-cpu-support-for-riva-with-whisper-large-v3-turbo/354519)
- [NVIDIA NIM Riva ASR Support Matrix](https://docs.nvidia.com/nim/riva/asr/latest/support-matrix.html)
- [NVIDIA NIM Riva ASR Release Notes](https://docs.nvidia.com/nim/riva/asr/latest/release-notes.html)
- [NVIDIA Blackwell Compatibility Guide](https://docs.nvidia.com/cuda/blackwell-compatibility-guide/)
- [NGC PyTorch Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
- [Nemotron Speech ASR on HuggingFace](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)
- [Pipecat + Nemotron January 2026](https://github.com/pipecat-ai/nemotron-january-2026)
- [open-webui Issue #18858 — CTranslate2 CUDA on DGX](https://github.com/open-webui/open-webui/issues/18858)
- [DGX Spark PyTorch Forum Thread](https://discuss.pytorch.org/t/dgx-spark-gb10-cuda-13-0-python-3-12-sm-121/223744)
- [Choosing Whisper Variants (Modal blog)](https://modal.com/blog/choosing-whisper-variants)
- [Daily.co: Building Voice Agents with NVIDIA Open Models](https://www.daily.co/blog/building-voice-agents-with-nvidia-open-models/)

## 7. Kannada STT: Self-Hosted Alternatives (2026-02-27)

**Context:** Auto-routing implemented — Whisper detects language, routes non-English to Sarvam API. Question: can we self-host Kannada STT instead of depending on Sarvam API?

### Mistral STT — NO Kannada

Voxtral Realtime 4B (Apache 2.0, Feb 2026) supports only 13 languages. No Kannada, no Dravidian languages. Hindi is the only Indian language. **Not usable.**

### Sarvam Saaras v3 — Proprietary, Not Mistral-Based

Sarvam's STT is their own architecture (NOT Whisper or Mistral). Trained on 1M+ hours, ~19% WER on IndicVoices (beats Gemini, GPT-4o). API-only, closed source. Cannot self-host.

### Self-Hosted Candidates

| Model | Kannada | VRAM | License | Quality | Notes |
|-------|---------|------|---------|---------|-------|
| **AI4Bharat IndicConformer-600M** | YES | ~2-3 GB | MIT | HIGH | IIT Madras, NeMo framework, purpose-built |
| AI4Bharat IndicWhisper | YES | ~3-6 GB | MIT | HIGH | Whisper fine-tuned on Indian langs |
| Meta MMS 1B (`mms-1b-all`) | YES (`kan`) | ~2-4 GB | CC-BY-NC | MEDIUM | 1107 langs, Bible-trained (narrow domain) |
| Meta SeamlessM4T v2 Large | YES | ~10 GB | CC-BY-NC | MED-HIGH | Also does speech-to-text translation |
| Vakyansh (Open-Speech-EkStep) | YES | Small | Open | OLDER | Superseded by AI4Bharat |

### Recommendation

**AI4Bharat IndicConformer-600M** — MIT licensed (commercial OK), ~2-3 GB VRAM (fits easily alongside existing stack), purpose-built for all 22 Indian languages at IIT Madras. Could replace Sarvam API entirely for a fully self-hosted pipeline.

Meta models (MMS, SeamlessM4T) support Kannada but are CC-BY-NC — conflicts with commercial distribution ($99-499/year plan).

### References
- [AI4Bharat IndicConformer on HuggingFace](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)
- [Voxtral Transcribe 2 — Mistral AI](https://mistral.ai/news/voxtral-transcribe-2)
- [Saaras V3 — Sarvam AI](https://www.sarvam.ai/blogs/asr/)
- [Meta MMS 1B on HuggingFace](https://huggingface.co/facebook/mms-1b-all)
- [SeamlessM4T v2 on HuggingFace](https://huggingface.co/facebook/seamless-m4t-v2-large)
