# Research: Qwen3-ASR — Speech Recognition Replacement for Whisper

**Date:** 2026-03-12 (initial), 2026-03-12 (updated with verified API details)
**Status:** Research complete, API verified from source code, not yet implemented
**Motivation:** Whisper large-v3-turbo consistently misrecognizes "Claude" as "cloud/plot/goal". Whisper's `initial_prompt` hotword biasing is too weak for proper nouns. Need a model with native contextual biasing.

## Source Verification Summary

| Source | URL | Status |
|--------|-----|--------|
| vLLM recipe | https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-ASR.html | Fetched, code extracted |
| HuggingFace model card | https://huggingface.co/Qwen/Qwen3-ASR-1.7B | Fetched, code extracted |
| GitHub repo (source code) | https://github.com/QwenLM/Qwen3-ASR | Fetched via API, full source read |
| Qwen blog | https://qwen.ai/blog?id=qwen3asr | **Failed** (SPA, JS-only rendering) |

---

## 1. Package & Installation (VERIFIED)

The package is called `qwen-asr` (hyphen, not underscore). Import as `qwen_asr`.

```bash
# Minimal (transformers backend)
pip install -U qwen-asr

# With vLLM backend
pip install -U qwen-asr[vllm]

# From source
git clone https://github.com/QwenLM/Qwen3-ASR.git
cd Qwen3-ASR
pip install -e .

# FlashAttention 2 (strongly recommended)
pip install -U flash-attn --no-build-isolation

# Fresh env recommended
conda create -n qwen3-asr python=3.12 -y
```

### Pinned Dependencies (from `pyproject.toml`)

```
transformers==4.57.6
nagisa==0.2.11        # Japanese tokenizer (for forced aligner)
soynlp==0.0.493       # Korean tokenizer (for forced aligner)
accelerate==1.12.0
qwen-omni-utils
librosa
soundfile
sox
gradio
flask
pytz
```

**Python:** >=3.9, recommended 3.12
**vLLM pin:** `vllm==0.14.0` (optional dependency)

---

## 2. Model Classes (VERIFIED from source code)

| Class | Package | Purpose |
|-------|---------|---------|
| `Qwen3ASRModel` | `qwen_asr` | Main inference wrapper (transformers + vLLM backends) |
| `Qwen3ForcedAligner` | `qwen_asr` | Word-level timestamp alignment |
| `Qwen3ASRForConditionalGeneration` | `qwen_asr.core.transformers_backend` | Raw HuggingFace model |
| `Qwen3ASRProcessor` | `qwen_asr.core.transformers_backend` | Feature extractor + tokenizer |
| `Qwen3ASRConfig` | `qwen_asr.core.transformers_backend` | Model config |
| `parse_asr_output` | `qwen_asr` | Parse raw model output into `(language, text)` |

**NOT** `AutoModelForCausalLM`. It uses custom registration:
```python
AutoConfig.register("qwen3_asr", Qwen3ASRConfig)
AutoModel.register(Qwen3ASRConfig, Qwen3ASRForConditionalGeneration)
AutoProcessor.register(Qwen3ASRConfig, Qwen3ASRProcessor)
```

---

## 3. Audio Input Format (VERIFIED from `utils.py`)

The `AudioLike` type alias:
```python
AudioLike = Union[
    str,                      # wav path / URL / base64
    Tuple[np.ndarray, int],   # (waveform, sr)
]
```

Accepted formats:
- **Local file path:** `"./audio.wav"` (loaded via `librosa.load`)
- **HTTPS URL:** `"https://example.com/audio.wav"` (downloaded via `urllib`)
- **Base64 data URI:** `"data:audio/wav;base64,..."` (decoded and read via soundfile)
- **Raw base64 string:** Long string without `/` or `\` (auto-detected if >256 chars)
- **NumPy array tuple:** `(np.ndarray, sample_rate)` where array is float32 waveform

All audio is normalized to: **mono, 16kHz, float32, [-1, 1]**

Resampling uses `librosa.resample()`.

---

## 4. Core API: `model.transcribe()` (VERIFIED from source code)

```python
def transcribe(
    self,
    audio: Union[AudioLike, List[AudioLike]],
    context: Union[str, List[str]] = "",        # <-- THIS IS THE BIASING MECHANISM
    language: Optional[Union[str, List[Optional[str]]]] = None,
    return_time_stamps: bool = False,
) -> List[ASRTranscription]:
```

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `audio` | `AudioLike` or `List[AudioLike]` | required | Audio input(s) |
| `context` | `str` or `List[str]` | `""` | **Contextual biasing text** (injected as system message) |
| `language` | `str`, `List[str]`, or `None` | `None` | Force language. `None` = auto-detect |
| `return_time_stamps` | `bool` | `False` | Enable word-level timestamps (requires forced aligner) |

### Return Type

```python
@dataclass
class ASRTranscription:
    language: str          # e.g. "Chinese" or "Chinese,English"
    text: str              # Transcribed text
    time_stamps: Optional[Any] = None  # ForcedAlignResult (if requested)
```

---

## 5. Contextual Biasing: How It ACTUALLY Works (VERIFIED from source code)

**Previous assumption was WRONG.** There is no "hotword list" parameter. The mechanism is:

### The `context` parameter is injected as the system message in the chat template.

From `_build_messages()`:
```python
def _build_messages(self, context: str, audio_payload: Any) -> List[Dict[str, Any]]:
    return [
        {"role": "system", "content": context or ""},
        {"role": "user", "content": [{"type": "audio", "audio": audio_payload}]},
    ]
```

The chat template (from `chat_template.json`) renders this as:
```
<|im_start|>system
{context_text}<|im_end|>
<|im_start|>user
<|audio_start|><|audio_pad|><|audio_end|><|im_end|>
<|im_start|>assistant
```

When `language` is forced, the assistant prefix becomes:
```
<|im_start|>assistant
language English<asr_text>
```

### What this means for her-os

The `context` parameter is a **free-form text string** passed as the system message. There is NO structured hotword list API. You can pass:

1. **A word list:** `context="Claude Rajesh Bangalore Sarthak"`
2. **A sentence:** `context="This is a conversation between Rajesh and his AI assistant Claude Code."`
3. **Domain terms:** `context="交易 停滞"` (as shown in the official examples)
4. **Any text:** The model uses it as prior context to bias recognition

### Evidence from official examples

In `example_qwen3_asr_transformers.py` and `example_qwen3_asr_vllm.py`:
```python
results = asr.transcribe(
    audio=[URL_ZH, zh_b64, (en_wav, en_sr)],
    context=["", "交易 停滞", ""],            # <-- contextual biasing
    language=[None, "Chinese", "English"],
    return_time_stamps=False,
)
```

The context `"交易 停滞"` biases the model toward recognizing these Chinese terms.

### CRITICAL CORRECTION from Session 289 research

The session 289 research doc stated:
> - **Hotword list:** Pass `["Claude Code", "Rajesh", "Bangalore", "Sarthak"]`
> - **Full context paragraphs:** Can pass background text

**Correction:** There is NO hotword list parameter. It's a single `context: str` parameter per audio sample, injected as a system message. The model treats it as domain context. The correct usage would be:

```python
# For her-os: pass known names as context string
results = model.transcribe(
    audio="recording.wav",
    context="Claude Code, Rajesh, Bangalore, Sarthak, Omi, Annie",
    language="English",
)
```

---

## 6. Word-Level Timestamps (VERIFIED from source code)

Timestamps require a **separate model**: `Qwen3-ForcedAligner-0.6B` (~1.2 GB bf16).

### Setup

```python
model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(
        dtype=torch.bfloat16,
        device_map="cuda:0",
    ),
    max_new_tokens=256,
)

results = model.transcribe(
    audio="audio.wav",
    language="English",
    return_time_stamps=True,
)

# Access timestamps
for ts in results[0].time_stamps:
    print(ts.text, ts.start_time, ts.end_time)  # seconds (float)
```

### Timestamp data structure (from `qwen3_forced_aligner.py`)

```python
@dataclass(frozen=True)
class ForcedAlignItem:
    text: str           # word or CJK character
    start_time: float   # seconds (rounded to 3 decimal places)
    end_time: float     # seconds (rounded to 3 decimal places)

@dataclass(frozen=True)
class ForcedAlignResult:
    items: List[ForcedAlignItem]
    # Supports iteration, len(), and indexing
```

### Forced aligner limitations (VERIFIED)

- **Max audio duration:** 180 seconds (3 minutes, from `MAX_FORCE_ALIGN_INPUT_SECONDS = 180` in source)
  - Note: earlier docs said 5 minutes; source code says 3 minutes
- **Supported languages:** 11 only (Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish)
- **Hindi NOT supported** for timestamps
- **Non-autoregressive** inference only
- **Requires `nagisa`** (Japanese) and **`soynlp`** (Korean) for tokenization

### Direct aligner usage

```python
from qwen_asr import Qwen3ForcedAligner

aligner = Qwen3ForcedAligner.from_pretrained(
    "Qwen/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

results = aligner.align(
    audio="audio.wav",
    text="The transcribed text goes here.",
    language="English",
)
# results[0][0].text, results[0][0].start_time, results[0][0].end_time
```

---

## 7. Long Audio Handling (VERIFIED from source code)

The model auto-splits long audio into chunks at low-energy boundaries:

| Mode | Max chunk duration | Constant |
|------|-------------------|----------|
| Without timestamps | 1200 seconds (20 min) | `MAX_ASR_INPUT_SECONDS = 1200` |
| With timestamps | 180 seconds (3 min) | `MAX_FORCE_ALIGN_INPUT_SECONDS = 180` |
| Minimum chunk | 0.5 seconds | `MIN_ASR_INPUT_SECONDS = 0.5` |

The splitting algorithm (`split_audio_into_chunks`) uses a sliding-window energy minimization to find optimal split points, ensuring zero-gap, zero-overlap concatenation.

---

## 8. Streaming API (VERIFIED from source code)

Streaming is **vLLM backend only**. No timestamps. No batching.

```python
# Init
state = model.init_streaming_state(
    context="",                    # contextual biasing
    language=None,                 # or "English"
    unfixed_chunk_num=2,           # first N chunks: no prefix prompt
    unfixed_token_num=5,           # rollback last K tokens for prefix
    chunk_size_sec=2.0,            # audio chunk size
)

# Feed audio incrementally (16kHz mono float32)
while has_audio:
    pcm_chunk = get_next_audio_chunk()  # any length
    model.streaming_transcribe(pcm_chunk, state)
    print(state.language, state.text)   # updated after each full chunk

# Flush remaining buffer
model.finish_streaming_transcribe(state)
print(state.language, state.text)       # final result
```

### Streaming internals

- Audio is buffered until `chunk_size_sec` worth of samples accumulate
- Each decode step re-feeds ALL accumulated audio from the start (not incremental)
- Prefix rollback strategy reduces boundary jitter
- Accepts `int16` or `float32` PCM input
- Context biasing works in streaming mode (passed at `init_streaming_state`)

---

## 9. vLLM Server Deployment (VERIFIED)

### CLI serve command
```bash
qwen-asr-serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000
```

### OpenAI-compatible endpoints

**Chat completions:**
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen3-ASR-1.7B",
    messages=[{
        "role": "user",
        "content": [{"type": "audio_url", "audio_url": {"url": "https://..."}}]
    }],
)
# Parse: language, text = parse_asr_output(response.choices[0].message.content)
```

**Transcription API:**
```python
audio_bytes = open("audio.wav", "rb").read()
transcription = client.audio.transcriptions.create(
    model="Qwen/Qwen3-ASR-1.7B",
    file=audio_bytes,
)
print(transcription.text)
```

**Note:** The OpenAI-compatible server endpoints do NOT expose the `context` parameter. Contextual biasing requires using the Python SDK directly or the system message in chat completions.

### vLLM offline (Python SDK)
```python
model = Qwen3ASRModel.LLM(
    model="Qwen/Qwen3-ASR-1.7B",
    gpu_memory_utilization=0.7,
    max_inference_batch_size=128,
    max_new_tokens=4096,
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(dtype=torch.bfloat16, device_map="cuda:0"),
)
```

---

## 10. Output Format (VERIFIED from source code)

### Raw model output format

The model generates text in this format:
```
language English<asr_text>The transcribed text here.
```

When language is forced (via `language` parameter), the model outputs plain text only.

### `parse_asr_output()` function

```python
from qwen_asr import parse_asr_output
language, text = parse_asr_output(raw_output)
# language: "English"
# text: "The transcribed text here."
```

Special cases:
- `"language None<asr_text>"` = empty/silent audio, returns `("", "")`
- No `<asr_text>` tag = treat entire string as text, language unknown
- Built-in repetition detection and cleanup (`detect_and_fix_repetitions`)

---

## 11. Model Variants (VERIFIED)

| Model | Parameters | VRAM (bf16) | Key Trait |
|-------|-----------|-------------|-----------|
| Qwen3-ASR-1.7B | 1.7B | ~3.4 GB | SOTA accuracy |
| Qwen3-ASR-0.6B | 0.6B | ~1.2 GB | 2000x throughput at concurrency 128 |
| Qwen3-ForcedAligner-0.6B | 0.6B | ~1.2 GB | Word-level timestamps, 11 languages |

**API is identical** between 1.7B and 0.6B. Same `Qwen3ASRModel` class, same `transcribe()` signature.

---

## 12. Supported Languages (VERIFIED from source code)

### ASR: 30 languages
Chinese, English, Cantonese, Arabic, German, French, Spanish, Portuguese, Indonesian, Italian, Korean, Russian, Thai, Vietnamese, Japanese, Turkish, Hindi, Malay, Dutch, Swedish, Danish, Finnish, Polish, Czech, Filipino, Persian, Greek, Romanian, Hungarian, Macedonian

Plus 22 Chinese dialects (Anhui, Dongbei, Fujian, etc.)

### Forced Aligner: 11 languages
Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish

**Kannada: NOT supported** (relevant for Rajesh's multilingual needs)
**Hindi: Supported for ASR, NOT for timestamps**

---

## 13. Architecture Details (from config)

- Audio encoder: Whisper-style (128 mel bins, 32 encoder layers, 20 attention heads, d_model=1280)
- Feature extraction: `WhisperFeatureExtractor` (reused)
- Tokenizer: `Qwen2TokenizerFast`
- Processor: `Qwen3ASRProcessor` (combines feature extractor + tokenizer)
- Special tokens: `<|audio_start|>`, `<|audio_pad|>`, `<|audio_end|>`, `<asr_text>`, `<non_speech>`

---

## 14. aarch64/ARM Support

**Not mentioned anywhere** in documentation or source code. The dependencies (`transformers`, `torch`, `librosa`, `soundfile`) all support aarch64, but:
- FlashAttention 2 may need compilation from source on aarch64
- vLLM aarch64 support is experimental
- The DGX Spark (Blackwell, aarch64) would need testing

---

## 15. Known Limitations (VERIFIED)

1. **No structured hotword API** -- contextual biasing is free-form text in system message
2. **Streaming: vLLM only**, no timestamps, no batching
3. **Forced aligner: 3-minute max**, 11 languages only
4. **vLLM requires `if __name__ == '__main__':` guard** (multiprocessing spawn)
5. **FlashAttention 2: float16/bfloat16 only**
6. **Repetition hallucination** -- model has built-in `detect_and_fix_repetitions()` for repeated patterns (threshold=20)
7. **No diarization** -- ASR only, no speaker identification

---

## 16. Integration Plan for her-os (UPDATED)

### For Annie Voice (`whisper_stt.py` replacement)

```python
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=1,  # single utterance at a time
    max_new_tokens=256,
)

# Contextual biasing with known entity names
context = "Claude Code, Rajesh, Annie, Bangalore, Sarthak, Omi"

results = model.transcribe(
    audio=(pcm_array, 16000),   # numpy array + sample rate
    context=context,
    language="English",
)
transcript = results[0].text
```

### For Audio Pipeline (`services/audio-pipeline/` -- batch processing)

```python
# vLLM backend for higher throughput
model = Qwen3ASRModel.LLM(
    model="Qwen/Qwen3-ASR-1.7B",
    gpu_memory_utilization=0.5,
    max_inference_batch_size=32,
    max_new_tokens=1024,
)

# Batch with per-file context
results = model.transcribe(
    audio=[file1, file2, file3],
    context=["entity list 1", "entity list 2", "entity list 3"],
    language=None,  # auto-detect
    return_time_stamps=True,  # if aligner loaded
)
```

### Context from Context Engine

```python
# Fetch known entities from CE to build context string
import httpx
resp = httpx.get("http://localhost:8100/v1/corrections/hotwords",
                 headers={"X-Internal-Token": TOKEN})
hotwords = resp.json()  # ["Claude", "Rajesh", ...]
context_str = ", ".join(hotwords)
```

### VRAM Impact

| Config | VRAM |
|--------|------|
| ASR only (1.7B bf16) | ~3.4 GB |
| ASR + ForcedAligner (1.7B + 0.6B) | ~4.6 GB |
| ASR only (0.6B bf16) | ~1.2 GB |

Comparable to current Whisper large-v3-turbo (~3 GB).

---

## 17. Open Questions — ANSWERED (Session 290–291 Empirical Testing)

1. **How effective is `context` biasing for "Claude" vs "cloud"?** — **WORKS, but context format matters.** Session 290's test was flawed (test audio didn't contain "Claude"). Session 291 retest with Kokoro TTS-generated audio proved:
   - Sparse context (`"Claude, Code"`) is insufficient — biases capitalization only ("Cloud Code")
   - Rich context (`"Claude Code, Rajesh, Annie, Bangalore"`) or full sentence context **successfully biases "cloud" → "Claude"** in acoustically ambiguous audio
   - Context has no effect when audio is unambiguous (clear pronunciation of "Claude")
   - Context correctly does NOT hallucinate "Claude" when the audio genuinely says "cloud computing"
2. **aarch64 Blackwell compatibility?** — **WORKS.** Model loads on Blackwell aarch64 with bfloat16, no FlashAttention needed. Used transformers backend (not vLLM).
3. **Latency vs Whisper?** — **~4x slower for long audio, comparable for short.** 71s audio: Qwen3-ASR 4.94s (14x RT) vs Whisper ~1.15s (62x RT). But for 2.6s utterances (typical VAD segments): Qwen3-ASR 483ms, which is acceptable for real-time voice.
4. **Can context be passed through vLLM server?** — yes, via system message in chat completions API
5. **Quality with streaming?** -- streaming re-feeds all audio each chunk (may have latency implications)

### Prototype Results (Session 290)

| Metric | Value | Gate | Result |
|--------|-------|------|--------|
| Model loads on Blackwell | Yes (31s warm, 207s cold) | MUST PASS | PASS |
| Basic transcription | Accurate, clean | MUST PASS | PASS |
| "Claude" recognized | No — "cloud code" | MUST PASS | **FAIL** (flawed test) |
| Latency (2.6s utterance) | 483ms | ≤2x Whisper | PASS |
| VRAM | 3.83 GB | ≤6 GB | PASS |
| NumPy input | Works | MUST PASS | PASS |
| Short audio (<0.3s) | No crash (hallucinated "Yeah.") | No crash | PASS |

### Controlled Retest Results (Session 291)

Test script: `scripts/retest_qwen3_biasing.py` — Kokoro TTS-generated audio, 6 context variations per clip.

| Audio | No context | Sparse ("Claude, Code") | Rich ("Claude Code, Rajesh, Annie...") | Full sentence |
|-------|-----------|------------------------|---------------------------------------|---------------|
| "Tell me about Claude Code" | Claude Code ✓ | Claude Code ✓ | Claude Code ✓ | Claude Code ✓ |
| "cloud computing code" | cloud ✓ | cloud ✓ | cloud ✓ | cloud ✓ |
| "I use Claude Code to write cloud software" | **cloud** code ✗ | **Cloud** Code (caps only) | **Claude** Code ✓ | **Claude** Code ✓ |
| "Claude is great" | Claude ✓ | Claude ✓ | Claude ✓ | Claude ✓ |

**Key finding:** Rich context (multiple entity names) or full sentence context successfully biases acoustically ambiguous speech. Sparse word lists are insufficient. Session 290's live failure was likely caused by empty or too-sparse context string.

**Decision: RE-EVALUATE.** Biasing works. Next step: enrich the context string in `qwen3_asr_stt.py` (build richer context from hotwords + entity names) and re-deploy.

---

## 18. Quantization & Runtime Speed Optimization (Session 292 Research)

**Date:** 2026-03-12
**Goal:** Reduce Qwen3-ASR-1.7B latency on DGX Spark (Blackwell GB10, aarch64, SM121, 128GB unified memory)
**Baseline:** 266-1143ms per utterance with transformers backend, torch.bfloat16

---

### 18.1 Option A: vLLM Backend (RECOMMENDED — highest impact)

**What:** Replace `Qwen3ASRModel.from_pretrained()` (transformers backend) with `Qwen3ASRModel.LLM()` (vLLM backend). vLLM uses CUDA Graphs + continuous batching + optimized attention kernels.

**Expected speedup:** 2-5x for single utterances; massive throughput gains at concurrency.
- Official benchmarks (vLLM v0.14.0, bfloat16, CUDA Graphs enabled):
  - 1.7B offline batch: RTF 0.131 (throughput 980 audio sec/sec) at concurrency 128
  - 1.7B online async: RTF 0.105, TTFT avg 3392ms at concurrency 128
  - 0.6B online async: TTFT as low as **92ms**, RTF 0.064, throughput **2000** audio sec/sec

**DGX Spark compatibility:**
- vLLM has CUDA 13.0 aarch64 wheels since v0.13.0 (closed issue [#31128](https://github.com/vllm-project/vllm/issues/31128))
- Pre-built Docker images available: [NVIDIA DGX Spark vLLM](https://build.nvidia.com/spark/vllm)
- Community images ship vLLM v0.16.0rc2 with SM121 patches
- **CRITICAL CAVEAT:** FlashAttention does NOT support SM121. vLLM falls back to FlashInfer (community patches route SM121 to SM120 paths) or SDPA. Must test for correctness — early SM121 vLLM had "garbled output" from attention backend incompatibility
- Installation: `uv pip install -U vllm --pre` with cu130 nightly wheels from `wheels.vllm.ai/nightly/cu130`
- May require `--enforce-eager` flag if Triton/torch.compile misbehaves on SM121

**Code change:**
```python
# Current (transformers backend)
model = Qwen3ASRModel.from_pretrained("Qwen/Qwen3-ASR-1.7B", dtype=torch.bfloat16, device_map="cuda:0", max_new_tokens=256)

# vLLM backend (requires if __name__ == '__main__' guard for multiprocessing)
model = Qwen3ASRModel.LLM(model="Qwen/Qwen3-ASR-1.7B", gpu_memory_utilization=0.5, max_inference_batch_size=1, max_new_tokens=256)
```

**Risk:** HIGH (vLLM on SM121 is bleeding-edge, may have correctness issues). Needs thorough testing.

---

### 18.2 Option B: Static KV Cache + CUDA Graphs (transformers backend)

**What:** Apply the `faster-qwen3-tts` optimization pattern to the transformers backend:
1. Replace dynamic KV cache with HuggingFace `StaticCache` (pre-allocated fixed-size tensors)
2. Capture decode loop with `torch.cuda.CUDAGraph`
3. Replay graph instead of re-launching kernels each step

**Expected speedup:** 2-4x for autoregressive decode phase (based on [faster-qwen3-tts](https://github.com/andimarafioti/faster-qwen3-tts) results on RTX 4090: RTF improved from ~1x to 5.6x).

**Key insight from faster-qwen3-tts:**
- `torch.compile` alone gave **zero speedup** because dynamic KV cache shapes defeat the compiler
- Static cache is the prerequisite — it both accelerates by itself AND enables CUDA graph capture
- CUDA graphs eliminate GPU-CPU synchronization overhead (the main bottleneck for small models generating short sequences)

**DGX Spark compatibility:**
- Static KV cache: Pure PyTorch, no SM121 issues
- CUDA graphs: Should work on SM121 (basic CUDA feature, not SM-specific)
- torch.compile: **PROBLEMATIC** on SM121 — Triton treats SM12x as SM80 (Ampere), disabling Blackwell optimizations. May need `--enforce-eager` / skip compile entirely
- No FlashAttention dependency (uses SDPA which works on SM121)

**Implementation complexity:** MEDIUM-HIGH. Requires modifying the `qwen-asr` package's internal generate loop or monkey-patching. The `Qwen3ASRForConditionalGeneration` model uses a custom architecture (audio encoder + LLM decoder); static cache must be wired correctly for both.

**Risk:** MEDIUM. Pure PyTorch approach, but custom model architecture may not play well with StaticCache out of the box.

---

### 18.3 Option C: Qwen3-ASR-0.6B (smaller model)

**What:** Drop from 1.7B to 0.6B parameters. Same API, same `Qwen3ASRModel` class.

**Expected speedup:** ~1.4x for offline, ~1.6x for online (from technical report benchmarks):

| Metric | 0.6B | 1.7B | Ratio |
|--------|------|------|-------|
| Offline RTF (c=128) | 0.113 | 0.131 | 1.16x faster |
| Online RTF (c=128) | 0.064 | 0.105 | 1.64x faster |
| Online throughput | 2000/s | 1220/s | 1.64x higher |
| VRAM (bf16) | ~1.2 GB | ~3.4 GB | 2.8x less |

**WER accuracy tradeoff (from technical report):**

| Benchmark | 0.6B WER | 1.7B WER | Degradation |
|-----------|----------|----------|-------------|
| LibriSpeech clean | 2.11 | 1.63 | +0.48 |
| LibriSpeech other | 4.55 | 3.38 | +1.17 |
| GigaSpeech | 8.88 | 8.45 | +0.43 |
| Fleurs (multilingual avg) | 5.69 | 4.90 | +0.79 |
| Language ID accuracy | 96.9% | 97.9% | -1.0% |

**Contextual biasing:** Same `context` parameter, same mechanism. May be slightly less effective with smaller model capacity.

**Code change:** Single line:
```python
model = Qwen3ASRModel.from_pretrained("Qwen/Qwen3-ASR-0.6B", ...)
```

**DGX Spark compatibility:** Identical to 1.7B. No new dependencies.

**Risk:** LOW. Drop-in replacement. Only risk is quality degradation on edge cases.

---

### 18.4 Option D: Pre-Quantized Variants

**What's available on HuggingFace (as of 2026-03-12):**

| Model | Format | Publisher | Notes |
|-------|--------|-----------|-------|
| weiren119/Qwen3-ASR-1.7B-CoreML | CoreML | Community | Apple only, not applicable |
| aufklarer/Qwen3-ASR-1.7B-MLX-8bit | MLX 8-bit | Community | Apple only, not applicable |
| aufklarer/Qwen3-ASR-1.7B-MLX-4bit | MLX 4-bit | Community | Apple only, not applicable |
| aitytech/Qwen3-ASR-1.7B-MLX-8bit | MLX 8-bit | Community | Apple only, not applicable |
| andrewleech/qwen3-asr-1.7b-onnx | ONNX | Community | Needs ONNX Runtime + CUDA EP |
| aoiandroid/Qwen3-ASR-1.7B-CoreML | CoreML | Community | Apple only, not applicable |

**No GPTQ, AWQ, INT8, INT4, or FP8 quantized variants exist for Qwen3-ASR.**

The base Qwen3-1.7B LLM has GPTQ-Int8 and GGUF variants, but the ASR model has a custom architecture (audio encoder + LLM decoder) that prevents direct use of standard LLM quantization tools.

**NVFP4 (Blackwell-native 4-bit):**
- NVFP4 works on DGX Spark via community patches (CUTLASS software E2M1 conversion path)
- 20% faster than AWQ on DGX Spark benchmarks
- **Not available for Qwen3-ASR** — no one has created an NVFP4 variant
- Creating one would require: (1) quantize the LLM decoder weights to FP4, (2) keep audio encoder in bf16/fp16, (3) custom vLLM integration
- **Status:** Theoretically possible but no existing implementation

**ONNX Runtime path:**
- `andrewleech/qwen3-asr-1.7b-onnx` exists
- Would need ONNX Runtime with CUDA Execution Provider compiled for aarch64/CUDA 13
- Uncertain compatibility with SM121
- No benchmark data available

**DIY quantization (using LLM Compressor or AutoGPTQ):**
- Could quantize the LLM decoder portion to INT8 or INT4
- Audio encoder should remain in bf16 (quantizing audio features degrades WER significantly)
- Requires calibration dataset and testing
- Expected VRAM savings: 1.7B INT8 ~1.7 GB (vs 3.4 GB bf16), INT4 ~0.9 GB
- Expected speed: INT8 ~1.2-1.5x faster, INT4 ~1.5-2x faster (with proper kernel support)

**Risk:** HIGH for DIY quantization (untested territory for ASR models). LOW for ONNX (if it works at all on SM121).

---

### 18.5 Option E: antirez/qwen-asr (C implementation)

**What:** Pure C inference implementation by antirez (Redis creator). Supports both 0.6B and 1.7B.

**Optimizations used:**
- BLAS acceleration (OpenBLAS on Linux)
- NEON SIMD for aarch64 (directly applicable to DGX Spark ARM CPU)
- Memory-mapped safetensors (near-instant loading)
- Encoder window caching in streaming mode
- Decoder prefix capping (~150 tokens)

**Benchmarks (Apple M3 Max, single-threaded, 0.6B):**
- 11s audio: 7.99x realtime
- 45s audio: 13.38x realtime
- Streaming with cache: 4.69x realtime

**DGX Spark applicability:**
- Runs on CPU only (no GPU acceleration)
- NEON support means it works on aarch64
- Would NOT benefit from the 128GB unified GPU memory
- For short VAD segments (2-3s), CPU inference on 0.6B may be fast enough
- Frees GPU entirely for LLM/TTS

**Risk:** MEDIUM. Would need integration with Pipecat (C library → Python bindings or subprocess). Loses contextual biasing (`context` parameter not implemented in C version).

---

### 18.6 Summary: Recommendation Matrix

| Option | Expected Speedup | VRAM Impact | DGX Spark Risk | Implementation Effort | Recommended? |
|--------|-----------------|-------------|----------------|----------------------|-------------|
| **A: vLLM backend** | 2-5x | Same (3.4 GB) | HIGH (SM121 bleeding-edge) | LOW (API change) | YES — try first |
| **B: Static cache + CUDA graphs** | 2-4x | Same | MEDIUM (no FA dependency) | HIGH (patch qwen-asr internals) | MAYBE — if vLLM fails |
| **C: 0.6B model** | 1.4-1.6x | -2.2 GB (saves 65%) | LOW (drop-in) | TRIVIAL | YES — combine with A or B |
| **D: Quantization** | 1.2-2x (theoretical) | -1.7 to -2.5 GB | HIGH (no existing variants) | HIGH (DIY calibration) | NO — too risky for now |
| **E: C implementation** | Unknown (CPU-only) | 0 GB GPU | LOW (CPU, aarch64 native) | HIGH (Pipecat integration) | NO — loses biasing |

### Recommended Action Plan

1. **Immediate (low risk):** Switch to **0.6B model** — single line change, saves 2.2 GB VRAM, ~1.4x faster, WER penalty is minor (+0.48 on LibriSpeech clean, +1.17 on other)
2. **Short-term (medium risk):** Test **vLLM backend on DGX Spark** — install vLLM cu130 aarch64 wheel, verify correctness with existing test audio, measure actual latency improvement
3. **If vLLM works:** Combine 0.6B + vLLM for maximum speed (estimated TTFT < 100ms for short utterances)
4. **If vLLM fails on SM121:** Investigate static KV cache + CUDA graphs on the transformers backend (Option B)
5. **Skip:** DIY quantization and C implementation — too much effort for uncertain gain

### DGX Spark SM121 Compatibility Cheat Sheet

| Component | SM121 Status | Notes |
|-----------|-------------|-------|
| PyTorch (CUDA 13) | Works | cu130 wheels for aarch64 |
| torch.compile / Triton | Degraded | Triton treats SM12x as SM80, disables Blackwell opts |
| FlashAttention 2 | BROKEN | No SM12x support, no plans |
| FlashAttention 4 | N/A | SM100 only (datacenter Blackwell) |
| FlashInfer | Patched | Community patches route SM121 to SM120 |
| CUDA Graphs | Works | Basic CUDA feature |
| vLLM | Works (patched) | cu130 aarch64 wheels, may need `--enforce-eager` |
| SDPA (PyTorch native) | Works | Fallback attention, slower than FA but correct |
| NVFP4 | Patched | Software E2M1 conversion (no hardware instruction on GB10) |
| bitsandbytes | Untested | Not recommended for inference (poor perf vs GPTQ/AWQ) |

---

## 19. Nemotron Speech ASR — Comparison & Rejection (Session 293)

**Date:** 2026-03-12
**Goal:** Evaluate NVIDIA Nemotron Speech ASR (0.6B) as potential Qwen3-ASR replacement.
**Verdict:** REJECTED — no contextual biasing, cannot solve the "Claude"/"cloud" problem.

### 19.1 Nemotron 3 vs Nemotron Speech (disambiguation)

- **Nemotron 3** (March 2026): 120B LLM (Mamba-Transformer MoE). Agentic reasoning. **Zero speech capabilities.** Not relevant to ASR.
- **Nemotron Speech ASR** (January 2026): 0.6B streaming FastConformer + RNNT. **This is the ASR model.**

### 19.2 Nemotron Speech ASR — Key Specs

| Spec | Value |
|------|-------|
| Model | `nvidia/nemotron-speech-streaming-en-0.6b` |
| Parameters | 600M |
| Architecture | Cache-aware FastConformer encoder (24 layers) + RNNT decoder |
| Training data | 285k hours English |
| Language | English only |
| License | nvidia-open-model-license (commercial OK) |
| Streaming | Native, cache-aware (80ms/160ms/560ms/1120ms chunk sizes) |
| VRAM | ~1.2 GB |
| Contextual biasing | **NONE** |

### 19.3 Latency Comparison

| Metric | Nemotron Speech | Qwen3-ASR-1.7B | Ratio |
|--------|----------------|-----------------|-------|
| Median time-to-transcript | **24ms** | 266-1143ms | 10-47x faster |
| Architecture | Streaming RNNT | Autoregressive LLM | Fundamentally different |
| Pipecat integration | Official repo exists | Custom SegmentedSTTService | Both work |

Independent benchmark (QbitLoop/RealtimeVoice): Nemotron 43ms vs Whisper 916ms — 21x faster.

### 19.4 Accuracy Comparison (WER)

| Dataset | Nemotron 0.6B (560ms) | Qwen3-ASR-1.7B | Winner |
|---------|----------------------|----------------|--------|
| LibriSpeech clean | 2.40 | **1.63** | Qwen3-ASR |
| LibriSpeech other | 4.97 | **3.38** | Qwen3-ASR |
| GigaSpeech | 11.43 | **8.45** | Qwen3-ASR |
| TEDLIUM | **4.46** | 4.50 | Nemotron (marginal) |

Qwen3-ASR has consistently lower WER, especially on noisy/difficult audio.

### 19.5 The Dealbreaker: No Contextual Biasing

Nemotron Speech RNNT decoder has no mechanism for:
- Hotword lists
- Context parameters (system messages)
- Word boosting
- Any form of vocabulary biasing

NeMo framework has CTC-WS (Word Spotter) context biasing, but it requires a CTC decoder head — Nemotron Speech uses RNNT, so CTC-WS is not applicable.

**This means Nemotron Speech cannot solve the "Claude"→"cloud" misrecognition** — the primary reason we switched to Qwen3-ASR.

### 19.6 DGX Spark Compatibility

| Component | Status |
|-----------|--------|
| NeMo model loading | Works (PyTorch) |
| nemo2riva conversion | Broken on aarch64 (dependency conflicts) |
| NIM container (Nemotron Speech) | Not listed in NIM support matrix for Spark |
| Bare NeMo inference | Should work (standard PyTorch + SDPA) |

Forum reports: developer hit `nemo2riva` dependency conflicts on DGX Spark. Thread unresolved.

### 19.7 Other Notable NVIDIA ASR Models

| Model | Params | WER (avg) | Biasing? | Notes |
|-------|--------|-----------|----------|-------|
| **Canary-Qwen-2.5B** | 2.5B | **5.63%** (#1 OpenASR) | Implicit (LLM decoder) | FastConformer + Qwen3-1.7B LLM |
| **Parakeet TDT 1.1B v2** | 1.1B | ~8.0% | CTC-WS available | Very fast (RTFx >2000) |
| **Canary 1B Flash** | 1B | ~7% | Via NeMo CTC-WS | 25 EU languages |

**Canary-Qwen-2.5B** is most interesting: pairs FastConformer encoder with Qwen LLM decoder. Implicit biasing through LLM world knowledge, but no explicit hotword API. 2.5B params (~5GB VRAM).

### 19.8 Decision

**Initial verdict was: stay with Qwen3-ASR.** However, Section 19.9 below revises this after discovering GPU-PB boosting.

### 19.9 CORRECTION: Nemotron Speech HAS Native Word Boosting (GPU-PB)

**Date:** 2026-03-12 (same session, follow-up research)

**Previous assumption was WRONG.** Nemotron Speech supports native word boosting via NeMo's GPU-PB (GPU-accelerated Phrase-Boosting) system. This was missed in initial research because:
1. The HuggingFace model card doesn't mention it
2. The Pipecat server.py uses a decoding config (`strategy: 'greedy'`, `loop_labels: False`) that is incompatible with boosting
3. The feature lives in NeMo's decoder layer, not the model card

**How GPU-PB works:** Shallow fusion — applies a weighted prefix tree (acceptor) at each RNNT decoding step. Biases token probabilities toward tokens that continue a match in the boosting tree. NOT beam search — modifies greedy decoding scores.

**API (from NeMo `boosting_graph_batched.py`):**
```python
# BoostingTreeModelConfig dataclass
key_phrases_list: list[str]  # e.g., ['Claude', 'Claude Code', 'Rajesh', 'Annie']
context_score: float = 1.0   # Per-token arc weight
depth_scaling: float = 2.0   # 2.0 for RNNT
boosting_tree_alpha: float   # Fusion weight (set on GreedyBatchedRNNTInferConfig)
```

**Compatibility matrix (from `rnnt_greedy_decoding.py`):**

| Strategy | loop_labels | Boosting | Per-Stream Biasing |
|----------|-------------|----------|-------------------|
| `greedy` (non-batched) | N/A | NO (NotImplementedError) | NO |
| `greedy_batch` + `loop_labels=True` | Yes | **YES** | **YES** |
| `greedy_batch` + `loop_labels=False` | No | NO (NotImplementedError) | NO |

**Proposed DGX Spark config (untested):**
```python
model.change_decoding_strategy(
    decoding_cfg=OmegaConf.create({
        'strategy': 'greedy_batch',
        'greedy': {
            'max_symbols': 10,
            'loop_labels': True,              # REQUIRED for boosting
            'use_cuda_graph_decoder': False,   # REQUIRED for Blackwell SM_121
            'boosting_tree': {
                'key_phrases_list': ['Claude', 'Rajesh', 'Annie'],
                'context_score': 1.0,
                'depth_scaling': 2.0,
            },
            'boosting_tree_alpha': 0.5,
        }
    })
)
```

**Risk:** MEDIUM. `loop_labels=True` + `use_cuda_graph_decoder=False` are logically independent settings, but this specific combination is untested on SM121. The Pipecat server.py may have used `loop_labels=False` for a reason not documented.

**Revised verdict:** Nemotron Speech ASR is a serious contender. GPU-PB boosting CONFIRMED working on Blackwell — see Section 20 empirical results.

### 19.10 Empirical Results — Nemotron Speech on DGX Spark (Session 293)

**Date:** 2026-03-12
**Hardware:** DGX Spark GB10, SM121, CUDA 13.0, PyTorch 2.12.0.dev+cu130
**Model:** nvidia/nemotron-speech-streaming-en-0.6b
**NeMo version:** 2.8.0rc0 (from GitHub main)

#### Gate Test Results (Run 2 — clean GPU, no contention)

| Test | Result | Details |
|------|--------|---------|
| Model loads | **PASS** | 6,018ms (vs Qwen3-ASR 10,317ms) |
| Basic transcription | **PASS** | Clean output from 71s movie clip (Her opening scene) |
| Boosting config | **PASS** | `greedy_batch` + `loop_labels=True` + `use_cuda_graph_decoder=False` + GPU-PB all work on SM121 |
| "Claude" biasing | **PASS** | See detailed results below |
| Latency (offline, 71s audio) | **PASS** | avg=431ms, RTF=0.006 |
| VRAM | **2.49 GB** allocated, 3.10 GB reserved | Clean GPU (no contention from other models) |
| Short audio | **PASS** | Empty string, no crash |

> **Note:** Run 1 showed 4.79 GB VRAM with ollama/llama-server competing for GPU memory. Run 2 with clean GPU shows true VRAM: **2.49 GB** — 27% less than Qwen3-ASR's 3.40 GB.

#### Claude Biasing Test (THE KEY RESULT)

Test audio: 10.9s Kokoro TTS-generated clip with 3 sentences containing "Claude" and "cloud".

**With GPU-PB boosting** (`key_phrases_list=["Claude", "Claude Code", "Rajesh", "Annie", "Bangalore"]`, `boosting_tree_alpha=0.5`):
> "I need to ask **Claude** about the project deadline. The **cloud** storage is almost full, but **Claude Code** can help. **Rajesh** told **Claude** to check the **Bangalore** weather forecast."

**Without boosting** (same audio, same model):
> "I need to ask Claude about the project deadline. The cloud storage is almost full, but **Cloud Code** can help. Rajesh told Claude to check the Bangalore weather forecast."

| Phrase | Without Boosting | With Boosting | Boosting Impact |
|--------|-----------------|---------------|-----------------|
| "ask Claude about" | Claude | Claude | No change (clear context) |
| "cloud storage" | cloud | cloud | No change (correct already) |
| "Claude Code can help" | **Cloud Code** | **Claude Code** | **FIXED by boosting** |
| "told Claude to check" | Claude | Claude | No change (clear context) |

**Key finding:** GPU-PB boosting fixes the ambiguous "Claude Code" case where the acoustic signal is borderline. In clear contexts ("ask Claude about"), Nemotron gets it right even without boosting.

#### Her Movie WAV Test (Run 2 — THE DECISIVE TEST)

**Audio:** `services/audio-pipeline/test-her-movie.wav` — 71s clip from "Her" (2013) opening scene.

**Transcription (with boosting, 456ms):**
> Hello, I'm here Hi, how you doing? I'm well. How's everything with you? Pretty good actually. It's really nice to meet you. It's nice to meet you too. Oh, what do I call you? Do you have a name? Yes. Samantha, where'd you get that name from? I gave it to myself, actually. How come? Wait, when did you give it to yourself? Well, right when you asked me if I had a name, I thought, yeah, he's right. I do need a name, but I wanted to pick a good one. So I read a book called How to Name Your Baby, and out of 180,000 names, that's the one I like the best.

**Quality assessment:** Perfect transcription of conversational dialogue. Captures the Samantha naming scene flawlessly — proper nouns, numbers ("180,000"), contractions, natural speech patterns all correct.

#### Claude Audio Test — Boosting Speed Advantage

| Config | Time | "Claude Code" | Correct? |
|--------|------|---------------|----------|
| Without boosting (basic greedy) | 534ms | Claude Code | Yes |
| **With boosting (greedy_batch)** | **346ms** | Claude Code | Yes |

**Surprise:** Boosting with `greedy_batch` is **35% faster** than non-boosted basic greedy. The batched decoder is inherently more efficient.

#### Side-by-Side Comparison with Qwen3-ASR

| Metric | Qwen3-ASR-1.7B | Nemotron Speech 0.6B | Winner |
|--------|----------------|---------------------|--------|
| Load time | 10,317ms (singleton: <10ms) | 6,018ms | Nemotron (first load); Qwen3-ASR (singleton) |
| Offline latency (71s audio) | 266-1143ms | **431ms avg** | **Nemotron** |
| Streaming latency | N/A | **24ms median** (native) | Nemotron |
| Boosted latency (11s audio) | Not tested | **346ms** | Nemotron |
| "Claude Code" biasing | context param (works) | GPU-PB key_phrases_list (works) | **Tie** |
| VRAM | 3.40 GB | **2.49 GB** (clean GPU) | **Nemotron** |
| WER (LibriSpeech clean) | **1.63** | 2.40 | Qwen3-ASR |
| Architecture | Autoregressive LLM | Streaming RNNT | Nemotron (streaming-native) |
| NeMo dependency | No (lightweight qwen-asr) | Yes (NeMo toolkit) | Qwen3-ASR |
| Pipecat integration | Custom SegmentedSTTService | Official pipecat-ai-services-nvidia repo | **Nemotron** |

#### Verdict (UPDATED 2026-03-12)

**DECISION: Switch to Nemotron Speech.** After two prototype runs on DGX Spark:

1. **VRAM: 2.49 GB** (27% less than Qwen3-ASR's 3.40 GB) — frees ~1 GB for other models
2. **Latency: 431ms avg** (offline) — comparable to Qwen3-ASR's 266-1143ms range, with 24ms streaming capability
3. **Boosting works and is FASTER** — greedy_batch + GPU-PB gives 346ms vs 534ms without boosting
4. **Claude/cloud disambiguation: SOLVED** — both with and without boosting in clear contexts
5. **Her movie WAV: perfect transcription** — Samantha scene captured flawlessly
6. **Official Pipecat integration exists** — `pipecat-ai-services-nvidia` repo, reduces custom code
7. **Streaming-native architecture** — can deliver 24ms latency for real-time voice UX

The WER gap (2.40 vs 1.63 on LibriSpeech clean) is the only downside, but empirical tests show excellent quality on conversational speech. NeMo dependency is heavier than qwen-asr, but the Pipecat integration handles it.

**Sources:**
- [NeMo Word Boosting Docs](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/asr_customization/word_boosting.html)
- [NeMo `rnnt_greedy_decoding.py`](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py) — compatibility matrix
- [NeMo `boosting_graph_batched.py`](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/parts/context_biasing/boosting_graph_batched.py) — BoostingTreeModelConfig
- [Modal Nemotron ASR Implementation](https://github.com/modal-projects/modal-nvidia-asr) — confirms `greedy_batch` + boosting tree usage
- [NeMo Issue #14772](https://github.com/NVIDIA-NeMo/NeMo/issues/14772) — confirms GPU-PB is recommended approach for RNNT
- [TurboBias: Universal ASR Context-Biasing (arXiv)](https://arxiv.org/abs/2508.07014)

---

## Sources

- [Qwen3-ASR GitHub](https://github.com/QwenLM/Qwen3-ASR) -- source code verified
- [Qwen3-ASR-1.7B on HuggingFace](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) -- model card read, 6 quantized variants listed
- [Qwen3-ASR Technical Report](https://arxiv.org/html/2601.21337v2) -- speed benchmarks, WER tables, architecture details
- [vLLM Recipe for Qwen3-ASR](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-ASR.html) -- deployment guide
- [vLLM Blackwell SM121 Issue #31128](https://github.com/vllm-project/vllm/issues/31128) -- closed as COMPLETED, cu130 aarch64 wheels available
- [FlashAttention SM121 Issue #1969](https://github.com/Dao-AILab/flash-attention/issues/1969) -- not supported, no SM12x backend
- [FlashAttention Blackwell Issue #1853](https://github.com/Dao-AILab/flash-attention/issues/1853) -- SM120/SM121 not supported
- [vLLM for DGX Spark](https://build.nvidia.com/spark/vllm) -- NVIDIA official vLLM Docker images
- [NVFP4 on DGX Spark](https://forums.developer.nvidia.com/t/we-unlocked-nvfp4-on-the-dgx-spark-20-faster-than-awq/361163) -- 20% faster than AWQ
- [DGX Spark SM121 Software Gaps](https://forums.developer.nvidia.com/t/dgx-spark-sm121-software-support-is-severely-lacking-official-roadmap-needed/357663) -- ecosystem issues
- [faster-qwen3-tts](https://github.com/andimarafioti/faster-qwen3-tts) -- static cache + CUDA graphs pattern (transferable to ASR)
- [antirez/qwen-asr](https://github.com/antirez/qwen-asr) -- C inference, NEON aarch64, 0.6B benchmarks
- [Qwen3 Quantization Study](https://arxiv.org/html/2505.02214v1) -- INT8 near-lossless, INT4 degrades
- [Qwen3-ASR Blog](https://qwen.ai/blog?id=qwen3asr) -- failed to fetch (SPA)
- [NVIDIA Nemotron 3 Newsroom](https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models) -- LLM, not ASR
- [Nemotron Speech Streaming 0.6B (HuggingFace)](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) -- model card, WER benchmarks
- [Scaling Voice Agents with Cache-Aware ASR (HuggingFace Blog)](https://huggingface.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents) -- latency benchmarks
- [Pipecat Nemotron Integration](https://github.com/pipecat-ai/nemotron-january-2026) -- official Pipecat + Nemotron Speech demo
- [Canary-Qwen-2.5B (HuggingFace)](https://huggingface.co/nvidia/canary-qwen-2.5b) -- #1 OpenASR leaderboard
- [ASR on DGX Spark Forum Thread](https://forums.developer.nvidia.com/t/asr-on-spark-with-nemotron-speech-streaming-en-0-6b/358614) -- compatibility issues
- [RealtimeVoice Benchmark](https://github.com/QbitLoop/RealtimeVoice) -- Nemotron 43ms vs Whisper 916ms
