# Benchmark: NVFP4 vs Q4_K_M on DGX Spark

**Machine:** DGX Spark (GB10), SM121, 128 GB unified LPDDR5x
**Benchmark series:** v1-v4 across sessions 337-339 (2026-03-14 to 2026-03-15)

| Version | Date | Models | Key Finding |
|---------|------|--------|-------------|
| v1 | 2026-03-14 | 9B NVFP4 vs 9B Q4_K_M | TTFT 3-15x faster (but thinking tokens leaked) |
| v2 | 2026-03-14 | Same + reasoning parser | Thinking adds 3-16s, masked real TTFT |
| v3 | 2026-03-14 | Same, thinking disabled | **NVFP4 = gibberish** (base model can't function without thinking) |
| **v4** | **2026-03-15** | **27B NVFP4 vs 9B Q4_K_M** | **Both excellent quality; 27B faster TTFT, 9B faster decode** |

**Raw data v1:** `scripts/benchmark_nvfp4_vs_q4km_results.json`
**Raw data v2:** `scripts/benchmark_nvfp4_vs_q4km_v2_results.json`
**Raw data v3:** `scripts/benchmark_quant_v3_results.json`
**Raw data v4:** `scripts/benchmark_27b_nvfp4_vs_9b_q4km_results.json`

---

## Models Compared

| | NVFP4 (candidate) | Q4_K_M (current) |
|---|---|---|
| **Model** | AxionML/Qwen3.5-9B-NVFP4 | Jackrong/Qwen3.5-9B-Claude-Opus-Distilled-Q4_K_M |
| **Fine-tuning** | Base Qwen3.5-9B (none) | SFT on ~3,950 Claude Opus 4.6 reasoning traces |
| **Quantization** | NVFP4 (E2M1 + FP8 block scaling) | Q4_K_M (mixed 4/6-bit GGUF) |
| **Runtime** | vLLM 0.16 (cu130-nightly, --enforce-eager) | llama-server (llama.cpp, SM121-compiled) |
| **Disk size** | 8.8 GB | 5.3 GB |
| **Context** | 262K (default) | 32K (configured) |

**Important:** This is NOT an apples-to-apples comparison. The models have
different fine-tuning (base vs Opus-Distilled). The NVFP4 model is base Qwen3.5
while the Q4_K_M model has Claude reasoning distillation. Performance differences
reflect both quantization format AND model quality.

---

## Results Summary

### Time to First Token (TTFT) — NVFP4 wins 3-4x

| Test | NVFP4 p50 | Q4_K_M p50 | Winner | Speedup |
|------|-----------|------------|--------|---------|
| Simple greeting | 82 ms | 287 ms | **NVFP4** | 3.5x |
| Factual question | 82 ms | 278 ms | **NVFP4** | 3.4x |
| Conversational | 83 ms | 367 ms | **NVFP4** | 4.4x |
| Reasoning | 84 ms | 371 ms | **NVFP4** | 4.4x |
| Long response | 83 ms | 366 ms | **NVFP4** | 4.4x |
| Multi-turn 1 | 83 ms | 336 ms | **NVFP4** | 4.0x |
| Multi-turn 3 | 102 ms | 762 ms | **NVFP4** | 7.5x |
| Multi-turn 5 | 153 ms | 1638 ms | **NVFP4** | 10.7x |

**Analysis:** NVFP4's TTFT is remarkably consistent at ~82ms regardless of prompt
complexity. This is because Blackwell Tensor Cores process FP4 data natively —
prompt processing is hardware-accelerated. Q4_K_M's TTFT grows linearly with
prompt size (287ms → 1638ms over 5 turns), reflecting llama.cpp's sequential
prompt evaluation.

### Decode Speed (tok/s) — Q4_K_M wins by ~20%

| Test | NVFP4 p50 | Q4_K_M p50 | Winner | Ratio |
|------|-----------|------------|--------|-------|
| Simple greeting | 29.3 | 31.9 | Q4_K_M | 0.9x |
| Factual question | 30.0 | 32.8 | Q4_K_M | 0.9x |
| Conversational | 31.3 | 37.3 | Q4_K_M | 0.8x |
| Reasoning | 25.6 | 19.0 | **NVFP4** | 1.3x |
| Long response | 31.5 | 38.3 | Q4_K_M | 0.8x |
| Multi-turn 5 | 28.7 | 11.7 | **NVFP4** | 2.5x |

**Analysis:** Q4_K_M generally decodes faster (~35-38 tok/s vs ~30 tok/s). This
is likely because:
1. `--enforce-eager` disables CUDA graphs on vLLM, adding Python overhead per step
2. llama.cpp's C++ runtime has lower per-token overhead than vLLM's Python engine
3. The NVFP4 model was not optimized for DGX Spark SM121 specifically

**Exception — Multi-turn 5:** NVFP4 is 2.5x faster at the 5th turn because
DeltaNet's linear attention (75% of layers) doesn't grow KV cache, so context
length has minimal impact on decode speed.

### Total Latency — Q4_K_M wins for short, NVFP4 competitive for long context

| Test | NVFP4 p50 | Q4_K_M p50 | Winner | Delta |
|------|-----------|------------|--------|-------|
| Simple greeting | 4682 ms | 1536 ms | Q4_K_M | 3146 ms |
| Factual question | 3137 ms | 1553 ms | Q4_K_M | 1584 ms |
| Conversational | 9370 ms | 3371 ms | Q4_K_M | 5999 ms |
| Reasoning | 9425 ms | 1734 ms | Q4_K_M | 7692 ms |
| Long response | 15661 ms | 4775 ms | Q4_K_M | 10886 ms |
| Multi-turn 4 | 6339 ms | 6594 ms | **NVFP4** | 256 ms |

**Analysis:** NVFP4's total latency is 2-3x worse for short responses because the
NVFP4 model generates much longer outputs (more verbose, includes thinking tokens).
This is a model quality difference (base vs Opus-Distilled), not a quantization
difference. The Opus-Distilled model is trained to be concise; base Qwen3.5
generates longer chains of thought.

At multi-turn 4+ (growing context), NVFP4 becomes competitive because its constant
TTFT advantage outweighs the decode speed gap.

### Tool Calling — Both work (v2, with `--tool-call-parser qwen3_coder`)

**v1 result:** 400 Bad Request — caused by missing `--enable-auto-tool-choice`
and `--tool-call-parser qwen3_coder` flags on vLLM. Not a model limitation.

**v2 result (corrected):**

| Test | NVFP4 TTFT p50 | Q4_K_M TTFT p50 | NVFP4 Tool | Q4_K_M Tool |
|------|----------------|-----------------|------------|-------------|
| search_memory | 1394 ms | 823 ms | search_memory | search_memory |
| web_search | 1308 ms | 824 ms | web_search | web_search |

Both models correctly identified and called the right tool. Q4_K_M is ~1.7x
faster at tool call generation. NVFP4 latency includes thinking time from
`--reasoning-parser qwen3` (model reasons internally before generating tool call).

### v2 Benchmark: With Reasoning Parser + Tool Calling

Adding `--reasoning-parser qwen3` and `--tool-call-parser qwen3_coder` to vLLM
changes the NVFP4 numbers significantly. The model now thinks internally before
producing visible output. The "TTFT" becomes "time to first visible token after
thinking completes."

| Test | NVFP4 v1 TTFT | NVFP4 v2 TTFT | Q4_K_M TTFT | Notes |
|------|---------------|---------------|-------------|-------|
| greeting | 82 ms | 4761 ms | 304 ms | v2 includes ~4.7s thinking |
| factual | 82 ms | 3204 ms | 288 ms | v2 includes ~3.1s thinking |
| conversational | 83 ms | 9564 ms | 384 ms | v2 includes ~9.5s thinking |
| reasoning | 84 ms | 9602 ms | 370 ms | v2 includes ~9.5s thinking |
| long_response | 83 ms | 16065 ms | 380 ms | v2 includes ~16s thinking |
| multi-turn 5 | 153 ms | 6371 ms | 1703 ms | v2 includes ~6.2s thinking |

**Interpretation:** v1 (no reasoning parser) shows raw inference speed. v2 shows
user-perceived latency with thinking mode enabled. For Annie Voice (which uses
`--reasoning-budget 0` on llama-server, disabling thinking), v1 is the fairer
comparison. v2 shows what happens if thinking is enabled — NVFP4 generates
extensive thinking chains that add 3-16 seconds before the visible response.

**Key finding:** The base Qwen3.5-9B model "thinks" much more extensively than
the Opus-Distilled model. The Opus-Distilled model has been trained to be concise
and skip unnecessary reasoning. With thinking disabled (v1), NVFP4 TTFT is 82ms —
dramatically faster than Q4_K_M's 280-370ms.

---

## Key Findings

### 1. NVFP4 has dramatically better TTFT

The 82ms constant TTFT vs Q4_K_M's 280-1600ms (growing with context) is the
standout metric. For a voice assistant like Annie, TTFT directly determines
response latency perceived by the user. At 5 conversation turns, NVFP4's TTFT
is **10.7x faster**.

### 2. Q4_K_M decodes faster per-token

llama.cpp's lean C++ runtime decodes ~20% faster than vLLM with `--enforce-eager`.
This gap might shrink with CUDA graphs enabled (currently disabled for SM121
compatibility).

### 3. Total latency is dominated by output length, not TTFT

The base Qwen3.5 model generates 2-3x more tokens per response than the
Opus-Distilled model. This makes total latency comparisons misleading — the
models are producing fundamentally different output volumes.

### 4. Multi-turn context is where NVFP4 + DeltaNet shines

At 5 turns, NVFP4 TTFT is 153ms (v1) while Q4_K_M is 1638ms. DeltaNet's linear
attention (24/32 layers) means KV cache grows only in 8 layers, giving
near-constant prompt processing regardless of conversation length.

### 5. Tool calling works on both models

v2 benchmark confirmed: with `--enable-auto-tool-choice --tool-call-parser qwen3_coder`,
NVFP4 correctly calls `search_memory` and `web_search` tools. The v1 "failure"
was a server configuration issue, not a model limitation.

### 6. The fine-tuning gap matters more than quantization

The biggest performance differences (output verbosity, thinking depth, response
style) come from base vs Opus-Distilled, not from NVFP4 vs Q4_K_M.
An NVFP4 version of the Opus-Distilled model would be the true comparison.

---

## Recommendation

### Do NOT switch to AxionML/Qwen3.5-9B-NVFP4 as a drop-in replacement

1. **Different model** — Base Qwen3.5 lacks Opus reasoning distillation. Annie's
   personality, conciseness, and tool-calling accuracy would degrade.
2. **Slower decode** — 30 vs 38 tok/s means longer responses take even longer.
3. **Tool calling works** (v2) — requires `--enable-auto-tool-choice --tool-call-parser qwen3_coder`.
   Both search_memory and web_search correctly triggered. Q4_K_M is ~1.7x faster at tool generation.
4. **More verbose** — Base model generates 2-3x more tokens, increasing latency
   and cost.
5. **Docker dependency** — Adds vLLM container lifecycle vs llama-server binary.

### What would change this recommendation

1. **NVFP4 quantization of Opus-Distilled model** — Eliminates the fine-tuning
   gap. We attempted this but serving is blocked by transformers version mismatch
   (see `BENCHMARK-NVFP4-EXECUTION-LOG.md`). Revisit when SGLang/vLLM update.
2. **CUDA graphs enabled** — `--enforce-eager` was required for SM121. With
   CUDA graph support, NVFP4 decode speed could match or exceed Q4_K_M.
3. **vLLM SM121 fix** — When NVIDIA's PyTorch build properly supports SM121
   (12.1 currently capped at 12.0), CUDA graphs should work.

### Keep Q4_K_M on llama-server for Annie Voice

The current setup (Opus-Distilled Q4_K_M on llama-server port 8003) remains
the best option:
- Proven in production
- Correct fine-tuning for Annie's personality
- Tool calling works
- ~35-38 tok/s decode
- Single binary, no container dependency

---

## Raw Benchmark Data

### NVFP4 (vLLM) — 3 runs per test

| Test | Run 1 TTFT | Run 2 TTFT | Run 3 TTFT | Run 1 tok/s | Run 2 tok/s | Run 3 tok/s |
|------|-----------|-----------|-----------|------------|------------|------------|
| greeting | 9360 ms* | 82 ms | 81 ms | 10.0 | 29.3 | 29.9 |
| factual | 82 ms | 82 ms | 82 ms | 29.3 | 31.2 | 30.0 |
| conversational | 82 ms | 83 ms | 83 ms | 31.3 | 29.8 | 32.2 |
| reasoning | 89 ms | 84 ms | 84 ms | 25.9 | 25.6 | 25.1 |
| long_response | 82 ms | 81 ms | 85 ms | 31.7 | 31.5 | 30.5 |

*Run 1 greeting: cold start (first inference after server start)

### Q4_K_M (llama-server) — 3 runs per test

| Test | Run 1 TTFT | Run 2 TTFT | Run 3 TTFT | Run 1 tok/s | Run 2 tok/s | Run 3 tok/s |
|------|-----------|-----------|-----------|------------|------------|------------|
| greeting | 287 ms | 279 ms | 307 ms | 31.9 | 33.5 | 22.3 |
| factual | 278 ms | 283 ms | 277 ms | 32.8 | 5.0 | 40.1 |
| conversational | 372 ms | 367 ms | 362 ms | 48.9 | 35.7 | 37.3 |
| reasoning | 371 ms | 351 ms | 427 ms | 17.5 | 23.0 | 19.0 |
| long_response | 373 ms | 366 ms | 365 ms | 36.4 | 38.3 | 39.2 |

### Multi-turn latency (5 turns, total per turn)

| Turn | NVFP4 p50 | Q4_K_M p50 |
|------|-----------|------------|
| 1 | 6285 ms | 2104 ms |
| 2 | 6313 ms | 4222 ms |
| 3 | 6287 ms | 6034 ms |
| 4 | 6339 ms | 6594 ms |
| 5 | 6384 ms | 2921 ms |

Note: NVFP4 multi-turn latency is remarkably constant (~6300ms per turn)
regardless of context growth. Q4_K_M varies significantly (2104-6594ms).

---

## v3 Benchmark: Proper Methodology (thinking disabled)

**Date:** 2026-03-14 (late night)
**Script:** `scripts/benchmark_quant_v3.py`
**Data:** `scripts/benchmark_quant_v3_results.json`

### Methodology Fixes from v1/v2

| Issue | v1/v2 | v3 |
|-------|-------|-----|
| **Thinking mode** | v1: leaked as content (fake 82ms TTFT); v2: hidden but consumed all tokens | **Disabled** via `enable_thinking: false` (matches Annie production `--reasoning-budget 0`) |
| **Token counting** | `len(text) // 4` estimate | API `usage.completion_tokens` preferred, estimate as fallback |
| **Warmup** | No exclusion (cold starts in data) | 2 warmup runs excluded, cold starts flagged |
| **Sample size** | n=3 | n=5 measured runs |
| **Quality gates** | None | Non-empty, tool accuracy, factual correctness, no-markdown, thinking leak |
| **Model mismatch** | Not acknowledged | Explicitly flagged as confounded comparison |

### v3 Results: TTFT — NVFP4 still 3-15x faster

| Test | NVFP4 p50 | Q4_K_M p50 | TTFT Ratio | Winner |
|------|-----------|------------|------------|--------|
| Simple greeting | 93 ms | 276 ms | 0.34x | **NVFP4** |
| Factual question | 87 ms | 293 ms | 0.30x | **NVFP4** |
| Conversational | 81 ms | 354 ms | 0.23x | **NVFP4** |
| Reasoning | 78 ms | 360 ms | 0.22x | **NVFP4** |
| Long response | 82 ms | 357 ms | 0.23x | **NVFP4** |
| Tool call (memory) | 104 ms | 752 ms | 0.14x | **NVFP4** |
| Tool call (web) | 103 ms | 757 ms | 0.14x | **NVFP4** |
| Multi-turn 1 | 90 ms | 297 ms | 0.30x | **NVFP4** |
| Multi-turn 3 | 94 ms | 602 ms | 0.16x | **NVFP4** |
| Multi-turn 5 | 92 ms | 1378 ms | 0.07x | **NVFP4** |

NVFP4 TTFT is real — with thinking disabled, there's no measurement artifact.
Blackwell Tensor Cores process FP4 natively, giving ~80-100ms constant TTFT.

### v3 Results: Decode Speed — Q4_K_M ~2x faster

| Test | NVFP4 tok/s | Q4_K_M tok/s | Ratio |
|------|-------------|--------------|-------|
| Simple greeting | 22.0 | 32.5 | 0.68x |
| Factual question | 17.0 | 38.2 | 0.45x |
| Conversational | 20.1 | 43.1 | 0.47x |
| Reasoning | 23.3 | 23.4 | 1.00x |
| Long response | 19.2 | 38.3 | 0.50x |
| Multi-turn 5 | 20.7 | 33.3 | 0.62x |

Q4_K_M decodes ~2x faster (32-43 tok/s vs 17-23 tok/s). The gap widened from
v1's ~20% to v3's ~50-55% — likely because v1 measured thinking tokens in the
NVFP4 decode rate, inflating it.

### v3 Results: Quality Gates — NVFP4 CATASTROPHIC FAILURE

| Gate | NVFP4 | Q4_K_M |
|------|-------|--------|
| **Non-empty responses** | 34/35 (97%) | 35/35 (100%) |
| **Factual correctness** (Tokyo) | **0/5 (0%)** | 5/5 (100%) |
| **Reasoning correctness** (15) | **0/5 (0%)** | 5/5 (100%) |
| **Tool call accuracy** (memory) | **0/5 (0%)** | 4/5 (80%) |
| **Tool call accuracy** (web) | **0/5 (0%)** | 5/5 (100%) |
| **No markdown** | 4/15 (27%) | 15/15 (100%) |
| **Thinking leaked** | 0/35 (0%) | 0/35 (0%) |

**NVFP4 output is garbled gibberish.** Sample factual_question response:

```
.isantts     E_OVER
 S copr; '...CSiscoslmogn不辞 продать / '^当 /M刘邦opIS ( S Z-s优化e'
```

vs Q4_K_M:

```
Tokyo is the capital of Japan. It's a massive metropolis that blends
ultramodern skyscrapers with ancient temples...
```

**Root cause:** Base Qwen3.5-9B was trained to reason in `<think>` blocks
before generating content. With `enable_thinking: false`, the model has no
internal reasoning step and produces degenerate output. The Opus-Distilled
model was fine-tuned to produce high-quality output *without* thinking,
which is why it works with `--reasoning-budget 0`.

### v3 Results: Multi-turn TTFT — Context scaling confirmed

| Turn | NVFP4 p50 (TTFT) | Q4_K_M p50 (TTFT) | Ratio |
|------|-------------------|--------------------| ------|
| 1 | 90 ms | 297 ms | 0.30x |
| 2 | 95 ms | 356 ms | 0.27x |
| 3 | 94 ms | 602 ms | 0.16x |
| 4 | 92 ms | 877 ms | 0.10x |
| 5 | 92 ms | 1378 ms | 0.07x |

NVFP4 TTFT stays constant at ~92ms across 5 turns (DeltaNet linear attention).
Q4_K_M TTFT grows linearly: 297ms → 1378ms (4.6x increase over 5 turns).

---

## Updated Recommendation (post-v3)

### NVFP4 base model is NOT viable — quality is zero

v3 definitively kills the NVFP4 base model as a candidate:

1. **Gibberish output** — With thinking disabled (Annie's config), the base
   model produces unintelligible garbled text. Not "lower quality" — zero quality.
2. **Zero factual accuracy** — 0/5 on "What is the capital of Japan?"
3. **Zero reasoning** — 0/5 on "3 boxes × 5 balls = ?"
4. **Zero tool calling** — 0/10 across search_memory and web_search
5. **TTFT is real but irrelevant** — 80ms response time means nothing when
   the response is gibberish.

### Why v1/v2 were misleading

v1 showed NVFP4 producing seemingly coherent text — but that was **thinking
tokens leaking as content**. The model was reasoning out loud, not answering
the question. v2 hid the thinking, revealing the model produced zero visible
output. v3 disabled thinking entirely, revealing the model can't function
without it.

### The path forward (updated after v4)

1. **Keep Q4_K_M on llama-server** — proven quality, 32-43 tok/s, proper
   tool calling, correct personality.
2. ~~**NVFP4 of Opus-Distilled**~~ — blocked by ecosystem. See v4 below for
   the 27B instruct alternative.
3. **27B NVFP4 instruct** — viable quality but 13 tok/s decode is too slow
   for voice. Revisit when MTP speculative decoding is enabled (~20 tok/s).

---

## v4 Benchmark: 27B NVFP4 (instruct) vs 9B Opus-Distilled Q4_K_M

**Date:** 2026-03-15
**Script:** `scripts/benchmark_27b_nvfp4_vs_9b_q4km.py`
**Data:** `scripts/benchmark_27b_nvfp4_vs_9b_q4km_results.json`

### What changed from v3

v3 proved the base 9B model produces gibberish with thinking disabled. v4
tests a **different model**: the full **27B instruct** model (not Opus-Distilled,
not base). This is a proper instruct-tuned model that works without thinking.

| | 27B NVFP4 (candidate) | 9B Q4_K_M (current) |
|---|---|---|
| **Model** | surogate/Qwen3.5-27B-NVFP4 | Jackrong/Qwen3.5-9B-Opus-Distilled Q4_K_M |
| **Parameters** | 27B (dense) | 9B (dense) |
| **Fine-tuning** | Official instruct (Qwen) | SFT on Claude Opus 4.6 traces |
| **Quantization** | NVFP4 (modelopt, group_size=16) | Q4_K_M (mixed 4/6-bit GGUF) |
| **Runtime** | vLLM 0.16rc2 (Docker cu130-nightly) | llama-server (llama.cpp) |
| **Disk size** | 19 GB | 5.3 GB |
| **Architecture** | 64 layers, DeltaNet hybrid (3 linear + 1 full attention) | 32 layers, standard attention |

**This is NOT a pure quantization comparison.** Different models, different sizes,
different runtimes. It's a practical "which backend is better for Annie" decision.

### v4 Results: Quality — Both excellent

| Gate | 27B NVFP4 | 9B Q4_K_M |
|------|-----------|-----------|
| Non-empty | 5/5 all tests | 5/5 all tests |
| Factual (Tokyo) | **5/5** | **5/5** |
| Reasoning (15) | **5/5** | **5/5** |
| No markdown | **5/5** | **5/5** |
| Tool: search_memory | 3/5 | 3/5 |
| Tool: web_search | 4/5 | **5/5** |
| Thinking leaked | 0/5 | 0/5 |

Both produce coherent, conversational, markdown-free responses. The 27B is
a proper instruct model — unlike v3's base 9B, it works perfectly with
thinking disabled.

### v4 Results: TTFT — 27B wins 1.5-4.5x

| Test | 27B NVFP4 p50 | 9B Q4_K_M p50 | Ratio | Winner |
|------|---------------|---------------|-------|--------|
| Simple greeting | 192 ms | 304 ms | 0.63x | **27B** |
| Factual question | 195 ms | 296 ms | 0.66x | **27B** |
| Conversational | 199 ms | 367 ms | 0.54x | **27B** |
| Tool call (memory) | 276 ms | 762 ms | 0.36x | **27B** |
| Tool call (web) | 276 ms | 764 ms | 0.36x | **27B** |
| Reasoning | 200 ms | 357 ms | 0.56x | **27B** |
| Long response | 199 ms | 355 ms | 0.56x | **27B** |
| **Multi-turn 1** | 193 ms | 345 ms | 0.56x | **27B** |
| **Multi-turn 3** | 231 ms | 657 ms | **0.35x** | **27B** |
| **Multi-turn 5** | 321 ms | 1434 ms | **0.22x** | **27B** |

DeltaNet linear attention keeps TTFT nearly constant: 193ms → 321ms (1.7x
growth over 5 turns). Standard attention: 345ms → 1434ms (4.2x growth).

Note: 27B TTFT is ~192ms (vs v3's 9B at ~82ms). The 27B is 2.3x larger but
only 2.3x slower on TTFT — DeltaNet's O(1) attention offsets the larger model.

### v4 Results: Decode Speed — 9B wins 2.5x

| Test | 27B NVFP4 tok/s | 9B Q4_K_M tok/s | Ratio |
|------|-----------------|-----------------|-------|
| Simple greeting | 12.2 | **29.9** | 0.41x |
| Factual question | 13.6 | **35.1** | 0.39x |
| Conversational | 14.3 | **38.1** | 0.38x |
| Reasoning | 10.6 | **25.6** | 0.41x |
| Long response | 14.7 | **37.2** | 0.40x |

27B at ~13 tok/s is unacceptable for voice. Annie needs ~20+ tok/s for
natural-feeling TTS streaming.

### v4 Results: Multi-turn TTFT Scaling

| Turn | 27B NVFP4 p50 | 9B Q4_K_M p50 | Ratio |
|------|---------------|---------------|-------|
| 1 | 193 ms | 345 ms | 0.56x |
| 2 | 202 ms | 377 ms | 0.54x |
| 3 | 231 ms | 657 ms | 0.35x |
| 4 | 278 ms | 1022 ms | 0.27x |
| 5 | 321 ms | 1434 ms | 0.22x |

The crossover for **total latency** depends on response length:
- 50-token response: 9B wins all turns (decode speed dominates)
- 200-token response: 27B wins at turn ~8+ (TTFT dominance at ~2000ms)
- Long conversations (>10 turns): 27B clearly wins

### v4 Verdict

**Keep 9B Q4_K_M for Annie voice.** The 13 tok/s decode speed of the 27B
makes Annie feel sluggish. Annie's typical responses are 30-80 tokens — at
13 tok/s that's 2-6 seconds of decode time vs 1-2 seconds with the 9B.

**The 27B becomes compelling if:**
1. **MTP speculative decoding** is enabled — model card claims ~19.7 tok/s
   (50% boost). Needs `--speculative-config '{"method":"mtp","num_speculative_tokens":1}'`.
2. **Conversations get long** (>10 turns where TTFT dominance kicks in).
3. A proper Opus-Distilled 27B NVFP4 exists (combines Annie personality
   with DeltaNet TTFT advantage).

---

## Serving NVFP4 on DGX Spark: The Hard-Won Recipe

### The Problem

DGX Spark's GB10 is SM 12.1 (Blackwell). Pre-built wheels for SGLang,
vLLM, and flashinfer target SM ≤ 12.0. The NVFP4 FP4 GEMM kernels
(`cutlass_scaled_fp4_mm` and `flashinfer.gemm.mm_fp4`) fail with:

```
torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
```

### What doesn't work

| Approach | Failure |
|----------|---------|
| SGLang + original VL config | Multimodal processor init fails (tokenizer.convert_ids_to_tokens) |
| SGLang + text-only config (`qwen3_5_text`) | transformers 4.57 doesn't recognize model type |
| SGLang + transformers 5.3 upgrade | `Qwen3_5ForCausalLM` not in EntryClass |
| SGLang + patched EntryClass | `config.num_experts` missing (dense model ≠ MoE) |
| SGLang + patched expert location | `config.layers_block_type` attr name changed in transformers 5.3 |
| SGLang + VL config + preprocessor_config.json | Weight shape mismatch (text-only weights vs VL model paths) |
| vLLM pip install (separate venv) | CPU-only torch (no CUDA aarch64 wheels on PyPI) |
| vLLM pip install (SGLang env, --no-deps) | ABI mismatch (vLLM 0.17.1 built against torch 2.10, SGLang has 2.9.1) |
| vLLM pip install + uv CUDA torch | SM 12.1 kernel missing in `flashinfer-cutlass` backend |
| vLLM + `VLLM_NVFP4_GEMM_BACKEND=cutlass` | SM 12.1 kernel missing in vLLM's own CUTLASS |
| vLLM + `VLLM_NVFP4_GEMM_BACKEND=flashinfer-cudnn` | Same SM 12.1 kernel error |
| Docker `vllm/vllm-openai:qwen3_5` | SM 12.1 kernel missing (CUDA 12.x image) |
| Docker `hellohal2064/vllm-qwen3.5-gb10` | Old transformers doesn't know `qwen3_5` model type |

### What works: Docker cu130-nightly

```bash
# Download the model (~19 GB)
huggingface-cli download surogate/Qwen3.5-27B-NVFP4 \
  --local-dir ~/models/Qwen3.5-27B-NVFP4

# Create preprocessor_config.json (model doesn't ship one)
cat > ~/models/Qwen3.5-27B-NVFP4/preprocessor_config.json << 'EOF'
{
    "size": {"longest_edge": 16777216, "shortest_edge": 65536},
    "patch_size": 16,
    "temporal_patch_size": 2,
    "merge_size": 2,
    "image_mean": [0.5, 0.5, 0.5],
    "image_std": [0.5, 0.5, 0.5],
    "processor_class": "Qwen3VLProcessor",
    "image_processor_type": "Qwen2VLImageProcessorFast"
}
EOF

# Serve with vLLM cu130-nightly Docker image
docker run -d --name vllm-27b \
  --runtime=nvidia --gpus all \
  -v ~/models/Qwen3.5-27B-NVFP4:/model \
  -p 8000:8000 \
  --shm-size=16g \
  vllm/vllm-openai:cu130-nightly \
  --model /model \
  --quantization modelopt \
  --enforce-eager \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --language-model-only \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192

# Wait ~3-4 minutes for model loading, then verify:
curl http://localhost:8000/v1/models
```

### Why cu130-nightly works

- **CUDA 13.0** natively supports SM 12.1 (GB10 Blackwell)
- The image includes flashinfer compiled with SM 12.1 CUTLASS targets
- `--language-model-only` skips vision encoder init (model has VL config
  but text-only weights)
- `--enforce-eager` disables CUDA graphs (not yet stable on SM 12.1)
- `preprocessor_config.json` is needed because the model config declares
  `Qwen3_5ForConditionalGeneration` (VL architecture) — the processor init
  runs even with `--language-model-only`

### Key flags explained

| Flag | Why |
|------|-----|
| `--quantization modelopt` | Tells vLLM to use NVFP4 weight format (E2M1 + FP8 scales) |
| `--language-model-only` | Skips vision encoder — only loads text decoder weights |
| `--enforce-eager` | Disables CUDA graphs (SM 12.1 compat) |
| `--reasoning-parser qwen3` | Enables Qwen3 thinking token parsing |
| `--enable-auto-tool-choice --tool-call-parser qwen3_coder` | Enables tool calling |
| `--gpu-memory-utilization 0.85` | Uses 85% of 128GB unified memory (~108GB) |
| `--max-model-len 8192` | Limits context to save KV cache memory |

### Environment notes

- **vLLM env at `~/vllm-env`**: Has CUDA torch 2.10.0+cu129 (installed via uv).
  FP4 kernels fail on SM 12.1 — use Docker instead.
- **SGLang env at `~/sglang-env`**: transformers upgraded to 5.3 + vLLM 0.17.1
  installed (--no-deps). Both partially broken — use Docker instead.
- **Model at `~/models/Qwen3.5-27B-NVFP4`**: 19GB safetensors + config files.
  `config.json.orig` is backup of original VL config.
