# Research: Running Gemma 4 on DGX Spark with vLLM

> **Date:** 2026-04-05
> **Status:** RESEARCH COMPLETE — clear path forward identified
> **Hardware:** NVIDIA DGX Spark (GB10, SM121, aarch64, 128 GB unified LPDDR5x)
> **Target model:** nvidia/Gemma-4-31B-IT-NVFP4 (pre-quantized NVFP4)
> **Blocking issue:** Our vllm-node image is v0.17.2, which lacks Gemma 4 architecture support

---

## Executive Summary

**There are THREE viable paths to run Gemma 4 on DGX Spark. No source build is needed.**

1. **Official vLLM Docker image** — `vllm/vllm-openai:gemma4-cu130` (arm64 native, released April 2)
2. **eugr/spark-vllm-docker recipe** — builds `vllm-node-tf5` with FP8 + tool parser fixes (community gold standard)
3. **Official vLLM v0.19.0** — `vllm/vllm-openai:v0.19.0-cu130` (arm64 native, released April 3)

The `_C_stable_libtorch` build failure and cu132/cu131 mismatch are **non-issues** — these are problems when building vLLM from source against NGC PyTorch containers. All three paths above use pre-built images/wheels and avoid this entirely.

---

## 1. Why Our Current Setup Cannot Run Gemma 4

| Component | Our Version | Required | Gap |
|-----------|-------------|----------|-----|
| vLLM | 0.17.2 (vllm-node) | >= 0.19.0 | No `Gemma4ForConditionalGeneration` architecture |
| Transformers | < 5.0 | >= 5.5.0 | Gemma 4 model type unrecognized |
| FlashInfer | old | >= 0.6.6 | Missing Gemma 4 attention support |

Gemma 4 was released April 2, 2026. vLLM added Gemma 4 support around v0.19.0. The model also requires `transformers >= 5.5.0` (the `--tf5` flag in eugr's build system overrides vLLM's `transformers < 5` constraint).

---

## 2. The Three Viable Paths

### Path A: `vllm/vllm-openai:gemma4-cu130` (Simplest)

**Official vLLM image with Gemma 4 baked in.** Released April 2, 2026.

```
Tag:           vllm/vllm-openai:gemma4-cu130
Architecture:  arm64 + amd64 (multi-arch)
Size:          ~9 GB (arm64)
Released:      2026-04-02
CUDA:          13.0
```

Confirmed working on DGX Spark by multiple NVIDIA forum users (damien.rufus, WilliamD, serapis).

**Serve command for NVFP4:**
```bash
docker run -d --name gemma4-31b-nvfp4 \
  --gpus all --ipc host --shm-size 64gb -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4-cu130 \
  nvidia/Gemma-4-31B-IT-NVFP4 \
    --served-model-name gemma4-31b \
    --port 8000 --host 0.0.0.0 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.70 \
    --kv-cache-dtype fp8 \
    --load-format safetensors \
    --enable-prefix-caching
```

**Serve command for 26B-A4B with FP8:**
```bash
docker run -d --name gemma4-26b \
  --gpus all --ipc host --shm-size 64gb -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4-cu130 \
  google/gemma-4-26B-A4B-it \
    --served-model-name gemma4-26b \
    --port 8000 --host 0.0.0.0 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.70 \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --load-format safetensors \
    --enable-prefix-caching
```

**Pros:** Zero build time, official image, proven on Spark.
**Cons:** Frozen at April 2 vLLM snapshot. No tool parser fix (PR #38909). No PR #38919 compilation fix (not needed since image is pre-built). Missing `--load-format fastsafetensors` support (use `safetensors` instead).

**Tool calling workaround:** Use `--tool-call-parser pythonic` instead of `--tool-call-parser gemma4` (the gemma4 parser has a known bug, vLLM #38837 / PR #38909).

---

### Path B: eugr/spark-vllm-docker Recipe (Best Performance)

**Community-maintained build system, actively updated (commits as recent as April 4).**

```bash
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker

# Build the vllm-node-tf5 image (uses pre-built wheels, ~3 min)
./build-and-copy.sh --tf5

# Run Gemma 4 26B-A4B recipe
./run-recipe.sh gemma4-26b-a4b --solo
```

This builds a `vllm-node-tf5` Docker image from:
- Base: `nvidia/cuda:13.2.0-devel-ubuntu24.04` (NOT NGC PyTorch)
- PyTorch: nightly cu130 wheels
- vLLM: pre-built wheels from eugr's nightly CI (compiled for SM121)
- FlashInfer: pre-built wheels from eugr's CI
- Transformers: >= 5.0.0 (via `--tf5` override)

The recipe applies:
- vLLM PR #35568 (FP8 kernel fix)
- vLLM PR #38919 (compilation fix for CUDA 13.0)
- Mod: `fix-gemma4-tool-parser` (applies PR #38909 at runtime)

**Gemma 4 26B recipe config (from `recipes/gemma4-26b-a4b.yaml`):**
```yaml
model: google/gemma-4-26B-A4B-it
container: vllm-node-tf5
defaults:
  port: 8000
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 8192
command: |
  vllm serve google/gemma-4-26B-A4B-it \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --load-format fastsafetensors \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --reasoning-parser gemma4 \
    -tp {tensor_parallel} --distributed-executor-backend ray
```

**Pros:** Latest patches, FP8 + tool calling fixed, `fastsafetensors` support, community-tested SM121 optimized wheels, recipe system handles everything.
**Cons:** Requires build (~3 min with pre-built wheels, 20-40 min from source). No 31B NVFP4 recipe yet (only 26B-A4B).

**For 31B NVFP4 with eugr's build (manual):**
```bash
./launch-cluster.sh --solo -t vllm-node-tf5 exec \
  vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.7 \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --reasoning-parser gemma4 \
    --load-format fastsafetensors \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching \
    --host 0.0.0.0 --port 8000
```

---

### Path C: `vllm/vllm-openai:v0.19.0-cu130` (Official Stable)

**Official vLLM v0.19.0 release, one day newer than the gemma4 tag.**

```
Tag:           vllm/vllm-openai:v0.19.0-cu130
Architecture:  arm64 + amd64 (multi-arch)
Released:      2026-04-03
```

This is the official v0.19.0 stable release. It includes Gemma 4 architecture support natively.

**Pros:** Official stable release, includes Gemma 4 support, arm64 native.
**Cons:** May not include the tool parser fix (PR #38909 is still open). Need to verify `--tf5` equivalent is included.

---

## 3. The cu132/cu131 Mismatch — Explained and Resolved

### What happened

The spark-vllm-docker Dockerfile uses:
- **Base image:** `nvidia/cuda:13.2.0-devel-ubuntu24.04` (CUDA 13.2)
- **PyTorch install:** `--index-url https://download.pytorch.org/whl/nightly/cu130` (CUDA 13.0 nightly)

The pre-built wheels on eugr's GitHub releases were compiled against `cu132` PyTorch at one point, but NGC 26.01/26.02 containers ship `cu131` PyTorch. This caused the `_ZN3c1013MessageLoggerC1ENS_14SourceLocationEib` undefined symbol error (issue #135).

### Why it is no longer an issue

eugr's Dockerfile was restructured (around late March 2026) to **not use NGC PyTorch containers at all**. Instead, it:
1. Starts from bare `nvidia/cuda:13.2.0-devel-ubuntu24.04`
2. Installs PyTorch cu130 nightly wheels directly
3. Builds vLLM and FlashInfer against those same PyTorch wheels

This eliminates the version mismatch. The pre-built wheels and the runner stage use the same PyTorch from the same source (cu130 nightly).

### If you hit this error anyway

```bash
# Clean slate rebuild
cd spark-vllm-docker
git pull
rm wheels/*.whl
rm wheels/.*-commit
./build-and-copy.sh --tf5
```

---

## 4. The `_C_stable_libtorch` Build Failure — Explained

### What it is

When building vLLM HEAD from source inside NGC 26.01/26.02 containers, cmake fails on the `_C_stable_libtorch` target. This target uses PyTorch's Stable ABI (Application Binary Interface) features that require specific header versions.

### Why it happens

- NGC 26.01: CUDA 13.1.1, PyTorch 2.10.0a0
- NGC 26.02: CUDA 13.1.1, PyTorch 2.11.0a0
- vLLM HEAD: expects PyTorch headers with newer Stable ABI support

The PyTorch in NGC containers is a custom NVIDIA build that may not include all upstream stable ABI changes that vLLM HEAD expects.

### Resolution

**Do not build from source inside NGC containers.** Use one of the three paths above, all of which provide pre-built binaries. eugr's build system uses bare CUDA 13.2 + upstream PyTorch nightly, which avoids the NGC PyTorch ABI mismatch entirely.

If you must build from source, PR #38919 addresses a related compilation failure (the `cuMemcpyBatchAsync` signature change between CUDA 12.x and 13.0).

---

## 5. NGC Container Version Reference

| Container | CUDA | PyTorch | Status for Gemma 4 |
|-----------|------|---------|-------------------|
| `nvcr.io/nvidia/pytorch:26.01-py3` | 13.1.1 | 2.10.0a0 | Cannot build vLLM HEAD (ABI mismatch) |
| `nvcr.io/nvidia/pytorch:26.02-py3` | 13.1.1 | 2.11.0a0 | Cannot build vLLM HEAD (ABI mismatch) |
| `nvcr.io/nvidia/pytorch:26.03-py3` | 13.2.0 | 2.11.0a0 | Used by TurboQuant research (vLLM 0.19.1 built successfully) |
| `nvidia/cuda:13.2.0-devel-ubuntu24.04` | 13.2.0 | (none) | eugr's Dockerfile base; works with cu130 nightly PyTorch |

**Key insight:** NGC 26.03 (CUDA 13.2) aligns with eugr's Dockerfile base and is where successful source builds have been reported. If you ever need to build from source, start from 26.03 or bare CUDA 13.2.

---

## 6. Performance Numbers from the Community

### Gemma 4 26B-A4B (MoE) on Single DGX Spark

| Configuration | Decode tok/s | Prefill pp2048 | Source |
|---------------|-------------|----------------|--------|
| BF16 (gemma4-cu130) | 23.7 | 3,105 t/s | WilliamD (NVIDIA forums) |
| FP8 on-the-fly (eugr recipe) | 36-40 | ~7,600 t/s | eugr, mikee.gwu |
| FP8 (latest builds) | **45-57** | ~7,600 t/s | serapis, blainesworld, lujun1255 |

### Gemma 4 31B Dense on Single DGX Spark

| Configuration | Decode tok/s | Prefill pp2048 | Source |
|---------------|-------------|----------------|--------|
| BF16 | 3.7 | 1,066 t/s | WilliamD |
| AWQ INT8 | 6.5 | 430 t/s | WilliamD |
| AWQ INT4 | 10.6 | 810 t/s | WilliamD |
| **NVFP4** | **~6.9** | ~1,550 t/s | damien.rufus, eugr |

### NVFP4 Performance Assessment

eugr's assessment of the 31B NVFP4: **"Performance is meh, as based on the size, most of the weights weren't really quantized to 4 bits."** The NVFP4 checkpoint achieves only ~6.9 tok/s decode — barely faster than AWQ INT8 (6.5 tok/s) and significantly slower than AWQ INT4 (10.6 tok/s).

Forum user trithemius: **"31b nvfp4 is painfully slow."**

**This is a critical finding for our benchmark plan.** The 31B NVFP4 may not be worth benchmarking if it only achieves ~6.9 tok/s. The 26B-A4B with FP8 at 45-57 tok/s is dramatically faster.

---

## 7. NVFP4 Kernel Status on SM121

From the "PSA: State of FP4/NVFP4 Support" forum thread:

- CUTLASS v4.4.2 (in vLLM PR #38423) enables SM120/SM121 NVFP4 tile constraints
- FlashInfer 0.6.7+ includes CUTLASS 4.4.2 for native NVFP4 on Blackwell
- Earlier CUTLASS versions caused `IllegalInstruction` on SM121 with NVFP4 MoE
- A critical GDC (Grid Dependency Control) race condition was fixed in FlashInfer PR #2913

However, even with proper kernel support, the 31B Dense NVFP4 is fundamentally bandwidth-limited on DGX Spark's 273 GB/s memory bus (vs 3.35 TB/s on H100). With 31B active params per token, it must stream ~15-20 GB of weights per forward pass, limiting decode to ~7-14 tok/s regardless of quantization format.

---

## 8. Gemma 4 Quirks on DGX Spark

1. **TRITON_ATTN forced:** Gemma 4 has heterogeneous head dimensions (256 + 512). vLLM auto-detects this and forces TRITON_ATTN backend. No user action needed.

2. **Tool calling bug:** `Gemma4ToolParser.__init__()` takes wrong arguments (vLLM issue #38837). Fixed by PR #38909 (open, not merged). eugr's recipe applies this as a runtime mod. Workaround: use `--tool-call-parser pythonic`.

3. **Thinking mode:** Gemma 4 supports thinking/reasoning. Use `--reasoning-parser gemma4` to properly parse thinking tokens. If thinking is enabled by default, pass `think: false` in the request to disable.

4. **`fastsafetensors` availability:** Not available in the `gemma4-cu130` image. Use `--load-format safetensors` with the official image. eugr's builds include `fastsafetensors`.

---

## 9. Recommended Path for Our Benchmark

Given our goals (benchmark Gemma 4 31B NVFP4 on Beast vs Super, and 26B FP8 on Titan vs Nano):

### For Titan (26B-A4B FP8):

**Use eugr's recipe (Path B).**
```bash
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh --tf5
./run-recipe.sh gemma4-26b-a4b --solo
```

This gives FP8 + tool calling + latest patches. Expected: 45-57 tok/s (vs Nano's 48-65 tok/s).

### For Beast (31B NVFP4):

**Use `vllm/vllm-openai:gemma4-cu130` (Path A) or build eugr's `vllm-node-tf5`.**

```bash
docker pull vllm/vllm-openai:gemma4-cu130

docker run -d --name gemma4-31b-nvfp4 \
  --gpus all --ipc host --shm-size 64gb -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4-cu130 \
  nvidia/Gemma-4-31B-IT-NVFP4 \
    --served-model-name gemma4-31b \
    --port 8000 --host 0.0.0.0 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.70 \
    --kv-cache-dtype fp8 \
    --load-format safetensors \
    --enable-prefix-caching
```

**WARNING:** Community reports ~6.9 tok/s for 31B NVFP4. This is dramatically slower than Super (which we need to baseline first). Consider whether the 31B NVFP4 benchmark is even worth running, given the 26B-A4B FP8 achieves 45-57 tok/s.

### Alternative for Beast: 26B-A4B FP8 instead of 31B NVFP4

Given the NVFP4 performance data, benchmarking 26B-A4B FP8 on Beast may be more informative:
- 45-57 tok/s (close to or exceeding Super)
- MoE = more memory efficient (86 GB loaded, 3.8B active params)
- Better for concurrent requests

---

## 10. Also Available: vLLM v0.19.0 + cu130-nightly

For reference, these additional official images exist on Docker Hub with arm64 support:

| Tag | Date | Notes |
|-----|------|-------|
| `vllm/vllm-openai:v0.19.0-cu130` | 2026-04-03 | Official v0.19.0 stable |
| `vllm/vllm-openai:latest-cu130` | 2026-04-03 | Same as v0.19.0 |
| `vllm/vllm-openai:cu130-nightly` | 2026-04-05 | Latest nightly (updated daily) |
| `vllm/vllm-openai:gemma4-cu130` | 2026-04-02 | Gemma 4 specific build |

All have `arm64` architecture support confirmed via Docker Hub API.

---

## Sources

- [eugr/spark-vllm-docker](https://github.com/eugr/spark-vllm-docker) — community build system, Gemma 4 recipe
- [NVIDIA DGX Spark vLLM Playbook](https://build.nvidia.com/spark/vllm) — official `gemma4-cu130` image reference
- [Gemma 4 Day-1 Benchmarks](https://forums.developer.nvidia.com/t/gemma-4-day-1-inference-on-nvidia-dgx-spark-preliminary-benchmarks/365503) — WilliamD's day-1 numbers
- [Gemma 4 vLLM Version Discussion](https://forums.developer.nvidia.com/t/gemma-4-models-which-vllm-version-any-prs-spotted/365490) — 61+ posts, community performance data
- [How to Run Gemma-4-NVFP4](https://forums.developer.nvidia.com/t/how-to-run-gemma-4-nvfp4-in-vllm-docker/365513) — NVFP4 specific instructions
- [Gemma 4 26B at 45-60 tok/s](https://forums.developer.nvidia.com/t/someone-post-this-gemma-4-26b-a4b-moe-running-at-45-60-tok-s-on-dgx-spark/365547) — FP8 performance confirmations
- [NVFP4 Support PSA](https://forums.developer.nvidia.com/t/psa-state-of-fp4-nvfp4-support-for-dgx-spark-in-vllm/353069) — SM121 NVFP4 kernel status
- [TurboQuant vLLM 0.19.1 on Spark](https://forums.developer.nvidia.com/t/dgx-spark-gb10-vllm-0-19-1-turboquant-kv-cache-integration-results-on-qwen3-5-and-nemotron/365627) — NGC 26.03 + vLLM 0.19.1 source build
- [davistroy/spark Gemma 4 Experiment Plan](https://github.com/davistroy/spark/blob/main/GEMMA4_EXPERIMENT_PLAN.md) — detailed A/B experiment plan
- [vLLM PR #38919](https://github.com/vllm-project/vllm/pull/38919) — cuMemcpyBatchAsync fix for CUDA 13.0
- [vLLM PR #38909](https://github.com/vllm-project/vllm/pull/38909) — Gemma4 tool parser streaming fix
- [spark-vllm-docker Issue #135](https://github.com/eugr/spark-vllm-docker/issues/135) — cu132/cu131 symbol mismatch
- [spark-vllm-docker Issue #158](https://github.com/eugr/spark-vllm-docker/issues/158) — Gemma 4 recipe missing --tf5
- [Docker Hub vllm/vllm-openai tags](https://hub.docker.com/r/vllm/vllm-openai/tags) — all available images
