# Research: vLLM Migration — NGC 0.12 → spark-vllm-docker 0.17

**Date:** 2026-03-20
**Session:** 353
**Status:** Complete — deployed on both Titan and Beast

## Summary

Migrated both DGX Sparks from hand-crafted NGC vLLM 0.12 container to NVIDIA's spark-vllm-docker framework (vLLM 0.17.2). Startup time dropped 10x (10 min → 60-89s), prefix caching enabled, no inference speed regression.

## Background

Titan was running `nvcr.io/nvidia/vllm:25.12.post1-py3` (vLLM 0.12.0) with manually specified Docker flags in start.sh. Beast was already using `spark-vllm-docker` (vLLM 0.17.2rc1) with NVIDIA's optimized recipes. Investigation started when prefix caching was found to be OFF on Titan, and a YouTube video (QIZz4AF0U24) raised questions about whether Unsloth models had special caching.

## Key Findings

### 1. Prefix Caching Status (pre-migration)

| Machine | vLLM Version | Prefix Caching | Source |
|---------|-------------|----------------|--------|
| Titan | 0.12.0 (NGC) | **OFF** (`enable_prefix_caching=False`) | docker logs |
| Beast | 0.17.2 (spark-vllm-docker) | **ON** (`enable_prefix_caching=True`) | docker logs |

Beast got prefix caching from the spark-vllm-docker recipe. Titan never had it — the flag was missing from start.sh, and vLLM 0.12 V1 engine does NOT default it to True despite being V1.

### 2. Unsloth Prompt Caching — Debunked

A YouTube video (QIZz4AF0U24: "How Prompt Caching Makes Local LLMs Fly") showed Nemotron 3 Nano from LM Studio Community had broken prompt caching, while the Unsloth GGUF version worked. Investigation revealed:

- **Unsloth does NOT embed prompt caching into GGUF files.** GGUF is a static weight format.
- The issue was **GGUF metadata compatibility** with llama.cpp's KV cache checkpoint system. LM Studio Community's converter produced files that broke checkpoint save/restore. Unsloth's converter preserved the right metadata.
- Unsloth's "2x faster" claim is about **training speed** (custom Triton kernels for backprop), not inference speed.
- **Irrelevant to us**: We use vLLM + NVFP4 (not GGUF, not llama.cpp). vLLM V1 has its own hash-based Automatic Prefix Caching (APC).

### 3. Why NGC 0.12 Was Slow at Startup

Comparison of startup phases on identical DGX Spark hardware:

| Phase | NGC 0.12 (Titan) | spark-vllm 0.17 (Beast) | Ratio |
|-------|-----------------|------------------------|-------|
| Weight loading | 100s (5 shards, 20s/shard, `auto` loader) | 84s (17 shards, 5s/shard, `fastsafetensors`) | 1.2x |
| torch.compile + warmup | **478s** | **49s** | **10x** |
| CUDA graph capture | 25s (24s/graph decode) | 2s (5 graphs/sec) | **12x** |
| **Init engine total** | **478s** | **49s** | **10x** |

Root cause: vLLM 0.12 lacks `fast_moe_cold_start`, optimized torch.compile passes, and efficient CUDA graph capture that 0.17 has.

### 4. spark-vllm-docker Migration

**What is spark-vllm-docker:** Community project (`github.com/eugr/spark-vllm-docker`) that builds vLLM from source on PyTorch 26.01 base, with DGX Spark patches. Provides YAML recipes for common models.

**Migration steps:**
1. Cloned repo on Titan (`~/spark-vllm-docker`)
2. Transferred pre-built `vllm-node` image from Beast (25.5 GB, via `docker save | pv | ssh titan docker load`)
3. Used existing `nemotron-3-nano-nvfp4.yaml` recipe (already in repo)
4. Added `--served-model-name nemotron-nano` to recipe on Titan (needed for Annie Voice compatibility)
5. Updated `start.sh` to use `run-recipe.py` instead of raw `docker run`

**First startup was slow (~19 min):** HuggingFace model download (19 GB, one-time) + first torch.compile cache build. The recipe uses HF model ID, not local `/models/` path.

**Subsequent startups: 60-89 seconds.**

### 5. Beast: --enforce-eager Removed

Beast was using a custom "eager" recipe that disabled CUDA graphs for faster startup. Switched to original recipe:

| | Before (eager) | After (original) |
|---|---|---|
| CUDA graphs | OFF | **ON** (FULL_AND_PIECEWISE) |
| Prefix caching | ON | ON |
| Startup | ~10 min | **3 min** |
| `enforce_eager` | True | **False** |

CUDA graphs work fine with Nemotron Super's Mamba+MoE hybrid architecture on vLLM 0.17.

### 6. Marlin vs FlashInfer Benchmark

Systematic A/B test on Titan with 4 configurations. Simple prompt: system prompt (Annie, 1 sentence) + "What is 2+2?" with max_tokens=30. 5 runs each, 3 warmup requests.

| Test | Backend | Prefix Cache | E2E Latency | Decode Speed | Startup |
|------|---------|-------------|-------------|-------------|---------|
| **D** | **Marlin** | **ON** | **561ms** | **53.4 tok/s** | 89s |
| A | Marlin | OFF | 560ms | 53.5 tok/s | 85s |
| B2 | FlashInfer-cutlass | ON | 595ms | 50.4 tok/s | 65s |
| C | FlashInfer-cutlass | OFF | 596ms | 50.3 tok/s | 60s |

**Findings:**
- **Marlin is 6% faster than FlashInfer-cutlass** (561 vs 595ms) despite the "weight-only decompression" warning
- **Prefix caching has zero overhead** (D vs A: 561 vs 560ms) — safe to enable
- **All configs produce ~560-596ms** — the previous 88-130ms "TTFT" was measured differently (likely streaming first-token with a tiny prompt, not full E2E with reasoning)
- **Winner: Marlin + prefix caching ON** (Test D)

Note: The old NGC 0.12 config used `VLLM_USE_FLASHINFER_MOE_FP4=1` + `VLLM_FLASHINFER_MOE_BACKEND=latency`. These env vars don't exist in vLLM 0.17 — the valid NVFP4 backends are: `flashinfer-cudnn`, `flashinfer-trtllm`, `flashinfer-cutlass`, `cutlass`, `marlin`.

### 7. Mamba Cache Mode Warning

When prefix caching is enabled on Nemotron (Mamba+MoE hybrid), vLLM 0.17 forces:
```
Mamba cache mode is set to 'all' for NemotronHForCausalLM by default
when prefix caching is enabled. Its support for Mamba layers is experimental.
```

This saves/restores full SSM state for every Mamba layer. Benchmark shows no measurable overhead (Test D vs A). Monitor for correctness issues.

## Final Configuration

### Titan (Nemotron 3 Nano 30B NVFP4)

```yaml
# ~/spark-vllm-docker/recipes/nemotron-3-nano-nvfp4.yaml
env:
  VLLM_NVFP4_GEMM_BACKEND: "marlin"
  VLLM_TEST_FORCE_FP8_MARLIN: "1"
  VLLM_MARLIN_USE_ATOMIC_ADD: "1"

command: |
  vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
  --enable-prefix-caching
  --load-format fastsafetensors
  --kv-cache-dtype fp8
  --enable-auto-tool-choice
  --tool-call-parser qwen3_coder
  --reasoning-parser nano_v3
  --served-model-name nemotron-nano
```

Launch: `run-recipe.py nemotron-3-nano-nvfp4 --solo --gpu-mem 0.25 --max-model-len 32768 --port 8003 --max-num-seqs 8`

### Beast (Nemotron 3 Super 120B NVFP4)

Uses original `nemotron-3-super-nvfp4.yaml` recipe (no --enforce-eager). Same Marlin backend, prefix caching ON, CUDA graphs ON.

## Performance Summary

| Metric | Before (NGC 0.12) | After (spark-vllm 0.17) |
|--------|-------------------|------------------------|
| Startup (Titan) | ~10 min | **60-89s** |
| Startup (Beast) | ~10 min | **3 min** |
| Prefix caching | OFF (Titan) | **ON (both)** |
| Decode speed | 48-65 tok/s | 50-54 tok/s (same) |
| E2E latency (30 tok) | ~500ms | ~560ms (same ballpark) |
| vLLM version | 0.12.0 | 0.17.2rc1 |
| Weight loading | 100s | **7.6s** (fastsafetensors) |
| torch.compile | 478s | **22s** |
| CUDA graphs | 25s | **1-3s** |

## Open Items

1. **HF model download on first start**: Recipe uses HF model ID, causing 19 GB download on first launch. Could symlink local `/home/rajesh/models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4` into HF cache to avoid this.
2. **Prefix cache hit rate**: Currently 0.0% in logs — need sustained multi-turn traffic to see hits. Dashboard context polling uses different prompts each time, so hits are expected only during voice sessions with stable system prompts.
3. **Marlin "no native FP4" warning**: Marlin does weight-only FP4 decompression on Blackwell instead of native FP4 compute. Despite the warning, it benchmarks faster than FlashInfer-cutlass. Worth re-testing when vLLM adds native Blackwell FP4 support.

## Commits

- `32a08eb` — Beast: switch to original recipe (CUDA graphs + prefix caching)
- `981e1e1` — Titan: switch to spark-vllm-docker recipe