# Benchmark Execution Log: Qwen3.5-9B Opus-Distilled NVFP4 vs Q4_K_M

**Date:** 2026-03-14
**Machine:** DGX Spark (GB10), SM121, 128 GB unified LPDDR5x, CUDA 13.0
**Operator:** Claude Code (automated)
**Goal:** Apples-to-apples benchmark — same Opus-Distilled fine-tuning, different
quantization (NVFP4 vs Q4_K_M) and runtime (vLLM/SGLang vs llama-server).

**Status: ABANDONED** — Quantization succeeded (3.8 min, 7.4 GB), but no inference
runtime supports Qwen3.5-9B NVFP4 on DGX Spark GB10 today. The ecosystem is not
ready for the combination of Qwen3.5 DeltaNet architecture + NVFP4 quantization +
SM121 GPU. Keeping Q4_K_M on llama-server.

---

## Phase 0: Research (Complete)

- Created `docs/PLAN-NVFP4-BENCHMARK.md` — 6-phase plan
- Created `docs/RESEARCH-NVFP4-QUANTIZATION.md` — quantization methodology
- Confirmed no existing NVFP4 of Opus-Distilled 9B on HuggingFace
- Identified source model: `Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled`

---

## Phase 1: Quantization (Complete)

### Step 1.1: Download source model

```bash
# huggingface-cli is in Annie voice's venv on Titan
/home/rajesh/workplace/her/her-os/services/annie-voice/.venv/bin/huggingface-cli \
    download Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled \
    --local-dir ~/models/Qwen3.5-9B-Opus-Distilled-BF16
```

**Result:** 18 GB downloaded in ~15 minutes. 4 safetensor shards + tokenizer + config.

### Step 1.2: Pre-installed containers inventory

DGX Spark ships these containers (NOT pip packages):

| Container | Tag | Size | Has ModelOpt? | Has Qwen3.5? |
|-----------|-----|------|---------------|--------------|
| `nvcr.io/nvidia/pytorch` | 25.11-py3 | 19.5 GB | **Yes (0.37.0)** | No (needs transformers upgrade) |
| `vllm/vllm-openai` | cu130-nightly | 20.3 GB | No | No (transformers 4.57) |
| `lmsysorg/sglang` | spark | 25.5 GB | No | No (transformers 4.57, SGLang 0.5.4) |
| `nvcr.io/nim/qwen/qwen3-32b-dgx-spark` | latest | 21.6 GB | N/A | Yes (NIM) |

**Key finding:** ModelOpt 0.37.0 in PyTorch container ALREADY supports NVFP4
(has `NVFP4_DEFAULT_CFG`, `NVFP4_MLP_ONLY_CFG`, etc.). No need to pull the
TensorRT container.

### Step 1.3: Quantization

```bash
# Pre-create output directory (IMPORTANT: Docker writes as root)
mkdir -p ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4

# Run inside PyTorch NGC container
docker run --gpus all --rm --ipc=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/models:/models \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "pip install -q --root-user-action=ignore transformers accelerate datasets && \
           python3 /models/quantize_nvfp4.py"
```

**Script:** `scripts/quantize_nvfp4.py`
- Loads BF16 model (16.7 GB GPU)
- Creates 512 calibration samples from cnn_dailymail (manual tokenization to
  avoid transformers 5.x/ModelOpt 0.37 `batch_encode_plus` API mismatch)
- Quantizes with `mtq.NVFP4_DEFAULT_CFG` (843 quantizers inserted)
- Exports to HuggingFace checkpoint format

**Results:**
| Metric | Value |
|--------|-------|
| Model load time | 55.9s |
| Calibration prep | 8.7s |
| **Quantization time** | **229.2s (3.8 minutes)** |
| Export time | 13.3s |
| **Total time** | **4.9 minutes** |
| Output size | 7.4 GB (1 safetensors + configs) |
| GPU memory during quant | 16.7 GB |

**Note:** Estimated 45-90 min from research; actual 3.8 min. DGX Spark's unified
memory (zero CPU-GPU transfer) and well-optimized ModelOpt explain the speed.

### Step 1.4: Fix file ownership

Docker container runs as root, so output files are owned by root:root.
```bash
docker run --rm -v ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4:/fix \
    alpine chown -R 1000:1000 /fix
```

**Output files:**
```
~/models/Qwen3.5-9B-Opus-Distilled-NVFP4/
  config.json             3.0K
  hf_quant_config.json      267
  model.safetensors       7.5G
  tokenizer.json           20M
  tokenizer_config.json   1.2K
  chat_template.jinja     4.0K
  generation_config.json    142
```

---

## Phase 2: Serve NVFP4 Model (BLOCKED)

### Problem 1: Transformers version mismatch (FIXED with config patches)

The quantization ran with transformers 5.3.0 (PyTorch 25.11 container), which
wrote new-style class names. ALL inference containers have transformers ~4.57.

| Field | transformers 5.3 wrote | Fix for 4.57 |
|-------|------------------------|--------------|
| `model_type` | `"qwen3_5_text"` | `"qwen3_5"` |
| `architectures` | `["Qwen3_5ForCausalLM"]` | `["Qwen3_5ForConditionalGeneration"]` |
| `tokenizer_class` | `"TokenizersBackend"` | `"Qwen2Tokenizer"` |
| `processor_class` | `"Qwen3VLProcessor"` | `"Qwen2VLProcessor"` |
| `image_processor_type` | `"Qwen2VLImageProcessorFast"` | `"Qwen2VLImageProcessor"` |
| `video_processor_type` | `"Qwen3VLVideoProcessor"` | `"Qwen2VLVideoProcessor"` |

**Resolution:** Patched config.json, tokenizer_config.json, and added
preprocessor_config.json (copied from BF16 source). After patches, vLLM
recognizes the model, loads weights, and begins warmup.

### Problem 2: vLLM rotary embedding crash (BLOCKER)

After config fixes, vLLM crashes during model warmup:

```
RuntimeError: query, key and positions must have the same batch_size and seq_len
```

**Location:** `vllm/model_executor/layers/rotary_embedding.py` →
`torch.ops._C.rotary_embedding()`

**Root cause:** Qwen3.5's DeltaNet architecture uses `partial_rotary_factor: 0.25`
(only 25% of head dimensions get rotary embeddings). vLLM's NVFP4 kernel
produces mismatched tensor shapes with this non-standard rotary factor.

**Known issue:** [vLLM #35519](https://github.com/vllm-project/vllm/issues/35519)

### All containers tested

| Container | Version | Config Fix | Result |
|-----------|---------|------------|--------|
| `lmsysorg/sglang:spark` | SGLang 0.5.4, tf 4.57.1 | None | `model type qwen3_5_text not recognized` |
| SGLang + `pip install transformers` | SGLang 0.5.4, tf 5.3.0 | None | PEP 668 blocked pip; `--break-system-packages` breaks `AutoImageProcessor.register()` |
| `vllm/vllm-openai:cu130-nightly` | vLLM 0.16.0rc2, tf 4.57.6 | Step-by-step | Config → tokenizer → preprocessor → **rotary crash** |
| `vllm/vllm-openai:qwen3_5` | vLLM 0.16.0rc2, tf 4.57.6 | Full | Same **rotary crash** |
| `hellohal2064/vllm-qwen3.5-gb10` | vLLM 0.16.0rc1, tf 4.57.3 | Full | Same **rotary crash** |
| `vllm cu130-nightly + --enforce-eager` | vLLM 0.16.0rc2, tf 4.57.6 | Full | Loaded model, passed SM121 check, but **GPU memory** error (llama-server was using GPU) |
| `vllm cu130-nightly + enforce-eager + 0.5 mem` | Same | Full | **rotary crash** (same root cause) |

### SGLang separately blocked

SGLang 0.5.4 pins `transformers==4.57.1`. Upgrading breaks SGLang because
the `AutoImageProcessor.register()` signature changed between transformers
4.57 → 5.x (`exist_ok` parameter position moved).

---

## Phase 3-6: Benchmark (BLOCKED)

Benchmark script ready at `scripts/benchmark_nvfp4_vs_q4km.py` but cannot
execute because no runtime can serve the NVFP4 model.

### Baseline (Q4_K_M on llama-server) — Works fine

```bash
~/llama-cpp-latest/build-gpu/bin/llama-server \
    --host 0.0.0.0 --port 8003 \
    -m ~/models/Qwen3.5-9B-Claude-Opus-Distilled-Q4_K_M.gguf \
    --alias qwen3.5-9b --ctx-size 32768 --n-gpu-layers 999 -fa auto --jinja \
    --reasoning-budget 0 --metrics
```

---

## Conclusion & Recommendation

### NVFP4 is NOT a viable replacement today

The Qwen3.5-9B NVFP4 format cannot be served on DGX Spark (GB10) because:

1. **No inference runtime supports Qwen3.5 + NVFP4 + SM121 together**
   - vLLM has a rotary embedding bug with Qwen3.5's DeltaNet architecture
   - SGLang Spark image is too old (0.5.4, transformers 4.57)
   - No TensorRT-LLM container was tested (potential avenue)

2. **The ecosystem is immature for this combination**
   - Qwen3.5 is very new (released after current container builds)
   - DeltaNet hybrid architecture is non-standard (24 linear + 8 full attention)
   - SM121 is a newer compute capability than most containers support

### Keep Q4_K_M on llama-server

The current setup works reliably:
- llama.cpp **does** support Qwen3.5 DeltaNet on SM121
- Q4_K_M at 32K context uses ~6.6 GB VRAM
- Performance is proven in production (Annie Voice)

### When to revisit

- **vLLM fixes #35519** — rotary embedding support for partial_rotary_factor
- **SGLang ships 0.6+** with transformers 5.x and Qwen3.5 support
- **NVIDIA releases updated NGC containers** for DGX Spark with Qwen3.5 support

### Quantized model preserved

The NVFP4 model is saved at `~/models/Qwen3.5-9B-Opus-Distilled-NVFP4/` (7.4 GB)
and ready to benchmark once a compatible runtime becomes available.

---

## Lessons Learned

1. **DGX Spark ships AI stack as Docker containers, not system packages.**
   No pip-installed SGLang/vLLM/TensorRT. Always check `docker images`.

2. **ModelOpt 0.37 (in PyTorch 25.11) already supports NVFP4.**
   No need for separate TensorRT-LLM container for quantization.

3. **Qwen3.5 is too new for current Spark containers.**
   `model_type: "qwen3_5_text"` needs transformers 5.x;
   all inference containers ship ~4.57.

4. **Upgrading transformers inside containers breaks them.**
   SGLang 0.5.4 and vLLM 0.16 have internal API dependencies on exact tf version.

5. **Docker root ownership bites.**
   Pre-create output directories on host before running quantization.

6. **NVFP4 quantization is blazingly fast on DGX Spark.**
   3.8 minutes for 9B model (vs estimated 45-90 min).
   Unified memory eliminates CPU-GPU data transfer.

7. **The DeltaNet architecture is the real blocker.**
   Qwen3.5's `partial_rotary_factor: 0.25` breaks vLLM's rotary embedding
   kernel when combined with NVFP4 weight packing. Standard Qwen3 models
   (without DeltaNet) likely work fine.

---

## Files Created

| File | Purpose |
|------|---------|
| `docs/PLAN-NVFP4-BENCHMARK.md` | 6-phase benchmark plan |
| `docs/RESEARCH-NVFP4-QUANTIZATION.md` | Quantization methodology research |
| `docs/BENCHMARK-NVFP4-EXECUTION-LOG.md` | This file (audit trail) |
| `scripts/quantize_nvfp4.py` | Quantization script (reproducible) |
| `scripts/benchmark_nvfp4_vs_q4km.py` | Benchmark comparison script (ready to use) |
| `~/models/Qwen3.5-9B-Opus-Distilled-NVFP4/` | Quantized model output (on Titan) |
| `~/models/Qwen3.5-9B-Opus-Distilled-BF16/` | BF16 source model (on Titan) |
