# Research: NVFP4 Quantization of Opus-Distilled Qwen3.5-9B

**Date:** 2026-03-14
**Goal:** Quantize `Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled` to NVFP4 format
for apples-to-apples benchmarking against current Q4_K_M on llama-server.
**Model name:** `rajesh/Qwen3.5-9B-Opus-Distilled-NVFP4`

---

## Source Model

| Field | Value |
|-------|-------|
| **HuggingFace** | `Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled` |
| **Format** | Safetensors (BF16, F32) |
| **Parameters** | 10B (dense, DeltaNet hybrid: 24 linear + 8 standard attention) |
| **Base model** | `Qwen/Qwen3.5-9B-Base` |
| **Fine-tuning** | LoRA + SFT on ~3,950 Claude Opus 4.6 reasoning traces (Unsloth) |
| **Context** | 16K SFT, 262K native architecture |
| **License** | Apache 2.0 |
| **GGUF variant** | `Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF` (what we currently run) |

---

## Existing NVFP4 Models (HuggingFace Search, 2026-03-14)

| Model | NVFP4 exists? | By whom |
|-------|---------------|---------|
| Qwen3.5-9B (base) | Yes | AxionML |
| Qwen3.5-9B **Opus-Distilled** | **No — we are first** | — |
| Qwen3.5-27B (base) | Yes | kaitchup, txn545 |
| Qwen3.5-27B **Opus-Distilled** | Yes | mconcat |

---

## NVFP4 Format

NVFP4 = E2M1 (1 sign + 2 exponent + 1 mantissa) with blockwise E4M3 FP8 scaling per
16-element micro-block, plus a second-level FP32 per-tensor scalar.

- **Key advantage:** Blackwell Tensor Cores have native FP4 datapaths (hardware-accelerated)
- **vs Q4_K_M:** Both ~4-bit, but Q4_K_M uses generic CUDA kernels while NVFP4 uses Tensor Cores
- **Model size:** ~3.5x smaller than BF16

---

## Approach 1: NVIDIA ModelOpt (FAILED)

### What We Tried
Used `nvcr.io/nvidia/pytorch:25.11-py3` container (has ModelOpt 0.37.0 pre-installed with
`NVFP4_DEFAULT_CFG` support).

```bash
docker run --gpus all --rm --ipc=host \
  -v ~/models:/models \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "pip install transformers accelerate datasets && python3 /models/quantize_nvfp4.py"
```

### Results
- **Quantization SUCCEEDED** in 3.8 minutes (not the estimated 45-90 min)
- Output: 7.4 GB NVFP4 safetensors
- 843 quantizers inserted, GPU used 16.7 GB

### Why It Failed at Serving
The exported `config.json` has `"model_type": "qwen3_5_text"` — a model type added in
transformers 5.x. But:
- **SGLang Spark container** (`lmsysorg/sglang:spark`, v0.5.4) has transformers 4.57.1
- **vLLM Spark container** (`vllm/vllm-openai:cu130-nightly`, v0.16.0rc2) has transformers 4.57.6
- Neither recognizes `qwen3_5_text` → `ValueError: Transformers does not recognize this architecture`

**Attempted fix:** Upgrading transformers inside SGLang container broke SGLang's
`AutoImageProcessor.register()` API (incompatible with transformers 5.x).

### Lessons Learned
1. ModelOpt quantization works great on DGX Spark (fast, low memory)
2. But the exported format is incompatible with pre-installed serving containers
3. Pre-installed containers on DGX Spark are from mid-2025, before Qwen3.5 was released
4. Cannot simply upgrade transformers inside SGLang — breaks internal APIs

---

## Approach 2: AutoRound + llm_compressor (CURRENT)

### Why AutoRound
- Uses `llm_compressor` export format which is compatible with SGLang/vLLM
- Has `--ignore_layers "linear_attn"` to keep DeltaNet linear attention in higher precision
- Handles chat template preservation properly

### Command
```bash
docker run --gpus all -d --name autoround-nvfp4 \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/models:/models \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "
    pip install auto-round llm-compressor &&
    auto-round \
      --model /models/Qwen3.5-9B-Opus-Distilled-BF16 \
      --scheme NVFP4 \
      --format llm_compressor \
      --output_dir /models/Qwen3.5-9B-Opus-Distilled-NVFP4 \
      --ignore_layers linear_attn \
      --batch_size 1
  "
```

### Key Flags
| Flag | Purpose |
|------|---------|
| `--scheme NVFP4` | NVIDIA FP4 (E2M1) with FP8 block scaling |
| `--format llm_compressor` | **REQUIRED** for SGLang/vLLM compatibility (AutoRound's default NVFP4 export is NOT supported) |
| `--ignore_layers linear_attn` | Keep DeltaNet linear attention layers in higher precision (same as AxionML's approach) |
| `--batch_size 1` | Prevent OOM on lower-VRAM machines |

### Expected Output
- `config.json` — should have compatible model_type
- `tokenizer.json`, `tokenizer_config.json` — must match source (Opus-Distilled chat template)
- `quantization_config.json` — llm_compressor metadata
- Model safetensors shards — ~4.5-5 GB expected

---

## DGX Spark Environment (Verified)

### Pre-installed Containers
| Container | Image | Size | Transformers | Status |
|-----------|-------|------|--------------|--------|
| PyTorch | `nvcr.io/nvidia/pytorch:25.11-py3` | 19.5 GB | 5.3.0 | Has ModelOpt 0.37.0 |
| vLLM | `vllm/vllm-openai:cu130-nightly` | 20.3 GB | 4.57.6 | Too old for Qwen3.5 |
| SGLang | `lmsysorg/sglang:spark` | 25.5 GB | 4.57.1 | Too old for Qwen3.5 |
| Qwen3-32B NIM | `nvcr.io/nim/qwen/qwen3-32b-dgx-spark` | 21.6 GB | — | NIM format |

### Key Finding
DGX Spark ships with the AI stack as **Docker containers**, not system packages.
None of the pre-installed containers have transformers 5.x needed for Qwen3.5's
`qwen3_5_text` model type. SGLang/vLLM may need to be installed from source or
use a newer container image.

### Hardware Specs (Verified)
- GPU: NVIDIA GB10
- Unified memory: 121.7 GB usable (128 GB physical)
- Model load time: ~52s for 9B BF16
- Quantization VRAM: 16.7 GB for 9B model
- Disk: 2.7 TB free

---

## Serving Options After Quantization

### Option A: SGLang from source
```bash
uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python&egg=sglang[all]'
```

### Option B: vLLM with upgraded transformers
Need a newer vLLM container or build from source with Qwen3.5 support.

### Option C: TensorRT-LLM
```bash
trtllm-serve ./Qwen3.5-9B-Opus-Distilled-NVFP4 --backend pytorch --max_batch_size 4 --port 8000
```

---

## Execution Log

### Session 337 Timeline
| Time | Action | Result |
|------|--------|--------|
| 18:02 | Started Docker pulls (TensorRT, SGLang) | Both ~20-25 GB |
| 18:05 | Started BF16 model download | 18 GB, completed in ~15 min |
| 18:20 | Discovered PyTorch container has ModelOpt 0.37 with NVFP4 | Skipped TensorRT pull |
| 18:30 | Started ModelOpt quantization | 3.8 min, 7.4 GB output |
| 18:35 | SGLang failed: `qwen3_5_text` not recognized | transformers too old |
| 18:40 | Tried upgrading transformers in SGLang | Broke SGLang internals |
| 18:42 | Tried vLLM | Same `qwen3_5_text` error |
| 18:50 | Switched to AutoRound + llm_compressor approach | Also blocked — same transformers issue |
| 18:55 | Config patch: model_type→qwen3_5, arch→ForConditionalGeneration | vLLM then needed preprocessor_config.json |
| 19:00 | Re-ran quantization (output was lost — Docker root ownership + rm) | 3.7 min, 7.4 GB, output verified this time |
| 19:10 | User provided SGLang-from-source approach | Correct: install on bare metal host |
| 19:15 | Created `~/sglang-env` venv on Titan host | Python 3.12.3, pip 26.0.1, uv 0.10.10 |
| 19:20 | Installing SGLang from source (main branch) | In progress |

---

## Approach 3: SGLang from Source on Bare Metal (CURRENT)

### Why This Works
Install SGLang directly on the DGX Spark host OS in a dedicated venv.
This pulls transformers 5.x as a dependency, avoiding the container
compatibility deadlock.

### Setup
```bash
# Create venv (avoid polluting system Python)
python3 -m venv ~/sglang-env
source ~/sglang-env/bin/activate

# Install SGLang from source (main branch, pulls transformers 5.x)
pip install --upgrade pip uv
uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python&egg=sglang[all]'

# Verify
python -c "import transformers; print('transformers:', transformers.__version__)"
python -c "import sglang; print('sglang:', sglang.__version__)"
```

### Launch
```bash
source ~/sglang-env/bin/activate
python -m sglang.launch_server \
  --model-path ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4 \
  --quantization modelopt_fp4 \
  --tp 1 \
  --reasoning-parser qwen3 \
  --port 8000 \
  --host 0.0.0.0
```

### Do NOT
- Do NOT upgrade transformers inside SGLang/vLLM containers (breaks internals)
- Do NOT install SGLang system-wide on the host (use venv)

---

## Sources

- [AxionML/Qwen3.5-9B-NVFP4](https://huggingface.co/AxionML/Qwen3.5-9B-NVFP4)
- [mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4](https://huggingface.co/mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4)
- [Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled](https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled)
- [SGLang DGX Spark SM121 tracking](https://github.com/sgl-project/sglang/issues/11658)
- [vLLM Qwen3.5 NVFP4 crash on ARM64 GB10](https://github.com/vllm-project/vllm/issues/35519)
- [NVIDIA ModelOpt NVFP4](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
- [SGLang for DGX Spark](https://build.nvidia.com/spark/sglang)
- [Pre-built SGLang Docker for Spark](https://forums.developer.nvidia.com/t/new-pre-built-sglang-docker-images-for-nvidia-dgx-spark/360656)
- [FlashInfer 12.1f on DGX Spark](https://forums.developer.nvidia.com/t/from-20-to-35-tps-on-qwen3-next-nvfp4-w-flashinfer-12-1f/356153)
