# NVFP4 Quantization: Lessons Learned & Playbook

**Date:** 2026-03-15 (Sessions 337-340)
**Hardware:** DGX Spark GB10 (Blackwell, SM 12.1, 128 GB unified memory)
**Goal:** Preserve Opus-distilled behavioral fine-tuning through FP4 quantization

---

## Executive Summary

NVFP4 quantization delivers transformative TTFT (90ms constant vs Q4_K_M's 300-1600ms growing)
but **destroys behavioral fine-tuning** (tool calling, markdown suppression, conciseness) regardless
of calibration data. The quantization algorithm itself (not the calibration data) is the bottleneck.
AWQ-Lite algorithm fixes this but produces a format vLLM can't serve.

---

## What We Tried (5 Benchmark Rounds + 2 Quantization Attempts)

| Round | What | Result |
|-------|------|--------|
| v1 | Base 9B NVFP4 vs Opus-Distilled Q4_K_M (no reasoning parser) | Misleading — thinking tokens leaked as "content", fake 82ms TTFT |
| v2 | Same + reasoning parser + tool calling flags | Thinking hidden correctly, but model produced ZERO visible output |
| v3 | Same + thinking disabled | **Gibberish** — base model can't function without thinking |
| v4 | 27B NVFP4 (instruct) vs 9B Q4_K_M | Quality excellent, but 13 tok/s decode too slow for voice |
| v4b | Same-model comparison: Opus-Distilled 9B NVFP4 vs Q4_K_M | NVFP4 speed proven (90ms TTFT, 33 tok/s) but behavioral quality destroyed |
| Phase 4 | Hardened prompt engineering on v1 NVFP4 | 43% reliability — behavioral signal partially present but unreliable |
| v2d | Annie calibration data (430 convos) + DEFAULT_CFG | **Same failures as v1** — calibration data alone doesn't help |

---

## Key Insights

### 1. FP4 Preserves Knowledge, Destroys Manners

The most important finding: factual accuracy and reasoning survive FP4 perfectly (5/5 on both).
What dies is the behavioral fine-tuning overlay — the patterns that make a model follow formatting
rules, suppress thinking, call tools in the right format, and stay concise.

**Why:** Behavioral fine-tuning creates subtle weight adjustments on top of base capabilities.
These sit at small magnitude ranges that FP4's reduced precision (E2M1 = 3 values per exponent)
rounds away first. Knowledge (large, well-distributed weights) survives; manners (small, precise
adjustments) don't.

### 2. Calibration Data Is Necessary But Not Sufficient

We generated 430 Annie-specific conversations using Claude Opus 4.6 as teacher:
- 100 greetings, 100 factual, 100 memory tools, 100 web search tools, 30 multi-turn
- 200/430 (47%) included proper tool_calls
- Zero markdown contamination

Despite this domain-perfect calibration data, the v2d model (DEFAULT_CFG + Annie data) had
identical failures to v1 (DEFAULT_CFG + CNN DailyMail). The "max" calibration algorithm in
DEFAULT_CFG simply finds the min/max activation range — the data domain doesn't matter because
it doesn't affect how the algorithm rounds weights.

**The algorithm matters more than the data.** AWQ-Lite's activation-aware weighting is what
actually preserves behavioral patterns, not the calibration data domain.

### 3. AWQ-Lite Works But Can't Be Served

AWQ-Lite (NVFP4_AWQ_LITE_CFG) took 63 minutes (vs DEFAULT_CFG's 2.7 min) and produced
a `MIXED_PRECISION` format with per-layer algorithm assignments (NVFP4 vs NVFP4_AWQ).
vLLM cu130-nightly only supports the simple `NVFP4` top-level format.

**Blocked by serving infrastructure, not by quantization quality.**

### 4. DGX Spark SM 12.1 Is Barely Supported

The GB10 Blackwell chip (SM 12.1 / compute 12.1) is so new that:
- PyTorch caps CUDA arch at 12.0 (SM 12.1 returns false for `is_available()` in some builds)
- CUDA graphs don't work (`--enforce-eager` mandatory)
- Pre-built wheels for SGLang, vLLM, flashinfer all lack SM 12.1 kernels
- Only `vllm/vllm-openai:cu130-nightly` Docker image works (CUDA 13.0 natively supports SM 12.1)
- pip-installed vLLM/SGLang fail with "no kernel image" errors

### 5. Qwen3.5 VL Config Wrapper Is Required

The NGC-quantized model exports with `model_type: "qwen3_5_text"` (transformers 5.3+).
The vLLM Docker image has transformers 4.57 which doesn't know this type. Solution:
wrap the config in a VL-style structure with `model_type: "qwen3_5"` and nest the text
config under `text_config`. Also need `preprocessor_config.json` stub and
`--language-model-only` flag.

### 6. Tokenizer Compatibility Matters

The NGC container's transformers 5.3 exports a tokenizer with `TokenizersBackend` class.
The vLLM Docker image's older transformers doesn't have this class. Solution: copy the
tokenizer files from a working model (the 27B NVFP4 or a pre-existing v1 model).

---

## Serving Recipe (Proven on DGX Spark)

```bash
# Prerequisites for any NVFP4 model on DGX Spark:
# 1. config.json must use VL wrapper (model_type: "qwen3_5", arch: "Qwen3_5ForConditionalGeneration")
# 2. preprocessor_config.json must exist (copy from 27B model)
# 3. tokenizer files must be compatible with older transformers (copy from working model)
# 4. hf_quant_config.json must have quant_algo: "NVFP4" (not "MIXED_PRECISION")

docker run --gpus all -d --name vllm-nvfp4 \
  --runtime=nvidia \
  -v ~/models/YOUR_MODEL:/model \
  -p 8000:8000 --shm-size=16g \
  vllm/vllm-openai:cu130-nightly \
  --model /model \
  --quantization modelopt_fp4 \
  --enforce-eager \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --language-model-only \
  --host 0.0.0.0 --port 8000 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768
```

---

## Quantization Recipe

```bash
# Run inside NGC container on DGX Spark:
docker run --gpus all --rm --ipc=host \
  -v ~/models:/models \
  -v ~/workplace/her/her-os:/workspace \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "pip install transformers accelerate datasets && \
  PYTHONUNBUFFERED=1 python3 -u /workspace/scripts/quantize_nvfp4_v2.py \
    --strategy default \
    --model /models/SOURCE_BF16 \
    --calib-data /workspace/data/calibration/annie_calibration.jsonl \
    --output /models/OUTPUT_NVFP4"
```

**Post-quantization checklist:**
1. Wrap `config.json` in VL format (see script in session notes)
2. Copy `preprocessor_config.json` from 27B model
3. Copy tokenizer files from a working model
4. Verify `hf_quant_config.json` has `quant_algo: "NVFP4"` (not `MIXED_PRECISION`)

---

## Calibration Data Generation

```bash
# Disable hooks first:
mv ~/.claude/hooks/stop-self-check.sh{,.disabled}

# Generate (10 conversations per batch, Opus 4.6 as teacher):
python3 scripts/generate_calibration_data.py --per-batch 10 --model opus --timeout 120

# Restore hooks:
mv ~/.claude/hooks/stop-self-check.sh{.disabled,}
```

**CLI gotchas:**
- `--no-session-persistence` causes HANGS with Opus — never use it
- `--append-system-prompt` is REQUIRED to prevent CLI from treating prompts as tasks to describe
- Max 10 conversations per batch (50 returns empty — output size limit)
- Multi-turn conversations need >120s timeout (5 turns × 11 messages is a lot of output)
- Run from `/tmp` to avoid project hooks interfering with `-p` mode

---

## Root Cause & Solution (discovered post-v2d)

**PTQ (Post-Training Quantization) is fundamentally unable to preserve behavioral fine-tuning.**
PTQ only measures activation ranges and rounds weights — it never adjusts weights. The behavioral
patterns are small, precise weight adjustments that FP4 rounds away regardless of calibration data.

**QAT (Quantization-Aware Training) is the correct approach.**
LMSYS research shows QAT achieves 97% pass rate on behavioral tasks where PTQ gets only 59%.
QAT inserts fake quantizers into the model, then fine-tunes with quantization simulation active:
- Forward pass: weights are quantized (simulating FP4)
- Backward pass: straight-through gradient estimator updates full-precision weights
- Result: weights learn to produce correct output DESPITE quantization noise

**QAT steps with ModelOpt:**
1. Load BF16 model
2. Insert quantizers: `mtq.quantize(model, NVFP4_DEFAULT_CFG, forward_loop)` (same as PTQ step)
3. Keep model trainable (don't call `model.eval()` after quantization)
4. Fine-tune on Annie conversations with low LR (1e-5), ~1-3 epochs
5. Export with `export_hf_checkpoint()`

**Source:** [LMSYS QAT blog](https://lmsys.org/blog/2025-08-28-gpt-oss-qat/) — PTQ 59% vs QAT 97%
on safety alignment. Also: [NVIDIA QAD Report](https://research.nvidia.com/labs/nemotron/files/NVFP4-QAD-Report.pdf).

**Original model training (Jackrong):**
- Unsloth + LoRA (train_on_responses_only, masking instructions)
- ~3,950 Claude Opus 4.6 reasoning traces from 3 datasets
- 16K context, SFT loss on `<think>` + answer tokens only
- LoRA rank/alpha/target modules NOT published

---

## What To Try Next

### Path 0: QAT (RECOMMENDED — highest probability of success)
Use ModelOpt's QAT mode to fine-tune the model WITH quantization active. Our 430 Annie
conversations become the fine-tuning data (not calibration data). LR 1e-5, 1-3 epochs.
This is the LMSYS-proven approach that achieves 97% behavioral preservation.

### Path 1: Wait for vLLM MIXED_PRECISION Support
The AWQ-Lite algorithm produces the right quality but wrong format. When vLLM adds
`MIXED_PRECISION` support, the v2a model (already quantized, sitting at
`~/models/Qwen3.5-9B-Opus-Distilled-NVFP4-v2a/`) is ready to serve.

### Path 2: SGLang with modelopt_fp4
SGLang might support the MIXED_PRECISION format. Build SGLang from source with
transformers 5.3+ on bare metal (not Docker — the Docker images lack SM 12.1 kernels).

### Path 3: Convert MIXED_PRECISION to Simple NVFP4
Write a script that reads the v2a model's per-layer quant config and converts the
NVFP4_AWQ layers back to simple NVFP4 format. The weights are already quantized —
we just need to change the metadata.

### Path 4: GPTQ or AWQ Instead of NVFP4
These quantization formats have mature vLLM support. The trade-off: lose the
Blackwell Tensor Core FP4 acceleration (no hardware-native prefill), but gain
proven serving infrastructure.

### Path 5: MTP Speculative Decoding for 27B
The 27B NVFP4 model had excellent quality but 13 tok/s decode. MTP speculative
decoding claims ~20 tok/s (~50% boost). If that's true, the 27B becomes viable
for voice at 200ms TTFT + 20 tok/s decode.

### Path 6: Generate 1000+ Calibration Conversations
More calibration data might help AWQ-Lite preserve more behavioral signal.
Increase per-batch to 15-20 (we proved 10 works), increase multi-turn timeout
to 300s, add more categories (error handling, context switching, personality).

---

## Files Reference

| File | Purpose |
|------|---------|
| `scripts/generate_calibration_data.py` | Python CLI generator (Claude Opus 4.6 teacher) |
| `scripts/quantize_nvfp4_v2.py` | Multi-strategy quantization (default, awq_lite, mlp_only, auto_mixed) |
| `scripts/quantize_nvfp4.py` | Original v1 quantization (CNN DailyMail, DEFAULT_CFG) |
| `scripts/benchmark_quant_v3.py` | Benchmark with quality gates and proper methodology |
| `data/calibration/annie_calibration.jsonl` | 430 Annie conversations (Opus 4.6 generated) |
| `docs/NVFP4-RESEARCH-JOURNEY.md` | Complete research narrative |
| `docs/NVFP4-EXECUTION-PLAN.md` | Original 5-phase plan |
| `docs/RESEARCH-NVFP4-VS-Q4KM.md` | All benchmark data |

---

## Anti-Patterns (Don't Do These)

1. **Don't compare different models and call it a quantization benchmark.** Confounds fine-tuning quality with quantization quality. Always compare the same model in two formats.

2. **Don't measure TTFT without checking what "first token" contains.** Without `--reasoning-parser`, thinking tokens leak as content. The 82ms "TTFT" was actually time-to-first-thinking-token.

3. **Don't use `--no-session-persistence` with Claude CLI Opus.** Causes hangs. Use `--append-system-prompt` instead.

4. **Don't pip-install vLLM/SGLang on DGX Spark.** Use Docker cu130-nightly. All pip builds lack SM 12.1 kernels.

5. **Don't expect Docker Python output in real-time.** Use `PYTHONUNBUFFERED=1` and `python3 -u`. Even then, some buffering occurs.

6. **Don't add extra keys to ModelOpt quantization configs.** v0.37.0 uses Pydantic validation that rejects unknown fields. Layer exclusions need a different API path.

7. **Don't assume calibration data domain matters for DEFAULT_CFG.** The "max" algorithm only measures activation ranges — it doesn't care about the content semantics.
