# NVFP4 Research Journey: From Naive Quantization to Calibration-Aware Strategy

## Project Context

**Goal:** Determine if NVFP4 quantization is viable for Annie Voice on DGX Spark (GB10, SM121, 128GB unified memory).
**Current production:** Qwen3.5-9B-Opus-Distilled Q4_K_M on llama-server (port 8003).
**Annie** is Rajesh's personal AI voice companion — warm, concise, no markdown, tool-calling capable.

**Hardware:** ASUS ProArt P16 (dev) + DGX Spark "Titan" (inference server)
**DGX Spark specs:** NVIDIA GB10 (Blackwell, SM121/compute 12.1), 128GB unified LPDDR5x, 2.7TB free disk

---

## Session 337: Initial NVFP4 Quantization Attempt

### What we tried
Quantize `Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled` to NVFP4 using NVIDIA ModelOpt inside NGC PyTorch container.

### What worked
- ModelOpt quantization succeeded in 3.8 minutes
- Output: 7.4 GB NVFP4 safetensors
- 843 quantizers inserted, GPU used 16.7 GB

### What failed — the transformers 5.x deadlock
The exported config.json has `"model_type": "qwen3_5_text"` — introduced in transformers 5.x. But:
- SGLang container has transformers 4.57.1
- vLLM container has transformers 4.57.6
- Neither recognizes `qwen3_5_text`
- Upgrading transformers inside containers breaks their internals (SGLang's `AutoImageProcessor.register()`)

### Solution found
Install SGLang from source on bare metal in a venv (`~/sglang-env`), which pulls transformers 5.x as a dependency.

---

## Benchmark v1: Base Qwen3.5-9B NVFP4 vs Opus-Distilled Q4_K_M (no reasoning parser)

### Critical flaw: different models
- NVFP4 model: AxionML/Qwen3.5-9B-NVFP4 (BASE, no fine-tuning)
- Q4_K_M model: Jackrong/Qwen3.5-9B-Claude-Opus-Distilled-Q4_K_M (SFT on 3,950 Opus traces)
- This confounded three variables: model weights, quantization format, runtime engine

### Headline result was misleading
- NVFP4 showed 82ms constant TTFT — but this was time-to-first-THINKING-token, not useful content
- Without `--reasoning-parser`, thinking tokens leaked as visible content
- Every response started with `"Thinking Process: 1. **Analyze the Request:**..."`
- The 82ms was measuring the wrong thing

### Tool calling "failed" — but it was a configuration error
- vLLM returned 400 Bad Request on all tool calls
- Root cause: missing `--enable-auto-tool-choice --tool-call-parser qwen3_coder` flags
- NOT a model limitation — confirmed by web research showing Qwen3.5-9B scores 66.1 on BFCL-V4

### Other methodology issues
- Token counting used `len(text) // 4` estimate, not API `usage.completion_tokens`
- n=3 runs with no warmup exclusion
- Cold starts (9.3s, 110s) contaminated statistics
- No quality evaluation at all

---

## Benchmark v2: Added reasoning parser + tool calling flags

### What changed
- Added `--reasoning-parser qwen3 --tool-call-parser qwen3_coder` to vLLM
- Tool calling now worked: 12/12 success across v2 runs

### What broke
- With reasoning parser, thinking tokens were hidden correctly
- But base Qwen3.5-9B spent 3-16 seconds thinking, then produced ZERO visible tokens
- 21/21 non-tool text tests returned empty responses
- The model exhausted its token budget on internal reasoning before generating content
- Only tool calls worked (structured output parsed separately)

### Key insight
The Opus-Distilled model works with thinking disabled because the distillation trained it to produce quality output without a thinking scaffold. Base Qwen3.5 cannot function without thinking.

---

## Benchmark v3: Proper methodology (thinking disabled)

### Methodology fixes
- `enable_thinking: false` in every request (matches Annie's `--reasoning-budget 0`)
- 2 warmup runs excluded, 5 measured runs
- Quality gates: non-empty, factual correctness, tool accuracy, no-markdown, thinking leak
- Real token counts from API usage field

### Devastating result: base NVFP4 produces gibberish
With thinking disabled, base Qwen3.5-9B-NVFP4 output was garbled:
```
.isantts     E_OVER
 S copr; '...CSiscoslmogn不辞 продать / '^当 /M刘邦opIS ( S Z-s优化e'
```

Quality gates:
- Factual correctness: **0/5** (0%)
- Reasoning correctness: **0/5** (0%)
- Tool call accuracy: **0/10** (0%)
- No markdown: 4/15 (27%)

### This killed the base model as a candidate
The base Qwen3.5-9B was trained to think before answering. Remove thinking = gibberish.
Not a quantization problem — a model training problem.

---

## Detour: Qwen3.5-27B NVFP4 (full instruct model)

### Hypothesis
The 27B is a proper instruct-tuned model (not base, not distilled). It should work with thinking disabled.

### Available models found
- surogate/Qwen3.5-27B-NVFP4 (tested on DGX Spark, includes MTP head, ~19 GB)
- txn545/Qwen3.5-27B-NVFP4 (SGLang-compatible)
- kaitchup/Qwen3.5-27B-NVFP4 (llm-compressor, vLLM-compatible)
- AxionML/Qwen3.5-27B-NVFP4 (MLP-only quantization)

### Serving challenges (session 339)
1. SGLang's FlashInfer NVFP4 kernels didn't work on SM121
2. Switched to vLLM Docker container (`vllm/vllm-openai:cu130-nightly`)
3. Used `VLLM_NVFP4_GEMM_BACKEND=cutlass` → FLASHINFER_CUTLASS backend worked
4. `--language-model-only` bypassed all vision/preprocessor issues
5. Model loaded successfully: 19 GB, ~18.5 GB with MTP

### 27B benchmark result
- Quality: coherent output, but generic (not Annie's personality)
- Decode speed: ~13 tok/s — too slow for voice (Annie needs perceived fluency)
- TTFT: confirmed constant ~90ms (Blackwell FP4 prefill)
- Verdict: right format, wrong model size for single-GPU voice workload

---

## Breakthrough: Same Model, Pure Quantization Comparison

### What we finally tested
- Model A: Opus-Distilled 9B in NVFP4 (self-quantized with ModelOpt)
- Model B: Opus-Distilled 9B in Q4_K_M (Jackrong's GGUF on llama-server)
- Same fine-tuning. Only quantization format differs.

### TTFT — NVFP4 wins 3-12x (CONFIRMED REAL)

| Test | NVFP4 p50 | Q4_K_M p50 | Ratio |
|------|-----------|------------|-------|
| Simple greeting | 92ms | 300ms | 3.3x faster |
| Factual | 88ms | 296ms | 3.4x faster |
| Conversational | 91ms | 372ms | 4.1x faster |
| Tool call | 93ms | 774ms | 8.3x faster |
| Multi-turn 5 | 134ms | 1608ms | **12x faster** |

NVFP4 TTFT grows 1.6x over 5 turns. Q4_K_M grows 4.6x. DeltaNet linear attention confirmed.

### Decode Speed — Nearly identical!

| Test | NVFP4 tok/s | Q4_K_M tok/s |
|------|-------------|--------------|
| Greeting | 33.3 | 31.6 |
| Factual | 37.6 | 37.3 |
| Conversational | 33.9 | 41.6 |
| Multi-turn avg | 35.4 | 35.9 |

The 9B NVFP4 decodes at 30-38 tok/s — matching Q4_K_M! The 27B's 13 tok/s was model size, not NVFP4.

### Quality — NVFP4 has behavioral degradation

| Gate | NVFP4 | Q4_K_M |
|------|-------|--------|
| Factual (Tokyo) | **5/5** | 5/5 |
| Reasoning (15) | **5/5** | 5/5 |
| No markdown | 3/15 (20%) | 15/15 (100%) |
| Tool: search_memory | 0/5 (0%) | 1/5 (20%) |
| Tool: web_search | 0/5 (0%) | 5/5 (100%) |
| Thinking leaked | 5/25 (20%) | 0/25 (0%) |

### The critical insight

**FP4 doesn't hurt intelligence — it hurts manners.**

- Factual accuracy: perfect (5/5 both)
- Reasoning: perfect (5/5 both)
- But: markdown leaks, tool calling format broken, thinking tokens leak, 2-6x more verbose

The Opus distillation taught the model behavioral constraints (conciseness, tool format, no markdown). These are subtle learned patterns sitting on top of base capabilities. FP4's reduced precision rounds them away first.

### Root cause
Our quantization used `NVFP4_DEFAULT_CFG` with "max" algorithm and CNN DailyMail calibration data. This is the most basic configuration — wrong domain (news ≠ conversational AI), no layer sensitivity analysis, default calibration.

---

## The Strategy: Calibration-Aware NVFP4 Quantization

### Why this will work
The behavioral signal isn't destroyed — it's being rounded away by calibration that doesn't know which weight ranges matter. If we calibrate with data that exercises the exact structured output patterns Annie needs (tool calls, concise responses, no markdown), ModelOpt will measure the activation ranges that produce correct behavior and preserve them.

### The teacher: Claude Opus 4.6 (via Claude Code CLI)
- Opus 4.6 is the ORIGINAL source — the model Annie was distilled from
- Perfect tool calling, perfect instruction following, zero markdown
- Q4_K_M would be a lossy teacher (already 6/10 on tool calling)
- Claude Code CLI with Max subscription = unlimited Opus usage, zero API cost
- Command: `echo "prompt" | claude -p --model claude-opus-4-6 --output-format text`

### Phase 1: Generate 500 calibration conversations
- 5 categories × 100 conversations each
- Categories: greetings, factual Q&A, memory tool calls, web search tool calls, multi-turn
- Opus generates ideal Annie responses (tool-call JSON, no markdown, concise voice output)
- Format: JSONL with Qwen3.5 chat template

### Phase 2: Calibration-aware quantization
Three strategies in order of expected quality:

**Strategy A — AWQ-Lite:**
- Config: `NVFP4_AWQ_LITE_CFG` + layer exclusions
- Algorithm: "awq_lite"
- Expected: good quality, ~10 min
- Size: ~7.5 GB

**Strategy B — MLP-Only (RECOMMENDED FIRST):**
- Config: `NVFP4_MLP_ONLY_CFG` + layer exclusions
- Keeps ALL attention weights in BF16 (where tool-call formatting decisions happen)
- Algorithm: "awq_lite"
- Expected: best quality for tool calling
- Size: ~8-9 GB

**Strategy C — Auto-Mixed:**
- Config: `auto_quantize(effective_bits=4.8)`
- Algorithm: "kl_div"
- Automatically assigns FP4 to insensitive layers, FP8 to sensitive ones
- Expected: highest accuracy, ~30 min
- Size: ~8-10 GB

### Layer exclusions (all strategies):
```python
"*linear_attn*": {"enable": False}    # DeltaNet layers — architecturally different
"*layers.0.*": {"enable": False}       # First transformer layer
"*layers.31.*": {"enable": False}      # Last transformer layer
"*embed_tokens*": {"enable": False}    # Embedding layer
"*lm_head*": {"enable": False}         # Output head (already default)
```

### Phase 3: Benchmark with v3 methodology
- Same quality gates as before
- Must pass: thinking leak 0/25, tool calling ≥6/10, no markdown ≥12/15
- TTFT must stay ≤120ms, decode ≥25 tok/s

### Phase 4: Prompt engineering quick test
- Try hardened system prompt on existing v1 NVFP4 model
- If it recovers tool calling, lighter quantization may suffice

### Phase 5: Publish
- HuggingFace: `rajesh/Qwen3.5-9B-Opus-Distilled-NVFP4-v2`
- Blog post: complete DGX Spark NVFP4 cookbook
- ROSCon China 2025 reference material

---

## Serving Recipes (Proven on DGX Spark)

### NVFP4 on vLLM (Docker — works on SM121)
```bash
docker run --gpus all -d --name vllm-nvfp4 \
  --ipc=host -p 8000:8000 \
  -v ~/models:/model \
  vllm/vllm-openai:cu130-nightly \
  /model/Qwen3.5-9B-Opus-Distilled-NVFP4 \
  --quantization modelopt \
  --enforce-eager \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --language-model-only \
  --port 8000 --host 0.0.0.0
```

### NVFP4 on SGLang (bare metal venv)
```bash
source ~/sglang-env/bin/activate
python -m sglang.launch_server \
  --model-path ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4 \
  --quantization modelopt_fp4 \
  --tp 1 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --port 8000 --host 0.0.0.0 \
  --mem-fraction-static 0.8
```

### Q4_K_M on llama-server (current production)
```bash
llama-server \
  -m ~/models/Qwen3.5-9B-Opus-Distilled-Q4_K_M.gguf \
  --port 8003 --host 0.0.0.0 \
  --ctx-size 32768 --reasoning-budget 0
```

---

## Key Lessons Learned

1. **Never compare different models and call it a quantization benchmark.** Three of our benchmark rounds were confounded. Only the same-model comparison gave actionable results.

2. **The 82ms TTFT headline was a measurement artifact.** Without the reasoning parser, thinking tokens leaked as content. Always validate what "first token" actually contains.

3. **Base Qwen3.5-9B cannot function without thinking mode.** It was trained to think before answering — remove thinking and you get gibberish. Opus-Distilled models can work without thinking because the distillation trained that capability.

4. **FP4 preserves knowledge, destroys manners.** Factual accuracy and reasoning survive FP4 perfectly. Behavioral fine-tuning (conciseness, tool format, markdown suppression) is the first casualty.

5. **Calibration data domain matters enormously.** CNN DailyMail (news articles) tells ModelOpt nothing about tool-call JSON formatting or markdown suppression. Domain-specific calibration is not optional.

6. **DGX Spark's SM121 (compute 12.1) is not fully supported.** PyTorch caps at 12.0, CUDA graphs don't work, many containers ship stale transformers. `--enforce-eager` is mandatory for vLLM. FlashInfer-CUTLASS backend works for NVFP4 GEMM.

7. **`--language-model-only` solves 80% of Qwen3.5 serving issues.** The vision encoder, preprocessor config, and Conv3d/CuDNN compatibility problems all disappear when you skip the multimodal pipeline for text-only workloads.

8. **Tool calling requires explicit server flags.** vLLM: `--enable-auto-tool-choice --tool-call-parser qwen3_coder`. SGLang: `--tool-call-parser qwen3_coder`. Without these, you get 400 errors that look like model failures.

---

## Files Reference

| File | Purpose |
|------|---------|
| `scripts/benchmark_quant_v3.py` | Proper benchmark with quality gates |
| `scripts/generate_calibration_data.sh` | Opus 4.6 calibration data generation via Claude Code CLI |
| `scripts/quantize_nvfp4.py` | Original naive quantization (v1) |
| `scripts/quantize_nvfp4_v2.py` | Calibration-aware quantization (to be created) |
| `docs/RESEARCH-NVFP4-VS-Q4KM.md` | Full benchmark results document |
| `data/calibration/annie_calibration.jsonl` | Calibration dataset (to be generated) |
