# Plan: Calibration-Aware NVFP4 Quantization for Opus-Distilled 9B

**Created:** 2026-03-15 (Session 339)
**Status:** Ready for implementation
**Goal:** Re-quantize Qwen3.5-9B-Opus-Distilled with calibration-aware NVFP4 to preserve instruction-following quality while keeping the 90ms TTFT + 33 tok/s decode speed.

---

## Background

Session 339 benchmarks proved:
- **NVFP4 speed is real**: 90ms constant TTFT (vs Q4_K_M's 300-1600ms growing), 33 tok/s decode (matches Q4_K_M)
- **Naive quantization destroys behavioral fine-tuning**: thinking leak (5/25), tool calling broken (0/10), markdown leak (12/15)
- **Knowledge/reasoning preserved**: factual 5/5, reasoning 5/5
- **Root cause**: Default `NVFP4_DEFAULT_CFG` + `"max"` algorithm + CNN DailyMail calibration is too aggressive for distilled models

**Data files:**
- `scripts/benchmark_9b_opus_nvfp4_vs_q4km_results.json` — v4b benchmark proving speed + quality issues
- `scripts/benchmark_27b_nvfp4_vs_9b_q4km_results.json` — v4a benchmark (27B instruct comparison)
- `scripts/quantize_nvfp4.py` — existing v1 quantization script (basic config)
- `docs/RESEARCH-NVFP4-VS-Q4KM.md` — full research doc with serving recipe

---

## Phase 0: Documentation & API Discovery (COMPLETE)

### ModelOpt API Summary

**Current script uses:** `NVFP4_DEFAULT_CFG` + `"max"` algorithm + 512 CNN DailyMail samples

**Better presets available:**

| Config | What it quantizes | Algorithm | Expected Quality |
|--------|------------------|-----------|-----------------|
| `NVFP4_DEFAULT_CFG` | All weights + activations | `"max"` | Baseline (current, poor) |
| `NVFP4_AWQ_LITE_CFG` | All weights + activations | `"awq_lite"` | Good — AWQ adapts to weight distributions |
| `NVFP4_MLP_ONLY_CFG` | MLP layers only | `"max"` | Better — attention stays BF16 |
| `NVFP4_SVDQUANT_DEFAULT_CFG` | All (SVD decomposed) | `"svdquant"` | Best — highest accuracy, slowest |
| `auto_quantize()` | Automatic FP4/FP8 mix | `"kl_div"` | Best — per-layer sensitivity search |

**Key API — `mtq.auto_quantize()`:**
```python
model, state = mtq.auto_quantize(
    model,
    constraints={"effective_bits": 4.8},
    quantization_formats=[mtq.NVFP4_DEFAULT_CFG, mtq.FP8_DEFAULT_CFG],
    data_loader=calib_dataloader,
    forward_step=lambda model, batch: model(batch),
    loss_func=lambda output, batch: output.loss,
    num_calib_steps=512,
    num_score_steps=128,
    method="kl_div",
)
```

**Layer exclusion via wildcard patterns:**
```python
import copy
quant_cfg = copy.deepcopy(mtq.NVFP4_AWQ_LITE_CFG)
quant_cfg["quant_cfg"]["*linear_attn*"] = {"enable": False}   # DeltaNet layers
quant_cfg["quant_cfg"]["*layers.0.*"] = {"enable": False}      # First layer
quant_cfg["quant_cfg"]["*layers.31.*"] = {"enable": False}     # Last layer
quant_cfg["quant_cfg"]["*embed_tokens*"] = {"enable": False}   # Embeddings
```

**Layers already excluded by default:** `lm_head`, MoE routers, `linear_attn.conv1d`, Mamba conv1d

### Quality Failure Analysis

| Failure Mode | Score | Fixable by Prompt? | Root Cause |
|-------------|-------|-------------------|------------|
| Thinking leak | 5/25 leaked | No | Quantization broke `enable_thinking:false` |
| Tool calling | 0/10 | No | Can't emit structured tool-call JSON |
| Markdown | 3/15 pass | Partially | Symptom of thinking leak |
| Verbosity | 2-6.6x over Q4_K_M | No | Leaked preamble consumes tokens |
| Factual | 5/5 | n/a | Knowledge preserved perfectly |
| Reasoning | 5/5 | n/a | Reasoning preserved perfectly |

### Available Calibration Data

- 8 conversation transcripts (`scripts/eval-llm/transcripts.json`) — mixed English/Kannada
- 60 benchmark tasks (`scripts/eval-llm/task_inputs.json`) — sensitivity, triage, QA, etc.
- 1042+ eval results with model responses (`scripts/eval-llm/results/`)
- Annie system prompt + 11 tool definitions (`services/annie-voice/bot.py:115-147`)
- Real session data on Titan (PostgreSQL + JSONL) — needs privacy review
- **Gap**: Need 500+ formatted multi-turn chat + tool-calling examples → Phase 1 generates these

---

## Phase 1: Generate Synthetic Calibration Dataset

**Goal:** 500-1000 Annie-style conversations exercising structured outputs

### Why synthetic data
The Q4_K_M model produces perfect Annie behavior. Use it as teacher to generate calibration data capturing the exact activation patterns of correct behavior.

### Tasks

**1.1** Create `scripts/generate_calibration_data.py`:
- Hit Q4_K_M server (port 8003) with diverse prompts
- **5 categories** (100 each = 500 total):
  - Simple greetings/casual chat (no tools, concise responses)
  - Factual Q&A (no tools, concise answers)
  - Memory-triggered ("What did X say about Y?") → should call `search_memory`
  - Web search ("What's the latest on X?") → should call `web_search`
  - Multi-turn (5-turn conversations with growing context)
- Format: `{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "...", "tool_calls": [...]}]}`
- Save to `data/calibration/annie_conversations.jsonl`

**1.2** Include tool-call examples:
- Record responses where Q4_K_M emits `tool_calls` in assistant message
- Include tool_result → follow-up pattern
- These teach the quantized model the exact token sequences for structured output

**1.3** Prompt diversity sources:
- Use questions from `scripts/eval-llm/task_inputs.json`
- Use topics from `scripts/eval-llm/transcripts.json`
- Generate variations with different phrasings, languages (English + Kannada mix)

### Verification
- [ ] 500+ conversations generated
- [ ] ≥100 contain tool calls (JSON format)
- [ ] All responses are markdown-free
- [ ] No thinking tokens in any response
- [ ] Tokenized lengths cover 64-2048 tokens

### References
- System prompt: `services/annie-voice/bot.py:115-147`
- Tool defs: `services/annie-voice/tools.py`, `services/annie-voice/visual_tools.py`
- Q4_K_M server: `http://localhost:8003/v1/chat/completions`

---

## Phase 2: Improved Quantization Script

**Goal:** Create `scripts/quantize_nvfp4_v2.py` with 3 strategies

### Three strategies to try (in order of expected accuracy)

| Strategy | Config | Algorithm | Expected Size | Calibration Time |
|----------|--------|-----------|--------------|-----------------|
| **A: AWQ-Lite** | `NVFP4_AWQ_LITE_CFG` + layer exclusions | `"awq_lite"` | ~7.5 GB | ~10 min |
| **B: MLP-Only** | `NVFP4_MLP_ONLY_CFG` + layer exclusions | `"awq_lite"` | ~8-9 GB | ~8 min |
| **C: Auto-Mixed** | `auto_quantize(effective_bits=4.8)` | `"kl_div"` | ~8-10 GB | ~30 min |

### Layer exclusion (all strategies)
```python
excluded = {
    "*linear_attn*": {"enable": False},      # DeltaNet layers (quantization-sensitive)
    "*layers.0.*": {"enable": False},         # First transformer layer
    "*layers.31.*": {"enable": False},        # Last transformer layer
    "*embed_tokens*": {"enable": False},      # Embedding layer
}
```

### CLI interface
```bash
# Run inside NGC container on Titan
python scripts/quantize_nvfp4_v2.py \
  --strategy awq_lite|mlp_only|auto_mixed \
  --calib-data data/calibration/annie_conversations.jsonl \
  --calib-samples 512 \
  --exclude-layers linear_attn,layers.0,layers.31 \
  --output ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4-v2-{strategy}
```

### Post-quantization config surgery
Each output model needs:
1. Config wrapped in VL structure (`Qwen3_5ForConditionalGeneration` architecture)
2. `preprocessor_config.json` added (copy from 27B)
3. `tokenizer_config.json` with `Qwen2Tokenizer` class (copy from 27B)

### Verification
- [ ] All 3 strategies produce valid safetensors
- [ ] `hf_quant_config.json` shows excluded layers in ignore list
- [ ] Each model serves in Docker cu130-nightly with `--language-model-only`

### References
- Existing script: `scripts/quantize_nvfp4.py`
- NGC container: `nvcr.io/nvidia/pytorch:25.11-py3` (has ModelOpt 0.37.0)
- For ModelOpt 0.42.0: `pip install nvidia-modelopt[all]` (may need newer container)

---

## Phase 3: Benchmark All Three Strategies

**Goal:** Find the best quality/speed tradeoff

### For each strategy (A, B, C):
1. Serve with Docker cu130-nightly + `--language-model-only`
2. Run `scripts/benchmark_27b_nvfp4_vs_9b_q4km.py` against it + Q4_K_M baseline
3. Save results to `scripts/benchmark_9b_opus_nvfp4_v2{a,b,c}_results.json`

### Quality gates (must pass to replace Q4_K_M)

| Gate | Required | Current v1 | Q4_K_M baseline |
|------|----------|-----------|-----------------|
| Thinking leak | 0/25 | 5/25 | 0/25 |
| Tool calling | ≥6/10 | 0/10 | 6/10 |
| No markdown | ≥12/15 | 3/15 | 15/15 |
| Factual | 5/5 | 5/5 | 5/5 |
| Reasoning | 5/5 | 5/5 | 5/5 |
| TTFT p50 | ≤120ms | 90ms | 300ms |
| Decode tok/s | ≥25 | 33 | 35 |

### Verification
- [ ] All 3 models benchmarked
- [ ] Quality comparison table produced
- [ ] Best strategy identified
- [ ] Results added to `docs/RESEARCH-NVFP4-VS-Q4KM.md`

---

## Phase 4: Prompt Engineering Quick Test (parallel with Phase 2)

**Goal:** Test if a hardened system prompt fixes any issues on existing v1 model

### Hardened prompt additions
```
CRITICAL RULES (violations will cause errors):
1. NEVER output thinking, reasoning, or analysis text
2. NEVER use markdown: no **, no #, no -, no `, no numbered lists
3. When tools are needed, emit ONLY the tool_call JSON — no explanatory text
4. Respond in 1-3 sentences maximum
```

### Quick test via curl (5 queries, manual inspection)
### If improvement: run full benchmark with hardened prompt

---

## Phase 5: Publish & Document

**Goal:** Publish best model on HuggingFace + blog

### 5.1 HuggingFace model card
- Repo: `rajesh/Qwen3.5-9B-Opus-Distilled-NVFP4-v2`
- Tags: `dgx-spark`, `blackwell`, `nvfp4`, `qwen3.5`, `sm121`
- Include: quantization methodology, calibration details, benchmark results, Docker serving recipe

### 5.2 Blog post outline
1. The problem: Running NVFP4 on DGX Spark SM 12.1
2. The journey: 12 failed approaches → Docker cu130-nightly
3. The finding: naive FP4 preserves knowledge but kills instruction-following
4. The fix: calibration-aware quantization with synthetic conversation data
5. The recipe: complete step-by-step for any model on DGX Spark

### 5.3 Update research doc
- Add v2 quantization results to `docs/RESEARCH-NVFP4-VS-Q4KM.md`

---

## Serving Recipe (proven, from session 339)

```bash
# Works for ANY NVFP4 model on DGX Spark GB10 SM 12.1
docker run -d --name vllm-model \
  --runtime=nvidia --gpus all \
  -v /path/to/model:/model \
  -p 8000:8000 \
  --shm-size=16g \
  vllm/vllm-openai:cu130-nightly \
  --model /model \
  --quantization modelopt \
  --enforce-eager \
  --language-model-only \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768

# Requires:
# 1. config.json wrapped in VL structure (Qwen3_5ForConditionalGeneration)
# 2. preprocessor_config.json (Qwen3VLProcessor + Qwen2VLImageProcessorFast)
# 3. tokenizer_config.json with tokenizer_class: Qwen2Tokenizer
```

---

## Models on Titan

| Path | Size | Status |
|------|------|--------|
| `~/models/Qwen3.5-9B-Opus-Distilled-NVFP4/` | 7.5 GB | v1 (naive quant, quality issues) |
| `~/models/Qwen3.5-9B-Opus-Distilled-BF16/` | ~18 GB | Source model for re-quantization |
| `~/models/Qwen3.5-27B-NVFP4/` | 19 GB | 27B instruct (good quality, slow decode) |
| `~/models/Qwen3.5-9B-NVFP4/` | ~9 GB | AxionML base 9B (gibberish without thinking) |
