# CLAUDE CODE EXECUTION PLAN: Calibration-Aware NVFP4 Quantization

## READ FIRST: Context Files

Before executing, read these files to understand the full context:
- `NVFP4-RESEARCH-JOURNEY.md` — Complete research history and findings
- `scripts/benchmark_quant_v3.py` — The benchmark script with proper methodology
- `scripts/generate_calibration_data.sh` — Opus 4.6 calibration data generator
- `docs/RESEARCH-NVFP4-VS-Q4KM.md` — Current benchmark results

## What's Running on Titan (DGX Spark)

- **Port 8003:** llama-server with Opus-Distilled Q4_K_M (DO NOT STOP — production)
- **Port 8000:** May have a vLLM/SGLang instance from previous testing (kill if present)
- **Envs:** `~/sglang-env` (SGLang from source), `~/vllm-env` (if created)
- **Models:** `~/models/` contains various downloaded models
- **Docker:** `vllm/vllm-openai:cu130-nightly` image available

## The Problem We're Solving

We successfully quantized the Opus-Distilled 9B model to NVFP4 and confirmed:
- TTFT: 90ms constant (vs Q4_K_M's 300-1600ms) — **NVFP4 wins 3-12x**
- Decode: 33-38 tok/s (matches Q4_K_M) — **no speed penalty**
- Knowledge/reasoning: 5/5 factual, 5/5 reasoning — **perfectly preserved**

BUT the behavioral fine-tuning is degraded:
- Tool calling: 0/10 (can't emit JSON format)
- Markdown: leaks in 80% of responses
- Thinking tokens: leak in 20% of responses
- Verbosity: 2-6x more tokens than needed

Root cause: naive quantization with CNN DailyMail calibration data destroyed the
Opus distillation's behavioral patterns. Fix: calibration-aware quantization with
Annie-specific data generated by Opus 4.6.

---

## Phase 1: Generate Calibration Data with Opus 4.6

### 1.1 Verify Claude Code CLI works

```bash
echo "Say hello in one sentence, no markdown" | claude -p --model claude-opus-4-6 --output-format text --max-turns 1
```

If `--model` flag isn't supported, check `claude --help` and adapt. The key is
using Opus 4.6 (not Sonnet) as the teacher.

### 1.2 Run the calibration data generator

```bash
cd ~/annie-voice
bash scripts/generate_calibration_data.sh
```

This generates 500 conversations in 10 batches across 5 categories:
- 100 greetings/casual (no tools)
- 100 factual Q&A (no tools)
- 100 memory-triggered (must call search_memory)
- 100 web search (must call web_search)
- 100 multi-turn (5 turns each, no tools)

Output: `data/calibration/annie_calibration.jsonl`

### 1.3 Validate the calibration data

```bash
# Count valid entries
TOTAL=$(wc -l < data/calibration/annie_calibration.jsonl)
VALID=$(jq -c '.' data/calibration/annie_calibration.jsonl 2>/dev/null | wc -l)
TOOLS=$(grep -c '"tool_calls"' data/calibration/annie_calibration.jsonl)
echo "Total: $TOTAL, Valid JSON: $VALID, With tools: $TOOLS"

# Spot check a few entries
head -3 data/calibration/annie_calibration.jsonl | jq .
```

**Must have:** ≥400 valid JSON lines, ≥80 with tool_calls, 0 markdown in responses.

### 1.4 If batches fail

Individual batches may fail if Opus wraps output in markdown fences. Re-run
specific batches or manually clean the output:

```bash
# Remove markdown fences if present
sed -i '/^```/d' data/calibration/batch_*.jsonl
# Re-merge
cat data/calibration/batch_*.jsonl | jq -c '.' > data/calibration/annie_calibration.jsonl
```

---

## Phase 2: Create Calibration-Aware Quantization Script

### 2.1 Create `scripts/quantize_nvfp4_v2.py`

This script must:

1. Load the BF16 Opus-Distilled model from `~/models/Qwen3.5-9B-Opus-Distilled-BF16/`
2. Load calibration data from `data/calibration/annie_calibration.jsonl`
3. Tokenize calibration data with Qwen3.5 tokenizer
4. Apply ModelOpt quantization with the specified strategy
5. Export to safetensors format compatible with vLLM/SGLang

### 2.2 ModelOpt API reference

ModelOpt is pre-installed in the NGC container (`nvcr.io/nvidia/pytorch:25.11-py3`).
Key APIs (v0.37.0+):

```python
import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint

# Strategy A: AWQ-Lite with calibration
config = mtq.NVFP4_AWQ_LITE_CFG

# Strategy B: MLP-Only (attention stays BF16)
config = mtq.NVFP4_MLP_ONLY_CFG

# Strategy C: Auto-mixed precision
# Uses modelopt.torch.quantization.auto_quantize()

# Layer exclusions — add to any config:
config["*linear_attn*"] = {"enable": False}
config["*layers.0.*"] = {"enable": False}
config["*layers.31.*"] = {"enable": False}
config["*embed_tokens*"] = {"enable": False}

# Calibration forward loop
def forward_loop(model):
    for batch in calibration_dataloader:
        model(batch["input_ids"].cuda())

# Quantize
model = mtq.quantize(model, config, forward_loop)

# Export
export_hf_checkpoint(model, output_dir)
```

### 2.3 Three strategies to implement

| Strategy | Config | Algorithm | Layer exclusions | Expected size |
|----------|--------|-----------|------------------|---------------|
| A: AWQ-Lite | `NVFP4_AWQ_LITE_CFG` | awq_lite | linear_attn, layers.0, layers.31, embed, lm_head | ~7.5 GB |
| B: MLP-Only | `NVFP4_MLP_ONLY_CFG` | awq_lite | Same + all attention stays BF16 | ~8-9 GB |
| C: Auto-Mixed | `auto_quantize(effective_bits=4.8)` | kl_div | Automatic | ~8-10 GB |

### 2.4 CLI interface

```bash
python scripts/quantize_nvfp4_v2.py \
  --strategy awq_lite \
  --model ~/models/Qwen3.5-9B-Opus-Distilled-BF16 \
  --calib-data data/calibration/annie_calibration.jsonl \
  --calib-samples 512 \
  --exclude-layers "linear_attn,layers.0,layers.31" \
  --output ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4-v2a
```

### 2.5 Run inside NGC container

```bash
docker run --gpus all --rm --ipc=host \
  -v ~/models:/models \
  -v ~/annie-voice:/workspace \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "
    pip install transformers accelerate datasets &&
    python /workspace/scripts/quantize_nvfp4_v2.py \
      --strategy mlp_only \
      --model /models/Qwen3.5-9B-Opus-Distilled-BF16 \
      --calib-data /workspace/data/calibration/annie_calibration.jsonl \
      --output /models/Qwen3.5-9B-Opus-Distilled-NVFP4-v2b
  "
```

### 2.6 Verification

After quantization:
```bash
ls -lh ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4-v2*/
# Should contain: config.json, tokenizer files, safetensors, quantization_config
```

Check that excluded layers are NOT quantized:
```bash
python -c "
import json
with open('~/models/Qwen3.5-9B-Opus-Distilled-NVFP4-v2b/hf_quant_config.json') as f:
    cfg = json.load(f)
print('Excluded layers:', [k for k,v in cfg.items() if not v.get('enable', True)])
"
```

---

## Phase 3: Serve and Benchmark Each Strategy

### 3.1 Prepare model for serving

Each quantized model needs the VL wrapper config for Qwen3.5:

```bash
# Copy preprocessor_config.json (text-only stub)
cp ~/models/Qwen3.5-27B-NVFP4/preprocessor_config.json ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4-v2b/
```

If the config.json has `"model_type": "qwen3_5_text"`, it may need wrapping in a
VL-style config for the server. Check the session 339 recipe for this.

### 3.2 Serve with vLLM Docker (proven recipe)

```bash
docker run --gpus all -d --name vllm-nvfp4-v2 \
  --ipc=host -p 8000:8000 \
  -v ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4-v2b:/model \
  vllm/vllm-openai:cu130-nightly \
  --model /model \
  --quantization modelopt \
  --enforce-eager \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --language-model-only \
  --port 8000 --host 0.0.0.0 \
  --gpu-memory-utilization 0.85
```

Verify:
```bash
curl -s http://localhost:8000/v1/models | python3 -m json.tool
```

### 3.3 Quick smoke test

```bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [
      {"role": "system", "content": "You are Annie, Rajesh'\''s personal AI companion. Speak naturally and concisely. Never use markdown."},
      {"role": "user", "content": "What is the capital of Japan?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7,
    "chat_template_kwargs": {"enable_thinking": false}
  }'
```

Must return: coherent response containing "Tokyo", no markdown, no thinking preamble.

### 3.4 Run full v3 benchmark

```bash
python scripts/benchmark_quant_v3.py \
  --backend-a nvfp4_v2 http://localhost:8000 \
  --backend-b q4km_llama http://localhost:8003 \
  --warmup 2 --runs 5 \
  --output scripts/benchmark_9b_opus_nvfp4_v2b_results.json
```

### 3.5 Quality gates (must pass)

| Gate | Threshold | v1 result (to beat) |
|------|-----------|---------------------|
| Thinking leak | 0/25 (0%) | 5/25 (20%) |
| Tool calling | ≥6/10 (60%) | 0/10 (0%) |
| No markdown | ≥12/15 (80%) | 3/15 (20%) |
| Factual | 5/5 (100%) | 5/5 (100%) — maintain |
| Reasoning | 5/5 (100%) | 5/5 (100%) — maintain |
| TTFT | ≤120ms | 90ms — maintain |
| Decode | ≥25 tok/s | 33 tok/s — maintain |

### 3.6 Repeat for each strategy

Run strategies A, B, C. Compare results. Pick the best quality/speed tradeoff.

---

## Phase 4: Prompt Engineering Quick Test (10 minutes, do first)

### 4.1 Test hardened prompt on existing v1 NVFP4

Before re-quantizing, test if a stronger system prompt fixes issues on the
existing naive NVFP4 model:

```bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [
      {"role": "system", "content": "You are Annie. CRITICAL RULES (violations cause errors):\n1. NEVER output thinking, reasoning, or analysis text\n2. NEVER use markdown: no **, no #, no -, no backticks, no numbered lists\n3. When tools are needed, emit ONLY the tool_call JSON with no explanatory text\n4. Respond in 1-3 sentences maximum\n5. Be warm and conversational"},
      {"role": "user", "content": "What is the capital of Japan?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7,
    "tools": [],
    "chat_template_kwargs": {"enable_thinking": false}
  }'
```

### 4.2 Test tool calling with hardened prompt

```bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model",
    "messages": [
      {"role": "system", "content": "You are Annie. When the user asks about past conversations, you MUST call search_memory. When they ask about current events, you MUST call web_search. Emit ONLY the tool call, no text."},
      {"role": "user", "content": "What do you remember about my robot project?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7,
    "tools": [{"type":"function","function":{"name":"search_memory","description":"Search conversation history","parameters":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}}}],
    "tool_choice": "auto",
    "chat_template_kwargs": {"enable_thinking": false}
  }'
```

### 4.3 Interpret results

- If tool calling works with hardened prompt → behavioral signal exists, just needs stronger activation. Lighter quantization (Strategy A) may suffice.
- If still broken → signal is truly destroyed. Need Strategy B (MLP-Only) or C (Auto-Mixed).

---

## Phase 5: Publish to HuggingFace

### 5.1 Upload best model

```bash
huggingface-cli upload rajesh/Qwen3.5-9B-Opus-Distilled-NVFP4-v2 \
  ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4-v2b/ \
  --repo-type model
```

### 5.2 Model card content

Include:
- Source model: Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled
- Quantization: NVFP4 via ModelOpt with AWQ-Lite/MLP-Only calibration
- Calibration: 500 Annie conversations generated by Claude Opus 4.6
- Benchmark results: TTFT, decode speed, quality gates
- Serving recipe: vLLM Docker on DGX Spark
- Tags: dgx-spark, blackwell, nvfp4, qwen3.5, opus-distilled

### 5.3 Update research doc

Add final results to `docs/RESEARCH-NVFP4-VS-Q4KM.md`

---

## Common Errors and Fixes

| Error | Fix |
|-------|-----|
| `qwen3_5_text not recognized` | Upgrade transformers to 5.3+: `pip install transformers>=5.3` |
| `400 Bad Request` on tool calls | Add `--enable-auto-tool-choice --tool-call-parser qwen3_coder` |
| CuDNN compatibility warning | `pip install nvidia-cudnn-cu12==9.16.0.29` or `SGLANG_DISABLE_CUDNN_CHECK=1` |
| CUDA graphs fail on SM121 | Use `--enforce-eager` (mandatory on DGX Spark) |
| Conv3d / preprocessor errors | Use `--language-model-only` for text-only workloads |
| Process killed during load | OOM — stop other models first, reduce `--mem-fraction-static` |
| SGLang pins transformers 4.57 | Upgrade with `pip install transformers>=5.3 --no-deps` after SGLang install |
| Empty responses from NVFP4 | Thinking consuming all tokens — add `enable_thinking: false` to request |
| Gibberish output | Base model can't work without thinking — need Opus-Distilled model |

---

## Decision Tree

```
Is the v2 quantized model serving?
├── YES → Run smoke test (Tokyo question)
│   ├── Coherent response → Run full v3 benchmark
│   │   ├── Quality gates pass → SHIP IT (Phase 5)
│   │   └── Quality gates fail → Try next strategy (A→B→C)
│   └── Gibberish → Check: is this Opus-Distilled or base model?
│       ├── Base model → WRONG MODEL, need Opus-Distilled BF16
│       └── Opus-Distilled → Quantization too aggressive, try Strategy C
└── NO → Check serving error
    ├── transformers version → pip install transformers>=5.3
    ├── CUDA/SM121 → --enforce-eager + FLASHINFER_CUTLASS
    ├── Memory → --language-model-only + reduce mem-fraction
    └── preprocessor → add preprocessor_config.json stub
```
