# NVFP4 QAT Cookbook — Voice AI on DGX Spark

**Last Updated:** 2026-03-16 (v4 adversarial training added)
**Hardware:** DGX Spark GB10 (Blackwell SM 12.1, 128 GB unified CPU+GPU memory)
**Model:** Qwen3.5-9B-Claude-Opus-Distilled → NVFP4 QAT
**Sessions:** 337-345+

---

## The Journey: PTQ → QAT v1 → v2 → v3

### What We Tried and Why It Failed

| Attempt | Method | Result | Root Cause |
|---------|--------|--------|------------|
| v1 PTQ (DEFAULT_CFG) | max algorithm, CNN DailyMail calibration | Tool calling 0/10, thinking leak 20%, markdown 80% | "max" algorithm rounds behavioral fine-tuning signal |
| v2a PTQ (AWQ-Lite) | awq_lite algorithm, Annie calibration | MIXED_PRECISION format, vLLM can't serve | AWQ-Lite produces incompatible quantization format |
| v2d PTQ (DEFAULT_CFG + Annie data) | max algorithm, Annie calibration | Identical failures to v1 | Calibration data domain is irrelevant for PTQ |
| Prompt engineering on v1 | 7 hardened prompts | 43% reliability (3/7 pass) | Signal partially present but unreliable |
| QAT v1 (LoRA) | 1 epoch, LoRA rank 16, full sequence loss | Loss dropped 24% but same failures | `merge_and_unload()` destroys QAT state |
| QAT v2 (full fine-tune) | 3 epochs, assistant-only loss, full FT | **Thinking 0%, markdown 0%, tools 90%** | All 3 v1 bugs fixed |
| QAT v3 (behavioral) | 5 epochs, 1000 targeted conversations | 45% behavioral pass (15 markdown, 9 tool_call XML, 9 JSON leak) | Stale system prompt in training data (missing emoji ban, no-followup, `<rules>`) |
| QAT v4 (adversarial) | 5 epochs, 1000 adversarial conversations | **10/10 behavioral pass, 0 emoji, 0 markdown** | 9 adversarial categories, current production prompt + `<rules>`, loss 1.307→0.329 |

### The Key Insight

**QAT = SFT with quantization active.** One training run does both behavioral training AND quantization adaptation. No separate SFT step needed.

The fake quantizers (simulated FP4) are active during the forward pass. The model "sees" FP4 precision and learns to produce correct outputs despite quantization noise. Gradients flow through a straight-through estimator to full-precision weights, which adapt to survive FP4 rounding.

**FP4 preserves knowledge but destroys manners.** PTQ keeps factual accuracy and reasoning (100%) but destroys formatting compliance, tool calling, and personality. QAT recovers all of them.

---

## Hardware

- **GPU:** NVIDIA GB10 (Blackwell SM 12.1)
- **Memory:** 128 GB unified CPU+GPU (no discrete GPU VRAM)
- **Training:** ~53 min for 430 samples × 3 epochs (v2), ~3.5h estimated for 1000 × 5 (v3)
- **Serving:** vLLM Docker with FlashInfer SM121 backend
- **Serving VRAM:** 7.55 GB at 0.15 gpu-memory-utilization + 19.5 GB total with KV cache

### Memory Budget for Training

| Phase | GPU Memory | Notes |
|-------|-----------|-------|
| Model load (BF16) | 16.7 GB | 80.8s to load |
| After quantizer insertion | ~18 GB | 843 FakeQuantize modules |
| During training | ~82-92 GB | With gradient checkpointing |
| Peak (optimizer states) | ~90 GB | AdamW: 2x model for momentum + variance |

**Critical:** Must stop vLLM before training — it holds GPU memory that causes OOM.

---

## Dataset Design

### v2 Dataset (430 conversations)
- 5 categories: greetings, factual, memory_tools, web_search_tools, multiturn
- Generated by Opus 4.6 teacher via Claude Code CLI (`claude -p --model opus`)
- Simplified 4-tool set, shortened system prompt
- 47% include tool_calls

### v3 Dataset (1000 conversations)
- **7 categories** targeting 10 behavioral issues from 514-message profiling
- **Full production system prompt** from `bot.py:115-170`
- **Full 17-tool production set** (not simplified)
- **No-thinking pattern:** `<think>\n</think>\n` prefix on all assistant responses

| Category | Count | Purpose | Key Pattern |
|----------|-------|---------|-------------|
| `topic_switch` | 200 | Fix topic persistence | User changes subject, Annie follows immediately |
| `direct_action` | 200 | Fix confirmation-before-action | Annie calls tools immediately, no "shall I?" |
| `concise_voice` | 200 | Fix verbose responses | Every response 1-2 sentences max |
| `honest_memory` | 100 | Fix fabricated memories | "I don't remember" when search_memory returns empty |
| `rajesh_context` | 100 | Fix name/city/identity | Always "Rajesh", Bangalore, DGX Spark |
| `kannada_culture` | 100 | Fix Kannada + Indian context | Correct greetings, festival knowledge |
| `mixed_multiturn` | 100 | Integration test | 5-turn conversations mixing all patterns |

### v4 Dataset (1000 adversarial conversations)

**Why v3 data failed:** The v3 system prompt was stale — it said "Keep responses concise (1-3 sentences)" and "Ask follow-up questions when appropriate", which contradicts the production prompt's "MAXIMUM 2 sentences" and "NEVER ask follow-up questions". The model trained on the wrong constraints.

**v4 fixes:**
1. **Current production prompt** from `bot.py:115-174` + `<rules>` block (4,140 chars total)
2. **3 new adversarial categories** targeting constraint resistance
3. **Stricter validation gates**: emoji=0, trailing questions <=5%, verbose <=5%
4. **`inject_production_prompt()`**: CLI generates with short prompt (avoids planning mode), then replaces system message in JSONL with full production prompt
5. **`--permission-mode default`** flag: overrides global `defaultMode: "plan"` setting that was causing CLI to enter planning mode

| Category | Count | Purpose | New in v4? |
|----------|-------|---------|------------|
| `topic_switch` | 150 | Fix topic persistence + no callbacks | Revised |
| `direct_action` | 150 | Call tools immediately, no "shall I?" | Revised |
| `concise_voice` | 200 | 1-2 sentences on long-answer topics | Revised (stricter) |
| `honest_memory` | 100 | "I don't remember" — no fabrication | Revised |
| `no_emoji_no_markdown` | 100 | Plain text on emotional prompts | **NEW** |
| `no_followup_questions` | 100 | Every response ends with `.` not `?` | **NEW** |
| `rajesh_context` | 100 | Name, Bangalore, robotics context | Revised |
| `kannada_culture` | 50 | Correct Kannada greetings | Same |
| `tool_result_responses` | 50 | Concise spoken summaries after tool results | **NEW** |

**Quality gates (v4 — stricter than v3):**

| Gate | v3 threshold | v4 threshold | Why stricter |
|------|-------------|-------------|-------------|
| Conversations | >= 900 | >= 900 | Same |
| Tool calls | >= 350 | >= 300 | Fewer tool categories |
| Think prefix | >= 95% | >= 95% | Same |
| Markdown | <= 10 | <= 5 | Near zero tolerance |
| Ends with `?` | not checked | **<= 5%** | v3 had 26% — catastrophic |
| `> 2` sentences | not checked | **<= 5%** | v3 had 23% |
| Emoji | not checked | **= 0** | Zero tolerance |
| System prompt length | not checked | **>= 2500** | Ensures full prompt embedded |

### Data Format (OpenAI chat JSONL)

Standard conversation:
```json
{"messages": [
  {"role": "system", "content": "<full production system prompt>"},
  {"role": "user", "content": "What's water made of?"},
  {"role": "assistant", "content": "<think>\n</think>\nWater is H2O, two hydrogen atoms and one oxygen."}
]}
```

Tool-calling conversation:
```json
{"messages": [
  {"role": "system", "content": "<full production system prompt>"},
  {"role": "user", "content": "Search for NVIDIA stock price"},
  {"role": "assistant", "content": null, "tool_calls": [{"id": "call_001", "type": "function", "function": {"name": "web_search", "arguments": "{\"query\": \"NVIDIA stock price\"}"}}]},
  {"role": "tool", "tool_call_id": "call_001", "content": "NVIDIA (NVDA) is currently trading at $142.50..."},
  {"role": "assistant", "content": "<think>\n</think>\nNVIDIA is currently at $142.50, up 2.3% today."}
]}
```

### Generation Method

Zero API cost with Claude Max subscription:
```bash
python3 scripts/generate_calibration_data_v3.py --per-batch 50 --model opus --timeout 900
```

- Uses `claude -p --model opus --output-format text --append-system-prompt "You are a JSONL data generator..."`
- Runs from `/tmp` to avoid project hooks
- 20 batches × 50 conversations = 1000 target
- Each batch takes ~60-90 seconds
- Built-in retry logic: empty batches auto-retry 2x with 3s delay

### v4 Generation Method

```bash
# Prerequisites: disable stop hooks that intercept CLI output
chmod -x ~/.claude/hooks/stop-self-check.sh

# Generate (3-5 hours, $0 with Max subscription)
python3 scripts/generate_qat_v4_data.py --per-batch 25 --model opus --timeout 600

# Resume if interrupted
python3 scripts/generate_qat_v4_data.py --resume

# Validate without generating
python3 scripts/generate_qat_v4_data.py --validate-only

# Test single category first
python3 scripts/generate_qat_v4_data.py --category concise_voice --per-batch 3
```

**Key v4 script features:**
- `--permission-mode default` flag on CLI calls (overrides global `defaultMode: "plan"`)
- `inject_production_prompt()` replaces short prompt with full 4,140-char production prompt in JSONL
- Enhanced `detailed_validation()` checks emoji, markdown, verbose, follow-up questions, system prompt length
- Per-category WRONG/CORRECT example pairs in generation prompts (adversarial training principle)

### Claude Code `-p` Mode Gotchas (Important!)

Claude Code in print mode (`-p`) will **enter planning mode** instead of outputting raw data if the prompt:
- Contains full JSON tool definitions (17 tools = ~3000 chars of JSON)
- Includes a very long system prompt (>1000 chars)
- Looks like a coding specification

**Symptoms:** Output says "Plan is ready for your review" or creates files instead of JSONL lines.

**Fixes applied in v3 script:**
1. Keep generation prompts SHORT (<3000 chars total)
2. Use condensed tool lists (just names + params), not full JSON schemas
3. Use shortened system prompt references, not the full production prompt
4. Strengthen `--append-system-prompt` with anti-planning instructions: "Do NOT create files. Do NOT write code. Do NOT make a plan."
5. Validate output contains `{"messages"` before accepting — retry if not
6. The JSONL output itself contains the full system prompt, but the generation prompt only needs a condensed version
7. **v4 fix:** Add `--permission-mode default` to override global `defaultMode: "plan"` in `~/.claude/settings.json` that forces every `-p` session into plan mode
8. **v4 fix:** Disable stop hooks (`chmod -x ~/.claude/hooks/stop-self-check.sh`) — they ask "Have you implemented...?" and Claude answers that instead of generating JSONL

---

## Training Recipe

### Prerequisites

```bash
# On DGX Spark (Titan)
# 1. Stop ALL GPU-consuming services to free memory
docker rm -f vllm-annie        # vLLM (~19 GB)
docker stop ollama              # Ollama (~42 GB if model loaded!)
# Check: nvidia-smi should show ~0% utilization

# 2. Pull NGC container (if not cached)
docker pull nvcr.io/nvidia/pytorch:25.11-py3

# 3. Ensure data is on Titan
git pull  # or scp data/calibration_v3/annie_calibration_v3.jsonl
```

### QAT v2 Command (proven recipe)

```bash
docker run --gpus all --rm -it --ipc=host --shm-size=16g \
  -v ~/models:/models \
  -v ~/workplace/her/her-os:/workspace \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "pip install transformers accelerate datasets && \
  PYTHONUNBUFFERED=1 python3 -u /workspace/scripts/qat_nvfp4_v2.py \
    --model /models/Qwen3.5-9B-Claude-Opus-Distilled \
    --data /workspace/data/calibration/annie_calibration.jsonl \
    --output /models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v2 \
    --epochs 3 --lr 1e-5 --simple-masking"
```

### QAT v3 Command (behavioral training — TWO-STEP: train then export)

**Step 1: Train only (saves checkpoints, skips export to avoid OOM)**
```bash
docker run --gpus all -d --name qat-train --ipc=host --shm-size=16g \
  -v ~/models:/models \
  -v ~/workplace/her/her-os:/workspace \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "pip install transformers accelerate datasets && \
  PYTHONUNBUFFERED=1 python3 -u /workspace/scripts/qat_nvfp4_v2.py \
    --model /models/Qwen3.5-9B-Opus-Distilled-BF16 \
    --data /workspace/data/calibration_v3/annie_calibration_v3.jsonl \
    --output /models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v3 \
    --epochs 5 --lr 1e-5 --max-length 2048 \
    --calib-samples 64 --simple-masking --no-thinking \
    --train-only"
```

**Step 2: Export from checkpoint (separate container, fresh memory)**
```bash
# Wait for training to finish, then fix ownership
docker run --rm -v ~/models:/models alpine \
  chown -R 1000:1000 /models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v3_checkpoints/

# Find the last valid checkpoint
ls ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v3_checkpoints/

# Export from that checkpoint
docker run --gpus all -d --name qat-export --ipc=host --shm-size=16g \
  -v ~/models:/models \
  -v ~/workplace/her/her-os:/workspace \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "pip install transformers accelerate && \
  PYTHONUNBUFFERED=1 python3 -u /workspace/scripts/qat_nvfp4_v2.py \
    --model /models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v3_checkpoints/checkpoint-625 \
    --data /workspace/data/calibration_v3/annie_calibration_v3.jsonl \
    --output /models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v3 \
    --export-only --calib-samples 64 --no-thinking"
```

**Why two steps?** Training uses ~90 GB. Export needs to hold the model + write new files. On 128 GB unified memory, doing both in one container causes OOM (exit 137). Separating them gives each step the full 128 GB.

### QAT v4 Command (adversarial behavioral training — TWO-STEP)

**Step 1: Train only**
```bash
docker run --gpus all -d --name qat-v4-train --ipc=host --shm-size=16g \
  -v ~/models:/models -v ~/workplace/her/her-os:/workspace \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "pip install transformers accelerate datasets && \
  PYTHONUNBUFFERED=1 python3 -u /workspace/scripts/qat_nvfp4_v2.py \
    --model /models/Qwen3.5-9B-Opus-Distilled-BF16 \
    --data /workspace/data/calibration_v4/annie_calibration_v4.jsonl \
    --output /models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v4 \
    --epochs 5 --lr 1e-5 --max-length 2048 \
    --calib-samples 64 --simple-masking --no-thinking \
    --train-only"
```

**Step 2: Export from checkpoint**
```bash
docker run --gpus all -d --name qat-v4-export --ipc=host --shm-size=16g \
  -v ~/models:/models -v ~/workplace/her/her-os:/workspace \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "pip install transformers accelerate && \
  PYTHONUNBUFFERED=1 python3 -u /workspace/scripts/qat_nvfp4_v2.py \
    --model /models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v4_checkpoints/checkpoint-LAST \
    --data /workspace/data/calibration_v4/annie_calibration_v4.jsonl \
    --output /models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v4 \
    --export-only --calib-samples 64 --no-thinking"
```

### Hyperparameters

| Parameter | v2 | v3 | v4 | Reason |
|-----------|----|----|--------|
| Epochs | 3 | 5 | 5 | More behavioral examples need more passes |
| Max seq len | 1024 | 2048 | 2048 | Full production prompt is ~800 tokens |
| Calib samples | 32 | 64 | 64 | More diverse calibration for better quantizer ranges |
| Dataset size | 430 | 1000 | 1000 | Same size, better quality (adversarial) |
| No-thinking | No | Yes | Yes | Prefix all assistant responses with empty think tags |
| Masking | simple | simple | simple | String-marker masking on assistant tokens only |
| LR | 1e-5 | 1e-5 | 1e-5 | Conservative — prevents catastrophic forgetting |
| Effective batch | 8 | 8 | 8 | 1 × 8 gradient accumulation |
| Grad checkpoint | Yes | Yes | Yes | Saves ~18 GB memory |
| System prompt | v1 (short) | v3 (stale) | **v4 (current + rules)** | Must match production exactly |

### Expected Loss Curve

From v2 (430 samples × 3 epochs):
```
Epoch 0.09: loss=1.588 (start)
Epoch 0.93: loss=0.924 (end epoch 1)
Epoch 1.11: loss=0.502 ← STEEP DROP at epoch boundary
Epoch 1.86: loss=0.547 (end epoch 2)
Epoch 2.60: loss=0.356 ← minimum
Epoch 2.97: loss=0.415 (final)
```

**Key pattern:** The epoch 1→2 boundary shows a steep loss drop. The second pass through the data is where behavioral recovery really happens.

---

## Export Gotchas (The Hard-Won Knowledge)

### Gotcha 1: Container auto-removal

**Never use `--rm` for long-running training.** The container exits before `export_hf_checkpoint` completes, and you lose the in-memory model state.

**Fix:** Use `-it` (interactive) or `-d` (detach), then manually clean up.

### Gotcha 1b: OOM during export corrupts the latest checkpoint

Training may complete successfully but OOM during the export/save step (exit code 137). The checkpoint being written at OOM time will have corrupted safetensors files ("incomplete metadata, file not fully covered"). Use the PREVIOUS checkpoint instead.

**v3 experience:** Training completed 625/625 steps. OOM during `Writing model shards`. checkpoint-624 was corrupted, but checkpoint-468 (step 468, epoch 3.74) was valid.

**Prevention:** Separate training and export into two Docker commands. Never let training auto-export in the same container.
1. Run training in container A — let it finish and save all checkpoints
2. Wait for the final checkpoint to be written (check `ls -la` for the checkpoint directory)
3. Run export in container B — load checkpoint → re-quantize → export

This avoids the peak memory spike (training state + export state simultaneously) that causes OOM.

**Fix for corrupted checkpoints:** Always check validity before loading:
```python
from safetensors import safe_open
sf = safe_open("model.safetensors", framework="pt")
print(f"OK: {len(sf.keys())} keys")  # Corrupt files throw SafetensorError
```

### Gotcha 2: HuggingFace Trainer drops FakeQuantize state

Checkpoint-162 contains `_amax` keys (FakeQuantize calibration values). But `AutoModelForCausalLM.from_pretrained()` creates a vanilla model without FakeQuantize wrappers, so these keys are UNEXPECTED and dropped.

**Fix:** You cannot `from_pretrained(checkpoint)` to get a quantized model. You must:
1. Load checkpoint (BF16 weights with QAT adjustments)
2. Re-insert quantizers via `mtq.quantize()`
3. Export via `export_hf_checkpoint()`

### Gotcha 3: Tokenizer version mismatch

NGC container has transformers 5.3.0 which exports tokenizer as `TokenizersBackend` class. vLLM Docker has transformers 4.57 which doesn't know this class.

**Fix:** Copy tokenizer files from a working model:
```bash
cp ~/models/Qwen3.5-27B-NVFP4/tokenizer*.json ~/models/YOUR_MODEL/
```

### Gotcha 4: VL config wrapper

NGC transformers 5.3 saves `model_type: "qwen3_5_text"`. vLLM expects `"qwen3_5"` with a VL wrapper.

**Fix:** Wrap `config.json`:
```python
import json
with open("config.json") as f:
    text_config = json.load(f)
wrapper = {
    "model_type": "qwen3_5",
    "text_config": text_config,
    "architectures": ["Qwen3_5ForConditionalGeneration"]
}
with open("config.json", "w") as f:
    json.dump(wrapper, f, indent=2)
```

### Gotcha 5: preprocessor_config.json stub

vLLM conditional generation expects this file to exist.

**Fix:** `cp ~/models/Qwen3.5-27B-NVFP4/preprocessor_config.json YOUR_MODEL/`

### Gotcha 6: Docker file ownership

Files created inside NGC container are owned by `root:root`. Can't modify from host.

**Fix:** `docker run --rm -v ~/models:/models alpine chown -R 1000:1000 /models/YOUR_MODEL/`

### Gotcha 7: quant_algo must be NVFP4

Check `hf_quant_config.json`. If it says `MIXED_PRECISION`, vLLM can't serve it. Only `NVFP4` works.

**Fix:** Use `NVFP4_DEFAULT_CFG` (not AWQ-Lite).

---

## Serving Recipe

### Start vLLM

```bash
docker run -d --name vllm-annie --gpus all --runtime=nvidia \
  -v ~/models:/models -p 8003:8000 --shm-size=16g \
  --restart unless-stopped --entrypoint vllm \
  hellohal2064/vllm-qwen3.5-gb10:latest \
  serve /models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v4 \
  --served-model-name qwen3.5-9b \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --language-model-only --trust-remote-code \
  --port 8000 --gpu-memory-utilization 0.15 --max-model-len 32768
```

### Key Serving Flags

| Flag | Purpose |
|------|---------|
| `--quantization modelopt_fp4` | Load NVFP4 weights (NOT `modelopt`) |
| `--attention-backend flashinfer` | SM121-optimized attention kernels |
| `--reasoning-parser qwen3` | Hide `<think>` blocks from API response |
| `--tool-call-parser hermes` | Extract Hermes-format tool calls |
| `--enable-auto-tool-choice` | Auto-detect tool calls in output |
| `--language-model-only` | Skip vision encoder init |
| `--gpu-memory-utilization 0.15` | ~18 GB (DGX Spark unified memory) |
| `--max-model-len 32768` | Match Annie's context window |

### Health Check

```bash
# Service health
curl http://titan:8003/health
# → {"status":"HEALTHY"}

# Model loaded
curl http://titan:8003/v1/models
# → {"data":[{"id":"qwen3.5-9b",...}]}

# Basic response test
curl -s http://titan:8003/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5-9b","messages":[{"role":"user","content":"Hi"}],"max_tokens":50}' \
  | python3 -m json.tool
```

---

## Common Errors and Fixes

| Error | Cause | Fix |
|-------|-------|-----|
| `Unsupported gpu architecture 'compute_121a'` | `/usr/bin/nvcc` is CUDA 12.0 | Use `/usr/local/cuda-13/bin/nvcc` or Docker with CUDA 13.0 |
| `TokenizersBackend class does not exist` | Tokenizer from transformers 5.3 | Copy tokenizer from 27B model |
| `Unknown ModelOpt quant algo: MIXED_PRECISION` | Used AWQ-Lite instead of DEFAULT_CFG | Re-quantize with NVFP4_DEFAULT_CFG |
| `model_type 'qwen3_5_text' is not supported` | transformers 5.3 config format | VL config wrapper |
| OOM during training | vLLM holding GPU memory | `docker rm -f vllm-annie` before training |
| `RuntimeError: Invalid device string: 'bfloat16'` | ModelOpt 0.37.0 export bug | Manual re-export (load checkpoint → re-quantize → export) |
| `Tensor.item() cannot be called on meta tensors` | `device_map="auto"` defers to meta | Use `device_map={"":"cuda:0"}` |
| Export container exits early | `--rm` flag | Don't use `--rm` for training |
| tool_calls empty in API response | Wrong parser (`qwen3_coder`) | Use `--tool-call-parser hermes` |

---

## Rollback

The previous model stays intact:
```bash
# v2b model at: ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v2b/
# Q4_K_M model at: ~/models/Qwen3.5-9B-Claude-Opus-Distilled/

# Rollback to v2b:
docker rm -f vllm-annie
# Change model path in docker run command to ...QAT-v2b

# Rollback to llama-server Q4_K_M:
LLM_SERVING_ENGINE=llama-server ./start.sh
```

---

## Files Reference

| File | Purpose |
|------|---------|
| `scripts/generate_qat_v4_data.py` | v4 adversarial data generator (1000 convos, 9 categories) |
| `scripts/generate_calibration_data_v3.py` | v3 data generator (1000 conversations) |
| `scripts/generate_calibration_data.py` | v2 data generator (430 conversations) |
| `scripts/qat_nvfp4_v2.py` | QAT training script (works for v2, v3, v4) |
| `scripts/test_annie_conversations.py` | Automated conversation tester |
| `scripts/benchmark_quant_v3.py` | Quantization quality gate benchmark |
| `data/calibration_v4/annie_calibration_v4.jsonl` | v4 training dataset (adversarial) |
| `data/calibration_v3/annie_calibration_v3.jsonl` | v3 training dataset |
| `data/calibration/annie_calibration.jsonl` | v2 calibration dataset |
| `docs/QAT-V2-EXECUTION-LOG.md` | v2 execution log with errors and fixes |
| `docs/RESEARCH-NVFP4-QAT-DISCOVERY.md` | PTQ vs QAT analysis |
| `services/annie-voice/bot.py` | Production system prompt + tools (source of truth) |
