# Next Session Prompt

Paste this after clearing context:

---

## Task: Execute NVFP4 v2 Calibration-Aware Quantization

Read these files first (in order):
1. `docs/NVFP4-EXECUTION-PLAN.md` — Step-by-step execution plan
2. `docs/NVFP4-RESEARCH-JOURNEY.md` — Full context of what we tried and learned
3. `docs/RESEARCH-NVFP4-VS-Q4KM.md` — Benchmark results and serving recipes

Then execute the plan starting from Phase 1:

**Phase 1:** Generate 500 calibration conversations using `scripts/generate_calibration_data.sh` (uses Claude Code CLI with Opus 4.6 as teacher). Validate output.

**Phase 2:** Create `scripts/quantize_nvfp4_v2.py` with three strategies (AWQ-Lite, MLP-Only, Auto-Mixed). Run inside NGC container on Titan.

**Phase 3:** Serve each quantized model with Docker cu130-nightly and benchmark with `scripts/benchmark_quant_v3.py`. Quality gates: thinking leak 0/25, tool calling ≥6/10, no markdown ≥12/15, TTFT ≤120ms, decode ≥25 tok/s.

**Phase 4:** Quick prompt engineering test on existing v1 model (10 min).

**Phase 5:** Publish best model to HuggingFace + update research docs.

### What's running on Titan
- Port 8003: llama-server with Q4_K_M (DO NOT STOP — production)
- Port 8000: May have Docker container `vllm-9b-opus` from last session (stop it)
- Docker image `vllm/vllm-openai:cu130-nightly` is already pulled

### Key constraint
DGX Spark GB10 is SM 12.1 — only Docker cu130-nightly works for NVFP4 serving. All pip-installed vLLM/SGLang fail with "no kernel image" errors.

---
