# Plan: Benchmark Qwen3.5-9B Opus-Distilled NVFP4 vs Q4_K_M on DGX Spark

**Goal:** Quantize `Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled` to NVFP4
ourselves, then benchmark it against the current Q4_K_M GGUF on llama-server for
Annie Voice (minotaur creature, port 8003). Apples-to-apples: same fine-tuning,
different quantization format and runtime.

**Output model:** `rajesh/Qwen3.5-9B-Opus-Distilled-NVFP4`

**Hardware:** DGX Spark (GB10), SM121, 128 GB unified, CUDA 13.x

**Research:** See `docs/RESEARCH-NVFP4-QUANTIZATION.md` for full quantization details.

---

## Phase 1: Quantize Opus-Distilled to NVFP4

### Time estimate: ~2 hours total
- Model download: ~18 GB BF16 safetensors, ~5-10 min
- Docker container pull: ~5-10 min
- Quantization (512 calibration samples): **45-90 minutes**
- Export: ~5 min
- **Total wall time: ~1.5-2 hours**

### Resource usage during quantization
- **GPU VRAM:** ~24-30 GB (BF16 model + calibration activations)
- **Disk:** ~18 GB source + ~6 GB output = ~24 GB
- **CPU:** Moderate (data loading)
- **Impact on running services:** Will compete with Annie Voice + Ollama for GPU.
  **Recommendation:** Stop Annie Voice during quantization, or run on a different machine.

### Tasks

1. **Pull NVIDIA's DGX Spark quantization container**
   ```bash
   docker pull nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
   ```

2. **Run quantization inside the container**
   ```bash
   docker run --rm -it --gpus all --ipc=host \
     --ulimit memlock=-1 --ulimit stack=67108864 \
     -v "$HOME/models:/workspace/output_models" \
     -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
     -e HF_TOKEN=$HF_TOKEN \
     nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
     bash -c "
       git clone -b 0.35.0 --single-branch https://github.com/NVIDIA/Model-Optimizer.git /app/TensorRT-Model-Optimizer && \
       cd /app/TensorRT-Model-Optimizer && pip install -e '.[dev]' && \
       export ROOT_SAVE_PATH='/workspace/output_models' && \
       /app/TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh \
         --model 'Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled' \
         --quant nvfp4 \
         --tp 1 \
         --export_fmt hf
     "
   ```
   The output will be saved to `~/models/` on the host.

3. **Rename output directory**
   ```bash
   mv ~/models/<output_dir> ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4
   ```

4. **Verify output files**
   ```bash
   ls -la ~/models/Qwen3.5-9B-Opus-Distilled-NVFP4/
   # Expect: *.safetensors, config.json, hf_quant_config.json,
   #         tokenizer.json, tokenizer_config.json
   ```

### Where to run this phase

| Machine | Pros | Cons |
|---------|------|------|
| **DGX Spark (Titan)** | Native SM121 container, no cross-compilation | Competes with Annie Voice for GPU. Must stop services. |
| **Cloud GPU (A100/H100)** | No impact on production | Need to transfer ~6 GB output back. Costs money. |
| **Other local GPU** | No production impact | May not have SM121 container. Quantization is arch-independent, but verify. |

**Decision:** If you want zero downtime for Annie, quantize on a different machine.
The quantization itself produces standard safetensors — it doesn't need SM121.
Only the *inference* needs SM121.

### Verification
- [ ] Container pulled successfully
- [ ] Quantization completes without errors (expect 45-90 min)
- [ ] Output directory contains safetensors + config files
- [ ] Output model size is ~6 GB (vs ~18 GB BF16 source)

### Anti-patterns
- Do NOT use `--low_memory_mode` unless VRAM is tight — it slows quantization
- Do NOT interrupt the quantization after "Inserted N quantizers" — it appears
  to hang but is actually computing calibration statistics
- Do NOT use pip-installed ModelOpt on Spark — use the Docker container

---

## Phase 2: Launch SGLang Server + Smoke Test

### Tasks

1. **Pull NGC SGLang container for DGX Spark**
   ```bash
   docker pull lmsysorg/sglang:spark
   # OR: nvcr.io/nvidia/sglang:latest (check build.nvidia.com/spark/sglang)
   ```

2. **Launch SGLang with the quantized model**
   ```bash
   docker run --gpus all --rm -it \
     -v ~/models:/models \
     -p 8000:8000 \
     lmsysorg/sglang:spark \
     python3 -m sglang.launch_server \
       --model-path /models/Qwen3.5-9B-Opus-Distilled-NVFP4 \
       --quantization modelopt_fp4 \
       --tp 1 \
       --reasoning-parser qwen3 \
       --port 8000 \
       --host 0.0.0.0
   ```

3. **Fix config.json if needed**
   If SGLang complains about quantization config:
   ```bash
   # Remove quantization_config from config.json (CLI flag overrides)
   python3 -c "
   import json
   cfg = json.load(open('config.json'))
   cfg.pop('quantization_config', None)
   json.dump(cfg, open('config.json', 'w'), indent=2)
   "
   ```

4. **Smoke test**
   ```bash
   curl http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "Qwen3.5-9B-Opus-Distilled-NVFP4",
       "messages": [{"role": "user", "content": "What is 17 * 43?"}],
       "max_tokens": 512
     }'
   ```
   Expected: correct answer (731)

5. **If SM121 kernel errors occur:**
   - Try `--attention-backend flashinfer` flag
   - Check NVIDIA dev forums for SM121 patches
   - Flag immediately — do not switch runtimes

### Verification
- [ ] SGLang container pulled
- [ ] Server starts without CUDA/kernel errors
- [ ] Smoke test returns correct answer
- [ ] No SM121 compatibility warnings in logs

---

## Phase 3: Benchmark — NVFP4 Performance

### Tasks

1. **SGLang built-in benchmark**
   ```bash
   python3 -m sglang.bench_serving \
     --backend sglang \
     --port 8000 \
     --num-prompts 10 \
     --input-len 128 \
     --output-len 256
   ```
   Target: >60 tok/s decode

2. **Annie Voice workload benchmark**
   Write `scripts/benchmark_nvfp4_vs_q4km.py` that tests with Annie's actual workload:
   - Simple greeting (short response, <50 tokens)
   - Conversational response (medium, 100-200 tokens)
   - Tool-calling response (function call generation with Annie's tool defs)
   - Multi-turn conversation (5 turns, growing context)
   - System prompt + briefing load (large system prompt ~2K tokens)

   Measure per test: TTFT (ms), total latency (ms), output tokens, tokens/sec

3. **VRAM measurement**
   ```bash
   nvidia-smi --query-gpu=memory.used --format=csv
   ```
   Record at: idle, 1 request, 5 concurrent requests

### Verification
- [ ] SGLang bench_serving completes with results
- [ ] Decode speed >= 60 tok/s
- [ ] VRAM usage recorded at all load levels
- [ ] Annie workload benchmark results saved to JSON

---

## Phase 4: Benchmark — Baseline (Current Q4_K_M on llama-server)

### Tasks

1. **Run same Annie Voice workload against current llama-server**
   - Endpoint: `http://localhost:8003/v1/chat/completions`
   - Same test cases as Phase 3
   - Same metrics: TTFT, total latency, tokens/sec

2. **Record current VRAM usage**
   ```bash
   nvidia-smi --query-gpu=memory.used --format=csv
   ```

3. **Use existing E2E benchmark for additional reference**
   ```bash
   python scripts/benchmark_e2e.py
   ```

### Verification
- [ ] Same test cases run against both models
- [ ] Metrics collected in same JSON format for fair comparison
- [ ] Existing E2E benchmark data incorporated

---

## Phase 5: Quality Comparison

### Tasks

Since both models use the **same Opus-Distilled fine-tuning**, quality differences
should be minimal — any regression is purely from quantization artifacts (NVFP4 vs Q4_K_M).

1. **Thinking mode test (NVFP4)**
   ```bash
   # With thinking (default)
   curl http://localhost:8000/v1/chat/completions \
     -d '{"messages": [{"role": "user", "content": "Solve: x^2 - 5x + 6 = 0"}], "max_tokens": 1024}'

   # Without thinking — re-launch with:
   # --chat-template-kwargs '{"enable_thinking": false}'
   ```

2. **Tool calling accuracy**
   - Test with Annie's actual tool definitions (search_memory, web_search, read_notes)
   - Compare function call format and JSON structure
   - Both models have the same SFT, so tool calling should be equivalent

3. **Conversational quality spot-check**
   - 10 conversational prompts through both models
   - Compare response quality, personality consistency
   - Look for quantization artifacts (repetition, degraded coherence)

4. **Perplexity comparison (optional)**
   - Run a small eval set through both models
   - Compare perplexity scores — NVFP4 should be <1% degradation vs FP16

### Verification
- [ ] Thinking mode works in both on/off configurations
- [ ] Tool calling format matches Annie's expectations on NVFP4
- [ ] No quantization-induced quality regressions
- [ ] Conversational quality is subjectively equivalent

---

## Phase 6: Analysis & Recommendation Document

### Tasks

1. **Create comparison document** (`docs/RESEARCH-NVFP4-VS-Q4KM.md`) with:
   - Head-to-head performance table (TTFT, tok/s, VRAM, latency)
   - Quality comparison (tool calling, conversational, thinking mode)
   - VRAM budget impact (update RESOURCE-REGISTRY.md projections)
   - Drop-in compatibility assessment
   - Recommendation with reasoning

2. **Key comparison dimensions:**

   | Dimension | Q4_K_M (current) | NVFP4 (candidate) |
   |-----------|-------------------|---------------------|
   | Runtime | llama-server | SGLang (NGC container) |
   | Fine-tuning | Opus-Distilled | Opus-Distilled (SAME) |
   | Quantization | Mixed 4/6-bit (GGUF) | E2M1 + FP8 scaling |
   | VRAM (model) | 5.3 GB | TBD (~4.5 GB expected) |
   | VRAM (32K ctx) | ~6.6 GB | TBD |
   | Decode tok/s | TBD | TBD (target: >60) |
   | TTFT (greeting) | ~17.7s (E2E) | TBD |
   | Deployment | Single binary | Docker container |
   | Context limit | 262K (32K configured) | 262K (TBD) |
   | HW acceleration | Generic GPU | Blackwell FP4 Tensor Cores |

3. **Drop-in replacement checklist:**
   - [ ] OpenAI-compatible API (same endpoint format)
   - [ ] Tool calling works with Annie's existing `llamacpp_llm.py`
   - [ ] Reasoning/thinking mode controllable
   - [ ] VRAM fits within budget (<110 GB peak)
   - [ ] Latency is equal or better
   - [ ] Quality is equivalent (same fine-tuning, minimal quantization loss)
   - [ ] `start.sh` can launch SGLang container instead of llama-server

### Anti-pattern guards
- Do NOT recommend based on tok/s alone — operational complexity matters too
- Do NOT ignore the runtime change (llama-server binary → Docker container)
- Do NOT forget to update RESOURCE-REGISTRY.md if we switch

### Verification
- [ ] Document created with all comparison data
- [ ] Recommendation includes clear rationale
- [ ] RESOURCE-REGISTRY.md impact assessed
- [ ] start.sh migration path documented if switching
