# Building a Voice AI Companion on DGX Spark: The NVFP4 QAT Journey

> Draft structure for blog post. Fill in benchmarks and results after v3 training.

---

## 1. The Problem: Voice AI Needs Speed AND Personality

Annie is a personal AI voice companion running on an NVIDIA DGX Spark (Blackwell GB10, 128 GB unified memory). She's built on Qwen3.5-9B, fine-tuned with Claude Opus 4.6 reasoning traces (3,950 conversations), and she needs to:

- Respond in under 500ms (voice latency budget)
- Call tools (web search, memory, notes) in the correct format
- Never use markdown (output is spoken by TTS)
- Be concise (1-2 sentences, voice-appropriate)
- Know her user's name, city, preferences

In Q4_K_M format on llama-server, everything works perfectly. But Q4_K_M TTFT grows with context length (300ms → 1600ms over a conversation). NVFP4 with DeltaNet linear attention promises constant 90ms TTFT regardless of context.

The catch: **NVFP4 quantization preserves knowledge but destroys personality.**

---

## 2. PTQ Failed: Calibration Data Doesn't Matter

We tried three PTQ approaches on Qwen3.5-9B-Opus-Distilled:

| Approach | Calibration | Tool Calling | Markdown | Think Leak |
|----------|-------------|-------------|----------|------------|
| v1: CNN DailyMail | Generic news | 0/10 (0%) | 80% leak | 20% leak |
| v2d: Annie conversations | Domain-specific | 0/10 (0%) | 80% leak | 20% leak |
| v2a: AWQ-Lite + Annie | Adaptive weighting | Can't serve | — | — |
| Prompt engineering | Hardened prompts | 43% | Inconsistent | Inconsistent |

**The "max" algorithm doesn't care what you calibrate with.** It measures min/max activation ranges and rounds weights identically regardless of domain. The behavioral fine-tuning — small, precise weight adjustments — is collateral damage.

Factual accuracy (100%) and reasoning (100%) survived perfectly. FP4 preserves knowledge but destroys manners.

---

## 3. The Discovery: QAT = SFT + Quantization in One Step

The breakthrough came from two sources:
- **NVIDIA QAT Blog:** QAT matches BF16 on Math-500/AIME/GPQA (100% recovery)
- **LMSYS MXFP4 Research:** QAT 97% vs PTQ 59% on safety alignment

QAT (Quantization-Aware Training) inserts fake quantizers during fine-tuning:
- **Forward pass:** Weights are simulated-FP4 (model "sees" quantization noise)
- **Backward pass:** Straight-through estimator passes gradients to full-precision weights
- **Result:** Weights adapt to produce correct output despite FP4 rounding

This means a single training run simultaneously teaches personality AND adapts to quantization. No separate SFT step needed.

---

## 4. Three Bugs That Took Three Days

### Bug 1: LoRA merge destroys QAT state
QAT v1 used LoRA (rank 16). Training converged (loss dropped 24%). But `merge_and_unload()` creates a new model with averaged weights, discarding the FakeQuantize calibration values. The re-quantized export was identical to PTQ.

**Fix:** Full fine-tune (8.95B trainable params, ~90 GB with gradient checkpointing).

### Bug 2: Full-sequence loss wastes 73% of gradient
v1 trained on the entire sequence (system prompt, user messages, tool definitions). Only 27% of tokens were assistant responses. The gradient signal was diluted across 73% of tokens that shouldn't change.

**Fix:** Assistant-only loss masking. Labels = -100 for system/user/tool turns.

### Bug 3: One epoch isn't enough
v1 used 1 epoch × 430 samples. NVIDIA says "less than 1% of pre-training time" but that's for base capabilities. Behavioral recovery needs more signal.

**Fix:** 3 epochs minimum. The loss curve shows a STEEP drop at the epoch 1→2 boundary.

### v2 Result
With all three bugs fixed: **thinking leak 0% (was 20%), markdown 0% (was 80%), tool calling 90% (was 0%).**

---

## 5. The Serving Stack: vLLM + FlashInfer + Hermes on SM121

DGX Spark runs Blackwell SM 12.1, which has limited support in the vLLM ecosystem:

| Component | Status | Fix |
|-----------|--------|-----|
| vLLM pip install | Broken | Use Docker `cu130-nightly` |
| SGLang pip install | Broken | — |
| FlashInfer SM121 cubins | Missing from pre-built | Use `hellohal2064/vllm-qwen3.5-gb10` |
| CUDA graphs | Work with right build | Remove `--enforce-eager` |
| Tool call parser | `qwen3_coder` doesn't work | Use `--tool-call-parser hermes` |
| Config format | `qwen3_5_text` unknown | VL config wrapper |
| Tokenizer | `TokenizersBackend` class mismatch | Copy from working model |

The serving command that works:
```bash
docker run -d --name vllm-annie --gpus all --runtime=nvidia \
  -v ~/models:/models -p 8003:8000 --shm-size=16g \
  hellohal2064/vllm-qwen3.5-gb10:latest \
  serve /models/YOUR_MODEL \
  --quantization modelopt_fp4 --attention-backend flashinfer \
  --reasoning-parser qwen3 --enable-auto-tool-choice \
  --tool-call-parser hermes --language-model-only \
  --gpu-memory-utilization 0.15 --max-model-len 32768
```

---

## 6. 14 Bugs Found Live: From XML Leaks to Infinite Tool Loops

After deploying QAT v2b, we ran 514+ messages through the full voice pipeline (STT → LLM → TTS → WebRTC). Found and fixed 14 infrastructure bugs:

[Table of all 14 bugs with symptoms, fixes, and commits — see QAT-COOKBOOK.md]

Key categories:
- **Streaming issues:** vLLM streams differently from llama-server
- **Tool parsing:** Hermes parser + streaming required new text filter logic
- **Session management:** Anti-Alzheimer poisoning (confused sessions → garbage → loop)
- **GPU memory:** Unified memory on DGX Spark requires different utilization settings

---

## 7. Profiling at Scale: 514 Messages, 44 Sessions, Playwright Automation

We built two automated testing systems:

### API Test (`test_annie_conversations.py`)
- 31 conversations across 8 categories
- SSE streaming consumption
- Anomaly detection via regex patterns
- JSON report with per-conversation details

### Browser Test (Playwright MCP)
- Real WebRTC connections through Pipecat Playground
- Full voice pipeline including STT and TTS
- Session management with disconnect/reconnect
- 20-second initialization window

### Results
- **89% overall pass rate** across 514+ messages
- **Zero:** XML leaks, role confusion, hallucinations (after fixes)
- **10 model behavior issues** identified for fine-tuning

---

## 8. Teaching Personality: 1000 Targeted Conversations

The 10 behavioral issues mapped to 7 training categories:

| Category | Count | Fixes |
|----------|-------|-------|
| Topic switching | 200 | Annie follows topic changes immediately |
| Direct action | 200 | Tools called without asking "shall I?" |
| Concise voice | 200 | 1-2 sentences max |
| Honest memory | 100 | "I don't remember" when search fails |
| Rajesh context | 100 | Name, city, work, food preferences |
| Kannada culture | 100 | Correct greetings, festival knowledge |
| Mixed multi-turn | 100 | Integration of all patterns |

Generated at zero cost using Claude Opus 4.6 as teacher via Claude Code CLI with Max subscription. Each batch of 50 conversations takes ~90 seconds.

The no-thinking pattern (`<think>\n</think>\n` prefix) teaches the model to skip internal reasoning, potentially saving ~30% TTFT.

---

## 9. Results

> [To be filled after v3 training and benchmarking]

### Quality Gates

| Gate | PTQ | QAT v2 | QAT v3 |
|------|-----|--------|--------|
| Thinking leak | 20% | 0% | — |
| Markdown | 80% | 0% | — |
| Tool calling | 0% | 90% | — |
| Topic switch | — | — | — |
| Direct action | — | — | — |
| Concise | — | — | — |
| Honest memory | — | — | — |
| Correct name | — | — | — |
| TTFT | 90ms* | ~1.2s | — |

*PTQ's 90ms was "broken thinking" — the model couldn't generate `<think>` tokens.

### Training Metrics

| Metric | v2 | v3 |
|--------|----|----|
| Dataset | 430 conversations | 1000 conversations |
| Epochs | 3 | 5 |
| Training time | 53 min | — |
| Loss reduction | 74% (1.588→0.415) | — |
| Export size | 7.5 GB | — |

---

## 10. Open Source

### Model
- **HuggingFace:** `[to be published]/Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v3`
- **Base:** Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled
- **Quantization:** NVFP4 QAT via NVIDIA ModelOpt
- **Training data:** 1000 synthetic conversations from Claude Opus 4.6

### Code
- **QAT training script:** `scripts/qat_nvfp4_v2.py`
- **Data generator:** `scripts/generate_calibration_data_v3.py`
- **Conversation tester:** `scripts/test_annie_conversations.py`
- **Serving recipe:** See QAT-COOKBOOK.md

### Key Takeaways
1. QAT = SFT + quantization in one step
2. PTQ's "max" algorithm can't save behavioral fine-tuning regardless of calibration data
3. Assistant-only loss masking is critical (74% vs 24% loss reduction)
4. Full fine-tune > LoRA for QAT (avoids merge_and_unload bug)
5. Pin your Docker images (nightly builds regress)
6. Test at scale (514 messages found 10 issues that 10 messages missed)
7. The dataset IS the personality — every conversation shapes behavior

---

*Built by Rajesh on DGX Spark GB10. Powered by Qwen3.5, Claude Opus 4.6, NVIDIA ModelOpt, vLLM, and Pipecat.*
