# Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v3

> Template — fill in benchmark results after training.

## Model Description

Voice AI companion fine-tuned from Qwen3.5-9B with:
- **Opus 4.6 distillation:** 3,950 Claude reasoning traces (base fine-tune by Jackrong)
- **QAT v3 behavioral training:** 1,000 targeted conversations across 7 categories
- **NVFP4 quantization:** 7.55 GB on disk, serves on DGX Spark (19.5 GB VRAM with KV cache)

### Intended Use

Personal voice AI companion running on NVIDIA DGX Spark (Blackwell GB10). Optimized for:
- Real-time voice conversation (1-2 sentence concise responses)
- Tool calling (web search, memory, notes, code execution)
- Indian cultural context (Bangalore, Kannada, Indian food)

### Architecture

- **Base model:** Qwen/Qwen3.5-9B (DeltaNet hybrid attention)
- **First fine-tune:** LoRA SFT on 3,950 Opus reasoning traces → Q4_K_M (by Jackrong)
- **Second fine-tune:** Full QAT on 1,000 behavioral conversations → NVFP4 (this model)
- **Quantization:** NVIDIA ModelOpt NVFP4_DEFAULT_CFG with QAT

## Training Data

### Opus Distillation (base)
- 3,950 samples from Claude Opus 4.6 reasoning datasets
- Sources: nohurry/Opus-4.6-Reasoning-3000x-filtered, TeichAI/claude-4.5-opus-high-reasoning-250x, Jackrong/Qwen3.5-reasoning-700x

### QAT v3 Behavioral Training (this model)
- 1,000 synthetic conversations generated by Claude Opus 4.6 via Claude Code CLI
- Zero API cost (Max subscription)

| Category | Count | Purpose |
|----------|-------|---------|
| Topic switching | 200 | Follow topic changes immediately |
| Direct action | 200 | Call tools without confirmation |
| Concise voice | 200 | 1-2 sentences max |
| Honest memory | 100 | "I don't remember" when memory empty |
| Rajesh context | 100 | User identity, city, work |
| Kannada culture | 100 | Correct greetings, festival knowledge |
| Mixed multi-turn | 100 | Integration of all patterns |

### Training Details
- **Method:** QAT (Quantization-Aware Training) — full fine-tune with NVFP4 fake quantizers active
- **Loss masking:** Assistant tokens only (system/user/tool masked)
- **No-thinking:** Empty `<think>\n</think>\n` prefix trains model to skip reasoning
- **Epochs:** 5
- **Learning rate:** 1e-5 (cosine schedule)
- **Effective batch size:** 8 (1 × 8 gradient accumulation)
- **Max sequence length:** 2048
- **Hardware:** DGX Spark GB10 (Blackwell SM 12.1, 128 GB unified memory)
- **Training time:** [TBD] hours
- **Final loss:** [TBD]

## Serving

### vLLM (recommended)

```bash
docker run -d --name vllm --gpus all --runtime=nvidia \
  -v /path/to/model:/model -p 8000:8000 --shm-size=16g \
  hellohal2064/vllm-qwen3.5-gb10:latest \
  serve /model \
  --quantization modelopt_fp4 \
  --attention-backend flashinfer \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --language-model-only --trust-remote-code \
  --gpu-memory-utilization 0.15 --max-model-len 32768
```

### Requirements
- NVIDIA GPU with SM 12.1+ (Blackwell) or SM 10.0+ (Hopper) with CUDA 13.0
- vLLM with ModelOpt FP4 support
- FlashInfer attention backend (for SM 12.1)
- ~19.5 GB VRAM at 0.15 utilization

### Post-Download Patches (DGX Spark specific)
1. May need tokenizer from Qwen3.5-27B-NVFP4 (transformers version mismatch)
2. May need VL config wrapper (`model_type: "qwen3_5"`)
3. May need `preprocessor_config.json` stub

## Benchmarks

> [Fill after v3 training]

### Quality Gates

| Gate | PTQ | QAT v2 | QAT v3 | Target |
|------|-----|--------|--------|--------|
| Thinking leak | 20% | 0% | [TBD] | 0% |
| Markdown leak | 80% | 0% | [TBD] | 0% |
| Tool calling | 0% | 90% | [TBD] | ≥90% |
| Factual accuracy | 100% | 100% | [TBD] | 100% |
| Reasoning | 100% | 100% | [TBD] | 100% |

### Behavioral Gates (NEW in v3)

| Gate | Target | v3 Result |
|------|--------|-----------|
| Topic switch | ≥90% | [TBD] |
| Direct action | ≥90% | [TBD] |
| Concise response | ≥90% | [TBD] |
| Honest memory | ≥90% | [TBD] |
| Correct name | 100% | [TBD] |
| Bangalore context | ≥80% | [TBD] |
| TTFT | ≤500ms | [TBD] |

### Performance

| Metric | Q4_K_M (llama-server) | NVFP4 QAT v3 (vLLM) |
|--------|----------------------|---------------------|
| TTFT | 300-1600ms (grows) | [TBD]ms (constant) |
| Decode | 33-42 tok/s | [TBD] tok/s |
| VRAM | 7.1 GB | 19.5 GB |
| Model size | 5.7 GB | 7.5 GB |

## Known Limitations

1. **TTFT on DGX Spark:** FlashInfer SM121 cubins are community-built, not officially supported. TTFT may be higher than Hopper/Ada.
2. **No-thinking trade-off:** Empty think blocks improve TTFT but may reduce reasoning quality for complex questions.
3. **Rajesh-specific context:** Training data is specific to one user's name, city, and preferences. Fine-tune on your own data for different users.
4. **Tool calling format:** Requires Hermes tool call parser in vLLM. Other parsers may not extract tool calls correctly.

## Citation

```
@misc{qwen35-9b-opus-nvfp4-qat-v3,
  title={Qwen3.5-9B-Opus-Distilled-NVFP4-QAT-v3},
  author={Rajesh},
  year={2026},
  howpublished={https://huggingface.co/[TBD]},
  note={Voice AI companion fine-tuned with QAT for NVFP4 on DGX Spark}
}
```

## Acknowledgements

- **Jackrong** for the original Opus distillation (Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled)
- **NVIDIA** for ModelOpt QAT framework, DGX Spark, FlashInfer
- **Anthropic** for Claude Opus 4.6 (teacher model for data generation)
- **hellohal2064** for the vLLM SM121 Docker build
- **Pipecat** for the voice pipeline framework
