# Research: Quantizing Nemotron 3 Nano 30B from BF16 to NVFP4

**Date:** 2026-03-18
**Context:** Fine-tuned Annie QLoRA-v1 merged model at `~/models/Nemotron-3-Nano-Annie-QLoRA-v1/merged_bf16/` (59 GB, 12 shards)
**Goal:** Quantize to NVFP4 for serving with vLLM on DGX Spark (128 GB)

---

## Executive Summary

**PTQ with NVIDIA's exact selective config is the recommended first step.** It fits in 67 GB on DGX Spark, uses the proven ModelOpt pipeline, and NVIDIA's own NVFP4 model used the same approach (PTQ + QAD). If behavioral quality degrades, QAT with LoRA (~80 GB) is the fallback. Full QAT and QAD both exceed 128 GB for a 30B model.

**Key difference from our Qwen 9B experience:** NVIDIA used a sophisticated *selective* quantization strategy for Nemotron -- keeping attention layers, their feeding Mamba layers, and all Mamba conv1d in BF16. Our previous Qwen script used `NVFP4_DEFAULT_CFG` which quantizes everything. The Nemotron architecture (MoE + Mamba + Attention hybrid) requires a custom exclude_modules config.

---

## 1. Does ModelOpt Support nemotron_h Architecture?

**Yes, confirmed.** Multiple evidence points:

- NVIDIA's official NVFP4 model (`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4`) was quantized with ModelOpt v0.29.0 (per `hf_quant_config.json`)
- The `hf_ptq.py` example in TensorRT-Model-Optimizer explicitly handles Nemotron models
- The ModelOpt PTQ README lists "Nemotron-3" as supported for both FP8 and NVFP4
- ModelOpt's `_default_disabled_quantizer_cfg` already has `"*mixer.conv1d*": disable` specifically for Mamba layers
- The model uses `trust_remote_code=True` so HuggingFace loads the custom `modeling_nemotron_h.py` -- ModelOpt wraps the Linear/Conv1d modules regardless of architecture class

**However:** The QAT example's support matrix only lists "LLAMA, CodeLlama, Qwen" -- it does NOT list Nemotron. QAT may need manual configuration for `--fsdp_transformer_layer_cls_to_wrap NemotronHBlock`.

---

## 2. NVIDIA's Exact Quantization Config

From `hf_quant_config.json` on HuggingFace:

```json
{
  "producer": {
    "name": "modelopt",
    "version": "0.29.0"
  },
  "quantization": {
    "quant_algo": "NVFP4",
    "kv_cache_quant_algo": "FP8",
    "group_size": 16,
    "exclude_modules": [
      "lm_head",
      <6 Mamba in_proj/out_proj pairs feeding attention>,
      <6 Attention q/k/v/o_proj sets>,
      <23 Mamba conv1d layers>
    ]
  }
}
```

### Selective Quantization Strategy (from QAD paper Section 3.4)

> "For Nemotron 3 Nano, a Mixture-of-Experts hybrid Mamba-Transformer with only 6 self-attention layers, we keep the 6 self-attention layers and their preceding Mamba-2 layers at BF16, quantize the remaining network to NVFP4, and quantize KV-Cache to FP8."

### Layer Architecture (52 layers total)

Pattern: `MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME`
- M = Mamba-2 (23 layers)
- E = MoE (23 layers) -- 128 routed experts + 1 shared, 6 active per token
- \* = Attention (6 layers) -- GQA with 2 groups

### What's Excluded from NVFP4 (kept in BF16)

| Category | Count | Layers | Rationale |
|----------|-------|--------|-----------|
| lm_head | 1 | output projection | Standard -- always excluded |
| Attention q/k/v/o_proj | 24 (6x4) | Layers 5, 12, 19, 26, 33, 42 | Attention precision critical |
| Mamba in_proj/out_proj feeding attention | 12 (6x2) | Layers 4, 11, 18, 25, 32, 41 | Protect signal flowing into attention |
| ALL Mamba conv1d | 23 | All 23 Mamba layers | Conv1d quantization destabilizes SSM state |
| **Total excluded** | **60 modules** | | |

### What IS Quantized to NVFP4

- **MoE expert layers**: 23 MoE layers x (128 routed + 1 shared) expert MLPs -- this is the BULK of the 30B parameters
- **Non-attention-feeding Mamba in_proj/out_proj**: 17 Mamba layers (those not directly before an attention layer)
- **MoE router/gate**: Already disabled by ModelOpt default config (`*router*`, `*block_sparse_moe.gate*`)

This is smart engineering: MoE experts contain most of the 30B parameters but only 6 are active per token, so FP4 quantization has less impact. Attention and Mamba conv1d are kept precise because they affect every token.

---

## 3. NVIDIA Blog Posts, Docs, and Examples

### QAD Paper (arxiv: 2601.20088) -- CRITICAL REFERENCE

**"Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery"** (March 2026)

NVIDIA used a two-step process for the official NVFP4 checkpoint:
1. **PTQ** with selective exclusion (the config above)
2. **QAD** (Quantization-Aware Distillation) for accuracy recovery

Key findings from the paper relevant to our case:

| Finding | Implication for Annie |
|---------|----------------------|
| QAD needs ~2.5B tokens for Nemotron 3 Nano | We don't have this much Annie data (~1000 convos at most) |
| QAD is robust to data quality -- even random tokens maintain PTQ baseline | We can use any reasonable dataset |
| QAD uses KL divergence between BF16 teacher and FP4 student | Requires BOTH models in memory simultaneously |
| QAD outperforms QAT on RL-trained models | Nemotron 3 Nano uses RL -- QAD preferred over QAT |
| QAT BREAKS RL capabilities (Table 3) | Standard QAT is NOT recommended for this model |
| LR: 1e-5 for Nemotron 3 Nano | If we attempt QAD |
| Same-model teacher outperforms larger teacher | Use our BF16 as teacher, not a bigger model |

**Table 3 results for Nemotron 3 Nano:**

| Method | AIME25 | GPQA-D | LiveCodeBench |
|--------|--------|--------|---------------|
| BF16 | 89.1 | 73.0 | 72.1 |
| NVFP4 PTQ | 85.0 | 71.6 | 68.9 |
| NVFP4 QAT | 83.3 | 66.0 | 62.0 |
| **NVFP4 QAD** | **87.9** | **72.7** | **68.9** |

PTQ only loses 2-4 points. **QAT actually makes things WORSE** for this RL-trained model. QAD recovers most of the gap.

### ModelOpt PTQ Example

`TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py`:
- Generic script that works with any HuggingFace model
- Supports `--quant nvfp4` flag
- Has `--low_memory_mode` for FP8/NVFP4 with max calibration
- Already handles Nemotron models (VL variant has special image calibration path)
- Uses `trust_remote_code=True` for custom architectures

### ModelOpt QAT Example

`TensorRT-Model-Optimizer/examples/llm_qat/`:
- Support matrix: LLAMA 2/3/3.1, CodeLlama, Qwen2/2.5/3
- Does NOT explicitly list Nemotron/nemotron_h
- QAD requires `QADTrainer` with `LMLogitsLoss()` criterion
- QAD does NOT support FSDP1, only FSDP2
- 8B model needs "minimum 2 x 80GB GPUs" for QAT

### QAD Code Repositories (from paper)

Three implementations exist:
1. **Megatron-LM version** -- NVIDIA's internal training framework
2. **NeMo version** -- NeMo 2.0 + NeMo-Run
3. **HuggingFace Transformers version** -- in TensorRT-Model-Optimizer

---

## 4. MoE Routing Layers and Mamba/SSM Layers -- Special Handling

### MoE Routing/Gate Layers

**Already handled by ModelOpt defaults.** The `_default_disabled_quantizer_cfg` disables:
- `*block_sparse_moe.gate*`
- `*router*`
- `*mlp.gate.*`
- `*mlp.shared_expert_gate.*`

These are routing decision layers that must stay precise. Only the expert MLP weights (up_proj, down_proj, gate_proj within each expert) get quantized.

### Mamba/SSM Layers

**conv1d must be excluded.** ModelOpt defaults already disable `*mixer.conv1d*`, and NVIDIA's explicit exclude list enumerates all 23. The conv1d is a causal depthwise convolution critical for temporal state -- quantizing it destabilizes the SSM recurrence.

**in_proj/out_proj**: Selectively excluded only for Mamba layers feeding into attention (6 of 23). The other 17 Mamba layers have their in_proj/out_proj quantized. These are standard Linear layers that ModelOpt handles natively.

**SSM parameters (dt_bias, A_log, D)**: These are small Parameters (not nn.Linear), so ModelOpt ignores them by default. They stay in BF16/FP32.

**Key insight from Nemotron config:** `"mamba_ssm_cache_dtype": "float32"` -- the SSM state cache runs in FP32, not BF16. This is extra precision protection for the recurrence.

---

## 5. Memory Requirements on 128 GB DGX Spark

### Option A: PTQ Only (RECOMMENDED FIRST STEP)

| Component | Memory |
|-----------|--------|
| Model BF16 on GPU | 59 GB |
| Calibration activations | ~5 GB |
| FakeQuantize wrappers | ~3 GB |
| **Total** | **~67 GB** |
| Headroom | 61 GB |

**FITS.** Same approach we used for Qwen 9B, just with NVIDIA's custom config.

### Option B: Full QAT

| Component | Memory |
|-----------|--------|
| Model BF16 | 59 GB |
| AdamW optimizer | 118 GB |
| Gradients | 59 GB |
| Activations (grad ckpt) | ~15 GB |
| **Total** | **~254 GB** |

**DOES NOT FIT.** 2x over budget. Also QAT BREAKS RL-trained models per NVIDIA's paper.

### Option C: QAD (Teacher-Student Distillation)

| Component | Memory |
|-----------|--------|
| Teacher BF16 (frozen) | 59 GB |
| Student BF16 | 59 GB |
| Optimizer (student only) | 118 GB |
| Gradients (student only) | 59 GB |
| **Total** | **~315 GB** |

**DOES NOT FIT.** 2.5x over budget. NVIDIA used multi-GPU clusters for this.

### Option D: QAT with LoRA

| Component | Memory |
|-----------|--------|
| Model BF16 (frozen base) | 59 GB |
| LoRA adapters | ~1.2 GB |
| LoRA optimizer (AdamW) | ~2.4 GB |
| Activations (grad ckpt) | ~15 GB |
| FakeQuantize wrappers | ~3 GB |
| **Total** | **~80 GB** |

**FITS** (48 GB headroom). But caution: LoRA merge may destroy QAT state (known bug from session 340). Also QAT not recommended for RL-trained models.

### Verdict

**PTQ is both the most practical AND the recommended approach for this model.** NVIDIA's own data shows PTQ only loses 2-4 points on Nemotron 3 Nano benchmarks, and QAT actively hurts this RL-trained model. The selective exclusion strategy (keeping attention + feeding-Mamba + all conv1d in BF16) is what preserves quality, not additional training.

---

## 6. Alternative Quantization Approaches

### GGUF (llama.cpp)

**Available:** `unsloth/Nemotron-3-Nano-30B-A3B-GGUF` (120K downloads)

| Format | Size | Notes |
|--------|------|-------|
| Q4_K_M | ~20 GB | Most popular 4-bit |
| Q8_0 | ~34 GB | High quality |
| BF16 | 63 GB | Full precision |

**Pros:**
- Battle-tested by community
- llama.cpp supports nemotron_h/Mamba/MoE
- No GPU-specific kernel requirements
- Runs on any hardware

**Cons:**
- Several bugs were filed and fixed (conv1d crashes, cuBLAS corruption with MoE)
- Chat parser regression still open (#20325)
- CUDA memory access errors on some configs (#20131)
- Cannot use FP4 kernels optimized for Blackwell (slower than NVFP4 on vLLM)
- Our existing serving infrastructure is vLLM-based, not llama.cpp

**Verdict:** GGUF Q4_K_M is a viable fallback if NVFP4/vLLM hits issues. Lower throughput but broader compatibility.

### AWQ

**Not recommended.** AWQ produces `MIXED_PRECISION` format that vLLM cannot serve (proven in session 340 with Qwen 9B). Also no community AWQ models for Nemotron exist.

### GPTQ

**Not tested.** Community GPTQ models may exist but GPTQ doesn't produce NVFP4 format, so no Blackwell FP4 kernel advantage.

---

## Recommended Approach: Step-by-Step

### Step 1: PTQ with NVIDIA's Exact Config

Use our existing export-only flow adapted for Nemotron's custom config:

```python
import copy
import modelopt.torch.quantization as mtq

# Start with NVFP4_DEFAULT_CFG
custom_cfg = copy.deepcopy(mtq.NVFP4_DEFAULT_CFG)

# Replicate NVIDIA's exclude_modules as disable patterns
# Attention layers (5, 12, 19, 26, 33, 42) - all projections
for layer in [5, 12, 19, 26, 33, 42]:
    for proj in ['q_proj', 'k_proj', 'v_proj', 'o_proj']:
        custom_cfg["quant_cfg"][f"*layers.{layer}.mixer.{proj}*"] = {"enable": False}

# Mamba layers feeding attention (4, 11, 18, 25, 32, 41) - in/out proj
for layer in [4, 11, 18, 25, 32, 41]:
    for proj in ['in_proj', 'out_proj']:
        custom_cfg["quant_cfg"][f"*layers.{layer}.mixer.{proj}*"] = {"enable": False}

# All Mamba conv1d layers (already disabled by default, but explicit is safer)
mamba_layers = [0, 2, 4, 7, 9, 11, 14, 16, 18, 21, 23, 25,
                28, 30, 32, 35, 37, 39, 41, 44, 46, 48, 50]
for layer in mamba_layers:
    custom_cfg["quant_cfg"][f"*layers.{layer}.mixer.conv1d*"] = {"enable": False}

# Quantize
model = mtq.quantize(model, custom_cfg, forward_loop)
```

**Run inside NGC container:**
```bash
docker run --gpus all --rm -it --ipc=host \
  -v ~/models:/models \
  -v ~/workplace/her/her-os:/workspace \
  nvcr.io/nvidia/pytorch:25.11-py3 \
  bash -c "pip install transformers accelerate datasets && \
  python3 /workspace/scripts/nemotron_nvfp4_ptq.py \
    --model /models/Nemotron-3-Nano-Annie-QLoRA-v1/merged_bf16 \
    --output /models/Nemotron-3-Nano-Annie-NVFP4 \
    --calib-samples 64"
```

### Step 2: Verify Export

Check `hf_quant_config.json`:
- `quant_algo` must be `"NVFP4"` (not `MIXED_PRECISION`)
- `exclude_modules` should list all 60 excluded modules
- Total size should be ~7-8 GB (similar to NVIDIA's 19.4 GB repo which includes tokenizer)

### Step 3: Serve with vLLM

```bash
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND=throughput \
docker run --gpus all -d --name vllm-annie-nemotron \
  -v ~/models/Nemotron-3-Nano-Annie-NVFP4:/model \
  -p 8003:8000 \
  nvcr.io/nvidia/vllm:25.12.post1-py3 \
  --model /model \
  --trust-remote-code \
  --quantization modelopt_fp4 \
  --enforce-eager \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin /model/nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3
```

**Note:** Must use NGC vLLM image (`nvcr.io/nvidia/vllm:25.12.post1-py3`) for FlashInfer MoE FP4 kernels. Our existing custom Docker image (`hellohal2064/vllm-qwen3.5-gb10`) lacks these.

### Step 4: Benchmark

Run existing benchmark suite. If quality is acceptable (PTQ should be close -- NVIDIA shows only 2-4 point loss), we're done.

### Step 5 (IF NEEDED): QAT with LoRA

Only if PTQ quality is unacceptable. This requires adapting the existing `qat_nvfp4_v2.py` script:
- Change assistant markers from `<|im_start|>assistant` to Nemotron's chat template format
- Use custom config instead of `NVFP4_DEFAULT_CFG`
- Set `--fsdp_transformer_layer_cls_to_wrap NemotronHBlock`
- Use LoRA (full fine-tune won't fit at 254 GB)
- Warning: QAT may hurt RL capabilities per NVIDIA paper Table 3

---

## Key Differences from Our Qwen 9B Recipe

| Aspect | Qwen 9B | Nemotron 30B |
|--------|---------|-------------|
| Config | `NVFP4_DEFAULT_CFG` (quantize everything) | Custom config with 60 excluded modules |
| Architecture | Dense transformer | Hybrid MoE + Mamba + Attention |
| Model size | 18 GB BF16 | 59 GB BF16 |
| QAT feasibility | Full fine-tune fits (82 GB) | Only LoRA fits (80 GB), full needs 254 GB |
| QAT recommendation | Recommended (improved v2→v4) | NOT recommended (breaks RL capabilities) |
| Serving image | `hellohal2064/vllm-qwen3.5-gb10` | `nvcr.io/nvidia/vllm:25.12.post1-py3` (NGC) |
| Special env vars | None | `VLLM_USE_FLASHINFER_MOE_FP4=1` |
| KV cache | Default | FP8 (`--kv-cache-dtype fp8`) |
| Reasoning parser | `qwen3` | `nano_v3` (custom plugin) |
| ModelOpt version | 0.37.0 (our container) | 0.29.0 (NVIDIA used), 0.42.0 (latest) |
| Export size | ~7.5 GB | ~7-8 GB (MoE experts compress well) |

---

## Open Questions

1. **ModelOpt version mismatch:** NVIDIA used v0.29.0, our container has v0.37.0+, latest is v0.42.0. The quantization config format may have changed. Need to verify `export_hf_checkpoint` still produces the same output format.

2. **Does our fine-tuned model's custom behavior survive PTQ?** NVIDIA tested base Nemotron 3 Nano, not a QLoRA-fine-tuned variant. Our Annie behavioral patterns (tool calling, conciseness) may be more fragile.

3. **Calibration data source:** NVIDIA didn't specify their PTQ calibration data. We should use Annie conversations for calibration (domain-relevant activation ranges), even though the QAD paper says data quality doesn't matter much for PTQ.

4. **VL config wrapper needed?** Our Qwen NVFP4 needed a VL config wrapper for vLLM compatibility. Nemotron uses `trust_remote_code=True` with custom model files -- the serving path may be different.

5. **nano_v3_reasoning_parser.py:** This custom parser extends DeepSeekR1ReasoningParser. Need to copy it into our model directory or download from HuggingFace.

---

## References

- [NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 (HuggingFace)](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4)
- [QAD Paper (arxiv: 2601.20088)](https://arxiv.org/abs/2601.20088)
- [TensorRT-Model-Optimizer PTQ examples](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq)
- [TensorRT-Model-Optimizer QAT examples](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_qat)
- [NVIDIA QAT Blog Post](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
- [Our Qwen 9B NVFP4 lessons learned](/home/rajesh/workplace/her/her-os/docs/NVFP4-LESSONS-LEARNED.md)
- [Our QAT v2 script](/home/rajesh/workplace/her/her-os/scripts/qat_nvfp4_v2.py)