# Next Session: vLLM Model Survey — What Can Replace Nemotron Nano on Titan?

## Context

Session 409 benchmarked **Gemma 4 26B-A4B via Ollama** (GGUF Q4). Quality was excellent (8/8 tools, 7/7 entities, 5/5 vision, 6/6 Kannada) but VRAM (35 GB) and TTFT (836ms) are Ollama overhead — not representative of vLLM performance. The NVFP4 checkpoint doesn't exist yet, so we can't do a fair vLLM head-to-head.

**But we shouldn't wait blindly.** There may be OTHER models that already have vLLM-compatible quantizations (NVFP4, GPTQ, AWQ, FP8) and could replace Nemotron Nano on Titan today — or at least be benchmarked now.

## What We're Looking For

A model to replace **Nemotron Nano 30B-A3B** on Titan for voice + extraction. Requirements:

| Dimension | Must Have | Nice to Have |
|-----------|----------|--------------|
| **VRAM** | ≤ 20 GB (quantized) | ≤ 15 GB |
| **Speed** | > 40 tok/s on DGX Spark | > 50 tok/s |
| **TTFT** | < 300ms via vLLM | < 150ms |
| **Tool calling** | > 90% accuracy | > 95% |
| **Architecture** | vLLM-supported on SM_121 aarch64 | MoE (lower active params) |
| **Quantization** | NVFP4, FP8, GPTQ, or AWQ available | NVFP4 preferred |
| **Vision** | Any image understanding | Video too |
| **Multilingual** | Kannada entity extraction | Native Kannada response |
| **Context** | ≥ 128K | ≥ 256K |
| **License** | Permissive (Apache 2.0, MIT) | Not NVIDIA-restricted |

## Current Baseline (Nemotron Nano)

- **Model**: Nemotron 3 Nano 30B-A3B NVFP4
- **VRAM**: 18 GB (vLLM, always loaded)
- **Speed**: 48-65 tok/s
- **TTFT**: 130ms
- **Tool calling**: ~95% in production
- **Vision**: NONE (text only)
- **Multilingual**: English-focused
- **Context**: 128K (vLLM config)

## Research Tasks

### 1. Survey vLLM-compatible models on HuggingFace
Search for models that:
- Have NVFP4/FP8/GPTQ/AWQ quantizations published
- Are in the 7B-30B parameter range (fit in 20 GB quantized)
- Have MoE architecture (lower active params = faster)
- Support vision/multimodal

**Candidates to investigate** (starting list):
- Gemma 4 26B-A4B (NVFP4 pending — check again)
- Gemma 4 31B Dense (NVFP4 exists: `nvidia/Gemma-4-31B-IT-NVFP4` — but 31B all-active, more VRAM)
- Gemma 4 E4B (~8B, has audio input — interesting for edge)
- Qwen3 MoE variants (if any new ones since ADR-028 retired Qwen)
- Mistral/Mixtral small MoE models
- DeepSeek-V3 lite / distilled variants
- Any new NVIDIA NIM models
- Phi-4 or successors
- Llama 4 Scout/Maverick (if available)
- Command-R variants (Cohere)

### 2. Check vLLM DGX Spark compatibility
For each candidate:
- Does vLLM support the architecture? (check vLLM model list)
- Are there SM_121 / aarch64 issues? (check vLLM issues)
- Any community patches needed?

### 3. Benchmark candidates that have vLLM-ready quantizations
Even if NVFP4 isn't available, FP8/GPTQ/AWQ can give us valid performance comparisons since they run through vLLM (same runtime as Nano). Adapt `scripts/benchmark_gemma4_phase1.py` for each.

Test matrix (same as Phase 1):
- Throughput (tok/s)
- TTFT (streaming)
- VRAM
- Entity extraction (7 persons, 3 places)
- Tool calling (8 cases)
- Vision (if supported)
- Kannada (entities + response)

### 4. Update RESOURCE-REGISTRY.md
For any model we benchmark, document VRAM usage and compare against current budget.

## Key Files

| File | Purpose |
|------|---------|
| `docs/RESEARCH-GEMMA4-BENCHMARK.md` | Gemma 4 research + Phase 1 results |
| `docs/RESOURCE-REGISTRY.md` | Current VRAM budget (Titan: 18 GB Nano + 14 GB embedding + 9.4 GB audio) |
| `scripts/benchmark_gemma4_phase1.py` | Benchmark template (adapt for new models) |
| `scripts/results/benchmark_gemma4_phase1.json` | Gemma 4 baseline results |
| `start.sh` line 276 | Ollama pin (currently 0.20.0) |

## DGX Spark Constraints (CRITICAL)

- **SM_121** (Blackwell) — not all CUDA kernels work. vLLM needs patches for some models.
- **aarch64** (ARM64) — no x86 Docker images. Must use native ARM or build from source.
- **Unified memory** — GPU and CPU share 128 GB. nvidia-smi may not report VRAM accurately.
- **vLLM container**: `vllm-node` Docker image with custom mods applied via start.sh.

## Decision Framework

After benchmarking, score each model:

| Metric | Weight | How |
|--------|--------|-----|
| Quality (tools + entities + Kannada) | 40% | Must meet thresholds |
| Performance (tok/s + TTFT) | 25% | Compare to Nano baseline |
| Vision capability | 20% | Binary (has it or doesn't) |
| VRAM efficiency | 10% | Lower is better |
| Stability / maturity | 5% | Production track record |

**Swap threshold**: Total score must exceed Nano's by ≥ 15% to justify migration risk.
