# Research: Gemma 4 Family — Comparison & Benchmark Plan

> **Date:** 2026-04-04
> **Status:** RESEARCH COMPLETE, BENCHMARK BLOCKED (no NVFP4 for 26B yet)
> **Relevance:** Potential Nemotron Nano replacement on Titan (voice + extraction)

## 1. Gemma 4 Family Overview

Released **April 2, 2026** by Google under **Apache 2.0** license.
Built on Gemini 3 architecture.

| Model | Total Params | Active Params | Architecture | Context | Modalities | Target |
|-------|-------------|---------------|-------------|---------|------------|--------|
| **E2B** | 5.1B | 2.3B | Dense + PLE | 128K | Text, image, video, **audio** | Phone/edge |
| **E4B** | ~8B | ~4B | Dense + PLE | 128K | Text, image, video, **audio** | Phone/edge |
| **26B-A4B** | 25.2B | 3.8B | MoE (128 experts, 8+1 active) | 256K | Text, image, video | Workstation |
| **31B** | 31B | 31B (all) | Dense | 256K | Text, image, video | Server |

### Key Architecture Details

- **PLE (Per-Layer Embeddings):** E2B/E4B use per-layer embeddings so 2.3B active params carry 5.1B representational depth. Fits <1.5 GB quantized.
- **MoE:** 26B-A4B has 128 small experts, activates 8 + 1 shared per token. Only 3.8B params fire per forward pass.
- **Hybrid attention:** Interleaves local sliding-window + full global attention. Final layer always global.
- **Audio input:** Only E2B and E4B support native audio (speech recognition built in). 26B and 31B are text+vision only.
- **Languages:** 140+ natively trained.
- **Arena AI text leaderboard:** 31B = #3, 26B = #6 (beating models 20x their size).

## 2. Side-by-Side Comparison with her-os Models

### Primary comparison: Gemma 4 26B-A4B vs Nemotron Nano 30B-A3B (Titan voice)

| Dimension | **Gemma 4 26B-A4B** | **Nemotron Nano 30B-A3B** (current) |
|-----------|---------------------|-------------------------------------|
| Total params | 25.2B | 30B |
| Active params | 3.8B | 3B |
| Architecture | Standard MoE (transformer) | MoE + Mamba-2 hybrid (linear attention) |
| Context window | 256K | 128K (vLLM config), 1M (model max) |
| Quantization available | BF16, GGUF Q4/Q8, NVFP4 (31B only!) | **NVFP4 (native, verified)** |
| VRAM (quantized) | ~14-16 GB (GGUF Q4 est.) | **18 GB (NVFP4, verified)** |
| Modalities | Text + image + video | Text only |
| Audio input | No | No (separate Nemotron Speech 0.6B) |
| Tool calling | Yes | Yes (trained for tools) |
| Thinking mode | Not documented | Yes (`enable_thinking=false` for voice) |
| Languages | 140+ | English-focused |
| License | Apache 2.0 | NVIDIA custom (permissive) |
| Arena ranking | #6 text | Not ranked |
| DGX Spark perf | **45-60 tok/s** (community report) | **48-65 tok/s** (our verified benchmark) |
| vLLM stability | Main branch only (PR #38826) | **Stable, production-verified** |
| NVFP4 checkpoint | **NOT AVAILABLE for 26B** | **Available, deployed** |

### Secondary comparison: Gemma 4 31B Dense vs Nemotron Super 120B-A12B (Beast)

| Dimension | **Gemma 4 31B Dense** | **Nemotron Super 120B-A12B** (current) |
|-----------|----------------------|----------------------------------------|
| Total params | 31B | 120B |
| Active params | 31B (all) | 12B |
| Architecture | Dense transformer | MoE + Mamba-2 hybrid |
| Context window | 256K | 131K (config) / 1M (model max) |
| VRAM (NVFP4) | ~18-20 GB est. | ~80 GB + 17 GB KV cache |
| Modalities | Text + image + video | Text only |
| Tool accuracy | Unknown | **97% (SWE-Bench verified)** |
| Thinking | Unknown | Full reasoning enabled |
| NVFP4 checkpoint | **YES: `nvidia/Gemma-4-31B-IT-NVFP4`** | Available, deployed |
| License | Apache 2.0 | NVIDIA custom |

**Verdict:** Not a Super replacement. 31B all-active is a different weight class than 120B/12B MoE. Super has 120B knowledge with 12B inference cost. Gemma 31B would be faster but significantly less capable for complex reasoning and tool chains.

### Edge opportunity: Gemma 4 E2B vs our STT pipeline

| Dimension | **Gemma 4 E2B** | **Current STT stack** |
|-----------|-----------------|----------------------|
| STT | Built-in audio input | Nemotron Speech 0.6B (2.49 GB) |
| Vision | Built-in image+video | None |
| Total VRAM | <1.5 GB (quantized) | 2.49 GB (STT alone) |
| Languages (audio) | 140+ | English only |
| Target | Pixel phone / edge | Titan (DGX Spark) |
| Indian language quality | Unknown | N/A (IndicConformerASR planned for Panda) |

**Verdict:** Interesting for Pixel 9a on-device inference. Could handle simple STT+LLM in a single 1.5 GB model, eliminating the Panda hop for simple queries. But lacks Indian language TTS and unknown quality for Kannada.

## 3. NVFP4 Availability — Exhaustive Search (2026-04-04)

### Official NVIDIA checkpoints (HuggingFace)

| Model | NVFP4 Available? | Repository |
|-------|-----------------|------------|
| Gemma 4 31B Dense | **YES** | [`nvidia/Gemma-4-31B-IT-NVFP4`](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) |
| Gemma 4 26B-A4B | **NO** | Not published |
| Gemma 4 E4B | **NO** | Not published |
| Gemma 4 E2B | **NO** | Not published |

### Community quantizations available

| Format | 26B-A4B | 31B | E4B | E2B |
|--------|---------|-----|-----|-----|
| **GGUF** (Unsloth) | [`unsloth/gemma-4-26B-A4B-it-GGUF`](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF) | [`unsloth/gemma-4-31B-it-GGUF`](https://huggingface.co/unsloth/gemma-4-31B-it-GGUF) | [`unsloth/gemma-4-E4B-it-GGUF`](https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF) | - |
| **GGUF** (bartowski) | [`bartowski/google_gemma-4-26B-A4B-it-GGUF`](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF) | [`bartowski/google_gemma-4-31B-it-GGUF`](https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF) | - | - |
| **GGUF** (ggml-org) | [`ggml-org/gemma-4-26B-A4B-it-GGUF`](https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF) | - | - | - |
| **MLX 4-bit** | [`mlx-community/gemma-4-26b-a4b-it-4bit`](https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-4bit) | - | - | - |
| **MLX NVFP4** (community) | [`mlx-community/gemma-4-26b-a4b-it-nvfp4`](https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-nvfp4) | - | - | - |
| **AWQ** | Not found | Not found | Not found | Not found |
| **GPTQ** | Not found | Not found | Not found | Not found |

### Ollama availability

| Model | Tag | Size | Available? |
|-------|-----|------|-----------|
| gemma4 (default = E4B) | `gemma4` | ~9.6 GB | **YES** |
| gemma4 26B-A4B | `gemma4:26b` | ~14-16 GB | **YES** |
| gemma4 31B | `gemma4:31b` | ~18-20 GB | **YES** |
| gemma4 E2B | `gemma4:e2b` | ~1.5 GB | **YES** |

### Bottom line on NVFP4

**The 26B-A4B does NOT have an official NVIDIA NVFP4 checkpoint.** Only the 31B Dense has one. The MLX community uploaded an "nvfp4" variant but that's Apple Silicon MLX format, not NVIDIA NVFP4 for vLLM/Blackwell. Google said "quantized versions following shortly" at launch — but as of 2026-04-04 (2 days post-launch), only the 31B NVFP4 has shipped.

## 4. DGX Spark Compatibility Status

### vLLM support

| Item | Status |
|------|--------|
| Gemma 4 model support in vLLM | **Main branch only** (PR #38826). NOT in any stable release. |
| vLLM on DGX Spark (sm_121 + aarch64) | **Requires patches** — stock vLLM does NOT work on GB10. |
| Community patches | [`atcuality2021/vllm-gb10-gemma4`](https://github.com/atcuality2021/vllm-gb10-gemma4) — one-command installer with sm_121 + aarch64 fixes. |
| Known issues | NCCL missing sm_121, CUTLASS FP8 tables absent, Ray unified memory OOM threshold. |

### Patches required (from community repo)

1. **`nccl-sm121-build.sh`** — Build NCCL with sm_121 support
2. **`cutlass-fp8-sm121.sh`** — Disable CUTLASS FP8 (fallback to Triton)
3. **`ray-unified-memory.sh`** — Fix Ray OOM threshold for unified memory architecture
4. **`gemma4-backport.sh`** — Backport Gemma 4 model support from vLLM main

### Community benchmarks on DGX Spark

| Model | Quantization | tok/s | Source |
|-------|-------------|-------|--------|
| Gemma 4 26B-A4B | BF16 (est.) | **45-60 tok/s** | [NVIDIA forums](https://forums.developer.nvidia.com/t/someone-post-this-gemma-4-26b-a4b-moe-running-at-45-60-tok-s-on-dgx-spark/365547) |
| Nemotron Nano 30B-A3B | NVFP4 | **48-65 tok/s** | Our verified benchmark |

**Performance is in the same ballpark.** The 26B-A4B with proper NVFP4 quantization could potentially be faster (3.8B active vs 3B active, but no Mamba-2 linear attention advantage).

## 5. What Gemma 4 26B Would Give Us (vs Nemotron Nano)

### Gains

| Capability | Impact |
|------------|--------|
| **Native vision** (image + video) | Photo understanding, screen reading, visual context — currently not possible |
| **2x context window** | 256K vs 128K — better long conversations |
| **140+ languages** | Potential Kannada understanding (quality unknown) |
| **Apache 2.0 license** | Fully open, no NVIDIA license restrictions |
| **Arena #6** | Independently validated quality ranking |

### Losses / Risks

| Concern | Impact |
|---------|--------|
| **No NVFP4 for 26B** | Must use GGUF Q4 or BF16 — less VRAM efficient |
| **No Mamba-2 hybrid** | Standard transformer, no linear attention for long sequences |
| **vLLM unstable** | Main branch only, requires community patches on DGX Spark |
| **No proven tool calling quality** | Nemotron Nano has verified tool calling in production |
| **No thinking mode** | Can't toggle reasoning on/off like Nemotron |
| **2 days old** | Zero production track record |
| **Voice latency unknown** | No TTFT benchmarks on Blackwell with NVFP4 |

## 6. Phase 1 Benchmark Results (2026-04-04)

**Script:** `scripts/benchmark_gemma4_phase1.py`
**Results JSON:** `scripts/results/benchmark_gemma4_phase1.json`
**Runtime:** Ollama 0.20.0 | Model: `gemma4:26b` (Q4_K_M, 25.8B) | GPU: DGX Spark GB202 128GB (isolated — all services stopped)

> **IMPORTANT**: This is a QUALITY benchmark, not a performance benchmark. Comparing Ollama GGUF Q4 (Gemma) vs vLLM NVFP4 (Nano) tok/s is structurally invalid — different runtimes, different quantization, different tensor core usage. Performance numbers are informational floor only.

### Results Summary

| Test | Result | Verdict | Notes |
|------|--------|---------|-------|
| **Throughput** | **59.7 tok/s** (±0.3) | Meets (floor) | Prompt eval: 901 tok/s |
| **TTFT** | **836 ms** (±113) | Fails (> 500ms) | Ollama overhead; vLLM would be ~130ms |
| **VRAM** | **35.3 GB** | Exceeds | 17 GB model + 18 GB KV cache/CUDA context |
| **Entity extraction** | **7/7 persons, 3/3 places** | Meets | JSON has minor syntax errors (regex fallback needed) |
| **Tool calling** | **8/8 (100% structured)** | Exceeds | Better than Nano's ~95% in production |
| **Vision** | **5/5 colors, accurate layout** | Exceeds | PNG works; BMP silently fails |
| **Kannada entities** | **6/6 key entities** | Exceeds | All: Priya, Arun, Mom, Bangalore, Hubli, Diwali |
| **Kannada response** | **YES (80 chars)** | Exceeds | "ನಮಸ್ಕಾರ! ನಾನು ಚೆನ್ನಾಗಿದ್ದೇನೆ" |

### Key Findings

1. **Thinking mode is ON by default** — Must pass `think: false` in every Ollama API request, otherwise thinking tokens consume all `num_predict` budget and content is empty. Same pattern as Nemotron Nano's `enable_thinking=false`.

2. **Vision WORKS** — Native image understanding with accurate color/layout detection. PNG format required (BMP silently produces empty response). This is the killer feature vs Nano.

3. **Tool calling is excellent** — 8/8 perfect structured tool calls, dispatching correctly between `web_search`, `search_memory`, and `get_entity_details`. No text-JSON fallback needed.

4. **VRAM is concerning** — 35 GB via Ollama (model 17 GB + overhead). Under vLLM with NVFP4, expect ~15-18 GB. Not viable for Titan alongside Audio Pipeline (9.4 GB) + embedding model.

5. **TTFT is high** — ~836ms via Ollama vs Nano's 130ms via vLLM. This is runtime overhead, not model quality. vLLM would dramatically improve this.

6. **Entity extraction quality is good but JSON output has bugs** — The model correctly identifies all entities but occasionally drops a quote character in JSON output (`"properties: {}` instead of `"properties": {}`). Needs lenient parsing.

7. **Kannada is strong** — Perfect entity extraction from mixed Kannada-English text, and natural Kannada conversation response.

### Decision Against Plan Criteria

| Metric | Category | Threshold | Result | Status |
|--------|----------|-----------|--------|--------|
| Entity extraction | Quality | ≥ 5/7 persons | **7/7** | PASS |
| Tool calling | Quality | ≥ 50% any format | **100% structured** | PASS |
| Vision | Quality | Describes accurately | **5/5 colors** | PASS |
| Kannada entities | Quality | ≥ 4/6 key entities | **6/6** | PASS |
| tok/s (floor) | Perf | > 25 tok/s | **59.7 tok/s** | PASS |
| TTFT (floor) | Perf | < 500ms | **836 ms** | FAIL (Ollama) |
| VRAM | Resource | < 20 GB | **35.3 GB** | FAIL (Ollama) |

**Quality: ALL PASS. Performance: BLOCKED by Ollama runtime — vLLM NVFP4 needed for fair comparison.**

### Ollama Upgrade Notes

- Upgraded from `0.17.1-rc2` → `0.20.0` (Gemma 4 requires newer Ollama)
- `start.sh` line 276 updated
- All 11 existing models survived upgrade (verified)

## 7. Phase 2 Plan (when NVFP4 available)

### Prerequisites (STILL BLOCKING)

- [ ] **Wait for NVIDIA to publish `nvidia/Gemma-4-26B-A4B-IT-NVFP4`** on HuggingFace
- [ ] **vLLM stable release with Gemma 4 support** (or use community patches)

### Phase 2 Tests (vLLM head-to-head)

| Test | What | Nemotron Nano baseline |
|------|------|----------------------|
| TTFT | Time to first token (empty KV) | 130ms |
| Sustained tok/s | 500-token generation | 48-65 tok/s |
| Tool calling | JSON tool output accuracy | ~95% |
| Long context | 32K input, 500 output | Works |
| Concurrent | 3 simultaneous requests | Works |
| VRAM | Total GPU memory used | 18 GB |
| Vision | Image understanding | N/A (Nano has none) |

### Phase 3: Production validation (if Phase 2 passes)

1. Run Annie voice loop with Gemma 4 26B as LLM backend
2. Measure end-to-end voice latency (STT + LLM + TTS)
3. Test all creatures that use minotaur
4. A/B test conversation quality with Rajesh
5. Run tool calling regression suite

## 8. Recommendation (Updated 2026-04-04)

**Do NOT swap today. Wait for NVFP4.** Quality is excellent — matches or exceeds Nemotron Nano on every test. But VRAM (35 GB) and TTFT (836ms) are Ollama runtime limitations, not model limitations. The real comparison requires vLLM with NVFP4.

**What changed from pre-benchmark recommendation:**
- Vision confirmed WORKING and accurate — this alone justifies pursuing the swap
- Tool calling is BETTER than Nano (100% vs ~95%)
- Kannada is surprisingly good (was unknown before)
- JSON output quality needs lenient parsing (minor weakness)

**Next steps:**
1. **WAIT** for `nvidia/Gemma-4-26B-A4B-IT-NVFP4` checkpoint
2. **WATCH** vLLM Gemma 4 stability (PR #38826)
3. **When ready:** Run Phase 2 vLLM benchmark
4. **Decision point:** If VRAM < 20 GB and TTFT < 200ms under vLLM → swap Nemotron Nano for Gemma 4 26B-A4B

## 9. Phase 2 Results — Super Baseline (2026-04-05)

**First formal Nemotron Super benchmark on Beast (DGX Spark).**

Machine: Beast (DGX Spark GB202, 128 GB unified memory)
Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
Runtime: vLLM 0.17.2 (custom spark-vllm-docker build)
Config: `--kv-cache-dtype fp8 -tp 1 --gpu-memory-utilization 0.75 --max-model-len 131072 --max-num-seqs 10 --enable-prefix-caching`
Thinking: Disabled via `chat_template_kwargs.enable_thinking: false`
Script: `scripts/benchmark_gemma4_vllm.py` v2.0.0
Results JSON: `scripts/results/benchmark_super_baseline.json`

### Performance

| Metric | Value | Community Reference |
|--------|-------|-------------------|
| **Throughput** | **16.0 tok/s** (0.0 stddev, 25 runs) | 14-19.5 tok/s |
| **TTFT** | **269.0 ms** (45.7 ms stddev, 5 runs) | ~200-400 ms |
| **VRAM** | **90.42 GB** | ~97 GB estimated |

Throughput is remarkably stable — all 25 runs (5 prompts × 5 runs) measured exactly 16.0 tok/s. Suggests a hardware throughput ceiling on DGX Spark unified memory.

### Quality

| Test | Score | Detail |
|------|-------|--------|
| **Entity extraction** | 7/7 persons, 2/3 places | Missed Mysore (typed as topic, not place) |
| **Tool calling** | 7/8 structured (0.88 weighted) | Failed: "Tell me about Arun" → search_memory (expected get_entity_details) |
| **Vision** | **NOT SUPPORTED** (HTTP 400) | "not a multimodal model" — confirms the gap Gemma 4 fills |
| **Kannada entities** | 6/6 (100%) | Priya, Arun, Mom, Bangalore, Hubli, Diwali |
| **Kannada response** | YES (47 chars) | ನಮಸ್ಕಾರ! ನಾನು ಚೆನ್ನಾಗಿ ಇದ್ದೇನೆ. ನಿಮಗೆ ಹೇಗೆ ಸಹಾಯ ಮಾಡಬಹುದು? |

### Key Findings

1. **16.0 tok/s is the benchmark to beat** — rock-solid, no variance
2. **Super has NO vision** — HTTP 400 confirms this is a text-only model. Gemma 4's native vision is the differentiator.
3. **Tool calling good but not perfect** — 87.5% (7/8). The "Arun" failure is understandable (search_memory vs get_entity_details is ambiguous).
4. **VRAM is 90.42 GB** — leaves 37.6 GB headroom on 128 GB. A 31B model at ~20 GB could potentially coexist.
5. **Kannada: perfect** — 6/6 entities from mixed code-switched text, correct Kannada response.

### Gemma 4 31B Benchmark Results (2026-04-05)

Machine: Beast (DGX Spark GB202, 128 GB unified memory)
Model: nvidia/Gemma-4-31B-IT-NVFP4
Runtime: vLLM 0.18.2 (`vllm/vllm-openai:gemma4-cu130` — official Docker image)
Config: `--gpu-memory-utilization 0.40 --max-model-len 32768 --tool-call-parser hermes`
Thinking: Disabled via `chat_template_kwargs.enable_thinking: false`
Results JSON: `scripts/results/benchmark_gemma4_31b_vllm.json`

#### Performance

| Metric | Gemma 4 31B | Super | Delta |
|--------|------------|-------|-------|
| **Throughput** | 6.9 tok/s (0.0 stddev) | 16.0 tok/s | **-57%** |
| **TTFT** | 390.4 ms (112.3 stddev) | 269.0 ms | +45% slower |
| **VRAM** | 45.95 GB | 90.42 GB | **-49%** |

#### Quality

| Test | Gemma 4 31B | Super | Winner |
|------|------------|-------|--------|
| Entity extraction | **7/7 persons, 3/3 places** | 7/7 persons, 2/3 places | **Gemma 4** (found Mysore) |
| Tool calling | 0/8 (hermes parser) | 7/8 structured | **Super** |
| **Vision** | **5/5 colors, WORKING** | NOT SUPPORTED (HTTP 400) | **Gemma 4** |
| Kannada entities | 6/6 | 6/6 | Tie |
| Kannada response | YES (71 chars) | YES (47 chars) | Tie |

#### Key Findings

1. **Vision is Gemma 4's killer feature** — perfectly identified all 5 colors and their positions from a 10x10 PNG. Super cannot do this at all.
2. **Tool calling failed completely** — 0/8 with `hermes` parser. Gemma 4 produced text responses instead of structured `tool_calls`. Needs `gemma4` tool-call-parser or prompt engineering.
3. **2.3x slower than Super** — Dense 31B (all params active) vs MoE 120B (only 12B active). Expected on unified memory.
4. **Half the VRAM** — 46 GB vs 90 GB. Could potentially coexist with Super on 128 GB Beast (46+90=136, slight oversubscription but unified memory can handle it).
5. **Better entity extraction** — Found Mysore (Super missed it, classified as topic not place).

#### Decision (per framework)

| tok/s | VRAM | Quality vs Super | Vision | Decision |
|-------|------|-----------------|--------|----------|
| 6.9 (5-15 range) | 45.95 GB (≤30 GB: NO) | Mixed | **Works** | **WAIT — capture data** |

31B Dense is too slow for interactive use (6.9 tok/s) and VRAM exceeds the 30 GB budget for a complementary model. However, the vision quality is excellent and VRAM is half of Super's. Best role: **offline vision tasks** (photo analysis, document reading) where latency doesn't matter.

#### Deployment Solution

The `vllm/vllm-openai:gemma4-cu130` Docker image (ARM64 native) works out of the box on DGX Spark. No building from source needed. Previously attempted approaches (pre-built spark-vllm-docker wheels, source build) all failed due to PyTorch ABI mismatches. See `docs/RESEARCH-GEMMA4-VLLM-DGX-SPARK.md` for full research.

### Gemma 4 26B-A4B FP8 Benchmark Results (2026-04-05)

Machine: Titan (DGX Spark GB202, 128 GB unified memory)
Model: google/gemma-4-26B-A4B-it (BF16 weights, online FP8 quantization)
Runtime: vLLM 0.18.2rc1 (`vllm/vllm-openai:gemma4-cu130` — official Docker image)
Config: `--quantization fp8 --kv-cache-dtype fp8 --gpu-memory-utilization 0.50 --max-model-len 32768 --tool-call-parser pythonic`
Thinking: Disabled via `chat_template_kwargs.enable_thinking: false`
Results JSON: `scripts/results/benchmark_gemma4_26b_vllm.json`

#### Performance

| Metric | Gemma 4 26B FP8 | Nano (production) | Phase 1 (Ollama Q4) | Delta vs Nano |
|--------|-----------------|-------------------|---------------------|---------------|
| **Throughput** | **38.4 tok/s** (0.1 stddev) | 48-65 tok/s | 59.7 tok/s | **-20% to -41%** |
| **TTFT** | **111.3 ms** (5.9 stddev) | ~130 ms | 835.9 ms | **-14% faster** |
| **VRAM** | **59.41 GB** | ~18 GB | 35.32 GB | **+230%** |

#### Quality

| Test | Gemma 4 26B FP8 | Nano | Phase 1 (Ollama) | Winner |
|------|-----------------|------|------------------|--------|
| Entity extraction | **7/7 persons, 3/3 places** | 7/7 | 7/7 + 3/3 | Tie |
| Tool calling | **0/8** (pythonic parser) | 7/8 | 8/8 (Ollama) | **Nano** |
| **Vision** | **5/5 colors, WORKING** | NOT SUPPORTED | 5/5 (Ollama) | **Gemma 4** |
| Kannada entities | 6/6 | N/A | 6/6 | Tie |
| Kannada response | YES (71 chars) | N/A | YES (80 chars) | Tie |

#### Key Findings

1. **38.4 tok/s is below Nano but usable** — 80% of Nano's 48 tok/s floor. Remarkably consistent (0.1 stddev across 25 runs). The online FP8 quantization from BF16 weights is less efficient than Nano's native NVFP4 checkpoint.

2. **TTFT is excellent at 111ms** — 14% faster than Nano, 87% faster than Ollama. CUDA graph compilation (28 min cold start) pays off at inference time.

3. **VRAM: 59.41 GB is a dealbreaker for coexistence** — Nano uses 18 GB, leaving room for Audio Pipeline (9.4 GB), embedding model, and Context Engine. The 26B at 59 GB would consume nearly half of Titan's 128 GB unified memory. The `gpu_memory_utilization=0.50` was set conservatively, but even at 0.30, model weights alone are ~27 GB.

4. **Tool calling: 0/8 with ALL parsers** — Tried `hermes` (31B, session 430) and `pythonic` (26B, this session). Both produce text responses, never structured `tool_calls`. Phase 1 Ollama got 8/8. This is a **vLLM-specific issue** — the model CAN do tool calling (Ollama proves it) but vLLM's tool parsers don't extract Gemma 4's output format.

5. **Vision confirmed again** — 5/5 color detection, accurate spatial layout. Consistent with Phase 1 and 31B results. Vision is the feature Nano fundamentally lacks.

6. **Cold start is painful** — 28 min for torch.compile CUDA graph compilation on first load. Subsequent starts would use the cached compilation artifacts. Not a production concern but impactful for benchmarking.

7. **Community numbers (45-57 tok/s) remain unmatched** — May require `--load-format fastsafetensors` (only available in eugr's build, not the official image) or higher `gpu_memory_utilization`. The FP8 online quantization overhead may also contribute.

#### Decision (per framework)

| Actual Quant | tok/s | VRAM | TTFT | Quality | Decision |
|-------------|-------|------|------|---------|----------|
| FP8 (online) | 38.4 (25-40 range) | 59.41 GB (> 35 GB) | 111ms (< 200ms) | Tools FAIL | **WAIT for NVFP4** |

**WAIT.** The 26B-A4B is promising — TTFT beats Nano, vision is excellent, Kannada is strong. But three blockers remain:

1. **No NVFP4 checkpoint** — online FP8 from BF16 wastes memory (59 GB vs expected ~15-18 GB with NVFP4)
2. **Tool calling broken in vLLM** — works perfectly in Ollama but no vLLM parser handles Gemma 4's format. Need vLLM `gemma4` parser fix (PR #38909) or custom parser.
3. **Throughput below Nano** — 38.4 vs 48-65 tok/s. NVFP4 + `fastsafetensors` might close this gap.

**When to re-evaluate:**
- `nvidia/Gemma-4-26B-A4B-IT-NVFP4` published on HuggingFace
- vLLM PR #38909 (gemma4 tool parser fix) merged + released
- eugr's spark-vllm-docker recipes updated with tool parser mod

## 10. NVFP4 Community Checkpoint Found (2026-04-05)

**Status: UNBLOCKED — community NVFP4 exists, needs verification on our hardware.**

[bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4](https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4) by Mario Iseli — first community NVFP4 quantization. 49 GB → 16.5 GB. Built and validated on DGX Spark with Anthropic AI engineering assistance.

### Why standard tools couldn't do it

Gemma 4 MoE fuses expert weights as 3D tensors `[128, dim, dim]` rather than `nn.ModuleList[nn.Linear]`. All existing tools (modelopt, llm-compressor, TensorRT-LLM) silently skip non-Linear parameters — which are **91% of the model**. Mario wrote a custom `_QuantGemma4TextExperts` modelopt plugin that unfuses → quantizes → renames keys for vLLM's FusedMoE format.

### Community benchmark numbers (DGX Spark)

| Metric | BF16 | NVFP4 | Our FP8 | Nano (prod) |
|--------|------|-------|---------|-------------|
| **tok/s** | 23.3 | **48.2** | 38.4 | 48-65 |
| **TTFT** | 97ms | **53ms** | 111ms | ~130ms |
| **VRAM** | ~49 GB | **15.7 GB** | 59.4 GB | 18 GB |

Quality retention: **97.6%** average (GSM8K ~95%, IFEval ~98.5%).

### Requirements to run on our Titan

1. **Build `vllm-node-tf5`** via eugr's spark-vllm-docker (`./build-and-copy.sh --tf5`, ~3 min)
2. **Download NVFP4 checkpoint** (16.5 GB): `huggingface-cli download bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4`
3. **Patch gemma4.py** — mount `gemma4_patched.py` into container (fixes `expert_params_mapping` scale key suffixes, [vLLM #38912](https://github.com/vllm-project/vllm/issues/38912))
4. **Serve command:**
```bash
docker run -d --name vllm-gemma4-26b-nvfp4 \
  --gpus all --ipc=host -p 8004:8004 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v /path/to/Gemma-4-26B-A4B-it-NVFP4:/model \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /path/to/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
  vllm-node-tf5 \
  vllm serve /model \
    --served-model-name gemma-4-26b \
    --host 0.0.0.0 --port 8004 \
    --quantization modelopt \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --moe-backend marlin \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4
```

### Decision framework re-evaluation

| Actual Quant | tok/s | VRAM | TTFT | Quality | Decision |
|-------------|-------|------|------|---------|----------|
| NVFP4 (community) | 48.2 (≥40) | 15.7 GB (≤25 GB) | 53ms (≤200ms) | Tool calling reported working | **SWAP candidate — needs Phase 3 verification** |

**All thresholds PASS per the original framework.** This is the first Gemma 4 configuration that meets all swap criteria.

### Phase 3 plan (verification benchmark)

1. Build `vllm-node-tf5` on Titan (`spark-vllm-docker --tf5`)
2. Download NVFP4 checkpoint + patched gemma4.py
3. Stop Titan services, serve NVFP4 model
4. Run same benchmark script (`benchmark_gemma4_vllm.py`) — verify tok/s, TTFT, VRAM, tool calling, vision
5. If ALL PASS: proceed to Phase 4 (production integration test with Annie voice loop)

### W4A16 variant

[bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4A16](https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4A16) — 4-bit weights with 16-bit activations. Slightly better quality but less throughput improvement. Worth testing if W4A4 quality is insufficient for tool calling.

## 11. Phase 2 Complete — Cross-Model Comparison (2026-04-05)

| Metric | Nano (Titan) | Super (Beast) | 26B FP8 (Titan) | **26B NVFP4 (Titan)** | 31B NVFP4 (Beast) |
|--------|-------------|---------------|-----------------|----------------------|-------------------|
| **tok/s** | 48-65 | 16.0 | 38.4 | **50.4** | 6.9 |
| **TTFT** | ~130ms | 269ms | 111ms | **83.9ms** | 390ms |
| **VRAM (model)** | 18 GB | 90 GB | 59 GB | **15.7 GB** | 46 GB |
| **Vision** | NO | NO | **YES** | **YES** | **YES** |
| **Tools** | 7/8 | 7/8 | 0/8 | **7/8** | 0/8 |
| **Entities** | 7/7 | 7/7+2/3 | 7/7+3/3 | **7/7+3/3** | 7/7+3/3 |
| **Kannada** | N/A | 6/6 | 6/6 | **6/6** | 6/6 |

**Bottom line:** The 26B-A4B NVFP4 **matches or exceeds Nano on every metric** while adding vision and Kannada. ALL swap thresholds PASS — Phase 3 verification confirms community numbers. **Phase 4 (production integration test) is next.** The 31B Dense remains too slow for interactive use (6.9 tok/s) but tool calling failure was just a parser issue (fixable with `--tool-call-parser gemma4`).

## 12. Phase 3 Results — NVFP4 Verification on Titan (2026-04-05)

**Status: ALL THRESHOLDS PASS — SWAP CANDIDATE CONFIRMED.**

Machine: Titan (DGX Spark GB202, 128 GB unified memory)
Model: bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 (community NVFP4, 16.5 GB)
Runtime: vLLM 0.18.2rc1.dev73 (`vllm/vllm-openai:gemma4-cu130` — official Docker image)
Config: `--quantization modelopt --kv-cache-dtype fp8 --gpu-memory-utilization 0.85 --max-model-len 32768 --moe-backend marlin --tool-call-parser gemma4`
Patch: `gemma4_patched.py` mounted (fixes expert_params_mapping for NVFP4, vLLM #38912)
Thinking: Disabled via `chat_template_kwargs.enable_thinking: false`
Results JSON: `scripts/results/benchmark_gemma4_26b_nvfp4.json`

### Performance

| Metric | NVFP4 | FP8 (Phase 2) | Nano (prod) | Community ref | Delta vs Nano |
|--------|-------|---------------|-------------|---------------|---------------|
| **Throughput** | **50.4 tok/s** (0.2 stddev) | 38.4 | 48-65 | 48.2 | **Within range** |
| **TTFT** | **83.9 ms** (4.9 stddev) | 111.3 | ~130 | 53 | **-35% faster** |
| **Model VRAM** | **15.74 GB** (vLLM log) | 59.4 | 18 | 15.7 | **-12% smaller** |
| **Total GPU alloc** | 102.04 GB (nvidia-smi) | 59.4 | ~27 | N/A | At 0.85 util |

Throughput is remarkably stable: all 25 runs (5 prompts × 5 runs) ranged 50.0-50.7 tok/s. Exceeds community's 48.2 tok/s. Total GPU allocation is high because `gpu_memory_utilization=0.85` reserves memory for KV cache — production would use ~0.25 like Nano.

### Quality

| Test | NVFP4 | FP8 (Phase 2) | Nano | Winner |
|------|-------|---------------|------|--------|
| Entity extraction | **7/7 persons, 3/3 places** | 7/7+3/3 | 7/7 | Gemma 4 (places) |
| Tool calling | **7/8 structured (0.88)** | 0/8 | 7/8 | **Tie** |
| **Vision** | **5/5 colors, WORKING** | 5/5 | NOT SUPPORTED | **Gemma 4** |
| Kannada entities | 6/6 | 6/6 | N/A | Tie |
| Kannada response | YES (75 chars) | YES (71 chars) | N/A | Tie |

### Key Findings

1. **Tool calling FIXED: 0/8 → 7/8.** The `--tool-call-parser gemma4` was the missing piece. Phase 2's 0/8 was entirely due to using `hermes`/`pythonic` parsers which can't parse Gemma 4's custom `<|tool_call>call:func{...}<tool_call|>` format. See `docs/RESEARCH-GEMMA4-TOOL-CALLING-VLLM.md` for full root cause analysis.

2. **50.4 tok/s matches Nano's production range (48-65).** The MoE architecture (only 3.8B active of 25.2B total) makes NVFP4 extremely efficient on DGX Spark's bandwidth-limited unified memory.

3. **15.74 GB model memory is 2.3 GB SMALLER than Nano (18 GB).** The NVFP4 compression from 49 GB → 16.5 GB on disk, 15.74 GB in GPU, is remarkably efficient. This means Gemma 4 26B would actually free up VRAM headroom compared to Nano.

4. **83.9 ms TTFT beats everything** — 35% faster than Nano (~130ms), 25% faster than FP8 (111ms), though slower than community's 53ms (may need `fastsafetensors` loader). CUDA graph compilation was cached from the FP8 run, so cold start was ~4 min instead of ~28 min.

5. **Vision remains the killer differentiator** — 5/5 color detection with accurate spatial layout description. Nano has no vision at all. This enables new capabilities like photo analysis, document reading, visual context from Pixel camera.

6. **Entity extraction JSON has minor bugs** — Drops quote characters in `properties` keys (known from Phase 1). The benchmark's lenient parser handles this, and the Context Engine already uses lenient JSON parsing. Not a blocker.

7. **Tool calling failure analysis: Priya query** — Called `get_entity_details` instead of `search_memory` for "What did Priya say about the product launch?" Same ambiguity Super encounters (7/8). The distinction between "look up a person" and "search for what someone said" is genuinely ambiguous.

8. **`gemma4-cu130` image has everything needed** — No need to build `vllm-node-tf5`. The official image already has transformers 5.5.0, gemma4 tool parser (PR #38826), and modelopt quantization. Only the `gemma4_patched.py` mount is needed for NVFP4 expert weight loading.

### Decision (per framework)

| Metric | Result | Threshold | Status |
|--------|--------|-----------|--------|
| tok/s | 50.4 | ≥40 | **PASS** |
| Model VRAM | 15.74 GB | ≤25 GB | **PASS** |
| TTFT | 83.9 ms | ≤200 ms | **PASS** |
| Tools | 7/8 (0.88) | ≥6/8 | **PASS** |
| Vision | 5/5 | Working | **PASS** |

**ALL THRESHOLDS PASS. Decision: SWAP Nano for Gemma 4 26B-A4B NVFP4.**

### Phase 4: Production Integration Test (next)

Before production swap:
1. Run Annie voice loop with Gemma 4 as LLM backend
2. Measure end-to-end voice latency (STT + LLM + TTS)
3. Test all creatures that use minotaur
4. A/B test conversation quality with Rajesh
5. Run tool calling regression suite
6. Update `docs/RESOURCE-REGISTRY.md` with new VRAM budget
7. Create production serve recipe for spark-vllm-docker

### Serve Configuration for Production

```bash
# On Titan — using existing gemma4-cu130 image (NO build needed)
docker run -d --name vllm-gemma4 \
  --gpus all --ipc=host --shm-size 64gb -p 8003:8003 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v ~/.cache/huggingface/hub/Gemma-4-26B-A4B-it-NVFP4:/model \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /path/to/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
  vllm/vllm-openai:gemma4-cu130 \
  /model \
    --served-model-name gemma-4-26b \
    --host 0.0.0.0 --port 8003 \
    --quantization modelopt \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.25 \
    --max-model-len 32768 \
    --max-num-seqs 8 \
    --moe-backend marlin \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --enable-prefix-caching
```

Key differences from benchmark → production:
- `gpu_memory_utilization: 0.85 → 0.25` (match Nano's footprint)
- `port: 8004 → 8003` (replace Nano's port)
- Added `--max-num-seqs 8` and `--enable-prefix-caching`

## 13. 31B Tool Calling Research (2026-04-05)

**Status: RESOLVED — wrong parser was the only issue. 31B remains too slow for interactive use.**

Full findings: `docs/RESEARCH-GEMMA4-TOOL-CALLING-VLLM.md`

### Root Cause of 0/8 Tool Calling

Both 26B and 31B got 0/8 in Phase 2 because we used the wrong tool parsers:
- 31B: `--tool-call-parser hermes` → expects JSON, Gemma 4 emits `<|tool_call>call:func{...}`
- 26B: `--tool-call-parser pythonic` → expects `func(args)`, same mismatch
- Ollama: built-in template → correctly parses native format → 8/8

Fix: `--tool-call-parser gemma4` (confirmed working in Phase 3: 7/8).

### 31B Verdict

| Question | Answer |
|----------|--------|
| Would `gemma4` parser fix 31B tools? | **YES** — PR #38847 was tested with 31B-IT-NVFP4 on DGX Spark |
| Faster 31B checkpoint? | **NO** — 6.9 tok/s is hardware bandwidth-limited (31B all-active) |
| 31B + Super on Beast? | **NOT FEASIBLE** — 141 GB needed, 128 GB available, system zombies on overcommit |

The 31B Dense is best as an **offline vision processor** (where latency doesn't matter), not for interactive use. The 26B MoE at NVFP4 is the right model for Titan.

## Sources

- [Google Blog — Gemma 4 announcement](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/)
- [NVIDIA Blog — Gemma 4 edge/on-device](https://developer.nvidia.com/blog/bringing-ai-closer-to-the-edge-and-on-device-with-gemma-4/)
- [NVIDIA Blog — RTX AI Garage Gemma 4](https://blogs.nvidia.com/blog/rtx-ai-garage-open-models-google-gemma-4/)
- [NVIDIA Forums — Day-1 DGX Spark Benchmarks](https://forums.developer.nvidia.com/t/gemma-4-day-1-inference-on-nvidia-dgx-spark-preliminary-benchmarks/365503)
- [NVIDIA Forums — 26B-A4B 45-60 tok/s on DGX Spark](https://forums.developer.nvidia.com/t/someone-post-this-gemma-4-26b-a4b-moe-running-at-45-60-tok-s-on-dgx-spark/365547)
- [NVIDIA Forums — vLLM version / PRs for Gemma 4](https://forums.developer.nvidia.com/t/gemma-4-models-which-vllm-version-any-prs-spotted/365490)
- [vLLM — Gemma 4 Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html)
- [GitHub — vllm-gb10-gemma4 community patches](https://github.com/atcuality2021/vllm-gb10-gemma4)
- [HuggingFace — nvidia/Gemma-4-31B-IT-NVFP4](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4)
- [HuggingFace — unsloth/gemma-4-26B-A4B-it-GGUF](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF)
- [HuggingFace — google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it)
- [Ollama — gemma4](https://ollama.com/library/gemma4)
- [Hugging Face Blog — Welcome Gemma 4](https://huggingface.co/blog/gemma4)
- [AI News Silo — Gemma 4 models explained](https://ainewssilo.com/articles/gemma-4-models-explained-hardware-benchmarks)
- [WaveSpeedAI — Gemma 4 architecture](https://wavespeed.ai/blog/posts/what-is-google-gemma-4/)
- [Latent Space — Gemma 4 analysis](https://www.latent.space/p/ainews-gemma-4-the-best-small-multimodal)
- [vLLM Issue #36821 — sm_121 Blackwell aarch64 bug](https://github.com/vllm-project/vllm/issues/36821)
- [vLLM Issue #38887 — Gemma 4 E4B slow on Triton fallback](https://github.com/vllm-project/vllm/issues/38887)
- [HuggingFace — bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4](https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4) — Community NVFP4 quantization (49 GB → 16.5 GB)
- [HuggingFace — bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4A16](https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4A16) — W4A16 variant
- [HuggingFace Discussion — First community NVFP4 quantization](https://huggingface.co/google/gemma-4-26B-A4B-it/discussions/7)
- [vLLM Issue #38912 — expert_params_mapping scale key bug](https://github.com/vllm-project/vllm/issues/38912)
