# Gemma 4 26B-A4B vs Nemotron 3 Super 120B-A12B — Head-to-Head Comparison

> **Date:** 2026-04-06 (updated with published benchmarks)
> **Source data:** `docs/RESEARCH-GEMMA4-BENCHMARK.md` (Phases 1-3) + `docs/RESEARCH-NEMOTRON3-SUPER.md` + official model cards
> **Both benchmarked on DGX Spark GB202 (128 GB unified memory), same script (`benchmark_gemma4_vllm.py`)**
> **Published benchmarks:** [HuggingFace model card](https://huggingface.co/google/gemma-4-26B-A4B-it), [Google DeepMind](https://deepmind.google/models/gemma/gemma-4/), [Google Blog](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/)

---

## 1. Model Architecture

| Property | Gemma 4 26B-A4B | Nemotron 3 Super 120B-A12B |
|----------|-----------------|---------------------------|
| **Total parameters** | 25.2B | 120B |
| **Active parameters** | 3.8B (15%) | 12B (10%) |
| **Architecture** | MoE (128 experts, ~6 active) | Hybrid Mamba-2 + MoE (512 experts, 22 active) + LatentMoE |
| **Special features** | Native vision (image + video) | Multi-Token Prediction (MTP), LatentMoE, O(n) Mamba-2 layers |
| **Modality** | Text + Vision | Text only |
| **Context window** | 128K | 1M (131K configured) |
| **Quantization** | NVFP4 (community, post-training) | NVFP4 (native — trained in FP4, lossless) |
| **Developer** | Google DeepMind | NVIDIA |

---

## 2. Inference Performance (DGX Spark, Measured)

| Metric | Gemma 4 26B NVFP4 | Nemotron Super NVFP4 | Delta | Winner |
|--------|-------------------|---------------------|-------|--------|
| **Throughput** | **50.4 tok/s** (±0.2) | **16.0 tok/s** (±0.0) | **3.15x faster** | 🏆 Gemma 4 |
| **TTFT** | **83.9 ms** (±4.9) | **269.0 ms** (±45.7) | **3.2x faster** | 🏆 Gemma 4 |
| **Model VRAM** | **15.74 GB** | **90.42 GB** | **5.7x smaller** | 🏆 Gemma 4 |
| **VRAM headroom** (of 128 GB) | ~112 GB free | ~37.6 GB free | 3x more room | 🏆 Gemma 4 |
| **Throughput stability** | 50.0-50.7 (25 runs) | 16.0 exactly (25 runs) | Both rock-solid | Tie |
| **Cold start** | ~4 min (cached CUDA graphs) | ~2-3 min | Comparable | Tie |

### Machine Assignments

| Model | Machine | Runtime | Config |
|-------|---------|---------|--------|
| Gemma 4 26B | **Titan** | vLLM 0.18.2 (`gemma4-cu130`) | `--quantization modelopt --kv-cache-dtype fp8 --gpu-memory-utilization 0.85 --tool-call-parser gemma4` |
| Nemotron Super | **Beast** | vLLM 0.17.2 (spark-vllm-docker) | `--kv-cache-dtype fp8 --gpu-memory-utilization 0.75 --max-model-len 131072 --reasoning-parser super_v3` |

---

## 3. Quality Benchmarks

### Functional Tests (Same Benchmark Suite)

| Test | Gemma 4 26B NVFP4 | Super 120B NVFP4 | Winner |
|------|-------------------|------------------|--------|
| **Entity extraction (persons)** | 7/7 (100%) | 7/7 (100%) | Tie |
| **Entity extraction (places)** | 3/3 (100%) | 2/3 (67%) | 🏆 Gemma 4 |
| **Tool calling** | 7/8 (87.5%) | 7/8 (87.5%) | Tie |
| **Vision (color detection)** | **5/5 (100%)** | **NOT SUPPORTED** | 🏆 Gemma 4 |
| **Kannada entities** | 6/6 (100%) | 6/6 (100%) | Tie |
| **Kannada response** | YES (75 chars) | YES (47 chars) | Tie |

### Published Academic Benchmarks — Head to Head

#### Reasoning & Knowledge

| Benchmark | Gemma 4 26B | Super 120B | Delta | Winner |
|-----------|-------------|-----------|-------|--------|
| **MMLU-Pro** | **82.6%** | 83.73% | -1.13 pts | 🏆 Super (barely) |
| **GPQA Diamond** | **82.3%** | 79.23% | **+3.07 pts** | 🏆 **Gemma 4** |
| **GPQA (with tools)** | N/P | 82.70% | — | 🏆 Super |
| **BigBench Extra Hard** | 64.8% | N/P | — | — |
| **MMMLU (multilingual)** | 86.3% | N/P | — | — |
| **HLE (no tools)** | 8.7% | N/P | — | — |
| **HLE (with search)** | 17.2% | N/P | — | — |

#### Coding

| Benchmark | Gemma 4 26B | Super 120B | Delta | Winner |
|-----------|-------------|-----------|-------|--------|
| **LiveCodeBench v6** | **77.1%** | 81.19% | -4.09 pts | 🏆 Super |
| **SWE-Bench Verified** | N/P | **60.47%** | — | 🏆 Super |
| **SWE-Bench Multilingual** | N/P | 45.78% | — | 🏆 Super |
| **Codeforces ELO** | 1718 | N/P | — | — |

#### Math

| Benchmark | Gemma 4 26B | Super 120B | Delta | Winner |
|-----------|-------------|-----------|-------|--------|
| **AIME 2026** | **88.3%** | N/P | — | — |
| **MATH-Vision** | 82.4% | N/P | — | — |

#### Agentic & Tool Calling

| Benchmark | Gemma 4 26B | Super 120B | Delta | Winner |
|-----------|-------------|-----------|-------|--------|
| **τ2-bench Retail** | **85.5%** | 62.83% | **+22.67 pts** | 🏆 **Gemma 4** |
| **τ2-bench Average** | **68.2%** | 61.15% | **+7.05 pts** | 🏆 **Gemma 4** |
| **τ2-bench Airline** | N/P | 56.25% | — | — |
| **τ2-bench Telecom** | N/P | 64.36% | — | — |
| **PinchBench (agentic)** | N/P | **85.6%** | — | 🏆 Super |
| **~~Tool calling "97%"~~** | — | ~~97.0%~~ | — | ~~Not from any official benchmark~~ |
| **RULER (1M context)** | N/P | **91.75%** | — | 🏆 Super |

#### Vision (Gemma 4 exclusive — Super has NO vision)

| Benchmark | Gemma 4 26B | Super 120B |
|-----------|-------------|-----------|
| **MMMU Pro** | 73.8% | ❌ N/A |
| **MATH-Vision** | 82.4% | ❌ N/A |
| **OmniDocBench 1.5** | 0.149 (edit dist, lower=better) | ❌ N/A |
| **MedXPertQA MM** | 58.1% | ❌ N/A |
| **Our vision test** | 5/5 colors | ❌ HTTP 400 |

#### Long Context

| Benchmark | Gemma 4 26B | Super 120B | Winner |
|-----------|-------------|-----------|--------|
| **MRCR v2 128K** | 44.1% | N/P | — |
| **RULER (1M)** | N/P | 91.75% | 🏆 Super |
| **Max context** | 128K | 1M | 🏆 Super |

#### Arena

| Benchmark | Gemma 4 26B | Super 120B | Winner |
|-----------|-------------|-----------|--------|
| **LMArena (text)** | 1441 (#6) | N/P | — |

*N/P = Not Published by vendor. Blank cells = benchmark not applicable or not run.*

### Key Findings from Published Benchmarks

1. **MMLU-Pro gap is tiny** — 82.6% vs 83.73% (only 1.13 pts). The earlier estimate of 4-6 pts was wrong. Gemma 4's 3.8B active params punch far above their weight.
2. **Gemma 4 WINS on GPQA Diamond** — 82.3% vs 79.23%. A 26B MoE beating a 120B hybrid on graduate-level science is remarkable efficiency.
3. **τ2-bench (apples-to-apples): Gemma 4 CRUSHES Super** — Retail: 85.5% vs 62.83% (+22.7 pts). Average: 68.2% vs 61.15% (+7.1 pts). This is the only standardized agentic benchmark covering both models, and Gemma 4 wins decisively.
4. **Super's "97% tool calling accuracy" is unverifiable** — Does NOT appear on the [HuggingFace model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16), the [NVIDIA blog](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/), or any standard benchmark. Likely from internal NemoClaw evaluation or marketing material.
5. **SWE-Bench remains Super's strongest argument** — 60.47% (Gemma 4 not published). But SWE-Bench measures multi-file code changes, not the tool-calling/ordering tasks Annie actually does.
6. **Vision is Gemma 4's exclusive moat** — 73.8% MMMU Pro, 82.4% MATH-Vision. Super has zero vision capability.

---

## 4. Voice Suitability Analysis

**Voice latency budget:** TTFT < 700ms, decode > 25 tok/s

| Criterion | Gemma 4 26B | Super 120B | Verdict |
|-----------|-------------|-----------|---------|
| TTFT | 83.9 ms ✅ | 269 ms ✅ | Both pass |
| Decode speed | 50.4 tok/s ✅ | 16.0 tok/s ❌ | **Gemma 4 only** |
| 50-word response time | ~0.5s | ~1.6s | 3x faster on Gemma 4 |
| Feels natural? | Yes — fluid | No — sluggish pauses | Gemma 4 wins |

**Verdict:** Super is **below the minimum decode threshold** for real-time voice. Gemma 4 is the only viable voice model.

---

## 5. Background Agent Suitability Analysis

**Agent requirements:** High tool accuracy, complex reasoning, no latency constraint

| Criterion | Gemma 4 26B | Super 120B | Verdict |
|-----------|-------------|-----------|---------|
| τ2-bench Retail (official) | **85.5%** | 62.83% | 🏆 **Gemma 4 (+22.7 pts)** |
| τ2-bench Average (official) | **68.2%** | 61.15% | 🏆 **Gemma 4 (+7.1 pts)** |
| PinchBench (agentic) | N/P | 85.6% | 🏆 Super |
| SWE-Bench (multi-step) | N/P (likely weaker) | 60.47% | 🏆 Super |
| LiveCodeBench v6 | 77.1% | 81.19% | 🏆 Super (-4 pts) |
| Multi-Token Prediction | No | Yes (3.45 avg acceptance) | 🏆 Super |
| Context window | 128K | 1M | 🏆 Super |
| Goal maintenance (RULER) | N/P | 91.75% at 1M | 🏆 Super |
| Latency acceptable? | N/A (not needed) | 16 tok/s fine for background | Both pass |

**Verdict (REVISED):** Gemma 4 **wins on agentic tool use** (τ2-bench: +22.7 pts Retail, +7.1 pts average). Super's advantage narrows to SWE-Bench (multi-file code changes), 1M context, and thinking mode. For Annie's actual workloads (ordering, research, extraction), Gemma 4 is the stronger agentic model.

---

## 6. Unique Capabilities

### Gemma 4 Has, Super Doesn't

| Capability | Impact |
|-----------|--------|
| **Native vision** (image + video) | Photo analysis, document reading, Pixel camera context, screen reading |
| **3.15x throughput** | Real-time voice, streaming TTS, responsive conversation |
| **5.7x smaller VRAM** | Room for audio pipeline, embeddings, STT, TTS on same machine |

### Super Has, Gemma 4 Doesn't

| Capability | Impact |
|-----------|--------|
| **Multi-Token Prediction** | 2-3x speedup on structured output (tool calls, JSON) |
| **1M context window** | Process entire conversation histories, long documents |
| **LatentMoE (512 experts)** | 4x expert diversity for nuanced analysis |
| **Mamba-2 layers** | O(n) scaling — no degradation with long contexts |
| **PinchBench-optimized** | Agent patterns trained via trajectory RL, not isolated completions |
| **Full thinking mode** | Deep reasoning chains for complex analysis |

---

## 7. Cost & Resource Summary

| Dimension | Gemma 4 26B | Super 120B |
|-----------|-------------|-----------|
| VRAM required | 15.74 GB | 90.42 GB |
| Machine | Titan (shared with audio stack) | Beast (dedicated) |
| Headroom | 112 GB free for other services | 37.6 GB free (tight) |
| Can coexist with audio pipeline? | ✅ Yes (total ~39 GB) | ❌ No (would need 106+ GB on Titan) |
| Network latency (cross-machine) | Local | < 1ms (ConnectX-7 link) |

---

## 8. Why Two Models, Not One?

Neither model alone can serve all of Annie's needs:

| If only Gemma 4 | If only Super |
|-----------------|--------------|
| ✅ Voice is fast and fluid | ❌ Voice is sluggish (16 tok/s) |
| ❌ Agent tasks: lower accuracy, no SWE-Bench-class reasoning | ✅ Agent tasks: 97% tool accuracy, 60.47% SWE-Bench |
| ❌ No deep thinking mode | ✅ Full reasoning chains |
| ✅ Vision for photo/document analysis | ❌ No vision at all |
| ✅ Fits alongside entire audio stack | ❌ Consumes 90 GB alone |

**The two-model architecture is optimal:** Gemma 4 on Titan handles latency-critical voice + vision, Super on Beast handles quality-critical background agents. The ConnectX-7 link between machines (< 1ms) makes cross-machine routing transparent.

---

## 9. Head-to-Head Summary Card (updated with published benchmarks)

```
GEMMA 4 26B-A4B  vs  NEMOTRON 3 SUPER 120B-A12B  (updated with verified benchmarks)
====================================================================================

Dimension        Gemma 4 26B         Super 120B           Winner
-----------      ---------------     ---------------      ------
SPEED            50.4 tok/s          16.0 tok/s           Gemma 4 (3.15x)
TTFT             83.9 ms             269 ms               Gemma 4 (3.2x)
VRAM             15.7 GB             90.4 GB              Gemma 4 (5.7x)
MMLU-PRO         82.6%               83.7%                ~Tie (-1.1 pts)
GPQA DIAMOND     82.3%               79.2%                Gemma 4 (+3.1)
LIVECODE v6      77.1%               81.2%                Super (-4.1 pts)
SWE-BENCH        N/P                 60.5%                Super
t2 RETAIL        85.5%               62.8%                Gemma 4 (+22.7) ***
t2 AVERAGE       68.2%               61.2%                Gemma 4 (+7.1)  ***
PINCHBENCH       N/P                 85.6%                Super
VISION           73.8% MMMU Pro      N/A (none)           Gemma 4
CONTEXT          128K                1M                   Super
AIME 2026        88.3%               N/P                  Gemma 4
MATH-VISION      82.4%               N/A                  Gemma 4
CODEFORCES       1718 ELO            N/P                  Gemma 4
LM ARENA         1441 (#6)           N/P                  Gemma 4

*** = Only apples-to-apples agentic benchmark covering both models

ROLE:            Voice + Vision      Background Agents (under review)
MACHINE:         Titan               Beast (may be decommissioned)
VERDICT:         Speed, vision,      SWE-Bench, thinking,
                 agentic tool use    long context
```
