# Next Session: Gemma 4 vLLM Benchmark — BOTH COMPLETE (26B + 31B), WAIT for NVFP4

## What
Benchmark two Gemma 4 models on their respective machines, each against the current production LLM:

1. **Benchmark A** — Gemma 4 26B-A4B MoE (FP8) on **Titan** vs Nemotron Nano (48-65 tok/s, 18 GB, 130ms TTFT)
2. **Benchmark B** — Gemma 4 31B Dense (`nvidia/Gemma-4-31B-IT-NVFP4`) on **Beast** vs Nemotron Super (community: 14-19.5 tok/s, ~97 GB, 97% tool accuracy)

Phase 1 (Ollama) proved quality — 100% pass on tools, vision, entities, Kannada. This session gets real vLLM numbers. The 31B is evaluated as a *complementary vision model*, not a full Super replacement.

**Bonus:** Before stopping Super, run the same benchmark against it to capture our own measured Super baseline. We have NO formal benchmark for Super on Beast — only community numbers.

## Plan
`~/.claude/plans/effervescent-mapping-island.md`

**Read the plan first — 34 adversarial findings (all addressed), full recipes, decision frameworks.**

## Key Design Decisions (from adversarial reviews)

1. **Benchmark B runs on Beast** — where Super lives. Direct hardware comparison. Captures our own Super baseline BEFORE swapping to 31B.

2. **Super baseline first** — While Super is still running on Beast:8003, run the same benchmark script. This gives our own measured tok/s, VRAM, TTFT, quality scores. Then stop Super, serve 31B, run again. Same machine, same script = perfect comparison.

3. **`--quantization fp8` is REQUIRED for Recipe A (26B)** — `--kv-cache-dtype fp8` alone leaves weights as BF16 (~86 GB → OOM).

4. **`nvidia/Gemma-4-31B-IT-NVFP4` for Recipe B** — official NVIDIA pre-quantized checkpoint. No `--quantization` flag needed.

5. **`vllm-node` likely v0.17.2** — Gemma 4 needs ≥0.19.0. Step 0 checks version on BOTH machines. Build separate image if needed.

6. **`gpu_memory_utilization` conservative** — 0.50 (Titan, 26B), 0.40 (Beast, 31B). DGX Spark unified memory = CPU+GPU share 128 GB.

7. **Thinking disable auto-detected** — Gemma 4 may use different field than Nemotron. Script tries `enable_thinking`, `thinking`, `think`. CLI `--thinking-field` override.

8. **Timeout 300 for 26B, 600 for 31B** — 31B Dense at ~6.9 tok/s needs ~297s for 2048-token entity extraction.

9. **Tool call parser** — Verify `gemma4` parser exists in vLLM build. Fallback to `hermes`.

10. **Nothing else runs on Beast during Benchmark B** — stop ALL Beast services (including whisper.cpp). Clean GPU.

## Files to Create/Modify

1. `scripts/benchmark_gemma4_vllm.py` — **CREATE**: Single script for all benchmarks (26B, 31B, Super baseline), parameterized via CLI
2. `~/spark-vllm-docker/recipes/gemma-4-26b-a4b-fp8.yaml` on Titan — **CREATE**: 26B FP8 recipe
3. `~/spark-vllm-docker/recipes/gemma-4-31b-nvfp4.yaml` on Beast — **CREATE**: 31B NVFP4 recipe
4. `scripts/results/benchmark_gemma4_26b_vllm.json` — Auto-generated (Benchmark A)
5. `scripts/results/benchmark_super_baseline.json` — Auto-generated (Super baseline)
6. `scripts/results/benchmark_gemma4_31b_vllm.json` — Auto-generated (Benchmark B)
7. `docs/RESEARCH-GEMMA4-BENCHMARK.md` — **EDIT**: Add Phase 2 section

## Execution Order

```
1.  Step 0     → Pre-flight checks (vLLM version, HF tokens, disk) on BOTH machines
2.  Step 0.5   → Build vLLM image if needed (on whichever machine is < 0.19.0)
3.  Step 1     → Create benchmark script
4.  Step 2     → Create recipes (26B on Titan, 31B on Beast)
5.  Step 3     → Deploy script (git push → git pull on Titan + Beast, clear __pycache__)
6.  Step A1-A4 → Benchmark A: stop Titan → serve 26B → benchmark → stop 26B → restart Titan
7.  Step B0    → Capture Super baseline (while Super is still running on Beast:8003)
8.  Step B1    → Stop ALL Beast services (nothing else running)
9.  Step B2-B3 → Benchmark B: serve 31B → benchmark → stop 31B
10. Step B4    → Restart Beast Super (15-20 min)
11. Step 4     → Commit results, update docs
```

## Start Command

```
cat ~/.claude/plans/effervescent-mapping-island.md
```

Then implement. Start with Step 0.

## Super Baseline (known community numbers for reference)

| Metric | Community DGX Spark | Source |
|--------|-------------------|--------|
| Decode speed | 14-19.5 tok/s | `docs/RESEARCH-NEMOTRON3-SUPER.md` |
| TTFT | ~200-400ms (estimated) | Community reports |
| VRAM | ~80 GB model + 17 GB KV | `docs/RESOURCE-REGISTRY.md` |
| Tool accuracy | 97% | SWE-Bench verified |
| SWE-Bench | 60.47% | NVIDIA model card |
| PinchBench | 85.6% | NVIDIA model card |

**Our benchmark will give us OUR numbers on OUR hardware — first formal Super baseline.**

## Verification

1. **Pre-flight:** vLLM ≥0.19.0 on both machines, HF tokens, disk space
2. **Benchmark A:** 26B on Titan — health, FP8 logs, smoke test, 9 tests, JSON saved
3. **Titan restored:** All 7 services healthy
4. **Super baseline:** Benchmark against running Super on Beast, JSON saved
5. **Beast cleared:** Nothing running, GPU empty
6. **Benchmark B:** 31B on Beast — health, NVFP4 logs, smoke test, 9 tests (timeout 600), JSON saved
7. **Beast restored:** Super healthy after restart (15-20 min)
8. **Docs updated:** Phase 2 with three comparison tables + decisions
