# Next Session: Gemma 4 26B-A4B Phase 1 Benchmark on Titan

## Start Command

```
Execute the Gemma 4 Phase 1 Ollama benchmark on Titan with ISOLATED GPU (all services stopped).

Read the full plan at ~/.claude/plans/fuzzy-wandering-pony.md FIRST — it has the complete 5-phase execution plan with benchmark script design, JSON output schema, adversarial review mitigations, and verification steps.

Then read docs/RESEARCH-GEMMA4-BENCHMARK.md for the research context and decision criteria.

Key context: This plan was stress-tested by 2 adversarial reviewers (architecture + code quality) who found 27 issues. All are addressed in the plan. The top changes from the original plan:
1. Phase 0 added (pre-flight: Ollama version, disk space, git status checks)
2. Phase 1 reframed as QUALITY benchmark (not performance — GGUF Q4 vs NVFP4 is apples-to-oranges)
3. VRAM test reordered (warmup FIRST, then nvidia-smi — was backwards)
4. TTFT uses real streaming (/api/generate stream=true, 1-byte reads, 5-run average)
5. Vision uses committed test image (no network download, no fragile BMP struct.pack)
6. Tool calling 3-tier scoring (structured=1.0, text-JSON=0.5, none=0.0)
7. Ollama version upgrade path (update start.sh pin, verify existing models, rollback plan)
8. Structured JSON output schema for automated Phase 2 comparison
9. 300s timeouts (600s for cold-load warmup), full nvidia-smi path, atomic writes
```

## Context

Sessions 404-405 researched the Gemma 4 family (released Apr 2, 2026) and designed the benchmark plan. Session 406 ran /planning-with-review which dispatched 2 adversarial reviewers finding 27 issues (23 implemented, 2 accepted, 1 deferred, 1 rejected). This session executes the hardened plan.

**What we're testing:** Gemma 4 26B-A4B MoE (25.2B total, 3.8B active) as a potential Nemotron Nano replacement on Titan. Key advantage: **native vision** (image+video understanding) that Nano doesn't have.

**Critical reframe:** Phase 1 is a **QUALITY benchmark**, not a performance benchmark. Comparing Ollama GGUF Q4 (Gemma) against vLLM NVFP4 (Nano) tok/s is structurally invalid. Performance numbers are informational floor. The real questions:
1. Does vision actually work?
2. Does tool calling work (structured or text-based)?
3. How good is Kannada understanding?
4. Does entity extraction quality match Nano?

## Files to Read

| File | Why | Priority |
|------|-----|----------|
| `~/.claude/plans/fuzzy-wandering-pony.md` | **FULL PLAN** with all phases, script design, output schema, review mitigations | **MUST READ** |
| `docs/RESEARCH-GEMMA4-BENCHMARK.md` | Research context, model comparison, decision criteria | **MUST READ** |
| `scripts/benchmark_nemotron_nano.py` | **TEMPLATE** — copy chat(), TOOLS, test cases, entity prompt | **MUST READ** |
| `docs/RESOURCE-REGISTRY.md` | Current VRAM budget (27.4 GB idle, 128 GB total) | Read |
| `start.sh` lines 274-300 | Ollama container config (pinned to 0.17.1-rc2) | Read |
| `stop.sh` | Service shutdown targets | Read |

## Decision Criteria (Quality-Focused)

| Metric | Category | Must Meet | Nice to Have | Dealbreaker |
|--------|----------|-----------|--------------|-------------|
| Entity extraction | Quality | >= 5/7 persons | 7/7 | < 3/7 |
| Tool calling | Quality | >= 50% (any format) | >= 75% structured | 0% |
| Vision | Quality | Describes test image | Reads text in images | Not supported |
| Kannada entities | Quality | >= 4/6 entities | 6/6 | < 2/6 |
| tok/s (informational) | Perf floor | > 25 tok/s | > 40 tok/s | < 15 tok/s |
| TTFT (informational) | Perf floor | < 500ms | < 200ms | > 2000ms |
| VRAM | Resource | < 20 GB | < 15 GB | > 25 GB |

## Adversarial Review Summary (27 Issues)

### Top Findings Addressed
1. **VRAM test ordering bug** — nvidia-smi fired before model loaded (captured idle VRAM). Fixed: warmup first.
2. **TTFT measurement broken** — stream=False measures total time, not first token. Fixed: streaming + 1-byte reads.
3. **Vision silent fallback** — Ollama ignores `images` if model doesn't support it. Fixed: verify response references image content.
4. **Tool calling false negatives** — Gemma 4 may output JSON in text, not structured tool_calls. Fixed: 3-tier scoring.
5. **Ollama version cascade** — upgrade for Gemma 4 then start.sh overwrites with old pin. Fixed: update pin first.
6. **No structured output** — console-only output can't be diffed across phases. Fixed: JSON schema defined.
7. **Synthetic BMP untestable** — struct.pack BMP is fragile. Fixed: PPM format + committed test image.
8. **Timeout too short** — urllib 60s default vs 120s cold model load. Fixed: 300s/600s.

### Full Review Details
See `~/.claude/plans/fuzzy-wandering-pony.md` for the complete feedback response table (27 rows).
