# Next Session: Gemma 4 26B NVFP4 Verification + 31B Tool Calling Research

## Context

Phase 2 benchmarks are COMPLETE (sessions 430-431). Results:
- **26B FP8 on Titan**: 38.4 tok/s, 111ms TTFT, 59 GB VRAM — too much VRAM, tools broken (0/8)
- **31B NVFP4 on Beast**: 6.9 tok/s, 390ms TTFT, 46 GB — too slow, tools broken (0/8 with `hermes`)
- **Tool calling is a vLLM parser issue, not model issue**: Ollama gets 8/8, vLLM gets 0/8

Then a community NVFP4 checkpoint was found:
- `bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4` by Mario Iseli
- **Community numbers**: 48.2 tok/s, 53ms TTFT, 15.7 GB VRAM, tool calling WORKING
- ALL swap thresholds PASS — but we need to verify on our hardware

Full research: `docs/RESEARCH-GEMMA4-BENCHMARK.md` (sections 9-10)

## Two Tasks

### Task 1: Phase 3 — Verify 26B NVFP4 on Titan

**Goal**: Run our benchmark script against the community NVFP4 checkpoint. If numbers match community reports, this is a Nano replacement.

**Prerequisites (build vllm-node-tf5)**:
```bash
# On Titan — eugr's spark-vllm-docker build system
ssh titan "cd ~/spark-vllm-docker && git pull && ./build-and-copy.sh --tf5"
# This builds vllm-node-tf5 image (~3 min with pre-built wheels)
# DO NOT modify production vllm-node
```

**Download NVFP4 checkpoint** (~16.5 GB):
```bash
ssh titan "huggingface-cli download bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 --local-dir ~/.cache/huggingface/hub/Gemma-4-26B-A4B-it-NVFP4"
```

**Get patched gemma4.py**:
The NVFP4 model card includes `gemma4_patched.py` which fixes `expert_params_mapping` scale key suffixes (vLLM issue #38912). Download it:
```bash
ssh titan "cd /tmp && huggingface-cli download bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 gemma4_patched.py --local-dir ."
```
Or extract from the model repo files. Mount it at:
`/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py`

**Serve NVFP4 model**:
```bash
# Stop Titan services first: ./stop.sh (from laptop)
# Verify GPU empty: ssh titan "nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv,noheader"

ssh titan 'docker run -d --name vllm-gemma4-26b-nvfp4 \
  --gpus all --ipc=host --shm-size 64gb -p 8004:8004 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v ~/.cache/huggingface/hub/Gemma-4-26B-A4B-it-NVFP4:/model \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /tmp/gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py \
  vllm-node-tf5 \
  vllm serve /model \
    --served-model-name gemma-4-26b \
    --host 0.0.0.0 --port 8004 \
    --quantization modelopt \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --moe-backend marlin \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4'
```

**Key flags explained**:
- `--quantization modelopt` — NVFP4 checkpoint format (NOT `fp8`)
- `--moe-backend marlin` — Marlin kernel for MoE layers (MANDATORY on SM121)
- `VLLM_NVFP4_GEMM_BACKEND=marlin` — for non-MoE layers
- `--tool-call-parser gemma4` — try native parser first (patched gemma4.py may fix it). If fails, try `pythonic`
- `--gpu-memory-utilization 0.85` — more aggressive than FP8 test (model is 3.8x smaller)

**Run benchmark** (via screen to avoid SSH pipe hangs):
```bash
ssh titan 'screen -dmS bench bash -c "cd ~/workplace/her/her-os && python3 -u scripts/benchmark_gemma4_vllm.py \
  --url http://localhost:8004/v1 \
  --model gemma-4-26b \
  --timeout 300 \
  --thinking-field enable_thinking \
  --output scripts/results/benchmark_gemma4_26b_nvfp4.json \
  --baseline scripts/results/benchmark_gemma4_26b_vllm.json \
  > /tmp/gemma4_26b_nvfp4.log 2>&1"'
# Poll: ssh titan "tail -20 /tmp/gemma4_26b_nvfp4.log"
```

**Cleanup**: `docker rm -f vllm-gemma4-26b-nvfp4`, then `./start.sh` from laptop.

**Decision framework**:

| tok/s | VRAM | TTFT | Tools | Vision | Decision |
|-------|------|------|-------|--------|----------|
| ≥40 | ≤25 GB | ≤200ms | ≥6/8 | Working | **SWAP Nano for Gemma 4 26B** |
| ≥40 | ≤25 GB | ≤200ms | 0/8 | Working | **WAIT** — tools still broken, investigate parser |
| <40 | any | any | any | any | **REJECT** — community numbers don't reproduce |

### Task 2: Research — Can 31B Benefit from Patched gemma4.py?

**Goal**: Determine if the patched `gemma4.py` + correct tool parser would fix 31B's 0/8 tool calling.

**Research questions**:
1. Does the patched `gemma4.py` fix tool calling for ALL Gemma 4 models or only NVFP4?
2. Is the `gemma4` tool-call-parser (vs `hermes`/`pythonic`) the real fix?
3. Does the `--moe-backend marlin` flag affect non-MoE (31B Dense) models?
4. Is there a community NVFP4 of 31B that's faster than the official `nvidia/Gemma-4-31B-IT-NVFP4`?
5. Could the 31B run on Beast alongside Super (46 + 90 = 136 GB on 128 GB)?

**Research approach**:
- Check vLLM issue #38912 comments for 31B mentions
- Check vLLM PR #38909 (gemma4 tool parser) status — may have been merged
- Search NVIDIA forums for 31B tool calling reports
- Check if eugr's spark-vllm-docker has a 31B recipe with tool parser mod
- If promising: quick re-test 31B on Beast with patched gemma4.py + `gemma4` parser

**Optional quick test** (if research is positive):
```bash
# On Beast — serve 31B with patched gemma4.py
# Same docker command as session 430 BUT:
#   - Mount gemma4_patched.py
#   - Use --tool-call-parser gemma4 (not hermes)
#   - Add --moe-backend marlin if applicable to dense models
# Run only the tool calling test (not full benchmark)
```

## Execution Order

```
1. Build vllm-node-tf5 on Titan (3 min)
2. Download NVFP4 checkpoint + patched gemma4.py (16.5 GB download)
3. Stop Titan → serve NVFP4 → smoke test
4. Run full benchmark (9 tests, ~10 min)
5. Stop NVFP4 → restart Titan
6. Commit results
7. Research 31B tool calling (web search + issue tracking)
8. If promising: quick 31B re-test on Beast (optional)
9. Update docs/RESEARCH-GEMMA4-BENCHMARK.md with Phase 3 results
```

## Key Learnings from Sessions 430-431 (DO NOT REPEAT)

1. **Run benchmark via `screen`** — SSH pipes cause urllib hangs
2. **`chat_template_kwargs` must be top-level** in JSON payload (not wrapped in `extra_body`)
3. **Cold start ~28 min on SM121** — torch.compile CUDA graph compilation. Cached after first run.
4. **HF token is at `~/.huggingface/token`** on Titan (NOT `~/.cache/huggingface/token`)
5. **`vllm/vllm-openai:gemma4-cu130` lacks tool parser fix** — this session uses `vllm-node-tf5` instead
6. **`VLLM_ATTENTION_BACKEND` env var is ignored** in newer vLLM — Gemma 4 auto-selects TRITON_ATTN
7. **`git pull` on Titan then `find . -name '*.pyc' -delete`** — clear stale bytecache

## What's Already Done (DO NOT REDO)

- Benchmark script: `scripts/benchmark_gemma4_vllm.py` (v2.0.0) — deployed on Titan
- FP8 results: `scripts/results/benchmark_gemma4_26b_vllm.json`
- 31B results: `scripts/results/benchmark_gemma4_31b_vllm.json`
- Super baseline: `scripts/results/benchmark_super_baseline.json`
- Phase 1 baseline: `scripts/results/benchmark_gemma4_phase1.json`
- Research doc: `docs/RESEARCH-GEMMA4-BENCHMARK.md` (sections 1-11)
- NVFP4 research: Section 10 of research doc (serve command, requirements, community numbers)
