# Next Session: Gemma 4 26B NVFP4 Production Swap (Phase 4)

## Context

Phase 3 benchmark PASSED ALL thresholds (session 432):
- **50.4 tok/s** (Nano: 48-65) — within production range
- **83.9ms TTFT** (Nano: ~130ms) — 35% faster
- **15.74 GB model VRAM** (Nano: 18 GB) — 2.3 GB smaller
- **7/8 tools** (Nano: 7/8) — tied
- **5/5 vision** (Nano: none) — new capability
- **6/6 Kannada** — new capability

**Decision: SWAP Nano for Gemma 4 26B-A4B NVFP4.**

## Prerequisites (already done)

- [x] NVFP4 checkpoint: `~/.cache/huggingface/hub/Gemma-4-26B-A4B-it-NVFP4` (16 GB on Titan)
- [x] Patched gemma4.py: `/tmp/gemma4_patch/gemma4_patched.py` on Titan
- [x] Docker image: `vllm/vllm-openai:gemma4-cu130` (already on Titan)
- [x] Benchmark results: `scripts/results/benchmark_gemma4_26b_nvfp4.json`

## Tasks

### Task 1: Create spark-vllm-docker Recipe

Create `recipes/gemma4-26b-a4b-nvfp4.yaml` in `~/spark-vllm-docker` on Titan:
```yaml
recipe_version: "1"
name: Gemma4-26B-A4B-NVFP4
description: vLLM serving Gemma4-26B-A4B community NVFP4
model: bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4
container: vllm/vllm-openai:gemma4-cu130
solo_only: true
mods: []  # gemma4 parser already in image
defaults:
  port: 8003
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.25
  max_model_len: 32768
  max_num_seqs: 8
env:
  VLLM_NVFP4_GEMM_BACKEND: marlin
command: |
  /model \
    --served-model-name gemma-4-26b \
    --host {host} --port {port} \
    --quantization modelopt \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --max-model-len {max_model_len} \
    --max-num-seqs {max_num_seqs} \
    --moe-backend marlin \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --enable-prefix-caching
```

Note: The image entrypoint is `["vllm", "serve"]` so the command starts with `/model` (the model path).

### Task 2: Update start.sh + stop.sh

- Change vLLM model from Nemotron Nano to Gemma 4 26B NVFP4
- Mount `gemma4_patched.py` at container runtime
- Update health check model name
- Update served-model-name references in Annie Voice config

### Task 3: Update Annie Voice Config

- `services/annie-voice/bot.py`: Update model name from `nemotron-nano` to `gemma-4-26b`
- Update any `enable_thinking` handling (Gemma 4 uses same field)
- Add vision tool support (new capability!)

### Task 4: Production Integration Test

1. Run Annie voice loop with Gemma 4 as LLM backend
2. Measure end-to-end voice latency (STT → LLM → TTS)
3. Test all creatures that use minotaur (the LLM creature)
4. A/B test conversation quality with Rajesh
5. Run tool calling regression suite
6. Verify Telegram bot works
7. Verify dashboard stats

### Task 5: Update Resource Registry

Update `docs/RESOURCE-REGISTRY.md`:
- Replace Nemotron Nano (18 GB) with Gemma 4 26B NVFP4 (15.74 GB)
- Recalculate steady-state scenarios
- Add to Change Log

### Task 6: Commit + Deploy

If all tests pass:
- Commit config changes
- Git pull on Titan
- Restart services with new model

## Key Gotchas

1. **Image entrypoint**: `gemma4-cu130` entrypoint is `["vllm", "serve"]` — command must NOT include `vllm serve`
2. **Mount patched gemma4.py**: Still needed for NVFP4 expert weight loading (vLLM #38912 not yet merged)
3. **gpu_memory_utilization: 0.25** — match Nano's footprint for coexistence with audio pipeline
4. **served-model-name**: Using `gemma-4-26b` — update all references (Annie Voice, dashboard, Telegram bot)
5. **Thinking field**: Gemma 4 uses `enable_thinking` (same as Nano) — no change needed
6. **Tool parser**: `--tool-call-parser gemma4` is MANDATORY (not hermes or pythonic)

## Rollback Plan

If Gemma 4 causes issues:
1. Stop Gemma 4: `docker rm -f vllm-gemma4`
2. Restore Nano: current start.sh still has Nano recipe
3. `./start.sh titan` — back to production in ~2 min
