# Next Session: Swap Nemotron Nano → Gemma 4 26B NVFP4

## Context

Phase 3 benchmark (session 432) PASSED ALL swap thresholds:

| Metric | Gemma 4 26B NVFP4 | Nemotron Nano | Better? |
|--------|-------------------|---------------|---------|
| tok/s | **50.4** | 48-65 | Tied |
| TTFT | **83.9 ms** | ~130 ms | **35% faster** |
| Model VRAM | **15.74 GB** | 18 GB | **2.3 GB smaller** |
| Tools | 7/8 | 7/8 | Tied |
| Vision | **5/5** | None | **New capability** |
| Kannada | **6/6** | N/A | **New capability** |

Full results: `docs/RESEARCH-GEMMA4-BENCHMARK.md` §12.
ADR-031 (pending Phase 4): `docs/PROJECT.md`.

## What's Already on Titan

- **NVFP4 checkpoint**: `~/.cache/huggingface/hub/Gemma-4-26B-A4B-it-NVFP4` (16 GB, 3 safetensors)
- **Patched gemma4.py**: `/tmp/gemma4_patch/gemma4_patched.py` (fixes NVFP4 expert weight loading, vLLM #38912)
- **Docker image**: `vllm/vllm-openai:gemma4-cu130` (has tf5 + gemma4 tool parser + modelopt)
- **Benchmark results**: `scripts/results/benchmark_gemma4_26b_nvfp4.json`

## Execution Order

### Phase A: Config Changes (plan + implement)

#### 1. Move patched gemma4.py to a persistent location
The file at `/tmp/gemma4_patch/gemma4_patched.py` won't survive a reboot. Move it:
```bash
ssh titan "mkdir -p ~/.her-os/patches && cp /tmp/gemma4_patch/gemma4_patched.py ~/.her-os/patches/"
```

#### 2. Create spark-vllm-docker recipe on Titan
Create `~/spark-vllm-docker/recipes/gemma4-26b-a4b-nvfp4.yaml`. Draft in `docs/NEXT-SESSION-GEMMA4-PRODUCTION-SWAP.md`.
Key details:
- Container: `vllm/vllm-openai:gemma4-cu130` (entrypoint is `["vllm", "serve"]` — command must NOT repeat `vllm serve`)
- `--quantization modelopt` (NOT `fp8`)
- `--moe-backend marlin` (MANDATORY on SM121)
- `--tool-call-parser gemma4` (NOT hermes or pythonic)
- `--gpu-memory-utilization 0.25` (match Nano's production footprint)
- `VLLM_NVFP4_GEMM_BACKEND=marlin` env var
- Volume mount: `~/.her-os/patches/gemma4_patched.py` → `/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py`
- Volume mount: `~/.cache/huggingface/hub/Gemma-4-26B-A4B-it-NVFP4` → `/model`

#### 3. Update `start.sh`
Change the Titan LLM section to launch Gemma 4 instead of Nemotron Nano. Key changes:
- Recipe: `nemotron-3-nano-nvfp4` → `gemma4-26b-a4b-nvfp4`
- Health check endpoint: same (`/health` on port 8003)
- Startup wait: may need longer (CUDA graph compilation ~4 min warm, ~28 min cold)

#### 4. Update `LLM_MODEL` env var
In the `.env` on Titan (the PROJECT `.env`, not `~/.her-os/.env`):
```
LLM_MODEL=gemma-4-26b  # was: nemotron-nano
```
This flows to `text_llm.py:1662` which uses it for API requests.

#### 5. Check model-name-dependent code paths
These files reference `nemotron-nano` or `nemotron_nano`:
- `services/annie-voice/text_llm.py` — `LLM_MODEL` env var (line 1662)
- `services/annie-voice/server.py` — model name in health/status
- `services/annie-voice/bot.py` — model name logging
- `services/annie-voice/compaction.py` — model name for API calls
- `services/annie-voice/phone_loop.py` — model name for phone LLM
- `services/annie-voice/observability.py` — model name in metrics
- `services/annie-voice/cost_tracker.py` — model name in cost tracking
- `services/annie-voice/agent_context.py` — model name for agent calls
- `services/context-engine/chronicler.py` — model name for extraction
- `services/context-engine/dashboard/src/creatures/registry.ts` — creature display name

**Most of these read `LLM_MODEL` from env** — changing the env var should propagate. But audit each file for hardcoded `nemotron-nano` strings.

#### 6. Update `RESOURCE-REGISTRY.md`
- Replace Nemotron Nano (18 GB) with Gemma 4 26B NVFP4 (15.74 GB)
- Recalculate steady-state scenarios (should free 2.3 GB)
- Add Change Log entry

### Phase B: Deploy + Smoke Test

#### 7. Deploy
```bash
# From laptop:
./stop.sh titan
# Wait for GPU clear
./start.sh titan
# Verify health:
ssh titan "curl -s http://localhost:8003/v1/models | python3 -m json.tool"
# Should show: gemma-4-26b (not nemotron-nano)
```

#### 8. Smoke tests
```bash
# Quick chat
ssh titan 'curl -s http://localhost:8003/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"gemma-4-26b\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}],\"max_tokens\":50,\"chat_template_kwargs\":{\"enable_thinking\":false}}"'

# Tool calling
# (test with a web_search tool call to verify gemma4 parser works in production)

# Vision (NEW — test with a base64 image)
```

#### 9. Telegram bot verification
Send messages to Annie via Telegram. Verify:
- Normal chat works
- Tool calling works (try "What's the weather in Bangalore?")
- No thinking leaks in responses

### Phase C: Full Integration Test

#### 10. Voice loop test (if Annie Voice is set up)
- Speak to Annie, verify STT → LLM → TTS pipeline
- Check end-to-end latency (should be ~35% faster TTFT)

#### 11. Run existing test suites
```bash
ssh titan "cd ~/workplace/her/her-os/services/annie-voice && python3 -m pytest tests/ -x -q"
```
Some tests may need model-name updates if they hardcode `nemotron-nano`.

#### 12. Commit + finalize ADR-031
- Commit all config changes
- Update ADR-031 status: Pending → Accepted
- Update MEMORY.md

## Critical Gotchas (DO NOT REPEAT)

1. **Image entrypoint is `["vllm", "serve"]`** — recipe command must NOT include `vllm serve`, start with `/model` path
2. **Mount patched gemma4.py** — still needed for NVFP4 expert weight loading (vLLM #38912 open)
3. **`--tool-call-parser gemma4`** — NOT hermes, NOT pythonic. This was the root cause of 0/8 tools in Phase 2
4. **`--moe-backend marlin`** — MANDATORY on SM121 (DGX Spark Blackwell). `cutlass` crashes
5. **`gpu_memory_utilization: 0.25`** — not 0.85 (benchmark used 0.85, production needs room for audio pipeline)
6. **`enable_thinking: false`** in every request — Gemma 4 has thinking ON by default (same as Nano)
7. **Patched file at `/tmp/` won't survive reboot** — move to `~/.her-os/patches/` first
8. **`LLM_MODEL` env var** goes in PROJECT `.env` on Titan, NOT `~/.her-os/.env`
9. **Run start.sh from laptop**, never directly on Titan
10. **Clear `__pycache__/*.pyc`** after git pull on Titan

## Rollback Plan

If Gemma 4 causes issues in production:
1. Revert `LLM_MODEL=nemotron-nano` in `.env`
2. Revert recipe reference in `start.sh` (or `git checkout start.sh`)
3. `./stop.sh titan && ./start.sh titan` — Nano back in ~2 min

## What NOT to Do

- Do NOT build `vllm-node-tf5` — `gemma4-cu130` has everything needed
- Do NOT use `--tool-call-parser hermes` or `pythonic`
- Do NOT use `--quantization fp8` — this is NVFP4, use `--quantization modelopt`
- Do NOT use `--moe-backend cutlass` — broken on SM121
- Do NOT change Beast/Super — only Titan's voice LLM changes