# Next Session: Gemma 4 26B NVFP4 Production Swap — Phase A (Config Changes)

## What

Swap the Titan voice LLM from Nemotron Nano 30B to Gemma 4 26B-A4B NVFP4. Phase A = all config + code changes (no deploy). 16 files, ~110 lines. Two adversarial reviews found and fixed 3 critical gaps (health monitor, status grep, compaction thinking leak).

## Plan

`~/.claude/plans/streamed-roaming-piglet.md`

Read the plan first — it has the full implementation with all adversarial review findings already addressed.

## Key Design Decisions (from adversarial review)

1. **Direct `docker run` instead of `run-recipe.py`** — the recipe system doesn't support custom volume mounts needed for the patched `gemma4.py` (vLLM #38912)
2. **`_LOCAL_LLM_BACKENDS` constant** in `bot.py` — replaces 10+ scattered `"nemotron-nano"` string checks with `backend in _LOCAL_LLM_BACKENDS`. Next model swap is a 1-line change.
3. **Always send `enable_thinking: False`** — `compaction.py` had a guard that skipped this for vLLM (relying on server-side `--reasoning-parser`). Gemma 4 has no reasoning parser. Guard removed.
4. **Both `LLM_BACKEND` AND `LLM_MODEL` must be set** — they're separate env vars used by different files. Both set to `gemma-4-26b` in start.sh.
5. **`health_monitor.py` container name updated** — without this, the health monitor would detect `vllm-nemotron` as missing and fire restart commands in an infinite loop.
6. **`start.sh` status grep updated** — line 991 had `grep '^vllm-nemotron$'` which would always show LLM as offline after the swap.
7. **Creature process strings atomic 3-file rename** — `observability.py`, `chronicler.py`, `registry.ts` must all change from `llm-nemotron-nano-voice` to `llm-gemma4-26b-voice` together or minotaur events silently drop.
8. **Keep `nemotron-nano` entries** in all PRESETS/ALLOWED_BACKENDS dicts for rollback compatibility.

## Files to Modify (in order)

1. **SSH to Titan** — `mkdir -p ~/.her-os/patches && cp /tmp/gemma4_patch/gemma4_patched.py ~/.her-os/patches/`
2. `services/annie-voice/bot.py` — add `_LOCAL_LLM_BACKENDS`, update 10 conditional checks
3. `services/annie-voice/server.py` — add to ALLOWED_BACKENDS, fix line 1323 hardcoded PRESETS key
4. `services/annie-voice/compaction.py` — add preset + remove enable_thinking guard
5. `services/annie-voice/text_llm.py` — atomic 3-line change (mapped + 2083 + 2089)
6. `services/annie-voice/cost_tracker.py` — add pricing entry
7. `services/annie-voice/observability.py` — creature process string
8. `services/annie-voice/llamacpp_llm.py` — docstring only
9. `services/context-engine/chronicler.py` — creature process string
10. `services/context-engine/dashboard/src/creatures/registry.ts` — creature label + process
11. `services/context-engine/llm.py` — docstring only
12. `services/context-engine/config.py` — comment only
13. `services/telegram-bot/health_monitor.py` — container name `vllm-nemotron` → `vllm-gemma4`
14. `start.sh` — docker run + env vars + health check timeout + status grep
15. `stop.sh` — container names
16. `docs/RESOURCE-REGISTRY.md` — VRAM budget (18 → 15.74 GB)
17. `services/annie-voice/tests/` — add gemma-4-26b routing tests

## Start Command

```
cat ~/.claude/plans/streamed-roaming-piglet.md
```

Then implement the plan. All adversarial findings are already addressed in it.

## Verification

1. Run tests locally: `cd services/annie-voice && python3 -m pytest tests/ -x -q`
2. Run dashboard tests: `cd services/context-engine/dashboard && npx vitest`
3. Commit + push
4. SSH Titan: move patched file (Step 1)
5. SSH Titan: git pull + clear __pycache__
6. From laptop: `./stop.sh titan && ./start.sh titan`
7. Verify: `ssh titan "curl -s http://localhost:8003/v1/models"` → `gemma-4-26b`
8. Telegram bot test: send message, verify Annie responds
9. `./start.sh status` — verify LLM shows as running
10. Dashboard: minotaur creature shows "Gemma 4 26B Voice"

## Critical Gotchas (DO NOT REPEAT)

- `--tool-call-parser gemma4` NOT hermes/pythonic (0/8 tools without it)
- `--moe-backend marlin` NOT cutlass (crashes on SM121)
- `gpu_memory_utilization: 0.25` NOT 0.85 (production, not benchmark)
- Container entrypoint is `["vllm", "serve"]` — command starts with `/model`, NOT `vllm serve /model`
- `enable_thinking: False` in EVERY request — Gemma 4 thinking ON by default
- Run `start.sh` from laptop, NEVER on Titan
- Clear `__pycache__` after git pull on Titan