# Next Session: Drop Nemotron Super — Full Migration to Gemma 4

## What

Drop Nemotron Super 120B from Beast (DGX Spark #2) entirely. Route ALL LLM workloads — extraction, daily reflections, Graphiti, contradiction detection, browser tasks, nudges, wonder, comic, text chat — through Gemma 4 26B NVFP4 on Titan. 40+ files across 5 services + 1 runtime YAML. Beast becomes free for other uses.

**Why now:** Session 447 debunked Super's "97% tool accuracy" (unverifiable marketing). τ2-bench (only apples-to-apples agentic benchmark): Gemma 4 85.5% vs Super 62.8% Retail. Super's edges (SWE-Bench, 1M context, thinking mode) don't map to Annie's actual workloads.

## Plan

`~/.claude/plans/partitioned-sauteeing-snail.md`

**Read the plan first** — it has the full implementation, all 14 adversarial review findings, and design decisions. Two parallel adversarial reviews (architecture + code quality) found 5 CRITICAL + 5 HIGH + 4 MEDIUM issues, all addressed.

## Key Design Decisions (from adversarial review)

1. **GPU memory: 0.25 → 0.50** — The original `--gpu-memory-utilization 0.25` only allocates 32 GB to vLLM. At 65K max_model_len, KV cache needs far more. Bump to 0.50 (64 GB total). Verified: 64 + 10 (audio) + 10 (OS) = 84 GB of 128 GB.

2. **Extraction timeout: 30 → 120s** — Was calibrated for Claude API. At 50 tok/s, 4096 tokens = ~82 seconds. 30s kills extractions mid-generation.

3. **Compaction ctx_size: 32768 → 65536** — `compaction.py` preset for `gemma-4-26b` must match the new `--max-model-len`. Otherwise compaction fires at 65%×32K = ~21K tokens (way too early).

4. **Text chat compaction: new 50K/24-turn threshold** — Beast had 85K/30 turns, Nano had 20K/12 turns. Neither is right for Gemma 4 at 65K context. New: 50K threshold, 24 turns kept.

5. **Background agent output: 2K → 8K tokens** — "super" tier falls back to "nano" budget (2K tokens). Beast ran at 16K. Bump to 8K — reasonable for Gemma 4 at 50 tok/s.

6. **browser_tasks.py: RuntimeError was a hard crash** — The approval workflow (coffee/TWF/Cremeitalia) would crash with `RuntimeError("BEAST_LLM_BASE_URL not configured")` on every order. Fixed: use `LLAMACPP_BASE_URL`, set `enable_thinking: False`.

7. **proactive-triage.yaml: runtime YAML file** — `on_complete: triage_to_beast` in `~/.her-os/annie/agents/proactive-triage.yaml` is NOT in git. Must be updated manually on Titan. If missed, all proactive ACT escalations silently drop.

8. **5 test assertions hardcode "nemotron"** — In `test_graphiti_client.py` (3), `test_chronicler.py` (1), `test_chronicler_llm.py` (1). Will block CI immediately after config change.

9. **Backend.BEAST enum deletion must be atomic** — All callers (`resource_pool.py`, `text_llm.py`, `bot.py`, tests) must be updated in the same commit, or import-time crashes.

10. **Stale .env cleanup** — Remove `BEAST_*` vars from `.env` on Titan during deploy. Stale `BEAST_LLM_BASE_URL` causes 3-second latency per text chat request (health probe timeout).

## Files to Modify (ordered by phase)

### Phase 0: Prerequisites
1. `start.sh` — `--gpu-memory-utilization 0.50`, `--max-model-len 65536`
2. `services/context-engine/config.py` — `EXTRACTION_TIMEOUT_S=120`
3. `services/annie-voice/compaction.py` — `ctx_size=65536` for gemma-4-26b preset

### Phase 1: Context Engine
4. `services/context-engine/config.py` — 9 defaults `vllm/nemotron-super` → `vllm/gemma-4-26b`, delete `BEAST_VLLM_BASE_URL`
5. `services/context-engine/llm.py` — delete Beast routing block
6. `services/context-engine/docker-compose.yml` — update defaults, delete Beast URL
7. `services/context-engine/chronicler.py` — 3 process names `llm-nemotron-super-*` → `llm-gemma4-*`

### Phase 2: Annie Voice (~12 files)
8. `agent_context.py` — delete Beast health/client (~200 lines), merge "super" tier, bump output to 8192
9. `text_llm.py` — delete Beast routing (~80 lines), new compaction threshold (50K/24)
10. `resource_pool.py` — delete `Backend.BEAST`, `BackendHealthMonitor` (~80 lines)
11. `browser_tasks.py` — replace Beast client with Titan, fix enable_thinking
12. `proactive_pulse.py` — rename `triage_to_beast` → `triage_to_agent`, fix model_tier metadata
13. `~/.her-os/annie/agents/proactive-triage.yaml` — `on_complete: triage_to_agent`
14. `server.py` — delete `_read_beast_stats()`, simplify stats endpoint
15. `cost_tracker.py` — swap nemotron-super entry for gemma-4-26b
16-19. `router_monitor.py`, `phone_loop.py`, `router_report.py`, `browser_agent_tools.py`, `pytest.ini` — comment/config updates

### Phase 3: Telegram Bot
20. `services/telegram-bot/bot.py` — `_classify_backend()` returns `"NANO"` not `"BEAST"`

### Phase 4: Dashboard (~8 files)
21. `systemStats.ts` — remove Beast column, single Titan layout
22. `creatures/registry.ts` — 3 creature labels + process names
23. `task-queue.html` — remove Beast status display + CSS
24. `synthetic.ts` — update process names
25-28. Comment updates: `styles.css`, `main.ts`, `infraStatus.ts`, `silhouettes.ts`

### Phase 5: Infrastructure
29. `start.sh` — delete `start_beast_llm()`, `check_beast()`, Beast env vars (already has Phase 0 changes)
30. `stop.sh` — delete `stop_beast_services()`
31. `scripts/health_check.sh` — delete Beast check

### Phase 6: Tests (~18 files)
32. `test_graphiti_client.py` — 3 "nemotron" assertions → "gemma"
33. `test_chronicler.py` + `test_chronicler_llm.py` — process name assertions
34. `test_context_inspect.py` — `ctx_size == 65536`
35-47. Annie voice tests (model_routing, resource_pool, orchestration_e2e, arc_e2e, proactive_pulse, text_llm, live_acceptance, router_e2e, router_report, router_alerts, cost_tracker) + telegram tests + dashboard tests

### Phase 7: Fixtures
48-51. `synthetic_audit.json`, `synthetic_events.json`, `generate_kernel_test_data.py`, `test_agent_orchestration.py`

### Phase 8: Docs
52. `docs/RESOURCE-REGISTRY.md` — delete Beast, recalculate Titan budget
53. `CLAUDE.md` — single-machine architecture
54. Config comments — Graphiti routing accuracy
55. Memory files — update migration status

## Start Command

```
cat ~/.claude/plans/partitioned-sauteeing-snail.md
```

Then implement the plan phase by phase. All adversarial findings are already addressed in it.

## Verification

1. `pytest` in `services/annie-voice/` — all tests pass
2. `pytest` in `services/context-engine/` — all tests pass
3. `npm test` in `services/context-engine/dashboard/` — all tests pass
4. Deploy via `start.sh` from laptop
5. Voice conversation works
6. Entity extraction runs (check unicorn events in dashboard)
7. Dashboard shows single TITAN column
8. Telegram routes correctly
9. Browser task approval works (test Cremeitalia or coffee order)
10. `nvidia-smi` on Titan: vLLM ~50-55 GB with extraction, ~30 GB idle
11. Long transcript (>32K tokens) extracts successfully
