# Next Session: Gemma 4 E4B Benchmark on Beast

**Created:** 2026-04-15 (session 105 planning)
**Status:** Prompt draft — execution NOT yet started
**Predecessor:** `docs/NEXT-SESSION-GEMMA4-E4B-BENCH.md` (Panda version, executed 2026-04-14 session 104)
**Predecessor results:** `docs/RESEARCH-GEMMA4-E4B-QUANTIZATIONS.md` — Panda measured 47.9 ms p50 / 20.9 Hz / 100% schema pass

---

## What (one paragraph)

Run the same Gemma 4 E4B Q4_K_M benchmark as session 104 (nav + describe workloads, same image, same production prompt from `services/panda_nav/server.py:131-145`) on **Beast** (NVIDIA GB10 Superchip, aarch64, SM_121, 128 GB unified memory, 192.168.x.x — `ssh beast`) instead of Titan. Beast has a stale 9-day-old `vllm-nemotron-super` container consuming 92 GB that must be stopped first (per `RESOURCE-REGISTRY.md:210` session 449 change log, Nemotron Super was supposed to be retired — verify with user before stopping). The goal is to measure E4B on unified-memory Blackwell silicon (same class as Titan) without disturbing Titan's production 26B workload.

## Why Beast, not Titan

Titan is actively serving Gemma 4 26B NVFP4 at ~69 GB VRAM peak for the entire her-os stack (context engine, voice, WhatsApp, phone, dashboard). An E4B benchmark there risks latency interference on production traffic. Beast has been idle (save the stale Nemotron container) since session 449 and is hardware-identical to Titan — a clean alternative with zero production risk.

## Deltas from Panda plan (session 104)

Read `~/.claude/plans/logical-juggling-parasol.md` first — that plan's 7-phase structure carries over. The following are the ONLY differences for Beast:

| Aspect | Panda (session 104) | Beast (this session) |
|---|---|---|
| Host | `ssh panda` (RTX 5070 Ti, 16 GB discrete VRAM, aarch64 userland on x86_64… verify) | `ssh beast` (GB10 Superchip, 128 GB unified RAM, aarch64) |
| SM version | 12.0 (Blackwell consumer) | 12.1 (Blackwell datacenter/Grace) |
| Free VRAM before benchmark | ~2.2 GB (tight) | 0 GB initial — must stop `vllm-nemotron-super` first to free 92 GB |
| Port | :11437 (avoid panda-nav on :11436) | :11437 — all 11435-11437 currently free on Beast |
| her-os repo on host | Yes, `~/workplace/her/her-os/` | **NO — not cloned.** Options: (a) `scp scripts/benchmark_gemma4_e4b_nav_panda.py beast:/tmp/` and run locally with `python3`, (b) run benchmark FROM laptop against `http://beast:11437` (Beast binds `--host 0.0.0.0` — new risk: unauth endpoint on LAN) |
| `hf` CLI | Venv-scoped at `~/workplace/her/her-os/.venv/bin/hf` | **NOT installed.** Options: (a) install `pip install huggingface_hub[cli]` to a fresh venv, (b) `scp` weights from Panda's `~/gguf-gemma4-e4b/` (still there — 5.6 GB) |
| Camera image | `~/car-view.jpg` on Panda | **NOT on Beast.** Must `scp panda:~/car-view.jpg beast:~/` first. Same frame is required for apples-to-apples with Panda |
| Chatterbox interference | Must stop before, restore after | N/A — Beast has no Chatterbox |
| Phone daemon risk | User-accepted mute-TTS window | N/A — Beast has no phone daemon |
| TTS smoke test in Phase 6 | Required | **NOT required** — no TTS service to restore |
| Restoration scope | Chatterbox + E2B + phone daemon liveness | `vllm-nemotron-super` restart only IF user confirms it was production (see Pre-execution gates) |

## Pre-execution gates (MUST pass before touching Beast)

### Gate 1 — Confirm `vllm-nemotron-super` is safe to stop
```bash
ssh beast 'docker ps --format "{{.Names}} {{.Status}} {{.Command}}" | grep nemotron'
ssh beast 'docker logs --tail 50 vllm-nemotron-super 2>&1 | tail -20'
ssh beast 'docker inspect vllm-nemotron-super | grep -iE "restart|labels|env" | head -20'
```
Ask user: "RESOURCE-REGISTRY claims Beast was freed in session 449, but `vllm-nemotron-super` has been running 9 days on 92 GB. Is this container still in use? Stop it? Restart after? Or leave it be and cancel the Beast benchmark?" **Do not proceed to Phase 3 until user responds.** This is the load-bearing decision of the session.

### Gate 2 — Verify llama.cpp aarch64 container support
```bash
ssh beast 'docker manifest inspect ghcr.io/ggml-org/llama.cpp:server-cuda 2>&1 | head -30'
```
Expected: manifest with `"architecture": "arm64"` entry. If x86_64-only: **abort** and recommend building llama.cpp from source (session 89 zenoh pattern) or switching to vLLM. Do not invent a CUDA wheel path mid-session.

### Gate 3 — Model weights strategy
Either (a) `ssh beast 'pip install --user huggingface_hub[cli]'` then download fresh (~4 min for 5.6 GB), OR (b) `scp -r panda:~/gguf-gemma4-e4b/ beast:~/`. Option (b) is faster and avoids LAN download of weights we already have — preferred if ssh panda↔beast is routable.

### Gate 4 — Camera image copy
```bash
scp panda:~/car-view.jpg beast:~/
ssh beast 'stat -c "%y %s" ~/car-view.jpg'
```
Same frame = apples-to-apples with Panda. Do not substitute a different image.

### Gate 5 — Port availability
```bash
ssh beast 'ss -tlnp | grep -E ":(11435|11436|11437)"'
```
Currently all three free. Verify again immediately before launch (Beast has no known services on those ports but confirm idle).

## Key design decisions inherited from session 104 (DO NOT revert)

1. **`--jinja` flag mandatory** — without it, `chat_template_kwargs: {"enable_thinking": false}` is ignored and Gemma 4 thinking mode stays ON, inflating p50 with hidden thinking tokens.
2. **`--host 127.0.0.1` for the container IF benchmark runs on Beast** — if instead you run benchmark from laptop, must use `--host 0.0.0.0` (acknowledge auth risk; tear down container at session end).
3. **Parse strategy 1 or 2 is the schema-pass bar**; strategy 3 (loose fallback) is a schema miss.
4. **Discard first sample as warm-up** — cold-start on first request.
5. **Fork `scripts/benchmark_gemma4_e4b_nav_panda.py` minimally** — the script from session 104 already targets port 11437 and parses the production VLM response. Only change needed: `LLAMACPP_URL` if benchmark runs from laptop; otherwise unchanged. scp to beast or run from laptop as decided in Gate 3.

## VRAM / RAM gates (revised for unified memory)

Beast has **unified memory** — no separate GPU VRAM pool. `nvidia-smi --query-gpu=memory.used` returns "Not Supported" on GB10. Instead:
- Use `nvidia-smi --query-compute-apps=pid,process_name,used_memory` (still works for per-process GPU allocation)
- Cross-reference with `free -h` for system RAM pressure
- Canonical abort gate: E4B should consume **<8 GB process-reported GPU memory**. If >8 GB, same quantization-trap red flag as Panda plan.
- No lower-band check (Panda plan's 5.5 GB floor was based on discrete-VRAM KV cache estimates; unified memory behaves differently — report footprint as data, don't gate on lower bound).

## Files to create/modify

| Path | Change |
|---|---|
| `scripts/benchmark_gemma4_e4b_nav_beast.py` | **NEW** — probably just a symlink or copy of `benchmark_gemma4_e4b_nav_panda.py` with `LLAMACPP_URL` updated if benchmark runs from laptop. Keep script identical if run on Beast locally. |
| `docs/RESEARCH-GEMMA4-E4B-QUANTIZATIONS.md` | Add a "Beast measurements" section alongside the existing Panda section. |
| `docs/RESOURCE-REGISTRY.md` | Change Log entry for the benchmark. If user decides to retire Nemotron Super properly: update Active Models table. |
| `MEMORY.md` | Session 105 summary. |

## Verification (adapted from Panda plan)

All must pass at end:
1. `docker rm -f panda-llamacpp-e4b` on Beast succeeded (container gone).
2. If Nemotron Super was stopped: restore per user's Gate 1 answer.
3. Benchmark JSON output saved locally (`scp beast:~/benchmark-results/*.json ./benchmark-results/beast-e4b-YYYY-MM-DD/`).
4. Schema pass rate recorded.
5. p50/p95/p99/tok-s recorded and compared to Panda's 47.9 ms p50 / 62.5 tok/s.

**Decision to make at session end**: is Beast E4B materially different from Panda E4B? If p50 changes >20% or schema pass drops below 95%, investigate (driver difference, KV cache config, etc.). If within noise: confirms that E4B adoption decision is architecture-independent, and user picks Panda vs Beast by deployment constraints, not quality.

## Start Command

```
cat docs/NEXT-SESSION-GEMMA4-E4B-BEAST-BENCH.md
cat ~/.claude/plans/logical-juggling-parasol.md   # Panda plan (structural reference)
```

Then run Gate 1 (ask user about Nemotron Super) before anything else. Gates 2-5 can run in parallel (all read-only). Phases 3-7 same as Panda plan with the Beast-specific deltas above.

## References

- Panda plan (authoritative structure): `~/.claude/plans/logical-juggling-parasol.md`
- Panda measured results: `docs/RESEARCH-GEMMA4-E4B-QUANTIZATIONS.md` "Measured benchmark results"
- Benchmark script: `scripts/benchmark_gemma4_e4b_nav_panda.py` (324 lines, session 104)
- Registry drift to confirm: `docs/RESOURCE-REGISTRY.md:210` session-449 Change Log
- llama.cpp aarch64 compatibility: verify via `docker manifest inspect` in Gate 2