# Next Session: Gemma 4 E4B Benchmark on Panda

**Created:** 2026-04-14 (planning session)
**Status:** Plan approved, ready to execute.
**Plan file (authoritative):** `~/.claude/plans/logical-juggling-parasol.md`

---

## What (one paragraph)

Benchmark Gemma 4 E4B (`unsloth/gemma-4-E4B-it-GGUF` Q4_K_M) on Panda's RTX 5070 Ti (16 GB, Blackwell SM 12.0) against the existing E2B baseline (18.4 ms p50 / 54 Hz, session 67). Two workloads: nav decision (production `build_vlm_prompt`) and vision describe. Same camera image (`~/car-view.jpg`). Temporary docker container `panda-llamacpp-e4b` on port **11437** with Chatterbox + llama-server E2B stopped for the ~8–18 min window. User accepted the TTS-mute risk during the window. Phone daemon stays up. Compare quality and latency; decide whether to adopt E4B for nav.

## Plan

Read the full plan first — it has the 7-phase execution sequence, state machine, pre-mortem, and all review-driven design decisions:

```bash
cat ~/.claude/plans/logical-juggling-parasol.md
```

The plan went through `planning-with-review` adversarial review: **7 HIGH + 6 MEDIUM issues found and fixed**. All findings are already addressed in the plan. Do not regress them.

## Key Design Decisions (from adversarial review — DO NOT revert)

1. **Port 11437, not 11436.** `services/panda_nav/server.py:70` already binds :11436 for the nav FastAPI sidecar. Using 11436 would cause a silent docker-bind failure.
2. **`--jinja` flag on the docker run command is mandatory.** Without it, `chat_template_kwargs: {"enable_thinking": false}` is ignored and Gemma 4 runs with thinking ON — p50 numbers become meaningless (inflated by hidden thinking tokens).
3. **`--host 127.0.0.1`, not 0.0.0.0.** Benchmark runs locally on Panda; no LAN exposure of an unauth LLM endpoint.
4. **`huggingface-cli download ... --include "file1" --include "file2"` syntax**, not `hf download` with positional filenames. The project uses `huggingface-cli` (see `docs/RESEARCH-PANDA-NAV-VLM-54HZ.md:431-432`).
5. **VRAM canonical gate: healthy 5.5–7.5 GB, warning 7.5–8.0 GB, abort if >8 GB.** The 6.9B active-param E4B has larger KV cache overhead than 2.3B E2B; bare-weights numbers underestimate.
6. **Phone TTS failure mode is MUTE-NOT-CRASH.** Verified in `services/annie-voice/chatterbox_tts.py:159-161`: `(httpx.TimeoutException, httpx.HTTPError)` → `yield TTSStoppedFrame()`. Daemon stays up; calls are answered; every sentence drops to silence. Worse UX than a hang. Post-restore TTS smoke test (synthesize phrase → verify WAV) is mandatory.
7. **Restoration order: E4B `docker rm` → 5-sec VRAM-drain wait → E2B `systemctl start` → Chatterbox `./start.sh` → phone daemon liveness check → TTS synthesize smoke test.** Skipping the drain wait races CUDA context release against the E2B restart.
8. **Extension 20→100 is a statistical refinement, not a quality gate.** Run 100 unless VRAM drifted out of the canonical band OR any sample tripped the `<think>` detector. Do not gate on p50.
9. **Nav uses `services/panda_nav/server.py:131-145` `build_vlm_prompt(goal="something interesting")` verbatim.** Max tokens 20. Parser strategy-1-or-2 must pass on smoke test (strategy-3 loose fallback counts as failure — model isn't following the contract).
10. **Phase 0 contract verification aborts the run before touching services** if: passwordless sudo fails, `chatterbox_tts.py:159` no longer exists (assumption invalidated), :11437 is occupied, or :11436 doesn't show `panda-nav`.

## Files to Modify

| Path | Change |
|---|---|
| `scripts/benchmark_gemma4_e4b_nav_panda.py` | **NEW.** Fork from `scripts/benchmark_nav_rate_llamacpp.py`. Minimal diff: URL → `http://127.0.0.1:11437/v1/chat/completions`, `MODEL_LABEL = "gemma-4-E4B-it-Q4_K_M"`, output path, request body adds `"chat_template_kwargs": {"enable_thinking": false}`, response validates `content` non-empty and no `<think>` marker, `--workload {nav,describe}` CLI flag. |
| `docs/RESEARCH-GEMMA4-E4B-QUANTIZATIONS.md` | Post-benchmark. Replace "no benchmarks exist yet" with measured numbers (p50/p95/p99, tok/s, effective Hz, VRAM footprint, schema adherence). |
| `docs/RESOURCE-REGISTRY.md` | Post-benchmark. Add E4B line to Active Models table. Recalculate steady-state peak. Add Change Log entry. |
| `MEMORY.md` | Post-benchmark session summary. |

No production code (`services/panda_nav/`, `services/annie-voice/`, `start.sh`, `stop.sh`) is modified. No systemd unit is edited. Benchmark container is torn down at session end.

## Start Command

```bash
cat ~/.claude/plans/logical-juggling-parasol.md
```

Then execute Phase 0 → Phase 7 in order. Every step has a concrete shell command in the plan.

**Critical pre-execution check** (if skipped, abort):

```bash
ssh panda 'sudo -n systemctl status panda-llamacpp' >/dev/null 2>&1
ssh panda 'grep -n "TTSStoppedFrame" ~/workplace/her/her-os/services/annie-voice/chatterbox_tts.py' | grep -E "^(159|160|161):"
ssh panda 'ss -tlnp 2>/dev/null | grep -E ":(11437)"'  # must be empty
ssh panda 'ss -tlnp 2>/dev/null | grep -E ":(11436)"'  # must show panda-nav/uvicorn
```

If any of the first three fail, stop and investigate before touching services.

## Verification

All eight bullets in the plan's "Rollback verification" section must pass:

1. `nvidia-smi`: 13.0–13.8 GB used, three expected processes (phone, chatterbox, llama-server E2B)
2. `curl :11435/health` → ok (E2B restored)
3. `curl :11436/health` → ok (panda-nav untouched)
4. `curl :11437/health` → **connection refused** (E4B container gone)
5. `curl :8772/health` → `"model_loaded": true` (Chatterbox restored)
6. `curl :8770/...` phone API alive
7. **TTS synthesize smoke test produces a valid WAV > 5 KB**:
   ```bash
   ssh panda 'curl -sf -X POST http://localhost:8772/synthesize \
     -H "Content-Type: application/json" \
     -d "{\"text\": \"test one two\"}" \
     --output /tmp/tts-test.wav && file /tmp/tts-test.wav'
   ```
8. `ssh panda 'pgrep -f "phone_call.py auto"'` returns a PID

Plus:
- Benchmark JSON outputs saved to local `./benchmark-results/` via `scp`
- `docs/RESEARCH-GEMMA4-E4B-QUANTIZATIONS.md` updated with measured numbers
- `docs/RESOURCE-REGISTRY.md` Change Log entry added

## Decision to Make at Session End

After numbers are in, decide:
- **Adopt E4B for nav?** Only if schema adherence ≥95% AND qualitative describe output is materially better than E2B AND VRAM footprint stays in the 5.5–7.5 GB band. Latency is not a gate (user decision, session 102 planning).
- **Benchmark NVFP4 next?** Deferred per user's Option A in the earlier clarification. NVFP4 (~10 GB) requires stopping phone daemon too. Scheduled for a separate session.

## Time Budget

**Total wall-clock: ~19–37 min.** TTS-down window: ~8–18 min. Hard 25-min overall cap; emergency rollback triggers if exceeded.

## References

- Plan (authoritative): `~/.claude/plans/logical-juggling-parasol.md`
- Prior research: `docs/RESEARCH-GEMMA4-E4B-QUANTIZATIONS.md` (session 101 variant catalog)
- E2B baseline: `docs/RESEARCH-PANDA-VLM-INFERENCE-STACK.md` (session 67, 18.4 ms p50)
- Resource budget: `docs/RESOURCE-REGISTRY.md` (Panda VRAM table)
- VRAM math + chatterbox-cpu-path context: `docs/RESEARCH-CHATTERBOX-CPU-BENCHMARK.md`
- Adversarial-review skill: `planning-with-review` (all findings in plan's Feedback Response Table)
