# Research — Panda VLM Inference Stack for Annie's Car Navigation

**Date:** 2026-04-12 (session 67)
**Status:** **RESOLVED — llama.cpp `llama-server` wins with 18.4 ms p50 (54 Hz)**
on Panda using upstream `unsloth/gemma-4-E2B-it-GGUF` Q4_K_M + mmproj-F16.

---

## Outcome (updated end of session 67)

**Winner: Option C (llama.cpp `llama-server` direct)** with the Docker image
`ghcr.io/ggml-org/llama.cpp:server-cuda` serving upstream Unsloth GGUF files.

**Measured against Ollama baseline (5 configs × 20 runs each):**

| Test | llama.cpp p50 | Ollama p50 | Speedup | Effective Hz |
|---|---:|---:|---:|---:|
| Full 554-token nav prompt + image | **20.5 ms** (18.4 min) | 159.1 ms | 7.8× | 48.9 |
| Full prompt repeat (cache warm) | **18.6 ms** | 157.8 ms | 8.5× | 53.8 |
| Full prompt streaming (TTFT) | **18.4 ms** (TTFT 14.4) | 156.5 ms | 8.5× | **54.5** |
| Short 310-token prompt | 33.2 ms | 156.2 ms | 4.7× | 30.1 |
| Short prompt streaming | 33.2 ms (TTFT 27.8) | 155.1 ms | 4.7× | 30.2 |

**Vision quality verified** via live inference on the car's actual camera frame:
*"I see an indoor scene with a light green floor, dark wooden furniture, and a
light green curtain on the left"* — matches the scene on independent visual
inspection. No quality regression from Q4_K_M quantization.

**Key insights discovered in deployment:**
1. **The bottleneck was never compute** — it was Ollama's ~110 ms per-request
   Go wrapper (HTTP marshal, subprocess IPC, JSON double-encode, kernel launch
   serialization). Same llama.cpp engine underneath, 8.5× speedup from removing
   the wrapper.
2. **Prefix caching dominates performance** for repeated-structure prompts.
   Tests A/B/C share a 554-token prefix and llama-server's cache kept the
   entire text warm between calls; only the image encoder (~14 ms) and first
   decode (~4 ms) re-runs each call ≈ 18 ms. The real car nav loop has the
   same prompt structure cycle-to-cycle so real-world perf will match.
3. **Counterintuitive: short prompt is SLOWER than full prompt** because the
   full prompt's cache is warmer across the test series. Once we land in
   production, the prompt is fixed and this won't matter.
4. **Thinking mode must be disabled via `chat_template_kwargs: {"enable_thinking":
   false}`** in the request body. `reasoning_effort: "none"` does NOT work for
   Gemma 4 on llama-server — it routes output to `reasoning_content` and empties
   the visible content field (confirmed empirically).

## Dead end: vLLM option

Attempted `vllm/vllm-openai:gemma4` (official Gemma 4 image, 22.5 GB) with
`unsloth/gemma-4-E2B-it-unsloth-bnb-4bit` (7.5 GB HF download). **OOM at model
load**: CUDA allocated 6.25 GB of PyTorch memory for weights alone, leaving
90 MB free when it tried to allocate another 128 MB for KV cache.

**Critical lesson: "4-bit" in a model filename does NOT mean 4 bits for everything
in multimodal models.** Unsloth's bnb-4bit variant keeps:
- Vision encoder (SigLIP): FP16
- Audio encoder (Conformer, unused but can't be skipped at load time): FP16
- Embeddings layer: FP16
- LM head: FP16
- Only the transformer body is 4-bit

Net VRAM footprint: **6.25 GB** (not the ~3 GB a naive read of "bnb-4bit" suggests).
With `phone_call.py` (5.2 GB) + Chatterbox (3.6 GB) both resident and
non-stoppable, Panda's 16.3 GB cap cannot host vLLM with this variant — even
with `--gpu-memory-utilization` dialed down, the model weights are mandatory.

**Conversely, GGUF Q4_K_M is a more VRAM-compact format for this workload**:
everything is quantized uniformly to Q4 with only small overhead tables, so
the in-VRAM footprint matches the on-disk size (~2.2 GB weights + ~1 GB
mmproj = 3.2 GB total in VRAM).

## Failed first attempt: Ollama's GGUF blob

Tried to point llama-server at Ollama's existing GGUF blob
(`/usr/share/ollama/.ollama/models/blobs/sha256-4e30e2665218...`) to avoid
re-downloading. Failed with:

```
llama_model_load: error loading model: done_getting_tensors:
  wrong number of tensors; expected 2012, got 601
```

Ollama's GGUF header claims 2012 tensors (confirmed via direct hex dump —
standard GGUF magic, version 3, tensor count field at offset 8 reads 0x7dc =
2012) but the blob file only physically contains 601 tensors. **Ollama chunks
or streams tensors across multiple internal sources in a way upstream
llama.cpp's loader doesn't understand.** Practical implication: **don't reuse
Ollama's blobs for llama-server** — download directly from HuggingFace.

## Deployment command (reproducible)

```
docker run -d --name panda-llamacpp --gpus all --network host \
  -v ~/gguf-gemma4-e2b:/models:ro \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  --model /models/gemma-4-E2B-it-Q4_K_M.gguf \
  --mmproj /models/mmproj-F16.gguf \
  --port 11435 --host 0.0.0.0 -ngl 999 --ctx-size 4096
```

Files at `panda:~/gguf-gemma4-e2b/`:
- `gemma-4-E2B-it-Q4_K_M.gguf` — 2.9 GB — main model
- `mmproj-F16.gguf` — 940 MB — vision projector

Startup time: ~5 seconds to "server is listening on http://0.0.0.0:11435".

**Important: requests must disable thinking mode** via the OpenAI-compat
extension:
```json
{
  "chat_template_kwargs": {"enable_thinking": false},
  "messages": [...]
}
```

## Remaining work (not yet done)

1. **Persistence**: `panda-llamacpp` is a one-off `docker run -d`, no restart
   policy. Convert to systemd unit or docker-compose with `restart: always`.
2. **Pi 5 env swap**: update turbopi-server systemd override to point at
   `NAV_VLLM_URL=http://192.168.68.57:11435/v1` with
   `NAV_VLLM_MODEL=gemma-4-E2B-it-Q4_K_M.gguf`.
3. **E2E test**: `/explore this room` via Telegram with the new endpoint,
   measure cycle-to-cycle timing vs the old Titan 26B path.
4. **Restart IndicF5** on Panda (it was stopped during the vLLM investigation
   and is still offline — Mom's Kannada calls broken).
5. **Cleanup**: delete the unused `vllm/vllm-openai:gemma4` image (22.5 GB)
   and `unsloth/gemma-4-E2B-it-unsloth-bnb-4bit` HF cache (7.5 GB) on Panda.
6. **Resource registry**: add a `panda-llamacpp` row to
   `docs/RESOURCE-REGISTRY.md` (3.2 GB VRAM, port 11435, creature TBD).

---

## Original research (pre-resolution — kept for context)

**Driver:** Annie's TurboPi robot car currently calls Titan's Gemma 4 26B (port 8003)
for navigation VLM decisions. Moving the nav VLM off Titan frees 26B KV-cache for
voice + extraction contention. Gemma 4 E2B (5.1 GB weights, 2.3B effective MoE) is
the candidate replacement because it has vision, it's 5× smaller, and Panda's
RTX 5070 Ti has spare GPU cycles.

This research answers: **which inference stack should host Gemma 4 E2B on Panda?**
Model choice is fixed — Gemma 4, no fallback to Gemma 3 or Gemma 3n family.
User preference order: **vLLM first, llama.cpp as fallback if vLLM not supported.**
(Confirmed: vLLM does support Gemma 4 — see Options below.)

---

## Hardware map (IMPORTANT — don't confuse these machines)

| | **Panda** (target host for nav VLM) | **Titan** (runs the 26B voice LLM, NOT this nav VLM) |
|---|---|---|
| hostname | `panda` | `physical-ai-titan` |
| arch | **x86_64** | **aarch64** |
| OS | Ubuntu (standard) | Ubuntu 24.04 |
| GPU | **NVIDIA GeForce RTX 5070 Ti** (discrete consumer Blackwell card) | **NVIDIA GB10** (DGX Spark SoC, Blackwell SM_121) |
| Driver | 580.126.09 | 580.142 |
| Compute capability | 12.0 | 12.1 |
| VRAM | 16,303 MiB dedicated | 128 GB unified CPU+GPU memory |
| VRAM free (live check, session 67) | **1,740 MiB** (with benchmark model still in Ollama keep_alive) | 59 GB+ |
| What's already deployed | Voice pipeline: Whisper 6GB + IndicF5 1.3GB + IndicConformer 0.3GB + Kokoro 0.5GB (~8 GB used by PyTorch workers); Ollama with gemma4:e2b (~2.2 GB while keep_alive) | vLLM Gemma 4 26B (15.74 GB + ~48 GB KV cache, port 8003) + Ollama embeddings + Whisper + more |
| CUDA wheel availability | Standard x86_64 PyPI / NGC | aarch64 — requires Blackwell patches for some packages (historical pain point — see MEMORY project_pico_imu_bridge neighbors, session 434 Nemotron Speech Blackwell monkeypatches) |

**Consequence:** Panda is a straightforward x86_64 CUDA 13 host — vLLM and llama.cpp
both ship prebuilt wheels/binaries that work out of the box. Titan is an aarch64
Blackwell SoC with a history of needing custom patches (TorchSTFT, nvrtc, etc.),
so deploying to Titan is a separate research thread and is explicitly NOT the
target of this document.

Any repo or guide that says "DGX Spark" is about Titan, not Panda. Do not copy
those Docker recipes onto Panda.

---

## The bottleneck we're trying to fix

Session 67 rate benchmark (`scripts/benchmark_nav_rate_panda.py`) against the current
Ollama-hosted Gemma 4 E2B on Panda, 5 configs × 20 runs each:

| Metric | Value |
|---|---:|
| Wall clock p50 | **156 ms** |
| Wall clock p95 | 168 ms |
| Effective sustained rate | **6.4 Hz** |
| Prefill time (Ollama reports, 554 tok prompt) | **14 ms** |
| Decode first token (Ollama reports) | **13 ms** |
| Image encoder cost (inferred via text-only A/B) | ~19 ms |
| **Unaccounted "Ollama server floor"** | **~110 ms** |

The bottleneck is **NOT compute** — it's the Ollama server-stack floor (HTTP marshal,
llama.cpp kernel launches, GPU sync, JSON encode/decode). Prefix caching, streaming,
short prompts, and `keep_alive` all converged on the same p50, because the 14 ms
prefill is already dwarfed by a 110 ms per-request fixed tax.

**Compute floor ≈ 27 ms (prefill + decode) ≈ 37 Hz** — tantalizingly close to the
Tesla FSD per-camera 36 Hz target, but **only reachable if the server stack overhead
is removed**.

---

## Target rate context

User's stated goal: **36 Hz (27.8 ms/decision)** decision rate for left/right/forward/brake.
Reference: Tesla FSD per-camera NN inference rate.

Reality check against the car's physical design:
- TurboPi max speed ≈ 1 m/s
- At 36 Hz: 2.8 cm per decision — smaller than the chassis control tolerance
- At 6.4 Hz (current): 15 cm per decision — ~6× over-spec vs Tesla-scaled-by-velocity
- Physical motor command settle time: 50–250 ms, independent of decision rate

**The 36 Hz target is ambitious for a robot car at 1 m/s.** Even if the VLM is that
fast, the classical safety controller (lidar + sonar + safety daemon) already runs
at 100+ Hz natively and handles hard braking — the VLM is a planner, not a reactive
controller. That said, the user asked for the data, and 27 ms compute floor makes
this **barely feasible if the inference stack is right**.

---

## Candidate inference stacks on Panda — Gemma 4 E2B only

Gemma 4 (Google, April 2026, Apache 2.0) has four sizes: E2B, E4B, 26B-A4B MoE, 31B
Dense. E2B is 4.65B parameters, 128K context, supports text + vision + audio (we
only need vision). The Ollama `gemma4:e2b` tag is ~7.16 GB GGUF (confirmed via
registry manifest, SHA `4e30e26652187...`). Hugging Face FP16 weights are ~10 GB.

### Option A — Ollama 0.20.5 (current baseline, NOT the solution)

- **Status**: deployed and benchmarked this session
- **Wall clock**: 156 ms p50 → **6.4 Hz**
- **Why it's slow**: ~110 ms per-request server-stack overhead (Go wrapper, HTTP
  marshal, llama.cpp subprocess IPC, JSON double-encode). NOT the model or GPU.
- **VRAM**: ~2.2 GB resident (Q4-ish GGUF, currently loaded under `keep_alive=10m`)
- **Verdict**: baseline. Everything else below is about beating this.

### Option B — **vLLM with the official Gemma 4 image (PRIMARY PATH)**

- **Gemma 4 support**: **FULL**, via
  [vLLM PRs #38826 and #38847](https://github.com/vllm-project/vllm/pulls). MoE,
  multimodal (vision), reasoning, and tool-use all covered.
- **Official pre-built image**: `vllm/vllm-openai:gemma4`
- **Resources**:
  - [vLLM Gemma 4 announcement blog (2026-04-02)](https://vllm-project.github.io/2026/04/02/gemma4.html)
  - [Official vLLM Gemma 4 recipe](https://github.com/vllm-project/recipes/blob/main/Google/Gemma4.md)
  - [vLLM Gemma 4 setup guide](https://www.gemma4.wiki/install/gemma-4-vllm-setup-guide)
  - [dev.to multimodal Gemma 4 E2B + vLLM walkthrough](https://dev.to/abdulhakkeempa/building-a-multimodal-local-ai-stack-gemma-4-e2b-vllm-and-hermes-agent-k8l)
- **Dependencies**: `transformers>=5.5.0`. For audio (we don't need it):
  `vllm[audio]` extras. Vision + text work without audio extras.
- **Weights format**: HuggingFace safetensors from `google/gemma-4-E2B-it` or similar
  HF repo — **does NOT reuse Ollama's GGUF blob**. A fresh download of ~10 GB FP16.
- **Expected latency**: V1 engine, FlashAttention, paged KV cache, CUDA graphs.
  Per-request overhead typically 5–15 ms (vs Ollama's ~110 ms). **Target wall clock:
  ~30–50 ms (20–33 Hz).** Not guaranteed 36 Hz but closest of all options.
- **VRAM math on Panda (live, end of session 67)**:
  - Panda total: 16,303 MiB (16 GB)
  - `scripts/phone_call.py auto` (Whisper + IndicConformer + Kokoro in ONE PyTorch
    process): **5,158 MiB** — always loaded, critical for phone calls
  - **Chatterbox TTS** (port 8772): **3,654 MiB** — always loaded, **CRITICAL for
    phone call English TTS, CANNOT stop per user**
  - IndicF5 TTS (port 8771): normally 2,864 MiB — **currently STOPPED in session 67
    for this benchmark work**. Mom's Kannada calls will fail until restarted.
  - Ollama gemma4:e2b: already self-unloaded (`keep_alive=10m` expired)
  - **Live VRAM used: 8,996 MiB, free: 6,823 MiB** (with IndicF5 off)
  - When IndicF5 is restarted: free drops back to ~4,000 MiB
- **Locked-in VRAM budget for vLLM**:
  - Must fit alongside phone_call.py (5.2 GB) + Chatterbox (3.6 GB) = **8.8 GB
    baseline**. Neither can be stopped.
  - If IndicF5 is also running (normal state): **11.7 GB baseline, 4.6 GB
    ceiling for vLLM**.
  - If IndicF5 is temporarily off (current state): **8.8 GB baseline, 7.5 GB
    ceiling for vLLM** — but this is not a permanent state, Mom needs Kannada TTS.
- **What actually fits**:
  - **FP16 E2B (~10 GB)**: ❌ does not fit in either 4.6 GB or 7.5 GB.
  - **FP8 E2B (~5 GB weights + ~1.5 GB KV)**: fits in 7.5 GB (IndicF5-off state),
    does NOT fit in 4.6 GB (IndicF5-on state).
  - **4-bit AWQ/GPTQ E2B (~3 GB weights + ~1 GB KV)**: ✅ fits in BOTH states.
    **This is the only configuration that works alongside the full voice pipeline
    in steady state.** Quality loss for 1-token action decisions (`forward`,
    `left`, `right`, etc.) is negligible per 4-bit quantization literature.
- **Deployment order**:
  1. While IndicF5 is off (current state), bring up vLLM with 4-bit quantized
     weights and `--gpu-memory-utilization 0.40` cap.
  2. Benchmark sustained rate against the OpenAI-compat endpoint.
  3. If benchmark passes: restart IndicF5, verify vLLM still has enough headroom
     (total should be ~14 GB used out of 16 GB).
  4. If it OOMs after IndicF5 restart: lower `--gpu-memory-utilization` further,
     or consider stopping one of the TTS servers only during active navigation
     tool calls (service-locality trick).
- **Setup cost**: docker-compose with the `vllm/vllm-openai:gemma4` image + pull
  quantized weights (~3–5 GB). ~20 min first run.
- **Deployment**: new docker-compose next to `services/annie-voice/` or new
  `services/panda-vlm/` folder. Bind to port 11435 on Panda.
- **Verdict**: **PRIMARY PATH per user preference.** Requires B1 + B3 combo to fit.
  If vLLM's own VRAM negotiation (`--gpu-memory-utilization 0.25` or similar) can
  be dialed tight enough, we can coexist with voice pipeline without pauses.

### Option C — llama.cpp `llama-server` (FALLBACK if vLLM can't fit)

- **Gemma 4 support**: day-0 via
  [llama.cpp PR #21309](https://github.com/ggml-org/llama.cpp/pull/21309) — vision
  + MoE supported, audio NOT yet. We don't need audio.
- **Binary**: `llama-server` ships as part of llama.cpp builds. OpenAI-compatible
  HTTP endpoint.
- **GGUF compatibility**: reads the SAME blob Ollama already has on Panda
  (`/usr/share/ollama/.ollama/models/blobs/sha256-4e30e26652187...`). **Zero
  re-download.**
- **VRAM**: same ~2.2 GB as Ollama (Q4-ish GGUF) — fits fine on Panda even with
  voice pipeline loaded.
- **Expected latency**: same llama.cpp inference engine as Ollama underneath, so
  prefill will still be ~14 ms. The question is how much of the 110 ms Ollama
  overhead disappears when we talk to llama-server directly. **High-confidence
  path to ~40–60 ms wall clock** (17–25 Hz). Not as fast as vLLM's V1 engine can
  theoretically go, but guaranteed to fit in VRAM and ~5 min to set up.
- **Setup cost**: 5 min — single binary or apt package. No containerization
  required.
- **Verdict**: **FALLBACK per user preference.** Use only if vLLM's VRAM math
  doesn't work out even with quantization + Ollama unload. Also useful as a
  quick sanity check that llama.cpp itself isn't the floor before investing in
  vLLM setup.

### Not pursued (explicitly)

- **Transformers direct (Python in-process)** — fails the robust-serving bar;
  PyTorch eager mode kernel launches are typically slower than llama.cpp's fused
  ops; no batching. Not recommended.
- **Titan-based deployment** — Titan is DGX Spark aarch64 Blackwell SM_121, NOT
  Panda. Moving the nav VLM to Titan defeats the purpose (freeing Titan's 26B
  KV-cache contention). Also Titan vLLM/llama.cpp builds on aarch64 Blackwell
  have a history of needing custom Torch patches. Separate research thread.
- **Older Gemma generations (Gemma 3, Gemma 3n)** — explicitly ruled out by user.

---

## Recommendation — deploy Option B (vLLM) first, fall back to C (llama-server)

**Per user preference: vLLM is the primary path because it's supported.** llama.cpp
is the fallback if vLLM's VRAM math doesn't work out on Panda.

Primary plan: **vLLM on Panda with quantized Gemma 4 E2B, coexisting with the
voice pipeline.**

Sub-steps (in order, pause between each for user confirmation if anything fails):

1. **Reclaim 2.2 GB of VRAM immediately** — unload Ollama's `gemma4:e2b` from VRAM
   (it's still resident from the earlier benchmarks with `keep_alive=10m`):
   ```
   curl -X POST http://192.168.68.57:11434/api/generate -d '{"model":"gemma4:e2b","keep_alive":0}'
   ```
   Brings Panda from 1.7 GB → ~4.0 GB free.

2. **Identify a quantized Gemma 4 E2B HF repo** that vLLM supports. Candidates:
   - `google/gemma-4-E2B-it-fp8` (if Google publishes it)
   - AWQ 4-bit community quants (`TheBloke/gemma-4-E2B-it-AWQ` or similar)
   - bitsandbytes nf4 (`google/gemma-4-E2B-it` loaded with `--quantization bitsandbytes`)
   Target: 3–5 GB weight footprint.

3. **Write `services/panda-vlm/docker-compose.yml`** using the pre-built
   `vllm/vllm-openai:gemma4` image with:
   - `--model <quantized-hf-repo>`
   - `--gpu-memory-utilization 0.30` (or whatever gives us ~5 GB for weights +
     2 GB KV cache + headroom without OOMing the voice pipeline)
   - `--max-model-len 4096` (nav prompts are ~600 tokens, 4K is plenty)
   - `--port 11435`
   - Volume-mounted HF cache under `~/.cache/huggingface`

4. **Launch vLLM**, wait for model load (first launch downloads ~3–5 GB of weights,
   subsequent starts reuse the HF cache).

5. **Adapt the benchmark script** — vLLM exposes OpenAI-compatible
   `/v1/chat/completions`, not Ollama's `/api/chat`. Need a ~20-line variant of
   `scripts/benchmark_nav_rate_panda.py` that targets the OpenAI schema with
   `image_url: "data:image/jpeg;base64,..."` instead of `images: [b64]`.

6. **Re-run the 5-config × 20-run benchmark** against the vLLM endpoint and compare
   to the Ollama baseline.

7. **Interpret**:
   - p50 < 27.8 ms → **36 Hz target hit**. Deploy vLLM as the nav VLM, swap Pi
     systemd `NAV_VLLM_URL` + `NAV_VLLM_MODEL` env, E2E test.
   - p50 27.8–50 ms → 20–36 Hz. Good enough for layered architecture (VLM planning +
     classical reactive controller). Deploy vLLM, same env swap.
   - p50 50–100 ms → 10–20 Hz. Marginal improvement. Fall through to Option C
     (llama-server) as a second data point.
   - p50 > 100 ms → vLLM's overhead unexpectedly high on E2B. Fall through to
     Option C, and file a note that vLLM's generic multimodal path may not be
     optimized for small MoE with vision.

**Fallback path (only if vLLM won't fit in VRAM or fails to start):**

1. Install `llama-server` on Panda (either prebuilt binary or `apt install
   llama.cpp-cuda` if available, or build from source with `-DGGML_CUDA=ON`).
2. Launch against the existing GGUF blob — no re-download needed:
   ```
   llama-server \
     --model /usr/share/ollama/.ollama/models/blobs/sha256-4e30e26652187... \
     --port 11435 \
     --n-gpu-layers 99 \
     --ctx-size 4096
   ```
   If the GGUF blob doesn't bundle the vision projector, locate the `mmproj` blob
   in Ollama's manifest and pass `--mmproj <path>`.
3. Adapt benchmark to `/v1/chat/completions` (same as vLLM).
4. Re-run the 5-config benchmark.
5. Compare to Ollama baseline — quick sanity check on whether llama.cpp direct
   beats Ollama's Go wrapper.

---

## Open questions to resolve during deployment

1. **Which quantized Gemma 4 E2B HF repo to pull?** Need to confirm FP8 / AWQ /
   nf4 availability for E2B specifically. If none exist at the required quality,
   fall back to FP16 with `--gpu-memory-utilization 0.50` + temporary voice pipeline
   pause.

2. **Can vLLM coexist with PyTorch workers on the same GPU?** vLLM's CUDA context
   may conflict with the voice pipeline's separate PyTorch processes. Worth
   testing early — start vLLM with `--gpu-memory-utilization 0.25` as a first
   probe, escalate if needed.

3. **What's vLLM's actual per-request overhead on E2B?** The 5–15 ms number is a
   rule of thumb from larger models; on a 2.3B-effective MoE with 1-token output,
   the overhead may be different. The benchmark will reveal this.

4. **Does vLLM's V1 engine multimodal vision path use optimized CUDA kernels for
   Gemma 4's vision tower, or is it falling back to `transformers.AutoModel`?**
   PRs #38826 and #38847 are recent — the first few releases often have generic
   paths that get replaced with custom kernels later. Only way to know: measure.

5. **Do the Pi 5 `robot_tools.py` env vars need to change when we swap from
   Titan 26B to Panda E2B?** Yes: `NAV_VLLM_URL=http://192.168.68.57:11435/v1`
   and `NAV_VLLM_MODEL=gemma-4-e2b` (exact model name depends on how vLLM
   advertises it via `/v1/models`). This is a systemd override edit + restart
   on Pi — clean, reversible.

---

## Appendix — session 67 rate benchmark data (Ollama baseline)

| Test | p50 ms | p95 ms | eff Hz | Notes |
|---|---:|---:|---:|---|
| A full prompt (554 tok + image) | 159 | 256 | 6.3 | baseline |
| B same prompt repeat (cache test) | 158 | 163 | 6.3 | no prefix caching effect |
| C streaming, full prompt | 157 (TTFT 143) | 172 | 6.4 | streaming gives TTFT ~= wall |
| D short prompt (310 tok + image) | 156 | 168 | 6.4 | shorter prompt no speedup |
| E short + streaming | 155 (TTFT 142) | 159 | 6.5 | combined no speedup |
| text-only nav (34 tok, no image) | 137 | — | 7.3 | image adds only 19 ms |

Reported compute: prefill 14 ms + decode 13 ms = **27 ms**.
Unaccounted: **~110–130 ms per request**. That's the target to eliminate.

Raw JSON at `panda:~/benchmark-results/nav_rate_20260411_235455.json`.
Benchmark script: `scripts/benchmark_nav_rate_panda.py` (uncommitted as of writing).

---

## Sources

- [vLLM Supported Models — Gemma3n section (stale cache — mentions Gemma 3n only)](https://docs.vllm.ai/en/latest/models/supported_models)
- [Announcing Gemma 4 on vLLM](https://vllm-project.github.io/2026/04/02/gemma4.html)
- [vLLM Gemma 4 recipe](https://github.com/vllm-project/recipes/blob/main/Google/Gemma4.md)
- [llama.cpp — main repo](https://github.com/ggml-org/llama.cpp)
- [llama.cpp Gemma 3 vision discussion #12348](https://github.com/ggml-org/llama.cpp/discussions/12348)
- [gemma4-llama-dgx-spark (ARM64 CUDA 13 build for Titan)](https://github.com/shamily/gemma4-llama-dgx-spark)
- [Gemma 4 Developer Guide — Lushbinary](https://lushbinary.com/blog/gemma-4-developer-guide-benchmarks-architecture-local-deployment-2026/)
- [Gemma 4 blog post — Google](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/)
- [Ollama gemma4:e2b library](https://ollama.com/library/gemma4:e2b)
- [Gemma 4 vLLM Setup Guide — Gemma 4 Wiki](https://www.gemma4.wiki/install/gemma-4-vllm-setup-guide)
- [Run Gemma 4:E2B Locally with Ollama — Medium/Gabriel Preda](https://medium.com/@gabi.preda/run-gemma-4-e2b-locally-with-ollama-no-cloud-no-limits-7e6c3f6bd860)
- [Building a Multimodal Local AI Stack: Gemma 4 E2B, vLLM, and Hermes — dev.to](https://dev.to/abdulhakkeempa/building-a-multimodal-local-ai-stack-gemma-4-e2b-vllm-and-hermes-agent-k8l)

---

## Cross-references (internal)

- `scripts/benchmark_nav_rate_panda.py` — rate benchmark script (5 configs × 20 runs)
- `scripts/benchmark_gemma4_e2b_panda.py` — session 67 initial benchmark (quality + latency)
- `services/annie-voice/robot_tools.py:280-298` — the actual 500-token nav prompt used in production
- `services/annie-voice/robot_tools.py:32-33` — `NAV_VLLM_URL` + `NAV_VLLM_MODEL` — swap targets
- `docs/RESOURCE-REGISTRY.md` — GPU VRAM budget (Panda + Titan sections)
- `docs/RESEARCH-GEMMA4-E2B-E4B-AUDIO.md` — prior research on Gemma 4 audio capabilities
- `memory/project_turbopi_camera_gotchas.md` — camera constraints (separate from VLM stack)
