# Next Session: Gemma 4 E2B on Raspberry Pi 5

## Goal
Run Gemma 4 E2B (2.3B effective / 5.1B total params) locally on the Pi 5 (16 GB RAM, aarch64).
This would give the TurboPi robot car an on-device LLM with audio + vision understanding.

## Why This Is Interesting
- **E2B has audio input** — the 26B on Titan does NOT. Car could listen + understand speech locally.
- **E2B has vision** — car could describe what it sees without sending frames to Titan.
- **On-device = zero network latency** — car responds even if WiFi drops.
- **16 GB RAM** — E2B at Q4 quantization should fit (~3-4 GB).
- **Hailo-8 (26 TOPS)** is available but only does HEF-compiled models, not LLMs (more below).

## Hardware
- Raspberry Pi 5, 16 GB RAM, aarch64, Debian Trixie 13, Python 3.13
- Hailo AI HAT+ 26T (26 TOPS) — connected and verified
- USB speaker (JMTek, `plughw:2,0`) + camera (`/dev/video0`)
- No GPU — CPU-only inference for LLMs

## Key Questions to Answer

### 1. Can Gemma 4 E2B run on Pi 5 via llama.cpp / Ollama?
- Ollama has aarch64 Linux builds — does it have Gemma 4 E2B?
- llama.cpp: does it support Gemma 4 architecture (MoE for E2B)?
- What quantizations are available? Q4_K_M should be ~3 GB.
- Expected tok/s on Pi 5 CPU? (ARM Cortex-A76, 4 cores, 2.4 GHz)
  - Reference: Phi-3 mini (3.8B) gets ~5-8 tok/s on Pi 5 via llama.cpp
  - Gemma 4 E2B (2.3B effective) should be similar or faster

### 2. Does audio input work via Ollama/llama.cpp?
- Gemma 4 E2B audio uses a SoundStream encoder — is this supported?
- Previous research (`RESEARCH-GEMMA4-E2B-E4B-AUDIO.md`) found:
  - 30-second max audio input
  - No streaming
  - Audio garbled on E2B (E4B was decent)
  - These tests were on Titan via transformers — does llama.cpp support the audio modality at all?
- If audio doesn't work via llama.cpp, text+vision only is still valuable

### 3. Does vision work?
- E2B is multimodal (text + vision + audio)
- llama.cpp has vision support for some models (LLaVA, etc.)
- Does Gemma 4 E2B vision work in llama.cpp/Ollama?
- Use case: "What do you see?" → car describes the camera frame

### 4. Can the Hailo accelerate E2B inference?
- Hailo runs HEF-compiled neural networks (YOLO, ResNet, etc.)
- LLMs are NOT typical Hailo workloads (no transformer support in Hailo compiler)
- Verdict is almost certainly NO — but confirm
- Hailo stays for vision (YOLO), LLM runs on CPU separately

### 5. Memory budget
- Pi 5 has 16 GB total
- OS + services: ~1 GB
- Hailo driver: minimal
- Camera/stream server: ~100 MB
- E2B Q4_K_M: ~3 GB estimated
- E2B Q8: ~5 GB estimated
- Available for KV cache: 7-11 GB — plenty for 128K context at small model size?
- **Risk:** Pi 5 has no swap by default. If model + KV cache exceed RAM → OOM kill.

## Steps

### Step 1: Install Ollama
```bash
ssh pi 'curl -fsSL https://ollama.com/install.sh | sh'
```

### Step 2: Check Gemma 4 E2B availability
```bash
ssh pi 'ollama list'
ssh pi 'ollama show gemma4:e2b --modelfile 2>&1 || echo "not available"'
# Or search for community GGUF quantizations
```

### Step 3: Pull and run
```bash
ssh pi 'ollama pull gemma4:e2b'  # or whatever the tag is
ssh pi 'ollama run gemma4:e2b "Hello, what can you do?"'
```

### Step 4: Benchmark
```bash
# Measure tok/s, TTFT, memory usage
ssh pi 'ollama run gemma4:e2b "Write a haiku about a robot car" --verbose'
ssh pi 'free -h'  # check memory after loading
```

### Step 5: Test vision (if supported)
```bash
# Capture frame from car camera
ssh pi 'python3 -c "import cv2; cap=cv2.VideoCapture(0); _, f=cap.read(); cv2.imwrite(\"/tmp/car-view.jpg\", f); cap.release()"'
# Send to model
ssh pi 'ollama run gemma4:e2b "Describe this image" /tmp/car-view.jpg'
```

### Step 6: Test audio (if supported)
```bash
# Record 5 seconds from USB mic (if available) or use a test WAV
ssh pi 'ollama run gemma4:e2b "Transcribe this audio" /tmp/test.wav'
```

### Step 7: Integration assessment
If it works, decide:
- Should the car have its own local LLM for quick responses?
- Or should it relay to Titan's Gemma 4 26B for quality?
- Hybrid: local E2B for latency-critical (obstacle descriptions, voice commands), Titan for complex reasoning?

## Fallback Options
If E2B doesn't work or is too slow:
1. **Qwen 2.5 0.5B / 1.5B** — known to work on Pi via Ollama, very fast
2. **Phi-3 mini 3.8B** — proven on Pi 5, ~5-8 tok/s
3. **SmolLM2 1.7B** — designed for edge, good at tool calling
4. **TinyLlama 1.1B** — fastest option, basic but functional

## Reference
- `docs/RESEARCH-GEMMA4-E2B-E4B-AUDIO.md` — Audio capabilities deep dive (verdict: can't replace pipeline)
- `docs/RESEARCH-TURBOPI-CAPABILITIES.md` — Full hardware inventory + project ideas
- E2B model card: https://ai.google.dev/gemma/docs/model_card_gemma4
- Ollama Gemma 4: check `ollama.com/library/gemma4`