# Next Session: Re-benchmark Gemma 4 E2B on Pi 5 with Thinking Disabled

## What

Session 36 benchmarked Gemma 4 E2B on Pi 5 and found it **viable** (6.5 tok/s, vision works, MoE coherent). But thinking mode is ON by default, generating 5-10x more tokens than the visible response — turning 5-10s answers into 40-80s waits. This session disables thinking and re-benchmarks to get the **real production numbers**.

## Context

Ollama v0.20.4 is already installed and running on Pi 5 (`ssh pi`, 192.168.68.61). Gemma 4 E2B (7.2 GB) plus 3 baseline models are already pulled. Ollama configured with `OLLAMA_NUM_THREADS=4`, `OLLAMA_MAX_LOADED_MODELS=1`.

### Session 36 Results (WITH thinking — inflated numbers)

| Model | tok/s | RAM | Vision | Notes |
|-------|-------|-----|--------|-------|
| Qwen 2.5 0.5B | 22.7 | 1.2 GB | No | Baseline floor |
| SmolLM2 1.7B | 6.5 | 3.2 GB | No | Edge-optimized baseline |
| Phi-3 mini 3.8B | 4.7 | 4.2 GB | No | Dense model baseline |
| **Gemma 4 E2B** | **6.2-6.9** | **8.5 GB** | **YES** | tok/s includes ~80% thinking tokens |

The 6.5 tok/s is the raw generation rate (thinking + visible combined). With thinking disabled, fewer tokens are generated per query, so **wall-clock response time** drops dramatically even if tok/s stays similar.

## Key Design Decision

Gemma 4's thinking mode is controlled via the `think` parameter. In Ollama, two approaches:

**Option A: Custom Modelfile** (persistent)
```
ollama create gemma4-e2b-nothinker -f- <<EOF
FROM gemma4:e2b
PARAMETER temperature 0.7
SYSTEM "Do not use thinking or chain-of-thought. Respond directly and concisely."
EOF
```

**Option B: API parameter** (per-request)
```bash
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:e2b",
  "prompt": "...",
  "think": false,
  "stream": false
}'
```

**Option C: `/no_think` token prefix** (Gemma 4 specific)
Some Gemma 4 builds support a `<start_of_turn>model\n/no_think\n` prefix. Check if Ollama's GGUF supports this.

Try Option B first (simplest), fall back to A if the API parameter doesn't work.

## Steps

### 1. Verify Pi + Ollama still running
```bash
ssh pi 'ollama --version && ollama list | grep gemma4 && free -h'
```
If Ollama or model missing (power outage), re-pull:
```bash
ssh pi 'sudo systemctl start ollama && ollama pull gemma4:e2b'
```

### 2. Test `think: false` API parameter
```bash
ssh pi 'curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"gemma4:e2b\",
  \"prompt\": \"What is 7 * 8? Answer with just the number.\",
  \"think\": false,
  \"stream\": false,
  \"options\": {\"num_predict\": 50}
}" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d[\"response\"][:200]); ec=d[\"eval_count\"]; ed=d[\"eval_duration\"]; print(f\"tokens: {ec}, tok/s: {ec/(ed/1e9):.2f}, total: {d[\"total_duration\"]/1e9:.1f}s\")"'
```
**Expected:** Response should be just "56" with ~2-5 tokens total (no thinking chain). Total time <5s.

If `think: false` is ignored (response still contains thinking), try Option A (Modelfile) or Option C (`/no_think`).

### 3. Re-benchmark text generation (no thinking)
```bash
# Haiku
ssh pi 'curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"gemma4:e2b\",
  \"prompt\": \"Write a haiku about a robot car driving through a house\",
  \"think\": false,
  \"stream\": false,
  \"options\": {\"num_predict\": 100}
}" | python3 -c "import json,sys; d=json.load(sys.stdin); r=d[\"response\"]; print(r[:300]); ec=d[\"eval_count\"]; ed=d[\"eval_duration\"]; print(f\"\\ntokens: {ec}, tok/s: {ec/(ed/1e9):.2f}, wall: {d[\"total_duration\"]/1e9:.1f}s\")"'

# Reasoning
ssh pi 'curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"gemma4:e2b\",
  \"prompt\": \"If a robot moves at 5cm/s toward a wall 30cm away, how many seconds until impact? Be brief.\",
  \"think\": false,
  \"stream\": false,
  \"options\": {\"num_predict\": 100}
}" | python3 -c "import json,sys; d=json.load(sys.stdin); r=d[\"response\"]; print(r[:300]); ec=d[\"eval_count\"]; ed=d[\"eval_duration\"]; print(f\"\\ntokens: {ec}, tok/s: {ec/(ed/1e9):.2f}, wall: {d[\"total_duration\"]/1e9:.1f}s\")"'

# Longer generation (200 words)
ssh pi 'curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"gemma4:e2b\",
  \"prompt\": \"Explain in 200 words how mecanum wheels enable omnidirectional movement. Include force vectors.\",
  \"think\": false,
  \"stream\": false,
  \"options\": {\"num_predict\": 400}
}" | python3 -c "import json,sys; d=json.load(sys.stdin); r=d[\"response\"]; print(r[:500]); ec=d[\"eval_count\"]; ed=d[\"eval_duration\"]; print(f\"\\ntokens: {ec}, tok/s: {ec/(ed/1e9):.2f}, wall: {d[\"total_duration\"]/1e9:.1f}s\")"'
```

### 4. Re-benchmark vision (no thinking)
```bash
# Capture fresh frame (10-frame warmup for auto-exposure)
ssh pi 'python3 -c "
import cv2, time
cap = cv2.VideoCapture(0)
for i in range(10): ret, f = cap.read(); time.sleep(0.1)
ret, f = cap.read(); cap.release()
print(f\"Captured: {f.shape}, Mean: {f.mean():.1f}\")
cv2.imwrite(\"/tmp/car-view.jpg\", f)
"'

# Vision with no thinking
ssh pi 'curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"gemma4:e2b\",
  \"prompt\": \"What do you see? 2-3 sentences max.\",
  \"images\": [\"$(base64 -w0 /tmp/car-view.jpg)\"],
  \"think\": false,
  \"stream\": false,
  \"options\": {\"num_predict\": 200}
}" | python3 -c "import json,sys; d=json.load(sys.stdin); r=d[\"response\"]; print(r[:500]); ec=d[\"eval_count\"]; ed=d[\"eval_duration\"]; print(f\"\\ntokens: {ec}, tok/s: {ec/(ed/1e9):.2f}, wall: {d[\"total_duration\"]/1e9:.1f}s\")"'
```
**Session 36 with thinking:** 434 tokens, 73s. Expected without thinking: ~50-80 tokens, ~8-12s.

### 5. Compare results table

Fill in:

| Test | WITH thinking (session 36) | WITHOUT thinking | Speedup |
|------|---------------------------|-----------------|---------|
| Math (7*8) | 94 tokens, 15.9s | ? tokens, ?s | ?x |
| Haiku | 331 tokens, 53s | ? tokens, ?s | ?x |
| Reasoning | 154 tokens, ~25s | ? tokens, ?s | ?x |
| Vision | 434 tokens, 73s | ? tokens, ?s | ?x |

### 6. Quality check — does disabling thinking hurt?

Run the same MoE coherence tests without thinking:
- 7 * 8 should still be 56
- 30cm / 5cm/s should still be 6 seconds
- Vision should still describe the actual scene

If quality drops significantly, consider a hybrid: thinking for complex reasoning, no-think for quick commands.

### 7. (Optional) Create production Modelfile

If `think: false` works well, create a persistent no-thinking variant:
```bash
ssh pi 'ollama create gemma4-car -f- <<EOF
FROM gemma4:e2b
PARAMETER temperature 0.7
PARAMETER num_predict 200
SYSTEM "You are the brain of a TurboPi robot car. Respond directly and concisely. Never use chain-of-thought or thinking. Describe what you see accurately. Follow driving commands precisely."
EOF'
```

## Decision Criteria

| Metric | Target (no thinking) | Session 36 (with thinking) |
|--------|---------------------|---------------------------|
| Wall-clock for haiku | <10s | 53s |
| Wall-clock for vision | <15s | 73s |
| Wall-clock for math | <5s | 16s |
| Quality (math correct) | Still correct | Correct |
| Quality (vision accurate) | Still accurate | Accurate |
| tok/s (raw rate) | ~6-7 (same) | 6.2-6.9 |

## Verification

- [ ] `think: false` parameter works (or alternative method found)
- [ ] Text benchmarks re-run without thinking
- [ ] Vision re-benchmarked without thinking
- [ ] Quality unchanged (math + vision still correct)
- [ ] Wall-clock times compared in table
- [ ] If viable: production Modelfile created for the robot car

## Files
- `memory/project_pi5_benchmark_state.md` — Session 36 full results
- `~/.claude/plans/woolly-zooming-eich.md` — Original adversarial-reviewed plan
- `docs/RESEARCH-TURBOPI-CAPABILITIES.md` — Hardware inventory + SDK