# Next Session: Nav VLM Production Deploy

**Pickup prompt for session 68. Self-contained — no context from session 67 required.**

---

## What Happened in Session 67

Session 67 benchmarked two inference backends for the robot car's navigation VLM (Gemma 4 E2B)
on Panda (RTX 5070 Ti 16 GB, 192.168.68.57). Replacing Ollama with llama.cpp's
`llama-server` cut per-decision latency from 156 ms (6.4 Hz) to 18.4 ms (54 Hz) —
an 8.5x speedup — without changing the model, weights, or quantization. The entire
gap was Ollama's Go server-stack overhead (~110 ms), not GPU compute (~27 ms).
The `panda-llamacpp` docker container was started as a one-off `docker run -d` and
verified working. The deployment has NOT been made persistent yet and the Pi has NOT
been switched to point at Panda. That is the work for this session.

Research writeup: `docs/RESEARCH-PANDA-NAV-VLM-54HZ.md`

---

## Current State (Start of Session 68)

### What is running (verified end of session 67)

| Machine | Process | Status | Port |
|---------|---------|--------|------|
| Panda (192.168.68.57) | `phone_call.py` (Whisper+IndicConformer+Kokoro) | Running | — |
| Panda | Chatterbox TTS | Running | 8772 |
| Panda | `panda-llamacpp` docker container (nav VLM) | **Running but NOT persistent** | 11435 |
| Titan (192.168.68.52) | vLLM Gemma 4 26B NVFP4 | Running | 8003 |
| Pi (192.168.68.61) | turbopi-server | Running | 8080 |

### What still needs to happen

1. IndicF5 (port 8771 on Panda) was stopped for benchmarking. Its retirement is permanent.
2. `panda-llamacpp` docker container needs a systemd unit (will die on Panda reboot).
3. Pi's turbopi-server still points at Titan (NAV_VLLM_URL=http://192.168.68.52:8003).
4. Dead-end artifacts from session 67 (~30 GB) need cleanup on Panda.
5. Benchmark scripts need to be committed to git.
6. RESOURCE-REGISTRY.md needs the IndicF5 row marked as permanently retired.

### Key facts about the nav VLM setup

- **Model weights on Panda:** `~/gguf-gemma4-e2b/gemma-4-E2B-it-Q4_K_M.gguf` + `mmproj-gemma-4-E2B-it-f16.gguf`
- **Docker image:** `ghcr.io/ggml-org/llama.cpp:server-cuda` (already pulled)
- **Container name:** `panda-llamacpp`
- **Port mapping:** host 11435 → container 8080
- **Critical flag in every request:** `"chat_template_kwargs": {"enable_thinking": false}`
  Without this, Gemma 4 attempts extended thinking and returns empty/malformed responses.
- **Rollback:** change Pi env back to `NAV_VLLM_URL=http://192.168.68.52:8003` and `NAV_VLLM_MODEL=gemma-4-26b` — one restart restores Titan 26B nav.

---

## Step-by-Step Deployment Checklist

Work through these in order. Each step has a verification command.

---

### Step A: Confirm panda-llamacpp is still alive

```bash
# On Panda (ssh panda or run on the machine directly)
docker ps --filter name=panda-llamacpp

# If running, you will see a row with panda-llamacpp and status "Up"
# If not running (was rebooted):
docker start panda-llamacpp
# Then wait 10s for model to load, then verify:
curl -s http://localhost:11435/health | python3 -m json.tool
# Expected: {"status": "ok"}
```

If `docker start` fails (container removed), recreate with:
```bash
docker run -d \
  --name panda-llamacpp \
  --gpus all \
  -p 11435:8080 \
  -v ~/gguf-gemma4-e2b:/models:ro \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  --model /models/gemma-4-E2B-it-Q4_K_M.gguf \
  --mmproj /models/mmproj-gemma-4-E2B-it-f16.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 4096 \
  --threads 4
```

---

### Step B: Permanently retire IndicF5 on Panda

```bash
# On Panda
# 1. Confirm it is stopped (should be from session 67)
ps aux | grep indicf5
# If running, kill it:
sudo systemctl stop indicf5-server 2>/dev/null || pkill -f indicf5

# 2. Disable from systemd if it was registered
sudo systemctl disable indicf5-server 2>/dev/null
sudo rm -f /etc/systemd/system/indicf5-server.service 2>/dev/null
sudo systemctl daemon-reload

# 3. Check for any startup references in her-os services
grep -r indicf5 ~/workplace/her/her-os/services/ --include="*.py" --include="*.sh" --include="*.service"
# If any files reference indicf5 for startup, comment those lines out

# 4. Verify VRAM freed
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv,noheader
# Should show: phone_call.py ~5158 MB, Chatterbox ~3654 MB, panda-llamacpp ~3227 MB
# Total ~12039 MB / 16303 MB = 74%
# IndicF5 (2864 MB) should NOT appear
```

---

### Step C: Clean up dead-end artifacts from session 67

These are safe to delete — they were dead ends and nothing depends on them.

```bash
# On Panda — ~30 GB total to recover

# 1. Delete the vLLM docker image (~22.5 GB)
docker images | grep vllm
docker rmi vllm/vllm-openai:latest
# If it complains about dependent containers:
docker ps -a | grep vllm  # find container
docker rm <container_id>   # remove it
docker rmi vllm/vllm-openai:latest

# 2. Delete the Unsloth bnb-4bit HF cache (~7.5 GB)
du -sh ~/.cache/huggingface/hub/models--unsloth--gemma-4-E2B-it-unsloth-bnb-4bit/ 2>/dev/null
rm -rf ~/.cache/huggingface/hub/models--unsloth--gemma-4-E2B-it-unsloth-bnb-4bit/

# 3. Verify disk freed
df -h ~/
# 4. List remaining HF cache to confirm only GGUF weights remain
ls ~/gguf-gemma4-e2b/
# Expected: gemma-4-E2B-it-Q4_K_M.gguf  mmproj-gemma-4-E2B-it-f16.gguf
```

---

### Step D: Create systemd unit for panda-llamacpp (persistence)

```bash
# On Panda
# Create the service file
sudo tee /etc/systemd/system/panda-llamacpp.service > /dev/null << 'EOF'
[Unit]
Description=llama-server Nav VLM (Gemma 4 E2B Q4_K_M)
After=docker.service
Requires=docker.service

[Service]
Type=forking
RemainAfterExit=yes
ExecStartPre=-/usr/bin/docker stop panda-llamacpp
ExecStartPre=-/usr/bin/docker rm panda-llamacpp
ExecStart=/usr/bin/docker run -d \
    --name panda-llamacpp \
    --gpus all \
    -p 11435:8080 \
    -v /home/rajesh/gguf-gemma4-e2b:/models:ro \
    ghcr.io/ggml-org/llama.cpp:server-cuda \
    --model /models/gemma-4-E2B-it-Q4_K_M.gguf \
    --mmproj /models/mmproj-gemma-4-E2B-it-f16.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --n-gpu-layers 99 \
    --ctx-size 4096 \
    --threads 4
ExecStop=/usr/bin/docker stop panda-llamacpp
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable panda-llamacpp
sudo systemctl start panda-llamacpp
sudo systemctl status panda-llamacpp

# Verify the container came up via systemd
sleep 15
curl -s http://localhost:11435/health | python3 -m json.tool
# Expected: {"status": "ok"}
```

---

### Step E: Switch Pi to point at Panda llama-server

```bash
# On Pi (ssh pi or run directly on 192.168.68.61)
# 1. Inspect current systemd env
sudo systemctl cat turbopi-server.service | grep NAV_VLLM

# Should currently show:
#   Environment="NAV_VLLM_URL=http://192.168.68.52:8003"
#   Environment="NAV_VLLM_MODEL=gemma-4-26b"

# 2. Edit the service
sudo systemctl edit --full turbopi-server.service
# In the [Service] section, find the two NAV_VLLM_* lines and change them to:
#   Environment="NAV_VLLM_URL=http://192.168.68.57:11435"
#   Environment="NAV_VLLM_MODEL=gemma-4-E2B-it"
# Save and exit the editor (usually Ctrl+X then Y in nano, or :wq in vi)

# 3. Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart turbopi-server

# 4. Verify the service is running
sudo systemctl status turbopi-server
```

---

### Step F: Verify Pi connectivity to Panda nav VLM

```bash
# On Pi
# 1. Confirm turbopi-server health shows updated nav VLM URL
curl -s http://localhost:8080/health | python3 -m json.tool
# Look for: "nav_vllm_url": "http://192.168.68.57:11435"
# Look for: "nav_vllm_healthy": true  (or similar connectivity indicator)

# 2. Test nav VLM reachability from Pi directly
curl -X POST http://192.168.68.57:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-E2B-it",
    "max_tokens": 10,
    "chat_template_kwargs": {"enable_thinking": false},
    "messages": [
      {"role": "user", "content": "Reply with exactly one word: forward"}
    ]
  }' | python3 -m json.tool
# Expected: response contains "forward" in content field
# Expected latency: <100ms from Pi to Panda
```

---

### Step G: Restart turbopi-server on Pi and verify full health

```bash
# On Pi
curl -s http://localhost:8080/health | python3 -m json.tool
# All these should be true/healthy:
#   slam_healthy: true
#   imu_healthy: true
#   odom_calibrated: true
#   lidar_healthy: true  (or similar)
# nav_vllm_url should show the Panda address
```

---

### Step H: E2E test via Telegram

Send via Telegram to Annie:
```
Annie, can you take a photo and tell me what you see in front of the robot?
```

Then:
```
Annie, drive the robot forward for 2 seconds
```

Then:
```
Annie, explore the room for 5 cycles and report what you found
```

Expected behavior:
- Photo description arrives in <5s (nav VLM call + response)
- Drive command executes (robot moves)
- Exploration returns after 5 cycles with a summary

While the exploration runs, time the cycles. In session 67, cycles were ~1s each
(drive duration dominates, not VLM latency). With 54 Hz nav VLM the per-cycle time
should be identical to session 65/66 — the improvement is headroom, not visible
latency in a 1-cycle-per-second navigation loop.

If Annie reports an error connecting to the nav VLM, verify:
1. `docker ps` on Panda shows `panda-llamacpp` Up
2. Port 11435 accessible from Pi (`curl http://192.168.68.57:11435/health` from Pi)
3. turbopi-server env has the correct URL (Step E verification)

---

### Step I: Update RESOURCE-REGISTRY.md

Open `docs/RESOURCE-REGISTRY.md` and make these edits:

**In the Panda table:**

1. Find the IndicF5 row and change the "Load Behavior" column to:
   `**PERMANENTLY RETIRED session 68** — user decision: "Mom speaks English, few words". Chatterbox (English) covers all TTS needs.`

2. Find the panda-llamacpp row and update "Load Behavior" to:
   `**Always loaded (systemd panda-llamacpp.service)** — persistent after session 68. Delivers 18.4 ms p50 (54 Hz) for nav prompt.`

3. Update the peak VRAM totals in the free-space commentary:
   - With IndicF5 permanently retired and panda-llamacpp added:
   - Active processes: phone_call.py (5,158) + Chatterbox (3,654) + panda-llamacpp (3,227) = 12,039 MB
   - Free: 16,303 - 12,039 = 4,264 MB (26%)

**In the Change Log table**, add a row at the top:
```
| 2026-04-13 | 68 | IndicF5 permanently retired (user: mom speaks English). panda-llamacpp systemd persisted. Pi NAV_VLLM_URL switched to Panda:11435. Dead-end vLLM image + bnb-4bit cache deleted (~30 GB). | Panda: -2,864 MB (IndicF5 gone). Net with llamacpp: -(2864-3227) = +363 MB vs pre-session-67 baseline. |
```

**In the Pi 5 table:**

Find the Ollama/nav row and update it to note that nav VLM now runs on Panda:
- Old: `Photo description | Gemma 4 E2B (gemma4-car) | 8.5 GB RAM | chariot | Ollama :11434 | On-demand`
- New note: Add "(nav VLM offloaded to Panda panda-llamacpp :11435 since session 68 — local Ollama kept as rollback)"

---

### Step J: Commit benchmark scripts

```bash
# On the dev machine (where her-os git repo is)
cd /home/rajesh/workplace/her/her-os

# Check what's uncommitted
git status

# The benchmark scripts should be uncommitted:
#   scripts/benchmark_gemma4_e2b_panda.py
#   scripts/benchmark_nav_rate_vllm.py  (or similar name)

# Stage them
git add scripts/benchmark_gemma4_e2b_panda.py
git add scripts/benchmark_nav_rate_vllm.py  # adjust name if different

# Also stage the two new docs if not yet committed:
git add docs/RESEARCH-PANDA-NAV-VLM-54HZ.md
git add docs/NEXT-SESSION-NAV-VLM-PRODUCTION-DEPLOY.md

# Commit
git commit -m "feat(nav): llama-server achieves 54 Hz nav VLM on Panda (8.5x vs Ollama)

Add benchmark scripts for Gemma 4 E2B on Panda RTX 5070 Ti.
llama-server (Q4_K_M + mmproj-F16) delivers 18.4 ms p50 vs 156 ms Ollama.
All 110 ms gap was Ollama Go server-stack overhead, not GPU compute.
Add research doc and next-session deploy checklist."

# Push
git push origin main

# Verify on Pi (if Pi is running from git)
ssh pi 'cd ~/her-os && git pull && echo "OK"'
```

---

### Step K: Final verification checklist

Run through all of these before declaring the session complete.

```bash
# === PANDA CHECKS ===
# 1. VRAM budget
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv,noheader
# Expected: 3 processes, total ~12039 MB, IndicF5 absent

# 2. panda-llamacpp systemd status
sudo systemctl status panda-llamacpp
# Expected: active (running)

# 3. Nav VLM responds correctly
curl -X POST http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-E2B-it",
    "max_tokens": 10,
    "chat_template_kwargs": {"enable_thinking": false},
    "messages": [
      {"role": "user", "content": "Choose: forward, left, right, backward, goal_reached, give_up. Just one word."}
    ]
  }' | python3 -c "import sys,json; r=json.load(sys.stdin); print(r['choices'][0]['message']['content'])"
# Expected: one of the six action words

# 4. Disk cleanup confirmed
df -h ~/
# Should have ~30 GB more than session 67 start

# === PI CHECKS ===
# 5. turbopi-server env updated
sudo systemctl cat turbopi-server.service | grep NAV_VLLM
# Expected:
#   NAV_VLLM_URL=http://192.168.68.57:11435
#   NAV_VLLM_MODEL=gemma-4-E2B-it

# 6. Health check passes
curl -s http://localhost:8080/health | python3 -m json.tool
# All subsystems healthy, odom_calibrated=true

# 7. Cross-machine connectivity
curl -s http://192.168.68.57:11435/health
# Expected: {"status": "ok"}  (from Pi to Panda)

# === GIT CHECKS ===
# 8. Scripts committed
git log --oneline -3  # should show the session 68 feat(nav) commit
git status  # should be clean

# 9. RESOURCE-REGISTRY.md updated
grep -A2 "IndicF5" docs/RESOURCE-REGISTRY.md | head -5
# Should contain "PERMANENTLY RETIRED"
```

---

## Rollback Plan

If llama-server fails in production (e.g., model corruption, container crash loop,
network issue between Pi and Panda), revert to Titan 26B nav in one command on Pi:

```bash
# On Pi — one-line rollback
sudo systemctl edit --full turbopi-server.service
# Change:
#   Environment="NAV_VLLM_URL=http://192.168.68.57:11435"
#   Environment="NAV_VLLM_MODEL=gemma-4-E2B-it"
# Back to:
#   Environment="NAV_VLLM_URL=http://192.168.68.52:8003"
#   Environment="NAV_VLLM_MODEL=gemma-4-26b"
sudo systemctl daemon-reload
sudo systemctl restart turbopi-server
```

Titan's vLLM (port 8003) is always running. Rollback restores the pre-session-67
behavior (6 Hz Ollama-matched speed via vLLM direct) immediately. No model downloads,
no container restarts, no VRAM changes.

**Rollback does NOT affect voice, WhatsApp, or any other Annie feature.** The Pi
nav VLM URL is isolated to turbopi-server's env.

---

## Files to Reference

| File | Where | Purpose |
|------|-------|---------|
| `docs/RESEARCH-PANDA-NAV-VLM-54HZ.md` | her-os repo | Full technical writeup of session 67 benchmark |
| `docs/RESOURCE-REGISTRY.md` | her-os repo | GPU VRAM budget — update per Step I |
| `scripts/benchmark_gemma4_e2b_panda.py` | her-os repo | 20-run benchmark script |
| `~/benchmark_nav_rate_vllm_results.json` | Panda home dir | Raw timing data from session 67 |
| `services/annie-voice/robot_tools.py` | her-os repo | Nav loop: reads NAV_VLLM_URL env |
| `/etc/systemd/system/turbopi-server.service` | Pi | NAV_VLLM_URL, NAV_VLLM_MODEL env |
| `/etc/systemd/system/panda-llamacpp.service` | Panda | Nav VLM persistence (created Step D) |
| `~/gguf-gemma4-e2b/` | Panda home dir | GGUF weights (do NOT delete) |