# Next Session: Implement SLAM+Zenoh + Multi-Query VLM

## What

Implement the adversarial-reviewed plan for two independent workstreams:

1. **SLAM+Zenoh (Pi 5):** Replace Python HectorSLAM (`slam.py`, 721 lines, 7.9x rotation drift) with ROS2 slam_toolbox in Docker, connected via Zenoh middleware (10-21 microsecond latency). Native `slam_bridge.py` publishes lidar/IMU/odom via `zenoh_ros2_sdk` → Zenoh router (separate container) → Docker ROS2 (rmw_zenoh_cpp) runs slam_toolbox+rf2o+EKF → publishes pose/map back. Same API as existing SlamDaemon. Switched via `SLAM_BACKEND=ros2` env var (default: `hector`, zero-risk rollback).

2. **Multi-Query VLM (Panda):** NavController cycles through 6 prompt slots per cycle — 3 goal-tracking + 1 scene + 1 obstacle + 1 path. Goal at ~27Hz, auxiliaries at ~9Hz. Feature-flagged via `NAV_MULTI_QUERY=1` (default: off). Motor commands ONLY from goal-tracking slots.

These share NO code and run on different machines — implement in parallel.

## Plan

**`~/.claude/plans/modular-leaping-teapot.md`** — Read this first. It has the full implementation with all code, all adversarial review fixes, and design decisions.

The plan passed adversarial review (2 parallel reviewers, 22 findings: 5 CRITICAL, 7 HIGH, 10 MEDIUM — ALL fixed in the plan).

## Key Design Decisions (from adversarial review)

These decisions were CHANGED by the adversarial review. Do not revert them:

1. **`_awaiting_reset` boolean flag** — NOT seq comparison. The original `_pending_reset_seq > _reset_seq` was mathematically impossible to trigger (both set equal in `reset()`). Use `_awaiting_reset = True` on reset, `False` on ack.

2. **zenohd as SEPARATE Docker service** (`restart: always`) — NOT inside the SLAM container. slam_toolbox crash must NOT kill the Zenoh router. Native session reconnects when router is alive.

3. **Publish loop wrapped in try/except** with reconnection — on Zenoh error, force `ZenohSession._instance = None`, re-init. Prevents "zombie RUNNING" state where the bridge thread dies but `is_healthy()` returns True.

4. **Message classes cached in `_init_zenoh()`** — NOT called per-message. `get_message_class()` was being called 250-400x/sec (8 calls per IMU message × 50Hz). All 9 message types cached as `self._Header`, `self._Quaternion`, etc.

5. **POSE_TIMEOUT_S = 15.0** — NOT 5.0. Loop closure takes 10-15s with no TF updates. 5s causes false DEGRADED mid-navigation.

6. **POSE_DEAD_S = 60.0** — NOT 30.0. Docker restart takes ~15s; 30s was too tight.

7. **Dockerfile: `source /opt/ros/jazzy/setup.bash` FIRST** — before any `ros2` commands. Original had `ros2 run` before sourcing, which fails immediately.

8. **ALL `_state` reads/writes under `_lock`** — `_check_state()` holds lock during state transitions. `is_healthy()` and `get_pose()` read `_state` under lock.

9. **`bin_scan_to_ranges`: clamp `bin_idx = min(bin_idx, num_bins - 1)`** — prevents off-by-one at ccw_rad=+π (bin 360 → dropped).

10. **Chassis reflection filter** — `MIN_SLAM_DISTANCE_MM = 50` applied in `_publish_scan()` before binning. `get_scan_snapshot()` returns unfiltered points but the publisher filters them.

11. **Pre-cache message definitions** (Task 0 Step 0.5) — `zenoh_ros2_sdk` downloads .msg from Git on first use. Air-gapped Pi crashes without pre-caching.

12. **C14 DeprecationWarning** (Task 4 Step 4.3) — monkey-patches `consume_heading_delta_deg()` when `SLAM_BACKEND=ros2` to warn about destructive IMU reads.

13. **`_vlm_request()` helper method** — extracted from existing `_ask_vlm()` inline HTTP call. Multi-query dispatch calls this, not the non-existent method.

14. **`_on_reset_ack` resets `_pose_time = 0.0`** — prevents zombie RUNNING state after reset.

15. **pose_publisher.py actually clears map files** — NOT a TODO stub. Deletes posegraph+data files so slam_toolbox restarts clean.

## Files to Modify

### SLAM+Zenoh (Pi 5, Tasks 0-4):
1. `services/turbopi-server/imu.py` — Add `get_gyro_z_dps()` accessor (+4 lines)
2. `services/turbopi-server/lidar.py` — Add `get_scan_snapshot()` atomic read (+6 lines)
3. `services/turbopi-server/safety.py` — Add `os.sched_setaffinity(0, {0})` core pinning (+3 lines)
4. `services/turbopi-server/test_conversions.py` — NEW: sensor accessor + conversion tests
5. `services/ros2-slam/Dockerfile` — NEW: ROS2 Jazzy + slam_toolbox + rf2o + EKF
6. `services/ros2-slam/docker-compose.yml` — NEW: zenohd sidecar + ros2-slam, cpuset=1-3
7. `services/ros2-slam/zenoh_router_config.json5` — NEW: peers_failover_brokering=true
8. `services/ros2-slam/config/slam_toolbox_params.yaml` — NEW: 5cm, async, loop closure
9. `services/ros2-slam/config/ekf_params.yaml` — NEW: 30Hz, 3-source fusion
10. `services/ros2-slam/config/rf2o_params.yaml` — NEW: 10Hz, base_footprint
11. `services/ros2-slam/launch/slam.launch.py` — NEW: all ROS2 nodes + static TF
12. `services/ros2-slam/src/pose_publisher.py` — NEW: TF2→/slam/pose + reset handler
13. `services/turbopi-server/slam_bridge.py` — NEW: Zenoh bridge (~280 lines)
14. `services/turbopi-server/test_slam_bridge.py` — NEW: bridge unit tests (~200 lines)
15. `services/turbopi-server/main.py` — SLAM_BACKEND switch + C14 guard (+30 lines)

### Multi-Query VLM (Panda, Task 5):
16. `services/panda_nav/server.py` — SceneContext, dispatch, parsers, _vlm_request (+100 lines)
17. `services/panda_nav/tests/test_multi_query.py` — NEW: multi-query tests (~150 lines)

## Critical Gotchas (memorize before implementing)

| Gotcha | Why It Matters |
|--------|---------------|
| RPLIDAR C1 baud=460800 | Wrong baud = no lidar data |
| CW→CCW conversion: `ros_rad = -deg2rad(cw_deg)` | Wrong sign = mirrored map |
| Body frame: `ros_vx = turbopi_vy`, `ros_vy = -turbopi_vx` | Wrong axes = robot drives sideways |
| `consume_heading_delta_deg()` is DESTRUCTIVE | Only ONE consumer. SlamBridge uses `get_heading_deg()` (non-destructive) |
| Docker NEVER touches serial ports | `/dev/lidar` and `/dev/imu` are native-only |
| `peers_failover_brokering: true` in router config | Without this, native→Docker messages silently fail |
| `ROS_LOCALHOST_ONLY=1` in Docker env | Without this, DDS/Zenoh exposes ports to LAN |
| Safety daemon on core 0, Docker on cores 1-3 | Prevents loop closure CPU spike from starving obstacle detection |
| `slam.py` is NOT deleted or renamed | Keeps git blame. SLAM_BACKEND=hector is the default fallback. |
| zenoh_ros2_sdk `get_message_class()` downloads from Git | Must pre-cache (Task 0 Step 0.5) or air-gapped Pi crashes |

## Hardware Inventory

| Machine | IP | Role |
|---------|-----|------|
| **Pi 5** | 192.168.68.61 | Robot car — lidar, IMU, camera, motors, Docker SLAM |
| **Panda** | 192.168.68.57 | VLM inference — RTX 5070 Ti, 58Hz Gemma 4 E2B |
| **Titan** | 192.168.68.52 | Main brain — DGX Spark, Gemma 4 26B, Annie |

## Start Command

```
1. Read the plan: cat ~/.claude/plans/modular-leaping-teapot.md
2. Implement using superpowers:subagent-driven-development
   - SLAM (Tasks 0-4) and Multi-Query VLM (Task 5) can run as PARALLEL workstreams
   - Task 0 (pre-flight) must complete before Tasks 1-4
   - Task 6 (deploy) runs after both workstreams complete
3. All adversarial findings are already fixed in the plan code — implement exactly as written
```

## Verification

### SLAM+Zenoh (after deploying to Pi 5):
1. `cd services/turbopi-server && python -m pytest test_conversions.py test_slam_bridge.py -v` — all pass
2. `cd services/ros2-slam && docker compose build` — image builds
3. `docker compose up -d && sleep 15 && docker compose logs --tail=30` — all nodes started
4. `SLAM_BACKEND=ros2 python -m uvicorn main:app --host 0.0.0.0 --port 9090` — bridge starts
5. `curl http://localhost:9090/pose` — returns `{"state": "running", ...}`
6. Drive robot forward 1s, verify y_m changes
7. `curl http://localhost:9090/map --output /tmp/map.png` — PNG with occupancy grid
8. `curl -X POST http://localhost:9090/slam/reset` → pose resets to (0,0)
9. `docker compose down` → `SLAM_BACKEND=hector` still works (fallback)
10. `taskset -p $(pgrep -f safety)` → confirms core 0
11. `docker stats` → confirms <3GB memory

### Multi-Query VLM (after deploying to Panda):
1. `cd services/panda_nav && python -m pytest tests/ -v` — all pass including new tests
2. `NAV_MULTI_QUERY=0` — identical behavior to before (no regression)
3. `NAV_MULTI_QUERY=1` — scene/obstacle/path data in logs and `/v1/nav/status`
4. Compare "find the red ball" with multi-query off vs on — verify no goal-tracking regression
