# Next Session: Replan SLAM + Zenoh + VLM Multi-Query, Then Implement

## What

Replan and implement the SLAM upgrade for Annie's TurboPi robot. Two work items in one plan:

1. **Phase 1: Replace HectorSLAM with slam_toolbox via Zenoh** — The existing Python HectorSLAM (`slam.py`, 721 lines) has 7.9x rotation drift and 28% distance undercount. Replace it with ROS2 slam_toolbox running in Docker, connected via Zenoh middleware (10-21 microsecond latency, 1000x better than the WebSocket bridge that was rejected in adversarial review).

2. **Phase 2a: Multi-query VLM pipeline** — Slot in during Phase 1. The 58 Hz VLM currently asks only "Where is the goal?" every frame. Alternate between goal-tracking, scene classification, and obstacle awareness across frames. Each task runs at ~15 Hz within the same 58 Hz throughput. Touches different code (NavController query pipeline), independent of SLAM infrastructure.

## Session Workflow

```
1. Read this prompt + all research docs
2. Use superpowers to replan Phase 1 (Zenoh) + Phase 2a (multi-query) as a unified plan
3. Run /planning-with-review (mandatory adversarial review)
4. Address ALL review findings (CRITICAL/HIGH/MEDIUM must be fixed)
5. Get user green light on final plan
6. Execute the plan using parallel agents
```

## Research Assets (read ALL before planning)

### Research Documents
| Doc | Path | Key Content |
|-----|------|-------------|
| SLAM+VLM Hybrid | `docs/RESEARCH-SLAM-VLM-HYBRID.md` | 5 proposals, R1-R6 answers, rf2o+RPLIDAR C1 compatibility, Pi 5 CPU/memory budget, risk register |
| Zenoh Bridge | `docs/RESEARCH-ZENOH-SLAM-BRIDGE.md` | 3 Zenoh options, reusable code patterns, CDR encoding, router config, critical gotchas, cloned repos |
| VLM-Primary Hybrid | `docs/RESEARCH-VLM-PRIMARY-HYBRID-NAV.md` | Multi-query pipeline design, 4-tier fusion architecture, temporal smoothing, Phase 2a-2e roadmap |
| 4-Command Nav Architecture | `docs/RESEARCH-NAV-4CMD-ARCHITECTURE.md` | How the current VLM nav works: perception≠reasoning, 4-command lookup table, closed-loop IMU turns |

### Previous Plan (adversarial-reviewed, needs Zenoh rewrite)
| Item | Path | Notes |
|------|------|-------|
| Plan v1 (WebSocket) | `~/.claude/plans/iterative-baking-cook.md` | 35 adversarial findings addressed. Architecture is sound EXCEPT the WebSocket bridge — replace with Zenoh. State machine, pre-mortem, file list all reusable. |

### Cloned Repos (in vendor/ — READ the key files)
| Repo | Path | Key Files to Read |
|------|------|-------------------|
| **zenoh-demos** | `vendor/zenoh-demos/` | `ROS2/zenoh-python-lidar-plot/ros2-lidar-plot.py` — LaserScan CDR with pycdr2. THE template for publishing scans. |
| **zenoh-plugin-ros2dds** | `vendor/zenoh-plugin-ros2dds/` | `DEFAULT_CONFIG.json5` — topic filtering, frequency limits, localhost-only. Full bridge config template. |
| **zenoh_ros2_sdk (ROBOTIS)** | `vendor/zenoh-ros2-sdk/` | `examples/15_publish_imu.py` — IMU publisher. `zenoh_ros2_sdk/publisher.py` — CDR via rosbags. `zenoh_ros2_sdk/session.py` — Zenoh session management. **This SDK is the native-side solution.** |
| **rmw_zenoh** | `vendor/rmw-zenoh/` | `rmw_zenoh_cpp/config/DEFAULT_RMW_ZENOH_ROUTER_CONFIG.json5` — router config template. `docs/design.md` — key expression format. |

### Existing Code to Understand
| File | Path | What |
|------|------|------|
| **slam.py** (HectorSLAM) | `services/turbopi-server/slam.py` | 721 lines. Being REPLACED (not renamed). Multi-res grids, Gauss-Newton, IMU prior. |
| **imu.py** | `services/turbopi-server/imu.py` | Pico reader. Needs `get_gyro_z_dps()` added. `consume_heading_delta_deg()` is destructive — one consumer only. |
| **lidar.py** | `services/turbopi-server/lidar.py` | RPLIDAR C1 daemon. Needs `get_scan_snapshot()` for atomic epoch+points. CCW→CW conversion at `LIDAR_CCW=False`. |
| **odometry.py** | `services/turbopi-server/odometry.py` | OdometryHint — velocity from /drive. Non-destructive reads. |
| **safety.py** | `services/turbopi-server/safety.py` | Hailo safety daemon. Needs `os.sched_setaffinity(0, {0})` for core pinning. |
| **main.py** | `services/turbopi-server/main.py` | Endpoints: /pose, /map, /slam/reset, /imu, /scan. Startup: lidar→imu→slam→safety. SLAM_BACKEND env flag. |
| **test_slam*.py** | `services/turbopi-server/test_slam*.py` | 1052 lines across 3 files. Do NOT rename slam.py — tests stay unchanged. |
| **MentorPi EKF config** | `vendor/mentorpi/ros2_ws/src/driver/controller/config/ekf.yaml` | Reference EKF: 100Hz, odom_raw + odom_rf2o + imu fusion. |
| **MentorPi slam_toolbox config** | `vendor/mentorpi/ros2_ws/src/orchestrator_launch/config/mapper_params_online_async.yaml` | Reference: 5cm resolution, loop closure, Ceres solver. |
| **NavController** | `services/panda_nav/server.py` | Where multi-query dispatch goes (Phase 2a). Currently sends single prompt per frame. |

## Key Design Decisions (already validated — carry forward)

1. **Zenoh, NOT WebSocket** — rmw_zenoh_cpp (Tier 1 in Jazzy) inside Docker + zenoh_ros2_sdk on native side. 10-21us latency vs 20-50ms WebSocket.
2. **zenoh_ros2_sdk handles CDR encoding** — uses `rosbags` library. No custom serialization code needed. `ROS2Publisher(topic="/scan", msg_type="sensor_msgs/msg/LaserScan")`.
3. **Router needs `peers_failover_brokering: true`** — GitHub issue #929. Without this, native→Docker messages silently fail.
4. **rmw_zenoh and zenoh-bridge-ros2dds are INCOMPATIBLE** — different key expressions. Pick one. We pick rmw_zenoh_cpp (Option 1).
5. **Do NOT rename slam.py** — keeps git blame, avoids 80+ test import changes. Add slam_bridge.py as new file.
6. **Safety daemon pinned to core 0** — `os.sched_setaffinity(0, {0})`. Docker cpuset 1-3.
7. **SlamBridge uses asyncio event loop** — websockets library is NOT thread-safe. Single event loop with send/recv tasks + asyncio.Queue(maxsize=5).
8. **Heading delta computed on native side** — not over network. SlamBridge tracks delta locally, sends with seq numbers.
9. **IMU initial zero guard** — check `is_healthy()` before publishing. Delay INIT→RUNNING until IMU healthy.
10. **Reset ack protocol** — `reset_seq` numbers prevent stale poses after reset.
11. **Map auto-save every 60s** — slam_toolbox serialize to Docker volume. Auto-load on startup.
12. **Multi-query VLM dispatch** — `cycle_count % N` in NavController selects which prompt to send per frame.

## Critical Gotchas (from adversarial reviews + memory)

| Gotcha | Impact | Mitigation |
|--------|--------|------------|
| RPLIDAR C1 baud=460800 | Wrong baud = no data | Native lidar daemon unchanged |
| CCW angles → ROS2 REP-103 | Mirrored map | `ros_rad = -deg2rad(cw_deg)`, startup self-test |
| Pico REPL dropout | IMU goes dead | No serial sharing. SlamBridge only reads non-destructive methods |
| Serial port contention | Corrupt data | Docker NEVER touches serial |
| Loop closure CPU spike | Safety daemon starves | Core pinning + Docker cpuset |
| Docker `network_mode: host` exposes ports | Security | `ROS_LOCALHOST_ONLY=1` in Docker env |
| Docker Compose v2 cpuset syntax | Config fails | Use `docker run --cpuset-cpus=1-3` in systemd unit |
| Stale map PNG in DEGRADED state | Serve old map | `_map_timestamp` staleness check, return 503 if >10s |
| `consume_heading_delta_deg()` has one consumer | Heading corruption | DeprecationWarning when SLAM_BACKEND=ros2 |

## Hardware Inventory

| Machine | IP | Role | Key Resources |
|---------|-----|------|---------------|
| **Pi 5** | 192.168.68.61 | Robot car | 16GB RAM, 4×A76, Hailo-8, RPLIDAR C1, Pico IMU, camera |
| **Panda** | 192.168.68.57 | Fast VLM | RTX 5070 Ti 16GB. VLM at 58Hz. 4.3GB free VRAM. |
| **Titan** | 192.168.68.52 | Main brain | DGX Spark 128GB. Gemma 4 26B. Annie voice+tools. |

## Start Command

```
Read this prompt and ALL research docs listed above.
Read the previous plan at ~/.claude/plans/iterative-baking-cook.md for the adversarial review findings.
Read the key vendor files listed above (zenoh-ros2-sdk examples, rmw-zenoh configs, zenoh-demos LaserScan pattern).

Then:
1. Use superpowers to create a unified plan for Phase 1 (SLAM+Zenoh) + Phase 2a (multi-query VLM)
2. Run /planning-with-review (mandatory adversarial review with at least 2 parallel reviewers)
3. Address ALL findings (CRITICAL/HIGH/MEDIUM must be implemented)
4. Present final plan for user approval
5. On green light: execute using parallel agents (Phase 0 prep → Phase 1 Docker → Phase 2 SlamBridge → Phase 2a multi-query → tests → deploy)
```