# Next Session: Debug rf2o Not Receiving Scans via Zenoh

## What

The Zenoh version fix is **deployed and working** — native Python (`eclipse-zenoh` 1.9.0) publishes LaserScan at 9.9 Hz to Docker ROS2 (`rmw_zenoh_cpp` built from source, zenoh 1.7.1). `ros2 topic echo /scan` and `ros2 topic hz /scan` inside Docker both receive data correctly. **But rf2o_laser_odometry's callback never fires** — it logs "Waiting for laser_scans" indefinitely.

## Evidence Collected (session 89)

| Check | Result |
|-------|--------|
| `ros2 topic hz /scan` inside Docker | 9.9 Hz (correct) |
| `ros2 topic echo /scan --once` inside Docker | Full LaserScan message received |
| `ros2 topic info /scan -v` | 1 publisher (slam_bridge), 1 subscriber (rf2o), BOTH BEST_EFFORT, type hashes match |
| `ros2 node info /rf2o` | Subscribes to `/scan`, `/base_pose_ground_truth`, `/parameter_events` |
| `ros2 node info /rf2o` warning | "There are 2 nodes in the graph with the exact name /rf2o" |
| rf2o logs | "Waiting for laser_scans...." at 10 Hz (100ms timer) |
| zenohd logs | Running, listening on tcp/192.168.68.61:7447 |
| `ZENOH_SESSION_CONFIG` | `/ros2-slam/config/zenoh_session_config.json5` (client mode → router) |
| `RMW_IMPLEMENTATION` | `rmw_zenoh_cpp` (confirmed via `env` inside container) |

## Hypotheses (ordered by likelihood)

### H1: Stale Zenoh liveliness tokens (HIGH likelihood)
The "2 nodes named /rf2o" message means the Zenoh router has stale liveliness entries from previous container starts. When `docker compose restart` was used (instead of `down + up`), the old node's liveliness token wasn't cleaned. The router may be routing scan data to the STALE rf2o subscription (which no longer has a live callback) instead of the current one.

**Test:** Stop everything (turbopi-server + Docker), wait 10s, start zenohd first, wait for healthy, start ros2-slam, start turbopi-server. Check if "2 nodes" warning disappears.

### H2: rf2o subscription QoS incompatibility at rmw level (MEDIUM likelihood)
`ros2 topic info` shows matching QoS profiles at the ROS2 level. But rmw_zenoh_cpp has an internal QoS mapping that may differ between the source-built version and the node. rf2o uses `KEEP_LAST(1)` depth — if the rmw layer treats this differently than `KEEP_LAST(10)` from the publisher, messages might be dropped silently.

**Test:** Check rf2o_params.yaml for QoS overrides. Try changing publisher depth to match (1 instead of 10) in slam_bridge.py.

### H3: rf2o expects `sensor_msgs/msg/LaserScan` with specific fields populated (MEDIUM likelihood)
rf2o checks `scan.ranges.size()` in its callback — if the ranges array is empty or all-infinity, the callback returns immediately without logging. The "Waiting for laser_scans" message is from a TIMER callback, not the scan callback. The scan callback might be firing but immediately returning because the data doesn't meet rf2o's expectations.

**Test:** `ros2 topic echo /scan --field ranges --once` inside Docker — check if ranges array has valid (non-infinity) values. Also check `angle_increment` is positive and `ranges.size() > 0`.

### H4: rmw_zenoh_cpp subscription doesn't match external publisher's key expression (LOW likelihood)
The source-built rmw_zenoh_cpp (1.7.1) may use a different key expression format than the `zenoh_ros2_sdk` publisher. `ros2 topic list` shows `/scan` exists, but the internal Zenoh key might differ (e.g., `slam_bridge/scan` vs `ros2_slam/scan`).

**Test:** Use `zenoh` CLI to list all keys in the router: `z_scout` or check zenohd's debug logs.

### H5: Docker container's rmw_zenoh_cpp session connects to router but subscription is on wrong session (LOW likelihood)
The container sources both `/rmw_ws/install/setup.bash` and has `ZENOH_SESSION_CONFIG`. If the session config isn't being picked up correctly, the node might create its own peer session instead of connecting to the router.

**Test:** Check zenohd logs for client connection count. Should show 2+ clients (ros2-slam nodes + slam_bridge).

## Docker + Service State on Pi 5

- **zenohd container:** Running, healthy, port 7447
- **ros2-slam container:** Running, rf2o + slam_toolbox + EKF + pose_publisher all started
- **turbopi-server:** Running with `SLAM_BACKEND=ros2`, slam_bridge publishing at 10 Hz
- **systemd override:** `/etc/systemd/system/turbopi-server.service.d/override.conf` has `SLAM_BACKEND=ros2`

## Files Involved

| File | Purpose |
|------|---------|
| `services/ros2-slam/docker-compose.yml` | Container orchestration, zenohd + ros2-slam |
| `services/ros2-slam/launch/slam.launch.py` | ROS2 node graph, rf2o remapping `laser_scan → /scan` |
| `services/ros2-slam/config/rf2o_params.yaml` | rf2o parameters (check for QoS, frame_id, topic overrides) |
| `services/ros2-slam/config/zenoh_session_config.json5` | rmw_zenoh_cpp session config (client → router) |
| `services/turbopi-server/slam_bridge.py` | Native Zenoh publisher (scan QoS: BEST_EFFORT, depth 10) |

## Research Completed (session 90)

**ROOT CAUSE IDENTIFIED — see `docs/RESEARCH-RF2O-ZENOH-CALLBACK.md` for full analysis.**

H1 (stale liveliness tokens) RULED OUT. The real cause is rf2o's `spin_some()` zero-timeout poll incompatible with rmw_zenoh_cpp's async data delivery. Plus wrong env var name (`ZENOH_SESSION_CONFIG` → `ZENOH_SESSION_CONFIG_URI`).

## Start Command (IMPLEMENTATION SESSION)

```
# 1. Read docs/RESEARCH-RF2O-ZENOH-CALLBACK.md for full root cause analysis
# 2. Implement the 3 fixes below, then rebuild + deploy

# Fix 1: Patch rf2o main loop (create patch file in services/ros2-slam/)
# Replace spin_some() loop with spin() + create_wall_timer() for process()
# Apply patch in Dockerfile BEFORE colcon build step

# Fix 2: Fix env var in docker-compose.yml
# ZENOH_SESSION_CONFIG → ZENOH_SESSION_CONFIG_URI

# Fix 3: Fix rf2o_params.yaml namespace
# rf2o_laser_odometry_node: → rf2o:

# Then rebuild + deploy:
# git add + commit + push
ssh pi "cd ~/workplace/her/her-os && git pull"
ssh pi "cd ~/workplace/her/her-os/services/ros2-slam && sudo docker compose down"
ssh pi "cd ~/workplace/her/her-os/services/ros2-slam && sudo docker compose build --no-cache"
ssh pi "cd ~/workplace/her/her-os/services/ros2-slam && sudo docker compose up -d"
ssh pi "sudo systemctl restart turbopi-server"
# Wait 15s for slam_bridge to connect
curl -s -H "Authorization: Bearer 8cX80yIBws1PfBjFuvPz0k9egPSZD0LvS02oUD6ijfg" http://192.168.68.61:8080/pose
```

## Verification (when fixed)

1. rf2o stops logging "Waiting for laser_scans" — logs scan processing instead
2. `ros2 topic hz /odom_rf2o` shows ~10 Hz output from rf2o
3. `curl /pose` returns `state: running` with valid x_m, y_m, heading_deg
4. Drive robot forward → y_m changes in repeated `/pose` calls
5. `curl /map -o /tmp/map.png` → valid PNG occupancy grid
