# Next Session: Fix slam_toolbox MessageFilter Dropping All Scans

## What

slam_toolbox in Docker on Pi 5 drops ALL scans — MessageFilter can't resolve `odom → laser_frame` TF at scan timestamps. Deep research (5 agents across codebase, MentorPi vendor reference, official repos, community issues) + adversarial review (2 reviewers, 9 CRITICAL + 8 HIGH findings, ALL addressed) identified the root cause chain and **8 fixes**.

**Most likely root cause chain**: (1) IMU frame_id mismatch causes EKF to silently drop ALL IMU data, (2) possible rmw_zenoh deadlock (#921) freezes EKF timer thread, (3) /tf_static transient_local race (#263) means slam_toolbox's TF buffer never gets base_footprint→laser_frame.

## Plan

`~/.claude/plans/mellow-crunching-nebula.md` — **Read this first.** It has the full implementation with all adversarial review findings addressed, state machines, pre-mortem, and verification steps.

## Key Design Decisions (from adversarial review)

These decisions were CHANGED by adversarial review. Do not revert them:

1. **DIAGNOSE FIRST** — Run Phase 1 diagnostic commands (5 min SSH to Pi 5) before implementing ANY fix. Do not implement 8 fixes simultaneously against unconfirmed root causes.

2. **IMU frame_id "base_link" → "base_footprint"** (Fix 1) — slam_bridge publishes IMU with wrong frame_id. No `base_link → base_footprint` TF exists anywhere in the system. robot_localization silently drops ALL IMU data. This is a one-line fix in `slam_bridge.py` line ~395.

3. **Keep static_transform_publisher + add periodic supplement** (Fix 3) — Do NOT remove the existing `static_transform_publisher`. Add a 5 Hz periodic publisher on volatile `/tf` alongside it. Publishing the same frame pair on both /tf_static and /tf is safe — TF2 deduplicates identical frame pairs.

4. **`scan_queue_size` does NOT exist in slam_toolbox** — the MessageFilter queue is hardcoded in C++ source. Adding it to YAML is dead config that will be silently ignored. Do NOT add it.

5. **`dynamic_process_noise_covariance` does NOT exist in robot_localization** — not a valid parameter. Do NOT add it.

6. **Pin specific rmw_zenoh commit, NOT HEAD** (Fix 5) — "latest jazzy HEAD" voids Session 89's wire compatibility verification with native eclipse-zenoh 1.9.0. Must identify specific SHA for the #921 deadlock fix and verify wire compat.

7. **Entrypoint rewrite** (Fix 4) — lifecycle activation in FOREGROUND (not backgrounded `&`), gated on `tf2_echo odom laser_frame` resolving, pose_publisher in restart loop. Current entrypoint has 3 race conditions that the review identified.

8. **Reset protocol must use lifecycle transitions** (Fix 6) — current _on_reset_cmd only deletes files. Must deactivate→cleanup→configure→activate slam_toolbox to clear in-memory pose graph.

9. **zenohd healthcheck** (Fix 7) — add `pgrep -f rmw_zenohd` alongside TCP probe to prevent false positives from stale processes on port 7447.

10. **ROS_LOCALHOST_ONLY** (Fix 8) — remove from Dockerfile (docker-compose is source of truth). Add to zenohd service env. Fixes undocumented mismatch.

## Files to Modify (ordered by implementation sequence)

| # | File | Fix | Change |
|---|------|-----|--------|
| 1 | `services/turbopi-server/slam_bridge.py` | Fix 1 | IMU frame_id: `self._make_header("base_link")` → `self._make_header("base_footprint")` (line ~395 in `_publish_imu`) |
| 2 | `services/ros2-slam/config/ekf_params.yaml` | Fix 2 | `frequency: 30.0` → `50.0`, `transform_timeout: 0.2` → `0.5`. Do NOT add `dynamic_process_noise_covariance`. |
| 3 | `services/ros2-slam/src/periodic_static_tf.py` | Fix 3 | NEW file — 5 Hz periodic TF publisher (base_footprint→laser_frame on /tf volatile). Uses `rclpy.spin()` (NOT spin_some). Pattern from `pose_publisher.py`. Code sketch is in the plan. |
| 4 | `services/ros2-slam/launch/slam.launch.py` | Fix 3 | Add `periodic_static_tf` as a Python node. KEEP existing `base_to_laser` static_transform_publisher. |
| 5 | `services/ros2-slam/src/pose_publisher.py` | Fix 6 + M10 | Rewrite `_on_reset_cmd`: deactivate→cleanup→delete files→configure→activate slam_toolbox via lifecycle transitions. Also fix logger `%s` → f-string (line 104). |
| 6 | `services/ros2-slam/Dockerfile` | Fix 4 + Fix 8 | Rewrite entrypoint: TF gate (`tf2_echo odom laser_frame`), foreground lifecycle activation, pose_publisher restart loop, failure logging. Remove `ENV ROS_LOCALHOST_ONLY=1` (line 67). |
| 7 | `services/ros2-slam/docker-compose.yml` | Fix 7 + Fix 8 | zenohd healthcheck: add `pgrep -f rmw_zenohd` check. Add `ROS_LOCALHOST_ONLY=0` to zenohd env. |
| 8 | `services/ros2-slam/Dockerfile` (line 23) | Fix 5 | CONDITIONAL: Update rmw_zenoh `git checkout` to specific SHA with #921 deadlock fix. Only if SHA identified + wire compat verified. |

## Architecture Context

```
Native Pi 5 (turbopi-server)              Docker (ros2-slam)
┌────────────────────┐                    ┌─────────────────────────┐
│ slam_bridge.py     │                    │                         │
│  └─ /scan (pub)  ──┼── Zenoh router ──→│ rf2o (/scan sub)        │
│  └─ /odom_raw (pub)┼── Zenoh router ──→│   └─ /odom_rf2o (pub)   │
│  └─ /imu (pub)   ──┼── Zenoh router ──→│                         │
│                    │                    │ EKF (50 Hz)              │
│                    │                    │  └─ /odom_rf2o (sub)     │
│                    │                    │  └─ /odom_raw (sub)      │
│                    │                    │  └─ /imu (sub)           │
│                    │                    │  └─ /tf (pub odom→bf)    │
│                    │                    │                         │
│                    │                    │ static_transform_pub     │
│                    │                    │  └─ /tf_static (pub)     │
│                    │                    │ periodic_static_tf (NEW) │
│                    │                    │  └─ /tf (pub bf→laser)   │
│                    │                    │                         │
│                    │                    │ slam_toolbox (lifecycle)  │
│                    │                    │  └─ /scan (sub)          │
│                    │                    │  └─ MessageFilter        │
│                    │                    │  └─ /tf (pub map→odom)   │
│                    │                    │                         │
│                    │                    │ pose_publisher            │
│  /slam/pose (sub) ←┼── Zenoh router ──←│  └─ /slam/pose (pub)    │
│  └─ /pose endpoint│                    │  └─ lifecycle reset      │
└────────────────────┘                    └─────────────────────────┘
```

## Start Command

```bash
# 1. Read the adversarial-reviewed plan (full implementation details + code sketches):
cat ~/.claude/plans/mellow-crunching-nebula.md

# 2. Run Phase 1 diagnostics FIRST (5 min, SSH to Pi 5):

# Is EKF publishing TF continuously? (should see ~30 Hz if healthy)
ssh pi "sudo docker exec ros2-slam-ros2-slam-1 bash -c 'source /opt/ros/jazzy/setup.bash && source /rmw_ws/install/setup.bash && source /ros2_ws/install/setup.bash && timeout 10 ros2 topic hz /tf 2>&1'"

# EKF diagnostics — does it show receiving IMU measurements?
ssh pi "sudo docker exec ros2-slam-ros2-slam-1 bash -c 'source /opt/ros/jazzy/setup.bash && source /rmw_ws/install/setup.bash && source /ros2_ws/install/setup.bash && timeout 5 ros2 topic echo /diagnostics 2>&1'"

# Can /tf_static be received via transient_local replay?
ssh pi "sudo docker exec ros2-slam-ros2-slam-1 bash -c 'source /opt/ros/jazzy/setup.bash && source /rmw_ws/install/setup.bash && source /ros2_ws/install/setup.bash && timeout 5 ros2 topic echo /tf_static --qos-reliability reliable --qos-durability transient_local --once 2>&1'"

# Docker logs — any errors?
ssh pi "sudo docker logs ros2-slam-ros2-slam-1 2>&1 | grep -i 'ekf\|error\|warn\|exception\|base_link\|frame' | tail -40"

# 3. Based on diagnostic results, implement fixes 1-8 in order
# 4. Deploy (see Deploy Process below)
# 5. Run all verification steps
```

## Deploy Process

```bash
# 1. Commit
git add services/ros2-slam/ services/turbopi-server/slam_bridge.py
git commit -m "fix(slam): MessageFilter fix — IMU frame, EKF tuning, TF supplement, entrypoint rewrite, reset protocol, healthcheck"

# 2. Push + pull on Pi 5
git push origin main
ssh pi "cd ~/workplace/her/her-os && git pull"

# 3. Docker rebuild (builder cache should hit for rmw_zenoh — ~3 min runtime layer only)
ssh pi "cd ~/workplace/her/her-os/services/ros2-slam && sudo docker compose build"

# 4. Restart containers
ssh pi "cd ~/workplace/her/her-os/services/ros2-slam && sudo docker compose down && sudo docker compose up -d"

# 5. Restart turbopi-server (for slam_bridge.py IMU frame fix)
ssh pi "sudo systemctl restart turbopi-server"
```

## Verification

### Phase 1: EKF Health (after Fix 1 + Fix 2)
1. `ros2 topic echo /diagnostics` — EKF shows IMU measurements being fused (not silently dropped)
2. `ros2 topic hz /tf` — shows ≥40 Hz continuously (50 Hz EKF - jitter)

### Phase 2: TF Supplement (after Fix 3)
3. `ros2 topic echo /tf` — shows base_footprint→laser_frame transforms interspersed with odom→base_footprint

### Phase 3: Entrypoint (after Fix 4)
4. Docker logs show "TF chain odom→laser_frame resolved" BEFORE "slam_toolbox activated"
5. Kill pose_publisher inside container — it restarts automatically within 2s
6. Docker logs show "slam_toolbox activated" (not timed out)

### Phase 4: Reset Protocol (after Fix 6)
7. Publish reset command → slam_toolbox deactivates + re-activates (logs confirm)
8. After reset, `/pose` returns (0,0) — not a stale pre-reset position

### Phase 5: E2E SLAM
9. **No "queue is full"**: slam_toolbox stops logging MessageFilter drops
10. **map→odom TF**: `ros2 topic echo /tf` shows map→odom transform from slam_toolbox
11. **Pose endpoint**: `curl http://pi:8080/pose` returns valid `x_m`, `y_m`, `heading_deg`
12. **Map endpoint**: `curl http://pi:8080/map -o /tmp/map.png` returns valid PNG
13. **Drive test**: Move robot forward → `/pose` y_m changes in repeated calls
14. **Stability**: No deadlocks or TF drops after 5 minutes of continuous operation

## Research References

- slam_toolbox #516: MessageFilter queue size 1 drops with TF latency (hardcoded in C++)
- slam_toolbox #794, #806: Queue full — resolved by fixing TF tree
- rmw_zenoh #263: transient_local race condition (PR #269 fix, theoretical race remains)
- rmw_zenoh #921: Deadlock in rmw_wait (confirmed, fixed — check if in our pinned commit)
- robot_localization source: `publish_tf: true` publishes every timer tick after initialization
- robot_localization source: uses `SensorDataQoS` (BEST_EFFORT) for odometry/IMU subs
- MentorPi vendor: EKF at 100 Hz, native DDS, robot_state_publisher for static TFs
- MentorPi EKF config: `vendor/mentorpi/ros2_ws/src/driver/controller/config/ekf.yaml`

## Existing Code to Reuse

- `services/ros2-slam/src/pose_publisher.py` — pattern for Python ROS2 node with timer + rclpy.spin()
- `vendor/mentorpi/ros2_ws/src/driver/controller/config/ekf.yaml` — reference EKF config (100 Hz, working)
- `services/ros2-slam/patches/rf2o-spin-fix.patch` — pattern for patching ROS2 nodes for rmw_zenoh
