# Next Session: Debug slam_toolbox Message Filter Dropping All Scans

## What

slam_toolbox (async_slam_toolbox_node, ROS2 Jazzy, lifecycle node) drops **every** scan with "Message Filter dropping message: frame 'laser_frame' ... discarding message because the queue is full". It never processes a single scan, never publishes the `map → odom` TF, which blocks pose_publisher → `/slam/pose` → slam_bridge → `/pose` endpoint.

## What IS Working (verified session 90)

| Component | Status | Evidence |
|---|---|---|
| slam_bridge → `/scan` via Zenoh | 9.9 Hz | `ros2 topic hz /scan` inside Docker |
| rf2o receives scans + publishes `/odom_rf2o` | 9.7 Hz | Docker logs: "execution time: 0.3ms" at 10 Hz |
| EKF publishes `odom → base_footprint` TF | Working | `ros2 topic echo /tf --once` shows valid transform |
| EKF publishes `/odometry/filtered` | Working | `ros2 topic echo /odometry/filtered --once` shows valid data |
| static TF `base_footprint → laser_frame` | Published | static_transform_publisher running, `tf2_echo` resolves it |
| `tf2_echo odom laser_frame` | **Resolves** | Full TF chain works from a fresh process |
| slam_toolbox lifecycle auto-activation | Working | Entrypoint logs "slam_toolbox activated" |

## What Is NOT Working

slam_toolbox's `tf2_ros::MessageFilter<LaserScan>` drops every scan. The filter subscribes to `/scan`, queues incoming messages, and waits for `canTransform(odom_frame, scan.header.frame_id, scan.header.stamp)` to return true before passing them to slam_toolbox's processing callback. When the queue (size 10) fills up, the oldest unresolved scan is evicted with "queue is full".

**Key paradox**: `tf2_echo odom laser_frame` resolves fine from a fresh `docker exec` process, but slam_toolbox's internal MessageFilter never resolves.

## Evidence Collected (session 90)

| Check | Result |
|---|---|
| `ros2 topic hz /odom_rf2o` | 9.7 Hz (rf2o working) |
| `ros2 topic echo /tf --once` | Shows odom→base_footprint from EKF |
| `ros2 topic hz /tf` (3s timeout) | **No output** — TF may publish rarely |
| `tf2_echo odom laser_frame` | Resolves after ~1s warmup |
| `tf2_echo map odom` | "frame does not exist" — slam_toolbox never publishes map frame |
| Deactivate + re-activate slam_toolbox | Same "queue is full" after re-activation |
| Set `transform_timeout: 2.0` via `ros2 param set` | Same "queue is full" |
| Publish static TF as volatile on `/tf` (workaround) | Same "queue is full" |
| zenoh_router_config.json5 `timestamping: enabled: true` | Added (commit d579373), rebuilt, still "queue is full" |

## Hypotheses (ordered by likelihood)

### H1: EKF TF publish rate is near-zero (HIGH)
`ros2 topic hz /tf` returned no rate over 3 seconds despite `ros2 topic echo /tf --once` returning a single transform. If the EKF publishes TF very infrequently (e.g., only on first measurement), slam_toolbox's message filter has stale or no TF data to interpolate with.

**Test**: `ros2 topic hz /tf` with longer timeout (10s). If rate is 0, EKF is not publishing continuously. Check `print_diagnostics: true` in ekf_params.yaml — run `ros2 topic echo /diagnostics` to see EKF internal state. Check if EKF is receiving `/odom_rf2o` and `/odom_raw` — maybe the Zenoh subscriptions inside EKF have the same rmw_zenoh issue.

**Fix if confirmed**: EKF (robot_localization) also uses `rclcpp::spin()` internally, so it should work with rmw_zenoh. But verify it's actually processing data, not stuck.

### H2: EKF not receiving input data via rmw_zenoh (HIGH)
The EKF subscribes to `/odom_rf2o` (from rf2o inside Docker) and `/odom_raw` + `/imu` (from slam_bridge via Zenoh). If EKF's subscriptions don't receive data, it won't publish TF. robot_localization uses `rclcpp::spin()` properly, so the rf2o `spin_some` bug shouldn't apply. But verify:

**Test**: `ros2 topic echo /odom_raw --once` and `ros2 topic echo /imu --once` inside Docker — check if slam_bridge data reaches the Docker ROS2 network. Also `ros2 topic info /odom_rf2o -v` to see if EKF is subscribed.

### H3: Scan timestamps out of TF buffer range (MEDIUM)
The scan timestamp is from Python `time.time()` on native Pi. The TF timestamp is from EKF's ROS2 clock inside Docker. Both should use the same system clock (host networking). But if there's a systematic offset (e.g., EKF only publishes one TF at startup, then stops), the message filter can't interpolate because it needs TF on BOTH sides of the scan's timestamp.

**Test**: Log timestamps from both `/scan` and `/tf` simultaneously. Check if scan stamps are always AHEAD of the latest TF stamp (meaning the pipeline round-trip scan→rf2o→EKF→TF is too slow for real-time matching).

### H4: transient_local still broken for /tf_static (MEDIUM)
Even with `timestamping: enabled: true`, rmw_zenoh_cpp's `AdvancedSubscriber` recovery mechanism may not work for the static TF. The static TF publisher publishes ONCE on `/tf_static` with transient_local durability. If slam_toolbox's TF listener (created during `on_configure`) doesn't receive the replayed message, it will never have the `base_footprint → laser_frame` transform.

**Test**: Inside the Docker container, run:
```bash
ros2 topic echo /tf_static --once --qos-reliability reliable --qos-durability transient_local
```
If this hangs (no output), transient_local replay is broken. Then the fix is to NOT use static_transform_publisher — instead, use a periodic transform publisher on `/tf`.

### H5: slam_toolbox MessageFilter uses separate internal executor (LOW)
The MessageFilter in slam_toolbox may create its own callback group or use a different threading model that conflicts with rmw_zenoh_cpp's notification mechanism.

**Test**: Check slam_toolbox source for how it creates its MessageFilter subscriber and TF listener. If it uses a MutuallyExclusiveCallbackGroup, the TF subscription callback might be blocked by the scan subscription callback.

## Architecture Context

```
Native Pi 5 (turbopi-server)              Docker (ros2-slam)
┌────────────────────┐                    ┌─────────────────────────┐
│ slam_bridge.py     │                    │                         │
│  └─ /scan (pub)  ──┼── Zenoh router ──→│ rf2o (/scan sub)        │
│  └─ /odom_raw (pub)┼── Zenoh router ──→│   └─ /odom_rf2o (pub)   │
│  └─ /imu (pub)   ──┼── Zenoh router ──→│                         │
│                    │                    │ EKF                      │
│                    │                    │  └─ /odom_rf2o (sub)     │ ← verified working
│                    │                    │  └─ /odom_raw (sub)      │ ← maybe not receiving?
│                    │                    │  └─ /imu (sub)           │ ← maybe not receiving?
│                    │                    │  └─ /tf (pub odom→bf)    │ ← verified once
│                    │                    │                         │
│                    │                    │ static_transform_pub     │
│                    │                    │  └─ /tf_static (pub)     │ ← transient_local
│                    │                    │                         │
│                    │                    │ slam_toolbox (lifecycle)  │
│                    │                    │  └─ /scan (sub)          │ ← receives scans
│                    │                    │  └─ MessageFilter        │ ← STUCK: can't resolve TF
│                    │                    │  └─ /tf (pub map→odom)   │ ← NEVER publishes
│                    │                    │                         │
│                    │                    │ pose_publisher            │
│  /slam/pose (sub) ←┼── Zenoh router ──←│  └─ /slam/pose (pub)    │ ← BLOCKED: no map frame
│  └─ /pose endpoint│                    │                         │
└────────────────────┘                    └─────────────────────────┘
```

## Files Involved

| File | Purpose |
|------|---------|
| `services/ros2-slam/config/ekf_params.yaml` | EKF config (frequency, inputs, TF publish) |
| `services/ros2-slam/config/slam_toolbox_params.yaml` | slam_toolbox config (transform_timeout: 1.0) |
| `services/ros2-slam/config/rf2o_params.yaml` | rf2o config (freq: 10.0) |
| `services/ros2-slam/config/zenoh_session_config.json5` | rmw_zenoh session (client mode) |
| `services/ros2-slam/zenoh_router_config.json5` | zenohd router (timestamping: enabled) |
| `services/ros2-slam/launch/slam.launch.py` | Node graph (static TF, rf2o, EKF, slam_toolbox) |
| `services/ros2-slam/Dockerfile` | Entrypoint with slam_toolbox auto-activation |
| `services/turbopi-server/slam_bridge.py` | Native Zenoh publisher + subscriber |

## Commits So Far

- `e5e01d4`: build rmw_zenoh from source (Zenoh version fix)
- `9461ebd`: patch rf2o spin_some→spin+timer + fix ZENOH_SESSION_CONFIG_URI env var
- `324da6d`: regenerate rf2o patch with proper git diff format
- `d579373`: enable timestamping for transient_local + auto-activate slam_toolbox + transform_timeout 1.0s

## Start Command

```bash
# 1. Read this doc + docs/RESEARCH-RF2O-ZENOH-CALLBACK.md for full context
# 2. Investigate H1 first: is EKF actually publishing TF continuously?

# Check EKF TF publish rate (needs longer timeout):
ssh pi "sudo docker exec ros2-slam-ros2-slam-1 bash -c 'source /opt/ros/jazzy/setup.bash && source /rmw_ws/install/setup.bash && source /ros2_ws/install/setup.bash && timeout 10 ros2 topic hz /tf 2>&1'"

# Check if EKF is receiving odom_rf2o:
ssh pi "sudo docker exec ros2-slam-ros2-slam-1 bash -c 'source /opt/ros/jazzy/setup.bash && source /rmw_ws/install/setup.bash && source /ros2_ws/install/setup.bash && ros2 topic info /odom_rf2o -v 2>&1'"

# Check EKF diagnostics:
ssh pi "sudo docker exec ros2-slam-ros2-slam-1 bash -c 'source /opt/ros/jazzy/setup.bash && source /rmw_ws/install/setup.bash && source /ros2_ws/install/setup.bash && timeout 5 ros2 topic echo /diagnostics 2>&1'"

# Test transient_local replay:
ssh pi "sudo docker exec ros2-slam-ros2-slam-1 bash -c 'source /opt/ros/jazzy/setup.bash && source /rmw_ws/install/setup.bash && source /ros2_ws/install/setup.bash && timeout 5 ros2 topic echo /tf_static --qos-reliability reliable --qos-durability transient_local --once 2>&1'"
```

## Verification (when fixed)

1. slam_toolbox stops logging "queue is full" — starts processing scans
2. `ros2 topic echo /tf` shows `map → odom` transform from slam_toolbox
3. pose_publisher resolves `map → base_footprint` TF → publishes on `/slam/pose`
4. slam_bridge receives pose → `state: running` in `/pose` endpoint
5. `curl /pose` returns valid `x_m`, `y_m`, `heading_deg`
6. `curl /map -o /tmp/map.png` returns valid PNG occupancy grid
7. Drive robot forward → `y_m` changes in repeated `/pose` calls
