# Research: Zenoh as Middleware for Native-to-ROS2 SLAM Bridge

**Session 85, 2026-04-13** | Research-only — no implementation

---

## Executive Summary

The adversarial review of the WebSocket+JSON bridge plan (session 85) found it adds **20-50ms latency** to a 100ms SLAM loop, is not thread-safe, has backpressure issues, and is the single point of failure. Zenoh is a pub/sub middleware from Eclipse/ZettaScale that can replace this bridge with **10-21 microsecond latency** — a 1000x improvement. It has first-class ROS2 integration (`rmw_zenoh_cpp` is Tier 1 in Jazzy), pre-built aarch64 Docker images, and a pure-Python SDK that lets our native turbopi-server publish directly to ROS2 topics without installing ROS2.

**Recommendation: Use Zenoh as the bridge layer.** Either via `rmw_zenoh_cpp` (ROS2 uses Zenoh natively) or via `zenoh-bridge-ros2dds` (sidecar bridges DDS↔Zenoh). Both eliminate the WebSocket, JSON serialization, and custom bridge_node entirely.

---

## What is Zenoh

Zenoh is a pub/sub/query protocol designed as a unified data plane. Unlike DDS (ROS2's default middleware, heavy RTPS wire protocol, UDP multicast discovery), Zenoh uses:
- **5-byte wire overhead** (vs DDS's ~100+ bytes per message)
- **Centralized router** with gossip discovery (vs DDS's multicast scouting — 97-99% less discovery traffic)
- **Multiple transports**: TCP, UDP, QUIC, Unix sockets, serial, shared memory, WebSocket
- **Built-in flow control** with priority queues (no backpressure collapse)
- **Runs everywhere**: ESP32 (zenoh-pico) to cloud (Rust core, Python/C/C++/Java bindings)

---

## Performance Benchmarks (Published)

Source: [arXiv 2303.09419](https://arxiv.org/pdf/2303.09419)

| Transport | Same-host latency (64B) | Cross-host latency | Notes |
|-----------|:-----------------------:|:------------------:|-------|
| Zenoh peer-to-peer | **10 us** | **16 us** | Rust core, TCP loopback |
| Zenoh brokered (router) | **21 us** | **16 us** | Via zenohd router |
| Zenoh-pico (embedded) | **5 us** | **13 us** | C library, MCU-grade |
| CycloneDDS (ROS2 default) | **8 us** | **37 us** | UDP multicast, same-host wins |
| MQTT QoS 0 | **27 us** | — | For reference |
| Kafka | **73 us** | — | For reference |
| **WebSocket+JSON (our plan)** | **200-500 us** | — | Serialization + TCP Nagle |

**For our use case (native Python → Docker ROS2 on same Pi 5):**
- Current plan (WebSocket+JSON): ~20-50ms per scan (serialization + deserialization + 5 pub/sub hops)
- Zenoh P2P: ~10-21 us — effectively zero compared to our 100ms SLAM cycle
- The Docker boundary does NOT add multicast shortcuts, so DDS loses its same-host advantage. Zenoh wins.

---

## Three Architecture Options with Zenoh

### Option 1: `rmw_zenoh_cpp` as ROS2 RMW (RECOMMENDED — highest probability of success)

**How it works:** Replace DDS with Zenoh inside the Docker container. slam_toolbox, rf2o, robot_localization all use Zenoh natively — no code changes. Native turbopi-server publishes to Zenoh topics using `eclipse-zenoh` Python library.

```
Pi Native (turbopi-server):
  eclipse-zenoh Python library
  ├─ Publish: "rt/scan" (CDR-encoded LaserScan)
  ├─ Publish: "rt/imu" (CDR-encoded Imu)
  ├─ Publish: "rt/odom_raw" (CDR-encoded Odometry)
  ├─ Subscribe: TF data for pose
  └─ Subscribe: /map for occupancy grid

Pi Docker (ros:jazzy + rmw_zenoh_cpp):
  RMW_IMPLEMENTATION=rmw_zenoh_cpp
  zenohd router (inside container)
  slam_toolbox → /map, /tf
  rf2o → /odom_rf2o
  robot_localization EKF → /odom
```

**Pros:**
- No WebSocket, no JSON, no custom bridge code
- 10-21 us latency (1000x better than WebSocket plan)
- slam_toolbox uses Zenoh natively — no adaptation
- Tier 1 RMW in Jazzy — well-tested, official support
- SHM possible for zero-copy (48 MiB pool by default)

**Cons:**
- Native Python must CDR-encode ROS2 messages (using `pycdr` library)
- Zenoh key expressions use `rt/` prefix for ROS2 compatibility
- Zenoh router must be discoverable by both native and Docker sides
- Newer technology — less community battle-testing than DDS

**Probability of success: 75%**

### Option 2: `zenoh-bridge-ros2dds` Sidecar

**How it works:** ROS2 inside Docker uses CycloneDDS (default). A `zenoh-bridge-ros2dds` sidecar process discovers DDS topics and re-publishes them over Zenoh. Native Python uses zenoh-python to talk to the bridge.

```
Pi Native (turbopi-server):
  eclipse-zenoh Python library
  ├─ Publish/Subscribe via Zenoh topics

Pi Docker:
  ROS2 Jazzy + CycloneDDS (default)
  slam_toolbox, rf2o, EKF (standard DDS)
  zenoh-bridge-ros2dds (sidecar)
  ├─ Bridges DDS ↔ Zenoh
  └─ Pre-built arm64 Docker image available
```

**Pros:**
- ROS2 side is completely standard — no RMW changes
- Pre-built `eclipse/zenoh-bridge-ros2dds:latest` Docker image (arm64)
- Bridge handles CDR encoding/decoding transparently
- Can filter topics (allow/deny lists in JSON5 config)

**Cons:**
- Extra process (bridge sidecar) — more things that can fail
- Two hops: native → Zenoh → bridge → DDS → ROS2 (adds ~20 us)
- `rmw_zenoh_cpp` and `zenoh-plugin-ros2dds` use different key expressions — CANNOT mix them
- DDS multicast inside Docker still needs `--network=host`

**Probability of success: 70%**

### Option 3: `zenoh_ros2_sdk` (ROBOTIS) — No ROS2 on Native Side

**How it works:** Use ROBOTIS's pure-Python SDK (`zenoh_ros2_sdk`) that wraps zenoh-python with ROS2 message auto-discovery. The SDK automatically downloads message definitions and computes type hashes. API: `ROS2Publisher(topic="/scan", msg_type="sensor_msgs/msg/LaserScan")`.

```
Pi Native (turbopi-server):
  zenoh_ros2_sdk (pure Python, no ROS2 install)
  ├─ ROS2Publisher("/scan", "sensor_msgs/msg/LaserScan")
  ├─ ROS2Publisher("/imu", "sensor_msgs/msg/Imu")
  ├─ ROS2Subscriber("/map", "nav_msgs/msg/OccupancyGrid")
  └─ Zenoh session connects to router

Pi Docker (ros:jazzy + rmw_zenoh_cpp OR CycloneDDS + bridge):
  slam_toolbox, rf2o, EKF (standard ROS2)
```

**Pros:**
- Cleanest Python API — looks like native ROS2 pub/sub
- Auto-handles CDR encoding, type hashes, discovery
- Zero ROS2 dependency on native side

**Cons:**
- `zenoh_ros2_sdk` is relatively new (ROBOTIS, 2025)
- Less tested than raw zenoh-python
- Depends on Zenoh router for discovery
- Message type resolution requires internet access on first use (downloads .msg definitions)

**Probability of success: 60%** (newer, less battle-tested)

---

## Probability of Success Matrix

| Architecture | P(success) | Latency | Complexity | Maturity | Key Risk |
|-------------|:---:|:---:|:---:|:---:|---|
| **Option 1: rmw_zenoh_cpp** | **75%** | 10-21 us | Low (no bridge code) | Tier 1 in Jazzy | CDR encoding from Python; Zenoh router discovery |
| **Option 2: zenoh-bridge sidecar** | **70%** | ~40 us | Medium (sidecar process) | Production (Autoware uses it) | Extra process, two-hop latency, DDS still inside container |
| **Option 3: zenoh_ros2_sdk** | **60%** | 10-21 us | Low (cleanest API) | New (2025) | SDK maturity; internet needed for msg resolution |
| **Original: WebSocket+JSON** | **50%** | 20-50 ms | High (custom bridge_node) | Custom code | All adversarial review findings (35 issues) |
| **Original: WebSocket+msgpack** | **55%** | 5-15 ms | High (custom bridge_node) | Custom code | Thread safety, backpressure, clock sync, reset races |

**Recommendation: Option 1 (rmw_zenoh_cpp) + zenoh_ros2_sdk on native side** with Option 2 as fallback.

---

## Critical Implementation Details (from deep-dive research)

### zenoh_ros2_sdk is the production-ready native-side solution

The ROBOTIS `zenoh_ros2_sdk` package handles ALL the protocol complexity:
- CDR serialization via `rosbags` library (not pycdr2 — more robust)
- Correct rmw_zenoh key expression format: `<domain_id>/<topic>/<dds_type>/<type_hash>`
- Liveliness token declaration (publishers appear in `ros2 topic list`)
- 33-byte message attachments (seq number + timestamp + GID)
- Session management + router discovery

**Native-side deps (pip only, NO ROS2 install):**
```
pip install zenoh-ros2-sdk eclipse-zenoh rosbags
```

**Usage pattern:**
```python
from zenoh_ros2_sdk import ROS2Publisher, ROS2Subscriber, get_message_class

LaserScan = get_message_class("sensor_msgs/msg/LaserScan")
pub = ROS2Publisher(topic="/scan", msg_type="sensor_msgs/msg/LaserScan")
pub.publish(header=..., ranges=..., angle_min=..., ...)
```

### Zenoh Router is REQUIRED + critical config

rmw_zenoh requires a zenohd router for discovery (multicast disabled by default). Start inside Docker:
```bash
ros2 run rmw_zenoh_cpp rmw_zenohd
```

**CRITICAL (from GitHub issue #929):** Default config has `peers_failover_brokering: false`. This means the router does NOT relay messages between peers that haven't directly discovered each other. For cross-boundary (native Python ↔ Docker ROS2), **MUST set `peers_failover_brokering: true`** in router config, OR native Python must connect in `client` mode (not peer).

**Localhost-only router config:**
```json5
{
  mode: "router",
  listen: { endpoints: ["tcp/127.0.0.1:7447"] },
  scouting: {
    multicast: { enabled: false },
    gossip: { enabled: true },
  },
  routing: {
    router: { peers_failover_brokering: true },
  },
}
```

### Docker Dockerfile (confirmed working on arm64)

```dockerfile
FROM ros:jazzy-ros-base
RUN apt-get update && apt-get install -y --no-install-recommends \
    ros-jazzy-rmw-zenoh-cpp \
    ros-jazzy-slam-toolbox \
    ros-jazzy-robot-localization \
    ros-jazzy-rf2o-laser-odometry \
    ros-jazzy-tf2-ros \
    && rm -rf /var/lib/apt/lists/*
ENV RMW_IMPLEMENTATION=rmw_zenoh_cpp
```

All packages have arm64 binaries in the Jazzy apt repository. No source compilation needed on Pi 5.

### Shared Memory: DO NOT USE across Docker

From issue #213: SHM across Docker containers is unreliable even with `--ipc=host`. TCP over loopback is fine for our message volumes (~100KB/s total). Do not enable SHM.

### Topic Key Expression Format (rmw_zenoh)

```
<domain_id>/<topic>/<dds_type_name>/<type_hash>
```
Example: `0/scan/sensor_msgs::msg::dds_::LaserScan_/RIHS01_df668c...`

The zenoh_ros2_sdk computes all of this automatically. Manual construction is not needed.

### CDR Encoding via rosbags (preferred over pycdr2)

```python
from rosbags.typesys import get_typestore, Stores
typestore = get_typestore(Stores.ROS2_HUMBLE)
LaserScan = typestore.types['sensor_msgs/msg/LaserScan']
scan = LaserScan(header=..., ranges=np.array([...], dtype=np.float32), ...)
cdr_bytes = typestore.serialize_cdr(scan, 'sensor_msgs/msg/LaserScan')
```

---

## Key References

### Official Zenoh
- [Zenoh documentation](https://zenoh.io/docs/getting-started/first-app/)
- [zenoh-python PyPI](https://pypi.org/project/eclipse-zenoh/) — v1.9.0, aarch64 wheels available
- [rmw_zenoh_cpp](https://github.com/ros2/rmw_zenoh) — Official ROS2 Tier 1 RMW
- [zenoh-plugin-ros2dds](https://github.com/eclipse-zenoh/zenoh-plugin-ros2dds) — 253 stars, arm64 Docker images

### Community
- [ROBOTIS zenoh_ros2_sdk](https://github.com/ROBOTIS-GIT/zenoh_ros2_sdk) — Pure Python ROS2-over-Zenoh SDK
- [Arm Learning Paths: Zenoh on Raspberry Pi](https://learn.arm.com/learning-paths/cross-platform/zenoh-multinode-ros2/) — 7-part tutorial, verified on Pi 4/5
- [Autoware multi-vehicle with Zenoh](https://autoware.org/running-multiple-autoware-powered-vehicles-in-carla-using-zenoh/) — Production-grade Docker bridging
- [zenoh-demos/ROS2/zenoh-python-lidar-plot](https://github.com/eclipse-zenoh/zenoh-demos) — LaserScan over Zenoh example
- [Eclipse Zenoh Docker Hub](https://hub.docker.com/r/eclipse/zenoh-bridge-ros2dds) — Pre-built arm64 images

### Benchmarks
- [arXiv 2303.09419](https://arxiv.org/pdf/2303.09419) — Zenoh vs DDS latency comparison
- [ZettaScale performance page](https://zenoh.io/docs/manual/performance/) — Official throughput/latency numbers

---

---

## Cloned Repos (in vendor/ for offline reference)

| Repo | Local Path | What | Key Files |
|------|-----------|------|-----------|
| **zenoh-demos** | `vendor/zenoh-demos/` | Official Zenoh examples including ROS2 | `ROS2/zenoh-python-lidar-plot/ros2-lidar-plot.py` — LaserScan CDR deserialization with pycdr2 |
| **zenoh-plugin-ros2dds** | `vendor/zenoh-plugin-ros2dds/` | DDS↔Zenoh bridge (standalone executable) | `DEFAULT_CONFIG.json5` — full config with topic filtering, frequency limits |
| **zenoh_ros2_sdk** | `vendor/zenoh-ros2-sdk/` | ROBOTIS pure-Python ROS2-over-Zenoh SDK | `examples/15_publish_imu.py` — complete IMU publisher example |
| **rmw_zenoh** | `vendor/rmw-zenoh/` | Official ROS2 Zenoh RMW | `rmw_zenoh_cpp/config/DEFAULT_RMW_ZENOH_ROUTER_CONFIG.json5` — router config template |

---

## Reusable Code & Patterns (from cloned repos)

### Pattern 1: LaserScan CDR Encoding with pycdr2

**Source:** `vendor/zenoh-demos/ROS2/zenoh-python-lidar-plot/ros2-lidar-plot.py`

This is the exact pattern we need for publishing RPLIDAR scans:

```python
from dataclasses import dataclass
from pycdr2 import IdlStruct
from pycdr2.types import uint32, float32
from typing import List
import zenoh

@dataclass
class Time(IdlStruct, typename="Time"):
    sec: uint32
    nsec: uint32

@dataclass
class Header(IdlStruct, typename="Header"):
    stamp: Time
    frame_id: str

@dataclass
class LaserScan(IdlStruct, typename="LaserScan"):
    header: Header
    angle_min: float32
    angle_max: float32
    angle_increment: float32
    time_increment: float32
    scan_time: float32
    range_min: float32
    range_max: float32
    ranges: List[float32]
    intensities: List[float32]

# Serialize: scan.serialize() → bytes (CDR format)
# Deserialize: LaserScan.deserialize(payload) → LaserScan
# Publish: session.declare_publisher('rt/scan'); pub.put(scan.serialize())
```

**Dependencies:** `pip install eclipse-zenoh pycdr2 numpy`

### Pattern 2: IMU Publishing with zenoh_ros2_sdk

**Source:** `vendor/zenoh-ros2-sdk/examples/15_publish_imu.py`

Higher-level SDK that auto-handles CDR encoding and type hashes:

```python
from zenoh_ros2_sdk import ROS2Publisher, get_message_class

Header = get_message_class("std_msgs/msg/Header")
Time = get_message_class("builtin_interfaces/msg/Time")
Quaternion = get_message_class("geometry_msgs/msg/Quaternion")
Vector3 = get_message_class("geometry_msgs/msg/Vector3")
Imu = get_message_class("sensor_msgs/msg/Imu")

pub = ROS2Publisher(topic="/imu", msg_type="sensor_msgs/msg/Imu")

# Publish with named fields
pub.publish(
    header=Header(stamp=Time(sec=sec, nanosec=nsec), frame_id="base_link"),
    orientation=Quaternion(x=0.0, y=0.0, z=sin(yaw/2), w=cos(yaw/2)),
    angular_velocity=Vector3(x=0.0, y=0.0, z=gyro_z_rps),
    linear_acceleration=Vector3(x=0.0, y=0.0, z=9.81),
    orientation_covariance=np.array([1e6,0,0, 0,1e6,0, 0,0,0.01]),
    angular_velocity_covariance=np.array([1e6,0,0, 0,1e6,0, 0,0,0.01]),
    linear_acceleration_covariance=np.array([-1.0]+[0.0]*8),
)
```

**Dependencies:** `pip install zenoh-ros2-sdk` (installs eclipse-zenoh + rosbags + GitPython)

**Key internals (from `vendor/zenoh-ros2-sdk/zenoh_ros2_sdk/`):**
- `publisher.py:261` — Uses `rosbags` library for CDR serialization (not pycdr2)
- `session.py:120` — Maintains a typestore from `rosbags.typesys`
- `keyexpr.py:14-21` — Topic format: `<domain_id>/<topic>/<dds_type_name>/<type_hash>`
- Connects to Zenoh router at `tcp/localhost:7447` by default
- Auto-downloads ROS2 message definitions from Git (cached locally)

### Pattern 3: zenoh-bridge-ros2dds Config for SLAM Topics

**Source:** `vendor/zenoh-plugin-ros2dds/DEFAULT_CONFIG.json5`

Directly usable config template for our Docker setup:

```json5
plugins: {
  ros2dds: {
    namespace: "/",
    domain: 0,
    ros_localhost_only: true,  // SECURITY: restrict to localhost
    allow: {
      publishers: ["/scan", "/imu", "/odom_raw", "/odom_rf2o", "/odom", "/tf", "/tf_static", "/map"],
      subscribers: [],
    },
    pub_max_frequencies: ["/map=1", "/tf=30"],  // rate-limit map to 1Hz
  }
}
```

### Pattern 4: rmw_zenoh Router Config (localhost-only)

**Source:** `vendor/rmw-zenoh/rmw_zenoh_cpp/config/DEFAULT_RMW_ZENOH_ROUTER_CONFIG.json5`

Key settings for our use case:
- `mode: "router"` (line 11)
- `listen/endpoints: ["tcp/[::]:7447"]` (line 90-92) — **Change to `tcp/127.0.0.1:7447` for localhost-only**
- Connect endpoints: empty (no router-to-router)
- Gossip scouting: enabled for peer discovery

---

## Critical Design Decision: rmw_zenoh vs zenoh-bridge-ros2dds

**These two approaches are INCOMPATIBLE.** You must choose one:

| | rmw_zenoh_cpp | zenoh-bridge-ros2dds |
|---|---|---|
| **ROS2 side uses** | Zenoh natively (no DDS) | CycloneDDS (standard DDS) |
| **Key expression format** | `<domain>/<topic>/<type>/<hash>` | `<topic>` (simpler) |
| **Native Python connects to** | Zenoh router (7447) | Zenoh bridge sidecar |
| **CDR encoding by** | Native Python (pycdr2 or rosbags) | Bridge handles automatically |
| **Can mix approaches?** | **NO** — rmw_zenoh and bridge use different key formats | — |
| **Maturity** | Tier 1 in Jazzy (2024+) | Production (Autoware uses it) |

**For Option 1 (rmw_zenoh):** Native Python must encode CDR AND match the exact key expression format including type hash. The zenoh_ros2_sdk handles this automatically.

**For Option 2 (bridge sidecar):** Native Python publishes to simple Zenoh keys. The bridge handles CDR encoding/decoding and DDS discovery. Simpler on the native side but adds a sidecar process.

---

## What This Means for the Plan

The current plan (`~/.claude/plans/iterative-baking-cook.md`) uses a custom WebSocket+msgpack bridge (20 files, ~1715 lines). With Zenoh, we can eliminate:
- `sensor_bridge.py` (200 lines) — replaced by zenoh-python or zenoh_ros2_sdk publishers in turbopi-server
- `pose_bridge.py` (150 lines) — replaced by zenoh subscribers in turbopi-server
- `slam_protocol.py` (60 lines) — CDR encoding replaces custom message format
- Custom WebSocket server/client code in `slam_bridge.py` — replaced by zenoh session
- WebSocket auth token — Zenoh has built-in ACL and TLS
- All backpressure/threading issues — Zenoh handles flow control natively

**What remains:**
- Docker container with ROS2 Jazzy + rmw_zenoh_cpp (or CycloneDDS + bridge sidecar) + slam_toolbox + rf2o + EKF
- `slam_bridge.py` — but now it's a Zenoh publisher/subscriber (much simpler, ~150 lines vs 300)
- CDR encoding via pycdr2 or zenoh_ros2_sdk (zero custom serialization code)
- Same EKF/slam_toolbox/rf2o configs
- Same state machine in SlamBridge (INIT → CONNECTING → RUNNING → DEGRADED)
- Same safety daemon CPU pinning
- Same Phase 0 preparations (imu.py, lidar.py, safety.py additions)

**Estimated reduction:** ~1715 → ~900 lines. Fewer custom components. More battle-tested middleware. 1000x lower latency.

---

## Additional Repos Discovered (extensive search)

| Repo | URL | What we can reuse |
|------|-----|-------------------|
| **evshary/zenoh-ros-type-python** | github.com/evshary/zenoh-ros-type-python | Pre-built pycdr2 dataclasses for ALL common ROS2 msgs. `pip install zenoh-ros-type`. Import: `from zenoh_ros_type.common_interfaces.sensor_msgs import LaserScan` |
| **fan-ziqi/zenoh_ros2dds_example** | github.com/fan-ziqi/zenoh_ros2dds_example | Minimal Python↔ROS2 via zenoh-bridge. Gotcha: "start ROS2 subscriber first or bridge won't forward" |
| **xopxe/rmw_zenoh_router_docker** | github.com/xopxe/rmw_zenoh_router_docker | Complete Dockerfile + systemd unit for Zenoh router. Directly reusable as our Docker base. |
| **OpenMind/OM1** | github.com/OpenMind/OM1 | **Production RPLidar-over-Zenoh.** `src/zenoh_msgs/idl/sensor_msgs.py` has LaserScan+IMU pycdr2 defs. `turtlebot4_rplidar_provider.py` subscribes to RPLidar scans via Zenoh. |
| **Yadunund/rmw_zenoh_examples** | github.com/Yadunund/rmw_zenoh_examples | By the rmw_zenoh maintainer. SHM config, per-topic priority/congestion control, downsampling rules. |
| **AI-Robot-SW/pedestrian-companion-robot** | github.com/AI-Robot-SW/pedestrian-companion-robot | Full pycdr2 sensor_msgs including LaserScan, IMU |

### Key Gotcha from fan-ziqi example
> For Zenoh→ROS2 direction, you must start the ROS2 subscriber FIRST, otherwise the bridge won't forward messages.

This is because zenoh-bridge-ros2dds only creates routes for discovered DDS endpoints. With rmw_zenoh_cpp (Option 1), this doesn't apply — publishers/subscribers discover each other via Zenoh gossip.

---

## User Decision: Two-Phase Approach

**Phase 1 (next sessions): SLAM Foundation with Zenoh**
- Replace HectorSLAM with slam_toolbox via Zenoh bridge
- Gives ground truth for evaluating VLM perception quality
- Enables spatial memory, map persistence, loop closure
- P(success): 75-80%

**Phase 2 (future): VLM-Primary Hybrid (Waymo/Tesla inspired)**
- 58 Hz VLM as continuous perception backbone
- SLAM as metric reference frame (like Waymo's HD map)
- Requires Phase 1 for ground truth and evaluation framework
- P(success): 40-50% (research problem, not just engineering)
- Same Zenoh middleware serves both phases
