LENS 09

Tradeoff Radar

"What are you sacrificing, and is that the right sacrifice?"

Annie VLM-Primary vs Traditional SLAM-Primary
25% 50% 75% +Hailo L1 35 → 65 robust PERCEPTION DEPTH SEMANTIC RICHNESS LATENCY (low=good) VRAM EFFICIENCY ROBUSTNESS (WiFi-dep) SPATIAL ACCURACY IMPL. SIMPLICITY Annie: 30% spatial gap SLAM: 20% semantic gap Annie — VLM-Primary (Gemma 4 E2B, 58 Hz) SLAM-Primary (slam_toolbox, lidar-driven) Annie + Hailo L1 (projected, 26 TOPS on Pi 5) All axes: outer edge = 100% = best. Latency axis inverted (outer = fastest).
Axis Annie VLM-Primary SLAM-Primary Justification
Perception Depth 85 30 E2B describes furniture, room type, goal position, and occlusion in a single pass. SLAM sees only geometry — no objects, no semantics.
Semantic Richness 90 20 VLM produces room labels, obstacle names, goal-relative directions in natural language. SLAM produces float coordinates — 20% credit for inferring high-traffic zones from occupancy density.
Latency (low = outer) 80 55 E2B at 18ms/frame (58 Hz) via llama-server direct. SLAM path-planning adds A* + lifecycle overhead; full tactical cycle ~50–80ms. Both are faster than the motor response bottleneck (~200ms).
VRAM Efficiency 45 80 Gemma 4 E2B occupies ~3.5 GB VRAM on Panda. SLAM is CPU-bound (slam_toolbox on Pi 5 ARM), zero GPU footprint. VLM VRAM leaves room for SigLIP sidecar but constrains concurrent workloads.
Robustness 35 88 VLM pipeline: WiFi hop Pi→Panda + Zenoh layer + llama-server process + hallucination risk. SLAM: all-local, no network, deterministic scan-matching. Session 89 Zenoh fix alone took one full session.
Spatial Accuracy 30 92 E2B output is "LEFT MEDIUM" — directional qualitative, not metric. Cannot localize at mm precision. Lidar-based slam_toolbox returns (x, y, θ) at ~10mm accuracy — mission-critical for furniture-clearance navigation.
Implementation Simplicity 40 30 VLM: add `_ask_vlm()` call, parse 2-token reply, no calibration. SLAM: slam_toolbox lifecycle, rf2o lidar odometry, IMU frame_id, EKF tuning, Zenoh version pinning (session 89 spent entire session on this). Both score low — this is a complex domain.

The radar reveals a striking asymmetry: Annie's VLM-primary approach and the traditional SLAM-primary approach are almost perfectly complementary anti-profiles. Where one peaks, the other troughs. Annie scores 85–90 on Perception Depth and Semantic Richness but only 30–35 on Spatial Accuracy and Robustness. SLAM-primary scores 88–92 on Spatial Accuracy and Robustness but collapses to 20–30 on any axis requiring understanding of what things are. This complementarity is exactly the premise for a hybrid — but it also means each approach fails on exactly the axes where the other excels, and the failure modes are not graceful. An SLAM-only robot gets permanently lost when a room rearranges. A VLM-only robot drives confidently into the leg of a chair because it cannot distinguish "the chair is at 250mm" from "the chair is at 600mm".

The tradeoff that researchers consistently decline to acknowledge is the robustness axis as a network reliability question. Every benchmark in the literature — VLMaps, OK-Robot, NaVid, text2nav — measures VLM accuracy assuming an always-on GPU. None of them measure what happens when the WiFi hop between the robot and its inference node drops for 80ms, or when the Panda llama-server process restarts mid-navigation (session 83: Annie's IMU became REPL-blocked, requiring a soft-reboot Ctrl-D). The research community treats inference latency as the latency problem; the actual production latency problem is network jitter. A 58 Hz VLM pipeline that hiccups for 300ms every 45 seconds due to a 2.4GHz congestion burst is not a 58 Hz system — it is a system that produces bursts of stale commands. The radar's "Robustness" axis score of 35 for Annie captures this honestly: the failure mode is not algorithmic, it is infrastructural and invisible in papers.

The cyan dashed polygon shows the single largest structural move available on this radar: activating the idle Hailo-8 AI HAT+ on the Pi 5 as an L1 safety layer (26 TOPS, YOLOv8n at 430 FPS, <10ms local inference, zero WiFi dependency). The Robustness axis jumps from ~35 to ~65 — the biggest single-axis delta any non-hardware-swap move produces on this chart. Why? Safety-critical obstacle detection no longer rides the same WiFi hop as semantic reasoning. The semantic path (Gemma 4 E2B on Panda for "where is the kitchen?") still depends on WiFi, so the robustness ceiling doesn't reach SLAM-primary's 88 — but the compound failure mode collapses: a WiFi brownout no longer simultaneously silences obstacle avoidance and goal reasoning. The IROS dual-process paper (arXiv 2601.21506) measured this exact pattern yielding 66% latency reduction and 67.5% success vs 5.83% VLM-only. The trade is visible on the Implementation Simplicity axis, which edges down from 40 to ~32: HailoRT, TAPPAS, and model compilation add real cognitive load, but the learning curve is days, with working Pi 5 examples at github.com/hailo-ai/hailo-rpi5-examples. This is the cheapest robustness move available on Annie's current hardware, because the hardware is already on the robot.

Two tradeoffs are movable by a fundamentally different approach, not just by tuning along the existing frontier. First: the spatial accuracy deficit (Annie: 30) can be largely eliminated without touching the VLM at all, by using lidar sectors as a pre-filter before the VLM command is issued — the existing NavController already does this via ESTOP gates. The VLM never needs metric precision; it only needs directional intent. Metric precision is the job of the lidar ESTOP. This reframes the tradeoff: Annie does not sacrifice spatial accuracy to gain semantics — it delegates spatial accuracy to a different component. Second: the VRAM efficiency gap (Annie: 45 vs SLAM: 80) is addressable by the embedding-only path described in Part 2 of the research. Running SigLIP 2 ViT-SO400M (~800MB VRAM) for place recognition instead of the full E2B model for embedding extraction changes the cost structure substantially. These are not points on the same frontier — they are structural moves that open new parts of the design space.

The user's actual priority ordering diverges from the researcher's in one specific place: Implementation Complexity. The research literature treats complexity as a constant ("one-time engineering cost") and optimizes for runtime metrics. In practice, session 89 shows that a single Zenoh version mismatch (apt package at 0.2.9, source build at 1.7.1) consumed an entire development session. The radar gives SLAM-primary a score of 30 on Implementation Simplicity — not 70 — because "simple in theory" and "simple to deploy on ARM64 with rmw_zenoh_cpp from source" are not the same axis. For a single-developer project, implementation complexity IS a first-class runtime constraint: a system you cannot debug in-field is effectively unavailable. The implicit researcher assumption — that deployment effort amortizes to zero over many robots — does not apply here.

UNACKNOWLEDGED TRADEOFF — Key Finding

Every benchmark in VLM navigation literature measures inference latency. Nobody benchmarks network reliability. The research assumes the inference node is co-located or always reachable. Annie's architecture has a mandatory WiFi hop (Pi 5 → Panda, ~5–15ms round-trip under ideal conditions, potentially 80–300ms under 2.4GHz congestion or llama-server restart). At 58 Hz inference, a single 100ms WiFi hiccup produces 5–6 stale commands issued to the motor controller. The Robustness axis score of 35 for the VLM-primary approach reflects this — but more importantly, it means the “latency advantage” of 58 Hz inference is partially illusory: the effective update rate under realistic home WiFi is closer to 15–20 Hz when packet jitter is accounted for.

Lens 04 finds a WiFi cliff edge at 100ms where VLM rate becomes insensitive above 15 Hz — this is consistent. The implication: investing in inference speed above 15 Hz (e.g., the move from 29 Hz to 58 Hz via single-query optimization) has near-zero user-facing benefit if the bottleneck is network jitter, not GPU throughput.

  • Hailo-8 activation is the single biggest axis-mover on this radar. Robustness 35 → 65 in one structural move — bigger than any tuning along the existing frontier. 26 TOPS on the Pi 5 is already on the robot, idle. YOLOv8n at 430 FPS, <10ms, zero WiFi dependency. The IROS dual-process paper (arXiv 2601.21506) validates this exact split for 66% latency reduction.
  • Why 65, not 88 (SLAM-parity)? L1 safety is now WiFi-independent, but L2 semantic queries ("go to the kitchen") still ride the WiFi hop. The compound failure — obstacle avoidance and goal reasoning both silenced by the same jitter burst — is broken; the residual semantic-path fragility is what keeps the ceiling below SLAM's 88.
  • The complexity tax is real but small. Implementation Simplicity drops from 40 to ~32: HailoRT + TAPPAS + model compilation add non-zero cognitive load, but working Pi 5 examples exist (github.com/hailo-ai/hailo-rpi5-examples) and the learning curve is days.
Where "good enough" is dramatically cheaper than "optimal"
  • Spatial accuracy for home nav: "Chair at 300mm right" is good enough; "chair at 287mm right" costs 10× in SLAM infrastructure. The ESTOP at 200mm makes sub-300mm accuracy irrelevant to safety.
  • Semantic richness: "kitchen / hallway / bedroom" covers 90% of room-routing decisions. Full scene-graph (ConceptGraphs-level) is academic overhead for a single-room navigation robot.
  • Place recognition: text2nav achieved 74% navigation success using frozen SigLIP embeddings — no fine-tuning, no DINOv2, no AnyLoc. For Annie's home environment (10–15 visually distinct places), a K-nearest cosine search over ~100 stored embeddings is computationally trivial and likely sufficient.
  • Multi-query VLM (Lens 07 target): 6-slot dispatch at 9Hz/slot vs single-query at 58Hz — the 58Hz path is only marginally better given that motor commands are issued at 1–2 Hz. "Good enough" is 15 Hz per query, achievable with 4 alternating queries at the current 58 Hz throughput.