LENS 18

Decision Tree

"Under what specific conditions is this the best choice?"

VLM-PRIMARY HYBRID NAV — STRUCTURED DECISION TREE (6 BRANCHES)
Do you have a camera AND an edge GPU?
(e.g. Raspberry Pi 5 + Panda, Jetson Orin, any SBC with NPU)
NO
Use lidar-only SLAM.
slam_toolbox + Nav2. No VLM path exists without visual input or local inference. Stop here.
YES
Is the target a known fiducial?
(ArUco / AprilTag / QR code — a pre-registered marker with known geometry)
YES
Use classical CV. Skip the VLM entirely.
cv2.aruco + solvePnP runs in ~78 µs on Pi ARM CPU — no GPU, no network, no hallucination surface. Annie's own homing path (DICT_6X6_50 id=23) is this exact case. A VLM here would be 400× slower and strictly worse. Cross-ref: Lens 16 on ArUco.
NO
Is your environment mostly static?
(home, office, warehouse — not street, crowd, construction site)
NO
VLM-primary won't help.
Dynamic scenes need trajectory prediction (Waymo MotionLM, occupancy flow). VLM scene-classification latency (~18ms) is too slow to track moving pedestrians or vehicles. Use a dedicated perception stack.
YES
Do you have a local NPU?
(Hailo-8, Coral Edge TPU, Jetson on-robot — any accelerator co-located with the camera, no WiFi hop)
YES
Dual-process: fast L1 local + slow L2 remote.
L1 = YOLOv8n on Hailo-8 at 430 FPS (<10 ms, 26 TOPS, zero WiFi) for reactive safety. L2 = VLM on edge GPU at 10-54 Hz for semantic goals. IROS 2601.21506 validates: 66% latency reduction, 67.5% vs 5.83% success rate. The ≥10 Hz question answers itself — L1 delivers it by construction. Skip to level 6 (semantic need).
NO (VLM-ONLY PATH)
Can your VLM sustain ≥10 Hz on-device?
(Gemma 4 E2B on Panda = 54 Hz. Cloud VLM with network round-trip = 2-5 Hz worst case.)
NO
Use VLM for scene labeling only — async, not in the control loop.
At <10 Hz, the robot travels >10 cm between decisions at 1 m/s. Lidar-primary SLAM handles reactive control. VLM annotates SLAM map cells offline (Phase 2c pattern — no real-time fusion).
YES
Do you need semantic understanding?
(room names, object categories, "go to the kitchen" — not just "avoid obstacle at 0.3m")
NO
Lidar-primary is simpler and more robust. Use VLM as emergency backup only.
Pure obstacle avoidance, go-to-coordinate tasks, and geometric path-following need zero VLM involvement. Lidar + SLAM + A* is a solved problem for this case.
YES
Do you have more than one robot?
(fleet = shared demonstration data; single robot = no fleet training signal)
FLEET
End-to-end VLA training.
Fleet-scale demonstration data unlocks RT-2, OpenVLA, pi0. Skip the multi-query hybrid — train a single model end-to-end.
SINGLE
VLM-primary hybrid is the right choice.
Add lidar as geometry/safety layer. Multi-query pipeline (Phase 2a). Semantic map annotation (Phase 2c). This is Annie's exact configuration.
Condition Why VLM-primary fails here Use instead
Target is a known fiducial (ArUco, AprilTag, QR) cv2.aruco + solvePnP solves pose in ~78 µs on Pi ARM CPU with zero hallucination surface. A VLM here is 400× slower and introduces failure modes (lighting, prompt drift) that classical CV has already engineered away. Annie's homing path proves this: DICT_6X6_50 id=23 via solvePnP beats any VLM substitute. cv2.aruco / AprilTag detectors + solvePnP on CPU. No GPU, no network, no VLM.
Dynamic environment (streets, crowds, warehouses with forklifts) VLM classification latency (18ms) cannot track moving agents. Scene labels go stale before the robot reacts. Waymo needs radar + 3D occupancy flow — unavailable on edge hardware. Dedicated detection + prediction stack (YOLO + Kalman filter + occupancy grids)
VLM inference < 10 Hz AND no local NPU At 2 Hz the robot travels 50 cm between decisions at 1 m/s. EMA smoothing cannot compensate. Commands arrive too late for reactive steering (Anti-Pattern 4 from Lens 12). Without a local NPU there is no fast layer to cover the gap. Lidar-primary + async VLM scene labeling (not in control loop). Or: add a Hailo-8 / Coral and flip to dual-process.
Pure obstacle avoidance (no room names, no object categories) Lidar + SLAM + A* already solves this completely. Adding VLM complexity without semantic payoff increases failure surface (glass door problem from Lens 12 Anti-Pattern 3) with no corresponding benefit. Classical SLAM + Nav2 path planner. Zero VLM involvement.
Fleet of robots (shared training data available) The multi-query hybrid is optimized for single-robot, no-training-data constraint. Fleet-scale data unlocks end-to-end VLA training (RT-2, pi0) which achieves better generalization than hand-composed hybrid pipelines. End-to-end VLA training on fleet demonstrations
Transparent obstacles (glass doors, mirrors, reflective floors) VLM prior cannot distinguish transparent obstacle from open space. Lidar handles this geometrically — reflected photons are objective. The VLM proposes; the lidar must dispose (safety ESTOP). Never remove the lidar layer. Lidar ESTOP chain remains mandatory even in VLM-primary architecture
Minimum Viable Context
The exact configuration where VLM-primary hybrid starts to pay off: not a fiducial target, single robot, static indoor environment, edge GPU sustaining ≥10 Hz VLM inference, semantic goal vocabulary needed (room names / object types), no fleet training data available. If a local NPU is available, the architecture upgrades to dual-process (L1 Hailo-8 reactive + L2 VLM semantic), which relaxes the ≥10 Hz constraint on the VLM because the NPU covers the safety loop independently.

Annie (Panda E2B at 54 Hz, Pi lidar, single home environment) satisfies the non-NPU path end-to-end — and has an idle Hailo-8 AI HAT+ on the Pi 5 that, once activated, flips Annie onto the dual-process branch and eliminates the WiFi-cliff failure mode for obstacle avoidance.
SINGLE CHANGE THAT FLIPS THE DECISION
If this changes… Decision flips to… Why
Idle Hailo-8 gets activated on Pi 5 Dual-process: L1 YOLOv8n on Hailo (430 FPS, <10 ms, local) + L2 VLM on Panda (semantic) 26 TOPS co-located with the camera delivers ≥10 Hz by construction and removes WiFi from the safety path. The ≥10 Hz branch no longer matters for reactive control — the NPU owns that loop. IROS 2601.21506 (66% latency reduction) is the validation.
Target becomes a registered ArUco marker (e.g. home dock) Classical CV: cv2.aruco + solvePnP at 78 µs on CPU Fiducials short-circuit the entire VLM pipeline. Annie already uses this for homing (DICT_6X6_50 id=23). Any task that can be reframed as "align to a known marker" should exit the tree at level 3.
VLM inference drops from 54 Hz to 3 Hz (GPU contention, model upgrade, network latency) Async scene labeling only — remove from control loop (unless a local NPU exists, in which case L1 covers it) At 3 Hz, robot travels 33 cm between decisions. Temporal consistency collapses. EMA has nothing to smooth. With a local NPU, the VLM can slow down freely because it is no longer on the critical path.
Goal vocabulary changes from "kitchen / bedroom / hallway" to "point 3.2m at 47°" Pure lidar-primary + coordinate nav. Remove VLM from steering loop. Coordinate-based navigation is purely geometric. SLAM + A* solves it optimally without VLM. Adding VLM introduces failure modes (hallucination, glass door) with zero benefit.
Environment transitions from static home to a retail store (daily rearrangement) VLM for real-time obstacle description, but lidar-primary planning — no persistent semantic map Semantic map annotation (Phase 2c) assumes labels are stable over sessions. A store rearranges daily — accumulated cell labels become stale. Persistent semantic memory is now a liability, not an asset.
Second robot added, same environment (shared home map) Shared semantic map (VLMaps-style) with multi-robot coordination — or full VLA training if demo data accumulates Fleet data changes the training signal availability. Even 2 robots over 6 months generate enough demonstration data to consider VLA fine-tuning on the specific home environment.

The question "Is VLM-primary hybrid navigation good?" is unanswerable and therefore useless. The question "Under what specific conditions?" yields six binary branches, each with a clear landing. Two of those branches are early exits that catch cases the VLM pipeline should never touch in the first place. The first early exit — at level two — is the fiducial branch: if the target is an ArUco, AprilTag, or QR code, classical CV (cv2.aruco + solvePnP at ~78 µs on Pi ARM CPU) wins by four hundred times. Annie's own homing path (DICT_6X6_50 id=23) is this exact case; a VLM here would be strictly worse. The second addition — between the static-environment check and the ≥10 Hz check — is the local NPU branch: if you have a Hailo-8, Coral, or on-robot Jetson, the dual-process architecture (fast L1 local + slow L2 remote) becomes available and the ≥10 Hz question answers itself because the NPU delivers it by construction. IROS 2601.21506 validates this with a 66% latency reduction.

The most important branch — often skipped — is the semantic need check at level seven. Lidar + SLAM + A* is a solved problem for pure obstacle avoidance and coordinate navigation. The literature is deep, the tools are mature, and the failure modes are well-characterized. Introducing a VLM into this loop adds a hallucination failure mode, the glass-door transparency problem (Lens 12, Anti-Pattern 3), and the GPU contention problem. None of these costs are worth paying unless the application genuinely requires room-level or object-level semantic understanding. The practical test: if your navigation goals can be expressed as (x, y) coordinates, you don't need a VLM in the control loop. If your navigation goals require natural language — "go to where Mom usually sits" — you do.

The ≥10 Hz threshold is not arbitrary. It comes from the physics of the robot's motion: at 1 m/s, a 10 Hz loop means decisions are at most 10 cm stale when they arrive. EMA smoothing with alpha=0.3 across five consistent frames (86ms at 10 Hz) reduces the 2% single-frame hallucination rate to near-zero. Below 10 Hz, EMA's stabilizing effect breaks down — there aren't enough frames in an 86ms window to vote out a bad answer. The research documents this failure experimentally: in session 92, routing nav queries to the 26B Titan model at ~2 Hz produced visibly worse driving than the resident 2B Panda model at 54 Hz. The fast small model plus temporal smoothing strictly dominates the slow large model for reactive steering. The local-NPU branch sits upstream of this check precisely because a Hailo-8 at 430 FPS satisfies it by construction — the question only matters on the VLM-only path. This is Lens 12's Anti-Pattern 4 rendered as a concrete threshold in the decision tree.

The fleet branch at level seven is the most counterintuitive finding: VLM-primary hybrid navigation is specifically optimized for the case where you cannot train an end-to-end model. It is the correct architecture for a constraint set — single robot, no demonstration data, must work from day one — that most robotics research doesn't address because it doesn't make good benchmark papers. The moment you add fleet data, the constraint evaporates and the architecture should change. OK-Robot (Lens 12, Correct Pattern 2) validated this explicitly: "What really matters is not fancy models but clean integration." That finding holds only while training data is absent. With data, training beats integration. The decision tree encodes this transition point precisely: >1 robot, same environment, accumulating data — switch tracks.

The single-change flip table reveals the architecture's brittleness profile. Most flips are triggered by changes to the inference rate, environment dynamics, or target type — not by changes to model quality or algorithm sophistication. This matches the landscape analysis (Lens 07): Annie's position in the "edge compute density, not sensor count" quadrant means the edge GPU is the load-bearing component. The newly-added Hailo-8-activation flip is the highest-leverage change available because it adds a second load-bearing component on the Pi side, eliminating the WiFi cliff-edge failure mode for obstacle avoidance. The explore-dashboard (session 92) should include a VLM inference rate gauge next to the camera feed: if it drops below 10 Hz, the system should automatically demote the VLM from steering to async labeling, not silently degrade — and if an L1 NPU is present, the demotion is free.

The decision tree makes four structural findings that "Is this good?" cannot reveal:

1. The tree now has six branches instead of five. The two new early branches (fiducial target, local NPU) catch entire classes of cases that don't need a VLM at all — or that graduate to a fundamentally better architecture (dual-process). The most useful decision trees reject cases early, and three of the six branches now lead to "don't use VLM-primary hybrid" before level six even runs.

2. VLM-primary hybrid is correct for exactly one constraint set: not a fiducial target, single robot, static indoor, edge GPU ≥10 Hz (or local NPU covering reactive control), semantic goals, no fleet data. Relax any one condition and the correct architecture changes. Annie satisfies this set today — and has an idle Hailo-8 that would upgrade the architecture to dual-process on demand.

3. The ≥10 Hz threshold is a hard boundary only on the VLM-only path. With a local NPU (Hailo-8 at 26 TOPS, 430 FPS on YOLOv8n, <10 ms), the NPU owns the reactive loop and the VLM can run at whatever rate its semantic task tolerates. IROS 2601.21506 measured 66% latency reduction from this split.

4. The architecture has a designed obsolescence point. At fleet scale, it should be replaced by VLA training. Building a clean hybrid integration is the correct intermediate step, not the final destination. Knowing the exit condition in advance prevents the architecture from calcifying into a permanent workaround.

The decision tree has six branches. Four of them lead to "don't use VLM-primary hybrid" — including two new early exits (fiducial target → classical CV at 78 µs; local NPU available → dual-process). That means the correct recommendation, most of the time, for most robots, is either "don't do this at all" or "do something more specialized first." How confident are you that your task isn't actually a fiducial task pretending to be a navigation task? That your NPU isn't sitting idle while your VLM does work a 26 TOPS chip could do in one millisecond? That you aren't pattern-matching to "54 Hz sounds impressive" when a 430 FPS local detector would cover the safety loop for free?

The fiducial audit: list every task the robot performs. For each, ask "could this be solved by a printed marker on the object?" If yes, that task does not belong on the VLM path — move it to classical CV. The NPU audit: check the bill of materials for every accelerator (Hailo-8 AI HAT+, Coral, Jetson co-processor) and for each one verify it is actually being exercised by the current code path. Annie's Hailo-8 has been idle since day one; the dual-process upgrade is a configuration change, not a rewrite. The ≥10 Hz honest-check: measure actual inference rate under production load with context-engine, audio pipeline, and panda_nav running simultaneously. Contention may drop E2B from 54 Hz to 20 Hz — still above threshold, but an L1 NPU makes the margin irrelevant by moving safety off the shared-GPU critical path entirely.

Click to reveal analysis