LENS 18 — Decision Tree: Under What Specific Conditions Is This the Best Choice?

The question "Is VLM-primary hybrid navigation good?" is unanswerable and therefore useless. The actionable version is: under what specific conditions?

The decision tree has six branching questions, up from five. Two new early exits catch entire classes of cases that don't need a VLM at all, or that graduate to a fundamentally better architecture. Four of the six branches terminate in "don't use VLM-primary hybrid." The architecture is correct for exactly one constraint set.

BRANCH ONE: Do you have a camera and an edge GPU?

If no — use lidar-only SLAM. slam_toolbox plus Nav2. No VLM path exists without visual input or local inference. Stop here.

If yes — continue to branch two.

BRANCH TWO — NEW: Is the target a known fiducial?

This means an ArUco tag, AprilTag, or QR code — a pre-registered marker with known geometry.

If yes — use classical computer vision. Skip the VLM entirely. The cv2.aruco detector plus solvePnP runs in roughly 78 microseconds on a Pi ARM CPU. No GPU, no network, no hallucination surface. Annie's own homing path — DICT 6x6 50, id 23 — is this exact case. A VLM here would be four hundred times slower and strictly worse.

If no — continue to branch three.

BRANCH THREE: Is your environment mostly static?

Home, office, warehouse — not a street, crowd, or construction site.

If no — VLM-primary won't help. Dynamic scenes need trajectory prediction: Waymo's MotionLM, occupancy flow, object tracking. VLM scene-classification latency of 18 milliseconds is too slow to track moving pedestrians or vehicles. Use a dedicated perception stack.

If yes — continue to branch four.

BRANCH FOUR — NEW: Do you have a local NPU?

This means a Hailo-8, Coral Edge TPU, or on-robot Jetson — any accelerator co-located with the camera, with no WiFi hop to the inference.

If yes — the dual-process architecture becomes available. Fast L1 reactive detection runs locally on the NPU. Slow L2 semantic reasoning runs remotely on the edge GPU. The Hailo-8 AI HAT+ on Pi 5 delivers 26 TOPS, YOLOv8n at 430 frames per second, latency under 10 milliseconds, and zero WiFi dependency. The 10 hertz question answers itself — the NPU delivers it by construction. The IROS paper 2601.21506 validates this pattern with 66% latency reduction and success rates of 67.5% versus 5.83% for VLM-only. Skip to the semantic need check.

If no — continue to branch five, the VLM-only path.

BRANCH FIVE: Can your VLM sustain 10 hertz or more on-device?

Gemma 4 E2B on Panda achieves 54 hertz. A cloud VLM with network round-trip achieves 2 to 5 hertz in the worst case.

If below 10 hertz — use VLM for scene labeling only. Run it asynchronously, not in the control loop. At below 10 hertz, the robot travels more than 10 centimeters between decisions at one meter per second. Lidar-primary SLAM handles reactive control. The VLM annotates SLAM map cells offline — Phase 2c pattern without real-time fusion.

If 10 hertz or above — continue to branch six.

BRANCH SIX: Do you need semantic understanding?

Room names, object categories, goals like "go to the kitchen." Not just "avoid obstacle at 0.3 meters."

If no — lidar-primary is simpler and more robust. Use VLM as emergency backup only. Pure obstacle avoidance, go-to-coordinate tasks, and geometric path-following need zero VLM involvement. Lidar plus SLAM plus A-star is a solved problem for this case. Don't add VLM complexity without a semantic payoff.

If yes — continue to the final check.

FINAL CHECK: Do you have more than one robot?

If yes, fleet — end-to-end VLA training. Fleet-scale demonstration data unlocks RT-2, OpenVLA, pi0. Skip the multi-query hybrid entirely. This research does not apply at fleet scale.

If no, single robot — VLM-primary hybrid is the right choice. Add lidar as the geometry and safety layer. Multi-query pipeline. Semantic map annotation. OK-Robot's central finding applies: clean integration beats custom models. This is Annie's exact configuration.

THE MINIMUM VIABLE CONTEXT

Not a fiducial target, single robot, static indoor environment, edge GPU sustaining 10 hertz or more VLM inference, semantic goal vocabulary needed, no fleet training data available. If a local NPU is available, the architecture upgrades to dual-process, which relaxes the 10 hertz constraint on the VLM because the NPU covers the safety loop independently.

Annie on Panda at 54 hertz with Pi lidar in a single home environment satisfies the non-NPU path end-to-end — and has an idle Hailo-8 AI HAT+ on the Pi 5 that, once activated, flips Annie onto the dual-process branch and eliminates the WiFi cliff-edge failure mode for obstacle avoidance.

THE SINGLE HIGHEST-LEVERAGE FLIP

Activating the idle Hailo-8 on the Pi 5 is the single highest-leverage change available. The hardware exists, currently unused for navigation. YOLOv8n runs at 430 frames per second with zero WiFi latency. It replaces WiFi-dependent VLM as the safety layer. It uses HailoRT and TAPPAS. It eliminates the WiFi cliff-edge failure for obstacle avoidance. The upgrade is a configuration change, not an architecture rewrite.

Three of the six branches now lead to "don't use VLM-primary hybrid" before the semantic need check even runs. The most useful decision trees reject cases early. This one now does.