LENS 24 — GAP FINDER
"What's not being said — and why?"

---

THE CORE FINDING

The research on VLM-primary hybrid navigation is comprehensive about the fast path and silent about the slow path. Eight things are covered in detail: the multi-query VLM pipeline, the four-tier hierarchical fusion architecture, temporal consistency via exponential moving average, visual place recognition, semantic map annotation, the evaluation framework, the phased implementation roadmap, and the architectural lessons from Waymo and Tesla. Every component of the nominal pipeline is specified with code entry points, hardware assignments, and probability estimates.

What the research never addresses is what happens when something goes wrong.

---

THE 18-GAP INVENTORY

GAP 1 — CRITICAL: Camera-lidar extrinsic calibration.

This is the most consequential gap because it is a hidden prerequisite for Phase 2c — the architectural centerpiece of the entire research. Phase 2c attaches VLM scene labels to SLAM grid cells "at current pose." This requires knowing the precise spatial transform between the camera's optical axis and the lidar's coordinate frame. Without calibration, a label generated by the camera at angle A lands on a lidar cell at angle B. Semantic labels drift from the obstacles they describe.

The research never mentions calibration anywhere. It treats Phase 2c as having 65 percent probability of success — but the actual prerequisite list includes an unlisted item that blocks the entire phase. Calibration requires a checkerboard target, multiple capture poses, and a solver such as Kalibr. It is a 2 to 4 hour process that must be repeated if the camera or lidar is physically moved.

GAP 2 — CRITICAL: VLM hallucination detection and recovery.

The research introduces confidence accumulation as a feature: after 5 consistent VLM frames, the system increases speed. But confidence accumulation on a systematically wrong VLM output means the system accelerates toward the hazard it has been confidently misclassifying.

There is no cross-check mechanism. VLM says "forward clear," lidar says "blocked at 200 millimeters" — there is no logic to flag this disagreement as a hallucination signal. There is no degraded-mode fallback. The lidar emergency stop will fire at 250 millimeters, but by then the robot is already committed to a collision trajectory at elevated speed.

GAP 3 — HIGH: WiFi fallback and graceful degradation.

The four-tier architecture requires the Panda VLM server to be reachable from the Pi over WiFi. Lens 04 identified the WiFi cliff edge at 100 milliseconds latency — above that, navigation decisions arrive stale. This research never describes what happens when WiFi degrades. Does the robot stop? Fall back to lidar-only reactive navigation? Continue on the last valid VLM command? The absence of a degradation protocol means the system has a single point of failure on the WiFi link.

GAP 4 — HIGH: Map persistence and corruption recovery.

Phase 1 SLAM builds the occupancy grid that Phase 2c annotates with semantic labels. The research describes building the map but not protecting it. What happens when the map is corrupted by a power loss mid-write? When the map diverges from reality after furniture is rearranged? When the robot is carried to a new location and the prior map is now wrong? Map corruption is silent — the robot will navigate confidently into walls.

GAP 5 — HIGH: Dynamic obstacle tracking — people, pets, moving objects.

The research treats obstacles as static. "Nearest obstacle — reply one word: chair, table, wall, door, person, none." A person walking through the frame moves at 1.5 meters per second. A cat moves even faster. The robot navigates at 1 meter per second. These are directly comparable speeds.

The Waymo section explicitly covers MotionLM trajectory prediction for agents, then dismisses it as "not directly applicable — no high-speed agents in a home." This is the most vulnerable sentence in the research. It is simply wrong. A 2-year-old child or a cat IS a high-speed agent in a home that moves faster than the robot can react at a 1 to 2 hertz planning frequency.

GAP 6 — HIGH: Night and low-light operation.

A home robot's most frequent use case is lights-off or dim-light navigation — fetching water at night, patrolling while the family sleeps. The VLM requires adequate illumination for scene classification and goal-finding. Below roughly 50 lux, VLM confidence drops dramatically and hallucination rate rises. The research never mentions this. Solutions exist — infrared illumination, lidar-only fallback mode, ambient light sensor gating VLM trust weight — but none are discussed.

GAP 7 — HIGH: Battery management during exploration.

The TurboPi with 4 batteries has a runtime of approximately 45 to 90 minutes under load. During Phase 2d embedding extraction, the VLM runs continuously — additional WiFi traffic increases power draw further. There is no power-aware path planning, no return-to-charger trigger, and no low-battery emergency stop. A robot that runs out of power mid-room becomes an obstacle itself.

GAP 8 — HIGH: Glass and transparent surface handling.

Glass doors, glass dining tables, and glass-fronted cabinets are invisible to lidar — the laser passes through. The research's fusion rule — "VLM proposes, lidar disposes" — fails here: lidar says "clear," VLM says "blocked," and the fusion rule discards the VLM's correct observation in favor of lidar's false negative. Glass surfaces are the one physical scenario where VLM must override lidar, but the research establishes no mechanism for this exception.

GAP 9 — HIGH: Cost-benefit analysis of each phase.

The roadmap provides probability-of-success estimates but no probability-of-worthwhile estimates. Phase 2c has 65 percent probability of success and requires 2 to 3 sessions of implementation. But what does success actually buy? How much does semantic map annotation improve navigation success rate? The evaluation framework defines metrics but never connects them to phase gates. There is no specification of "if metric X does not reach threshold Y, skip phase Z."

GAP 10 — MEDIUM: Privacy implications of persistent spatial memory.

Phase 2c and 2d build a semantically annotated map of the home — every room labeled, every piece of furniture positioned, camera embeddings indexed by location. The research never mentions where this data is stored, who can access it, how long it persists, or whether guests consent to being observed and classified. For her-os specifically, the spatial memory intersects with conversation memory — the system knows both what was said AND where the robot was when it was said.

GAP 11 — MEDIUM: User onboarding and first-run experience.

Phase 2 requires Phase 1 SLAM to be deployed first. Phase 1 requires the robot to explore the entire home to build the map. Who drives the robot during this exploration? What does the user experience when the map is empty and navigation is impossible? The research specifies what data Phase 1 must log but not how a non-technical user initiates the mapping process or recovers from a failed mapping run.

GAP 12 — MEDIUM: Acoustic localization as complementary signal.

A home robot built around Annie's voice capabilities has access to an unused sensor: sound source localization. A person calling "Annie, come here" provides a bearing to the speaker that neither camera nor lidar can match at distance. Sound travels around corners and through walls. For her-os specifically, voice-directed navigation — "I'm in the kitchen" — is a more natural interaction pattern than visual goal-finding and should be a first-class input to the planner. The research focuses entirely on visual and geometric perception. The acoustic dimension is completely absent.

GAP 13 — MEDIUM: Long-term map drift correction.

SLAM drift is cumulative. After weeks of operation, the occupancy grid will have small errors that compound. Neither the research nor the roadmap specifies a drift correction schedule: How often should the robot re-survey the home? What triggers a global re-localization? How are semantic labels migrated when the underlying occupancy grid is updated?

GAP 14 — MEDIUM: Furniture rearrangement detection.

Indian homes rearrange furniture frequently — seasonal, guests, festivals, daily prayer setups. The Phase 1 SLAM map bakes in the furniture layout at time of mapping. When a sofa moves 1 meter, the SLAM system will experience localization failures. The research never describes how the system detects that a map region is stale versus that the robot is lost.

GAP 15 — MEDIUM: Emergency behavior — fire, smoke, medical alert.

The research defines emergency stop as absolute priority for obstacle collisions. But it never defines behavior for whole-home emergencies. If a smoke detector triggers, should the robot navigate to the nearest exit? Alert family members via Telegram? The 4-tier architecture has no emergency tier above the strategic tier.

GAP 16 — LOW: Multi-floor navigation.

The TurboPi cannot climb stairs. This gap is correctly implicit. However, the research never states the single-floor constraint explicitly. Explicit scope declarations matter as much as what is included.

GAP 17 — LOW: Outdoor-to-indoor transition.

The research is implicitly scoped to indoor home navigation but never states this boundary. The VLM's scene classifier has no outdoor classes. The correct response is to state the boundary explicitly rather than leave it implicit.

GAP 18 — LOW: Map sharing between robots.

If a household has two Annie units in the future, should they share the occupancy grid? The architecture choice made in Phase 1 — centralized versus per-robot map storage — will determine whether this is possible at all.

---

THE FAST PATH VERSUS SLOW PATH DISTINCTION

The 18-gap inventory reveals a consistent pattern. The research solves every problem in the nominal execution path and ignores every problem in the recovery path. This is not carelessness — it is the standard research paper tradeoff. Papers demonstrate the happy path. Slow path specification belongs to engineering documentation, not academic research.

But her-os is not an academic project. It is a home robot that will run unattended in a real house with real people. The slow path is where the system will spend a significant fraction of its operational lifetime.

The highest-leverage action is to close Gap 1 before Phase 2c begins. Calibrate the camera-lidar transform. Encode it as a static TF transform in the SLAM configuration. Treat it as a physical constant. Then close Gap 2: add a VLM-lidar disagreement detector before enabling confidence-based speed modulation. These two fixes address the most dangerous failure modes with changes that require less than one session each.

---

UPDATE — 2026-04-16: THE MITIGATION PATH AND THE META-GAP

The session 119 hardware audit of April sixteenth added three items to this lens: a mitigation path for Gap 3, an integration action item, and a meta-gap about process.

Gap 3 — the WiFi fallback gap — is now reclassified from "open" to "mitigation path identified." The Pi 5 already carries a Hailo-eight AI HAT-plus, twenty-six tera-operations per second, currently idle for navigation. YOLOv8n runs locally on it at roughly four hundred thirty frames per second with inference latency under ten milliseconds and zero WiFi dependency. An IROS paper, arXiv twenty-six oh one dot two one five oh six, validates the fast-plus-slow dual-process pattern — fast local reactive layer plus slow remote semantic layer — and reports a sixty-six percent latency reduction versus continuous VLM. The WiFi cliff edge that Lens 04 quantified no longer terminates in an unsolved safety gap.

Gap 3-a is the new action item: Hailo-eight integration. HailoRT runtime installed on the Pi. YOLOv8n compiled to HEF. A ROS2 or zenoh publisher emitting bounding boxes. A fusion node that keeps lidar ESTOP authoritative for transparent surfaces while using Hailo detections for fast general obstacle classification. And a WiFi-down regression test that proves the robot avoids a chair with Panda unreachable.

Three inventory gaps are added. INV-1: the Hailo-eight itself, twenty-six TOPS, unused. INV-2: Beast, a second DGX Spark with one hundred twenty-eight gigabytes of unified memory, always on, workload-idle since April sixth. INV-3: an Orin NX sixteen-gigabyte unit, one hundred TOPS of Ampere, owned but not mounted on a carrier board.

The meta-gap is procedural, not technical. The original eighteen-gap inventory was derived by reading the design and asking what failure modes were unaddressed. It was not derived by first listing every accelerator Annie already owns and asking which ones the design uses. Had the owned-hardware audit come first, Gap 3 would have noted the Hailo-eight on day one. Dormant owned hardware is the most common unacknowledged gap class in a multi-node system. The fix is not a new architecture — it is a new first step: before any tier is proposed, enumerate the powered-on accelerators in the household and explain which one hosts the new tier, or why none can.

Cross-references. Lens 04 already mapped the WiFi cliff edge; activation of the Hailo-eight now closes that single-point-of-failure. Lens 15 on tempo should add a local-first tier below the current reactive tier, because sub-ten-millisecond local inference rewrites the safety budget. Lens 21 should flag as a contradiction the simultaneous claim of "WiFi fallback unsolved" and a twenty-six TOPS accelerator idle on the same board. Lens 25 on process gaps is where the owned-hardware audit belongs — promoted from an ad hoc finding to a standing pre-design ritual.

---

END OF LENS 24