LENS 24

Gap Finder

"What's not being said — and why?"

Covered — The Fast Path
Multi-query VLM pipeline (Phase 2a)

Goal-tracking, scene classification, obstacle awareness, place recognition — all on alternating frames at 58 Hz. Mechanically complete.

4-tier hierarchical SLAM + VLM fusion

Strategic (Titan LLM) → Tactical (Panda VLM) → Reactive (Pi lidar) → Kinematic (IMU). Fusion rule explicit: VLM proposes, lidar disposes, IMU corrects.

Temporal consistency (EMA + confidence accumulation)

Exponential moving average filters single-frame hallucinations. Variance tracking detects cluttered vs. stable scenes and adjusts speed.

Visual place recognition (AnyLoc / SigLIP embeddings)

DINOv2 + VLAD for loop closure confirmation. Cosine similarity topological map. Phases 2d and 2e with clear hardware assignments.

Semantic map annotation (VLMaps pattern)

VLM scene labels attached to SLAM grid cells at current pose. Rooms emerge from accumulated labels over time.

Evaluation framework and Phase 1 logging spec

ATE, VLM obstacle accuracy, scene consistency, place recognition P/R, navigation success rate all defined. Data sources and rates specified.

Phased implementation roadmap (2a–2e) with P(success) estimates

Clear sequencing: 2a/2b before SLAM deployed, 2c–2e after. Probability estimates from 90% down to 50%. Prerequisites explicit.

Waymo/Tesla architectural translation exercise

Explicit "translates / does not translate" analysis. Identifies what to borrow (dual-rate, map-as-prior) and what to skip (custom silicon, 8-camera surround).

Gaps — The Slow Path (Recovery, Edge Cases, Human Factors)
Camera-lidar extrinsic calibration [CRITICAL — hidden Phase 2c prerequisite]

Phase 2c attaches VLM scene labels to SLAM grid cells "at current pose." This requires knowing the precise spatial transform between the camera's optical axis and the lidar's coordinate frame. Without calibration, a label generated by the camera at angle A lands on a lidar cell at angle B — semantic labels drift from the obstacles they describe. The research never mentions this. Calibration requires a checkerboard target, multiple capture poses, and a solver (e.g., Kalibr). It is a multi-hour process that must be repeated if the camera or lidar is physically moved. See also Lens 03 (the llama-server embedding blocker is a similar hidden prerequisite — a dependency that blocks a phase without being named as a prerequisite).

VLM hallucination detection and recovery [CRITICAL]

The research mentions EMA filtering for single-frame noise but never addresses systematic hallucination — when the VLM confidently and persistently reports something false (e.g., "door CENTER LARGE" for a wall). Confidence accumulation makes this worse: after 5 consistent wrong frames the system goes faster toward the obstacle. There is no detection mechanism (e.g., VLM says forward-clear, lidar says blocked at 200mm → flag as hallucination), no recovery protocol, and no degraded-mode fallback. This is the most dangerous gap in the design. See also Lens 10 ("we built the fast path, forgot the slow path") — hallucination recovery IS the slow path for VLM navigation.

WiFi fallback and graceful degradation strategy [HIGH — mitigation path identified]

The 4-tier architecture requires Panda (VLM, Tier 2) to be reachable from the Pi (Tier 3/4) over WiFi. Lens 04 identified the WiFi cliff edge at 100ms latency — above that, nav decisions arrive stale. This research never described what happens when WiFi degrades: Does the robot stop? Fall back to lidar-only reactive nav? Continue on the last valid VLM command? Update (2026-04-16, session 119 hardware audit): The gap is now partially closable at zero hardware cost. The Pi 5 already carries a Hailo-8 AI HAT+ (26 TOPS) that is currently idle for navigation. Running YOLOv8n locally on the Hailo delivers ~430 FPS of obstacle detection with <10 ms latency and zero WiFi dependency — exactly the "fast safety layer" the degradation protocol needs. An IROS paper (arXiv 2601.21506) validates the dual-process pattern (fast local reactive + slow semantic remote) and reports a 66% latency reduction vs. continuous-VLM. Status transitions from "open / no mitigation" to "mitigation path identified; integration work pending." The residual gap is no longer "what happens when WiFi drops" — it is Gap 3a below.

Hailo-8 integration into the safety loop [HIGH — action item]

This is the specific closable work implied by Gap 3's mitigation path. The Hailo-8 AI HAT+ sits on the Pi 5's PCIe lane at 26 TOPS, drawing power and occupying physical space, and contributes nothing to navigation today. Activating it requires: HailoRT/TAPPAS runtime install on the Pi, a YOLOv8n HEF model compiled for Hailo-8, a ROS2/zenoh publisher that emits bounding boxes + class IDs, a fusion node that combines Hailo detections with lidar ESTOP (lidar still wins on transparent-surface false positives), and a WiFi-down regression test that verifies the robot can still avoid a chair with Panda unreachable. This is not a research gap — it is a prioritization gap: a 26 TOPS accelerator on-robot is orders of magnitude more capable than the lidar-only fallback that was implicitly assumed for WiFi outages. See also Lens 04 (WiFi cliff-edge) and Lens 25 (process gap — owned-hardware audit was skipped).

Inventory Gaps — Dormant Owned Hardware (Process Gap: No Audit Was Performed)
INV-1: Hailo-8 AI HAT+ on Pi 5 — 26 TOPS idle for navigation

Already present on the robot. YOLOv8n reference throughput: 430 FPS local, <10 ms, zero WiFi dependency. Closes Gap 3 if integrated (see Gap 3a). The fact that this accelerator was not considered in the original 4-tier architecture is itself the finding — the research specified a Pi tier and a Panda tier without auditing what the Pi was already capable of.

INV-2: Beast (2nd DGX Spark, 128 GB) — always-on, workload-idle since 2026-04-06

A second DGX Spark node with 128 GB of unified memory is powered on 24/7 and carries no production workload following the single-machine consolidation onto Titan. Potentially suitable for: offline benchmark sweeps, a dedicated SLAM/perception compute node, an alternate-region replica for resilience, or a dedicated training surface for vision models (YOLOv8n distillation for Hailo, Gemma LoRA). The research proposed new workloads without first checking what existing compute was unused — a dormant 128 GB node is a larger capacity reservoir than the entire Pi+Panda stack combined.

INV-3: Orin NX 16GB — owned, not yet mounted on a carrier board

Jetson Orin NX 16GB (100 TOPS, Ampere GPU) is user-owned and reserved for a future robot chassis, currently not powered. This is the highest-capacity edge compute unit in the inventory and has no carrier board wired. If a stereo camera were added, the Orin NX becomes the natural on-robot host for nvblox + cuVSLAM — a path the current Pi 5 cannot take. Tracking it here so that Phase 3+ planning starts from "we own a 100 TOPS robotics SOC" rather than re-deriving the hardware budget from scratch.

Map persistence and corruption recovery [HIGH]

Phase 1 SLAM builds the occupancy grid that Phase 2c annotates with semantic labels. The research describes building this map but not protecting it. What happens when slam_toolbox's serialized map is corrupted by a power loss mid-write? When the map diverges from reality after furniture rearrangement (Gap 15)? When the robot is carried to a new location and the prior map is now wrong? Map corruption is silent — the robot will navigate confidently into walls. Recovery requires map versioning, integrity checks, and a "map invalid" detection heuristic (e.g., lidar scan consistently disagrees with map prediction).

Dynamic obstacle tracking — people, pets, moving objects [HIGH]

The research treats obstacles as static ("nearest obstacle? chair/table/wall/door/person/none"). But in a home, a person walks through the frame at 1.5 m/s — 10x the robot's speed. A single-class "person" label tells the robot nothing about trajectory. Should it wait? Predict the path? Follow? The Waymo section explicitly covers MotionLM trajectory prediction for agents, then dismisses it as "not directly applicable (no high-speed agents in a home)." This is the most vulnerable sentence in the research: it is simply wrong. A 2-year-old child or a cat IS a high-speed agent in a home that moves faster than the robot can react at 1-2 Hz planning frequency.

Night and low-light operation [HIGH]

A home robot's most frequent use case is lights-off or dim-light navigation — fetching water at night, patrolling while the family sleeps. The VLM requires adequate illumination for scene classification and goal-finding. Below ~50 lux, VLM confidence drops dramatically and hallucination rate rises. The research never mentions this. Solutions exist (IR illumination, lidar-only fallback mode, ambient light sensor gating VLM trust weight) but none are discussed. This gap means the system described has a usage hours ceiling of roughly 8am–10pm — exactly the opposite of when autonomous home navigation is most useful.

Battery management during exploration [HIGH]

The research describes autonomous exploration for map-building but never addresses the energy budget. The TurboPi with 4 batteries has a runtime of approximately 45–90 minutes under load (motors + Pi 5 + camera + lidar + WiFi). During Phase 2d embedding extraction, the VLM runs continuously on Panda — additional WiFi traffic increases Pi power draw. There is no power-aware path planning (prefer shorter routes when battery low), no return-to-charger trigger, and no low-battery ESTOP. A robot that runs out of power mid-room is worse than one that never moved — it becomes an obstacle itself.

Privacy implications of persistent spatial memory [MEDIUM]

Phase 2c/2d builds a semantically annotated map of the home — every room labeled, every piece of furniture positioned, camera embeddings indexed by location. This is a detailed surveillance record of domestic life. The research never mentions where this data is stored, who can access it, how long it persists, or whether guests consent to being observed and classified ("person" label in the obstacle classifier). For her-os specifically (a personal ambient intelligence system), the spatial memory intersects with conversation memory — the system knows both what was said AND where the robot was when it was said. This combination is more privacy-sensitive than either alone.

User onboarding and first-run experience [MEDIUM]

The research describes a system that requires Phase 1 SLAM to be deployed before Phase 2 can function. Phase 1 requires the robot to explore the entire home to build the map. Who drives the robot during this exploration? What does the user experience when the map is empty and navigation is impossible? The "evaluation framework" section specifies what data Phase 1 must log — but not how a non-technical user initiates the mapping process, monitors its progress, or recovers from a failed mapping run. The first-run experience determines whether users adopt the system or abandon it after the second session.

Acoustic localization as complementary signal [MEDIUM]

A home robot built around Annie's voice capabilities has access to an unused sensor: sound source localization. A person calling "Annie, come here" provides a bearing to the speaker that neither camera nor lidar can match at distance. Sound travels around corners and through walls. The research focuses entirely on visual and geometric perception — the acoustic dimension is completely absent. For her-os specifically, where the robot's primary purpose is conversational companionship, voice-directed navigation ("I'm in the kitchen") is a more natural interaction pattern than visual goal-finding and should be a first-class input to the planner.

Long-term map drift correction [MEDIUM]

SLAM drift is cumulative. After weeks of operation, the occupancy grid will have small errors that compound. slam_toolbox uses scan-matching for loop closure to correct drift, and Phase 2e adds AnyLoc visual confirmation. But neither the research nor the roadmap specifies a drift correction schedule: How often should the robot re-survey the home? What triggers a global re-localization? How are semantic labels migrated when the underlying occupancy grid is updated? The 6-month map becomes less reliable than the 1-week map — and the system has no mechanism to detect or correct this degradation.

Furniture rearrangement detection [MEDIUM]

Indian homes rearrange furniture frequently — seasonal, guests, festivals, daily prayer setups. The Phase 1 SLAM map bakes in the furniture layout at time of mapping. When a sofa moves 1 meter, the SLAM system will experience localization failures as the scan disagrees with the stored map. The research never describes how the system detects that a map region is stale vs. that the robot is lost. This gap connects directly to the map corruption gap (Gap 4) and the long-term drift gap — they share the same failure mode: the map is wrong and the system doesn't know it.

Multi-floor navigation [LOW for current hardware]

The TurboPi cannot climb stairs. This gap is correctly implicit — there is no stair-climbing mechanism, so multi-floor navigation is physically impossible. However, the research's silence is still meaningful: it never establishes the single-floor constraint explicitly, meaning a future implementer reading this document might attempt to path-plan across floors without realizing the physical impossibility. Explicit scope declarations matter as much as what is included.

Outdoor-to-indoor transition [LOW for current scope]

The research is implicitly scoped to indoor home navigation, but never states this boundary. The VLM's scene classifier ("kitchen/hallway/bedroom/bathroom/living/unknown") has no outdoor classes. If the robot is moved outdoors (courtyard, balcony), the SLAM map becomes invalid, the VLM scene labels become "unknown," and the lidar gets confused by vegetation and open space. Like multi-floor, the correct response is to state the boundary explicitly rather than leave it implicit.

Map sharing between robots [LOW for current scope]

The research implicitly assumes a single-robot home. If a household has two Annie units (future), should they share the occupancy grid? Share the semantic annotations? Share place embeddings? Shared maps create a 2x improvement in exploration coverage but require conflict resolution when two robots annotate the same cell with different labels at different times. This gap is low priority now but the architecture choice (centralized vs. per-robot map storage) made in Phase 1 will determine whether this is possible at all.

Emergency behavior — fire, smoke, medical alert [MEDIUM]

The research defines ESTOP as "absolute priority over all tiers" for obstacle collisions. But it never defines behavior for whole-home emergencies. If a smoke detector triggers, should the robot navigate to the nearest exit and wait there as a beacon? Alert family members via Telegram? The 4-tier architecture has no emergency tier above the strategic tier. For a home robot with spatial awareness, emergency wayfinding is a natural capability — and its absence means the most high-stakes scenario is also the least specified.

Glass and transparent surface handling [HIGH]

Glass doors, glass dining tables, and glass-fronted cabinets are common in Indian homes and are invisible to lidar (the laser passes through). The research's "fusion rule — VLM proposes, lidar disposes" fails here: lidar says "clear" (the laser returned nothing), VLM says "BLOCKED" (it can see the glass door), and the fusion rule discards the VLM's correct observation in favor of lidar's false negative. Glass surfaces are the one physical scenario where VLM must override lidar, but the research establishes no mechanism for this exception.

Cost-benefit analysis of each phase [HIGH]

The roadmap provides P(success) estimates but no P(worthwhile) estimates. Phase 2c (semantic map annotation) has P(success)=65% and requires 2–3 sessions of implementation. But what does success actually buy? The research never quantifies: How much does semantic map annotation improve navigation success rate? Does it reduce average path length? Reduce collision frequency? The evaluation framework in Part 7 defines metrics but never connects them to phase gates — there is no specification of "if metric X does not reach threshold Y, skip Phase Z." Each phase is treated as inherently worthwhile if it succeeds, which is not the same thing.

The 18-Gap Inventory: Fast Path vs. Slow Path

The research solves the fast path comprehensively. Multi-query VLM dispatch, temporal EMA smoothing, 4-tier hierarchical fusion, semantic map annotation, visual place recognition — every component of the nominal navigation pipeline is specified with concrete code entry points, hardware assignments, and probability estimates. The system works when everything goes right.

What the research never addresses is the slow path: what happens when something goes wrong. This is not an oversight — it is a conscious scope decision. Research papers optimize for the demonstration case, not the recovery case. But the 18 gaps in this inventory are precisely the slow path: hallucination recovery, map corruption, WiFi degradation, battery depletion, furniture rearrangement, emergency behavior. Each gap is a scenario where the fast path has already failed and the system needs to handle a situation its designers did not fully specify.

The single most consequential gap is camera-lidar extrinsic calibration (Gap 1). It is not mentioned anywhere in the document. Yet Phase 2c — semantic map annotation, the architectural centerpiece that makes Annie's navigation "intelligent" rather than just reactive — cannot function without it. When a VLM label is attached to a grid cell at "current pose," that attachment requires a known transform between the camera frame and the lidar/map frame. Without this transform, labels land in the wrong place. The calibration is a 2–4 hour process with physical targets and specialized software. It must be repeated if hardware moves. The research treats Phase 2c as having P(success)=65% — but the actual prerequisite list includes an unlisted item that blocks the entire phase.

The second most consequential gap is VLM hallucination recovery (Gap 2). The research introduces confidence accumulation as a feature — after 5 consistent VLM frames, the system increases speed. But confidence accumulation on a systematically wrong VLM output means the system accelerates toward the hazard it has been confidently misclassifying. There is no cross-check mechanism (VLM vs. lidar disagreement as hallucination signal), no degraded-mode fallback, and no recovery protocol. The lidar ESTOP will fire at 250mm, but by then the robot is already committed to a collision trajectory at elevated speed.

The glass surface problem (Gap 17) is architecturally interesting because it is the one physical scenario where the research's explicit fusion rule — "VLM proposes, lidar disposes" — produces the wrong answer. Lidar returns nothing through glass (false negative). VLM correctly identifies the glass door (true positive). The fusion rule silences the VLM in favor of lidar. A complete navigation system needs a sensor-disagreement classifier that can identify when lidar's "clear" signal is itself anomalous (e.g., no reflection at expected range → possible transparent surface), and route that signal to VLM for confirmation rather than treating lidar's null return as ground truth.

Three gaps — dynamic obstacle tracking (Gap 5), acoustic localization (Gap 10), and emergency behavior (Gap 16) — are gaps of ambition, not just implementation. The research deliberately stays within the space of what is achievable with current hardware. A child running through the frame, a voice calling from the kitchen, and a smoke alarm triggering are all events that require capabilities beyond the 4-tier architecture as specified. The architecture has no provision for agent trajectory prediction, no audio input channel, and no emergency escalation tier. These are not bugs — they are scope decisions. But each scope decision, left implicit, becomes an assumption that a future implementer will violate.

The Meta-Gap: No Owned-Hardware Audit

The most structurally revealing gap is not in the checklist — it is in how the checklist was generated. The original 18 gaps were derived by reading the research and asking "what failure modes are unaddressed?" They were not derived by first cataloguing what compute Annie already owns and asking "which of these assets does the design use, and which does it leave idle?" The session 119 hardware audit (2026-04-16) surfaced three dormant assets — a 26 TOPS Hailo-8 AI HAT+ on the Pi 5, a second DGX Spark ("Beast") with 128 GB unified memory sitting workload-idle since 2026-04-06, and an Orin NX 16GB (100 TOPS, Ampere) owned but not yet on a carrier board. None of these appeared in the 4-tier architecture. Gap 3 (WiFi fallback) was framed as an unsolved problem for months; the Hailo-8 had been on the robot the entire time, capable of running YOLOv8n at 430 FPS with zero WiFi dependency, validated for this exact dual-process pattern by an IROS paper reporting a 66% latency reduction. The gap was not technical — it was procedural. When the design phase does not begin with an inventory pass over owned hardware, proposed workloads land on new acquisitions while existing accelerators idle. This is the meta-gap: the absence of the audit step that would have prevented half the listed gaps from being listed at all. It is tracked as INV-1/2/3 in the checklist above not because those items are "gaps" in the narrative sense, but because their non-use is the most common unacknowledged gap class in any multi-node system.

Nova — Sharpest Observation

The research's most confident sentence is its most vulnerable: "no high-speed agents in a home." This dismissal of dynamic obstacle prediction occurs in the Waymo section to explain why MotionLM doesn't translate. It is immediately followed by the robot's obstacle classifier, which has a "person" category treated identically to "chair" — a static label with no velocity or trajectory. A 2-year-old child moves at 0.8 m/s. A cat moves at 1.5 m/s. The robot navigates at 1 m/s. These are directly comparable speeds. The sentence that dismissed trajectory prediction is the same sentence that guaranteed the robot will someday corner a pet or block a toddler's path without any mechanism to predict or avoid it. The gap is not that trajectory prediction is missing — it's that the research argued it wasn't needed.

  • Dormant owned hardware is the most common unacknowledged gap class. The session 119 hardware audit (2026-04-16) surfaced a 26 TOPS Hailo-8 on the Pi 5, an idle 128 GB Beast DGX Spark, and an Orin NX 16GB (100 TOPS) still in a box — none of them referenced by the 4-tier architecture. Gap 3 (WiFi fallback) was treated as unsolved for months while a 430 FPS local obstacle detector sat on the robot's own PCIe lane. An IROS paper (arXiv 2601.21506) had already validated the dual-process pattern and reported a 66% latency reduction. The gap was not a missing capability — it was a missing inventory audit. Whenever a design proposes a new accelerator, the first question should be: which owned accelerator is this replacing, and if none, why is the existing one idle?
Think — Highest-Leverage Gap to Close

Close Gap 1 (camera-lidar calibration) before starting Phase 2c implementation. The calibration procedure takes 2–4 hours. Skipping it produces a system that appears to work — labels attach to cells, rooms accumulate annotations — but every label is spatially offset by the uncalibrated transform. This creates a subtle correctness bug that will not manifest in unit tests or simulation but will cause the robot to navigate toward where the VLM thinks the goal is, which is not where the goal actually is. The fix is Kalibr or a simplified hand-measurement approach (measure the physical offset between camera optical axis and lidar center, encode as a static TF transform). Document the calibration values in the SLAM config. Treat it as a physical constant, not a software parameter.

Close Gap 2 (VLM hallucination recovery) before enabling confidence-based speed modulation. Add a cross-validation check: if VLM reports "CLEAR" and lidar reports obstacle <400mm, treat the VLM output as suspect, reduce confidence to zero for that cycle, and do not increase speed. Log VLM-lidar disagreement events as a new metric. After 100 disagreement events, analyze the distribution — if VLM is right more often than lidar (e.g., glass surfaces), recalibrate the fusion weights. If lidar is right more often, the VLM prompt needs revision.