LENS 25

Blind Spot Scan

"What's invisible because of where you're standing?"

LANGUAGE

The VLM Speaks English

The entire semantic layer — room labels, navigation goals, obstacle names — lives in English. This home speaks Hindi. "Pooja ghar mein jao" is not a parseable goal. VLM cannot read Devanagari text on a medicine bottle, a calendar, or a door sign. The spatial vocabulary of the house (including Mom's voice commands) is not the language the model was trained on.

SPATIAL GRAMMAR

Western Floor Plans

Waymo, Tesla, VLMaps, OK-Robot — every cited reference was developed in wide-corridor, Western-layout spaces. Indian homes routinely have 60–70cm passages between furniture, floor-level seating (gadda, takiya), rangoli patterns that confuse floor-texture segmentation, shoes piled at every threshold, and a pooja room with no Western equivalent. The robot was designed for the hallways in the papers, not the hallways in the house.

PERSONA

Mom Is Not a Beta Tester

The research author is the engineer and the robot's primary mental model is his. Mom — the person who will interact with Annie most — appears only in the goal phrase "bring tea to Mom." She has no voice in the prompt design, no role in the evaluation framework, and no mechanism to correct the robot when it fails. The system is built to satisfy the engineer's definition of success, which may be orthogonal to Mom's.

INFRASTRUCTURE

WiFi as Given

The entire 4-tier architecture routes every VLM inference call from Pi (robot) to Panda (192.168.68.57) over WiFi — a channel that Lens 04 identified as the single cliff-edge parameter. What happens during a power cut? During monsoon interference? During a neighbor's router broadcast storm? The research has no offline-degradation path. The robot cannot navigate at all without the 18ms Panda VLM response, which requires WiFi that requires power.

LIGHTING

All Testing in Daytime

Session logs, SLAM maps, and VLM evaluation all occurred under normal ambient light. Indian households face load-shedding (scheduled outages), tube-light flicker (40–60Hz interference patterns on monocular cameras), and the transition from daylight to a single incandescent bulb in one room while adjacent rooms go dark. The VLM scene classifier trained on ImageNet-scale indoor datasets has not been evaluated on these lighting regimes. Room classification accuracy at 11pm under load-shedding lighting is completely unknown.

MODALITY

Camera as the Only Eye

The research treats camera-primary as a baseline constraint, but it is actually a choice that was never examined. Rooms in a home have acoustic signatures: the kitchen has exhaust fan noise, the bathroom has reverb, the living room has the TV. Touch at the chassis level already carries information — floor texture, door thresholds, carpet edges. These signals require no GPU, no WiFi, no VLM inference. The research never asks why it chose camera-first rather than sensor-first.

IDLE HARDWARE

We Own a 26 TOPS NPU We Aren't Using

The Hailo-8 AI HAT+ was installed on the Pi 5 months ago. It sits two inches from the camera ribbon cable. It can run YOLOv8n at 430 FPS with <10 ms latency, zero WiFi dependency. The research spent dozens of sessions routing every obstacle-detection frame over WiFi to Panda's RTX 5070 Ti (18–40 ms + jitter cliff, Lens 04), while the 26 TOPS NPU on the same board as the camera stayed idle. This is the canonical "missed what we owned" blind spot: the architecture diagrams never listed the Hailo in the inventory, so it was never in the design space. The IROS dual-process paper (arXiv 2601.21506, 66% latency reduction) describes exactly the L1-reactive / L2-semantic split that Hailo-on-Pi plus VLM-on-Panda would make free.

PROCESS

The Audit Pattern Never Asked "What Do We Own?"

Across 26 lenses of self-critique, not one asked: what hardware does the user already own that does not appear in the architecture diagrams? Asked once, that question surfaces the Hailo-8 NPU (26 TOPS, idle for nav), the Beast — a second DGX Spark with 128 GB unified memory, always-on, idle workload — and the Orin NX 16GB (100 TOPS, reserved for a future robot but available for ahead-of-time experimentation). Three pieces of compute capable of transforming the nav stack were invisible because the review process started from the drawn system, not the owned system. This is a meta-blind-spot: the research checklist reviewed everything on the diagram and nothing off it. The fix is a one-line addition to every future audit: "list every powered device in the house; explain why each is or isn't in the diagram."

Session 119 validated this lens in the most literal way possible: the single highest-impact architectural finding of the session was a blind-spot that became visible only because a targeted hardware-audit pass forced a full inventory of powered devices. The Hailo-8 AI HAT+ had been on the Pi 5 for months. Every nav-tuning document, every latency budget, every WiFi cliff-edge diagnosis (Lens 04) was drawn on a canvas that did not include it. The research author was standing inside a pipeline whose architecture-of-record omitted a 26 TOPS accelerator sitting on the same bus as the camera. That is the exact structure this lens predicts — a blind spot is not ignorance, it is position. From the seat of "Pi sensors go to Panda VLM," the Hailo is invisible. From the seat of "list every chip in the house," it is the obvious L1 safety layer. Session 119 is the clean case: the lens's question works.

The language blind spot is the most structurally load-bearing of the eight. It is invisible from the engineer's position because the engineer thinks in English, writes prompts in English, and evaluates results in English. The VLM prompt says "Where is the kitchen?" not "rasoi kahaan hai?" — but Mom, the actual end user, might say the latter. This creates a three-way mismatch: Mom's voice command (Hindi) must be transcribed (STT layer), translated or reframed (invisible middleware), then expressed as an English goal phrase that the VLM can semantically anchor. The research has no such middleware. The Annie voice agent (Pipecat + Whisper) uses an English-primary STT pipeline. Whisper handles Hindi adequately, but the semantic navigation layer downstream expects English room-type tokens — "kitchen," "bedroom," "bathroom" — tokens that appear in the research's Capability 1 scene classifier verbatim. If Mom says "pooja ghar" the scene classifier has no bucket for it. The room will be labeled "unknown" and the SLAM map will never annotate it correctly, making language-guided navigation to that room permanently impossible.

The spatial grammar blind spot compounds the language one. Indian homes are not smaller versions of Western ones — they are structurally different. Floor-level living (gadda, floor cushions, low charpais) means a robot navigating at 13cm chassis height will have its sonar constantly triggered by objects that a Western-layout robot would never encounter at that height. Rangoli and kolam floor patterns are specifically designed to be visually striking — they will produce strong floor-texture signals that a VLM-based path classifier trained on hardwood and tile floors will misread as obstacles or clutter. The pooja room, which is a fundamental spatial anchor in tens of millions of Indian homes, does not appear in any of the research's room taxonomy lists. The VLM's training distribution almost certainly contains no examples. This is not a missing feature — it is a category that does not exist in the model's world.

Mom's invisibility as a design actor is the deepest blind spot because it is the most human one. The research is technically sophisticated: it cites Waymo, Tesla, VLMaps, AnyLoc, and OK-Robot. But it mentions Mom only as a delivery destination. She appears as a waypoint, not as a person with preferences, tolerances, and failure modes of her own. Would she find a robot silently approaching from behind alarming? Does she need it to announce itself in Hindi? Does she know that "ESTOP" is a concept? The evaluation framework (Part 7 of the research) defines metrics — ATE, VLM obstacle accuracy, navigation success rate — that are all defined from the engineer's vantage point. None of them measure whether Mom found the interaction comfortable or whether she was able to correct the robot when it made a mistake. A system optimized entirely on engineer-defined metrics can achieve high scores while remaining unusable by its actual primary user.

The WiFi and lighting blind spots are invisible because the development environment is unusually stable. Testing happens when the engineer is present, which is also when lights are on, WiFi is active, and the household is in its daytime configuration. Lens 04 already identified WiFi as the single cliff-edge parameter — below 100ms the system is stable, above it the system collapses. But load-shedding does not just affect WiFi: it takes down the entire network including the Panda inference server. The robot becomes a brick at exactly the moments when having an intelligent household assistant would be most useful. The Hailo-8 discovery sharpens the remedy — once L1 obstacle detection runs locally on the Pi's NPU, loss of WiFi degrades capability from "full semantic nav" to "safe local wander," not from "driving" to "brick." The blind spot is the same; the fix was sitting on the board the whole time.

The camera-first assumption is the most intellectually interesting blind spot because it was never a deliberate decision — it was inherited from the research corpus. Waymo, Tesla, VLMaps, and AnyLoc all use cameras. So Annie uses a camera. But an outside observer — say, a deaf-blind person's assistive device designer — would immediately ask: what other signals does this environment emit? The kitchen emits smell, heat, and fan noise. The bathroom emits humidity and reverb. The living room emits television audio. A robot that listens for a few seconds before navigating would classify rooms with high reliability using $2 of microphone hardware, no GPU inference, and no WiFi. The camera solves a hard problem (visual scene understanding) when easier signals are available. The engineer's training makes camera-based vision feel like the natural starting point. An outsider would find this choice puzzling.

The process blind spot is the one that enables the others. Twenty-six lenses of critique could not see the idle Hailo because none of them asked "what is in the room that is not in the diagram?" The Hailo, the Beast (second DGX Spark, 128 GB, always-on, idle workload), and the Orin NX 16GB (100 TOPS, reserved) are all un-drawn compute. A one-line audit step — list every powered device in the house and state whether it is in the diagram — would have surfaced them. That is the meta-fix this lens produces: don't just scan for what's blind, scan for what's un-drawn.

Session 119 is the canonical Blind Spot Scan success story. The Hailo-8 AI HAT+ — 26 TOPS, on the Pi, idle for navigation — was the highest-impact discovery of the session. It was invisible for months not because anyone hid it but because the architecture-of-record did not list it. Once listed, YOLOv8n at 430 FPS with <10 ms latency and no WiFi dependency becomes the obvious L1 safety layer, turning Lens 04's WiFi cliff from a "brick" failure mode into a graceful degradation to "safe local wander."

Audit the owned system, not just the drawn system. The highest-leverage process change is adding one line to every architecture review: list every powered device in the house; explain why each is or isn't in the diagram. That single question surfaces the Hailo, the always-on idle Beast (128 GB unified memory), and the dormant Orin NX — three compute substrates that the 26-lens audit could not see because it started from the diagram instead of the house.

Camera-first is inherited, not chosen. The research corpus is vision-centric so the system is vision-centric. An acoustic room classifier using microphone input costs $2 of hardware, requires no GPU, and works in the dark during a power cut — the exact scenario where the camera-first architecture becomes a brick.

If Mom replaced Rajesh as the system's primary evaluator for one week, what would be the first three things she would report as broken?

Click to reveal

First: the robot cannot understand Hindi goals. "Rasoi mein jao" produces no navigation because the VLM semantic layer has no Hindi vocabulary and the goal-parsing middleware was never built. Second: the robot does not announce itself before entering a room, which is alarming when you are not watching it. Annie's voice agent can speak but has no protocol for room-entry announcements — the research treats proximity only as an ESTOP trigger, not as a social cue. Third: the robot stops working entirely during load-shedding, which happens regularly, and there is no graceful degradation mode — no cached last-known map, no simple obstacle avoidance without WiFi, no acoustic-only fallback. These three failures are invisible from the engineer's evaluation framework because they are not in any of the Part 7 metrics.