LENS 10 — FAILURE PRE-MORTEM

"It's October 2026 and this failed. What happened?"

THE TIMELINE

April 2026. Phase 2a deploys. Multi-query pipeline is live. 29 Hz goal tracking, 10 Hz scene classification, 58 Hz throughput intact. Annie navigates to the kitchen and finds Mom's tea. The team is optimistic.

May 2026. Pre-monsoon humidity rises. Neighbors' routers add congestion. VLM inference round-trip time climbs from 18 milliseconds to 35–90 milliseconds on roughly 8% of frames. The NavController's timeout fires silently — robot freezes mid-corridor, resumes after reconnect. The team notes it in a comment but ships no fix. "It usually recovers." No fallback behavior exists. The fast path was engineered to 1-millisecond precision. The failure path was never designed at all. Partial mitigation, deployed in April: the Hailo-8 L1 safety layer runs YOLOv8 nano at 430 FPS locally on the Pi 5, with zero WiFi dependency. The safety path no longer freezes — Annie still avoids obstacles during brownouts. But the semantic queries — where is the kitchen, what room is this — still degrade silently when VLM round-trip time spikes. The robot keeps moving. It just stops understanding. Mom experiences this as Annie wandering, rather than Annie frozen. A different failure, not a solved one.

June 2026 — INCIDENT ONE. Mom's bedroom has a floor-to-ceiling glass sliding door left partially open at 45 degrees. Annie approaches at 1 meter per second. The VLM reports "CLEAR" — the glass is transparent, the camera sees the room beyond. The lidar beam strikes the door at a glancing angle below the reflectance threshold and returns nothing. The safety rule — "VLM proposes, lidar disposes" — assumes at least one sensor is correct. Both are wrong simultaneously. ESTOP fires at 80 millimeters. Too late. Annie hits the door frame at reduced speed, knocking it off its track. Mom is shaken. No injury. But trust is damaged. The temporal smoothing had 14 consecutive confident "CLEAR" readings — it amplified the error rather than catching it.

July 2026. The Pico RP2040 drops to REPL during a long navigation session — a known failure mode requiring manual soft-reboot. Without IMU heading, the EKF diverges within 90 seconds. The SLAM map accumulates ghost walls. Three days of room-label training data are corrupted. The map must be rebuilt from scratch. No watchdog or auto-recovery was ever implemented.

August 2026 — INCIDENT TWO. Monsoon peak. WiFi drops 15–20% of frames during the 7-to-9pm window when Mom most wants Annie's help. Annie freezes in the hallway, blocking passage. When it resumes, it has lost goal context and asks: "Where would you like me to go?" After the third freeze in one evening, Mom stops calling Annie. She doesn't complain. She simply stops. The team doesn't notice for two weeks because the dashboard shows 94% navigation success rate — averaged over all 24 hours, not the evening window. The metric was right. The window was wrong.

September 2026. Phase 2c stalls. Semantic map annotation requires stable SLAM as its pose ground truth. But SLAM is still fragile. The Zenoh fix from session 89 was never deployed. Phase 2c cannot start. Phase 2d cannot start without 2c. Phase 2e cannot start without 2d. Three of five Phase 2 sub-phases are gated behind a prerequisite that is itself gated behind another prerequisite. The roadmap looked like a directed graph. It was actually a single chain.

Also September. SigLIP 2 requires 800 megabytes of VRAM. The E2B VLM already uses 1.8 gigabytes. Panda's GPU has 4 gigabytes total. The two models cannot coexist. Phase 2d — embedding extraction, place recognition, visual loop closure — is shelved. The perception architecture loses its memory layer before it was ever built.

October 2026. The decision is made to route VLM inference to Titan over the home LAN. "Too many moving parts on Panda." This is exactly the architectural bet the research identified as the risk: if WiFi is unreliable, making it the critical transport makes things worse. The pivot does not solve the glass door problem, the IMU crash, or the prerequisite chain. Six months of edge-first infrastructure work is partially undone in one decision made under time pressure.

2027. THE PAPERWEIGHT. An Orin NX 16-gigabyte module is purchased mid-2027 as future upgrade path to run Isaac Perceptor — nvblox and cuVSLAM — locally. The module ships in a tray. The carrier board is on a separate SKU from a different vendor with a four-to-eight-week lead time. No one orders it. The module sits in a drawer for six months. By the time the carrier arrives, DGX Spark and Panda already handle the workload. The stereo camera cuVSLAM requires still has not been purchased either. The hardware is not wrong. The bill of materials discipline is. One missing 200-dollar part turns a 600-dollar module into a paperweight. Buying into an ecosystem before verifying the full chain works end to end is its own failure mode.

THE KEY INSIGHT

We built the fast path. We forgot the slow path entirely.

The research is meticulous about the 58-Hz throughput, the 18-millisecond latency, the 4-tier fusion architecture. These numbers are correct. But the research contains zero specification for what happens when any of them degrades. What does Annie do when VLM inference times out? The research doesn't say. What does Annie do when the SLAM map diverges? The research doesn't say. What does Annie do when the IMU drops to REPL? The research says "known failure mode" and moves on.

The boring failure, not the interesting one. The system did not fail because the VLM architecture was wrong. It failed because WiFi dropped 8–15% of frames during the hours when the system was most used. The research spends three pages on AnyLoc loop closure — probability of success: 50%, multi-session effort — and zero words on "what happens when the 18-millisecond VLM call takes 90 milliseconds." The effort allocation was exactly backwards from what the deployment needed.

The glass door failure is epistemically different. Glass is not random noise. Every frame through glass is consistently "CLEAR." The temporal smoothing was designed to filter random hallucinations. It amplifies systematic ones. This is the unknown unknown: a safety rule with a hidden premise — "at least one sensor is truthful" — that glass removes.

What the team wishes they'd built differently: Graceful degradation first, throughput optimization second. A WiFi circuit breaker that switches to lidar-only mode and says "I'm navigating carefully — my eyes are slow right now." Glass catalogued as a named hazard class during setup, not discovered during navigation. An IMU watchdog automated on day one. And a per-user, per-hour dashboard that would have caught the 7-to-9pm degradation in the first week — before Mom formed the habit of not asking.

CROSS-LENS CONNECTIONS

This lens connects to Lens 4, which identified the WiFi cliff edge at 100 milliseconds. That lens correctly flagged the risk. This lens shows what happens when the flag is not acted on.

It connects to Lens 13, which covers Mom's real-world usage patterns and trust dynamics. The 94% success rate masked a 75% success rate during the window that mattered to her. Metrics aggregated across time hide time-varying failures.

It connects to Lens 21, which covers the voice-to-ESTOP gap — Mom's inability to say "Stop!" and have Annie respond within 5 seconds. The glass door incident is the concrete realization of that risk. ESTOP fired at 80 millimeters. The gap between sensor blind spot and physical safety margin was smaller than designed.