LENS 10

Failure Pre-mortem

"It's October 2026 and this failed. What happened?"

APR 2026

Phase 2a deployed — team optimistic

Multi-query pipeline live. 29 Hz goal tracking + 10 Hz scene classification. 58 Hz throughput intact. Annie successfully navigates to kitchen, finds Mom's tea. Internal Slack: "this is working better than expected."

MAY 2026

WiFi instability begins — dismissed as transient

Pre-monsoon humidity rises. Neighbors' routers add 2.4 GHz congestion. VLM inference RTT to Panda climbs from 18ms to 35–90ms on roughly 8% of frames. The NavController's 200ms command timeout fires silently — robot freezes mid-corridor, resumes after reconnect. Team notes it in a comment but ships no fix: "it usually recovers." No fallback behavior exists. The fast path was engineered to 1ms precision; the failure path was never designed at all.

Partial mitigation (deployed APR 2026): Hailo-8 L1 safety layer runs YOLOv8n at 430 FPS locally on Pi 5 (zero WiFi dependency). The safety path no longer freezes — Annie still avoids obstacles during brownouts. But L2/L3 semantic queries ("where is the kitchen?", "what room is this?") still degrade silently when VLM RTT spikes. The robot keeps moving; it just stops understanding. Mom experiences this as "Annie is wandering" rather than "Annie is frozen" — a different failure, not a solved one.

JUN 2026 — INCIDENT 1

Glass door collision — both sensors wrong simultaneously

Mom's bedroom has a floor-to-ceiling glass sliding door left partially open at 45°. Annie approaches at 1 m/s. VLM reports "CLEAR" — the glass is transparent, the camera sees the room beyond. Lidar beam strikes the door at a glancing angle (below the reflectance threshold), returns no return. The "VLM proposes, lidar disposes" safety rule assumes at least one sensor is correct. Both are wrong simultaneously. ESTOP fires at 80mm — too late. Annie hits the door frame at reduced speed, knocking it off its track. Mom is shaken. No injury, but trust is damaged. The temporal smoothing (EMA filter) had 14 consecutive confident "CLEAR" readings — it amplified the error rather than catching it.

JUL 2026

IMU REPL crash corrupts SLAM map — localization lost for 3 days

Pico RP2040 drops to REPL during a long navigation session (known failure mode, requires manual Ctrl-D soft-reboot). Without IMU heading, EKF diverges within 90 seconds. slam_toolbox accumulates ghost walls. The occupancy grid — which Phase 2c semantic annotation was being built on top of — becomes unusable. Three days of room-label training data are corrupted. The map must be rebuilt from scratch. Phase 2c rollout is delayed 3 weeks. This is the second time a Pico REPL crash has blocked a milestone; no watchdog or auto-recovery was ever implemented.

AUG 2026 — INCIDENT 2

Mom stops using Annie — "it just freezes"

Monsoon peak. WiFi drops 15–20% of frames during peak household streaming hours (7–9pm, when Mom most often wants tea or the TV remote). Annie freezes in the hallway, blocking passage. When it resumes, it has lost goal context and asks "Where would you like me to go?" Mom has to repeat herself. After the third freeze in one evening, Mom stops calling Annie. She doesn't complain — she simply stops. The team doesn't notice for two weeks because the dashboard shows 94% nav success rate (computed over all hours, not the 7–9pm window). The metric was right; the window was wrong.

SEP 2026

Phase 2c stalls — SLAM prerequisite chain broken

Phase 2c (semantic map annotation) requires Phase 1 SLAM to be stable enough to serve as pose ground truth for labeling. But SLAM is still fragile — the IMU watchdog is unimplemented, map corruption happens roughly monthly, and the Zenoh fix from session 89 was never deployed (the multi-stage Dockerfile buildx build has been "blocked on CI setup" for 3 months). Phase 2c cannot start. Phase 2d (embeddings) cannot start without 2c. Phase 2e (AnyLoc) cannot start without 2d. Three of five Phase 2 sub-phases are gated behind an infrastructure prerequisite that is itself gated behind another prerequisite. The roadmap looked like a DAG; it was actually a single chain.

SEP 2026

VRAM ceiling hit — Phase 2d quietly abandoned

SigLIP 2 ViT-SO400M requires ~800MB VRAM on Panda. The E2B VLM already uses ~1.8GB. Panda's GPU has 4GB total. With OS overhead, the two models cannot coexist. The research said "competing with VLM for VRAM" — the competition was never resolved. Phase 2d is deprioritized to "future work." The embedding extraction capability — which would have enabled place recognition, loop closure augmentation, and scene change detection — is shelved. The perception architecture loses its memory layer before it was ever built.

OCT 2026

Project pivots — edge thesis abandoned, cloud VLM fallback adopted

"Too many moving parts on Panda." The decision is made to route VLM inference to Titan over the home LAN, treating WiFi as the transport layer rather than the failure mode. This is the exact architectural bet the research identified as the risk: if WiFi is unreliable, cloud inference is worse. The pivot does not solve the glass door problem, the IMU crash problem, or the SLAM prerequisite chain. It trades edge latency (18ms) for LAN latency (35–120ms) and makes the system more fragile to the same failure that already caused Mom to stop using Annie. Six months of edge-first infrastructure work is partially undone in one architectural decision made under time pressure.

2027 — THE PAPERWEIGHT

Orin NX 16GB module sits unused for 6 months — no carrier board ordered

Optional/speculative scenario. An Orin NX 16GB SoM is purchased mid-2027 as "future upgrade path" to run Isaac Perceptor (nvblox + cuVSLAM) locally. The module ships in a tray; the carrier board is on a separate SKU from a different vendor with a 4–8 week lead time. No one orders it. The module sits in a drawer for six months. By the time the carrier arrives, DGX Spark + Panda already handle the workload the Orin was meant to absorb, and the stereo camera required by cuVSLAM still hasn't been purchased either. The hardware isn't wrong; the bill-of-materials discipline is. One missing $200 part turns a $600 module into a paperweight. Buying into an ecosystem before verifying the full chain works end-to-end is its own failure mode.

What the Post-mortem Reveals

The KEY INSIGHT: We built the fast path. We forgot the slow path entirely.

The research is meticulous about the fast path: 58 Hz VLM throughput, 18ms inference latency, 4-tier hierarchical fusion, dual-rate architecture (perception at 58 Hz, planning at 1–2 Hz). These numbers are correct and impressive. But the research contains zero specification for what happens when any of these numbers degrades. What does Annie do when VLM inference times out? The research doesn't say. What does Annie do when the SLAM map diverges? The research doesn't say. What does Annie do when the IMU drops to REPL? The research says "known failure mode" and moves on.

The boring failure, not the interesting one: The system did not fail because the VLM architecture was wrong, or because 58 Hz was insufficient, or because Waymo's patterns didn't translate. It failed because WiFi dropped 8–15% of frames during the hours when the system was most used. This was not an exotic failure. Every home robot deployment on consumer WiFi faces this. The research spends three pages on AnyLoc loop closure (P(success) = 50%, multi-session effort) and zero words on "what happens when the 18ms VLM call takes 90ms." The effort allocation was exactly backwards from what the deployment needed.

The glass door failure is the epistemically interesting one: The "VLM proposes, lidar disposes" safety rule is structurally sound — until both sensors have the same blind spot. Glass and mirrors are systematic failures, not random noise. The temporal EMA smoothing (alpha=0.3, 14 frames) was designed to filter random hallucinations. But glass is not random — every frame through glass is consistently "CLEAR." The EMA amplifies systematic errors while filtering random ones. This is the unknown unknown: a failure mode that the safety rule was designed around didn't protect against.

The prerequisite chain was a single point of failure: Phases 2c, 2d, and 2e are each gated on the previous phase, and all three are gated on Phase 1 SLAM being stable. The research acknowledges this ("Prerequisite: Phase 1 SLAM foundation must be deployed first") but treats it as a sequencing note rather than a risk. In practice, SLAM stability is a moving target — the Zenoh version fix, the IMU watchdog, the MessageFilter queue size — each one is a dependency that never fully cleared. The DAG became a chain became a single point of failure. Phase 2 shipped two sub-phases and stalled.

The metric masked the user experience: 94% navigation success rate measured over all 24 hours. But Mom uses Annie 7–9pm, when WiFi contention is highest. The success rate during that window was closer to 75%. Metric aggregation hid the failure from the team for two weeks — long enough for Mom to form the habit of not using Annie. Habits form in two weeks. Trust, once lost in a vulnerable user, takes months to rebuild.

What the team wishes they'd built differently:

  1. Graceful degradation first, throughput optimization second. A robot that navigates at 10 Hz with a defined freeze-and-announce behavior is more trustworthy than one that navigates at 58 Hz with undefined timeout behavior.
  2. A WiFi circuit breaker. When VLM RTT exceeds 50ms for 3 consecutive frames, switch to lidar-only reactive mode and announce "I'm navigating carefully — my eyes are slow right now." Mom would have found this charming. Instead she got silent freezes.
  3. Glass as a named hazard class. Catalog reflective/transparent surfaces in the home during setup. Don't discover them during navigation. This is a one-time manual task that removes a systematic sensor blind spot.
  4. An IMU watchdog on day one. The Pico REPL crash is a known failure mode documented in MEMORY.md since session 83. It's still manual-intervention-only in October 2026.
  5. Measured Mom's actual usage window. The dashboard showed system-wide metrics. A per-user, per-hour breakdown would have caught the 7–9pm degradation in the first week.
NOVA: The research optimized from the VLM's perspective: how fast can it run, how many tasks can it multiplex, how sophisticated can the fusion be? But the system's actual reliability is set by its weakest path, not its fastest. The weakest path was never designed. "VLM proposes, lidar disposes" is a beautiful safety rule — but it has a hidden premise: "at least one sensor is truthful." Glass removes the premise. WiFi degradation removes the fast path entirely. A pre-mortem asks: what's the most likely way this fails? The answer isn't "the VLM architecture was wrong." It's "WiFi dropped 10% of frames during dinner and Mom stopped asking."

Added pre-mortem findings:
  • The Hailo-8 under the bed. The mitigation that prevents half the WiFi-brownout scenario was already on the robot — 26 TOPS of NPU, idle. The failure mode isn't "we need better hardware"; it's "we didn't inventory what was already there." Safety-path freezes are fixable with a weekend of HailoRT integration; semantic-path degradation needs a fallback policy the research never specified.
  • The paperweight pattern. Optional: Orin NX modules, stereo cameras, carrier boards — an ecosystem purchase that stalls on a $200 missing part. Each "future upgrade" that isn't a full end-to-end BOM becomes a drawer ornament. Cross-ref Lens 25 (ethics of resource allocation) — burnt capital is moral, not just financial.
THINK: The research cites OK-Robot's principle: "What really matters is not fancy models but clean integration." Then it proceeds to design a 4-tier hierarchical fusion architecture with 5 perception capabilities, 4 temporal smoothing mechanisms, and a 5-phase roadmap with 3 SLAM prerequisites. The principle was cited but not applied. Clean integration would have started with: (1) define the degraded-mode behavior first, (2) implement WiFi fallback before adding a second VLM task, (3) automate the IMU watchdog before adding place recognition. The research described what success looks like at 58 Hz. It forgot to describe what the system does at 0 Hz — when the network is gone, the IMU has crashed, and Mom is standing in the hallway waiting.