LENS 01 — CROSS-LENS CONVERGENCE NOTES

• Lens 04 (WiFi Achilles' heel): Lens 04 correctly identifies the 100ms cliff edge as a binding constraint, but treats it as fixed. First principles dissolves it: WiFi latency matters only because the reactive tier currently requires round-trips. If the lidar ESTOP and last-cached VLM command run locally on Pi, network jitter affects strategic planning at 1 Hz — not collision avoidance at 10 Hz. The constraint is real; the sensitivity to it is architectural, and therefore voluntary.

• Lens 17 / 18 (multi-query is highest-value, lowest-risk): Lens 01 provides the first-principles justification for why Lenses 17 and 18 are correct. The "one query per frame" assumption is a convention, not a physics constraint. Temporal surplus (1.7cm between frames at 1 m/s) means the scene barely changes — redundant queries carry no additional information. Breaking the convention via cycle_count % N dispatch is the logical consequence of recognizing that the 58 Hz slot is already paid for.

• Lens 16 (build the map to remember / local-first edge sovereignty): The "map is for navigation" assumption is the most consequential dissolved constraint at the convention layer. First principles reveals that an annotated SLAM grid — once you attach VLM scene labels to cells over time — is a queryable semantic memory of the home's layout. Navigation is the secondary benefit. Lens 16's "build to remember" insight is the natural endpoint of stripping the navigation assumption from the mapping layer. NEW (after Session 119 hardware audit): Lens 01 now makes the "inference must be remote" assumption explicit as voluntary. Three of the four irreducible constraints — lidar ESTOP, IMU heading, ArUco classical CV at 78 µs on Pi ARM CPU — already live on the robot edge. Only the VLM goal-tracking signal crosses the network. 75% of the minimum-viable floor is already edge-native. Lens 16 gets a concrete numerator/denominator for its local-first thesis.

• Lens 26 (bypass the text-language layer): The ViT encoder produces 280 continuous embedding tokens in 14ms; text decoding adds 4ms and collapses that continuous representation into 1–2 discrete tokens. For place recognition and scene-change detection, the text step is pure information loss. Lens 01 names this as a convention ("VLM must return text"), and Lens 26 names the same observation from the opposite direction ("bypass the language layer"). Both converge on SigLIP embedding extraction as the highest-leverage single architectural change — no hardware required, one model deployment on Panda. Annie's ArUco pipeline is the existence proof that "useful output is a text string" is already broken inside the codebase: solvePnP returns a 6-DoF pose, not a sentence.

• Lens 10 (fast path built, slow path forgotten): The now-4-constraint irreducible minimum (lidar ESTOP + VLM directional query + IMU heading + classical-CV fiducial detection) maps directly onto Lens 10's "fast path." Annie already has the fast path deployed. Everything in Phase 2 is the slow path: semantic richness, loop closure, topological memory, strategic planning. Lens 01 makes explicit that the slow path items are not prerequisites — they are layered enhancements on a working minimum. This validates Lens 10's critique while explaining why it is safe to proceed: the fast path is solid.

— NEW CONNECTIONS (added 2026-04-16 after Session 119 hardware audit) —

• Lens 02 (Inversion / "What would the opposite approach look like?"): Lens 01 now names classical CV as the fourth irreducible constraint — ArUco + solvePnP at 78 µs, 1.7 cm pose accuracy, no GPU/WiFi. Lens 02 should carry the inverted frame: instead of "upgrade the VLM," the inversion is "downgrade to a classical detector when geometry is known." This is a Pareto improvement, not a compromise — 230× faster than the 18 ms VLM round-trip and incapable of hallucinating. The instinct is "bigger, fancier GPU stack," but the inversion that wins is "smaller, dumber, local."

• Lens 12 (Idle Resources / "What is already paid for but unused?"): Lens 01's fourth irreducible constraint points directly at Lens 12's strongest single example — the Hailo-8 AI HAT+ on Annie's Pi (26 TOPS, already on the BOM) has been idle for nav since day one. YOLOv8n at 430 FPS, sub-10 ms, zero WiFi — the hardware to dissolve the WiFi cliff-edge failure mode has been sitting on the robot the whole time. Also flag the parallel idle compute tiers for Lens 12's enumeration: Beast (DGX Spark, always-on), Orin NX 16 GB (future robot SBC). Lens 12 should enumerate all three and ask: what would each do if we stopped treating it as a spare?

• Lens 14 (Dual-Process / Kahneman System 1 + System 2): Lens 01's 4-constraint minimum IS a dual-process architecture in disguise — constraints #1 and #4 (lidar ESTOP, classical CV fiducial detection) are System 1: fast, reactive, local, no semantics. Constraint #2 (VLM nav query) is System 2: slow, semantic, remote, reflective. The IROS paper (arXiv 2601.21506) validates this split empirically: 66% latency reduction, 67.5% success vs 5.83% VLM-only. Lens 14 should cite this: the four-constraint minimum is not arbitrary; it maps onto a cognitive architecture that has independent research backing.

SOURCE REFERENCES for downstream lenses:
  • services/panda_nav/ + aruco_homing implementation — existence proof for the 78 µs / 1.7 cm classical CV claim.
  • docs/RESOURCE-REGISTRY.md — confirms Hailo-8 line item + idle-for-nav status.
  • IROS paper arXiv 2601.21506 — dual-process validation (66% latency reduction, 67.5% vs 5.83% success).