LENS 01 — FIRST PRINCIPLES X-RAY
"What must be true for this to work?"

The single most non-obvious insight from applying first principles to this research is that the architecture is not bandwidth-limited — it is assumption-limited. The VLM runs at 58 frames per second, producing 58 frames of visual intelligence every second. Yet the system acts on barely 10 to 15 commands per second in practice, because the pipeline treats each frame as an independent query requiring a complete round-trip to Panda and back.

Every frame that carries the same question as the previous frame is pure redundancy at the physics layer. At one meter per second, consecutive frames differ by only 1.7 centimeters of robot travel — the scene is structurally identical. The VLM's answer to the same question will almost certainly be the same. Temporal surplus is not a nice-to-have; it is the free resource that makes the entire multi-query strategy possible without touching a single piece of hardware.

The research's core argument about multi-query VLM — that you can run four parallel perception tasks at 15 Hz each by time-slicing a 58 Hz pipeline — is the canonical example of breaking a convention disguised as a law. The "one question per frame" assumption was never stated in the codebase; it emerged organically when the nav loop was written for a single task. First principles says: the model accepts any prompt. The model runs in 18 milliseconds regardless of which question you ask. The time slot is already paid for. The only cost of asking a different question on alternating frames is a single modulo operation. That the research assigns this a 90% success probability and one session of implementation effort confirms it is a convention dissolving, not an engineering lift.

What this lens reveals that others miss is the hierarchy of constraint rigidity. Lens 04 correctly identifies WiFi as the Achilles' heel — but treats it as a fixed constraint to work around. First principles says: WiFi latency is a constraint only because the current architecture requires round-trips. A system that runs the VLM at the robot edge, co-located with the camera on Panda, caches recent nav commands, and uses the network only for strategic tier updates would reduce WiFi dependency from a hard real-time constraint to a soft planning constraint. The 100ms cliff edge becomes a non-issue if the reactive tier operates entirely on-device. The constraint is real, but the assumption that the system must be structured to be sensitive to it is voluntary.

The implications form a 4-constraint minimum viable system — and the fourth only became visible once the Session 119 hardware audit forced a careful look at what Annie's ArUco homing actually does. Strip everything to physics. You need, first, a collision-avoidance signal that cannot be spoofed by VLM hallucination — that is the lidar ESTOP operating locally on Pi at 10 Hz. Second, a goal-relative directional signal updated faster than the robot can move into danger — that is the VLM nav query at any rate above five Hz. Third, a heading reference that corrects motor drift — that is the IMU. And fourth, a local detector for known-shape signals — that is OpenCV's ArUco detector plus solvePnP, running in about 78 microseconds on the Pi ARM CPU, returning a six-degree-of-freedom pose accurate to 1.7 centimeters, with no GPU, no model weights, and no network.

When the target geometry is known in advance — fiducial markers, QR codes, charging-dock shapes, known-class obstacles — classical CV is strictly better than a VLM. It is 230 times faster than Panda's 18 millisecond GPU plus WiFi round-trip, and it cannot hallucinate. Fast detection already lives on Pi and only covers one target today. Everything else in the research — SLAM, semantic maps, temporal EMA, AnyLoc, SigLIP embeddings, Titan strategic planning — layers capability on top of this irreducible quartet. Annie already has all four deployed.

The Hailo-8 AI HAT+ on the Pi provides 26 TOPS of idle NPU capacity — enough to run YOLOv8n at 430 frames per second locally, with sub-10 millisecond latency and zero WiFi dependence. The Pi-as-dumb-sensor-frontend architecture was a first-pass implementation decision, not a physics constraint. Activating Hailo-8 would be the obvious extension of constraint four beyond ArUco — the same known-shape-detector-on-local-silicon principle, widened from fiducials to the 80 COCO classes. The hardware to dissolve the WiFi cliff edge has been sitting idle the whole time.

The entire multi-query Phase 2 research is about enriching layers five through ten, all of which are voluntary enhancements. Phase 2a can be deployed confidently because it does not touch the 4-constraint minimum — it only adds information into the layers above safety. The constraint hierarchy does not just clarify what must be done first. It reveals what cannot fail even if everything else is stripped away. Cross-reference Lens 02 for why classical CV is a Pareto improvement, Lens 12 for the idle-hardware blind spot, and Lenses 14 and 16 for dual-process and local-first implications.