LENS 01

First Principles X-Ray

"What must be true for this to work?"

CONSTRAINT LAYERS — PHYSICS → HARDWARE → ALGORITHM → CONVENTION (deepest = hardest to dissolve)
PHYSICS
Light travels in straight lines

Camera has zero knowledge around corners, behind furniture, or above its own plane. Every visual navigation system is imprisoned by this: the robot can only see what the camera sees, and the camera sees only what the photons reach. No algorithm changes this.

PHYSICS
Rotational inertia is real and instant

At speed 30, a 5° IMU turn target yields 37° of actual rotation. Motor torque releases kinetic energy into the chassis — it continues turning after the signal stops. The overshoot is not a software bug. You cannot wish it away with a tighter control loop; you can only predict and pre-brake.

PHYSICS
Lidar cannot see above its own plane

The RPLIDAR C1 sweeps a single horizontal disc at chassis height (~130mm). Table edges, hanging cords, open dishwasher doors, and chair rungs above 130mm are invisible to it. Glass doors reflect IR and return as walls or as nothing. These are not edge cases — they are the majority of real home obstacles.

HARDWARE
WiFi latency has a cliff edge at ~100ms

Annie's inference runs on Panda (18ms per frame). But the round-trip across household WiFi — Pi sends JPEG, Panda returns command string — adds 30–80ms under load, with occasional 150–300ms spikes. At 1 m/s, a 300ms spike means the robot has moved 30cm with no steering correction. The VLM's 58 Hz frame rate is a local measurement; the effective command rate, network-inclusive, is 10–20 Hz on a good day.

HARDWARE
Panda has 8GB VRAM and one PCIe lane to the camera

The Gemma 4 E2B ViT (150M params) uses ~14ms for vision encoding and ~4ms for text decoding. A second model on Panda (e.g., SigLIP 2 at 800MB) competes for VRAM and thermal budget. There is one camera. You cannot run 6 VLM instances in parallel on 6 different image streams — you must time-slice a single stream.

HARDWARE
Pi 5 has no wheel encoders

TurboPi omits rotary encoders entirely. Dead-reckoning from motor commands is unusable (wheel slip, surface variation). This forced rf2o lidar odometry as the primary odometry source — which turned out to be more accurate in practice. A constraint that looked like a hardware deficiency produced a better architecture than the "standard" approach.

HARDWARE
Hailo-8 NPU on Pi — idle, 26 TOPS of local inference available

The AI HAT+ ships a Hailo-8 accelerator (26 TOPS) physically attached to the same Pi that carries the camera. It is currently unused by the nav stack. YOLOv8n runs at 430 FPS locally on this chip with <10ms latency and zero WiFi dependence. The assumption that "inference must be remote on Panda" was never a physics constraint — it was a first-pass implementation decision made before the Hailo was on the bill of materials. The hardware to run a fast local safety tier has been sitting idle the whole time.

ALGORITHM
SLAM requires loop closure to remain accurate

Without revisiting previously mapped areas, trajectory error accumulates as a random walk. Scan-matching gives relative accuracy (frame-to-frame) but long-range absolute pose error grows unboundedly in linear environments like hallways. This means Annie's SLAM is accurate for short exploratory runs but will drift in large, featureless rooms. Visual loop closure (AnyLoc, SigLIP embeddings) addresses this — but at added VRAM cost on an already-constrained Panda.

ALGORITHM
VLM text output is discrete, not continuous

The current nav command schema ("LEFT MEDIUM", "CENTER LARGE") maps a continuous visual field onto 9 discrete cells. This is an algorithmic choice, not a physics constraint — the ViT encoder produces 280-dimensional continuous feature vectors per image. Discretizing to text sacrifices geometric precision in exchange for human readability and easy downstream parsing. The 9-cell schema is a convention that could be replaced entirely by feeding raw embeddings to a learned steering function.

CONVENTION
"One query per frame" — the highest-value voluntary constraint

All current systems, including Annie's Phase 1 nav loop, send one question to the VLM per frame. This emerged naturally from single-task systems where one question was all you needed. At 58 Hz, the assumption is gratuitous waste: alternating four different queries across frames gives each task 14–15 Hz — faster than Waymo's planning loop. The research shows this is a one-line code change (cycle_count % N dispatch). This convention costs nothing to break.

CONVENTION
"VLM must return text to be useful" — proven dissolvable

The 4ms text-decoding step is not technically necessary for place recognition or scene-change detection. The SigLIP ViT encoder output — 280 tokens of high-dimensional embedding — IS the scene representation. Cosine similarity on these vectors finds visually similar locations without any language at all. Text output is a convention inherited from chatbot pipelines, not a requirement of visual intelligence. Annie's own ArUco homing is the existence proof: cv2.aruco.ArucoDetector + solvePnP(SOLVEPNP_ITERATIVE) returns a 6-DoF pose in ~78 µs per call on the Pi ARM CPU — no text, no VLM, no network, and accurate to ~1.7 cm. The "useful output is a text string" convention is already broken inside the codebase; we just haven't generalized it.

CHOICE
"Map is for navigation" — the dissolved assumption

VLMaps, the research reference for Phase 2c, reframes the map entirely: the occupancy grid is not a navigation substrate, it is a semantic memory surface. Navigation is a secondary benefit. Once you annotate SLAM grid cells with VLM scene labels over time, you have built a queryable model of the home's layout — rooms, furniture positions, traffic patterns. The map becomes the knowledge base Annie consults to answer "where is Mom usually in the morning?"

The single most non-obvious insight from applying first principles to this research: the architecture is not bandwidth-limited — it is assumption-limited. The VLM runs at 58 Hz, producing 58 frames of visual intelligence per second. Yet the system acts on barely 10–15 commands per second in practice, because the pipeline treats each frame as an independent query requiring a complete round-trip. Every frame that carries the same question as the previous frame is pure redundancy at the physics layer. At 1 m/s, consecutive frames differ by 1.7cm of robot travel — the scene is structurally identical. The VLM's answer to the same question will almost certainly be the same. Temporal surplus is not a nice-to-have; it is the free resource that makes the entire multi-query strategy possible without touching a single piece of hardware.

The research's core argument about multi-query VLM — that you can run four parallel perception tasks at 15 Hz each by time-slicing a 58 Hz pipeline — is the canonical example of breaking a convention disguised as a law. The "one question per frame" assumption was never stated in the codebase; it emerged organically when the nav loop was written for a single task. First principles says: the model accepts any prompt. The model runs in 18ms regardless of which question you ask. The time slot is already paid for. The only cost of asking a different question on alternating frames is a single modulo operation. That the research assigns this a 90% success probability and "1 session" of implementation effort confirms it is a convention dissolving, not an engineering lift. This matters because it signals where the next five conventions are hiding: not in the hardware spec, not in the physics, but in the first-pass implementation decisions that were never revisited.

What this lens reveals that others miss is the hierarchy of constraint rigidity. Lens 04 (see cross-lens notes) correctly identifies WiFi as the Achilles' heel — but treats it as a fixed constraint to work around. First principles says: WiFi latency is a constraint only because the current architecture requires round-trips. A system that runs the VLM at the robot edge (i.e., on Panda, co-located with the camera), caches recent nav commands, and uses the network only for strategic tier updates would reduce WiFi dependency from a hard real-time constraint to a soft planning constraint. The 100ms cliff edge that Lens 04 fears becomes a non-issue if the reactive tier (10 Hz lidar ESTOP) operates entirely on-device. The constraint is real, but the assumption that the system must be structured to be sensitive to it is voluntary.

The implications form a 4-constraint minimum viable system — and the fourth only became visible once the Session 119 hardware audit forced a careful look at what Annie's ArUco homing actually does. Strip everything to physics: you need (1) a collision-avoidance signal that cannot be spoofed by VLM hallucination — that is the lidar ESTOP operating locally on Pi at 10 Hz; (2) a goal-relative directional signal updated faster than the robot can move into danger — that is the VLM nav query at any rate above ~5 Hz; (3) a heading reference that corrects motor drift — that is the IMU; and (4) a local detector for known-shape signals — that is cv2.aruco + solvePnP running in ~78 µs on the Pi ARM CPU, returning a 6-DoF pose accurate to ~1.7 cm with no GPU, no model weights, and no network. When the target geometry is known in advance (fiducial markers, QR codes, charging-dock shapes, known-class obstacles), classical CV is strictly better than a VLM: 230× faster than Panda's 18 ms GPU+WiFi round-trip, and it cannot hallucinate. Fast detection already lives on Pi and only covers one target today. Everything else in the research — SLAM, semantic maps, temporal EMA, AnyLoc, SigLIP embeddings, Titan strategic planning — layers capability on top of this irreducible quartet. Annie already has all four. The entire multi-query Phase 2 research is about enriching layers 5 through 10, all of which are voluntary enhancements. Hailo-8 activation (430 FPS YOLOv8n, zero WiFi) would be the obvious extension of constraint #4 beyond ArUco: the same "known-shape detector on local silicon" principle, widened from fiducials to the 80 COCO classes. This means Phase 2a (multi-query dispatch) can be deployed confidently because it does not touch the 4-constraint minimum — it only adds information into the layers above safety. (cross-ref Lens 02 for why classical CV is a Pareto improvement, Lens 12 for the idle-hardware blind spot, Lens 14/16 for dual-process and local-first implications.)

Temporal surplus is the free resource. At 1 m/s and 58 Hz, consecutive frames differ by 1.7cm — meaning 57 of 58 frames per second carry near-duplicate scene information. Multi-query time-slicing converts this redundancy into four parallel perception channels at 14–15 Hz each, at zero hardware cost. The research assigns 90% success probability precisely because the physics was always permissive; only the convention was restrictive.

"One query per frame" is the highest-value dissolved constraint. It is a single modulo operation away from yielding scene classification, obstacle awareness, and place-recognition embeddings alongside nav commands. The research (Phase 2a) treats this as a 1-session implementation — accurate, because the hardness is zero once the assumption is named and rejected.

Classical CV is the fourth irreducible constraint. ArUco detection + solvePnP at 78 µs on Pi ARM CPU, pose-accurate to 1.7 cm, is 230× faster than the 18 ms GPU + WiFi VLM round-trip and cannot hallucinate. For any target with known geometry — fiducials, dock shapes, the 80 COCO classes — a local detector beats a remote VLM on latency, reliability, and failure mode. The minimum viable system is a 4-constraint floor, not 3.

The "inference must be remote" assumption is voluntary. The Hailo-8 AI HAT+ on Annie's Pi provides 26 TOPS of idle NPU capacity — enough to run YOLOv8n at 430 FPS locally with sub-10 ms latency and zero WiFi dependence. The Pi-as-dumb-sensor-frontend architecture was a first-pass implementation decision, not a physics constraint. The hardware to dissolve the WiFi-cliff-edge failure mode has been sitting idle the whole time.

The 4-constraint irreducible minimum is already deployed. Lidar ESTOP (collision physics), VLM directional query (goal tracking), IMU heading (drift correction), classical-CV fiducial detection (known-shape grounding). All four run today. Everything in Phase 2 is additive enrichment above this floor, not prerequisite infrastructure — which means the risk profile of the entire research program is lower than it appears.

If you could only keep 4 constraints to make indoor robot navigation work, which 4? And what does the answer reveal about Phase 2's entire roadmap — and about which "remote inference" assumptions are actually voluntary?

The irreducible four are: (1) a local collision gate that operates faster than the robot can hit something — the lidar ESTOP at 10 Hz on Pi, requiring zero network; (2) a directional signal from the VLM faster than ~5 Hz — any query rate above that is sufficient for 1 m/s navigation; (3) an IMU for heading correction, because motor control without heading reference drifts non-deterministically; (4) a local detector for known-shape signals — ArUco + solvePnP at 78 µs on Pi ARM CPU, proving that when target geometry is known, a classical detector is strictly better than a remote VLM on latency (230× faster), reliability (no hallucination), and failure mode (no WiFi). Strip everything else — SLAM, temporal smoothing, semantic maps, AnyLoc, Titan planning — and Annie can still navigate to named goals and dock precisely. The revelation: Phase 2's entire architecture (all 5 phases, 2a through 2e) is about expanding the capability ceiling, not raising the capability floor. The floor is already built. And constraint #4 reveals the bigger voluntary assumption underneath — "inference happens on Panda over WiFi." The Hailo-8 (26 TOPS, idle) could run YOLOv8n at 430 FPS on the Pi itself. The remote-inference architecture is a default, not a requirement. Every Phase 2 element is independently optional and rollback-safe, and there is a whole parallel track — local-silicon detection — that the current roadmap hasn't touched at all.

Click to reveal