LENS 02

Abstraction Elevator

"What do you see at each altitude?"

SAME SYSTEM, SIX ALTITUDES — THE VIEW CHANGES EVERYTHING
30,000 FT
A robot companion that navigates your home by understanding it

"Go to the kitchen" — understands rooms, recognizes places, avoids obstacles, reports what it sees, builds a living semantic map. Faster perception than Tesla FSD (58 Hz vs 36 Hz).

10,000 FT
4-tier hierarchical fusion: strategic → tactical → reactive → kinematic (post-hoc rationalization; should be 5-tier)

Titan LLM (1 Hz) plans routes on SLAM map → Panda VLM (29–58 Hz) tracks goals and classifies scenes → Pi lidar (10 Hz) enforces ESTOP → Pi IMU (100 Hz) corrects heading drift. The "4" count is a description of how the code happens to be wired — not a first-principles derivation. A 5th tier (on-robot Hailo-8 reflex) is missing and the convention "Pi is sensor-only" is hiding it.

CONVENTION (dissolvable)
"Pi is sensor-only; Panda is the perception brain"

This convention made the 4-tier story tell cleanly, but the Pi 5 has an idle Hailo-8 NPU at 26 TOPS sitting on the AI HAT+. YOLOv8n runs on it at 430 FPS with <10ms latency and zero WiFi. Activating it dissolves the 4-tier abstraction into a 5-tier one: a new L1 safety reflex slots below the current reactive tier, on-robot, WiFi-independent. The convention is reversible; the hardware was always there.

3,000 FT
Multi-query alternating dispatch: 6 VLM slots per 58-frame second

Frame 0,2,4: "LEFT MEDIUM" goal-tracking at 29 Hz. Frame 1: "hallway" scene label at 9.7 Hz. Frame 3: "chair" obstacle token at 9.7 Hz. Frame 5: 280-dim ViT embedding at 9.7 Hz. EMA alpha=0.3 smooths noise across frames. Scene variance gate: high variance → cautious mode.

GROUND
cycle_count % N dispatch in NavController._run_loop()

Sonar ESTOP fires at 250mm — absolute gate over all tiers. SLAM cells accumulate scene labels at current pose. _consecutive_none counter is crude EMA precursor. sonar_cm is float | None (None disables safety gate — not 999.0 sentinel). WiFi round-trip latency is uncontrolled here.

BYTE LEVEL
18ms/frame, 150M-param ViT, 280-token feature vector, 1–2 token text output

llama-server wraps Gemma 4 E2B — text decoder adds ~4ms on top of 14ms vision encoder. Pico RP2040 sends IMU at 100 Hz over USB serial (GP4/GP5, 100kHz I2C). llama-server cannot expose multimodal intermediate embeddings — blocks Phase 2d without a separate SigLIP 2 sidecar.

PHYSICS
WiFi RF, motor momentum, lidar beam geometry, 1.7cm inter-frame travel at 1 m/s

At 1 m/s consecutive VLM frames differ by <1.7cm — EMA is physically valid. WiFi latency spikes to 100ms destroy the clean tier timing model. Motor momentum carries 30° past IMU target at speed 30 — kinematic tier cannot correct what physics delivers late. Lidar blind spot: above-plane obstacles (shelves, hanging objects) are invisible.

The system looks clean at 10,000 ft: four tiers, each with a defined frequency and responsibility, connected by tidy arrows. Drop to ground level and the first thing you notice is that the tiers are not connected by arrows — they are connected by household WiFi. Titan sits in one room, Panda on a shelf in another room (not on the robot — session 119 corrected a long-standing placement error in the lens narratives), Pi inside the chassis. The "1 Hz strategic plan" reaching Panda from Titan traverses the same 2.4 GHz band as a microwave oven. When WiFi spikes to 100ms — a cliff edge identified by Lens 04 — the clean hierarchy stalls: Panda receives no new plan, Pi receives no new tactical waypoint, and the robot's only active layer is the 10 Hz lidar ESTOP. The architecture diagram shows four tiers collaborating; the physics shows three tiers occasionally collaborating and one tier (reactive ESTOP) running solo. Physical placement was always hidden inside the tier abstraction.

The second leak is semantic. At 30,000 ft the pitch is "navigates to named goals" — rich, spatial, intentional. At ground level the VLM outputs "LEFT MEDIUM": a qualitative direction and a qualitative distance. No coordinates. No confidence score. No map reference. The 10,000 ft diagram shows Tier 1 sending waypoints to Tier 2, but Tier 2's actual output vocabulary has two words for position (LEFT/CENTER/RIGHT) and two for distance (NEAR/FAR/MEDIUM). The semantic map that bridges this gap — Phase 2c, where scene labels attach to SLAM grid cells — does not exist yet. Until it does, "go to the kitchen" means "turn and go toward the thing the VLM recognizes as kitchen-like," which only works if the kitchen is currently in frame.

The third leak is in the kinematic tier — specifically at the hardware boundary between software and motor. The IMU reports heading at 100 Hz and _imu_turn reads it faithfully. But at speed 30, motor momentum delivers 37° of actual rotation when 5° was requested. The Pico RP2040 acts as IMU bridge over USB serial — if it drops to REPL (a crash mode where it silently stops publishing), the kinematic tier goes dark without alerting the reactive or tactical tiers. The system's 4-tier safety model implicitly assumes each tier is healthy; the Pico REPL failure is an abstraction leak where the hardware reality (a microcontroller with an interactive console) bleeds through the software assumption (a reliable 100 Hz heading stream). Lens 01 identified the temporal surplus of 58 Hz as free signal; Lens 02 identifies the fragility of the substrate that produces it.

The deepest leak is the tier-count itself. The "4-tier hierarchy" is a post-hoc rationalization of how components happen to be wired, not a derivation from first principles. The Pi 5 carries a Hailo-8 AI HAT+ with 26 TOPS of NPU throughput that is currently idle for navigation. YOLOv8n runs on it at 430 FPS with <10ms latency and zero WiFi dependency. Activating it dissolves the 4-tier story into a 5-tier hierarchy with a new L1 safety reflex sitting below the current tier-3 lidar ESTOP: on-robot obstacle detection that pre-empts the reactive tier, survives WiFi drops, and gives pixel-precise bounding boxes instead of qualitative "BLOCKED" tokens (detail in Lens 16 on hardware substrate, and Lens 18 on dual-process architectures). The description "Pi is sensor-only, Panda is the perception brain" is not a physical constraint — it is a convention inherited from the WiFi-coupled topology. The future Orin-NX-native robot will collapse L1+L2+L3 onto a single onboard device and the 4-tier/5-tier distinction disappears entirely. Abstraction elevators reveal not just what each altitude shows, but where the floor numbers themselves are arbitrary.

WiFi is the load-bearing abstraction violation. The 4-tier hierarchy diagram implies synchronous communication between tiers. The actual substrate is household 2.4 GHz WiFi with uncontrolled latency spikes to 100ms (Lens 04). When WiFi degrades, the architecture does not degrade gracefully tier-by-tier — it collapses to ESTOP-only operation because the reactive tier is the only one that runs locally on Pi.

"LEFT MEDIUM" is the semantic glass ceiling. At 30,000 ft the system navigates to named rooms. At ground level it outputs two-token qualitative directions. The entire Phase 2c roadmap exists to bridge this single abstraction gap: scene labels → SLAM grid cells → queryable semantic map. Until Phase 2c deploys, "go to the kitchen" is an aspirational description of a capability that works only when the kitchen is currently in the camera frame.

The Pico REPL crash is an invisible tier failure. No upper tier detects it — imu_healthy=false surfaces only if the caller checks the health flag. The kinematic tier silently disappears and tactical/reactive tiers continue operating without heading correction, accumulating drift that compounds with every turn. This is the canonical abstraction leak: a hardware state (microcontroller in interactive REPL mode) that bypasses every software-layer health model.

4-tier was always 5-tier — the floor was mislabelled. The Pi 5's 26 TOPS Hailo-8 NPU has been idle the entire time the "4-tier hierarchy" diagram has been circulating. YOLOv8n at 430 FPS, <10ms latency, zero WiFi, on-robot. The diagram described how the code was wired, not how the hardware was provisioned. Once activated, the 5th tier (L1 Hailo reflex) pre-empts the lidar ESTOP and decouples safety from WiFi. The lens elevator taught us altitudes; this taught us that the floor numbers can change when you notice hardware you forgot you owned — and that future Orin-NX robots will collapse L1+L2+L3 into one device, making the tier count itself a transient artifact of current deployment.

If the "4-tier hierarchy" was a post-hoc rationalization, what other diagrams in the stack are describing wiring rather than hardware — and which idle capabilities are hiding behind the labels?

Click to reveal analysis

The Hailo-8 discovery is a specific instance of a general failure mode: architecture diagrams tend to name components by their current software role rather than their physical capability. "Pi is sensor-only" described a code layout; it did not describe the 26 TOPS NPU sitting unused on the AI HAT+. The same audit applied to the rest of the stack surfaces candidates worth re-examining: Panda's RTX 5070 Ti runs llama-server at ~18ms/frame with headroom for a second model (open-vocab detector, whisper, or SLAM acceleration); Titan's DGX Spark GB10 is described as "the LLM box" but natively runs Isaac Perceptor (nvblox + cuVSLAM) which is idle; the Pico RP2040 is "the IMU bridge" but has 3 unused GPIO pins that could drive a buzzer for operator feedback. Each of these is a convention that became an abstraction once it entered a diagram. The lens elevator lesson is that the diagram is not the territory — every altitude description is a choice about what to include, and every inclusion is a choice about what to leave out. What would break if we re-derived the architecture from hardware-first instead of code-first? The tier count would change. Possibly the tier names would change (a Hailo-8 YOLO is technically "reactive perception" not "safety reflex"). Possibly the whole 4/5/6-tier vocabulary is itself a post-hoc rationalization of a continuous latency spectrum. The Orin NX migration will force this question explicitly: when L1+L2+L3 collapse onto one device, what does "tier" even mean? It becomes a latency budget, not a physical partition. The abstraction elevator stops being an elevator and becomes a gradient.