LENS 16: Composition Lab

Core question: What if you combined ideas that weren't meant to go together?

The Composition Lab maps every pairwise combination of Annie's eight subsystems. Six original: the multi-query VLM running at 54 Hz on Panda, the SLAM occupancy grid, the Context Engine conversation memory, the speech emotion recognizer, the voice agent, and the place embedding extractor. Two added in the 2026-04-16 session-119 hardware audit: the Hailo-8 L1 reflex layer on the Pi 5, and ArUco classical CV. The matrix now has nine HIGH-rated pairings. That density is unusual and meaningful. It signals that the architecture is at a combinatorial inflection point — and that two of those HIGH pairings are crown jewels on orthogonal axes: memory and motion.

THE CROWN JEWEL: SLAM GRID PLUS CONTEXT ENGINE

The single highest-value combination in the entire matrix is the pairing of SLAM grid with Context Engine. Neither system was designed with the other in mind. SLAM is a robotics system — it builds a 2D occupancy map and tracks pose. Context Engine is a conversation memory system — it indexes transcript segments, extracts entities, and makes them retrievable by BM25 search. But their intersection produces something neither was designed to do: every conversation turn tagged to a room and a timestamp.

"Mom sounded worried in the hallway at 08:50, then calmer in the kitchen at 09:14." This is now a retrievable fact, not an interpretation. It comes from cross-referencing a SLAM pose log with a Context Engine transcript index. The map stops being a navigation artifact and becomes a household diary. The robot doesn't build the map to navigate. It builds the map to remember. Navigation is the side effect. Memory is the product.

This is called the spatial-temporal witness. Annie knows WHERE things happened and WHAT WAS SAID there. The combination has no precedent in either robotics literature or conversation AI literature because it crosses the boundary between the two fields.

THE SECOND CROWN JEWEL: HAILO-8 PLUS VLM DUAL-PROCESS NAVIGATION

The second crown jewel sits on the motion axis, not the memory axis. And unlike the first, it is experimentally validated by outside research, and it is implementable today with hardware Annie already owns. The composition: the Hailo-8 AI HAT+ on the Pi 5 as a System One fast reflex layer, paired with the Panda VLM as a System Two slow semantic layer. The Hailo-8 is 26 TOPS of NPU that has been sitting idle for navigation. It runs YOLOv8-nano at 430 frames per second, under 10 milliseconds per inference, with zero WiFi dependency. The Panda VLM runs Gemma 4 E-2-B at 54 Hz with full semantic reasoning over WiFi. The IROS paper arXiv 26-01-21506 measured this exact pattern for indoor robot navigation and reported a 66 percent latency reduction versus always-on VLM, and a 67-point-5 percent navigation success rate versus 5-point-8-three percent for VLM-only. Both parts are already on the robot. No hardware purchase is required. The blocker is not procurement — it is activation.

THE PRODUCTION OFFLINE COMPOSITION: ArUco PLUS CLASSICAL CV

Long before the VLM research landed, Annie shipped an ArUco homing system that runs entirely on the Pi ARM CPU. OpenCV's aruco module detects the fiducial marker. solvePnP with iterative refinement recovers the 6-DoF pose. Lidar sector clearance handles the approach. 78 microseconds per call. No GPU. No WiFi. No cloud. Marker id 23 at the charging station. When Panda is offline, when WiFi has dropped, when the VLM is unreachable — Annie still homes to the dock using this composition. It is the genuine failover perception stack, and it is already in production.

THE 80 PERCENT COMBINATION

The minimal composition that delivers 80% of the spatial-temporal witness value is: multi-query VLM plus SLAM plus scene labels. This is Phase 2a and 2c from the roadmap — no place embeddings required.

Scene labels from VLM scene classification, running at roughly 15 Hz via alternating frames, get attached to SLAM grid cells at the current pose. Over time, rooms emerge from accumulated labels — the kitchen is the cluster of cells labeled "kitchen" across many visits. This is the VLMaps pattern from Google ICRA 2023, adapted to Annie's single-camera setup.

This composition is enough to support "Annie, what room am I in?" and "Annie, where did you last see the kitchen table?" The remaining 20% — visual similarity queries, loop closure improvement from place embeddings, voice-triggered map recall — requires the Phase 2d SigLIP 2 deployment on Panda. Worth building eventually. Not required for the core insight to become operational. The 80% combination is a one-session code change: add cycle-count modulo N dispatch in the NavController run loop, and start logging SLAM pose alongside VLM scene labels.

TRIED AND ABANDONED: MULTI-CAMERA BEV

Tesla's multi-camera bird's-eye-view architecture was explicitly checked and discarded. Annie has one camera. BEV feature projection from 8 surround cameras requires geometry from multiple viewpoints — geometry that a single camera cannot provide.

But something changed since that exclusion: Phase 1 SLAM was deployed. The SLAM occupancy grid IS a bird's-eye-view of the environment, built from lidar rather than camera projection. The geometry that Tesla's surround cameras provide is now provided by lidar and slam_toolbox. The abandoned combination was correctly abandoned for the wrong reason. The working alternative — Waymo-style map-as-prior, with VLM handling semantics and SLAM handling geometry — is structurally equivalent to what the Tesla multi-camera approach was trying to achieve. The architecture converged on the right answer via a different path.

WHAT AN ELDER CARE PRACTITIONER WOULD NATURALLY TRY

A geriatric care practitioner — not a roboticist — would immediately combine SER, Context Engine, and Voice Agent, and ignore SLAM entirely. Their framing: "I need to know when Mom sounds distressed, what she said just before, and respond gently." They would build the affective loop: SER tags emotion, Context Engine stores emotion with the transcript, Voice Agent retrieves it and responds with care. The map, the lidar, the IMU — irrelevant to their use case.

This combination is HIGH-rated. SER plus Context Engine gives affectively indexed memory. SER plus Voice Agent gives real-time tone adaptation. Neither requires Phase 1 SLAM. Neither requires Phase 2 VLM capabilities. Both are deployable right now on the existing stack.

The elder-care practitioner would be frustrated that the team spent twelve sessions on navigation before wiring up the emotion layer. They are not wrong. Navigation and affective care are parallel development paths with no shared prerequisites. They converge at the crown jewel combination — the spatial-temporal witness — but either can be built first. The matrix reveals that the choice to build navigation before affective care was a sequencing decision, not a technical dependency.

THE MOST UNDERESTIMATED IMPLEMENTATION STEPS

Nine HIGH-rated combinations. Two of them are crown jewels on orthogonal axes, and both require less effort than a new subsystem.

The memory-axis wire — SLAM plus Context Engine — requires one log line and one API call. The SLAM bridge already publishes pose. The Context Engine already stores conversation segments with timestamps. The composition is: when storing a Context Engine segment, look up the current SLAM pose and attach it as metadata. That is the spatial-temporal witness, implemented.

The motion-axis wire — Hailo-8 L1 plus Panda VLM L2 — requires activating the HailoRT runtime on the Pi, loading a YOLOv8-nano HEF file, and defining the handoff protocol: Hailo fires ESTOP on imminent obstacle, VLM handles everything else. The hardware is installed. The software stack is documented. The research is validated. The blocker is prioritization.

The Composition Lab lens reveals that the highest-value work is not building new components — it is connecting existing ones at the right interface point. Build the map to remember. Activate the reflex to move safely. The navigation and the memory come for free once the wires are in.