LENS 16

Composition Lab

"What if you combined ideas that weren't meant to go together?"

Combination matrix: rows and columns are the eight subsystems (six original + two added from the 2026-04-16 session-119 hardware audit: Hailo-8 L1 reflex + ArUco classical CV). Cells show what emerges from their pairing, rated HIGH (green) / MEDIUM (amber) / LOW (muted). Each pairing is assessed for the novel capability produced — capability that neither subsystem has alone. Self-pairings (diagonal) are omitted.

NEW HIGH COMPOSITIONS (experimentally validated):
  • Hailo-8 L1 reflex + Panda VLM L2 reasoning (dual-process nav) — IROS arXiv 2601.21506 measured 66% latency reduction and 67.5% success vs 5.83% VLM-only for indoor robot navigation. Annie owns the Hailo-8 AI HAT+ (26 TOPS, currently idle on Pi 5, YOLOv8n @ 430 FPS local, <10 ms, zero WiFi) and the Panda VLM (Gemma 4 E2B @ 54 Hz, 18–40 ms). No new hardware required. See Lens 17 & 18.
  • ArUco (cv2.aruco) + solvePnP + lidar sector clearance — already in production (ArUco homing system). Runs entirely on Pi ARM CPU, 78 µs/call, no WiFi, no GPU. This is Annie's genuine offline-safe fiducial-target composition.
System → Multi-Query VLM SLAM Grid Context Engine SER (Emotion) Voice Agent Place Embeddings Hailo-8 L1 Reflex ArUco + Classical CV
Multi-Query VLM
HIGH
Scene labels stamped onto grid cells at SLAM pose → rooms emerge over time (VLMaps). Spatial knowledge that neither lidar geometry nor camera pixels hold alone.
HIGH
Obstacle + scene labels fed into conversation memory → "you mentioned tea; Annie was in the kitchen at 09:14." Vision becomes a dimension of episodic recall.
MEDIUM
Emotion state modulates speed and query cadence → Annie slows and defers obstacle-classification frames when Mom sounds distressed. Affective pacing without a separate motion planner.
HIGH
Voice command "go to the kitchen" resolved by real-time scene classification → Annie navigates to the room labeled "kitchen" by VLM, not by hard-coded coordinate. Language grounds to live perception.
HIGH
Text-labeled scene + SigLIP embedding at same pose → dual-channel place index: retrievable by description ("near the bookcase") AND by visual similarity. text2nav RSS 2025 validates 74% nav success from frozen embeddings alone.
HIGH ⭐ CROWN JEWEL (validated)
Dual-process navigation. Hailo-8 = System 1 (fast reflex, 430 FPS, <10 ms, local, 26 TOPS idle on Pi 5). VLM = System 2 (semantic reasoning @ 54 Hz on Panda). IROS arXiv 2601.21506 measured 66% latency reduction vs always-on VLM and 67.5% nav success vs 5.83% VLM-only. Both parts already owned — this is implementable today. See Lens 17 (hardware tiers) & Lens 18 (failover).
MEDIUM
VLM seeds a semantic goal ("find the docking station"); ArUco takes over at close range for millimeter-precise approach via solvePnP. Semantic coarse + fiducial fine, handing off at ~1 m. Already partially used in ArUco homing.
SLAM Grid ↑ see above
HIGH ⭐ CROWN JEWEL (memory axis)
Spatial-temporal witness. SLAM provides WHERE. Context Engine provides WHAT WAS SAID. Together: every conversation is anchored to a room and a time. "Mom sounded worried in the hallway at 08:50" is now a retrievable memory, not a lost signal. Build the map to remember, not navigate. Peer crown jewel on the motion axis: Hailo + VLM dual-process (IROS arXiv 2601.21506, 66% latency reduction, 67.5% vs 5.83% success) — see VLM × Hailo-8 cell.
LOW
Grid cells tagged with emotion-at-location data. Technically possible but weak value: room acoustics don't predict emotion, and SER signal is noisy enough that per-cell tagging produces spurious "anxious hallway" labels.
MEDIUM
Voice goal ("go to bedroom") parsed by Titan LLM → SLAM path planned to room centroid on annotated map → waypoints executed. Full Tier 1–4 pipeline. Already designed; needs semantic map from VLMaps step first.
HIGH
Embeddings keyed to SLAM (x, y, heading) → visual loop closure confirmation on top of scan-matching. AnyLoc (RA-L 2023) + DPV-SLAM (arXiv 2601.02723) validate this pattern. Dual-modality loop closure raises confidence and reduces drift.
MEDIUM
Hailo-8 bounding boxes projected through SLAM pose → tracked-object occupancy layer distinct from lidar geometry. People and pets become first-class map entities, not just lidar returns. Complements the static occupancy grid with a dynamic-object layer.
HIGH (in production)
Fiducial-anchored home base. ArUco id=23 at the charging station provides absolute pose correction when detected; between detections SLAM dead-reckons. cv2.aruco + solvePnP (SOLVEPNP_ITERATIVE) + lidar sector clearance, 78 µs/call on Pi ARM CPU, zero GPU, zero WiFi. Already shipped in Annie's homing system.
Context Engine
HIGH
Emotion tagged to conversation turns → "Mom sounded anxious when discussing the hospital appointment." Context Engine becomes affectively indexed: retrieve not just what was said but how it felt. Proactive follow-up triggers on stress patterns. (Lens 21 cross-ref.)
HIGH
Pre-session memory load into voice context → Annie begins each call knowing what Mom said last time. Long-term conversational continuity from Context Engine bridges short voice sessions. Already implemented in context_loader.py.
MEDIUM
Conversation entity ("Mom's reading glasses") linked to best-matching place embedding → "glasses" as a concept resolves to a visual-spatial region, not just a text label. Multi-modal grounding of memory entities. Requires Phase 2d embedding infrastructure first.
MEDIUM
Hailo-8 detected "person in frame" logged to Context Engine with timestamp → "who was home at 3pm?" becomes answerable from sensor data, not just conversation. Useful for elder-care presence audit.
LOW
Fiducial detections at known landmarks logged as conversation anchors. Technically possible but redundant with SLAM-anchored turns; adds no new signal.
SER (Emotion)
HIGH
Emotion signal modulates voice agent tone and response strategy in real time → Annie speaks more gently when SER detects stress, more briskly when calm. Latency matches voice pipeline (~80–120ms). The most immediately deployable high-value composition on this matrix.
LOW
Emotion state at place embedding → "Annie associates the hallway with stress." Conceptually interesting (emotional topography of the home) but unreliable: SER noise + small dataset + confounded by conversation topic produce spurious room-emotion links.
LOW
Emotion cross-checked with detected persons-in-frame. Low signal: Hailo is 80-class COCO, no face/identity; SER is audio-only. The fusion adds little.
LOW
Fiducials are static landmarks; emotion is situational. No meaningful coupling.
Voice Agent
MEDIUM
"Annie, show me where you saw that" → place embedding nearest to described entity → map UI highlights the grid region. Voice triggers visual recall. Requires Phase 2d + map UI integration. High user delight; medium implementation complexity.
MEDIUM
"Annie, stop" fires a direct L1 motor halt via Hailo-anchored reactive loop without round-tripping through Titan. Voice-triggered ESTOP via the reflex layer, WiFi-independent once command is parsed locally.
LOW
"Annie, go home" uses ArUco homing today. Coupling is via the homing tool, not a voice-fiducial fusion per se.
Place Embeddings
MEDIUM
Hailo-detected object classes at pose feed embedding context → "the chair by the window" becomes a compound query across visual similarity AND object-class presence. Cheap object grounding for embedding lookup.
MEDIUM
ArUco fiducials act as ground-truth anchors for the embedding manifold → known-location embeddings calibrate the learned place representation. Useful for dataset bootstrapping and drift recalibration.
Hailo-8 L1 Reflex
MEDIUM
Both run on Pi with no WiFi. Hailo suppresses spurious detections around the ArUco marker region (no false-positive "bottle detected" near the fiducial tag). Tight offline-only perception loop: reactive obstacle avoidance + fiducial anchoring, both local, both WiFi-independent.
ArUco + Classical CV
HIGH — strong novel capability MEDIUM — real but dependent on prerequisites LOW — weak or spurious emergent value

Most of the research focuses on what each component does in isolation: multi-query VLM at 54 Hz, SLAM occupancy grid at 10 Hz, Context Engine conversation memory, SER emotion at the audio pipeline. The Composition Lab question is different: what happens when two of these systems see each other's output? The matrix above now has nine HIGH-rated pairings (two added from the 2026-04-16 session-119 hardware audit). That density is unusual. It signals that the architecture has reached a combinatorial inflection point — adding one new component produces multiple new capabilities simultaneously, because each new component has high affinity with each existing one. This is the signature of a well-chosen stack. Two of those HIGH pairings are crown jewels on orthogonal axes: the spatial-temporal witness (SLAM + Context Engine, the memory axis) and the dual-process nav loop (Hailo-8 L1 reflex + Panda VLM L2 reasoning, the motion axis). The motion-axis crown jewel is experimentally validated — IROS arXiv 2601.21506 reports 66% latency reduction versus always-on VLM and 67.5% navigation success versus 5.83% VLM-only — and both components are already owned: the Hailo-8 AI HAT+ is idle on the Pi 5 (26 TOPS, YOLOv8n @ 430 FPS local, <10 ms, zero WiFi) and the Panda VLM ships Gemma 4 E2B at 54 Hz. No hardware purchase required. The roadmap question is no longer "can we afford dual-process?" but "why haven't we activated the Hailo-8 yet?"

The offline-safe composition, already in production: ArUco + classical CV + lidar sector clearance. Long before the VLM research landed, Annie shipped an ArUco homing system running entirely on the Pi ARM CPU — cv2.aruco.ArucoDetector + cv2.solvePnP with SOLVEPNP_ITERATIVE, 78 µs per call, marker id=23 at the charging station. No GPU. No WiFi. No cloud. When Panda is offline or WiFi has dropped, this composition still homes Annie to the dock. It is the genuine failover composition: a known fiducial target, a closed-form pose solve, and lidar sector clearance for the approach. The matrix flags this as HIGH (SLAM × ArUco) because it is not hypothetical — it is the composition keeping Annie recoverable during every WiFi outage the household has experienced.

The crown jewel combination: SLAM grid + Context Engine. Call it the spatial-temporal witness. SLAM provides WHERE Annie is. Context Engine provides WHAT WAS SAID and WHAT WAS FELT. Neither system was designed with the other in mind — SLAM is a robotics system, Context Engine is a conversation memory system. But their intersection produces a capability that has no precedent in either: every conversation turn is tagged to a room and a timestamp. "Mom sounded worried in the hallway at 08:50, then calmer in the kitchen at 09:14" is no longer an interpretation — it is a retrievable fact, composed from a SLAM pose log and a Context Engine transcript index. The map stops being a navigation artifact. It becomes a household diary, written by sensor fusion and read by language models. This is what "build the map to remember, not navigate" means in operational terms. Navigation is the side effect. Memory is the product.

The minimal 80% combination: Multi-Query VLM + SLAM + scene labels (Phase 2a + 2c, no embeddings). This is the composition that delivers most of the spatial-temporal witness without the Phase 2d embedding infrastructure (SigLIP 2 on Panda, ~800MB VRAM, complex deployment). Scene labels from VLM scene classification (~15 Hz via alternating frames) attached to SLAM grid cells at current pose is enough to support "Annie, what room am I in?" and "Annie, where did you last see the kitchen table?" The topological richness of place embeddings (visual similarity, loop closure confirmation) can be deferred. The 80% value — a queryable spatial map with room labels, tied to conversation memory — is achievable with one code file change (add cycle_count % N dispatch in NavController._run_loop()) and the Phase 1 SLAM groundwork. The embeddings add the remaining 20%: loop closure improvement, visual similarity queries, and "show me where you saw that" from voice. Worth doing eventually; not required for the core insight to become operationally real.

Tried and abandoned: multi-camera surround view (Tesla-style). The research explicitly excludes this — Annie has one camera. BEV feature projection, 8-camera surround, and 3D voxel occupancy all require geometry from multiple viewpoints. The research checked this architecture and discarded it. Has anything changed? Not on the hardware side. But the spirit of the exclusion — "we need geometry from multiple angles" — has a partial workaround: SLAM provides the geometry that surround cameras would otherwise supply. SLAM gives the global map; the single VLM camera provides local semantic context. This is structurally equivalent to "camera gives semantics, lidar gives geometry, radar gives velocity" from the Waymo principles. Annie's architecture is not Tesla-inspired (no surround cameras) but IS Waymo-inspired (complementary modalities, map-as-prior). The abandoned combination was correct to abandon; the working alternative is already in the design.

What would a roboticist from elder care naturally try? A geriatric care practitioner — not a roboticist — would immediately combine SER + Context Engine + Voice Agent and ignore SLAM entirely. Their framing: "I need to know when Mrs. X sounds distressed, what she said just before, and respond gently." They would build the affective loop (SER tags emotion → Context Engine stores emotion with transcript → Voice Agent retrieves it → responds with care) without caring at all about navigation. This is the emotion-first lens on the same data. The composition is HIGH-rated (SER + Context Engine, SER + Voice Agent). And notably, it requires none of the Phase 1 or Phase 2 navigation infrastructure — it is deployable right now on the existing voice + SER + Context Engine stack. The elder-care practitioner would be horrified that the roboticist spent 12 sessions on navigation before wiring up the emotion layer. They are both correct. The matrix reveals that navigation and affective care are parallel development paths that share no prerequisites but share the crown-jewel combination (spatial-temporal witness) as their convergence point.

NOVA (What this lens uniquely reveals):
  • Nine HIGH-rated combinations in an eight-system matrix is a signal, not a coincidence. Each component was chosen to be maximally composable — standard interfaces (REST, SLAM pose, JSONL), shared infrastructure (Titan LLM, Panda VLM), and complementary modalities (geometry, semantics, conversation, emotion, reflex, fiducial). The combinatorial density means the project's output is not the sum of its components but the product of their interactions.
  • Crown jewel #1 (memory axis): SLAM pose → Context Engine conversation index. One log line, one API call. The most underestimated implementation step in the entire roadmap.
  • Crown jewel #2 (motion axis), implementable today with owned hardware: Hailo-8 L1 reflex (26 TOPS, idle on Pi 5, YOLOv8n @ 430 FPS, <10 ms, zero WiFi) + Panda VLM L2 reasoning (Gemma 4 E2B @ 54 Hz). IROS arXiv 2601.21506 measured 66% latency reduction vs always-on VLM and 67.5% nav success vs 5.83% VLM-only. This is not a research hypothetical — both parts are physically on the robot right now, and the composition is experimentally validated. The blocker is activation, not procurement.
  • Offline-safe composition already in production: ArUco (cv2.aruco) + solvePnP + lidar sector clearance. 78 µs/call on Pi ARM CPU. Runs even if Panda is powered off and WiFi is dead. This is the household's genuine failover perception stack.
  • Most innovations in this research are not new algorithms — they are new pairings of existing algorithms at the right interface point. The Composition Lab lens reveals that the highest-value work remaining is wiring existing components together at two seams: (a) the spatial-temporal seam (SLAM → CE) and (b) the reflex/semantics seam (Hailo L1 → VLM L2).
THINK (Open questions this lens surfaces):
  • The 80% combination (multi-query VLM + SLAM + scene labels) is a one-session code change. What is blocking it from being the next implementation target? Is it the Phase 1 SLAM deployment prerequisite, or is it a prioritization decision?
  • The spatial-temporal witness stores WHERE things were said. Is there consent infrastructure for this? "Mom, I'm tagging your conversations to rooms" is qualitatively different from "I'm logging your conversations." The spatial dimension of the log needs explicit disclosure.
  • SER + Voice Agent is immediately deployable. If the elder-care use case is the most impactful composition, why is it not the current sprint? Does the navigation research crowd out the affective layer, or do they genuinely run in parallel?
  • The crown jewel combination has a failure mode: SLAM drift corrupts the spatial index. A conversation tagged to "hallway" that actually occurred in the kitchen produces a wrong memory. How does the system signal spatial uncertainty in Context Engine queries?
  • Abandoned: Tesla multi-camera BEV. Not abandoned: Waymo map-as-prior. Both were considered at the same time. What made the Waymo pattern obviously superior for this hardware? Could the BEV idea be partially revived using SLAM occupancy as a synthetic BEV? (Occupancy grid IS a bird's-eye-view, just built from lidar rather than camera projection.)
  • Cross-lens (Lens 06): "Build the map to remember" appears in Lens 06's third-order effects as an emergent discovery. Lens 16 treats it as a first principle. Which is correct — is it a design intention or an emergent consequence? The answer determines whether it should be a spec requirement or a design constraint on the spatial-temporal witness implementation.
  • Cross-lens (Lens 20): multi-modal convergence. The matrix shows SER + Context Engine as HIGH. Lens 20 presumably describes what happens when audio emotion, visual scene, and conversational memory converge simultaneously. Is there a triple composition (SER + VLM + Context Engine) that the matrix undersells by only looking at pairs?
  • Cross-lens (Lens 21): voice-to-ESTOP gap. SER + VLM combined: if SER detects sudden panic in voice AND VLM detects a person suddenly in frame, is there a fast-path composition that bypasses Tier 1 planning entirely and fires a direct ESTOP? The two signals together should be more reliable than either alone as a safety trigger.
  • Cross-lens (Lens 04 & 17 & 18): Hailo-8 is idle compute that turns into a new crown-jewel composition the moment it's activated. Lens 04 (asymmetric tech) and Lens 17 (hardware tiers) should both cross-reference the Hailo × VLM cell. Lens 18 (failover) should treat Hailo L1 + ArUco classical-CV as the two WiFi-independent perception compositions. Are those cross-lenses already flagging this, or does the matrix expose a hole in them?
  • Cross-lens (Lens 14): if Lens 14 addresses first-principles re-derivation, then the dual-process pattern (System 1 reflex + System 2 reasoning) is a textbook example — Kahneman-grade psychology mapping directly onto Pi + Panda hardware. The composition was arrived at in Lens 16 via matrix traversal but would also fall out of a first-principles analysis in Lens 14.