LENS 26

Question Horizon

"What new questions become askable because of this research?"

QUESTION HORIZON — BRANCHING INQUIRY MAP

Annie proved 58 Hz monocular VLM navigation on a $200 robot.
Before this: nobody asked what to do with 58 frames per second on a home robot. Now the surplus is the design space.

BRANCH 1 — NEWLY ASKABLE

Can a single VLM frame serve 4 independent tasks simultaneously?

Before Annie, VLMs were assumed to be single-query tools. The 58 Hz result proves the bottleneck is inference frequency, not task count per frame.

↳ Does attention-head specialization exist at 58 Hz? Can some heads be frozen for nav while others serve scene queries?

↳ If query alternation at 29 Hz nav + 10 Hz scene + 10 Hz obstacle works — what is the minimum nav frequency before task performance degrades?

↳ Does temporal interleaving create phantom correlations between tasks that a truly parallel architecture would not?

BRANCH 2 — ALMOST ANSWERED

Does EMA temporal consistency make VLM navigation more reliable than sensor fusion?

The research proposes EMA with alpha=0.3 giving 86 ms of consistency memory. It almost shows EMA beats the naive approach. But it never formally compares to Kalman filtering over IMU + lidar, leaving the key claim unproven.

↳ Can EMA on VLM outputs (pure monocular) beat Kalman over IMU + lidar on heading estimation? If yes, lidar becomes redundant for goal-tracking.

↳ What is the optimal alpha for EMA in each room type? A cluttered living room needs faster EMA decay than an empty hallway.

↳ Does the variance spike from EMA (scene change detection) correlate precisely with SLAM loop closure events? If so, VLM is predicting SLAM events.

BRANCH 3 — 10x MULTIPLIER

Can Annie's semantic map transfer between homes?

If the SLAM map is purely metric (coordinates), it cannot transfer — Grandma's kitchen is in a different building. But if the map is stored as semantic embeddings ("kitchen-ness cluster near entrance"), the concept transfers. Annie never asked this question before because she had no semantic map.

↳ If Annie builds a semantic map in Rajesh's home, how many exploration minutes does she need in Grandma's home to orient herself using the transferred concept graph?

↳ Are there universal semantic anchors (refrigerator = kitchen, toilet = bathroom) that survive home transfer? What fraction of the concept graph is home-specific vs universal?

↳ Could a semantic map trained in one home be uploaded to a product SKU, giving new users a head-start on exploration? This is the "map as product" business model question — only askable because Annie proved semantic labeling works.

BRANCH 4 — CROSS-FIELD

Can this architecture run entirely text-free?

Text2nav showed frozen SigLIP embeddings alone achieve 74% navigation success. The architecture currently routes perception through text ("LEFT MEDIUM") then back to motor commands. What if the VLM output never became text? This connects Annie's nav problem to cognitive science (how do bees navigate without language?) and animal navigation (rat hippocampal place cells store spatial identity directly in activation patterns, not descriptions).

↳ If a 3-neuron readout layer trained on 6 months of Annie's own labeled frames maps ViT embeddings directly to motor commands, does it outperform the text-decoding path? (Convergent with Lens 01's "temporal surplus as training signal".)

↳ What is the minimum representational bottleneck for spatial navigation? Bees navigate 5 km with a brain of 1 million neurons. Annie uses 2 billion. What's the architectural gap?

↳ Does the text-language bottleneck create alignment with human intent as a side effect? If Annie goes text-free, does she become harder to explain, debug, and correct? (The explainability cost of bypassing language.)

BRANCH 5 — OUTSIDER QUESTION

"Why does the robot need to understand language at all?"

An insider would never ask this — the team chose a VLM because vision-language models are state of the art. But an outsider from animal cognition or robotics theory would immediately point out: the robot's goal (navigate to kitchen, avoid obstacles) is a geometric problem. Language is a communication layer, not a perception layer. The research proves Annie can navigate. The outsider asks whether language was necessary, or just convenient. This question connects to Lens 08's neuroscience mechanisms — specifically the observation that rat hippocampal place cells encode space as activation patterns, not as verbal descriptions.

↳ Does the text layer contribute more to failure modes (hallucinations, tokenization noise, semantic drift) than it contributes to navigation accuracy?

↳ Could Annie navigate as well using the vision encoder only — at 71 Hz (no text decode overhead) — with a learned linear probe mapping ViT patches to 4-command outputs?

↳ If language is retained only for Tier 1 (strategic planning, Annie's goal interpretation), and removed from Tier 2 (tactical VLM perception), what breaks and what gets faster?

BRANCH 6 — DUAL-PROCESS HORIZON (SESSION 119 HARDWARE AUDIT)

A session 119 hardware audit surfaced an idle 26 TOPS accelerator and a peer-reviewed dual-process pattern. Each is a question-generator.

A targeted hardware-inventory pass found the Hailo-8 AI HAT+ already mounted on Pi 5 (26 TOPS, idle for navigation) and the IROS paper (arXiv 2601.21506) validating a System 1 / System 2 dual-process architecture for indoor robot navigation (66% latency reduction, 67.5% vs 5.83% success). Each fact generates questions that did not exist before the audit.

↳ ARCHITECTURAL (TUNING): At what VLM query rate does System 2 gating outperform always-on VLM? IROS validated the pattern at their setup; Annie's specific crossover frequency (Hailo L1 at 30 Hz, VLM L2 decision at N Hz) is unmeasured. The answer sets the VRAM and latency budget for the entire dual-process stack.

↳ ARCHITECTURAL (LAYER RATIOS): Once dual-process lands, what's the right relative rate for L1 (Hailo obstacle, 30+ Hz) / L2 (VLM goal-tracking, 15-27 Hz) / L3 (VLM multi-query scene, 5-9 Hz) / L4 (Titan strategic planning, 1-2 Hz)? IROS gives one answer for their benchmark; Annie's home-robot mix of tasks may tilt the optimum elsewhere.

↳ CAPABILITY (HAILO OPEN-VOCAB): Can Hailo-8 run open-vocabulary detectors like NanoOWL-lite, or is it structurally limited to closed-class YOLO-family models? If open-vocab compiles to Hailo, L1 can take "door", "kitchen", "person" queries locally — fusing System 1 speed with System 2 flexibility. If not, Hailo is a safety layer only and the VLM remains the sole semantic path.

↳ PROCESS (META-QUESTION): What other idle compute is in the household that hasn't been audited? The Hailo discovery was a process success, not a design success — it was already on the robot, paid for, and invisible until a targeted investigation surfaced it. If Hailo hid in plain sight, what else is hiding? Three other tiers are known idle or underused: Beast, Orin NX 16 GB, and whatever unaudited compute lives in the house (phones, laptops, TV SoCs).

CONVERGENCE POINT — THE CROWN JEWEL

Three independent branches converge on: bypass the text-language layer.

BRANCH 1 asks

"What if VLM outputs embeddings instead of text?" → Vision encoder at 71 Hz, no decoding. Attention-head specialization by task.

BRANCH 3 asks

"What if the SLAM map stored visual embeddings instead of occupancy?" → Transferable semantic maps. Place recognition as cosine similarity.

BRANCH 4 asks

"What if place recognition used raw ViT features instead of text descriptions?" → Text2nav (RSS 2025): 74% success with frozen SigLIP alone.

All three point at the same single architectural change: remove the text-decoding step from the Tier 2 perception loop. The text layer adds ~4 ms latency, ~30% VRAM overhead, semantic compression loss, and hallucination risk — in exchange for human-readable intermediate outputs. The question of whether that trade is worth making is newly askable because Annie proved the nav loop works. Before this research, there was nothing to bypass.

Research is typically evaluated by the answers it provides. The more productive evaluation is the questions it makes possible to ask for the first time. Before Annie proved 58 Hz monocular VLM navigation on a $200 robot, five of the questions in this analysis were not merely unanswered — they were not yet coherent. "Can one VLM frame serve 4 tasks simultaneously?" presupposes a pipeline fast enough that frame allocation is a meaningful design variable. "Can a semantic map transfer between homes?" presupposes a semantic map at all. "Why does the robot need to understand language?" presupposes a working non-language path worth comparing against. None of these could be seriously asked before the 58 Hz result existed. The research created the conditions for its own successors.

The most structurally important of the five branches is Branch 5: the outsider question "why does the robot need to understand language at all?" It is structurally important because insiders cannot ask it. The team chose a Vision-Language Model — language is in the name. Language is assumed. The outsider, arriving from animal cognition or control theory, immediately sees the mismatch: the navigation problem is geometric (where am I, where is the goal, what is between me and the goal) and the robot is solving it by translating geometry into natural language and then translating language back into geometry. The text layer is a relay station between two signal types that don't need an interpreter. An ant colony navigating complex terrain does not pass its pheromone gradients through a language model. Lens 08 makes the same observation from neuroscience: rat hippocampal place cells encode spatial identity directly as activation patterns, not as verbal descriptions of the place. The text-language layer is the architecturally interesting thing to remove — and that question only becomes askable once the research proves the vision encoder already has everything needed for navigation without it.

Three branches converge on the same answer from independent starting points: bypass the text-language layer. Branch 1 arrives there through task-parallelism (what if embeddings instead of text for each frame?), Branch 3 arrives through map transfer (what if SLAM cells stored embeddings instead of text labels?), and Branch 4 arrives through cross-field comparison to cognitive science and animal navigation (what if place recognition used raw ViT features rather than text descriptions?). The text2nav result (RSS 2025) — 74% navigation success with frozen SigLIP embeddings alone — is the empirical anchor for all three. These three lines of inquiry converge on one architectural change: remove the text-decoding step from the Tier 2 (tactical, 58 Hz) perception loop while retaining text at Tier 1 (strategic, 1-2 Hz) where language is actually needed to interpret human goals. The convergence is not coincidence. It reflects the structure of the research: the research built a system that works, and the bottleneck that now stands between "working" and "excellent" is the translation overhead the system inherited from its model class rather than from its task.

Branch 2 — the almost-answered question about EMA temporal consistency — is worth examining precisely because the research stops just short of its most important implication. The research proposes EMA alpha=0.3 producing 86 ms of consistency memory, and notes this filters single-frame hallucinations. What it never asks: does EMA on VLM outputs predict SLAM loop closure events? If Annie's scene variance spikes every time SLAM independently detects a revisited location, the VLM is doing place recognition through the text layer without being asked to. This would mean the 150M-parameter vision encoder already detects "I've been here before" as a byproduct of its scene stability signal, and the text decoding pipeline is the barrier preventing that signal from being used directly. The almost-answered question points at the convergence point from yet another direction. The research got within one analysis step of discovering that EMA variance is already a text-mediated place recognition signal.

Branch 3 — the 10x multiplier question — is the one with the clearest business consequence. If Annie's semantic map transfers between homes (because it stores concept embeddings rather than room coordinates), the map becomes a product distinct from the robot. A new user's Annie could bootstrap orientation in an unfamiliar environment from a pre-trained concept graph rather than requiring full blind exploration. "Kitchen-ness," "bathroom-ness," and "living-room-ness" are not home-specific — they are culturally stable semantic clusters. The fraction of the concept graph that transfers (hypothesis: 60-70%) minus the fraction that is home-specific (hypothesis: 30-40%) determines the commercial value of semantic map sharing. That calculation could not be set up before this research existed. It now can.

Branch 6 — the dual-process horizon opened by session 119 — is the first branch that was not visible at the time of the primary research and became visible only because a targeted hardware-inventory pass ran in parallel with a literature sweep. Two findings emerged at once: the IROS 2601.21506 result (System 1 / System 2 dual-process, 66% latency reduction, 67.5% vs 5.83% success on indoor robot nav) and an idle 26 TOPS Hailo-8 AI HAT+ already paid for and mounted on Annie's Pi 5 — running zero inferences for navigation, capable of YOLOv8n at 430 FPS in under 10 ms with no WiFi dependency. The pair is load-bearing: IROS supplies the architectural pattern and Hailo supplies the substrate that makes the pattern free to adopt. Four new questions became askable in a single session: the tuning question (at what query rate does System 2 gating win?), the layer-ratio question (what are the optimal relative Hz for L1/L2/L3/L4 once dual-process lands?), the Hailo capability question (can it run NanoOWL-lite open-vocabulary, or only closed-class YOLO?), and the meta-question (what other idle compute is in the house that nobody has audited?). The meta-question is the one that propagates beyond this research. The Hailo-8 was not a design success — nobody designed Annie to use it; it came with the Pi 5 AI kit. It was a process success: a targeted audit found a previously-invisible resource. The explicit question "what else is idle?" is the durable output of session 119, and it points at Beast, Orin NX 16 GB, and unaudited household compute (phones, laptops, TV SoCs) as the next places to look.

Nova: The convergence finding is the most actionable output of this lens. Three question branches independently reach the same answer: remove text decoding from Tier 2 perception. The implementation path is sequenced: (1) profile text-decode latency separately from vision-encode latency in the current llama-server pipeline to confirm the 4 ms claim; (2) deploy a SigLIP 2 ViT-SO400M as a dedicated embedding extractor on Panda (~800 MB VRAM, already identified in Part 2 of the research); (3) train a 3-layer linear probe mapping SigLIP embeddings to {LEFT, CENTER, RIGHT} × {SMALL, MEDIUM, LARGE} using 6 months of Annie's labeled frame logs; (4) A/B test the embedding path vs the text path on identical routes. The question "does the text layer help or hurt Tier 2 navigation?" is now answerable with 3 months of Annie's existing data. Before this research, there was no question to test.

Session 119 addendum — discovered questions that were previously invisible:

The idle Hailo-8 (26 TOPS) on Pi 5 was structurally invisible until a targeted hardware-inventory pass surfaced it as an alternative System 1 substrate. It runs YOLOv8n at 430 FPS with sub-10 ms latency and zero WiFi dependency — making the System 1 / System 2 split (IROS arXiv 2601.21506, 66% latency reduction, 67.5% vs 5.83% success) a hardware-feasible architecture today, not a future aspiration.
The dual-process tuning question (at what VLM query rate does System 2 gating beat always-on VLM?) is now the blocking unknown for adopting the IROS pattern. Measurement setup: run the Hailo L1 + VLM L2 stack and sweep L2 rates from 1 Hz to 27 Hz on identical routes, measuring success-rate and p95 decision latency.
The layer-ratio question (L1 30 Hz / L2 15-27 Hz / L3 5-9 Hz / L4 1-2 Hz — optimal for Annie's actual task mix?) is separable from the tuning question and can be swept independently once L1 is live.
The meta-question is the durable output: what other idle compute is in the household? Four known tiers are active or idle today — Panda (active), Beast (idle), Orin NX 16 GB (idle), Titan (active). Unaudited: phones, laptops, TV SoCs, router NPUs. Propose a "household compute census" as a one-shot exercise; its output is a durable resource-registry appendix.

Think: The outsider question (Branch 5) carries a hidden cost that insiders should not dismiss. If language is removed from the Tier 2 perception loop, the intermediate representation becomes opaque to debugging. When Annie navigates incorrectly, the current pipeline produces a human-readable trace: "frame 247: VLM said LEFT MEDIUM, but EMA said CENTER, so planner chose CAUTIOUS." That trace is why bugs are findable. A text-free embedding pipeline produces: "frame 247: cosine similarity 0.73 to goal cluster, routing to sector 2." The numeric trace is less interpretable. The question of whether to bypass text is not purely about navigation accuracy — it is about the explainability cost of removing the language relay. Lens 14 observed that the research describes the Waymo pattern (lidar-primary) then does the opposite (VLM-primary). There is an analogous inversion here: the research builds toward language-grounded semantic maps (VLMaps pattern) and simultaneously identifies reasons to remove language from the perception loop. Both cannot be maximally true. The question horizon forces the explicit choice: is language in the loop for human debugging convenience, or for navigation performance? That question, now askable, deserves an explicit answer before Phase 2 commits to an architecture. Cross-reference Lens 01 (temporal surplus as free signal): the 86 ms EMA window is itself a form of temporal surplus, and the question of whether that surplus is being used optimally (smoothing vs prediction vs place-recognition) is unresolved. Cross-reference Lens 05: if the semantic map transfers between homes, the privacy model changes — the transferred map carries behavioral signals about how people organize their living spaces.