LENS 26 — QUESTION HORIZON

"What new questions become askable because of this research?"

---

Research is typically evaluated by the answers it provides. The more productive evaluation is the questions it makes possible to ask for the first time.

Before Annie proved 58 hertz monocular VLM navigation on a two-hundred-dollar robot, five of the questions in this analysis were not merely unanswered — they were not yet coherent. "Can one VLM frame serve 4 tasks simultaneously?" presupposes a pipeline fast enough that frame allocation is a meaningful design variable. "Can a semantic map transfer between homes?" presupposes a semantic map at all. "Why does the robot need to understand language?" presupposes a working non-language path worth comparing against. None of these could be seriously asked before the 58 hertz result existed. The research created the conditions for its own successors.

---

BRANCH 1 — NEWLY ASKABLE

Can a single VLM frame serve 4 independent tasks simultaneously?

Before Annie, VLMs were assumed to be single-query tools. The 58 hertz result proves the bottleneck is inference frequency, not task count per frame. The research proposes alternating queries across frames: goal tracking at 29 hertz, scene classification at 10 hertz, obstacle awareness at 10 hertz, place recognition at 10 hertz.

This opens three questions that did not exist before.

First: Does attention-head specialization exist at 58 hertz? Can some heads be frozen for navigation while others serve scene queries in the same forward pass?

Second: If query alternation at 29 hertz navigation plus 10 hertz scene plus 10 hertz obstacle works — what is the minimum navigation frequency before task performance degrades? Is 15 hertz enough? 8 hertz?

Third: Does temporal interleaving create phantom correlations between tasks that a truly parallel architecture would not? The alternating-frame design produces outputs where frame 3's obstacle report arrived one frame after frame 2's navigation command. In a fast-moving scenario, those frames captured different spatial moments. Is the interleaving introducing a systematic lag artifact that true parallelism would avoid?

---

BRANCH 2 — ALMOST ANSWERED

Does EMA temporal consistency make VLM navigation more reliable than sensor fusion?

The research proposes an exponential moving average with alpha equals 0.3, producing 86 milliseconds of consistency memory. It almost shows EMA beats the naive approach. But it never formally compares to Kalman filtering over IMU plus lidar, leaving the key claim unproven. The research gets within one analysis step of its most important implication.

Here is what it almost found: if the EMA variance spike (scene change detection) correlates precisely with SLAM loop closure events, the VLM is doing place recognition through the text layer without being asked to. The 150 million-parameter vision encoder would be detecting "I've been here before" as a byproduct of its scene stability signal. The text-decoding pipeline would be the barrier preventing that signal from being used directly.

The almost-answered question points at the convergence finding from a fourth independent direction. The research got within one step of discovering that EMA variance is already a text-mediated place recognition signal.

---

BRANCH 3 — 10x MULTIPLIER

Can Annie's semantic map transfer between homes?

If the SLAM map is purely metric — defined by coordinates — it cannot transfer. Grandma's kitchen is in a different building. But if the map stores semantic embeddings, "kitchen-ness" is a cluster of visual features that appears near an entrance, adjacent to a refrigerator, with a particular texture profile. That concept is not home-specific. It is culturally stable.

Annie could not ask this question before the research existed. There was no semantic map to transfer. Now there is.

Three sub-questions follow.

First: If Annie builds a semantic map in Rajesh's home, how many exploration minutes does she need in Grandma's home to orient herself using the transferred concept graph? This is measurable today with existing hardware.

Second: Are there universal semantic anchors — refrigerator equals kitchen, toilet equals bathroom — that survive home transfer without retraining? What fraction of the concept graph is home-specific versus universal?

Third: Could a semantic map trained in one home be uploaded as a product SKU, giving new users a head-start on exploration? The fraction of the concept graph that transfers — hypothesis: 60 to 70 percent — minus the fraction that is home-specific — hypothesis: 30 to 40 percent — determines the commercial value of semantic map sharing. That calculation could not be set up before this research existed. It now can.

---

BRANCH 4 — CROSS-FIELD CONNECTION

Can this architecture run entirely text-free?

Text2nav, presented at RSS 2025, achieved 74 percent navigation success using frozen SigLIP embeddings alone — no text decoding, no tokenization, no language. The architecture Annie uses currently routes perception through text ("LEFT MEDIUM") then back to motor commands. What if the VLM output never became text?

This question connects the navigation problem to cognitive science and animal navigation. Rat hippocampal place cells encode spatial identity directly as activation patterns — not as verbal descriptions of the place. Bees navigate 5 kilometers with a brain of 1 million neurons. Annie uses 2 billion. The architectural gap is not obviously explained by task complexity.

Three sub-questions emerge.

First: If a 3-neuron readout layer trained on 6 months of Annie's own labeled frames maps ViT embeddings directly to motor commands, does it outperform the text-decoding path?

Second: What is the minimum representational bottleneck for spatial navigation? This question connects robotics to theoretical neuroscience in a way that was not possible before Annie proved a 2-billion-parameter model works on this task.

Third — and this is the one insiders miss: does the text-language bottleneck create alignment with human intent as a side effect? If Annie goes text-free, does she become harder to explain, debug, and correct? The explainability cost of bypassing language is a genuine trade. Annie's current pipeline produces human-readable traces: "frame 247: VLM said LEFT MEDIUM." A text-free embedding pipeline produces: "frame 247: cosine similarity 0.73 to goal cluster." The numeric trace is less interpretable. The question of whether to bypass text is not purely about navigation accuracy. It is about the debugging cost of removing the language relay.

---

BRANCH 5 — THE OUTSIDER QUESTION

"Why does the robot need to understand language at all?"

An insider would never ask this. The team chose a Vision-Language Model because vision-language models are state of the art. But an outsider from animal cognition or control theory would immediately see the mismatch: the navigation problem is geometric. Language is a communication layer, not a perception layer.

The research proves Annie can navigate. The outsider asks whether language was necessary, or just convenient.

Three sub-questions follow.

First: Does the text layer contribute more to failure modes — hallucinations, tokenization noise, semantic drift — than it contributes to navigation accuracy?

Second: Could Annie navigate as well using the vision encoder only — at 71 hertz, with no text-decode overhead — with a learned linear probe mapping ViT patches to 4-command outputs?

Third: If language is retained only at Tier 1 (strategic planning, Annie's goal interpretation) and removed from Tier 2 (tactical VLM perception), what breaks and what gets faster?

---

BRANCH 6 — THE DUAL-PROCESS HORIZON (SESSION 119 HARDWARE AUDIT)

Session 119 ran a targeted hardware-inventory pass alongside a literature sweep on dual-process indoor navigation. Two findings emerged at once, and the pair is load-bearing: the literature supplied the architectural pattern, and the inventory supplied the substrate that makes the pattern free to adopt.

First: an IROS paper, arXiv 2601.21506, validating a System 1 System 2 dual-process pattern for indoor robot navigation. Fast reactive detection at 30-plus hertz combined with slow semantic VLM reasoning at 1-5 hertz. Result: 66 percent latency reduction versus continuous VLM, 67.5 percent success rate versus 5.83 percent for VLM-only. The architectural pattern Annie needs — already validated in peer-reviewed research.

Second: an idle 26 tera-operations-per-second Hailo-8 AI HAT Plus already sitting on Annie's Pi 5, running zero inferences for navigation. Capable of YOLOv8 nano at 430 frames per second with under 10 milliseconds of latency and zero WiFi dependency. The System 1 substrate the IROS pattern needs — already paid for and mounted on the robot.

Four new questions became askable in this one session.

First, the tuning question: at what VLM query rate does System 2 gating outperform always-on VLM? IROS validated the pattern at their setup. Annie's specific crossover rate — the frequency at which Hailo L1 should delegate upward to VLM L2 — is unmeasured. The answer sets the VRAM and latency budget for the whole dual-process stack.

Second, the layer-ratio question: once dual-process lands, what is the right relative rate for L1 Hailo obstacle detection, L2 VLM goal tracking, L3 VLM multi-query scene, and L4 Titan strategic planning? IROS gives one answer for their benchmark. Annie's home-robot task mix may tilt the optimum elsewhere.

Third, the capability question: can Hailo-8 run open-vocabulary detectors like NanoOWL-lite, or is it structurally limited to closed-class YOLO-family models? If open-vocab compiles to Hailo, L1 can absorb "door", "kitchen", "person" queries locally — fusing System 1 speed with System 2 flexibility. If not, Hailo is a safety layer only, and the VLM remains the sole semantic path.

Fourth, and most durable: the meta-question. What other idle compute is in the household that has not been audited? The Hailo-8 discovery was not a design success. Nobody designed Annie to use it; it came with the Pi 5 AI kit. It was a process success — a targeted audit found a previously invisible resource that was already on the robot. The question "what else is hiding in plain sight?" is the durable output of session 119. Four compute tiers are known today: Panda active, Titan active, Beast idle, Orin NX 16 gigabyte idle. Unaudited: phones, laptops, TV system-on-chips, router NPUs. The next move is a household compute census.

---

THE CONVERGENCE FINDING

Three branches converge on one answer from independent starting points: bypass the text-language layer.

Branch 1 arrives through task-parallelism: what if embeddings instead of text for each frame?

Branch 3 arrives through map transfer: what if SLAM cells stored embeddings instead of text labels?

Branch 4 arrives through cross-field comparison to cognitive science: what if place recognition used raw ViT features rather than text descriptions?

The text2nav result — 74 percent success with frozen SigLIP alone — is the empirical anchor for all three.

These three independent lines of inquiry converge on one architectural change: remove the text-decoding step from the Tier 2 perception loop while retaining text at Tier 1 where language is actually needed to interpret human goals.

The text layer currently adds approximately 4 milliseconds of latency, 30 percent of VRAM overhead, semantic compression loss, and hallucination risk — in exchange for human-readable intermediate outputs.

The question of whether that trade is worth making is newly askable because Annie proved the navigation loop works.

Before this research, there was nothing to bypass.

---

SESSION 119 ADDENDUM — THE QUESTION ABOUT QUESTIONS

The convergence finding (bypass text at Tier 2) remains the most actionable single architectural decision. The session 119 horizon widens the frame. Before committing to a text-free Tier 2, two prerequisite questions should be answered:

One: at what VLM query rate does System 2 gating outperform always-on VLM? If the answer is "any rate below 15 hertz," then Annie's current 54 hertz VLM is already over-budget and the dual-process split is the first-order architectural move, ahead of text bypass.

Two: can Hailo-8 run open-vocabulary detectors? If yes, L1 can do more than safety; it can handle some of the goal-tracking load that currently sits in Tier 2. That shifts what Tier 2 needs to be and therefore what its right representation is.

The process lesson is the most important output. The Hailo-8 was invisible until a targeted investigation surfaced it. The explicit question "what else is idle?" is the durable instrument. Use it on Beast, on Orin NX 16 gigabyte, and on the unaudited tiers. The next invisible resource is waiting for the next targeted audit.