LENS 05

Evolution Timeline

"How did we get here and where are we going?"

2019–2020

Active Neural SLAM

The foundational hybrid: CNN-predicted occupancy from RGB-D + classical A* planner + learned global policy for "where to explore next." Solved the blind-robot problem — gave robots a persistent spatial model. Bottleneck it removed: global memory (pure reactive systems forgot where they had been). Bottleneck it exposed: the CNN knew geometry but not meaning — it could map a chair as an obstacle but not understand that the chair means "living room."

2022

SayCan / Inner Monologue — Language Enters the Loop

LLMs began mediating between human instruction and robot action. SayCan scored candidate actions by both LLM feasibility and robot affordance. Inner Monologue closed the loop: VLM provides scene feedback → LLM revises plan → robot acts again. Bottleneck removed: instruction parsing — robots could now accept "go to the kitchen" rather than hand-coded waypoints. Bottleneck exposed: LLMs had no spatial grounding. They knew kitchens exist but not where this kitchen is on this map.

2023

VLMaps + AnyLoc — Semantics Fused Into Space

VLMaps (Google, ICRA 2023) solved the grounding gap: dense CLIP/LSeg embeddings projected onto 2D occupancy grid cells during exploration. "Where is the kitchen?" becomes a cosine similarity search on spatially indexed embeddings — no pre-labeling required. AnyLoc (RA-L 2023) solved the inverse: DINOv2 + VLAD for universal place recognition across indoor/outdoor/underwater without retraining. Bottleneck removed: semantic grounding — robots could navigate to named places. Bottleneck exposed: all of this required offline exploration sweeps, dense GPU compute, and a robot that had already seen the environment.

2024

OK-Robot + GR00T N1 — Pragmatic Integration & Dual-Rate Action

OK-Robot (NYU, CoRL 2024) demonstrated 58.5% pick-and-drop success in real homes using only off-the-shelf CLIP + LangSam + AnyGrasp. Their explicit finding: "What really matters is not fancy models but clean integration." GR00T N1 (NVIDIA, 2025) formalized dual-rate architecture: VLM runs at 10 Hz for high-level reasoning, action tokens stream at 120 Hz for smooth motor control. Bottleneck removed: deployment gap — academic systems became reproducible in real homes. Bottleneck exposed: these systems still required multi-GPU inference infrastructure or pre-built robot platforms. Nothing ran on a $35 compute board.

2024–2025

Tesla FSD v12 — End-to-End Neural Planner (Automotive Scale)

Tesla replaced 300,000 lines of C++ with a single neural net. FSD v12's planner is trained on millions of human driving miles — the neural net is the policy. Running at 36 Hz perception, it demonstrated that with sufficient data, the classical planning stack becomes unnecessary. Bottleneck removed: edge-case brittleness of hand-coded rules. Bottleneck exposed: this approach is strictly fleet-scale. One robot, one home, one user — zero training data. The "end-to-end or nothing" framing is a false dichotomy for low-volume robotics.

2025–2026 (Annie)

58 Hz VLM-Primary + SLAM Hybrid — Faster Than Tesla, Purpose-Built for One Home

Annie's Gemma 4 E2B on Panda runs at 54–58 Hz — faster than Tesla FSD's 36 Hz perception loop — on a single Raspberry Pi 5 + Panda edge board. The 4-tier hierarchy: Titan LLM at 1–2 Hz (strategic), Panda VLM at 10–54 Hz (tactical multi-query), Pi lidar at 10 Hz (reactive), Pi IMU at 100 Hz (kinematic). The multi-query pipeline allocates surplus 58 Hz capacity across goal-tracking (29 Hz), scene classification (10 Hz), obstacle description (10 Hz), and place embedding (10 Hz). Fusion rule: VLM proposes, lidar disposes, IMU corrects. Bottleneck removed: single-task VLM waste — 58 Hz on one prompt was underutilizing available perception bandwidth. Bottleneck now exposed: the VLM still speaks in text tokens. "LEFT MEDIUM" is a language-mediated navigation signal. The gap between language output and motor command is a translation step that adds latency, ambiguity, and brittleness. The next evolution will bypass text entirely.

2026-Q2/Q3 (Annie, next inflection)

Hailo-8 L1 Activation — Dual-Process Architecture Lands On-Robot

The quiet fact the 58 Hz VLM era concealed: Annie's Pi 5 already carries a Hailo-8 AI HAT+ at 26 TOPS that has been idle for navigation this entire time. The next evolution is not a new model — it is activating the NPU we've been ignoring. YOLOv8n at 430 FPS local with <10 ms latency and zero WiFi dependency becomes the L1 safety layer; the Panda VLM stays as L2 semantic reasoning. This is the System 1 / System 2 pattern validated by the IROS 2026 paper (arXiv 2601.21506): fast reactive obstacle detection on-device + slow semantic reasoning off-device yielded 66% latency reduction and 67.5% success vs 5.83% VLM-only. The single-query VLM-over-WiFi era ends here. Bottleneck removed: WiFi-coupled safety — when the network stutters, Annie no longer goes blind. Bottleneck it exposes: the split-brain coordination problem — two perception systems, two update rates, two vocabularies (bounding boxes vs language tokens). The fusion policy becomes the new research surface.

2026–2027 (Predicted)

Semantic Map as First-Class Memory — Place Recognition Closes the Loop

Phase 2c/2d: VLM scene labels attach to SLAM grid cells at each pose. Over dozens of traversals, rooms emerge from accumulated evidence without manual annotation. Phase 2d deploys SigLIP 2 ViT-SO400M (~800 MB VRAM) as a dedicated embedding extractor — no text decoding. Cosine similarity on stored (x, y, heading) embeddings enables "I've been here before" without scan-matching. The map transitions from geometry-only to a hybrid metric-semantic structure: walls + "kitchen" + "hallway junction where Mom usually sits." Bottleneck this will remove: re-learning the home on every session. Bottleneck it will expose: single-camera depth ambiguity — without learned depth, semantic labels on a 2D grid lose the third dimension that distinguishes "table surface" from "floor under table."

2027+ (Future Annie Robot — Generation 2)

Orin-NX-Native Chassis — The Robot That Can Finally Host Isaac

The current TurboPi chassis is a Pi-5-bound platform: the Orin NX can only supplement a Pi, not replace it. The next-generation Annie robot will be Orin-NX-native (100 TOPS Ampere, 16 GB LPDDR5). This is not a marginal upgrade — it is a categorical shift in what can run on-body. Isaac ROS 4.2's nvblox (camera-only 3D voxel mapping) and cuVSLAM (GPU-accelerated visual SLAM) become deployable on the robot itself instead of remoted across WiFi. The VLM tier can migrate partially on-body, lidar can be supplemented or replaced by stereo vision, and the WiFi umbilical becomes optional rather than structural. The architecture becomes a dual-generation arc: the current TurboPi + Pi 5 + Panda-over-WiFi continues as the "development rig" (cheap, hackable, where new ideas are prototyped), while the Orin-NX-native robot becomes the "production body" (self-contained, user-owned, privacy-preserving at the edge). Bottleneck removed: the WiFi-tethered robot body. Bottleneck it exposes: dual-platform maintenance — every capability now needs two deployment targets, and the NavCore abstraction layer becomes load-bearing rather than optional.

2027–2028 (Predicted)

Sub-100-Demo VLA Fine-Tuning — The Pipeline Compresses

When 1–3B parameter VLAs (vision-language-action models) become fine-tunable on 50–100 home-collected demonstrations — not millions of fleet miles — the 4-tier hierarchy begins collapsing. The VLM no longer needs to output "LEFT MEDIUM" as a text token; it outputs a motor torque vector directly. The NavCore middleware (Tiers 2–4) becomes a compatibility shim rather than the primary control path. This is the transition where OK-Robot's "clean integration of replaceable components" may yield to "one model, one fine-tune, one home." Bottleneck this will remove: text-mediated motor control. Bottleneck it will expose: interpretability — when the model is end-to-end, there is no "lidar disposal" override. Safety requires a new architecture.

2030+ (Provocative)

Post-Token Navigation — What 2030 Finds Laughable

A 2030 researcher reading this document will find the following primitive: that we made a vision model output the string "LEFT MEDIUM" and then parsed that string with a Python function to produce a motor command. The entire text-token intermediary — prompt engineering, parser fallbacks, 3-strategy extraction, the "UNKNOWN" handling — will read like GOTO statements in assembly: technically functional, structurally wrong. Navigation will be a continuous embedding space operation, not a discrete token classification. The VLM's vision encoder output will route directly to a motor policy head, the way the human visual cortex routes to motor cortex without "saying" directions to itself. The SLAM map will be a learned latent space, not an explicit 2D grid. The "58 Hz loop with alternating prompts" will be the punchline in a CVPR keynote about the early days of embodied AI.

The repeating pattern across every transition in robot navigation is identical: a new bottleneck becomes the rate-limiting step, a new approach removes it, and in doing so exposes the next bottleneck one layer deeper. The sequence runs: compute → memory → semantics → grounding → integration → language-motor gap → interpretability. Each era solved the bottleneck of the previous era so completely that the solution became invisible infrastructure. Nobody in 2026 thinks of "persistent spatial memory" as a solved problem — it is simply what SLAM does. In 2030, nobody will think of "semantic grounding" as a research question. But right now, the language-motor gap is the live bottleneck: Annie speaks directions to herself in English tokens in order to move a wheel, which is the robotic equivalent of doing arithmetic by writing out the words.

Annie's current architecture sits at a historically interesting inflection point. It is simultaneously ahead of its time in one dimension — 58 Hz VLM on commodity edge hardware, faster than Tesla's automotive perception loop — and at risk of being bypassed in another. The research document describes Waymo's MotionLM (trajectory as language tokens) and then builds a system that does the opposite: it uses language tokens as a proxy for trajectory. This is the contradiction Lens 14 identifies most sharply. The Waymo pattern was adopted at the architectural level (dual-rate, map-as-prior, complementary sensors) but inverted at the output level (language tokens instead of continuous actions). The next evolution will close this inversion.

The multi-query pipeline (Phase 2a) is not just a performance optimization — it is the last evolutionary step before the architecture fundamentally changes. By distributing 58 Hz across four concurrent perception tasks, it maximizes the extractable value from a text-token VLM. It is the most sophisticated thing you can do with the current paradigm before the paradigm shifts. This is consistent with the general pattern: each era's final contribution is an optimization of the existing approach that also makes the limits of that approach unmistakable. VLMaps was the most sophisticated thing you could do with offline CLIP embedding before online VLMs arrived. The multi-query pipeline is the most sophisticated thing you can do with text-token navigation before direct-action VLAs become fine-tunable at home scale.

The next inflection point is not about a new model — it is about activating the NPU we've been ignoring. Annie's Pi 5 has carried a 26 TOPS Hailo-8 AI HAT+ for this entire research window, idle for navigation. In 2026-Q2/Q3, the single-query VLM-over-WiFi era gives way to an on-robot dual-process architecture: YOLOv8n at 430 FPS locally for L1 safety (under 10 ms, WiFi-independent), Gemma 4 E2B at 15–27 Hz on Panda for L2 semantic reasoning. This is the exact IROS 2026 pattern (arXiv 2601.21506) — System 1 / System 2 with a 66% latency reduction. The discovery that reframes the current timeline: Annie was not bottlenecked on model capability, she was bottlenecked on a perception layer we had not yet wired into the stack. And beyond that, the arc extends into hardware: the next-generation Annie robot will be Orin-NX-native (100 TOPS Ampere, 16 GB LPDDR5), capable of hosting Isaac Perceptor's nvblox and cuVSLAM on-body — making WiFi optional rather than structural. This is no longer a single moment, it is a dual-generation upgrade path: the current TurboPi + Pi 5 + Panda rig continues as the hackable development platform, and the Orin-NX body becomes the self-contained production platform. Lens 02 (architecture bets) and Lens 07 (latency budgets) both reset against this horizon.

The cross-lens convergence with Lens 17 (transfer potential) and Lens 26 (bypass text layer) points to a concrete near-term opportunity: the NavCore middleware — the 4-tier hierarchy that abstracts VLM outputs into motor commands — has significant transfer value precisely because it is the translation layer between language and action. When the translation layer eventually becomes unnecessary, the NavCore pattern will survive as a safety shim: a fallback execution path that catches failures in the end-to-end model and routes through interpretable, auditable logic. The bottleneck of interpretability will be solved the same way every previous bottleneck was solved — by making the new approach compatible with the old infrastructure until the old infrastructure can be safely retired.

Nova: The pattern is brutally consistent. Every era's "breakthrough" removes one bottleneck while making the next one unmistakable. Active Neural SLAM solved memory and immediately exposed the lack of meaning. VLMaps solved meaning and immediately exposed the deployment gap. OK-Robot solved deployment and immediately exposed the text-motor gap. Annie's multi-query pipeline is the apex of the text-token era — it extracts maximum value from the current paradigm while making its fundamental limit (language mediation) impossible to ignore. The 2030 punchline writes itself: we made robots say "LEFT MEDIUM" to themselves.
  • Dual-generation upgrade path: The next era is not one leap but two concurrent tracks. Current robot (TurboPi + Pi 5): activate the idle Hailo-8 NPU as L1 safety (YOLOv8n @ 430 FPS, <10 ms, zero WiFi) — the IROS 2026 dual-process pattern lands on hardware Annie already owns. Future robot (Orin-NX-native): 100 TOPS Ampere on-body, Isaac Perceptor's nvblox and cuVSLAM running at the edge, WiFi becomes optional. The current rig stays as the development platform; the future rig becomes the production body. NavCore middleware becomes the bridge between them.
  • The bottleneck we didn't see: Annie wasn't limited by model capability. She was limited by an entire perception tier we hadn't wired in. 26 TOPS sitting idle for months is the kind of constraint that only becomes visible after you've exhausted the alternatives.
Think: If the text-token intermediary is the current bottleneck, what does it mean that the entire research document is written in text? The research describes, in natural language, a system that navigates by translating vision into natural language commands. The meta-structure of the research mirrors the structural flaw of the system. A 2030 researcher would find not just the implementation primitive — they would find the act of writing a text document about text-token navigation as the primary artifact equally telling. The medium of the research (text) is also the bottleneck of the system. When navigation becomes a continuous embedding operation, what does the research document look like?