# Research: Phase 2 — VLM-Primary Hybrid Navigation (Waymo/Tesla Inspired)

**Session 86, 2026-04-13** | Research-only — no implementation
**Prerequisite:** Phase 1 SLAM foundation must be deployed first (provides ground truth + evaluation framework)

---

## Executive Summary

Annie's 58 Hz VLM (Gemma 4 E2B on Panda, 18ms/frame) is already faster than Tesla FSD's perception (36 Hz). The question is not speed — it's what to DO with 58 frames per second. This research identifies **5 concrete capabilities** that can be extracted from the VLM beyond the current "LEFT MEDIUM" nav commands, and proposes a **4-tier hierarchical fusion architecture** inspired by Waymo's sensor fusion + Tesla's occupancy prediction + VLMaps' semantic grounding.

**The core insight:** Don't make the VLM do more per frame. Make it do DIFFERENT things on alternating frames. At 58 Hz, splitting across 4 perception tasks still gives each task ~15 Hz — faster than most robot SLAM loops.

---

## Part 1: Waymo & Tesla Architecture Lessons

### What Waymo Does (and what translates to Annie)

Waymo uses a **perceptual foundation model** that fuses lidar + camera + radar into a unified representation. The key patterns:

1. **Map-as-prior:** Waymo pre-builds HD maps of every road. When driving, perception focuses only on dynamic objects — static structure is already known. **Annie equivalent:** Phase 1 SLAM builds the occupancy grid (static structure). Phase 2 VLM focuses on dynamic/semantic understanding (what things ARE, not where walls are).

2. **Complementary sensors, not redundant:** Camera gives semantics, lidar gives geometry, radar gives velocity. Each does something the others cannot. **Annie equivalent:** Lidar gives geometry (walls, furniture legs), VLM gives semantics (what things are), IMU gives heading. Don't try to make one substitute for another.

3. **Prediction-first planning:** Waymo's MotionLM autoregressively generates discrete motion tokens for all agents — strikingly similar to how LLMs generate text. **Annie equivalent:** Not directly applicable (no high-speed agents in a home), but the "prediction as language modeling" concept validates using LLMs for planning.

### What Tesla Does (and what doesn't translate)

Tesla's vision-only architecture:
1. Extracts features from 8 cameras using RegNet + BiFPN
2. Projects 2D features into **Bird's Eye View (BEV)** via transformer attention
3. Predicts per-voxel **3D occupancy** (is this cube of space filled with matter?)
4. Tracks occupancy across time → **4D occupancy flow** (where is each voxel moving?)
5. FSD v12: End-to-end neural planner replaced 300,000 lines of C++ with a single neural net

**What translates:**
- **Dual-rate architecture.** Tesla runs perception at 36 Hz, planning at lower frequency. GR00T N1 (NVIDIA) runs VLM at 10 Hz, actions at 120 Hz. Annie should run VLM perception at 29-58 Hz and motor commands at 1-2 Hz.
- **Temporal occupancy accumulation.** Don't use a single frame — accumulate VLM reports into a persistent grid that decays over time.

**What does NOT translate:**
- Multi-camera surround view (we have one camera)
- Custom silicon (HW4 chip)
- Million-mile training dataset (we have one robot)
- 3D voxel representation (we operate in 2D)
- End-to-end neural planner (requires fleet-scale data)

### Key Takeaway

> **The minimal viable "world model" for an indoor robot at 1 m/s is: a 2D occupancy grid (lidar), a current scene label (VLM), a goal-relative bearing (VLM), and a topological place memory (embeddings).** This is achievable with current hardware.

---

## Part 2: What 58 Hz VLM Can Actually Do

### Current: Single-task at 54 Hz
- Frame → "Where is the [goal]?" → "LEFT MEDIUM" (1-2 tokens, 18ms)

### Capability 1: Scene Classification (~15 Hz via alternating frames)
Ask "What room? Reply ONE word: kitchen/hallway/bedroom/bathroom/living/unknown"
- Single-token classification — same structure as current nav prompt
- VLM already demonstrated solid scene understanding (described "indoor scene with light green floor, dark wooden furniture" in session 67)
- Use: Label SLAM map regions, detect room transitions

### Capability 2: Obstacle Description (~15 Hz)
Ask "Nearest obstacle? Reply ONE word: chair/table/wall/door/person/none"
- Noisier than goal-finding (obstacles are diverse, partially occluded)
- Useful as supplementary signal fused with lidar
- Use: Semantic labels for lidar-detected obstacle clusters ("that cluster at 1.5m right is a chair")

### Capability 3: Qualitative Depth / Path Assessment (~15 Hz)
Ask "Path ahead? Reply: CLEAR/NEAR/BLOCKED"
- Covers camera's wider FOV where sonar/lidar may have gaps
- Detects above-lidar-plane obstacles (shelves, hanging objects, table edges)
- Use: Soft occupancy signal for cells lidar cannot see

### Capability 4: Embedding Extraction (potentially ~71 Hz)
Extract vision encoder output WITHOUT text decoding.
- Gemma 4 E2B uses a 150M-parameter ViT (2D RoPE, learned positional embeddings)
- Vision encoder runs in ~14ms (text decoding adds ~4ms)
- Produces 280-token feature representation per image
- **Practical blocker:** llama-server doesn't cleanly expose intermediate embeddings for multimodal inputs. Workaround: deploy a separate SigLIP 2 ViT-SO400M (~800MB VRAM) as a dedicated embedding extractor.

**What embeddings enable (no text decoding needed):**
- **Place recognition:** Cosine similarity against stored embeddings → "I've been here before" (loop closure)
- **Scene change detection:** Track cosine distance between consecutive frames. Jump > threshold → new room entered.
- **Topological map:** Store embeddings keyed by (x, y, heading) from SLAM. Build a graph of visually-distinct places.

Reference: text2nav (RSS 2025 Workshop) achieved 74% success in language-guided navigation using frozen SigLIP embeddings alone.

### Capability 5: Multi-Query Pipeline (THE KEY PATTERN)

At 58 Hz, alternate queries across frames:

```
Frame 0: "Where is the [goal]?"        → "LEFT MEDIUM"     (nav decision)
Frame 1: "What room is this?"          → "hallway"          (scene context)
Frame 2: "Where is the [goal]?"        → "CENTER MEDIUM"   (nav decision)
Frame 3: "Nearest obstacle?"           → "chair"            (obstacle awareness)
Frame 4: "Where is the [goal]?"        → "LEFT MEDIUM"     (nav decision)
Frame 5: [embedding extraction]         → 280-dim vector    (place recognition)
```

**Result:** Nav decisions at 29 Hz, scene classification at 10 Hz, obstacle awareness at 10 Hz, place recognition at 10 Hz. Each query gets the full model's attention on its frame.

Implementation: Add `cycle_count % N` dispatch in `NavController._run_loop()`. The `_ask_vlm` method already takes `image_b64` — just parameterize the prompt.

---

## Part 3: VLM + SLAM Fusion Architectures

### Architecture: 4-Tier Hierarchical Fusion (Recommended)

```
┌─────────────────────────────────────────────────────────────┐
│ Tier 1 — STRATEGIC (1-2 Hz, Titan LLM)                     │
│   Annie interprets goals, queries semantic map              │
│   "Go to the kitchen" → path on SLAM map → waypoints       │
│   Replans when VLM reports unexpected scene                 │
├─────────────────────────────────────────────────────────────┤
│ Tier 2 — TACTICAL (10-54 Hz, Panda VLM)                    │
│   Multi-query pipeline:                                      │
│   - Goal tracking: "LEFT MEDIUM" → steering commands        │
│   - Scene classification: "kitchen" → map annotation        │
│   - Obstacle awareness: "chair" → semantic labeling         │
│   - Place recognition: embedding → loop closure assist      │
│   VLM confirms scene matches expectation from Tier 1        │
│   "Path blocked by person" → re-plan request to Tier 1      │
├─────────────────────────────────────────────────────────────┤
│ Tier 3 — REACTIVE (10 Hz, Pi lidar + SLAM)                 │
│   slam_toolbox: occupancy grid + localization               │
│   Local obstacle avoidance via lidar sectors                │
│   Safety daemon gates ALL forward motion                     │
│   ESTOP has absolute priority over all tiers                │
├─────────────────────────────────────────────────────────────┤
│ Tier 4 — KINEMATIC (100 Hz, Pi IMU)                         │
│   Heading correction on every motor command                  │
│   Drift compensation during turns                           │
│   Odometry hints for SLAM prediction                        │
└─────────────────────────────────────────────────────────────┘
```

**Fusion rule: VLM proposes, lidar disposes, IMU corrects.**
- VLM says "go forward" → Lidar says "forward blocked at 250mm" → IMU says "drifted 3° right" → Planner: "turn left 8°, then forward"

This is already what Annie's NavController does (sessions 79-83). Phase 2 makes it explicit and adds Tier 1 (SLAM-informed strategic planning).

### VLMaps Pattern: Semantic Labels on Occupancy Grid

The most directly applicable academic system (Google, ICRA 2023):
1. Robot explores, capturing posed RGB frames
2. Each frame processed by CLIP/LSeg → dense per-pixel embeddings
3. Embeddings projected onto 2D map grid cells
4. To navigate to "kitchen table": text encoded by CLIP, cosine similarity localizes target
5. LLM parses complex instructions into executable nav code

**Annie adaptation:**
- slam_toolbox produces 2D occupancy grid (Phase 1)
- VLM scene labels attached to grid cells at current SLAM pose
- Over time: cells accumulate labels → rooms emerge
- Annie queries: "where is the kitchen?" → find cells with highest "kitchen" confidence
- Navigate using SLAM path + VLM confirmation

### OK-Robot Pragmatism Principle

OK-Robot (NYU, 2024) achieved 58.5% pick-and-drop success in real homes using only off-the-shelf components (CLIP + LangSam + AnyGrasp). Their paper explicitly argues:

> "What really matters is not fancy models but clean integration."

**This validates our approach:** Don't build a custom end-to-end model. Combine CLIP embeddings + SLAM occupancy grid + LLM planner. Each component is independently testable and replaceable.

---

## Part 4: Temporal Consistency at 58 Hz

At 58 Hz, consecutive frames differ by <1.7cm of robot travel (at 1 m/s). This enables:

### Exponential Moving Average on VLM Outputs
Track position and size as numeric values (LEFT=-1, CENTER=0, RIGHT=1; SMALL=1, MEDIUM=2, LARGE=3). Apply EMA with alpha=0.3. Filters single-frame hallucinations. The existing `_consecutive_none` hysteresis counter is a crude version of this.

### Confidence Accumulation
If VLM reports "CENTER MEDIUM" for N consecutive frames, confidence increases geometrically. After 5 consistent frames (~86ms): increase forward speed. After 1 inconsistent frame: drop to cautious mode.

### Scene Change Detection
Track running variance of VLM outputs. Low variance (30+ frames same answer) = stable scene → faster. High variance (answers flip every 2-3 frames) = cluttered/boundary → slower.

---

## Part 5: Visual Place Recognition (AnyLoc)

### AnyLoc (RA-L 2023)
Universal visual place recognition using DINOv2 + VLAD. Works across indoor/outdoor/underwater without retraining.

**Critical 2026 paper:** "Loop Closure using AnyLoc in DPV-SLAM" (arXiv 2601.02723) directly replaces bag-of-words loop detection with AnyLoc features. Adaptive similarity thresholds based on environment.

**Annie integration path:**
1. Phase 1 SLAM uses scan-matching for loop closure (standard slam_toolbox)
2. Phase 2 adds AnyLoc visual features as a CONFIRMATION layer
3. When slam_toolbox detects a scan-matching loop closure, also check AnyLoc similarity
4. If both agree → high confidence. If they disagree → flag for review.

**Hardware:** DINOv2 feature extraction needs GPU. Run on Panda (competing with VLM for VRAM) or Titan (plenty of headroom). A separate SigLIP 2 ViT-SO400M (~800MB VRAM) on Panda would serve both embedding extraction and place recognition.

---

## Part 6: Neural SLAM Systems (State of the Art)

### Active Neural SLAM (Chaplot et al., 2020)
The foundational hybrid: learned mapper (CNN predicts occupancy from RGB-D) + classical planner (A* to waypoints) + learned global policy (which waypoint to explore next).

**Directly maps to Annie:** SLAM provides spatial backbone, VLM replaces learned mapper for semantic understanding. Classical planner handles path execution.

### What's NOT Suitable
- **GS-SLAM / NeRF-SLAM**: 3D Gaussian Splatting SLAM — impressive (386 FPS rendering) but requires desktop GPU and RGB-D. Not for Pi 5.
- **End-to-end VLA models** (RT-2, OpenVLA, pi0): Require millions of demonstrations for training. Annie has one robot.
- **NaVid** (video-based VLM navigation): Validates our monocular VLM approach but has no persistent map. Combining with SLAM gives the best of both.

### What IS Suitable
- **VLMaps**: Semantic labels on occupancy grid. Our target architecture.
- **SayCan/Inner Monologue**: LLM plans, robot executes, VLM provides feedback loop. Annie already does this.
- **AnyLoc**: Visual place recognition for loop closure. Lightweight augmentation of slam_toolbox.
- **PRISM-TopoMap**: Topological mapping with learned place descriptors. Build a graph of places using VLM embeddings — no global metric coordinates needed.

---

## Part 7: Evaluation Framework (What Phase 1 Must Log for Phase 2)

### Data to Log During Phase 1

| Data | Rate | Purpose |
|------|------|---------|
| SLAM pose (x, y, theta) | 10 Hz | Ground truth for VLM pose estimates |
| Lidar scan (raw points) | 10 Hz | Obstacle ground truth for VLM accuracy |
| Camera frames (JPEG) | At VLM inference rate | Training data for VLM evaluation |
| VLM outputs (text) | At VLM inference rate | Scene descriptions, nav commands |
| Odometry sources (rf2o, IMU, odom_hint) separately | 10-100 Hz | Sensor fusion evaluation |
| Loop closure events from slam_toolbox | Event-driven | Place recognition evaluation |
| Room annotations (manual) | One-time | Ground truth for scene classification accuracy |

### Evaluation Metrics

| Metric | What it measures | Phase 2 decision it informs |
|--------|-----------------|----------------------------|
| **ATE** (Absolute Trajectory Error) | SLAM pose accuracy | Is SLAM reliable enough as ground truth? |
| **VLM obstacle accuracy** | VLM "obstacle left" vs lidar reality | Can VLM replace lidar sectors for obstacle detection? |
| **Scene consistency** | Same location → same VLM label? | Is VLM reliable for room labeling? |
| **Place recognition P/R** | VLM "been here before" vs SLAM revisit | Can embeddings augment loop closure? |
| **Navigation success rate** | Goals reached / goals attempted | Baseline before Phase 2 changes |

---

## Part 8: Phase 2 Implementation Roadmap

### Phase 2a: Multi-Query Pipeline (immediate, zero hardware change)
- Add alternating-query dispatch to `NavController._run_loop()`
- Frames 0,2,4: goal question. Frames 1,3,5: scene/obstacle question.
- Store results in controller state. Feed to Annie for richer context.
- **Doubles perception richness at same 58 Hz throughput.**

### Phase 2b: Temporal Smoothing (short-term)
- Replace `_consecutive_none` counter with EMA filter on VLM position/size
- Add confidence-based speed modulation (high confidence = faster)
- Add scene change detection (variance tracking)

### Phase 2c: Semantic Map Annotation (medium-term, requires Phase 1 SLAM)
- Every N frames: VLM scene label → attach to SLAM grid cells at current pose
- Over time: rooms emerge from accumulated labels
- Annie queries annotated map: "where is the kitchen?"
- Navigate using SLAM path + VLM waypoint confirmation

### Phase 2d: Embedding Extraction + Place Memory (longer-term)
- Deploy SigLIP 2 ViT-SO400M on Panda (~800MB VRAM) as embedding extractor
- Store embeddings keyed by (x, y, heading) from SLAM
- Cosine similarity for "have I been here?" → visual loop closure
- Build topological place graph on top of metric SLAM map

### Phase 2e: AnyLoc Visual Loop Closure (future)
- DINOv2 features for universal place recognition
- Augments slam_toolbox's scan-matching loop closure
- Run on Titan (plenty of VRAM headroom)

---

## Probability of Success by Phase

| Phase | What | P(success) | Prerequisites | Sessions |
|-------|------|:---:|---|:---:|
| 2a | Multi-query pipeline | **90%** | Current VLM works | 1 |
| 2b | Temporal smoothing | **85%** | Phase 2a | 1 |
| 2c | Semantic map annotation | **65%** | Phase 1 SLAM deployed | 2-3 |
| 2d | Embedding extraction | **55%** | SigLIP 2 on Panda + Phase 1 | 2-3 |
| 2e | AnyLoc loop closure | **50%** | Phase 2d + DINOv2 on Titan | 2-3 |

**Phases 2a and 2b can start BEFORE Phase 1 SLAM is deployed** — they only modify the VLM query pipeline, not the SLAM stack.

---

## Key Design Principles (from research)

1. **From Waymo:** Map-as-prior. SLAM map = known static world. VLM = dynamic changes + semantics.
2. **From Tesla:** Dual-rate architecture. Perception at 58 Hz, planning at 1-2 Hz.
3. **From VLMaps:** Language-grounded spatial memory. CLIP embeddings overlaid on SLAM grid.
4. **From OK-Robot:** Pragmatic integration of off-the-shelf components over custom models.
5. **From Bootstrap Perception:** Lidar is ground truth. Learned depth fills gaps only where needed.
6. **From Active Neural SLAM:** Classical SLAM for geometry + neural perception for semantics.
7. **From sensor fusion research:** Late fusion with hierarchical override. Faster sensors override slower when safety-critical.

> **The existing NavController architecture (sessions 79-83) is already correct for Tiers 2-4. Phase 1 SLAM adds Tier 3 localization. Phase 2 enriches Tier 2 perception. Neither requires rewriting the reactive navigation core.**

---

## References

### Autonomous Driving Architectures
- [Waymo 6th-Gen Driver](https://waymo.com/blog/2024/08/meet-the-6th-generation-waymo-driver/)
- [Waymo HD Mapping](https://waymo.com/blog/2020/09/the-waymo-driver-handbook-mapping/)
- [MotionLM — trajectory prediction as language modeling](https://waymo.com/research/motionlm/)
- [Tesla Occupancy Networks](https://www.thinkautonomous.ai/blog/occupancy-networks/)
- [Tesla FSD v12 End-to-End Neural Planner](https://www.thinkautonomous.ai/blog/tesla-end-to-end-deep-learning/)
- [GR00T N1: NVIDIA dual-system VLA (10Hz VLM + 120Hz action)](https://arxiv.org/html/2503.14734v1)

### VLM Navigation Systems
- [VLMaps (ICRA 2023) — semantic labels on occupancy grid](https://vlmaps.github.io/)
- [OK-Robot (CoRL 2024) — pragmatic VLM + SLAM integration](https://ok-robot.github.io/)
- [NaVid — video-based VLM navigation](https://arxiv.org/html/2402.15852v6)
- [SayCan — grounding LLM in robotic affordances](https://say-can.github.io/)
- [Inner Monologue — closed-loop VLM feedback](https://innermonologue.github.io/)
- [NaVILA — VLA with mid-level language commands](https://navila-bot.github.io/)
- [text2nav — frozen SigLIP embeddings for navigation (RSS 2025)](https://github.com/oadamharoon/text2nav)

### Visual Place Recognition
- [AnyLoc — universal place recognition via DINOv2](https://anyloc.github.io/)
- [AnyLoc in DPV-SLAM loop closure (2026)](https://arxiv.org/abs/2601.02723)
- [PRISM-TopoMap — topological mapping with learned descriptors](https://arxiv.org/html/2404.01674)

### Neural SLAM
- [Active Neural SLAM (Chaplot 2020)](https://arxiv.org/abs/2004.05155)
- [ConceptGraphs — open-vocabulary 3D scene graphs](https://concept-graphs.github.io/)
- [HOV-SG — hierarchical scene graphs (RSS 2024)](https://hovsg.github.io/)
- [ConceptFusion — open-set multimodal 3D mapping](https://concept-fusion.github.io/)

### Sensor Fusion & Occupancy
- [Bootstrap Perception — hierarchical lidar+depth fusion](https://arxiv.org/html/2603.28890)
- [SpatialVLM — metric spatial reasoning from VLM (CVPR 2024)](https://spatial-vlm.github.io/)
- [Multi-sensor fusion for autonomous driving (survey)](https://www.mdpi.com/1424-8220/25/19/6033)

### Edge VLM
- [Gemma 4 ViT architecture (150M params, 2D RoPE)](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4)
- [SigLIP 2 — improved vision-language encoders](https://huggingface.co/blog/siglip2)
- [FastVLM — efficient vision encoding (Apple)](https://machinelearning.apple.com/research/fast-vision-language-models)
- [VLMs on edge devices (LearnOpenCV)](https://learnopencv.com/vlm-on-edge-devices/)