{
  "title": "Research Perspectives: VLM-Primary Hybrid Navigation",
  "source": "docs/RESEARCH-VLM-PRIMARY-HYBRID-NAV.md",
  "generated_at": "2026-04-14T07:30:00",
  "sections": [
    {"id":"lens-01","title":"First Principles X-Ray","category":"decompose","text":"The deepest truth: at 1 m/s with 58 Hz VLM, the world moves only 1.7 cm between frames. Temporal consistency is free. Consecutive answers should agree because the scene barely changed. Any disagreement is either hallucination or genuine scene transition, detectable through variance tracking. The most important convention disguised as physics: VLM must output text. The 150M parameter ViT encoder produces a 280 token feature vector in 14ms. Text decoding adds 4ms. For place recognition, the embedding IS the output.","findings":["Temporal surplus is the foundation","WiFi, not VLM speed, is the binding constraint"]},
    {"id":"lens-02","title":"Abstraction Elevator","category":"decompose","text":"The abstraction leak between 10,000 ft and ground level is WiFi. The clean 4-tier hierarchy assumes instant inter-tier communication, but tiers run on different hardware connected by household WiFi. Another leak: the 30,000 ft pitch says named goals but ground level outputs LEFT MEDIUM, qualitative directions not coordinates. Semantic mapping bridges this gap.","findings":["WiFi leaks across all abstractions","LEFT MEDIUM is the glass ceiling until Phase 2c"]},
    {"id":"lens-03","title":"Dependency Telescope","category":"decompose","text":"The most dangerous dependency is llama-server's inability to expose intermediate embeddings. Phase 2d is blocked. The workaround, deploying separate SigLIP 2, costs 800MB VRAM on already constrained Panda. This upstream limitation cascades into hardware budget decisions. Downstream, if semantic maps work, spatial memory feeds voice agent, context engine, and home automation.","findings":["llama-server blocks Phase 2d","Panda GPU is single point of failure"]},
    {"id":"lens-04","title":"Sensitivity Surface","category":"decompose","text":"WiFi latency has a cliff edge. Below 20ms, fine. At 50ms, degraded. Above 100ms, blind for multiple frames. No graceful degradation, just a phase transition. The surprise: VLM query rate barely matters above 15 Hz. Going from 29 to 58 Hz gives the same information. The multi-query pipeline's value is using surplus frames for different questions.","findings":["WiFi is 92% impact cliff edge","VLM rate barely matters above 15 Hz"]},
    {"id":"lens-05","title":"Evolution Timeline","category":"evolve","text":"Every 1 to 2 years, the learned vs classical boundary shifts. Annie sits at the pragmatic fusion point, using off-the-shelf VLMs for perception while keeping classical SLAM and A* for planning. The 2027 question: will 1B Vision Language Action models become trainable on small datasets? If so, the 4-tier hierarchy collapses. But OK-Robot's lesson persists: clean integration may remain more practical for low-volume robotics.","findings":["Annie at pragmatic fusion inflection","Edge VLAs may simplify by 2027"]},
    {"id":"lens-06","title":"Second-Order Effects","category":"evolve","text":"The killer second-order effect: spatial memory meets conversational memory. Mom mentioned needing glasses, and Annie saw glasses on the nightstand. Combined with emotion recognition, this enables proactive care without being asked. The concerning effect: a camera robot mapping rooms creates surveillance even unintentionally. Needs explicit consent architecture.","findings":["Spatial + conversational memory is transformative","Privacy consent architecture needed"]},
    {"id":"lens-07","title":"Landscape Map","category":"position","text":"Annie occupies the sweet spot: zero training with medium-high spatial understanding. The only system that's better without training is VLMaps, which is Annie's Phase 2c target. The empty quadrant, high understanding plus zero training, suggests an opportunity: plug-and-play semantic SLAM using foundation model embeddings.","findings":["Annie in sweet spot: zero training, medium-high understanding","Empty quadrant opportunity for plug-and-play semantic SLAM"]},
    {"id":"lens-08","title":"Analogy Bridge","category":"position","text":"Like a submarine's sonar plus periscope. Sonar gives 360 degree geometry but can't identify. Periscope gives narrow field of view but can identify. The fusion rule is the same: sonar never trusts periscope over sonar for safety. The analogy predicts something new: submarines use sound signatures to classify contacts through sonar alone. Annie could use lidar scan patterns for obstacle classification without VLM.","findings":["Submarine analogy validates fusion rule","Lidar fingerprinting predicted by analogy"]},
    {"id":"lens-09","title":"Tradeoff Radar","category":"position","text":"VLM plus SLAM trades robustness for semantics. Pure SLAM is more robust with no WiFi, no hallucinations, no GPU required, but understands nothing. The hidden tradeoff is deployment complexity. Setting up the VLM pipeline is far harder than SLAM-only, but this doesn't show on the radar.","findings":["Robustness traded for semantics","Deployment complexity is the hidden cost"]},
    {"id":"lens-10","title":"Failure Pre-mortem","category":"stress","text":"Most likely failure: WiFi degrades under real conditions like monsoon humidity and family streaming. The glass door scenario is the unknown unknown where both VLM and lidar agree on the wrong answer. VLM sees through glass, lidar beam passes at angle. No temporal smoothing fixes systematic errors.","findings":["WiFi reliability under real conditions","Glass doors fool both sensors simultaneously"]},
    {"id":"lens-11","title":"Red Team Brief","category":"stress","text":"The CTO's challenge is hardest: why 2 billion parameters for 2 tokens? The value isn't in those tokens, it's in the 150M parameter vision encoder's scene understanding. Text output is lossy compression. Phase 2d makes this explicit by using encoder output directly. The competitor challenge is valid for obstacle avoidance but misses the point: Annie needs semantic understanding.","findings":["VLM value is in encoder, not text output","Semantic understanding justifies GPU cost"]},
    {"id":"lens-12","title":"Anti-Pattern Gallery","category":"stress","text":"Most seductive mistake: asking VLM for metric distance estimation. How far is that chair sounds natural but monocular depth is fundamentally ambiguous. The research wisely keeps VLM outputs qualitative and uses lidar for all geometry. Other anti-patterns: trusting every frame, giving VLM override over lidar, fine-tuning on specific environment.","findings":["Never use VLM for metric distance","Keep VLM outputs qualitative"]},
    {"id":"lens-13","title":"Constraint Analysis","category":"stress","text":"Two constraints are artificially imposed: single camera and 2D-only navigation. Both relaxable with 15 to 50 dollar hardware. The WiFi constraint is genuinely fragile. The constraint most likely relaxed by technology: VLM model size. In 2 years, 1B params will match today's 2B capability.","findings":["Single camera and 2D are artificial constraints","Model efficiency will relax VRAM constraint"]},
    {"id":"lens-14","title":"The Inversion","category":"generate","text":"Most productive inversion: lidar proposes, VLM confirms. Let SLAM compute A* path, then ask VLM is this clear once per second. This IS Phase 2c, reframed. The radical inversion: no explicit map at all. PRISM-TopoMap uses embeddings as spatial memory. Eliminates SLAM drift but can't do metric path planning.","findings":["Lidar proposes + VLM confirms is Phase 2c","No-map approach trades precision for simplicity"]},
    {"id":"lens-15","title":"Constraint Relaxation","category":"generate","text":"Most revealing: if accuracy dropped by half, you could run on Pi 5 CPU directly. No GPU, no WiFi, no Panda. The entire system complexity exists to push accuracy from barely useful to reliable. That last 40 percent of accuracy costs 10x the hardware. Constraint to be relaxed soonest: VRAM per model, as vision encoders get more efficient.","findings":["Last 40% accuracy costs 10x hardware","VRAM efficiency improving fast"]},
    {"id":"lens-16","title":"Composition Lab","category":"generate","text":"Killer combination: semantic map plus voice agent. Last time I was in the bedroom, I saw your glasses on the nightstand. This bridges Context Engine conversation memory with navigation spatial memory. Neither alone is this powerful. Untried combination: VLM plus audio SLAM. Room acoustics differ between rooms and nobody uses this for indoor robots.","findings":["Spatial + voice memory is killer combination","Audio SLAM is untried opportunity"]},
    {"id":"lens-17","title":"Transfer Matrix","category":"generate","text":"The multi-query VLM pipeline transfers to any edge robot with a camera and VLM. Security patrols, greenhouses, retail all benefit from splitting VLM attention across tasks within the same frame budget. Startup opportunity: package as open-source ROS2 middleware before the space gets crowded.","findings":["Multi-query pattern is universally transferable","Open-source middleware opportunity"]},
    {"id":"lens-18","title":"Decision Tree","category":"apply","text":"Three binary questions determine fit: camera available, semantics needed, and edge VLM speed above 5 Hz. All yes equals use VLM plus SLAM hybrid. Wrong for pure obstacle avoidance where lidar is cheaper. Wrong for cloud robots where latency kills fusion.","findings":["Three binary questions determine fit","Wrong for pure avoidance or cloud robots"]},
    {"id":"lens-19","title":"Scale Microscope","category":"apply","text":"Phase transition at roughly 10 robots: dedicated GPU per robot breaks. Need shared inference service. At 1000, need fleet learning which is Tesla's approach. Annie's architecture is explicitly artisanal, not industrial, and that's fine for her-os scope.","findings":["Phase transition at 10 robots","Architecture is artisanal by design"]},
    {"id":"lens-20","title":"Day-in-the-Life","category":"apply","text":"The 42-second journey from ask to answer reveals: the value isn't speed, walking is faster. It's not having to get up. The architecture serves a lifestyle use case. Debugging spans 4 tiers on 3 machines, which is the hidden cost of distributed hybrid systems.","findings":["Value is lifestyle, not efficiency","Debugging across 3 machines is hidden cost"]},
    {"id":"lens-21","title":"Stakeholder Kaleidoscope","category":"human","text":"Gap between Rajesh who sees architecture and Mom who sees behavior. All elegance of 4-tier hierarchy is invisible to user. Phase 2a has zero user-visible benefit unless paired with a new capability. Visitor's perspective is most underrepresented. A camera robot in someone's home triggers privacy instincts. Needs a privacy mode.","findings":["Architecture invisible to users","Privacy mode needed for visitors"]},
    {"id":"lens-22","title":"Learning Staircase","category":"human","text":"The plateau is SLAM integration. Sessions 86 through 92 were spent on ROS2, Docker, Zenoh transport, MessageFilter bugs. VLM side is straightforward prompting. SLAM requires deep ROS2 expertise. If packaged for others, SLAM deployment must be radically simplified.","findings":["SLAM integration is the adoption plateau","Docker Compose simplification critical"]},
    {"id":"lens-23","title":"Energy Landscape","category":"human","text":"Biggest barrier isn't technical, it's the good enough incumbent. Lidar-only reactive navigation works for basic avoidance. VLM hybrid's activation energy is high until you need go to the kitchen. Catalytic event: a pre-built Docker Compose that works in one command.","findings":["Good enough incumbent is main barrier","One-command deployment would catalyze adoption"]},
    {"id":"lens-24","title":"Gap Finder","category":"discover","text":"Most significant gap: night and low-light navigation. Indian homes have intermittent lighting. Solutions include IR illumination, lidar-only fallback, ambient light sensor. The voice plus navigation interaction gap is critical for her-os. What do you see mid-navigation requires VLM query switching.","findings":["Night/low-light is biggest gap","Voice + nav interaction unaddressed"]},
    {"id":"lens-25","title":"Blind Spot Scan","category":"discover","text":"Cultural blind spot: Annie navigates an Indian home with room types the VLM won't recognize. Pooja room is unlikely in Gemma 4's training data. The CV researcher's point is valid: a 50 dollar depth camera eliminates the glass door problem. VLM value is specifically semantic understanding, not geometry.","findings":["Indian room types not in training data","Depth camera solves glass door problem"]},
    {"id":"lens-26","title":"Question Horizon","category":"discover","text":"Most exciting new question: can Annie build conversation maps fusing spatial memory with emotional context? Mom mentioned glasses, Annie saw them on nightstand, Mom sounded tired, equals proactive care. Most practical: what's the minimum VLM size? If 400M works on Pi 5 CPU, the entire Panda dependency disappears.","findings":["Conversation maps fuse spatial + emotional","Minimum VLM size could eliminate GPU dependency"]}
  ]
}
