LENS 03

Dependency Telescope

"What's upstream and downstream?"

Full Dependency Graph — VLM-Primary Hybrid Navigation

VLM-Primary Hybrid Navigation System
▲ UPSTREAM: Gemma 4 E2B model — Google-controlled, hosted on HuggingFace. No contractual SLA. Model retirement or architecture change breaks inference at 18ms/frame.
▲ llama-server (llama.cpp) — The inference server wrapping Gemma 4. Critical blocker: cannot expose intermediate multimodal embeddings. This single architectural gap blocks Phase 2d entirely.
▲ GGUF quantization format — llama.cpp's native format. Gemma 4 E2B is loaded as GGUF. If Gemma 5 ships in a new format llama.cpp doesn't support, the 54 Hz inference pipeline stalls until a new build is cut.
▲ Panda Jetson VRAM (8 GB) — Hard ceiling. VLM takes ~2.8 GB. SigLIP 2 workaround for embeddings costs +800 MB. Any model upgrade that pushes past 8 GB forces a hardware decision.
▲ UPSTREAM: Household WiFi 2.4/5 GHz — Uncontrolled shared medium. Pi 5 ↔ Panda latency baseline is ~8ms, but household contention can spike to 100ms+ (verified: Lens 04 WiFi cliff edge). At 100ms, the 54 Hz VLM pipeline is throttled to 10 Hz. No redundancy path — this is a single-point-of-failure with no engineering mitigation available short of Ethernet.
▲ Pi 5 → Panda TCP/IP stack — Camera frames travel as base64 JPEG over HTTP POST. Frame size ~30–80 KB. At 54 Hz this is ~16–43 Mbps sustained. Household WiFi rarely sustains this under load.
▲ JPEG compression quality setting — A single config value. Too high: frames too large, WiFi saturates. Too low: VLM hallucinates from compression artifacts. No automated adaptation logic exists yet.
▼ MITIGATION AVAILABLE: Hailo-8 AI HAT+ on Pi 5 (currently idle) — 26 TOPS local NPU already on the robot. Runs YOLOv8n at ~430 FPS with zero WiFi traffic. Activating it as an L1 safety layer converts the WiFi cascade from "degrades → all three Phase 2 downstream phases degrade simultaneously" to "degrades → semantic features degrade, safety stays local." The dependency on WiFi for obstacle-avoidance stops being safety-critical. See Lens 13 for the opportunity framing.
▲ UPSTREAM: Phase 1 SLAM (slam_toolbox) — Prerequisite for Phases 2c, 2d, 2e. Three downstream phases are gated on one upstream deployment. SLAM health degrades silently: MessageFilter queue drops (~13% of scans) are normal, but a Zenoh session crash or IMU dropout stops localization without alerting the nav layer.
▲ rf2o lidar odometry — Provides the primary odometry signal feeding slam_toolbox. A dead RPLIDAR C1 (baud 460800, CCW angles) kills both rf2o and SLAM simultaneously. No backup odometry path.
▲ Pico RP2040 IMU bridge — MPU-6050 via USB serial at 100 Hz. Known failure mode: drops to REPL silently. When IMU goes down, slam_toolbox loses heading — localization degrades, Tier 4 kinematic correction stops. The only detection method is manual health polling.
▲ zenoh_ros2_sdk (rmw_zenoh_cpp) — Built from source (pinned afcd981). Wire-protocol version mismatch between apt (0.2.9) and source (1.7.1) was discovered in session 89. Currently not deployed on Pi 5. The SLAM+Zenoh bridge is implemented but undeployed.
▲ UPSTREAM: SigLIP 2 ViT-SO400M (workaround for Phase 2d) — 800 MB VRAM on Panda. Not yet deployed. Required because llama-server blocks direct embedding access. A dependency created by another dependency's limitation — a second-order upstream.
▲ UPSTREAM: DINOv2 / AnyLoc (Phase 2e) — Requires GPU. Either competes with VLM on Panda (risky) or runs on Titan (plenty of headroom). Phase 2e is the furthest downstream in the SLAM chain, with the most upstream dependencies stacked.
▼ DOWNSTREAM CONSUMERS (what VLM nav enables)
▼ Semantic Map (VLMaps pattern) — VLM scene labels attach to SLAM grid cells at current pose. Rooms emerge over time. Accidentally enables: floor plan extraction, room-change detection, "haven't been in the kitchen for 3 days" memory signal.
▼ Annie Voice Agent — Spatial Queries — If semantic map is built, Annie can answer "where is the charger?" from the map. This is a capability the research doesn't fully scope: the voice agent becomes spatially aware without any additional training. An accidentally powerful downstream.
▼ Context Engine — Spatial Memories — SLAM pose + scene label + timestamp = a structured spatial memory. "Annie was in the kitchen at 14:32" becomes a queryable fact. The Context Engine's entity extraction pipeline wasn't designed for spatial facts — a mismatch that creates integration work.
▼ Place Recognition / Loop Closure (Phase 2d/2e) — Embeddings stored keyed by SLAM pose enable "have I been here before?" — augmenting slam_toolbox's scan-matching loop closure with visual confirmation. When both agree: high-confidence loop closure. When they disagree: a new detection failure mode.
▼ Home Automation (future) — Room occupancy detected by VLM scene classification. "Annie is in the bedroom" becomes an event. Unplanned downstream: triggers lights, thermostat, camera privacy modes. No consent layer exists for this.
▼ Evaluation Framework (Phase 7 logging) — Phase 1 must log SLAM pose + camera frames + VLM outputs at 10 Hz for Phase 2 evaluation. This logging requirement changes Phase 1's storage budget: 10 Hz JPEG + pose = ~50–100 MB/hour of drive time. Disk planning is a hidden downstream cost.

The dependency telescope reveals a system that is far more fragile at its upstream joints than its engineering confidence suggests. The four-tier hierarchical fusion architecture — Titan at Tier 1, Panda VLM at Tier 2, Pi lidar at Tier 3, IMU at Tier 4 — reads as robust modularity. But each tier is tethered to an upstream it does not control. The most consequential of these is not the obvious WiFi dependency: it is llama-server's inability to expose intermediate multimodal embeddings. This single API gap in an open-source inference server blocks Phase 2d (embedding extraction + place memory) entirely, and forces the deployment of a separate SigLIP 2 model that consumes 800 MB of Panda's already-constrained 8 GB VRAM. A limitation in one upstream layer manufactured a hardware budget problem in another.

The WiFi dependency is the system's hidden single point of failure — not because it is unknown, but because it has no engineering mitigation. Every other dependency has a documented workaround or fallback: if Gemma 4 E2B is retired, swap to a different GGUF model; if slam_toolbox stalls, restart the Docker container; if the IMU drops to REPL, soft-reboot the Pico. But if household WiFi degrades, the Pi-to-Panda camera link drops from 54 Hz to something below 10 Hz, and there is no fallback — the system runs degraded silently. Lens 04 identified this as the WiFi cliff edge at 100ms latency. What the Dependency Telescope adds is the cascade: degraded VLM throughput degrades scene classification, which degrades semantic map annotation quality, which degrades Phase 2c room labeling accuracy. A single uncontrolled RF environment poisons three downstream phases. The Session 119 hardware audit surfaced a downstream-dependency mitigation hiding in plain sight: the Pi 5's Hailo-8 AI HAT+ is already on-robot and idle. Activating it as a local L1 safety layer (YOLOv8n at 430 FPS, zero WiFi) rewrites the cascade. "WiFi degrades → all three Phase 2 phases degrade" becomes "WiFi degrades → semantic features degrade, safety stays local." The dependency doesn't disappear — it gets demoted from safety-critical to semantic-only, which is exactly where an uncontrolled RF medium belongs.

The Phase 1 SLAM prerequisite chain deserves special attention because it is the upstream that gates the most downstream value. Phases 2c (semantic map annotation), 2d (embedding extraction and place memory), and 2e (AnyLoc visual loop closure) are all marked "requires Phase 1 SLAM deployed." This means three of the five Phase 2 phases — the three that deliver the most architectural novelty — are in a single-file queue behind one deployment. If Phase 1 SLAM suffers a persistent failure (Zenoh session crash, lidar dropout, IMU brownout), the downstream timeline does not slip by one phase, it slips by three simultaneously. The research acknowledges this in its probability table: Phase 2c is 65%, Phase 2d is 55%, Phase 2e is 50%. Those probabilities are not independent — they are conditionally dependent on the same upstream SLAM health.

The downstream surprises are equally instructive. The research frames the semantic map as a navigation primitive — rooms labeled on a grid. But the voice agent downstream consumer converts that primitive into a qualitatively different capability: spatial memory answerable by voice. Annie can tell you where the charger is, when she last visited the kitchen, or whether the living room is currently occupied — without any additional training, purely because scene labels are attached to SLAM poses. The Context Engine similarly receives a capability it was not designed for: spatial facts in its entity index. Neither downstream consumer is mentioned in the research roadmap. The most valuable accidental enablement is the one most likely to create an integration mismatch when it arrives.

Highest-leverage blocker: llama-server's inability to expose multimodal embeddings. Fixing this — either by patching llama-server upstream or switching to a server that supports embedding extraction (e.g., a raw Python inference script) — would unblock Phase 2d without any hardware change and reclaim 800 MB of Panda VRAM. Cost: 1–2 engineering sessions. Value: removes a second-order dependency that created a hardware budget constraint.

Hidden single point of failure: Household WiFi. Unlike every other dependency, WiFi has no programmatic fallback. The system runs degraded silently when it saturates. A watchdog that detects round-trip latency above 80ms and switches the VLM query rate down from 54 Hz to 10 Hz — with an alert to Annie — would convert a silent failure into a managed degradation.

Most likely to change in 2 years: Gemma 4 E2B model. Google's model release cadence (Gemma 2, Gemma 3, Gemma 4 all within 18 months) makes a Gemma 5 or successor highly probable before Phase 2e is deployed. The architecture is correctly abstracted — _ask_vlm(image_b64, prompt) is model-agnostic — but the GGUF conversion + llama.cpp compatibility step will need re-validation for each new model generation.

Accidental downstream: Voice-queryable spatial memory. When the semantic map is built, the voice agent inherits spatial awareness for free. This capability is unplanned and unscoped — it will arrive before anyone has designed a consent model for "Annie, who was in my bedroom yesterday?"

Downstream dependency demotion (mitigation available): Hailo-8 AI HAT+ on the Pi 5 is on-hand hardware, currently idle, capable of YOLOv8n at 430 FPS with zero WiFi traffic. Activating it as an L1 safety layer converts WiFi from a safety-critical dependency into a semantic-only dependency — the cascade "WiFi degrades → 3 Phase 2 phases degrade" becomes "WiFi degrades → semantic features degrade, safety stays local." This is the highest-leverage dependency restructuring available without new hardware purchase. See Lens 13.

If llama-server gained native multimodal embedding extraction tomorrow — what breaks first at scale?

The storage layer. At 54 Hz, extracting 280-token embedding vectors produces roughly 280 × 4 bytes × 54 frames/second = ~60 KB/s of raw float data per second of robot operation. Over a 2-hour exploration session: ~432 MB of embeddings — before any SLAM pose metadata. The topological place graph would need both an in-memory index for cosine similarity queries and a persistent store for session-to-session place memory. Neither exists. The research proposes storing embeddings "keyed by (x, y, heading) from SLAM" without addressing deduplication: if Annie traverses the same hallway 50 times, she accumulates 50 nearly-identical embeddings for the same place. The query cost of a 50,000-embedding cosine search at navigation speed is unaddressed. The dependency telescope reveals that unblocking llama-server immediately creates a data engineering dependency that doesn't yet exist.

Click to reveal