LENS 12: Anti-Pattern Gallery
"What looks right but leads nowhere?"

This lens catalogues seven recurring mistakes in VLM-primary hybrid navigation — patterns that feel correct when first encountered and become costly only after the system has been running for a while.

---

ANTI-PATTERN 1: More frames equals better navigation.

The wrong approach is to run the same goal-tracking question — "Where is the goal?" — on every frame at 54 to 58 hertz. This feels like maximum attentiveness. The model is never idle. It ships as the obvious first implementation in session 79.

The hidden cost: one task monopolises every frame. The robot is blind to room context, obstacle class, and place memory. At 58 hertz, consecutive frames differ by less than 1.7 centimetres of robot travel. The 58th answer contains almost no new information the first answer didn't already contain.

The correct pattern: rotate four different tasks across the same 58-hertz budget. Goal tracking at 29 hertz. Scene classification at 10 hertz. Obstacle description at 10 hertz. Embedding extraction for place recognition at 10 hertz. Each task gets the model's full attention on its dedicated frame. EMA smoothing with alpha 0.3 filters single-frame hallucinations across the goal-tracking frames. This is a one-line change in NavController's run loop. Same throughput, four times richer perception.

---

ANTI-PATTERN 2: An end-to-end neural planner is more elegant.

Tesla FSD version 12 replaced 300,000 lines of C++ with a single neural network. The papers on RT-2, OpenVLA, and pi-zero report impressive numbers. The natural conclusion for Annie is a custom vision-language-action model trained end to end.

The flaw: Tesla trained on millions of miles of driving. RT-2 required millions of robot demonstrations. Annie has one robot. End-to-end neural planners require fleet-scale data that does not exist at this project's scale.

The correct pattern, validated by OK-Robot at NYU in 2024: pragmatic integration of off-the-shelf components. OK-Robot achieved 58.5 percent pick-and-drop success in real homes using only CLIP, LangSam, and AnyGrasp — entirely off the shelf. Their explicit finding: "What really matters is not fancy models but clean integration." Annie's NavController already follows this principle. The research endorses the existing architecture, not as a stopgap, but as the correct long-term approach.

---

ANTI-PATTERN 3: The VLM sees the world — why run lidar separately?

If the VLM can say "wall ahead" and "chair on the left," it's tempting to cut the lidar pipeline entirely. Fewer moving parts. No RPLIDAR driver, no MessageFilter queue-drop grief. The VLM even catches above-lidar-plane hazards: shelves, hanging objects, table edges.

The failure mode is the glass door problem. A monocular camera cannot distinguish a transparent obstacle from open space. Lidar measures geometry physically — reflected photons. The VLM guesses geometry from learned priors. When the prior is wrong, the robot drives into the obstacle.

The correct pattern, stated in the research as the fusion rule: VLM proposes, lidar disposes, IMU corrects. Tier 3 — the Pi lidar and SLAM stack — holds absolute ESTOP priority over Tier 2's VLM. VLM obstacle descriptions become semantic labels on lidar-detected clusters. Lidar says where. VLM says what. Neither replaces the other. The ESTOP chain is the only line between a one-metre-per-second robot and a broken piece of furniture.

---

ANTI-PATTERN 4: Switch to the bigger Titan model for better navigation decisions.

Gemma 4 26 billion on Titan is the project's most capable model: 50 tokens per second, 128K context, full reasoning capability. When Gemma 4 E2B on Panda gives shaky navigation — session 92 confirmed the 2-billion-parameter model always says FORWARD into walls — the obvious fix is to route navigation queries to Titan.

The temporal math destroys this reasoning. Titan 26 billion runs at roughly 2 hertz for image-plus-navigation queries accounting for network latency and generation time. At 2 hertz, the robot travels 50 centimetres between decisions at walking speed. By the time each Titan answer arrives, the scene has already changed. Single-frame quality is higher but temporal consistency is gone.

Session 92's explore-dashboard tested this directly. Routing navigation to Titan produced visibly worse driving than Panda E2B. The data corrected the intuition.

The correct pattern: fast small model with EMA smoothing beats slow big model for reactive steering. GR00T N1 from NVIDIA encodes this architecturally: VLM at 10 hertz, motor outputs at 120 hertz. Tesla runs perception at 36 hertz, planning at lower frequency. The pattern is universal: high-frequency cheap inference for reactive control, low-frequency expensive inference for strategy. Titan 26 billion belongs at Tier 1 — strategic planning at 1 to 2 hertz — not Tier 2 reactive steering.

---

ANTI-PATTERN 5: Build the map to navigate.

The traditional robotics curriculum teaches: SLAM produces a 2D occupancy grid, path planners find collision-free routes through it, the robot follows the path. The map is infrastructure for navigation. This is correct, useful, and exactly what every robotics course teaches.

The natural next step after Phase 1 SLAM is therefore to wire up Nav2 and navigate waypoints on the grid. But this view treats the map as a transient navigation aid — rebuilt each session, discarded when the robot stops. It throws away the most valuable thing the robot accumulates over time: persistent spatial memory of where things are and what they mean.

The correct pattern, demonstrated by Google's VLMaps at ICRA 2023: attach VLM scene labels to SLAM grid cells at each robot pose during exploration. Over dozens of sessions, cells accumulate semantic labels. Kitchen confidence grows on the cluster of cells near the stove. Hallway confidence grows on the narrow corridor cells. "Where is the kitchen?" becomes a query against accumulated knowledge, not a real-time VLM call on an unknown environment.

Waymo encodes the same principle: pre-built HD maps store all static structure. Perception focuses only on dynamic changes. Annie's SLAM map is not throw-away scaffolding. It is the beginning of her persistent spatial memory — the substrate on which the semantic knowledge graph lives.

---

ANTI-PATTERN 6: Route safety-critical inference through WiFi when a local NPU exists.

Annie's Pi 5 carries a Hailo-8 AI HAT+ at 26 TOPS that has sat idle for months. Meanwhile, safety-critical obstacle detection is routed over WiFi to Panda's remote GPU. The intuition — centralise the smart compute, keep the edge dumb — is wrong for reflex-path inference. A one-metre-per-second robot covers 30 centimetres in 300 milliseconds of WiFi jitter. That is a broken piece of furniture or a dent in a wall.

The correct pattern: fast-reactive inference lives on whatever compute is physically closest to the actuator. YOLOv8n on the Hailo-8 runs at 430 frames per second with under 10 milliseconds of local inference and zero WiFi dependency. The VLM on Panda stays as the slow semantic layer at 18 milliseconds plus WiFi round-trip for "what room is this?"-shaped questions. For a future Orin-NX-equipped robot, the same rule applies: keep obstacle detection onboard, not in the cloud. Safety latency budgets must not depend on networks the system doesn't control.

---

ANTI-PATTERN 7: Use a vision-language model for a known fiducial target.

The charging dock carries an ArUco marker — a deterministic 6-by-6 bit pattern, dictionary 50, id 23. The modern instinct is to ask Gemma 4 E2B "is there a marker in view?" because the VLM is already running. It can read text, count objects, describe scenes. Surely it can spot a square.

But VLM inference costs roughly 18 milliseconds on Panda's GPU plus WiFi round-trip, produces non-deterministic free-text, and is prone to hallucination on partial or occluded markers. Meanwhile, cv2.aruco.ArucoDetector plus cv2.solvePnP runs at 78 microseconds per call on the Pi ARM CPU. No GPU. No network. Pure OpenCV. That is a 230-times speedup over the VLM round-trip with deterministic sub-pixel corners and an exact 6-DoF pose transform.

The rule: VLMs are for semantic understanding of unknown targets — "is there a chair?", "is this the kitchen?". Classical CV is for known shapes with deterministic detectors — ArUco markers, AprilTags, QR codes, known logos. Asking a VLM to do fiducial detection is paying for generality that isn't needed and losing determinism that is. Annie's homing implementation already does this right.

---

The anti-patterns in this gallery share a common structure. They are all locally optimal choices that look correct when evaluated at a single decision point but accumulate cost over time. Running one query at maximum frequency is locally fast. Routing to the bigger model is locally more capable. End-to-end neural is locally more elegant. Treating SLAM as navigation infrastructure is locally simpler. Each becomes an anti-pattern only when evaluated across the system's full operational lifetime — across hundreds of navigation sessions, a home that changes, and a robot that should get smarter rather than restart from zero every time it boots.

Two of these anti-patterns were hit in production before this research was written. Ollama's Go wrapper added 110 milliseconds of overhead per call and was retired in session 67 — the clean integration anti-pattern in practice. IndicF5 wasted 2.8 gigabytes of VRAM on a TTS model that served no active need — the bigger model anti-pattern applied to speech. Both were discovered by measurement, not intuition. The lesson: always instrument the thing you think is working.

Two further anti-patterns surfaced during session 119's hardware audit. Both share a root cause: mismatched inference mechanism. Routing safety through WiFi ignored the idle local NPU. Asking the VLM to detect ArUco markers paid for semantic flexibility where a deterministic classical detector would do the job 230 times faster. The IROS dual-process paper, arXiv 26-01-21506, measured the payoff — 66 percent latency reduction and 67-point-5 percent navigation success versus 5-point-8-three percent for VLM-only — when reactive perception runs locally and semantic reasoning runs elsewhere. The meta-rule binding both anti-patterns: match the inference mechanism to the signal's predictability. Classical CV for known-geometry detectors. Fast-local NPUs for reactive safety. VLMs for semantic unknowns. LLMs for strategy. Wrong-layer inference is the dominant failure mode across this gallery.