LENS 12

Anti-Pattern Gallery

"What looks right but leads nowhere?"

ANTI-PATTERN 1

"Run the same query as fast as possible."

Annie's original loop fires the goal-tracking question "Where is the [goal]?" on every frame at 54–58 Hz. It feels maximally attentive — the model is never idle. This is the obvious implementation and it ships in session 79.

The cost: one task monopolises all frames. The robot is blind to room context, obstacle class, and whether it has visited this place before. Single-frame hallucinations (2% of outputs) pass directly to the motor command with no smoothing.

Result: fast but narrow. 58 Hz of the same question is redundant — consecutive frames differ by <1.7 cm of robot travel. The 58th answer adds almost nothing the 1st answer didn't contain.
vs

CORRECT PATTERN 1

"Rotate 4 different tasks across the same 58 Hz budget."

The research's Phase 2a proposal: alternate goal-tracking, scene classification, obstacle description, and path assessment across consecutive frames. Each task still runs at ~14–15 Hz — faster than most robot SLAM loops (10 Hz).

Nav decisions: 29 Hz. Scene labels: 10 Hz. Obstacle class: 10 Hz. Place embeddings: 10 Hz. The model's full attention lands on each task on its dedicated frame. EMA (alpha=0.3) across the 29 goal-tracking frames smooths single-frame glitches.

Result: same 58 Hz throughput, 4× richer perception. Implemented as cycle_count % N dispatch in NavController._run_loop() — a one-line change.

ANTI-PATTERN 2

"A custom end-to-end neural planner is more elegant."

Tesla FSD v12 replaced 300,000 lines of C++ with a single neural net. The narrative is compelling: one model, no hand-written rules, everything learned end-to-end. The natural extrapolation for Annie is a custom VLA — a model trained to map images directly to motor commands.

The seduction: research papers report impressive numbers. RT-2, OpenVLA, pi0 all show image → action working. End-to-end "feels" like the right direction of travel.

Reality check: Tesla trained on millions of miles of real driving. RT-2 requires millions of robot demonstrations. Annie has one robot. End-to-end neural planners require fleet-scale data that doesn't exist at this project's scale.
vs

CORRECT PATTERN 2

"Pragmatic integration of off-the-shelf components."

OK-Robot (NYU, CoRL 2024) achieved 58.5% pick-and-drop success in real homes using only CLIP + LangSam + AnyGrasp — entirely off-the-shelf. Their explicit finding: "What really matters is not fancy models but clean integration."

Annie's current architecture already follows this. SLAM handles geometry. VLM handles semantics. LLM handles planning. IMU handles heading. Each component is independently testable and replaceable. The research endorses this as the correct architecture — not as a stopgap until a custom model can be trained.

The existing NavController architecture (sessions 79–83) is already correct for Tiers 2–4. The research says so explicitly. Don't rewrite it chasing an end-to-end ideal.

ANTI-PATTERN 3

"The VLM sees the world — why run lidar separately?"

If Gemma 4 E2B can say "wall ahead" and "chair on the left," it's tempting to treat the VLM as a complete sensor and cut the lidar pipeline. Fewer moving parts. No serial port, no RPLIDAR driver, no MessageFilter queue-drop grief (session 89 cost three full sessions to fix).

The VLM even catches above-lidar-plane hazards: shelves, hanging objects, table edges. In some scenarios it provides more context than 2D lidar. This feels like an upgrade.

The glass door problem: a monocular camera cannot distinguish a transparent obstacle from open space. Lidar measures geometry physically — reflected photons. VLM guesses geometry from learned priors. When the prior is wrong (unmarked glass, mirror, unexpected furniture placement), the robot drives into the obstacle.
vs

CORRECT PATTERN 3

"VLM proposes, lidar disposes — they are complementary, not redundant."

The research's fusion rule states this directly: "VLM proposes, lidar disposes, IMU corrects." The 4-tier architecture enforces it structurally: Tier 3 (Pi lidar + SLAM) has absolute ESTOP priority over Tier 2 (Panda VLM).

Waymo's architecture validates the principle at scale: camera gives semantics, lidar gives geometry, radar gives velocity. Each does something the others cannot. Reducing one to a subordinate of another destroys the complementarity.

Concretely: VLM obstacle descriptions ("chair") become semantic labels on lidar-detected clusters. The lidar says where. The VLM says what. Neither replaces the other.

The ESTOP chain — lidar → sonar → ESTOP — is the only line between a 1 m/s robot and a broken piece of furniture. VLM is an advisor, not a brake.

ANTI-PATTERN 4

"Switch to the 26B Titan model for better nav decisions."

Gemma 4 26B on Titan is the project's most capable model: 50.4 tok/s, 128K context, thinking enabled, handles complex multi-tool orchestration. When the E2B 2B model on Panda gives shaky navigation (session 92: "E2B always says FORWARD into walls"), the obvious fix is to route navigation queries to the bigger model.

This was actually tried in session 92 with the explore-dashboard. Larger model, richer reasoning, better spatial understanding. Seems straightforward.

At 26B on Titan: ~2 Hz inference rate (network latency + generation time). At 2 Hz, the robot travels 50 cm between decisions at 1 m/s. Single-frame quality is higher, but temporal consistency is destroyed — the robot is navigating on stale data by the time each answer arrives.
vs

CORRECT PATTERN 4

"Fast small model + EMA smoothing > slow big model."

The research's temporal consistency analysis is definitive: at 58 Hz, consecutive frames differ by <1.7 cm. EMA with alpha=0.3 across five consistent frames (86ms) effectively removes the 2% hallucination rate. The architecture produces a smoothed, reliable signal from an individually noisy source.

GR00T N1 (NVIDIA) runs its VLM at 10 Hz and action outputs at 120 Hz — the VLM is the slow strategic layer, not the fast reactive layer. Tesla runs perception at 36 Hz, planning at lower frequency. The pattern is universal: high-frequency cheap inference for reactive control; low-frequency expensive inference for strategy.

The correct use of Titan 26B is Tier 1 strategic planning ("go to the kitchen" → waypoints on SLAM map, 1–2 Hz). Not Tier 2 reactive steering.

Annie's current architecture already has this right: Panda E2B at 54 Hz for steering (Tier 2), Titan 26B at 1–2 Hz for goal interpretation (Tier 1). Session 92's explore-dashboard failure confirmed the anti-pattern experimentally.

ANTI-PATTERN 5

"SLAM is for finding paths. Build the map, then navigate it."

The traditional robotics framing: SLAM produces a metric 2D occupancy grid; A* or Nav2 finds collision-free paths through it; the robot follows the path. The map is infrastructure for the planner. It is correct, useful, and exactly what every robotics course teaches.

The natural next step after Phase 1 SLAM is therefore to wire up Nav2 and send the robot from waypoint to waypoint using the grid. This is what "VLM-primary SLAM" sounds like when heard through the robotics curriculum.

This view treats the map as a transient navigation aid — rebuilt each session, discarded when the robot stops. It throws away the most valuable thing the robot accumulates over time: a persistent spatial memory of where things are and what they mean.
vs

CORRECT PATTERN 5

"Build the map to remember — navigation is a side effect."

The VLMaps insight (Google, ICRA 2023): attach VLM scene labels to SLAM grid cells at each robot pose during exploration. Over dozens of sessions, cells accumulate semantic labels — "kitchen" confidence grows on the cluster of cells near the stove; "hallway" confidence grows on the narrow corridor cells.

The Waymo equivalent: pre-built HD maps store all static structure. Perception focuses only on dynamic changes. Annie's equivalent: the SLAM map stores "where the walls are AND what rooms exist AND where the charging dock was last seen." Navigation queries the accumulated knowledge — it doesn't rebuild from scratch.

This reframes the purpose of Phase 1 SLAM entirely. The occupancy grid is not throw-away scaffolding. It is the beginning of Annie's persistent spatial memory — the substrate on which the semantic knowledge graph lives.

Concretely: Phase 2c (semantic map annotation) grows this memory incrementally. "Where is the kitchen?" becomes a query against accumulated VLM labels, not a real-time VLM call on an unknown environment.

ANTI-PATTERN 6

"Route safety-critical inference through WiFi when a local NPU exists."

Annie's Pi 5 carries a Hailo-8 AI HAT+ at 26 TOPS that has sat idle for months. Meanwhile, the safety-critical obstacle-detection path runs on Panda's RTX 5070 Ti — which means every frame that could brake the robot has to survive a WiFi round-trip (5–300 ms of jitter) before the stop decision comes back. The "remote GPU is stronger, so route everything to it" instinct feels correct: centralise the smart compute, keep the edge dumb.

The hidden assumption: that the network is a reliable bus. It isn't. When WiFi drops or congests, the robot's reflex evaporates. A 1 m/s robot covers 30 cm in 300 ms of WiFi jitter — that's a broken piece of furniture or a dent in a wall.

The classic layering violation: local NPU idle (26 TOPS of YOLOv8n at 430 FPS, <10 ms, zero WiFi) while remote GPU is overloaded with work it shouldn't own. The safety reflex is architecturally in the wrong place.
vs

CORRECT PATTERN 6

"Fast-reactive inference lives on whatever compute is physically closest to the actuator."

The dual-process rule (IROS 2026, arXiv 2601.21506): fast reactive layer on local silicon, slow semantic layer anywhere. For Annie, that means YOLOv8n on the Hailo-8 (430 FPS, <10 ms, no network) becomes L1 safety; the VLM on Panda (18 ms + WiFi ≈ 30–40 ms total) stays as L2 semantic grounding. When WiFi drops, Annie still has a reflex. For a future Orin-NX-equipped robot, the same rule says: keep obstacle detection onboard, not in the cloud.

This isn't just "edge computing is faster." It's that safety latency budgets must not depend on networks the system doesn't control. The Hailo-8 was hardware Annie already had. The anti-pattern was architectural, not budgetary — nobody re-asked "where should this layer live?" when the Pi gained an NPU.

Concretely: activate HailoRT on Pi 5, route YOLOv8n inference there, fire ESTOP from the Pi without a WiFi round-trip. VLM stays on Panda for "what room is this?"-shaped questions.

ANTI-PATTERN 7

"Use a VLM for a known fiducial target."

The charging dock carries an ArUco marker — a black-and-white square with a known 6×6 bit pattern (DICT_6X6_50, id=23). When the goal is "find the dock," the modern instinct is: point Gemma 4 E2B at the camera feed and ask "is there a marker in view?" The VLM is already running. It can read text, count objects, describe scenes. Surely it can spot a square.

This feels like a win — one model, one query, uniform interface. No special-case code for fiducials, no OpenCV dependency, no solvePnP math to maintain.

Cost: VLM inference is ~18 ms on Panda GPU plus WiFi round-trip (total ≈ 30–40 ms), non-deterministic output, prone to hallucination on partial/occluded markers, and requires parsing free-text responses. The VLM is being asked to solve a problem it shouldn't even see.
vs

CORRECT PATTERN 7

"Classical CV for known shapes; VLM for semantic unknowns."

cv2.aruco.ArucoDetector + cv2.solvePnP runs at 78 µs per call on the Pi ARM CPU — no GPU, no network, pure OpenCV. That's a 230× speedup over the VLM round-trip (~18 ms), with deterministic output: either the marker is detected with sub-pixel corners, or it isn't. Pose comes back as an exact SE(3) transform, not a description.

The rule: VLMs are for semantic understanding of unknown targets ("is there a chair?", "is this the kitchen?"). Classical CV is for known shapes with deterministic detectors — ArUco markers, AprilTags, chessboards, QR codes, known logos. Asking a VLM to do ArUco detection is paying for generality that isn't needed and losing determinism that is.

Annie's homing implementation (sessions 81–83) already does this right: aruco_detect.py on Pi ARM CPU, solvePnP for 6-DoF pose, VLM never involved in the "is there a marker?" decision. The anti-pattern is the temptation to retrofit this as a VLM query "for consistency."

The most seductive mistake in VLM-primary navigation is asking the model to confirm its own outputs at high frequency instead of diversifying the question set. Running "Where is the goal?" at 58 Hz feels like maximum attentiveness. It is actually maximum redundancy: consecutive frames differ by 1.7 cm, so the 58th answer contains nearly identical information to the 1st. The valuable alternative — rotate four different perception tasks across the same budget — costs nothing in hardware, requires a one-line code change, and quadruples the semantic richness of each second of robot operation. This anti-pattern is so common in early implementations precisely because it is the natural first version: one question, one answer, repeat.

The "bigger model" anti-pattern is particularly important because it contradicts a deeply held assumption: that capability scales monotonically with model size. For strategic reasoning this is true, and Titan 26B earns its place at Tier 1. But for reactive steering, a 26B model at 2 Hz produces stale commands 50 cm into the future at walking speed — worse than a 2B model at 54 Hz with EMA smoothing. Annie's session 92 explore-dashboard made this concrete: routing navigation to the larger Titan model produced visibly worse driving than the resident Panda E2B. The data corrects the intuition. GR00T N1 (NVIDIA) encodes the same lesson architecturally: VLM at 10 Hz, motor outputs at 120 Hz. The fast path must be fast.

The end-to-end neural planner seduction is the anti-pattern with the longest incubation period. Papers reporting Tesla FSD v12 replacing 300,000 lines of C++ with a single neural net are correct — for an actor with millions of miles of training data. For a single-robot project, the correct architecture is the one OK-Robot validated: clean integration of off-the-shelf components, each independently testable. Annie's NavController already implements this correctly. The anti-pattern is not committing a bad implementation — it's questioning a correct implementation because a research paper made a fancier approach look attainable.

The deepest anti-pattern is treating SLAM as infrastructure rather than memory. The occupancy grid built during Phase 1 is not a means to an end (path planning) that can be discarded and rebuilt each session. It is the spatial substrate on which Annie's persistent knowledge of her home accumulates. VLMaps demonstrated this at Google: semantic labels attached to grid cells during exploration become a queryable knowledge base — "where is the kitchen?" resolves to a cluster of high-confidence cells, not a real-time VLM call on an unknown environment. Framing SLAM as "just navigation infrastructure" forecloses the most valuable long-term capability in the entire architecture.

Two further anti-patterns surfaced during the session-119 hardware audit and are worth naming explicitly because they share a root cause: mismatched inference mechanism. The first is routing safety-critical inference through WiFi when a local NPU exists. Annie's Pi 5 carries an idle Hailo-8 AI HAT+ (26 TOPS) while obstacle-detection latency is held hostage by WiFi to Panda — YOLOv8n at 430 FPS with <10 ms local inference sits untouched. The reflex that should brake the robot has no business being on the other side of a lossy radio. The correct rule is universal across robotics: fast-reactive inference lives on compute physically closest to the actuator; cloud/remote compute is for strategy, not safety. The IROS dual-process paper (arXiv 2601.21506) measured the payoff — 66% latency reduction and 67.5% navigation success versus 5.83% for VLM-only — when reactive perception runs locally and semantic reasoning runs elsewhere. The second is using a VLM for a known fiducial target. Asking Gemma 4 E2B to spot an ArUco marker costs ~18 ms of GPU time plus WiFi round-trip, produces non-deterministic free-text, and can hallucinate on partial occlusion — when cv2.aruco + cv2.solvePnP solves the same problem in 78 µs on the Pi ARM CPU, a 230× speedup with deterministic sub-pixel output. VLMs earn their cost on semantic unknowns ("what room is this?"). Classical CV wins on known shapes (markers, AprilTags, QR codes). The meta-rule: match the inference mechanism to the signal's predictability.

NOVA: The anti-patterns in this gallery share a common structure: they are all locally optimal choices that look correct when evaluated at a single decision point, but accumulate cost over time. A few distilled rules:
  • Anti-Pattern 6 — Safety on WiFi: a 26 TOPS Hailo-8 NPU sitting idle on the Pi while a remote GPU owns the brake reflex is a classic layering violation. Safety latency budgets must never depend on networks the system doesn't control.
  • Anti-Pattern 7 — VLM for known fiducials: cv2.aruco + cv2.solvePnP at 78 µs on Pi ARM CPU is 230× faster than an 18 ms VLM call, with deterministic sub-pixel output. Don't pay for semantic flexibility when the target is a known shape.
  • Meta-rule — Match the inference mechanism to the signal's predictability: classical CV for known-geometry detectors, fast-local NPUs for reactive safety, VLMs for semantic unknowns, LLMs for strategy. Wrong-layer inference is the dominant failure mode across this gallery.
Each of these becomes an anti-pattern only when evaluated across the system's full operational lifetime — across hundreds of navigation sessions, across a home that changes, across a robot that should be getting smarter rather than restarting from zero every time it boots.
THINK: Two of these anti-patterns were hit in production before the research was written. Ollama's Go wrapper (110 ms overhead, retired in session 67) is Anti-Pattern 2's corporate equivalent: a "clean integration" that looked correct but added a hidden tax on every call. IndicF5 wasting 2.8 GB VRAM (also session 67) is Anti-Pattern 4 applied to TTS: a bigger, more capable model deployed where a smaller one was sufficient, costing resource budget without improving the user experience that mattered. Both were found by measurement, not by intuition. The lesson: always instrument the thing you think is working.