LENS 21

Stakeholder Kaleidoscope

"Who sees what — and whose view are we ignoring?"

FOUR PERSPECTIVES ON THE SAME SYSTEM
MOM — PRIMARY USER (Underrepresented)

"Please don't knock anything over."

What she sees: A small machine that sometimes moves purposefully and sometimes freezes in the hallway for no reason. She does not see tiers, latencies, or frame rates. She sees behavior and its effect on her home.

What she needs:

  • Sub-1-second voice ESTOP: "Ruko!" must stop the robot immediately — not after 5 seconds of pipeline propagation
  • Predictable movement: no sudden direction changes, no speed surges, no approaching her from behind
  • Audible/visible state: she needs to know what Annie is doing right now ("I'm going to the kitchen") not silence
  • Graceful freezes: if Annie must pause, she should say why ("my eyes are slow, I'll wait a moment") not simply stop
  • No camera surprises: she should know when Annie is looking at her and why

What the research gives her: One paragraph in the Day-in-Life section. The phrase "Mom's bedroom" appears once. Her needs are never directly stated as system requirements.

What is missing: A Mom-perspective acceptance test. No requirement states "Mom must be able to halt Annie via voice within 1 second." No scenario asks "what does Mom experience when the VLM times out?" The research was written in engineering language for an engineering audience. Mom's requirements are inferred from architecture, never stated as primary.

Trust-curve shift — the Hailo-8 activation: The 7:30 AM WiFi-brownout freezes documented in Lens 20 ("Annie, did you stop?") are the single biggest trust-eroding moments in Mom's day. Activating the idle Hailo-8 AI HAT+ on the Pi 5 (26 TOPS NPU, YOLOv8n at 430 FPS, <10 ms local inference) gives Annie a WiFi-independent safety layer. Post-Hailo, Annie no longer "dies" mid-hallway when the semantic pipeline stalls — she keeps moving safely while the VLM recovers. The cumulative effect on Mom's trust curve is larger than any single user-facing feature: the robot becomes something she can count on during network stress, which is precisely when her anxiety peaks. No new prompt, no new skill, no new voice — just the quiet absence of the two-second freeze.

RAJESH — ENGINEER / EXPERIMENTER

"Elegant architecture. Let's ship it."

What he sees: A 4-tier hierarchical fusion system with clean separation of concerns, 58 Hz throughput, academic validation from Waymo/Tesla/VLMaps, and a clear 5-phase implementation roadmap. Architecturally satisfying.

What he needs:

  • Observable system: dashboard metrics, per-tier latency, VLM confidence scores, SLAM pose drift
  • Testable components: each tier independently runnable, simulation mode for integration testing
  • Failure visibility: when something breaks, he needs to know where in the 4-tier stack it broke and why
  • Iteration speed: the ability to swap the VLM, tune EMA alpha, change the query cycle — without rebuilding the whole stack

What the research gives him: Everything. The research is written from his perspective. Every architectural decision, every academic citation, every phase roadmap assumes his mental model as the reader.

The tension this creates: Rajesh's experimentalist instinct (Phase 2a this week, 2b next week, 2c after SLAM is stable) is structurally in conflict with Mom's need for consistency. Every experiment that changes Annie's behavior is a new surprise for Mom. A Nav pipeline that is a research platform cannot simultaneously be a trustworthy household companion — unless experimentation is explicitly contained away from Mom's hours of use.

Highest-leverage single change available — Hailo-8 activation: From the engineer's vantage point, the idle Hailo-8 AI HAT+ on the Pi 5 is the "lowest risk × highest value" move that was not visible before this research. Cost: ~1–2 engineering sessions (HailoRT install + TAPPAS GStreamer pipeline). Hardware cost: zero — the NPU is already bolted to the robot, drawing power, doing nothing for navigation. Architecture impact: purely additive — a new L1 reactive safety layer slotted beneath the existing VLM stack, with the IROS dual-process paper (arXiv 2601.21506) supplying 66% latency reduction as academic validation. Rollback: trivial — disable the systemd unit, behavior reverts to today. This is the rare intervention where the engineer's "interesting experiment" box and the user's "make it stop freezing" box check at the same time. See Lens 04 for the WiFi cliff-edge finding this addresses.

ANNIE — THE AI AGENT

"I need clear goals and honest sensors."

What she sees: A stream of camera frames, lidar sectors, IMU headings, and natural-language goals. Her job is to reconcile these signals into motor commands. She has no concept of "Mom's comfort" or "Rajesh's experiment" — only the signals she receives and the rules she follows.

What she needs:

  • Consistent environment: furniture rearranged overnight means her SLAM map is wrong; she doesn't know it's wrong
  • Honest sensors: a glass door that reads as CLEAR is not lying — it is a systematic blind spot her architecture cannot self-correct
  • Stable goals: a goal interrupted mid-navigation (WiFi drop, Pico crash) creates an ambiguous recovery state she has no procedure for
  • Latency budget honesty: she is designed for 18ms inference; she needs defined behavior when inference takes 90ms

What the research gives her: A well-specified fast path. 58 Hz perception, 4-tier fusion, EMA smoothing, confidence accumulation. The normal-operation design is thorough.

What is missing: A failure-mode specification. When the VLM times out, what does Annie do? When IMU goes to REPL, what does Annie announce? When two sensors disagree by more than a threshold, what does Annie say aloud? Annie's behavior in degraded states is unspecified — which means it is unpredictable — which means it violates Mom's most basic need: predictability.

VISITOR / FAMILY MEMBER

"Is it watching me?"

What they see: A camera-equipped robot moving through a home. They have no context for what it is, who controls it, what it records, or how to stop it. They encounter it without onboarding.

What they need:

  • Immediate legibility: what is this thing, is it recording, who can I ask to turn it off
  • A pause gesture or command that works for strangers: "Stop" or a raised hand should halt Annie even from an unknown voice
  • Honest signaling: if Annie's camera is active, a visible indicator (LED, spoken acknowledgment) should make this unambiguous
  • Privacy opt-out: the ability to be excluded from the semantic map without requiring Rajesh to intervene

What the research gives them: Nothing. The word "visitor" does not appear in the research document. The privacy concern is noted once under Lens 06 (second-order effects), but only as a concern for Mom, not for third parties.

The underappreciated risk: Phase 2c (semantic map annotation) will record who was in which room at what time. A visitor who sits in the living room for two hours is in the semantic map. They did not consent to this. Local-only storage does not eliminate the privacy issue — it only changes who can access the data. The visitor's perspective is the least represented and the most legally exposed.

WHERE STAKEHOLDER NEEDS DIRECTLY CONFLICT
Conflict Rajesh wants Mom needs Resolution path
Experimentation vs. predictability Deploy Phase 2a this week, tune EMA, try new queries Annie behaves the same way every day; surprises are frightening Maintenance window: experiments only during Mom's sleep hours; freeze nav behavior 7am–10pm
Speed vs. safety margin Confidence accumulation → faster navigation (more impressive demos) Slower is safer; she cannot react fast enough to a speeding robot Speed cap in Mom's presence zones; voice-triggered slow mode
Camera-always-on vs. privacy Continuous VLM inference at 58 Hz requires constant camera stream Should be able to stop the robot from watching (especially in bedroom) Camera-off room tags on SLAM map; "don't enter bedroom" constraint layer
Dashboard metrics vs. lived experience 94% nav success rate over 24h — system is working Annie froze 3 times during the 7–9pm window — system is broken Per-user per-hour success windows as primary dashboard metric
Silent failure vs. audible failure Clean logs; no noisy announcements cluttering dev output Needs to know when Annie is confused; silence is not neutral, it is alarming Production voice layer for all failure states; dev-mode flag to suppress for testing

The Underrepresented Perspective: Mom

The research is excellent engineering. It is thorough on Waymo's MotionLM, precise on EMA filter alpha values, careful about VRAM budgets. What it does not contain, anywhere, is a single sentence written from Mom's perspective. Mom is mentioned as the person who wants tea. She is not consulted as a primary stakeholder whose requirements should shape the architecture.

This is not an oversight — it is a structural consequence of who writes research documents. Research is written by engineers for engineers. The 4-tier fusion hierarchy, the 5-phase roadmap, the probability tables — these are all written in a language Mom does not speak and for a reader she is not. The danger is not that the engineering is wrong. It is that the engineering is optimized for the wrong utility function. The research maximizes VLM throughput and architectural elegance. Mom's utility function is entirely different: does Annie behave consistently? Can I stop it? Does it tell me what it's doing? Will it knock over my tea?

The critical finding from this lens: the voice-to-ESTOP gap is not a safety feature missing from the architecture. It is a Mom requirement that was never written. No section of the research states "Mom must be able to halt Annie via voice within 1 second." The 4-tier architecture has ESTOP in Tier 3 (lidar reactive) with "absolute priority over all tiers" — but this is a sensor-triggered ESTOP (80mm obstacle threshold), not a voice-triggered ESTOP. A voice ESTOP requires a separate always-listening path that bypasses the VLM pipeline entirely. This path does not exist in the architecture. It was never designed because the architect never asked: what does Mom need when she is scared?

The conflict between Rajesh and Mom is not a personality conflict — it is a values conflict that is characteristic of every system that serves both builder and user simultaneously. Rajesh's values: learn, iterate, improve, tolerate failures as data. Mom's values: consistency, safety, dignity, trust. These are not reconcilable by better code. They require an explicit protocol: the system's external behavior (what Mom experiences) is frozen during experimentation; changes are deployed only when they don't alter Mom's experience; and any change that does alter her experience requires her informed acceptance first. The research has no such protocol. It has a roadmap. Roadmaps serve Rajesh. Protocols serve Mom.

What Would Change If We Designed for Mom First

The 4-tier architecture would remain — but its design priorities would invert. Tier 4 (kinematic) is currently the fastest tier and the least specified in terms of what it does under failure. A Mom-first design would specify Tier 4's voice interrupt path before specifying Tier 2's multi-query pipeline. The ESTOP gap (5 seconds to propagate a "Ruko!" through voice recognition → Titan LLM → Nav controller → motor) would be identified as the first engineering problem, not an afterthought.

The evaluation framework (Part 7 of the research) would look completely different. Instead of ATE, VLM obstacle accuracy, and place recognition P/R, it would start with: (1) voice ESTOP latency under load, (2) number of silent freezes per hour during Mom's usage window, (3) number of times Annie announces what she is doing vs. acts silently, (4) Mom's subjective safety rating after a 2-week deployment. These metrics are not in the research. They are not even suggested. A Mom-first design makes them the primary acceptance criteria.

The Visitor perspective, even more underrepresented, adds a legal dimension that the research ignores: a semantic map that records room occupancy at all times is a data product that requires explicit consent from everyone in the home, not just the family. This is not a technical issue. It is a social contract that must be designed before Phase 2c ships. The consent architecture is the Visitor's primary requirement. It is absent from the research entirely.

The Stakeholder Asymmetry: Same Change, Different Value

The Hailo-8 activation surfaces the kaleidoscope's most important property — the same engineering change carries dramatically different perceived value depending on whose face is pressed against the lens. To Rajesh (engineer), Hailo-8 reads as "interesting optimization, ~1–2 sessions, additive L1 layer, 26 TOPS NPU currently idle, YOLOv8n at 430 FPS, <10 ms local inference, IROS-validated dual-process pattern, zero hardware cost, rollback-safe." It is a technically elegant cleanup of a wasted resource. To Mom (primary user), the exact same change reads as "the robot stops having the scary freezes in the hallway at 7:30 AM during the WiFi brownout." She does not know what a TOPS is. She does not know what YOLO is. She knows that last Tuesday Annie stopped for two seconds in front of her bedroom door and she had to ask, "Annie, did you stop?", and nobody answered. After Hailo, that moment stops happening. To the Visitor, Hailo-8 is invisible — the robot still moves through the house, the camera is still on, the consent architecture is still missing. To Annie herself, Hailo-8 is the first honest sensor layer: a fast, local, deterministic obstacle detector whose behavior is independent of the WiFi weather. The stakeholder kaleidoscope's lesson is that the value of a change is not a scalar. It is a vector indexed by perspective, and the vector components can differ by orders of magnitude. Hailo-8 scores medium-interesting to Rajesh, trust-transforming to Mom, invisible to the Visitor, and grounding to Annie — from a single patch of software. (Cross-ref Lens 04 WiFi cliff, Lens 06 second-order effects, Lens 20 7:30 AM event, Lens 25 leverage ranking.)

NOVA (What this lens uniquely reveals): The research document contains exactly four stakeholders — implicitly. It was written by an engineer (Rajesh), for an engineer (Rajesh), about a system that will be experienced primarily by a non-engineer (Mom). This asymmetry is not a flaw in the research; it is a structural property of who does research. The lens reveals what falls out when you ask: who is this system FOR? The 4-tier architecture is for Rajesh — it serves his goals of experimentation, observability, and architectural elegance. Mom's requirements — sub-1-second voice ESTOP, audible state announcements, predictable behavior during her usage window — are not derivable from the architecture. They require a separate document: a Mom Requirements Spec. That document does not exist. Until it does, every architectural decision implicitly optimizes for the builder and implicitly de-prioritizes the user. The voice-to-ESTOP gap is not a missing feature. It is the proof that the Mom Requirements Spec was never written.
  • Hailo-8 activation is the single change most stakeholders would agree on. Mom gains trust (no more WiFi-brownout freezes in the hallway); Rajesh gains his highest leverage-per-hour move available (~1–2 sessions, zero hardware cost, additive, rollback-safe, IROS-validated); Annie gains her first honest local sensor (YOLOv8n at 430 FPS, <10 ms, independent of WiFi weather). Only the Visitor is unmoved — Hailo does not address the consent-architecture gap. When a single change serves three of four stakeholders and harms none, it is the intervention the kaleidoscope is telling you to ship first.
  • Stakeholder value is a vector, not a scalar. The same change (Hailo activation) ranges from medium-interesting (Rajesh) to trust-transforming (Mom) to invisible (Visitor) to grounding (Annie). Planning documents that report a single "value score" per feature are silently collapsing this vector and making Mom-valued changes look unimpressive next to Rajesh-valued ones.
THINK (Open questions this lens surfaces):
  • What is the minimum voice ESTOP latency that Mom would experience as "responsive"? Is it 500ms? 1 second? 3 seconds? This is empirically measurable and currently unknown — nobody has asked her.
  • Should Annie's behavioral envelope during Mom's usage hours (7am–10pm) be treated as a frozen production release while Rajesh's experiments run in staging? What would a staging/production distinction look like for a home robot?
  • The research estimates Phase 2c (semantic map annotation) at 65% probability of success. What does Mom experience during the 35% failure case? Does she know the room labels are wrong? Does she know there are room labels at all?
  • A visitor in the living room for two hours is in the semantic map. They did not consent. Is "local only" storage sufficient consent protection? What is the minimum viable consent UX for a home robot with persistent visual memory?
  • Cross-lens (Lens 06): the second-order effect where Mom asks "Annie, what's in the kitchen?" arrives before any consent architecture is deployed. Mom will discover and love this feature before Rajesh has designed its privacy controls. Should the semantic map be disabled by default until the consent layer exists?
  • Cross-lens (Lens 10): Mom stopped using Annie in August 2026 and the team didn't notice for two weeks. What would a Mom-perspective dashboard look like? What metrics are only visible from her point of view?
  • If you had to write a 5-line "Mom's Acceptance Test" that must pass before any Phase 2 sub-phase ships, what would those 5 lines be?