# Subsumption Architecture: Why a 1986 Paper Powers Our Home Robot

*Applying Rodney Brooks' layered behavioral architecture to a hobby robot car — and why it's better than a ROS navigation stack for this use case.*

## The Origin: Brooks, 1986

In 1986, Rodney Brooks published "A Robust Layered Control System for a Mobile Robot" at the MIT AI Lab. His key insight was radical: instead of building robots that plan first and act second (sense → model → plan → act), build layers of simple behaviors that run independently and override each other.

```
Traditional approach:
  Sensors → World Model → Planner → Actions
  (slow, fragile, requires complete world knowledge)

Subsumption approach:
  Layer 2: Explore ──────────────────────────▶ Actions
  Layer 1: Wander  ──────────────────────────▶ Actions (overrides 2)
  Layer 0: Avoid   ──────────────────────────▶ Actions (overrides 1 and 2)
  (fast, robust, works with incomplete information)
```

Each layer is a complete behavior. Lower layers are faster and simpler. Higher layers are slower and smarter. The critical rule: **a lower layer can always override a higher layer.** The avoid-obstacle reflex doesn't wait for the explore planner to finish thinking.

## Our Implementation

We use two layers:

### Layer 0: Hailo Safety Reflex

- **Hardware:** Hailo AI HAT+ 26T neural processing unit on Pi 5
- **Model:** YOLOv8s (pre-compiled .hef), runs at 310 FPS
- **Behavior:** Detect obstacles → estimate distance from bounding box → if distance < 30cm, stop all motors immediately
- **Latency:** <100ms from frame capture to motor stop
- **Dependencies:** None. No network, no LLM, no external service. Pure local reflex.

This layer runs as a background thread in the turbopi-server process. It reads frames from the FrameGrabber (shared camera buffer), runs YOLO inference on the Hailo NPU, and triggers an emergency stop via a callback if any detected object is too close.

### Layer 1: Titan Strategic Navigation

- **Hardware:** DGX Spark (Titan) with Gemma 4 26B GPU model
- **Model:** Multimodal vision-language model, 50 tokens/second
- **Behavior:** Receive goal from user → capture photo → read obstacle data → combined vision+decision LLM call → execute drive command → repeat
- **Latency:** ~3.5 seconds per cycle (sense + think + act)
- **Dependencies:** Network (Pi ↔ Titan), vLLM API

This layer runs as a tool handler in Annie's voice pipeline on Titan. It orchestrates the navigation loop but trusts Layer 0 to handle safety. If Layer 0 triggers an emergency stop during a drive, Layer 1 receives a 409 HTTP response and aborts gracefully.

## Why Subsumption, Not ROS

ROS (Robot Operating System) is the standard framework for robot software. Its navigation stack (nav2) includes:
- SLAM (Simultaneous Localization and Mapping)
- Costmaps (occupancy grids)
- Path planners (A*, Dijkstra, DWB)
- Recovery behaviors (rotate, clear costmap)
- Localization (AMCL, particle filters)

**Why we didn't use it:**

| Factor | ROS nav2 | Subsumption |
|--------|----------|-------------|
| **Setup complexity** | Weeks (URDF, transforms, tuning) | Hours (two Python threads) |
| **Map requirement** | Yes (pre-built or SLAM) | No (explore without map) |
| **Sensor requirements** | LIDAR or depth camera | Regular camera + sonar |
| **Compute requirements** | Full ROS stack + nav2 | Python threads + Hailo NPU |
| **Goal precision** | "Go to (x=3.2, y=1.7)" | "Find the kitchen" |
| **Failure mode** | Stuck at costmap edge | Wanders, eventually gives up |
| **Code size** | Thousands of lines + configs | ~350 lines of Python |

The key insight: **our robot doesn't need precision navigation.** "Explore the room" and "find Rajesh" are fuzzy goals. A VLM looking at camera images and deciding "turn left" is sufficient. If the robot can't find the kitchen in 10 cycles (~35 seconds), it says "I couldn't find it" — which is an acceptable outcome for a personal assistant's hobby robot car.

## The ESTOP Propagation Chain

When Layer 0 detects an obstacle:

```
1. Safety daemon thread detects object < 30cm
   │
2. ├─ Sets _hardware_estop (threading.Event)     ← thread-safe
   ├─ Calls loop.call_soon_threadsafe(             ← wakes asyncio event loop
   │      _stop_event.set)
   └─ Calls _stop_motors_sync(_board)              ← acquires _uart_lock
       └─ board.set_motor_duty([[1,0],[2,0],        ← UART to STM32
              [3,0],[4,0]])                         ← all motors zero
   │
3. Meanwhile, if /drive is in progress:
   └─ asyncio.wait_for(_stop_event.wait())         ← wakes up immediately
       └─ finally: _stop_motors_sync()              ← redundant but safe
   │
4. Next /drive call:
   └─ Checks _hardware_estop.is_set()              ← rejects with 409
       └─ User/Annie must send drive(action=stop)   ← clears the ESTOP
```

The redundant motor stop in step 3 is intentional. In safety-critical systems, idempotent stops are better than clever optimizations.

## The Temporal Separation Principle

The two layers operate on fundamentally different timescales:

| Metric | Layer 0 (Hailo) | Layer 1 (Titan) |
|--------|-----------------|-----------------|
| Cycle time | 33ms (30 FPS) | 3,500ms |
| Decision complexity | Binary (safe/unsafe) | Multi-option (forward/left/right/stop) |
| State | Stateless (each frame independent) | Stateful (goal + history) |
| Failure mode | False positive (unnecessary stop) | False negative (wrong direction) |
| Recovery | Automatic (next frame clears) | Manual (user says "try again") |

This separation is why subsumption works: each layer is optimized for its timescale. Layer 0 doesn't need to understand the scene — it just needs to detect proximity. Layer 1 doesn't need to be real-time — it just needs to make reasonable decisions every few seconds.

## Real-World Observations

1. **False positives are fine, false negatives are not.** Layer 0 stopping the car because a shadow looked like an obstacle wastes 3 seconds (one navigation cycle). Layer 0 failing to stop the car because a real obstacle was missed damages hardware. We bias toward false positives.

2. **The VLM is surprisingly good at navigation.** Gemma 4's multimodal capability accurately describes room layouts, identifies doorways, and makes sensible movement decisions. It doesn't need a map — it navigates like a human would: look around, spot the goal, move toward it.

3. **Power is the hidden constraint.** The Pi 5 draws 2-3A from motors, plus camera, plus Hailo NPU. Under-voltage throttling slows the safety daemon, widening the reaction-time window. We monitor `vcgencmd get_throttled` and abort navigation if throttled.

4. **Distance estimation from bounding boxes is unreliable.** It's sensitive to camera angle, object type, and lighting. Sonar provides accurate frontal distance (ultrasonic, not affected by lighting). The combination of "rough vision distance" + "precise sonar distance" is more robust than either alone.

## References

- Brooks, R. A. (1986). "A Robust Layered Control System for a Mobile Robot." IEEE Journal of Robotics and Automation.
- Brooks, R. A. (1991). "Intelligence Without Representation." Artificial Intelligence Journal.
- Hailo AI HAT+ documentation: [hailo.ai](https://hailo.ai)
- YOLOv8 on Hailo: [hailo-rpi5-examples](https://github.com/hailo-ai/hailo-rpi5-examples)
