# Agent Orchestration: OpenClaw vs AutoResearchClaw vs Annie

**Date:** 2026-03-20
**Supersedes:** `RESEARCH-OPENCLAW-VS-ANNIE-GAPS.md` (concept comparison only)
**Purpose:** Deep architectural comparison of two orchestration paradigms, gap analysis against Annie's implementation, and hybrid adoption strategy with test harness.

---

## 1. Executive Summary

Annie needs **two execution models** simultaneously:

- **Reactive** (OpenClaw pattern) — voice conversations, Omi events, webhook triggers, Telegram callbacks. Our `AgentRunner` + `AgentScheduler` already implements this.
- **Batch pipeline** (AutoResearchClaw pattern) — meditation, email triage, research tasks, booking. These are multi-step workflows with gates, rollback, and cross-run learning. We have none of this infrastructure.

**Verdict: Hybrid — OpenClaw skeleton (already built) + ARC organs (to adopt).**

The 8 patterns to adopt, ranked by impact:

1. Evolution/lessons system (Annie learns from corrections)
2. Typed adapter protocol (unblocks email, calendar, clipboard)
3. Complexity routing (auto-score tasks, route to cheap/expensive models)
4. Sub-orchestrator pattern (email triage, booking as multi-agent pipelines)
5. Gate system (5-tier approval for external actions)
6. Hook mapping (generic webhook routing)
7. Cost tracking (per-agent budget enforcement)
8. Checkpoint/resume (long research tasks survive crashes)

---

## 2. Architecture Overview

```
OPENCLAW (Reactive)                    AUTORESEARCHCLAW (Pipeline)

  Hooks/WebSocket/Cron                  Single Topic Input
       |                                    |
  +-------------+                     +--------------+
  | Command Queue|                     | Stage Runner  |
  | (per lane)   |                     | (sequential)  |
  +--+--+--+----+                     +------+-------+
     |  |  |                                  |
  Main Cron Sub                        Stage 1 -> 2 -> ... -> 23
  lane lane lane                       with gates, rollback,
     |  |  |                           checkpoints, PIVOT
     v  v  v                                  |
  Agent Sessions                       Sub-Orchestrators
  (persistent,                         (Figure, Benchmark)
   multi-turn)                         each with retry loops
     |                                        |
  Tool calls                           Evolution/Lessons
  (plugins, hooks)                     (cross-run learning)
     |                                        |
  Delivery                             Deliverables Package
  (WS, webhook, channels)             (paper, code, charts)
```

---

## 3. Head-to-Head Comparison

| Dimension | OpenClaw | AutoResearchClaw | What Annie Needs | Winner |
|-----------|----------|-------------------|-------------------|--------|
| **Execution model** | Reactive event loop | Batch pipeline (23 stages) | Both: voice is reactive, meditation is batch | Tie |
| **Concurrency** | Lane-based (`Main/Cron/Subagent`) with configurable `maxConcurrent` | Single-threaded per run | Voice + background agents simultaneously | OpenClaw |
| **Session persistence** | Multi-turn sessions to disk (`sessions.json`). Resume across restarts | Checkpoint files per stage. Crash recovery | Conversation continuity | OpenClaw |
| **Sub-agent spawning** | Full registry: parent-child tracking, depth limits (max 4), announce queue | Orchestrator calls sub-agents inline (sequential) | Parallel sub-agents (researcher, draft writer) | OpenClaw |
| **Error handling** | Model fallback chain (Opus->Sonnet->Haiku), auth rotation, overload backoff | Input/output validation, LLM retry with exponential backoff | Model fallback is critical for cost | OpenClaw |
| **Cost tracking** | Per-run, per-session, per-agent token+dollar. Monthly budget caps | Per-orchestrator token accumulation only | Daily cost control ($0.47/day target) | OpenClaw |
| **Approval gates** | No formal gate system | 3 gate stages. Pipeline pauses for human APPROVE/REJECT with rollback | 5-tier approval from narrative | ARC |
| **Rollback / PIVOT** | No rollback | PIVOT -> Stage 8, REFINE -> Stage 13. Stage versioning preserves history | Research tasks may need course correction | ARC |
| **Cross-run learning** | None | Evolution/lessons with time-decay overlay. -24.8% retry rate | Core narrative promise: "Annie gets better" | ARC |
| **Hook/webhook integration** | Rich: path matching, payload transforms, template rendering, channel routing | None (CLI-triggered only) | Omi webhook, Telegram callbacks, browser events | OpenClaw |
| **Tool ecosystem** | Plugin SDK, tool registry, sandbox policies, result truncation | Typed adapter protocol (6 adapters) + recording stubs | OpenClaw for tools, ARC for external integrations | Tie |
| **Delivery** | WebSocket streaming, webhook POST, channel routing | Filesystem package (paper + code + charts) | Voice, Telegram, screen | OpenClaw |
| **Complexity routing** | Manual model override per agent/session | Auto-score complexity (0.0-1.0), route to cheap/expensive model | Cost optimization | ARC |
| **Cron scheduling** | Full cron service with stagger algorithm, one-shot, per-job budgets | None | Daily/weekly/monthly meditation | OpenClaw |
| **Checkpoint/resume** | Session state persisted but no stage checkpointing | Atomic checkpoint after every stage. True crash recovery | Long research tasks (schedule_task) | ARC |
| **Code execution sandbox** | Via tool calls (sandboxed) | Docker/SSH/Colab with GPU passthrough, network policies | Not needed today | ARC |

**Score: OpenClaw 8, ARC 5, Tie 3**

---

## 4. Annie's Workload Classification

| Workload | Type | Duration | Frequency | Best Fit |
|----------|------|----------|-----------|----------|
| Voice conversation | Reactive, real-time | 1-30 min | 5-10x/day | OpenClaw |
| Omi ambient processing | Reactive, event-driven | 2-5 sec | Continuous | OpenClaw |
| Web search during voice | Reactive, tool call | 1-3 sec | On-demand | OpenClaw |
| Morning briefing | Scheduled, batch | 10-30 sec | 1x/day | Either |
| Nudge delivery | Scheduled + context-aware | 1-2 sec | 1-3x/day | OpenClaw |
| Evening meditation | Batch, multi-source | 1-5 min | 1x/day | ARC |
| Weekly reflection | Batch, deeper analysis | 5-10 min | 1x/week | ARC |
| Email triage | Batch pipeline with gates | 2-5 min | 2-3x/day | ARC |
| Research ("tell me tonight") | Long-running batch | 10-30 min | On-demand | ARC |
| Booking a restaurant | Multi-step with approval | 1-2 min | Rare | ARC |
| Self-improvement | Batch, introspective | 2-5 min | Post-session | ARC |
| Health monitoring | Reactive, polling | 1 sec | Every 60s | OpenClaw |

---

## 5. What Annie Already Has (OpenClaw Pattern)

| OpenClaw Pattern | Annie Implementation | File(s) |
|-----------------|---------------------|---------|
| Lane-based concurrency | `AgentRunner` with `cron/subagent/background` lanes | `agent_context.py` |
| Session persistence | `session_broker.py` (unified daily sessions) | `session_broker.py` |
| Cron scheduling | `AgentScheduler` with `at/every/cron` kinds | `agent_scheduler.py` |
| Sub-agent spawning | `subagent_tools.py` (researcher, memory_dive, draft) | `subagent_tools.py` |
| Agent discovery | YAML-based `AgentDiscovery` with filesystem watcher | `agent_discovery.py` |
| Tool registration | `FunctionSchema` + `register_function()` in Pipecat | `bot.py` |
| Webhook/hook routing | Omi watcher (specific), not generic | `omi_watcher.py` |
| Delivery | Voice (Kokoro TTS), Telegram bot, screen (dashboard) | `bot.py`, `telegram-bot/` |
| Voice-priority gating | Background agents wait for voice to finish | `agent_context.py:598` |
| Model fallback | Beast (Super) -> local (Nano) fallback | `agent_context.py:798` |

---

## 6. What Annie Is Missing (ARC Patterns to Adopt)

### 6.1 Evolution/Lessons System

**Source:** `researchclaw/evolution.py` (300 lines)

**What it does:** After each agent run, extract lessons from failures/corrections. Store in JSONL with time-decay weighting. Inject as prompt overlay into future runs.

**Impact:** -24.8% retry rate, -40% refine cycles in ARC benchmarks. For Annie, this is the "progressive autonomy" narrative promise — she gets better at knowing what Rajesh likes without retraining.

**Lesson categories for Annie:**
- TONE: "User corrected my formality — prefer casual"
- TIMING: "Nudge sent during meeting — check calendar first"
- EXTRACTION: "Missed restaurant name — lower entity threshold for place names"
- SEARCH: "User wanted Reddit consensus, not generic results"
- APPROVAL: "User auto-approved this type — escalate trust tier"

**Data model:**
```python
@dataclass
class LessonEntry:
    category: str          # TONE, TIMING, EXTRACTION, SEARCH, APPROVAL
    severity: str          # info, warning, error
    description: str       # What happened
    source_agent: str      # meditation_daily, voice_session, etc.
    run_id: str
    timestamp: float
    # Time-decay: 30-day half-life, 90-day max age
```

**Storage:** `~/.her-os/annie/evolution/lessons.jsonl`

**Injection:** `build_overlay(agent_name, max_lessons=5)` returns a prompt fragment with the 5 most relevant recent lessons for that agent type.

### 6.2 Typed Adapter Protocol

**Source:** `researchclaw/adapters.py` (108 lines)

**What it does:** Protocol-based interfaces for external integrations. Start with recording stubs, swap to real implementations without touching pipeline code.

**Adapters for Annie:**
```python
class EmailAdapter(Protocol):
    async def list_inbox(self, account: str, limit: int) -> list[dict]: ...
    async def read_email(self, email_id: str) -> dict: ...
    async def draft_reply(self, email_id: str, body: str) -> str: ...
    async def send(self, draft_id: str) -> bool: ...

class CalendarAdapter(Protocol):
    async def list_events(self, date: str) -> list[dict]: ...
    async def create_event(self, event: dict) -> str: ...
    async def check_conflicts(self, start: str, end: str) -> list[dict]: ...

class ClipboardAdapter(Protocol):
    async def save_item(self, content: str, content_type: str) -> str: ...
    async def list_items(self, limit: int) -> list[dict]: ...

class BookingAdapter(Protocol):
    async def search_venue(self, query: str) -> list[dict]: ...
    async def reserve(self, venue_id: str, details: dict) -> dict: ...
    async def confirm(self, reservation_id: str) -> bool: ...
```

**Recording stubs:** Each adapter has a `RecordingXxxAdapter` that logs calls without side effects. Used for testing and development.

### 6.3 Complexity Routing

**Source:** `researchclaw/pipeline/opencode_bridge.py` (300 lines)

**What it does:** Score task complexity (0.0-1.0) using weighted signals, route to appropriate model tier.

**Annie's routing:**
```
Score < 0.3  -> Nemotron Nano (free, local, 48 tok/s)
    Simple entity extraction, nudge delivery, note saving
Score 0.3-0.6 -> Nemotron Super on Beast (free, local, higher quality)
    Meditation, self-improvement, email draft
Score > 0.6  -> Claude API ($0.003/call)
    Complex research, multi-source analysis, entity extraction from ambiguous audio
```

**Signals for scoring:**
- Token count of input context (weight 0.25)
- Number of data sources needed (weight 0.20)
- Historical failure rate for this agent type (weight 0.20)
- User-facing vs background (weight 0.15)
- Tool calls expected (weight 0.10)
- Time sensitivity (weight 0.10)

### 6.4 Sub-Orchestrator Pattern

**Source:** `researchclaw/agents/figure_agent/orchestrator.py` (500 lines)

**What it does:** Multi-phase pipeline with specialized sub-agents. Each phase can retry independently.

**Annie's orchestrators:**

**EmailTriageOrchestrator:**
```
Stage 1: ClassifierAgent -> skip / info / action_required / meeting
Stage 2: DraftAgent -> compose reply using relationship context
Stage 3: SafetyCheckAgent -> flag sensitive content, check tone
Stage 4 (GATE): DeliveryAgent -> send via correct account (requires approval)
```

**BookingOrchestrator:**
```
Stage 1: SearchAgent -> find venue via web/Playwright
Stage 2: AvailabilityAgent -> check calendar conflicts
Stage 3 (GATE): ConfirmationAgent -> Tier 3 rephrase-confirm
Stage 4: BookingAgent -> Playwright form fill + submit
```

### 6.5 Gate System (5-Tier Approval)

**Source:** `researchclaw/pipeline/stages.py` (lines 109-145)

**Annie's 5 tiers (from narrative R-18, A-30):**

| Tier | Action | Example | Approval |
|------|--------|---------|----------|
| 1 | Internal only | Save note, search memory | Silent (no approval) |
| 2 | Inform | Morning briefing, nudge | Delivered, no confirmation needed |
| 3 | Propose + confirm | Book restaurant, send email | Rephrase-confirm ("Dinner at Vicolo, 7:30, 4 people. Go ahead?") |
| 4 | Draft + review | Email draft, message draft | Show draft, wait for edit/approve |
| 5 | Double confirm | Financial action, delete data | Explicit "Are you sure?" + PIN/voice confirm |

### 6.6 Checkpoint/Resume

**Source:** `researchclaw/pipeline/runner.py` (lines 73-126)

**What it does:** After each stage, write `checkpoint.json` with `last_completed_stage`. On crash, `resume_from_checkpoint()` picks up from the next stage. Atomic write via tmpfile + rename.

**For Annie:** Long research tasks (`schedule_task`) can take 10-30 minutes. If Annie Voice restarts, the research should resume, not restart.

---

## 7. Hybrid Architecture

```
+-----------------------------------------------------------+
|                     ANNIE RUNTIME                          |
|                                                            |
|  +----------------------------------+                      |
|  |    OpenClaw Layer (Outer)        |                      |
|  |    -------------------------     |   <- EXISTING:       |
|  |    * Lane-based concurrency     |      AgentRunner     |
|  |    * Session persistence        |      AgentScheduler  |
|  |    * Hook/webhook routing       |      session_broker  |
|  |    * Cron scheduling            |                      |
|  |    * Sub-agent registry         |                      |
|  |    * Cost tracking              |                      |
|  |    * Model fallback chain       |                      |
|  |    * Delivery (voice/TG/screen) |                      |
|  +---------+------------------------+                      |
|            |                                               |
|            | submits complex tasks to                      |
|            v                                               |
|  +----------------------------------+                      |
|  |    ARC Layer (Inner)             |                      |
|  |    -------------------------     |   <- TO BUILD:       |
|  |    * Stage contracts (DoD)      |      evolution.py    |
|  |    * Gate/approval system       |      adapters.py     |
|  |    * Checkpoint/resume          |      orchestrator.py |
|  |    * Evolution/lessons          |      routing.py      |
|  |    * Complexity routing         |                      |
|  |    * Typed adapter protocol     |                      |
|  |    * Sub-orchestrator pattern   |                      |
|  +----------------------------------+                      |
+-----------------------------------------------------------+
```

---

## 8. Adoption Priority

| # | Pattern | Source | Impact | Effort | Blocks |
|---|---------|--------|--------|--------|--------|
| 1 | Evolution/lessons | ARC | HIGH: Annie learns from corrections | Medium | None |
| 2 | Typed adapters | ARC | HIGH: unblocks email, calendar, clipboard | Small | None |
| 3 | Complexity routing | ARC | MEDIUM: cost optimization | Small | None |
| 4 | Sub-orchestrator | ARC | HIGH: email triage, booking pipelines | Medium | #2 (adapters) |
| 5 | Gate system (5-tier) | ARC | HIGH: safety for external actions | Medium | #4 (orchestrators) |
| 6 | Hook mapping | OpenClaw | MEDIUM: generic webhook routing | Small | None |
| 7 | Cost tracking | OpenClaw | MEDIUM: per-agent budget enforcement | Small | None |
| 8 | Checkpoint/resume | ARC | LOW: only for long research tasks | Medium | #4 (orchestrators) |

---

## 9. Test Harness Strategy

Both vendor codebases have extensive test patterns we should adopt:

### OpenClaw Testing Patterns
- **Lane concurrency tests:** Verify maxConcurrent limits, drain behavior, generation clearing
- **Hook mapping tests:** Template rendering, path matching, payload transform
- **Session lifecycle tests:** Create, persist, resume, expire
- **Subagent registry tests:** Spawn, track, terminate, announce queue with retry
- **Model fallback tests:** Primary fails -> fallback chain -> all fail
- **Cost tracking tests:** Token accumulation, budget cap enforcement

### ARC Testing Patterns
- **Stage contract tests:** Input validation, output validation, DoD checks
- **Gate approval tests:** BLOCKED -> APPROVE/REJECT, rollback on REJECT
- **PIVOT/REFINE tests:** Decision routing, stage versioning, max pivots limit
- **Evolution tests:** Lesson extraction, time-decay, overlay injection, JSONL roundtrip
- **Checkpoint tests:** Write, read, resume, corruption recovery
- **Adapter tests:** Recording stubs log calls, real implementations swap cleanly
- **Orchestrator tests:** Full pipeline flow, retry on sub-agent failure, partial success handling
- **Complexity scoring tests:** Signal weights, threshold routing, edge cases

### Annie-Specific Test Categories
1. **Orchestration integration:** Submit complex task -> stages execute -> gates fire -> result delivered
2. **Cross-channel:** Voice message -> broker -> text chat sees it -> Telegram notification
3. **Evolution feedback loop:** Correction in voice -> lesson extracted -> next session improved
4. **Approval flow:** Voice command -> Tier 3 gate -> Telegram confirm -> execute
5. **Model routing:** Low complexity -> Nano, medium -> Super, high -> Claude API
6. **Crash recovery:** Kill Annie mid-research -> restart -> checkpoint resumes

---

## 10. Key Source Files

### OpenClaw
| File | What to Study |
|------|--------------|
| `vendor/openclaw/src/process/command-queue.ts` | Lane-based concurrency (245 lines) |
| `vendor/openclaw/src/agents/subagent-registry.ts` | Parent-child tracking (1500 lines) |
| `vendor/openclaw/src/gateway/hooks.ts` | Hook routing (250 lines) |
| `vendor/openclaw/src/cron/isolated-agent/run.ts` | Cron execution (800 lines) |
| `vendor/openclaw/src/agents/agent-command.ts` | Master entry point (1043 lines) |

### AutoResearchClaw
| File | What to Study |
|------|--------------|
| `vendor/AutoResearchClaw/researchclaw/evolution.py` | Lessons system (300 lines) |
| `vendor/AutoResearchClaw/researchclaw/adapters.py` | Typed adapter protocol (108 lines) |
| `vendor/AutoResearchClaw/researchclaw/pipeline/stages.py` | Stage contracts + gates (150 lines) |
| `vendor/AutoResearchClaw/researchclaw/pipeline/runner.py` | Pipeline orchestration (1000 lines) |
| `vendor/AutoResearchClaw/researchclaw/agents/figure_agent/orchestrator.py` | Sub-orchestrator (500 lines) |
| `vendor/AutoResearchClaw/researchclaw/pipeline/opencode_bridge.py` | Complexity routing (300 lines) |

---

## 11. Narrative Coverage After Adoption

| Phase | Before This Work | After 5-Step Implementation | After ARC Adoption |
|-------|-----------------|---------------------------|-------------------|
| Phase 1 (Now) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) |
| Phase 2 (Next) | 6/11 (55%) | 8/11 (73%) | 10/11 (91%) |
| Phase 3 (Vision) | 1/9 (11%) | 3/9 (33%) | 7/9 (78%) |
| **Total** | **12/25 (48%)** | **16/25 (64%)** | **22/25 (88%)** |

The remaining 3 scenes (12%) require external service integrations (WhatsApp API, Gmail OAuth, Google Calendar) that are implementation tasks, not architecture gaps.
