# Gap Analysis: Karpathy's "Dobby" Vision vs Annie's Current State

**Source:** [Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI](https://www.youtube.com/watch?v=kwSVtQ7dziU) — No Priors podcast, ~66 min
**Date:** 2026-03-21
**Purpose:** Map Karpathy's "Dobby the Elf" claw architecture against Annie's infrastructure to identify gaps and prioritize work.

---

## 1. The Core Insight: Tool-Maker, Not Tool-User

The central lesson from Karpathy's interview is NOT "pre-build tools for every possible use case." It's this:

> **The agent should be able to accomplish tasks it was never explicitly programmed to do, by researching, building, testing, and persisting new capabilities on the fly.**

Karpathy didn't pre-program Dobby to control Sonos. He said:

> "I just told it that I think I have Sonos at home. Like can you try to find it? And it goes and it did like IP scan of all the computers on the local area network and it found the Sonos thing... It does some web searches and it finds like okay these are the API endpoints and then it's like do you want to try it?"

**Three prompts. Zero pre-built IoT code.** The agent discovered, researched, implemented, and tested the integration autonomously.

This is the paradigm shift: Annie should be a **self-programming agent** — not a collection of pre-built tools, but an agent that creates the tools she doesn't have yet.

### What Karpathy's Dobby Actually Did

```
User: "Can you find my Sonos?"
  │
  ▼
Agent has NO Sonos tool
  │
  ▼
Research: IP scan the local network (wrote and ran code)
  │
  ▼
Discover: Found Sonos devices, no auth required
  │
  ▼
Research: Web search for Sonos REST API documentation
  │
  ▼
Build: Created API integration code
  │
  ▼
Test: "Do you want to try it?" → plays music in the study
  │
  ▼
Persist: Created dashboard, saved as reusable capability
  │
  ▼
Expand: Repeated for lights, HVAC, shades, pool, security cameras
```

The same pattern applied to everything — lights, HVAC, shades, pool, security. Each one was discovered and integrated by the agent, not pre-coded by Karpathy.

---

## 2. What Karpathy Describes

### The "Dobby the Elf" Home Claw

A persistent agent accessible via a single WhatsApp portal. NOT pre-programmed — self-built from natural language requests:

| What User Said | What Agent Did (Autonomously) |
|---------------|------------------------------|
| "Can you find my Sonos?" | IP scan → device discovery → API research → integration → music playback |
| "Dobby, sleepy time" | Orchestrated all lights off (using self-discovered light APIs) |
| *(camera feed)* | Change detection → Qwen VLM analysis → "FedEx truck pulled up" notification |
| "Can you control the pool?" | Discovered pool controller on network → reverse engineered API → integrated |

### The Broader Agent Philosophy

| Concept | Key Quote |
|---------|-----------|
| **Express your will** | "Code's not even the right verb anymore. I have to express my will to my agents for 16 hours a day." |
| **Skill issue, not capability issue** | "When it doesn't work, it all feels like skill issue. I didn't give good enough instructions in the agents MD file." |
| **Remove yourself as bottleneck** | "The name of the game is: I put in very few tokens once in a while and a huge amount of stuff happens on my behalf." |
| **Persistent loops (Claws)** | "A claw takes persistence to a whole new level. It kind of does stuff on your behalf even if you're not looking." |
| **Auto Research** | "Here's an objective, here's a metric, here's your boundaries — and go." Agent runs autonomously forever. |
| **Single portal** | "The single WhatsApp portal to all of the automation." |
| **Skills as self-created markdown** | "I'm coming up with skills where skill is just a way to instruct the agent how to teach the thing." |
| **Apps should be APIs** | "These apps shouldn't exist. It should just be APIs and agents using them directly." |
| **Personality matters** | "Peter [OpenClaw] really crafted a personality that is compelling." |
| **Self-evolving capability** | Agent discovers what's on the network, builds integrations, saves them, reuses them. |

### OpenClaw Patterns Karpathy Praises

1. **Soul/personality document** — compelling, feels like a teammate
2. **Sophisticated memory** — beyond basic context compaction
3. **WhatsApp as single portal** — one interface to all automation
4. **Persistence** — keeps looping when you're not looking
5. **Self-evolving skills** — agent learns and creates new capabilities

---

## 3. Annie's Current Capability Inventory

### Annie Voice (Pipecat WebRTC)

| Category | Tools | Status |
|----------|-------|--------|
| **Web search** | `web_search` (SearXNG), `fetch_webpage` | ✅ Working |
| **Memory** | `search_memory` (hybrid BM25+vector+graph), `get_entity_details` | ✅ Working |
| **Code execution** | `execute_python` (sandboxed, 10s timeout, matplotlib) | ✅ Working |
| **Browser automation** | `browser_list_tabs`, `browser_read_page`, `browser_navigate`, `browser_snapshot`, `browser_click`, `browser_fill` | ✅ Working |
| **Sub-agents** | `run_subagent(research)`, `run_subagent(memory_dive)`, `run_subagent(draft)` | ✅ Working |
| **Visual output** | `render_table`, `render_chart`, `render_svg` | ✅ Working |
| **Emotion** | `show_emotional_arc` (SER pipeline + SVG) | ✅ Working |
| **Personality** | SOUL.md + RULES.md + QLoRA-trained Nemotron Nano | ✅ Working |
| **Memory system** | Entity extraction, temporal decay, knowledge graph, emotion arcs | ✅ Exceeds OpenClaw |

### Telegram Bot

| Category | Capability | Status |
|----------|-----------|--------|
| **Commands** | `/start`, `/help`, `/briefing`, `/search`, `/promises`, `/status`, `/context`, `/pending` | ✅ Working |
| **Free text** | Routes to Context Engine search (`GET /v1/context?query=...`) | ⚠️ Search-only relay |
| **Proactive** | Morning briefing, evening questions, nudge eval, promise heartbeat, entity validation, health monitoring | ✅ Working |
| **Approvals** | Nudge feedback, entity validation, workspace changes (inline buttons) | ✅ Working |
| **LLM access** | None — never calls Annie's LLM for generative responses | ❌ Missing |
| **Tool access** | None — can't use web search, code exec, browser, sub-agents | ❌ Missing |

### Context Engine APIs (Available but Underutilized by Telegram)

- `GET /v1/context` — Hybrid search
- `GET /v1/daily` — Daily reflection
- `GET /v1/entities` — Entity listing + pending validations
- `GET /v1/emotions/arc` — Emotion arc data + SVG
- `GET /v1/wonder`, `/v1/comic` — Daily content
- `GET /v1/promises/due` — Promise tracking
- `POST /v1/nudge/evaluate` — Nudge generation
- `GET /v1/workspace/pending` — Workspace changes

---

## 4. Gap Analysis

### Gap 1: Telegram Is a Search Relay, Not Annie (CRITICAL)

**Current flow:**
```
User → Telegram → GET /v1/context?query=... → formatted search results → User
```

**Karpathy's pattern:**
```
User → WhatsApp → LLM (full personality + tools + self-programming) → User
```

The Telegram bot never touches Annie's LLM. Messages like "Find my Sonos", "Tell me a joke", or "What's the weather?" get memory search results instead of conversational responses with tool execution. Both voice AND Telegram must be equal frontends to the same self-programming brain.

**Fix:** Route free-text through the same LLM + tools that Annie Voice uses. Either:
- (a) Call the vLLM endpoint directly from Telegram with the same prompt/tools, or
- (b) Call a new Annie Core API endpoint that wraps the LLM interaction

**Complexity:** Medium — the prompt builder, tool definitions, and LLM client already exist in Annie Voice. Need to extract them into a shared module or expose an API.

### Gap 2: No Self-Programming Loop (CRITICAL — The Core Gap)

This is the fundamental capability gap. Annie has the **raw ingredients** to be a self-programming agent but lacks the **meta-loop** that ties them together.

**What Annie already has:**
- `execute_python` — can run arbitrary code (network scans, API calls, data processing)
- `web_search` + `fetch_webpage` — can research any API, protocol, or technique
- `browser_*` tools — can interact with any web UI or local dashboard
- `run_subagent(research)` — multi-step research synthesis

**What's missing — the self-programming meta-loop:**

```
User: "Annie, find my Sonos and play jazz in the study"
  │
  ▼
1. RECOGNIZE: Annie doesn't have a Sonos tool
  │
  ▼
2. RESEARCH: web_search("Sonos local network API discovery")
   fetch_webpage("https://docs.sonos.com/...")
  │
  ▼
3. BUILD: execute_python("""
     import socket, requests
     # scan network for Sonos devices
     # test API endpoints
   """)
  │
  ▼
4. TEST: execute_python("""
     # try playing music via discovered API
     requests.post('http://192.168.1.x:1400/MediaRenderer/...')
   """)
  │
  ▼
5. PERSIST: Save as a reusable skill/tool definition
   → SKILLS/sonos_control.md or tools/sonos_tools.py
   → Available in future sessions without re-discovery
  │
  ▼
6. EXECUTE: Use the newly created capability to fulfill
   the original request ("play jazz in the study")
```

**The 5 missing pieces:**

| # | Piece | What It Does | Annie Has Today |
|---|-------|-------------|-----------------|
| 1 | **Capability gap recognition** | Detect "I don't have a tool for this, but I could build one" | ❌ No — Annie either uses existing tools or says she can't |
| 2 | **Research → Build pipeline** | Chain web_search → fetch_webpage → execute_python into a coherent build process | ⚠️ Partial — sub-agents do multi-step research, but don't create code artifacts |
| 3 | **Code artifact persistence** | Save generated code as reusable tool modules or scripts | ❌ No — execute_python is ephemeral, output is lost after session |
| 4 | **Skill file creation** | Write SKILLS/*.md files that describe new capabilities for future sessions | ❌ No — no skill persistence mechanism |
| 5 | **Tool self-registration** | Dynamically load new tools into Annie's tool registry without restart | ❌ No — tools are statically registered at boot |

**This is the difference between a tool-user and a tool-maker.** Annie is currently a tool-user. Karpathy's Dobby is a tool-maker.

### Gap 3: No Persistent Autonomous Loops (HIGH)

Annie is **reactive** — she responds when spoken to. Karpathy's claws are **proactive loops** that run autonomously with objectives:

| Karpathy Pattern | Annie Equivalent | Gap |
|-----------------|-----------------|-----|
| Auto Research (objective + metrics → loop forever) | ❌ Nothing | Full gap |
| Camera monitoring (change detect → VLM → notify) | ❌ Nothing | Full gap |
| OpenClaw heartbeat daemon | Telegram scheduler (timed polls) | Partial — polls are fixed schedule, not goal-driven |

**What would be needed:**
1. **Loop runner** — A persistent process that executes a goal with defined metrics
2. **Stop conditions** — Success criteria, timeout, budget limits
3. **Progress tracking** — Dashboard visibility into running loops
4. **Notification on milestones** — Push results to Telegram/voice

**Complexity:** High — This is an architectural addition (a new service or daemon), not just a new tool.

**Note:** Autonomous loops are also how self-programming scales. Once Annie can create a skill, a loop could refine it: "try the Sonos integration → test it → if it fails, research the error → fix → retry."

### Gap 4: No Skill Persistence Across Sessions (HIGH)

When Annie figures out how to do something new (e.g., control Sonos), that knowledge is lost when the session ends. There's no mechanism to:

1. **Save a new skill** as a markdown file (like Karpathy's `program.md` or OpenClaw's self-evolving skills)
2. **Save generated code** as a reusable tool module
3. **Load saved skills** into future sessions automatically
4. **Evolve skills** based on usage feedback (worked/didn't work)

This maps directly to:
- OpenClaw's **self-evolving skills** (TODO-OPENCLAW-ADOPTION.md item 10)
- Karpathy's **"skills as markdown"** concept: "a way to instruct the agent how to do the thing"
- Karpathy's **"program.md"** pattern: meta-instructions that describe how the agent should work

**Skill lifecycle:**
```
1. CREATE: Annie discovers how to control Sonos
2. SAVE:   Write ~/.her-os/annie/SKILLS/sonos_control.md
           (description, API endpoints, example commands, code snippets)
3. LOAD:   Next session, prompt_builder includes saved skills
4. USE:    "Play jazz" → Annie reads skill → executes directly (no re-discovery)
5. EVOLVE: If Sonos API changes, Annie updates the skill file
```

---

## 5. What Annie Already Exceeds

Annie is ahead of Karpathy's Dobby in several areas that matter:

| Capability | Annie | Dobby |
|-----------|-------|-------|
| **Memory sophistication** | Hybrid BM25+vector+graph, entity extraction, temporal decay, emotion tracking | Basic OpenClaw memory |
| **Personality training** | QLoRA fine-tuned Nemotron Nano with behavioral constraints | Personality doc only |
| **Emotion awareness** | SER pipeline, emotion arcs, emotional gating for nudges | Not mentioned |
| **Sub-agent system** | Research, memory dive, draft composition | Not mentioned |
| **Code execution** | Sandboxed Python with matplotlib | Used implicitly (agents write and run code) |
| **Browser automation** | Full Chrome DevTools MCP | Not mentioned explicitly |
| **Proactive notifications** | Morning briefings, nudges, promise tracking, entity validation | Camera notifications only |
| **Visual output** | Tables, charts, SVGs via WebRTC data channel | Dashboard only |
| **Speaker identification** | ECAPA-TDNN voice embeddings, enrollment bank | Not mentioned |

Annie's raw tool inventory is actually richer than Dobby's — the gap is in **how those tools are orchestrated** (self-programming loop) and **how results persist** (skill creation).

---

## 6. Priority Roadmap

### Phase 1: Unified Channel Access (Foundation)

**Goal:** Both voice AND Telegram route through the same Annie brain with full tool access.

- Extract prompt builder + tool system into shared Annie Core module
- Telegram free-text → Annie Core LLM (not Context Engine search)
- Support tool results in Telegram formatting (markdown, inline results)
- Maintain `/search` as explicit memory search command
- **Effort:** ~2-3 sessions
- **Impact:** Both channels become equal. Prerequisite for everything else.

### Phase 2: Self-Programming Loop (The Core Capability)

**Goal:** "Annie, find my Sonos and play jazz" works — even though Annie has no Sonos tool.

This is the meta-capability that makes everything else possible:

1. **Capability gap recognition** — Prompt engineering: teach Annie to recognize when she needs to build a tool vs use an existing one. Add to system prompt / RULES.md: "If you don't have a tool for something, research how to do it, write code to test it, and create a reusable skill."

2. **Research → Build → Test pipeline** — Already possible with existing tools:
   - `web_search` → research the API/technique
   - `execute_python` → write and test implementation code
   - `fetch_webpage` → read documentation
   - What's new: chain these into a coherent build workflow via prompt guidance

3. **Skill persistence** — New infrastructure:
   - `save_skill` tool: writes `~/.her-os/annie/SKILLS/<name>.md` with description, code, examples
   - `list_skills` tool: shows available saved skills
   - Prompt builder loads saved skills into context at session start
   - Skills are human-readable markdown (reviewable, editable)

4. **Dynamic tool loading** — Stretch goal:
   - Save Python tool modules to `~/.her-os/annie/tools/`
   - Load at session start alongside built-in tools
   - Annie can create tools that become first-class capabilities

- **Effort:** ~3-4 sessions
- **Impact:** Annie becomes a tool-maker. Any new capability is one conversation away.

### Phase 3: Persistent Autonomous Loops

**Goal:** "Annie, monitor X and tell me when Y happens" — runs indefinitely.

- Loop runner daemon (objective + metrics + stop conditions)
- Progress tracking visible in dashboard
- Telegram notifications on milestones
- Combined with self-programming: Annie can create monitoring skills, then loop them
- **Effort:** ~4-5 sessions
- **Impact:** Annie becomes proactive. "Claws" that work when you're not looking.

### Phase 4: Expand to Any Domain

Once phases 1-3 are complete, Annie can tackle **any domain** without pre-programming:

- "Annie, check my calendar" → researches Google Calendar API → builds integration → persists skill
- "Annie, draft a reply to Priya's email" → researches Gmail API → builds read-only access → drafts reply
- "Annie, turn off the lights" → discovers smart home devices on LAN → builds integration → executes
- "Annie, monitor the front door" → researches RTSP streams → builds camera watcher → loops

The specific domains (IoT, email, calendar, cameras) are not separate phases — they're **test cases** for the self-programming loop. Once Phase 2 works, each new domain is just a conversation.

---

## 7. Architecture

### Current (Tool-User)

```
┌─────────────┐     ┌─────────────┐
│ Annie Voice  │     │  Telegram   │
│  (Pipecat)   │     │    Bot      │
│              │     │             │
│ [20 tools]   │     │ [search     │
│ [sub-agents] │     │  relay]     │
└──────┬───────┘     └──────┬──────┘
       │                     │
       ▼                     ▼
  ┌─────────┐         ┌──────────┐
  │  vLLM   │         │ Context  │
  │ Nemotron│         │ Engine   │
  └─────────┘         └──────────┘
```

### Target (Self-Programming Tool-Maker)

```
                    ┌──────────────────┐
                    │    Annie Core    │
                    │                  │
                    │ • Prompt builder │
                    │ • Tool registry  │  ← includes dynamically loaded skills
                    │ • LLM client     │
                    │ • Self-prog loop │  ← research → build → test → persist
                    │ • Skill store    │  ← ~/.her-os/annie/SKILLS/*.md
                    │ • Loop runner    │  ← persistent autonomous tasks
                    └────────┬─────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
        ┌─────┴─────┐ ┌─────┴────┐ ┌──────┴─────┐
        │Annie Voice│ │ Telegram │ │  Future:   │
        │ (Pipecat) │ │   Bot    │ │  Web Chat  │
        └───────────┘ └──────────┘ └────────────┘
```

The Channel Adapter Pattern (OpenClaw TODO item 9) + Self-Evolving Skills (OpenClaw TODO item 10) + Heartbeat Daemon (OpenClaw TODO item 7) — these three OpenClaw patterns, implemented together, produce the Karpathy "Dobby" architecture.

---

## 8. Key Karpathy Quotes (Organized by Relevance)

### On Self-Programming & Skills

> "I'm coming up with skills where basically skill is just a way to instruct the agent how to teach the thing. So maybe I could have a skill for micro GPT of the progression I imagine the agent should take you through."

> "The things that agents can't do is your job now. The things that agents can do, they can probably do better than you or like very soon."

### On Removing Yourself as Bottleneck

> "You need to take yourself outside. You have to arrange things such that they're completely autonomous and the more you know how can you maximize your token throughput and not be in the loop."

> "The name of the game now is to increase your leverage. I put in just very few tokens just once in a while and a huge amount of stuff happens on my behalf."

### On Capability Being There (Skill Issue)

> "Everything like so many things even if they don't work I think to a large extent you feel like it's skill issue. It's not that the capability is not there. It's that you just haven't found a way to string it together."

### On Persistent Claws

> "A claw takes persistence to a whole new level. It's something that keeps looping, it's not something that you are interactively in the middle of. It kind of does stuff on your behalf even if you're not looking."

### On Apps Becoming APIs

> "These apps that are in the app store for using these smart home devices — these shouldn't even exist. Shouldn't it just be APIs and shouldn't agents be just using it directly?"

### On Auto Research

> "Here's an objective, here's a metric, here's your boundaries of what you can and cannot do — and go."

> "I thought [the repo] was fairly well tuned and then I let auto research go for overnight and it came back with tunings that I didn't see."

---

## 9. Vendor Codebase Analysis: Implementation Patterns (v2 — Fresh Deep Dive)

Deep analysis of all three vendor codebases after latest pulls (ARC: +15K lines, OpenClaw: massive update, NemoClaw: +8K lines). This section covers the specific algorithms, data structures, and code patterns to adopt for Annie's self-programming capability.

### Pattern 1: Skill System (from OpenClaw)

**Source:** `vendor/openclaw/src/agents/skills/workspace.ts` — `loadSkillEntries()`, `buildWorkspaceSkillsPrompt()`, `applySkillsPromptLimits()`

OpenClaw loads skills from 6 directories in precedence order (workspace skills override bundled). Each skill is a directory with a `SKILL.md` file:

```
~/.her-os/annie/SKILLS/
  sonos-control/
    SKILL.md          ← YAML frontmatter + markdown body
    scripts/          ← Python scripts referenced by SKILL.md
    references/       ← API docs, examples
```

**SKILL.md format** (from `vendor/openclaw/src/agents/skills/frontmatter.ts` — `resolveOpenClawMetadata()`):
```yaml
---
name: sonos-control
description: "Discover and control Sonos speakers on the local network"
always: false           # only load body when relevant
user-invocable: true    # "Annie, play music" triggers this
---

# Steps
1. Scan network for Sonos devices at port 1400
2. GET /xml/device_description.xml for each device
3. Use UPnP AVTransport to control playback

# Anti-Patterns
- Don't hardcode IPs — devices may get new DHCP leases
```

**Three-level progressive disclosure** (critical for voice context budget):
- **Level 1**: `name` + `description` always in context (~50 tokens per skill)
- **Level 2**: Full SKILL.md body loaded when agent decides to use the skill
- **Level 3**: Scripts and reference files loaded on demand during execution

Budget enforcement: 150 skills max, 30K chars total, 256KB per file (`formatSkillsCompact()` in `workspace.ts`).

**For Annie:** The prompt builder already loads workspace files (`SOUL.md`, `RULES.md`, `TOOLS.md`). Extend it to scan `~/.her-os/annie/SKILLS/*/SKILL.md`, inject Level 1 (name+description) always, and inject Level 2 (full body) when Annie decides to use a skill.

### Pattern 2: Lessons → Skills Evolution (from ARC's MetaClaw Bridge)

**Source:** `vendor/AutoResearchClaw/researchclaw/metaclaw_bridge/lesson_to_skill.py`

ARC automatically converts pipeline failures into reusable skills:

```
Pipeline failure → LessonEntry (stage, category, severity, description)
    → EvolutionStore (JSONL-backed, append-only)
    → convert_lessons_to_skills() — LLM generates SKILL.md from lessons
    → Write to skills_dir/<name>/SKILL.md
```

The conversion uses an LLM prompt (`_CONVERSION_PROMPT_SYSTEM`) that takes failure descriptions and generates structured skill files with:
- `name`: lowercase-hyphenated, prefixed `arc-`
- `description`: when to use the skill
- `category`: maps from lesson categories via `LESSON_CATEGORY_TO_SKILL_CATEGORY`
- `content`: markdown with numbered steps and anti-pattern section

**Skill effectiveness tracking** (`vendor/AutoResearchClaw/researchclaw/metaclaw_bridge/skill_feedback.py`):
- `SkillFeedbackStore` records which skills were active during each task
- `compute_skill_stats()` returns `{total, successes, success_rate}` per skill
- Low-performing skills can be evolved or deprecated

**For Annie:** When Annie fails a task (e.g., user asks to control Sonos but no tool exists):
1. Extract lesson: "User asked to control Sonos, no capability available"
2. Annie researches (web_search + fetch_webpage), writes code (execute_python), tests it
3. If successful → convert the working approach into a SKILL.md
4. Track effectiveness: did the skill work next time it was used?

### Pattern 3: Self-Evolution Store (from ARC)

**Source:** `vendor/AutoResearchClaw/researchclaw/evolution.py`

`EvolutionStore` is a JSONL-backed persistent store for lessons learned across runs:

```python
class LessonEntry:
    stage_name: str       # which stage failed
    category: str         # system|experiment|writing|analysis|literature|pipeline
    severity: str         # info|warning|error
    description: str      # what went wrong
    timestamp: str        # ISO 8601
    run_id: str           # which run

class EvolutionStore:
    def append_many(lessons)    # persist lessons
    def build_overlay(stage)    # generate per-stage prompt overlay from past lessons
```

The `build_overlay()` method generates prompt text injected into future runs: "In previous runs, stage X failed because Y. To avoid this: Z." This is time-weighted — recent lessons have more influence.

`extract_lessons()` auto-detects issues from pipeline results:
- Failed stages → error lesson
- Blocked stages → pipeline lesson
- Runtime warnings → code_bug lesson
- Metric anomalies (NaN, identical convergence) → metric_anomaly lesson

**For Annie:** Create `~/.her-os/annie/evolution.jsonl` — after each session, extract lessons from failures. Inject relevant lessons into future prompts via `build_overlay()` pattern.

### Pattern 4: Heartbeat Daemon (from OpenClaw)

**Source:** `vendor/openclaw/src/infra/heartbeat-runner.ts` (1183 lines)

The heartbeat is a persistent background loop that checks on things without polluting conversation context:

```
startHeartbeatRunner(config) → returns { stop, updateConfig }
    │
    ├─ setTimeout loop (each tick schedules next)
    ├─ Check areHeartbeatsEnabled(config)
    ├─ Check isWithinActiveHours(config) — don't run at 3am
    ├─ Check queue — if main lane busy, skip
    ├─ Read HEARTBEAT.md — instructions for what to check
    ├─ Run LLM with heartbeat context (isolated session)
    │
    ├─ If HEARTBEAT_OK returned:
    │   └─ pruneHeartbeatTranscript() — remove this run's entries from session history
    │   └─ If remaining content ≤ 300 chars, suppress delivery entirely
    │
    └─ If action needed:
        └─ Deliver to configured Telegram/WhatsApp channel
```

**Critical design decisions:**
- **Isolated sessions** (`isolatedSession: true`): Each heartbeat gets a fresh session key (`session:heartbeat`), preventing heartbeat context from contaminating main conversation
- **Transcript pruning**: No-op heartbeats leave zero trace in session history
- **Deduplication**: If identical non-OK response within 24h, suppress delivery
- **Light context** (`lightContext: true`): Heartbeats only get HEARTBEAT.md + bootstrap files, not full conversation history

**HEARTBEAT.md as control surface** — the user writes instructions, the agent follows them:
```markdown
## Every 30 minutes
- Check if any GitHub PRs need review
- Monitor disk space on Titan (alert if >85%)

## Every morning at 9am
- Summarize yesterday's conversations
- Check if any promises are overdue
```

**For Annie:** The Telegram scheduler already runs timed polls. Replace with a proper heartbeat runner that reads `~/.her-os/annie/HEARTBEAT.md`, runs isolated LLM sessions, and delivers results to Telegram only when action is needed. The transcript pruning pattern prevents heartbeat pollution of Annie's conversation memory.

### Pattern 5: Code Agent — Exec-Fix Loop with Targeted Repair (from ARC)

**Source:** `vendor/AutoResearchClaw/researchclaw/pipeline/code_agent.py`

ARC's CodeAgent has 5 phases. The most adoptable for Annie is **Phase 2.5 + Phase 3**: hard validation gates + execution-fix loop.

**Hard validation (AST-based, zero LLM cost)** — before running any generated code:
1. `ast.parse()` syntax check
2. Cross-file import consistency (imported names exist in target files)
3. Empty class detection, hardcoded metric values, trivial computations
4. `if __name__ == "__main__":` guard check

**Exec-fix loop with targeted repair** — when code fails at runtime:
1. Parse Python traceback → extract exact file + line number (`_parse_error_location()`)
2. Extract ±30-line context window around the error
3. Send ONLY the affected file + compact summaries of other files to LLM
4. LLM fixes the targeted section
5. Retry up to `exec_fix_max_iterations` (default 3)

This is dramatically more token-efficient than sending all code on every error.

**For Annie:** When Annie writes code to discover APIs or control devices, the exec-fix loop is the pattern: run → if crash, parse traceback → targeted fix → retry. Combined with AST validation gates, most syntax errors are caught before execution.

### Pattern 6: Two-Level Self-Evolution (from ARC)

**Source:** `vendor/AutoResearchClaw/researchclaw/evolution.py` + `metaclaw_bridge/`

ARC has two evolution levels:

**Level 1 — Intra-session (current session feedback):**
- `extract_lessons()` inspects task results for failures, blocked steps, runtime warnings, metric anomalies
- Each lesson → `LessonEntry(stage_name, category, severity, description, timestamp)`
- Stored in `evolution/lessons.jsonl` (JSONL append-only)
- `build_overlay(stage_name)` generates prompt text injected into future attempts
- **Time-decay weighting**: `weight = exp(-age_days * ln(2) / 30)` — 30-day half-life, 90-day cutoff

**Level 2 — Cross-session (persistent skills):**
- `convert_lessons_to_skills(lessons, llm, skills_dir)` — LLM generates SKILL.md from high-severity lessons
- `SkillFeedbackStore` tracks which skills were active when tasks succeeded/failed
- `compute_skill_stats()` returns `{total, successes, success_rate}` per skill
- **PRM quality gate** (`prm_gate.py`): N parallel LLM-as-judge calls (default 3 votes at temp 0.6) with majority vote before persisting skills

**For Annie:** After each session, extract lessons from failures → store in `~/.her-os/annie/evolution.jsonl` → inject relevant lessons into future prompts. When a lesson's severity is "error", trigger skill creation via LLM. Track effectiveness. Low-performing skills get evolved.

### Pattern 7: Session Key as Routing Table (from OpenClaw)

**Source:** `vendor/openclaw/src/agents/tools/cron-tool.ts`, `subagent-spawn.ts`

Session keys encode the entire routing path:
```
agent:<agentId>:telegram:direct:<peerId>
agent:<agentId>:telegram:group:<chatId>:<topicId>
cron:<jobId>
agent:<agentId>:subagent:<runId>
<mainKey>:heartbeat
```

The cron tool's `inferDeliveryFromSessionKey()` automatically extracts channel + peer from the current session key. When Annie creates a cron job from a Telegram chat, the job automatically delivers results back to that Telegram chat — no explicit delivery configuration needed.

**For Annie:** Use session keys like `annie:voice:webrtc:<sessionId>` and `annie:telegram:direct:<chatId>`. Heartbeat, cron, and sub-agent results route back to the originating channel automatically.

### Pattern 8: Subagent Push-Not-Poll (from OpenClaw)

**Source:** `vendor/openclaw/src/agents/subagent-spawn.ts` (line 88)

When Annie spawns a background task (e.g., "research how to control Sonos"):
- The child agent runs in an isolated session
- On completion, result is delivered as a **user-turn message** into the parent session
- Parent is re-invoked to process the completion
- **Critical contract**: parent must NOT poll — just wait for push notifications

For long-running bash processes, the **Auto-Notify** pattern:
```bash
# Appended to spawned process prompt:
"When done: annie system event --text 'Done: <summary>' --mode now"
```
This calls `requestHeartbeatNow()` which wakes the heartbeat runner immediately.

### Pattern 9: Sandboxed Self-Programming (from NemoClaw)

**Source:** `vendor/NemoClaw/nemoclaw-blueprint/policies/openclaw-sandbox.yaml`

Three-layer security: Landlock (filesystem) + seccomp (syscalls) + netns (network).

**9 composable policy presets** (hot-reloadable without restart):

| Preset | What it allows | Methods |
|--------|---------------|---------|
| `pypi.yaml` | pypi.org, files.pythonhosted.org | GET only |
| `npm.yaml` | registry.npmjs.org, registry.yarnpkg.com | GET only |
| `telegram.yaml` | api.telegram.org `/bot*/**` | GET+POST |
| `huggingface.yaml` | huggingface.co, cdn-lfs, api-inference | GET+POST |
| `docker.yaml` | Docker Hub, nvcr.io | GET+POST |
| `slack.yaml` | slack.com, api.slack.com, hooks.slack.com | GET+POST |
| `discord.yaml` | discord.com, gateway, CDN | GET+POST |
| `outlook.yaml` | graph.microsoft.com, login.microsoftonline.com | GET+POST |
| `jira.yaml` | *.atlassian.net (wildcard) | GET+POST |

**Binary-level network enforcement**: `binaries: [{path: /usr/bin/git}]` means only git can reach github.com — not any Python script Annie writes. This prevents self-written code from exfiltrating data.

**Credential isolation via inference proxy**: Agent sees `api_key: "openshell-managed"`. The `inference.local` TLS-terminating proxy intercepts requests and injects the real API key from host credential store. Even a compromised agent cannot exfiltrate keys.

**Gaps for Annie's IoT use case:**
- No IP-range policies (NemoClaw is FQDN-based; LAN scanning needs `192.168.x.x/24` ranges)
- `landlock: compatibility: best_effort` silently falls back on older kernels — Annie should fail-closed instead

### Pattern 10: Contract/Definition-of-Done (from ARC)

**Source:** `vendor/AutoResearchClaw/researchclaw/pipeline/contracts.py`

Every ARC stage has a `StageContract`:
```python
@dataclass(frozen=True)
class StageContract:
    input_files: list[str]    # required inputs
    output_files: list[str]   # expected outputs
    dod: str                  # definition of done (human-readable)
    error_code: str           # failure identifier
    max_retries: int          # retry budget
```

The executor validates pre-conditions (inputs exist) and post-conditions (outputs non-empty) for every stage. Missing/empty outputs → FAILED status.

**For Annie's skills:** Each SKILL.md should declare what it needs and what it produces:
```yaml
---
name: sonos-control
requires: [network_access, execute_python]
produces: [device_list, playback_control]
dod: "At least one Sonos device discovered and responding to API calls"
---
```

---

## 10. Implementation Blueprint (v2): How to Build Annie's Self-Programming

### Mapping Vendor Patterns to Annie Components

| What to Build | Primary Source | Key Files | Annie Component |
|---|---|---|---|
| **Skill loader** | OpenClaw `loadSkillEntries()` | `vendor/openclaw/src/agents/skills/workspace.ts` | `services/annie-voice/skill_loader.py` |
| **Skill format** | OpenClaw frontmatter | `vendor/openclaw/src/agents/skills/frontmatter.ts` | YAML frontmatter + markdown |
| **Skill creation** | ARC `convert_lessons_to_skills()` | `vendor/AutoResearchClaw/researchclaw/metaclaw_bridge/lesson_to_skill.py` | `services/annie-core/skill_creator.py` |
| **Skill effectiveness** | ARC `SkillFeedbackStore` | `vendor/AutoResearchClaw/researchclaw/metaclaw_bridge/skill_feedback.py` | `~/.her-os/annie/skill_feedback.jsonl` |
| **Evolution store** | ARC `EvolutionStore` | `vendor/AutoResearchClaw/researchclaw/evolution.py` | `~/.her-os/annie/evolution.jsonl` |
| **Prompt overlay** | ARC `build_overlay()` | Same file | Inject lessons into prompt_builder |
| **Exec-fix loop** | ARC `_exec_fix_loop()` | `vendor/AutoResearchClaw/researchclaw/pipeline/code_agent.py` | Extend `execute_python` tool |
| **AST validation** | ARC `_hard_validate()` | Same file | Pre-execution gate in `code_tools.py` |
| **PRM quality gate** | ARC `ResearchPRMGate` | `vendor/AutoResearchClaw/researchclaw/metaclaw_bridge/prm_gate.py` | Validate skills before persisting |
| **Heartbeat daemon** | OpenClaw `heartbeat-runner.ts` | `vendor/openclaw/src/infra/heartbeat-runner.ts` | `services/annie-core/heartbeat.py` |
| **Transcript pruning** | OpenClaw `pruneHeartbeatTranscript()` | Same file | Prevent context pollution |
| **Sentinel watchdog** | ARC `sentinel.sh` | `vendor/AutoResearchClaw/sentinel.sh` | `scripts/annie-sentinel.sh` |
| **Checkpoint resume** | ARC `_write_checkpoint()` | `vendor/AutoResearchClaw/researchclaw/pipeline/runner.py` | Atomic tempfile + rename pattern |
| **Cron scheduling** | OpenClaw `CronService` | `vendor/openclaw/src/cron/service.ts` | `services/annie-core/cron.py` |
| **Cron auto-delivery** | OpenClaw `inferDeliveryFromSessionKey()` | `vendor/openclaw/src/agents/tools/cron-tool.ts` | Route from session key |
| **Channel adapter** | OpenClaw `ChannelPlugin` | `vendor/openclaw/extensions/telegram/src/channel.ts` | Unified Voice+Telegram frontend |
| **Session key routing** | OpenClaw session key encoding | `vendor/openclaw/src/routing/session-key.ts` | `annie:<channel>:<peer>` format |
| **Bootstrap files** | OpenClaw `loadWorkspaceBootstrapFiles()` | `vendor/openclaw/src/agents/workspace.ts` | Extend prompt_builder |
| **Session filtering** | OpenClaw `filterBootstrapFilesForSession()` | `vendor/openclaw/src/agents/bootstrap-files.ts` | Heartbeat/cron get minimal context |
| **Sub-agent push** | OpenClaw push-not-poll contract | `vendor/openclaw/src/agents/subagent-spawn.ts` | Extend sub-agent system |
| **Auto-notify** | OpenClaw `system event --mode now` | `vendor/openclaw/skills/coding-agent/SKILL.md` | HTTP POST to Annie event endpoint |
| **Sandbox security** | NemoClaw deny-by-default | `vendor/NemoClaw/nemoclaw-blueprint/policies/openclaw-sandbox.yaml` | Extend `execute_python` sandbox |
| **Policy presets** | NemoClaw composable YAML | `vendor/NemoClaw/nemoclaw-blueprint/policies/presets/` | Per-skill network policy |
| **Credential isolation** | NemoClaw `inference.local` proxy | `vendor/NemoClaw/bin/lib/credentials.js` | Never expose keys to sandbox |
| **Knowledge base** | ARC `KBEntry` + markdown | `vendor/AutoResearchClaw/researchclaw/knowledge/base.py` | `~/.her-os/annie/kb/` |
| **File-as-memory** | OpenClaw SOUL.md contract | `vendor/openclaw/docs/reference/templates/SOUL.md` | Already have SOUL.md/RULES.md |

### Build Sequence (4 Phases)

**Phase 1 — Unified Channel + Skill Loader (Foundation)**
1. Extract Annie Core: shared prompt builder + tool system + LLM client
2. Telegram routes free-text through Annie Core (not Context Engine search)
3. Skill loader: scan `~/.her-os/annie/SKILLS/*/SKILL.md`, inject Level 1 into prompt
4. Bootstrap file filtering: heartbeat/cron sessions get minimal context

**Phase 2 — Self-Programming Loop (Core Capability)**
1. Capability gap recognition: prompt engineering ("If you don't have a tool, research + build + persist")
2. Exec-fix loop: extend `execute_python` with traceback parsing + targeted repair + retry
3. AST validation gates before execution
4. `save_skill` tool: writes `~/.her-os/annie/SKILLS/<name>/SKILL.md`
5. Evolution store: `~/.her-os/annie/evolution.jsonl` with `build_overlay()`
6. PRM quality gate before persisting new skills

**Phase 3 — Heartbeat + Autonomous Loops**
1. Heartbeat daemon: reads HEARTBEAT.md, runs isolated LLM sessions, delivers to Telegram
2. HEARTBEAT_OK contract + transcript pruning
3. Sentinel watchdog: monitors heartbeat.json, auto-restarts crashed loops
4. Cron service: LLM-callable tool for self-scheduling
5. Sub-agent push-not-poll with auto-notify

**Phase 4 — Sandbox Hardening + Domain Expansion**
1. Network policy presets for discovered services
2. Credential isolation proxy
3. Per-skill contract validation (requires/produces/dod)
4. Skill effectiveness tracking + evolution
5. Any domain is now one conversation away

---

## References

- **Transcript:** `../her-player/downloads/kwSVtQ7dziU/subtitles.vtt`
- **OpenClaw research:** `docs/RESEARCH-OPENCLAW.md`
- **OpenClaw TODOs:** `docs/TODO-OPENCLAW-ADOPTION.md`
- **Email agents research:** `docs/RESEARCH-EMAIL-AGENTS.md`
- **Browser agent research:** `docs/RESEARCH-BROWSER-AGENT.md`
- **Annie Voice tools:** `services/annie-voice/tools.py`, `memory_tools.py`, `emotion_tools.py`, `code_tools.py`, `browser_tools.py`, `subagent_tools.py`, `visual_tools.py`
- **Telegram bot:** `services/telegram-bot/bot.py`, `context_client.py`
- **AutoResearchClaw:** `vendor/AutoResearchClaw/` — 23-stage pipeline, evolution store, MetaClaw bridge (lessons → skills), code agent (5-phase), sentinel watchdog, PRM quality gates
- **OpenClaw:** `vendor/openclaw/` — skill system (6 sources, 3-tier budget), heartbeat daemon (15-step decision tree, transcript pruning), channel adapters, cron service with auto-delivery, session key routing, sub-agent push-not-poll, file-as-memory contract
- **NemoClaw:** `vendor/NemoClaw/` — sandboxing (Landlock + seccomp + netns), 9 policy presets (hot-reloadable), binary-level network enforcement, credential isolation via inference proxy, 4 inference profiles