# Research: Context Engineering for AI Agents — Anthropic's Playbook

**Date:** 2026-03-10
**Status:** Research complete
**Sources:**
- [Effective Context Engineering for AI Agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) (Sep 2025, Anthropic Applied AI)
- [Prompt Caching](https://platform.claude.com/docs/en/docs/build-with-claude/prompt-caching) (Anthropic Platform)
- [Context Windows](https://platform.claude.com/docs/en/docs/build-with-claude/context-windows) (Anthropic Platform)
- [Memory Tool](https://platform.claude.com/docs/en/docs/agents-and-tools/tool-use/memory-tool) (Anthropic Platform)
- [Compaction](https://platform.claude.com/docs/en/docs/build-with-claude/compaction) (Anthropic Platform)
- [Effective Harnesses for Long-Running Agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) (Anthropic)

**Relevance:** Direct — every concept maps to Annie's voice agent architecture

---

## 1. The Paradigm Shift: Context Engineering vs Prompt Engineering

### Definition

**Context:** "The set of tokens included when sampling from a large-language model."

**Context Engineering:** "The set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts."

### Why It Matters

Prompt engineering = methods for writing and organizing LLM instructions (primarily system prompts).
Context engineering = managing the **entire context state** — system instructions, tool definitions, MCP data, retrieved memories, conversation history, and dynamic state.

Key distinction: prompt engineering is a **discrete writing task**; context engineering is an **iterative curation process** that occurs on every single API call.

### Annie Mapping

Annie already practices context engineering without naming it:
- `context_loader.py` assembles 6 sources in parallel (not "one prompt")
- `context_management.py` manages compaction and tool clearing
- `memory_notes.py` persists facts across sessions
- `transcript_writer.py` creates durable conversation logs
- Tool schemas ship with every API call

**Gap:** Annie doesn't have a unified framework for thinking about all of these as "context engineering." The learning page should establish this vocabulary.

---

## 2. Context Rot and the Attention Budget

### Context Rot

As context grows, model accuracy and recall degrade. Anthropic calls this "context rot."

**Mechanism:** Transformer architecture creates n-squared pairwise attention relationships. With n tokens in context, each new token creates n new relationships. The model's "attention budget" (finite working memory) is spread thinner.

**Result:** "Performance gradient rather than hard cliff — capability maintained but reduced precision for information retrieval and long-range reasoning."

### Attention Budget

LLMs have a **finite attention budget** like human working memory. Every token depletes this budget. Context is a "finite resource with diminishing marginal returns."

Key insight: the 90th percentile of useful context is orders of magnitude more valuable than the 10th percentile.

### Annie Mapping

Annie's voice conversations generate 10,000-15,000 tokens per 30-minute session. Add tool results and responses, and the 128K window (or 32K for Qwen) fills faster than expected.

Current mitigations:
- Session boundaries (each voice call starts fresh) — eliminates cross-session rot
- Compaction at 80K tokens — reduces mid-session rot
- Tool result clearing at 50K tokens — removes stale intermediate data

**Gap:** Annie has no **context awareness** — she doesn't know how full her window is until compaction fires. Claude Code tracks remaining budget via `<budget:token_budget>` markers.

---

## 3. The Anatomy of Effective Context

### System Prompts: The Right Altitude Principle

"Specific enough to guide behavior effectively, yet flexible enough to provide the model with strong heuristics."

**Failure modes:**
1. **Overly prescriptive:** Hardcoding complex, brittle logic (fragile, high maintenance). Example: "Always respond in exactly 3 sentences."
2. **Overly vague:** High-level guidance without concrete signals. Example: "Be helpful."

**Best practices:**
- Organize into sections using XML tags or Markdown headers
- Strive for minimal set of information fully outlining expected behavior
- Start with minimal prompt on best available model
- Iteratively add instructions based on observed failure modes

### Annie Mapping

Annie's SYSTEM_PROMPT (in `bot.py`) is at "right altitude" already:
- "Keep responses conversational" (not "respond in under 50 words")
- "Be warm but concise" (personality, not word count)
- "Don't use markdown formatting" (practical voice constraint)

### Tool Design Principles

Effective tools must be:
1. **Well-understood** — both human and AI can predict behavior from name + schema
2. **Minimal overlap** — no two tools do the same thing
3. **Self-contained** — all context needed is in the parameters
4. **Robust to error** — clear error messages, not silent failures
5. **Clear intended use** — description tells *when* to use, not just *what*

**Anti-pattern: Bloated Tool Sets** — "If humans can't definitively select a tool, agents can't either."

### Annie Mapping

Annie has 9 tools. Each passes the 5 principles:
- `search_memory` — BM25 over transcripts (broad, fuzzy)
- `read_notes` — curated personal notes (narrow, structured)
- `web_search` — current info from SearXNG
- `fetch_webpage` — read specific URLs
- `save_note` / `delete_note` — note management
- `think` — internal reasoning (not spoken)
- `render_table` / `render_chart` / `show_emotional_arc` — visual output

No overlap: `search_memory` and `read_notes` query different data stores with different purposes.

### Few-Shot Examples

"For an LLM, examples are the 'pictures' worth a thousand words." Curate diverse, canonical examples. Avoid "laundry list of edge cases."

### Annie Mapping

Annie doesn't currently use few-shot examples in her system prompt. This could be a future enhancement: a few example exchanges showing ideal voice conversation patterns.

---

## 4. Context Retrieval Strategies

### Three Approaches

**1. Traditional Pre-Inference (Upfront) Retrieval**
Embedding-based systems surface important context before the API call. Fast but risks stale data.

**2. Just-in-Time Context**
Agents maintain lightweight identifiers (file paths, stored queries, web links) and dynamically load data via tool calls. Better for large/dynamic data.

**3. Progressive Disclosure**
Agents incrementally discover context through exploration. Signals guide decisions:
- File sizes suggest complexity
- Naming conventions hint at purpose
- Timestamps indicate relevance

Agents "assemble understanding layer by layer, maintaining only what's necessary in working memory."

### Hybrid Strategy (Recommended)

Combine upfront retrieval for speed with autonomous exploration for depth.

**Claude Code's approach:**
- CLAUDE.md files loaded upfront (stable, always-relevant context)
- Glob/grep tools enable just-in-time file retrieval
- Bypasses stale indexing and complex syntax tree issues

**Recommendation:** "Do the simplest thing that works."

### Annie Mapping

**Currently:** Annie uses **pure upfront retrieval** — `load_full_briefing()` loads all 6 sources via `asyncio.gather` before the conversation starts. Everything available from the start.

**Why this works (for now):** Voice sessions are short (5-30 min). Annie's data universe is small (one person's life, not a codebase). Upfront loading adds ~1-2s latency at session start, acceptable.

**Growth path — Hybrid retrieval:**
- Load entity names + types upfront (metadata)
- Use `search_memory` tool for on-demand deep retrieval
- MCP Knowledge Graph server already supports this pattern (list entities → get entity details)
- Particularly valuable when the entity count grows beyond hundreds

---

## 5. Long-Horizon Context Management

### Compaction

**Definition:** Summarizing conversation nearing context limit, reinitiating with the summary.

**Key insight:** "Distills the contents of a context window in a high-fidelity manner, enabling the agent to continue with minimal performance degradation."

**Art of compaction:** What to keep vs discard. Maximize recall first (capture all relevant info), then iterate for precision (eliminate superfluous content).

**Server-side API (Beta: `compact_20260112`):**
- Supported on Claude Opus 4.6 and Sonnet 4.6
- Detects when input tokens exceed threshold
- Generates a `compaction` block with summary
- Subsequent requests automatically drop messages before compaction block

```python
response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-6",
    max_tokens=4096,
    messages=messages,
    context_management={"edits": [{"type": "compact_20260112"}]},
)
```

**Lowest-touch form:** Tool result clearing — clear old tool results from history. Recently launched on Claude Developer Platform.

```python
context_management={
    "edits": [{
        "type": "clear_tool_uses_20250919",
        "trigger": {"type": "input_tokens", "value": 50000},
        "keep": {"type": "tool_uses", "value": 5}
    }]
}
```

### Annie's Current Implementation

```python
# context_management.py
COMPACTION_TRIGGER_TOKENS = 80000  # auto-compaction at 80K
TOOL_CLEAR_TRIGGER_TOKENS = 50000  # clear old tool results at 50K
TOOL_KEEP_COUNT = 5                # keep 5 most recent tool exchanges
```

Both features are already integrated. Annie uses compaction + tool clearing together.

### Memory + Context Editing Combo

When context editing clears old tool results, the memory tool preserves critical information. The workflow:
1. Context grows toward clearing threshold
2. Claude receives warning notification
3. Agent saves important info to memory before clearing
4. After clearing, agent retrieves stored info from memory on demand

```python
# Combine memory + context editing
context_management={
    "edits": [{
        "type": "clear_tool_uses_20250919",
        "trigger": {"type": "input_tokens", "value": 100000},
        "keep": {"type": "tool_uses", "value": 3},
        "exclude_tools": ["memory"]  # Keep memory ops visible
    }]
}
```

### Annie Mapping

Annie uses `memory_notes.py` (save/read/delete) as her memory layer. When compaction fires, facts saved to notes survive. This is architecturally equivalent to the Memory + Context Editing combo described in the docs.

**Gap:** Annie doesn't exclude memory tool calls from clearing. Could improve by keeping memory ops visible during clearing.

---

## 6. Structured Note-Taking (Agentic Memory)

### Concept

Agents regularly write notes persisted **outside** the context window, pulled back later. "Provides persistent memory with minimal overhead."

**Patterns:**
- Claude Code creating to-do lists
- Custom agents maintaining NOTES.md files
- Track progress across complex tasks
- Maintain critical context and dependencies

### Pokémon Example

Claude playing Pokémon maintains precise tallies across thousands of game steps — "for the last 1,234 steps I've been training my Pokémon in Route 1, Pikachu has gained 8 levels toward the target of 10." Develops maps, remembers achievements, maintains combat strategy notes. After context resets, reads its notes and continues.

### Memory Tool API

Type: `memory_20250818` (client-side tool). Commands:
- `view` — directory listing or file contents with line numbers
- `create` — create new file
- `str_replace` — replace text in file (must be unique match)
- `insert` — insert text at specific line
- `delete` — delete file or directory
- `rename` — rename/move file

**Prompting guidance (auto-included):**
```
IMPORTANT: ALWAYS VIEW YOUR MEMORY DIRECTORY BEFORE DOING ANYTHING ELSE.
MEMORY PROTOCOL:
1. Use the `view` command to check for earlier progress.
2. ... (work on the task) ...
   - As you make progress, record status/progress/thoughts in memory.
ASSUME INTERRUPTION: Your context window might be reset at any moment.
```

### Annie's Equivalent: `memory_notes.py`

```python
# Annie's memory_notes.py API
save_note(category, content)   # Create note in category
read_notes(category)           # Read all notes in category
delete_note(category, fragment) # Delete matching note
# Categories: preferences, people, schedule, facts, topics
```

**Comparison:**

| Feature | Claude Code Memory Tool | Annie memory_notes.py |
|---------|------------------------|----------------------|
| Storage | File-based (/memories/) | JSON file (notes.json) |
| Structure | Free-form files | 5 fixed categories |
| Operations | view/create/str_replace/insert/delete/rename | save/read/delete |
| Update | In-place str_replace | Delete + save (no in-place) |
| Cross-session | Yes (persistent files) | Yes (persistent JSON) |
| Auto-check | Prompting: "always view memory first" | context_loader loads notes at session start |

**Gap:** Annie lacks in-place update (`str_replace` equivalent). Currently must delete + re-save to update a note.

---

## 7. Sub-Agent Architectures

### Concept

Rather than one agent maintaining state across entire project, specialized sub-agents handle focused tasks with clean context windows.

**Operation:** Main agent coordinates with high-level plan. Sub-agents perform deep technical work or find information. Each may use tens of thousands of tokens but returns condensed summary (1,000-2,000 tokens).

**Advantage:** "Clear separation of concerns — the detailed search context remains isolated within sub-agents, while the lead agent focuses on synthesizing and analyzing the results."

### Approach Selection by Task Type

| Approach | Best For | Annie Mapping |
|----------|----------|---------------|
| Compaction | Long conversational flow | Voice sessions — compaction at 80K |
| Note-taking | Iterative development with milestones | Memory notes for cross-session facts |
| Multi-agent | Parallel exploration, complex research | **Not yet** — biggest growth vector |

### Annie Mapping

Annie has zero sub-agents today. This is the biggest architectural gap identified in the Claude Code comparison.

**Proposed sub-agent types for Annie:**
1. **Research agent** — multi-step web search, summarize results
2. **Memory deep-dive agent** — search 30 days of history, cluster themes
3. **Draft agent** — compose emails/messages using relationship context
4. **Planning agent** — break down complex requests into steps

---

## 8. Context Awareness

### Concept

Claude Sonnet 4.6, Sonnet 4.5, and Haiku 4.5 feature **context awareness** — models track remaining token budget.

**How it works:**

At conversation start:
```xml
<budget:token_budget>200000</budget:token_budget>
```

After each tool call:
```xml
<system_warning>Token usage: 35000/200000; 165000 remaining</system_warning>
```

**Benefits:**
- Long-running agent sessions requiring sustained focus
- Multi-context-window workflows where state transitions matter
- Complex tasks requiring careful token management

### Annie Mapping

**Gap:** Annie does not implement context awareness. She doesn't know her token budget or current usage. Compaction fires at a fixed threshold (80K) with no model awareness.

**Opportunity:** Adding budget tracking would let Annie:
- Proactively save facts to notes before compaction
- Adjust response verbosity based on remaining budget
- Decide whether to call tools (each call costs ~500-2000 tokens)

---

## 9. 1M Token Context Window

### Availability

Claude Opus 4.6, Sonnet 4.6, Sonnet 4.5, Sonnet 4 support 1M token context (beta, tier 4+ orgs).

```python
response = client.beta.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[...],
    betas=["context-1m-2025-08-07"],
)
```

**Pricing:** Requests exceeding 200K tokens charged at 2x input, 1.5x output.

### Annie Mapping

Voice conversations don't need 1M tokens (sessions are 5-30 min, ~15K tokens). But sub-agents doing deep research could benefit from larger windows to process more search results or conversation history.

**Decision:** Not a priority for Annie's voice path. Relevant for future sub-agents and text chat.

---

## 10. Prompt Caching

### How It Works

Stores and reuses cached prompt prefixes. Two approaches:

**Automatic caching** (recommended for multi-turn):
```python
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},  # Top-level
    system="...",
    messages=[...]
)
```

**Explicit breakpoints** (fine-grained):
```python
system=[{
    "type": "text",
    "text": "System prompt...",
    "cache_control": {"type": "ephemeral"}
}]
```

### Cost Savings

| Model | Base Input | Cache Write | Cache Read | Savings |
|-------|-----------|-------------|-----------|---------|
| Opus 4.6 | $5/MTok | $6.25/MTok | **$0.50/MTok** | **90%** |
| Sonnet 4.6 | $3/MTok | $3.75/MTok | **$0.30/MTok** | **90%** |
| Haiku 4.5 | $1/MTok | $1.25/MTok | **$0.10/MTok** | **90%** |

TTL: Default 5 minutes (free), extended 1 hour (2x write cost).

### Annie Mapping

Annie's system prompt + tool schemas are identical every turn. Multi-turn automatic caching means:
- Turn 1: All content cached (write cost)
- Turn 2+: Previous content read from cache (90% savings)

**Design implication:** Annie's system prompt is designed with stable content first (personality, rules) and dynamic sections appended (briefing, emotional state). Stable content caches. Dynamic content doesn't. This is correct architecture for prompt caching.

---

## 11. Multi-Session Patterns

### From "Effective Harnesses for Long-Running Agents"

**Problem:** Long-running agents lose progress across context windows. "Each new session begins with no memory of what came before."

**Solution: Two-part architecture:**

**1. Initializer Agent (first session):**
- `init.sh` script: automates environment setup
- `claude-progress.txt`: logs activities and decisions
- Initial git commit: documents baseline
- Feature list (JSON): comprehensive task specification

**2. Coding Agent (subsequent sessions):**
1. Run `pwd` to confirm directory
2. Read git logs and progress files
3. Consult feature list, select highest-priority incomplete feature
4. Run end-to-end tests before implementing
5. Make incremental progress on single feature
6. Commit with descriptive messages
7. Update progress documentation

**Key principle:** Work on one feature at a time. Only mark complete after e2e verification.

### Annie Mapping

Annie's voice sessions naturally follow a simplified version:
- **Session start** = `load_full_briefing()` (reads 6 sources — equivalent to "read progress files")
- **Active session** = voice conversation + tool use
- **Deterministic persistence** = `transcript_writer.py` (every 30s to JSONL) + `save_note` (model decision)
- **Session end** = final flush, context destroyed

**Gaps:**
1. No progress files / feature checklists for Annie's own development
2. No initializer agent pattern for sub-agents
3. No one-task-per-session discipline (voice sessions are naturally bounded, but text chat is not)
4. Text chat (`text_llm.py`) has no session boundary — context grows indefinitely

---

## 12. Guiding Principles (from Anthropic)

1. **Find the smallest possible set of high-signal tokens** that maximize the likelihood of desired outcome
2. **Context is a precious, finite resource** — treat it with diminishing marginal returns
3. **Smarter models require less prescriptive engineering** — allow greater agent autonomy
4. **Do the simplest thing that works** — don't over-engineer retrieval strategies
5. **Maximize recall first, then iterate for precision** — especially for compaction
6. **If humans can't select a tool, agents can't either** — tool set design matters

---

## 13. Annie Implementation Roadmap

### Already Implemented
- [x] Server-side compaction (`compact_20260112` at 80K tokens)
- [x] Tool result clearing (`clear_tool_uses_20250919` at 50K, keep 5)
- [x] Structured note-taking (`memory_notes.py`)
- [x] Deterministic persistence (`transcript_writer.py`)
- [x] Upfront context assembly (6-source `load_full_briefing()`)
- [x] Prompt caching (implicit via multi-turn API)
- [x] Right-altitude system prompt
- [x] 5-principle tool design (9 tools, no overlap)
- [x] Session boundaries (WebRTC = natural session)
- [x] MCP integration (Knowledge Graph server)

### To Implement
- [ ] **Context awareness** — token budget tracking + model warnings
- [ ] **Sub-agent delegation** — research/draft/memory-dive agents
- [ ] **Hybrid retrieval** — metadata-first + tool-based deep retrieval
- [ ] **Progressive disclosure** — load entity names upfront, details on demand
- [ ] **In-place note updates** — str_replace equivalent for memory_notes
- [ ] **Memory exclusion from clearing** — `exclude_tools: ["save_note", "read_notes"]`
- [ ] **Text chat session boundary** — prevent indefinite context growth
- [ ] **Few-shot examples** — example voice exchanges in system prompt
- [ ] **Multi-session progress files** — for sub-agent handoffs
- [ ] **Initializer agent pattern** — for sub-agent spawning

### Priority Order
1. Context awareness (low effort, high impact — just add budget tracking)
2. Sub-agent delegation (high effort, highest impact — enables research/draft agents)
3. Hybrid retrieval (medium effort, grows with entity count)
4. Text chat session boundary (medium effort, prevents silent context rot)
5. Everything else (incremental improvements)