# Research: Context Compaction for LLM Conversations

**Date:** 2026-03-13
**Status:** Complete (research). Implementation plan pending.
**Scope:** Context window management strategies for long-running voice conversations on small local LLMs (32K context, ADR-026)
**Applies to:** Annie voice agent — Qwen3.5-9B backend (32K ctx, bumped from 16K SFT), with reference to Claude backend (API-side compaction)

---

## Table of Contents

1. [Problem Statement](#1-problem-statement)
2. [Industry Survey — 7 Frameworks](#2-industry-survey--7-frameworks)
3. [Three-Tier Compaction Strategy](#3-three-tier-compaction-strategy)
4. [Annie's 16K Context Budget](#4-annies-16k-context-budget)
5. [Architecture & Integration Points](#5-architecture--integration-points)
6. [Generic Library Design (Model-Agnostic)](#6-generic-library-design-model-agnostic)
7. [Implementation Recommendations](#7-implementation-recommendations)
8. [Sources](#8-sources)

---

## 1. Problem Statement

### The Core Constraint

Annie's local LLM backend (Qwen3.5-9B) runs with `--ctx-size 32768` (32K tokens, ADR-026). The base architecture supports 262K natively; the Opus-Distilled SFT was trained at 16K but generalizes well to 32K for conversational use. Cost: ~1 GB extra VRAM on the DGX Spark (128 GB shared across multiple models).

### Why Context Fills Up

| Message Type | Typical Size | Frequency | Impact |
|-------------|-------------|-----------|--------|
| User speech | 30–150 tokens | Every turn | Low |
| Annie response | 100–300 tokens | Every turn | Low |
| Tool call (JSON) | 50–100 tokens | ~30% of turns | Low |
| **Tool RESULT** | **500–2000 tokens** | **~30% of turns** | **HIGH** |

**Tool results are the context killer.** A single `web_search` or `fetch_webpage` result consumes as many tokens as 4–10 user messages. Three tool calls consume the same space as the entire system prompt + all 17 tool schemas combined.

### Current State

- **Claude backend:** Already has API-side compaction via `compact-2026-01-12` (Anthropic's built-in summarization). No work needed.
- **9B backend:** No compaction at all. Context grows monotonically until the conversation ends or the model produces degraded output near the context ceiling.

### What Triggers the Problem

A typical 10-turn conversation with 3 tool calls reaches ~11,500 tokens (70% full). By turn 12-13 with additional tool calls, the context overflows at ~16K. Real sessions with heavy web search usage can hit this in under 10 exchanges.

---

## 2. Industry Survey — 7 Frameworks

### Comparison Table

| Framework | Approach | Model-Agnostic? | Reusable for Us? |
|-----------|----------|-----------------|------------------|
| **Claude API** (`compact-2026-01-12`) | Server-side summarization | No (Claude only) | Already using for Claude backend |
| **Google ADK** (`LlmEventSummarizer`) | Sliding window + LLM summary | Yes (any LLM) | **Best pattern to adopt** |
| **LangChain Deep Agents** | 3-tier: offload tool results → offload tool inputs → summarize | Yes | **Best architecture** |
| **JetBrains Junie** | Observation masking (hide old tool outputs) | Yes (no LLM needed) | **Simplest first step** |
| **OpenCode** | Tool pruning + full compaction | Yes | Similar to LangChain |
| **Morph Compact** | Verbatim deletion model (purpose-built) | Separate model needed | Overkill for 16K context |
| **Claude Agent SDK** | Wraps Claude API compaction | No (Claude only) | N/A |

### Detailed Findings

#### Claude API — `compact-2026-01-12`

Anthropic's built-in compaction model. Called server-side during `messages.create()` when the `model` field specifies `compact-2026-01-12`. The server summarizes older conversation turns, maintaining the most recent exchanges verbatim. **Already active** for Annie's Claude backend via `_create_message_stream()`.

**Key insight:** Compaction is opaque — it happens inside the API. We can't inspect, customize, or reuse the prompt. This is fine for Claude but doesn't help the 9B backend.

#### Google ADK — `LlmEventSummarizer`

Google's Agent Development Kit provides `LlmEventSummarizer`, a class that:
- Monitors context window utilization
- Triggers LLM-based summarization when a threshold is exceeded
- Uses a configurable LLM (any model) to generate summaries
- Replaces old events with a summary event

**Pattern to steal:** The `LlmEventSummarizer` is model-agnostic. It doesn't care what LLM does the summarization. We can use the same 9B model that's already loaded — zero additional VRAM cost.

#### LangChain Deep Agents — Three-Tier Context Management

LangChain's "Deep Agents" blog post describes the most sophisticated approach:

1. **Tier 1: Offload tool results** — Replace old tool outputs with `"[Result offloaded]"`. Zero LLM cost.
2. **Tier 2: Offload tool inputs** — Remove the tool call JSON for old calls (the response already captures the essence).
3. **Tier 3: LLM summarization** — Summarize the oldest conversation segment.

**Architecture insight:** The three tiers are ordered by cost. Tier 1 alone often frees enough context for 5-10 more turns. Tier 2 adds marginal savings. Tier 3 is the heavy gun.

#### JetBrains Junie — Observation Masking

JetBrains published SWE-bench results showing that simply **hiding old tool outputs** (without summarization) matched the quality of full LLM summarization. Their "observation masking" technique:

- Replaces tool results older than N turns with a placeholder
- Keeps the tool call JSON (so the model knows what was asked)
- **Zero LLM cost, zero latency**

**Critical finding:** On SWE-bench, observation masking alone matched full LLM summarization quality. The model doesn't need old tool results — it already absorbed the information into its subsequent responses. The tool outputs are redundant context.

#### OpenCode

Similar to LangChain's approach: tool pruning first, full compaction second. Uses a configurable LLM for summarization. Nothing novel beyond what LangChain describes.

#### Morph Compact

A purpose-built model for "verbatim compaction" — it deletes low-signal tokens from the conversation without paraphrasing. Claims 50-70% context reduction with zero hallucination risk (since it only deletes, never generates).

**Assessment:** Interesting but overkill for our 16K context. Requires loading a separate model (more VRAM). The three-tier approach with observation masking achieves similar results without a dedicated model.

#### Claude Agent SDK (Python)

The `claude-agent-sdk-python` wraps Claude API compaction. Not usable for non-Claude models.

### Key Industry Insight

**No one has published a standalone open-source compaction library.** Instead, compaction is built into each agent framework (Google ADK, LangChain, Claude Agent SDK). But the patterns are well-documented and straightforward to implement — the core logic is 50-100 lines of code.

---

## 3. Three-Tier Compaction Strategy

Based on the industry survey, our recommended strategy for the 9B backend:

### Tier 1: Tool Result Clearing (Zero LLM Cost)

**Inspiration:** JetBrains Junie + LangChain Deep Agents

Replace old tool results with a placeholder string. Keep the tool call JSON (so the model knows what was asked).

```python
# Before: tool result consuming 1500 tokens
{"role": "tool", "content": "[Full web search results about Netflix trending shows...]"}

# After: tool result consuming ~20 tokens
{"role": "tool", "content": "[Tool result cleared — information incorporated in conversation above]"}
```

**When to trigger:** When context reaches 65% capacity (~10,600 tokens). Clear all tool results except the most recent 2.

**Expected savings:** 2,000–6,000 tokens freed per clearing (each tool result is 500-2000 tokens).

### Tier 2: LLM Summarization (Same 9B Model)

**Inspiration:** Google ADK `LlmEventSummarizer`

If Tier 1 is insufficient (context still above threshold after clearing), summarize the oldest conversation turns using the same 9B model.

```python
# Summarization prompt
COMPACT_PROMPT = """Summarize this conversation concisely. Preserve:
- Key facts and decisions
- User preferences and requests
- Emotional context
- Any promises or commitments made
Keep the summary under 400 tokens."""
```

**When to trigger:** When context reaches 80% capacity after Tier 1 has already run (~13,100 tokens).

**Expected savings:** Compresses 3,000-8,000 tokens of conversation into ~400 tokens of summary.

**Latency:** 1-3 seconds on the 9B model. Should be triggered during idle time (user is speaking, STT is processing).

### Tier 3: Post-Disconnect Persistence

When the WebRTC session ends, compact the full conversation and save for next session loading.

```python
# Save compacted summary for next session
summary = await compact_messages(context.messages, llm_client)
save_session_summary(session_id, summary)  # → memory_notes.py or context_loader.py
```

**When to trigger:** On `on_client_disconnected` event.

**Purpose:** Ensures the next conversation starts with relevant context from previous sessions.

---

## 4. Annie's 16K Context Budget

### Fixed Overhead (Loaded Before Conversation)

| Component | Tokens | % of 16K | Source |
|-----------|--------|----------|--------|
| System prompt | ~750 | 5% | `bot.py` lines 111-161 (personality, guidelines, examples) |
| Tool schemas (17 tools) | ~1,800 | 11% | web_search, fetch_webpage, search_memory, think, show_emotional_arc, save_note, read_notes, delete_note, update_note, get_entity_details, invoke_researcher, invoke_memory_dive, invoke_draft_writer, render_table, render_chart, render_svg |
| Memory briefing | ~750 | 5% | From Context Engine (recent convos, entities, promises, emotional state; MAX 3000 chars) |
| Annie's notes | ~125 | 1% | Curated facts saved by Annie (`memory_notes.py`) |
| **Total fixed** | **~3,425** | **21%** | |

### Conversation Budget

| | Tokens | % of 16K |
|---|--------|----------|
| Available for conversation | ~12,959 | 79% |

### Example 10-Turn Session

| Turn | Content | Tokens | Cumulative |
|------|---------|--------|------------|
| 1 | "Hi Annie" + response | 250 | 250 |
| 2 | "Search Netflix" + tool call + result | 1,900 | 2,150 |
| 3 | "Tell me more" + response | 250 | 2,400 |
| 4 | "Search Iraq news" + tool call + result | 1,800 | 4,200 |
| 5 | "Interesting" + response | 200 | 4,400 |
| 6 | "Search Iran news" + tool call + result | 2,000 | 6,400 |
| 7 | "Compare them" + response | 300 | 6,700 |
| 8 | "How am I feeling?" + emotional arc result | 800 | 7,500 |
| 9 | "Save that I like..." + save_note result | 400 | 7,900 |
| 10 | "What's the time?" + response | 200 | 8,100 |

**Total with fixed overhead:** ~11,525 tokens (70% full)

| Scenario | Tokens | % Full | Status |
|----------|--------|--------|--------|
| After turn 10 | ~11,525 | 70% | OK |
| Turn 11 (another search) | ~13,525 | 83% | Warning |
| Turn 12 (another search) | ~15,525 | 95% | DANGER |
| Turn 13 | >16,384 | >100% | OVERFLOW |

### After Tier 1 Compaction (Tool Result Clearing)

Clear 4 old tool results (turns 2, 4, 6, 8): **~5,000 tokens freed**

| Metric | Before | After |
|--------|--------|-------|
| Total tokens | ~11,525 | ~8,525 |
| % full | 70% | 52% |
| Remaining headroom | ~4,859 | ~7,859 |

### After Tier 2 Compaction (LLM Summarization)

Summarize turns 1-8 into ~400 token summary, keep turns 9-10 intact:

| Metric | Before | After |
|--------|--------|-------|
| Total tokens | ~8,525 | ~4,425 |
| % full | 52% | 27% |
| Remaining headroom | ~7,859 | ~11,959 |

**Conclusion:** Tier 1 alone extends conversations from ~12 turns to ~20+ turns. Combined with Tier 2, a conversation can run indefinitely.

---

## 5. Architecture & Integration Points

### Pipecat Context Architecture

Pipecat's context is a simple Python list — `context.messages` is a `list[dict]`. Compaction = replacing old messages with a summary message. No special framework support needed.

```python
# context.messages is just:
[
    {"role": "system", "content": "..."},
    {"role": "user", "content": "Hi Annie"},
    {"role": "assistant", "content": "Hello! How are you?"},
    {"role": "user", "content": "Search for..."},
    {"role": "assistant", "tool_calls": [...]},
    {"role": "tool", "content": "[search results]"},
    # ... grows monotonically
]
```

### Key Integration Points

1. **`context.messages`** — The conversation list in `bot.py`. Direct read/write access.
2. **`UserStartedSpeakingFrame`** — Pipecat event fired when user starts talking. The LLM is idle during this time — perfect window for background compaction.
3. **`on_client_disconnected`** — WebRTC disconnect event. Triggers Tier 3 (persist).
4. **`_create_message_stream()`** in `text_llm.py` — Where Claude's API-side compaction already happens. The 9B path in `llamacpp_llm.py` has no equivalent.

### Claude Backend (Already Solved)

Claude's compaction happens inside the API call via `compact-2026-01-12`. This is opaque and automatic. No changes needed.

### 9B Backend (Needs Implementation)

The local 9B backend (`llamacpp_llm.py`) has no compaction. Context grows until the session ends. Implementation needed:

1. Token counting after each LLM response
2. Tier 1 trigger at 65% capacity
3. Tier 2 trigger at 80% capacity
4. Tier 3 on disconnect

### Self-Compaction (9B Summarizes Itself)

Rather than calling Claude or the 27B model, we use the same 9B model that's already loaded in VRAM. Zero additional GPU memory cost. The 9B is adequate for summarization — it doesn't need deep reasoning, just information extraction.

### Idle Window Detection

Compaction should run during natural conversation pauses:
- **During STT processing** — User is speaking, LLM is idle (1-3 seconds available)
- **During silence gaps** — VAD detects no speech (variable, could be seconds to minutes)
- **Pre-call safety net** — If context is at 100% threshold when LLM needs to generate, block and compact first

Two-tier triggering: idle compaction at 65-80% (proactive, hidden latency) + pre-call compaction at 95% (safety net, blocking). This guarantees context never overflows while usually hiding the latency behind user speech time.

---

## 6. Generic Library Design (Model-Agnostic)

### Why Generic?

The compaction logic should work across any LLM backend — 9B, 27B, or future model swaps. The core algorithms (tool result clearing, LLM summarization, persistence) are identical; only the thresholds and token budgets change.

### Context Limits by Model

| Model | Native Limit | Currently Configured | Where Set |
|-------|-------------|---------------------|-----------|
| Qwen3.5-9B (distilled) | 262K (base arch), 16K (SFT) | `--ctx-size 32768` (ADR-026) | `start.sh:305` |
| Qwen3.5-9B (base) | 262K | 32K (same setting) | `start.sh:300` (comment) |
| Qwen3.5-27B | **128K** (native) | 2048 (Ollama default) | Not explicitly set |
| Qwen3.5-32B | 128K | 2048 (Ollama default) | Not explicitly set |

**Note:** The 27B currently does single-shot extraction (not multi-turn), so the Ollama 2048 default hasn't been a bottleneck. If the 27B were used for conversations, `num_ctx` would need explicit configuration.

### Configurable Compaction

```python
@dataclass(frozen=True)
class CompactionConfig:
    """Model-agnostic compaction configuration."""
    ctx_size: int              # e.g., 16384, 32768, 131072
    tier1_threshold: float     # 0.65 = start clearing tool results at 65%
    tier2_threshold: float     # 0.80 = start LLM summarization at 80%
    keep_recent_tools: int     # how many recent tool results to preserve
    keep_recent_turns: int     # how many recent turns to keep intact
    summary_max_tokens: int    # max tokens for the summary output

# Presets per model — swap when you swap models
PRESETS = {
    "qwen3.5-9b": CompactionConfig(
        ctx_size=32768, tier1_threshold=0.65, tier2_threshold=0.80,
        keep_recent_tools=3, keep_recent_turns=6, summary_max_tokens=600,
    ),
    "qwen3.5-27b": CompactionConfig(
        ctx_size=131072, tier1_threshold=0.70, tier2_threshold=0.85,
        keep_recent_tools=5, keep_recent_turns=8, summary_max_tokens=800,
    ),
}
```

### Library Location

`services/annie-voice/compaction.py` — pure library, zero framework dependencies. Both `llamacpp_llm.py` and any future backend import the same module. The functions take a `CompactionConfig` and a `list[dict]` of messages — they don't know or care which model produced them.

---

## 7. Implementation Recommendations

### Phase 0: Context Inspector (Build First)

Build a live visualization endpoint so we can SEE the compaction working:

- **New endpoint:** `GET /v1/context/inspect` on Annie
- Reads `context.messages` live
- Categorizes each message: system, memory_briefing, notes, user_speech, assistant_response, tool_call, tool_result
- Returns JSON: `{total_tokens, components: [{name, tokens, pct, content_preview}], threshold, compaction_history}`
- HTML visualization (like `titan-validation.html`) — horizontal stacked bar chart

### Phase 1: Tier 1 — Tool Result Clearing

**Estimated effort:** ~50 lines of Python

```python
def clear_old_tool_results(messages: list[dict], keep_recent: int = 2) -> list[dict]:
    """Replace old tool results with placeholder. Returns new list (immutable)."""
    tool_indices = [i for i, m in enumerate(messages) if m.get("role") == "tool"]
    clear_indices = tool_indices[:-keep_recent] if len(tool_indices) > keep_recent else []

    new_messages = []
    for i, msg in enumerate(messages):
        if i in clear_indices:
            new_messages.append({**msg, "content": "[Tool result cleared]"})
        else:
            new_messages.append(msg)
    return new_messages
```

### Phase 2: Tier 2 — LLM Summarization

**Estimated effort:** ~80 lines of Python

```python
async def compact_messages(
    messages: list[dict],
    llm_client,
    keep_recent: int = 4
) -> list[dict]:
    """Summarize old messages, keep recent ones intact. Returns new list."""
    system_msg = messages[0]  # Always keep system prompt
    old_messages = messages[1:-keep_recent]
    recent_messages = messages[-keep_recent:]

    summary = await llm_client.summarize(old_messages, COMPACT_PROMPT)

    return [
        system_msg,
        {"role": "user", "content": f"[Previous conversation summary: {summary}]"},
        *recent_messages
    ]
```

### Phase 3: Tier 3 — Post-Disconnect Persistence

Save compacted summary on disconnect for next session's `context_loader.py` to pick up.

### Design Principles

1. **Immutable `compact_messages()`** — Returns a new list instead of mutating the input, following the project's coding-style convention.
2. **Self-compaction** — Same 9B model summarizes itself. Zero additional VRAM.
3. **Event-driven triggers** — Hook into Pipecat's frame events, not polling loops.
4. **Graceful degradation** — If compaction fails (LLM error), continue with full context until it naturally overflows. Never crash.

---

## 8. Sources

### Primary References

1. **Anthropic — "Effective Context Engineering for AI Agents"**
   https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
   — Claude Code's compaction strategy, `compact-2026-01-12` model, observation masking patterns

2. **Google ADK — Context Compaction Documentation**
   https://google.github.io/adk-docs/context/compaction/
   — `LlmEventSummarizer` class, model-agnostic summarization, sliding window approach

3. **LangChain — "Context Management for Deep Agents"**
   https://blog.langchain.com/context-management-for-deepagents/
   — Three-tier context management (offload results → offload inputs → summarize), autonomous compression

4. **badlogic's Context Compaction Research Gist** (compares Claude Code, Codex CLI, OpenCode, Amp)
   https://gist.github.com/badlogic/cd2ef65b0697c4dbe2d13fbecb0a0a5f
   — Side-by-side comparison of compaction strategies across AI coding tools

5. **Anthropic — Claude Agent SDK (Python)**
   https://github.com/anthropics/claude-agent-sdk-python
   — Reference implementation wrapping Claude API compaction

### Key Findings Summary

- **No standalone open-source compaction library exists** — every framework builds it in
- **Observation masking (JetBrains Junie) matched full LLM summarization** on SWE-bench — tool results ARE the bloat
- **Three-tier approach is the industry consensus** — clear tool results first, then summarize, then persist
- **Self-compaction works** — the same model that generated the conversation can summarize it effectively
- **Morph Compact** is the only purpose-built compaction model, but requires a separate model load (overkill for 16K)

---

*Research conducted 2026-03-13. Implementation plan to follow in a dedicated session.*
