Annie's Context Budget — 32K Window Visualization

Fixed Overhead

Loaded before conversation

0 32,768 tokens

System

17 Tools

Memory

System Prompt (750 tok)

Tool Schemas × 17 (1,800 tok)

Memory Briefing (750 tok)

Annie's Notes (125 tok)

System Prompt

~750

2% of context

Personality, guidelines, conversation examples. Lines 111-161 in bot.py.

Tool Schemas

~1,800

5% of context

17 tools: web_search, fetch_webpage, search_memory, think, save_note, render_table, etc.

Memory Briefing

~750

2% of context

Recent conversations, key entities, promises, emotional state. MAX 3000 chars from Context Engine.

Annie's Notes

~125

<1% of context

Curated facts saved by Annie during conversations via memory_notes.py.

Example 10-Turn Session

Context growth over time

Overhead + 10 turns = ~11,525 tokens (35%) 32,768

Fixed (3,425)

Search (1.9K)

Search (1.8K)

Search (2K)

Emo

User + Response

Tool Results (the bloat)

Tool Call + Response

Turn	Content	Tokens	Cumulative	% Full	Status
1	"Hi Annie" + response	250	3,675	22%	OK
2	"Search Netflix" + tool result	1,900	5,575	34%	OK
3	"Tell me more" + response	250	5,825	36%	OK
4	"Search Iraq news" + tool result	1,800	7,625	47%	OK
5	"Interesting" + response	200	7,825	48%	OK
6	"Search Iran news" + tool result	2,000	9,825	60%	OK
7	"Compare them" + response	300	10,125	62%	OK
8	"How am I feeling?" + emotional arc	800	10,925	67%	TIER 1
9	"Save that I like..." + save_note	400	11,325	69%	TIER 1
10	"What's the time?" + response	200	11,525	70%	TIER 1
11	Another web search	~2,000	~13,525	83%	TIER 2
12	Another web search	~2,000	~15,525	95%	DANGER
13	Another web search	~2,000	>16,384	>100%	OVERFLOW

The Context Killer

A single tool result (web_search, fetch_webpage) consumes 500-2,000 tokens — as much as 4-10 user messages. Three tool calls consume the same space as the entire system prompt + all 17 tool schemas combined. Tool results ARE the bloat.

Three-Tier Compaction Strategy

Industry consensus

Tier 1: Clear Tool Results

0 LLM cost

Trigger: 65% capacity

Replace old tool results with "[Tool result cleared]". Keep tool call JSON. Frees 2-6K tokens instantly. Inspired by JetBrains Junie.

Tier 2: LLM Summarization

1-3s latency

Trigger: 80% capacity

Summarize old conversation turns using the same 9B model. Zero additional VRAM. Run during idle time (user speaking).

Tier 3: Persist on Disconnect

On WebRTC close

Trigger: Session end

Compact full conversation and save for next session's context_loader.py. Ensures continuity across sessions.

Industry Survey

7 Frameworks compared

Claude API (compact-2026-01-12)

Server-side summarization. Opaque, automatic. Already active for Claude backend.

Already Using

Google ADK (LlmEventSummarizer)

Sliding window + LLM summary. Model-agnostic. Best pattern for our Tier 2.

Adopt Pattern

LangChain Deep Agents

3-tier: offload tool results, offload tool inputs, then summarize. Best architecture.

Adopt Architecture

JetBrains Junie

Observation masking only. Matched full LLM summarization on SWE-bench. Zero cost.

Adopt for Tier 1

OpenCode

Tool pruning + full compaction. Similar to LangChain. Configurable LLM.

Reference

Morph Compact

Verbatim deletion model. 50-70% reduction, zero hallucination. Requires separate model.

Overkill

Claude Agent SDK

Wraps Claude API compaction. Not usable for non-Claude models.

N/A

Model Context Limits

Configurable per model

Model	Native Limit	Configured	Where Set	Notes
Qwen3.5-9B (distilled)	262K	32,768	start.sh:305	SFT at 16K, bumped to 32K (ADR-026)
Qwen3.5-9B (base)	262K	32,768	start.sh:300 (rollback)	Full native support
Qwen3.5-27B	128K	2,048	Ollama default	Not explicitly set
Qwen3.5-32B	128K	2,048	Ollama default	Not explicitly set