Fixed Overhead

Loaded before conversation
0 32,768 tokens
System
17 Tools
Memory
System Prompt (750 tok)
Tool Schemas × 17 (1,800 tok)
Memory Briefing (750 tok)
Annie's Notes (125 tok)
System Prompt
~750
2% of context
Personality, guidelines, conversation examples. Lines 111-161 in bot.py.
Tool Schemas
~1,800
5% of context
17 tools: web_search, fetch_webpage, search_memory, think, save_note, render_table, etc.
Memory Briefing
~750
2% of context
Recent conversations, key entities, promises, emotional state. MAX 3000 chars from Context Engine.
Annie's Notes
~125
<1% of context
Curated facts saved by Annie during conversations via memory_notes.py.

Example 10-Turn Session

Context growth over time
Overhead + 10 turns = ~11,525 tokens (35%) 32,768
Fixed (3,425)
Search (1.9K)
Search (1.8K)
Search (2K)
Emo
User + Response
Tool Results (the bloat)
Tool Call + Response
Turn Content Tokens Cumulative % Full Status
1 "Hi Annie" + response 250 3,675 22% OK
2 "Search Netflix" + tool result 1,900 5,575 34% OK
3 "Tell me more" + response 250 5,825 36% OK
4 "Search Iraq news" + tool result 1,800 7,625 47% OK
5 "Interesting" + response 200 7,825 48% OK
6 "Search Iran news" + tool result 2,000 9,825 60% OK
7 "Compare them" + response 300 10,125 62% OK
8 "How am I feeling?" + emotional arc 800 10,925 67% TIER 1
9 "Save that I like..." + save_note 400 11,325 69% TIER 1
10 "What's the time?" + response 200 11,525 70% TIER 1
11 Another web search ~2,000 ~13,525 83% TIER 2
12 Another web search ~2,000 ~15,525 95% DANGER
13 Another web search ~2,000 >16,384 >100% OVERFLOW

The Context Killer

A single tool result (web_search, fetch_webpage) consumes 500-2,000 tokens — as much as 4-10 user messages. Three tool calls consume the same space as the entire system prompt + all 17 tool schemas combined. Tool results ARE the bloat.

Three-Tier Compaction Strategy

Industry consensus
Tier 1: Clear Tool Results
0 LLM cost
Trigger: 65% capacity
Replace old tool results with "[Tool result cleared]". Keep tool call JSON. Frees 2-6K tokens instantly. Inspired by JetBrains Junie.
Tier 2: LLM Summarization
1-3s latency
Trigger: 80% capacity
Summarize old conversation turns using the same 9B model. Zero additional VRAM. Run during idle time (user speaking).
Tier 3: Persist on Disconnect
On WebRTC close
Trigger: Session end
Compact full conversation and save for next session's context_loader.py. Ensures continuity across sessions.

Industry Survey

7 Frameworks compared
Claude API (compact-2026-01-12)
Server-side summarization. Opaque, automatic. Already active for Claude backend.
Already Using
Google ADK (LlmEventSummarizer)
Sliding window + LLM summary. Model-agnostic. Best pattern for our Tier 2.
Adopt Pattern
LangChain Deep Agents
3-tier: offload tool results, offload tool inputs, then summarize. Best architecture.
Adopt Architecture
JetBrains Junie
Observation masking only. Matched full LLM summarization on SWE-bench. Zero cost.
Adopt for Tier 1
OpenCode
Tool pruning + full compaction. Similar to LangChain. Configurable LLM.
Reference
Morph Compact
Verbatim deletion model. 50-70% reduction, zero hallucination. Requires separate model.
Overkill
Claude Agent SDK
Wraps Claude API compaction. Not usable for non-Claude models.
N/A

Model Context Limits

Configurable per model
Model Native Limit Configured Where Set Notes
Qwen3.5-9B (distilled) 262K 32,768 start.sh:305 SFT at 16K, bumped to 32K (ADR-026)
Qwen3.5-9B (base) 262K 32,768 start.sh:300 (rollback) Full native support
Qwen3.5-27B 128K 2,048 Ollama default Not explicitly set
Qwen3.5-32B 128K 2,048 Ollama default Not explicitly set