# Research: WhatsApp Silent Observer Agent for Annie

**Date:** 2026-04-06
**Status:** RESEARCHED — revised after Rajesh's review (3 corrections applied)

---

## Problem

Annie has been added to a WhatsApp family group ("Holliday"). She should silently listen to all messages, build her own understanding, and only respond when Rajesh directly addresses her. WhatsApp can't overwhelm Annie's main context — she needs her own independent agent with its own context window, like how Claude Code works.

## Key Constraint

> "WhatsApp can't overwhelm her. She needs a WhatsApp agent that has its own context. Similar to how you [Claude Code] behave and work." — Rajesh

---

## 1. How the Best Do It

### Claude in Slack
- Mention-only activation: `@Claude` triggers response
- Full channel context awareness but only speaks when addressed
- Thread-aware: responds in the same thread, keeps main channel clean

### ChatGPT in Slack
- **Mention-only model**: Does NOT monitor channels or watch for unanswered requests
- Within a thread, subsequent messages don't need re-mention (context preserved)
- Private DMs work without mention requirement
- Event-driven: `app_mention` trigger, not polling

### Zoom AI Companion (Silent Observer)
- "Patiently and silently listens to a conversation and keeps a written record"
- Does NOT interrupt, ask clarifying questions, or make suggestions in real-time
- Only acts when summoned or when the event ends
- Produces summaries after the fact

### Design Principle
**Silence is not absence — it's intelligent presence.** The best systems are deeply aware of context while respecting group dynamics by only speaking when directly addressed.

---

## 2. The Claude Code Model (What Annie's WhatsApp Agent Should Be)

Claude Code has:
- **Own context window** that fills up over time
- **Automatic compaction** when approaching limits
- **Persistent memory files** (MEMORY.md) for cross-session facts
- **Independent conversations** that can reference memory
- **Tools** that extend capabilities without filling context

Annie's WhatsApp agent should mirror this:
- Own process, own context window (128K on Gemma 4 — generous budget for group history)
- Own compaction (group-chat-specific tiers, but less aggressive with 128K headroom)
- **No separate memory file** — uses Annie's existing Context Engine for persistent facts (JSONL ingestion + entity extraction + hybrid retrieval). CE is already channel-agnostic.
- Bridges to Context Engine via JSONL ingestion (same format as audio pipeline) and HTTP API

---

## 3. Message Ingestion: How to Read WhatsApp Without an SDK

### Available Methods (No Custom App Required)

| Method | Message Text | Sender | Group | Real-Time | Ban Risk |
|--------|:---:|:---:|:---:|:---:|:---:|
| `dumpsys notification` | ❌ Encrypted | ✅ | ✅ | ⚠️ Poll | ✅ None |
| `adb logcat` | ❌ | ⚠️ Partial | ⚠️ | ✅ Stream | ✅ None |
| **UI scraping via u2** | ✅ Full text | ✅ | ✅ | ⚠️ On-demand | ✅ None |
| Tasker + AutoRemote | ✅ | ✅ | ✅ | ✅ Real-time | ✅ None |

### Recommended: Tasker as Primary, UI Scraping as Fallback

**Primary — Tasker + AutoRemote (event-driven, real-time)**
- Tasker app on Pixel watches WhatsApp notifications
- On new message → AutoRemote sends event to Panda (via ADB broadcast or HTTP webhook)
- Event payload: sender, group name, message text, timestamp, media type
- **Why primary**: eliminates UI scraping (most fragile layer), eliminates phone mutex for reading, event-driven (instant vs polling), notification API is stable across WhatsApp updates
- One-time setup: install Tasker, configure notification intercept profile, set up AutoRemote webhook

**Fallback — Notification Polling + UI Scraping**
- `adb shell dumpsys notification` detects new WhatsApp messages by badge count changes (10s interval)
- When new messages detected, acquire phone mutex, open Holliday chat via u2, scrape messages by resource IDs:
  - `com.whatsapp:id/message_text` — message body
  - `com.whatsapp:id/conversation_contact_name` — sender name
  - `com.whatsapp:id/date` — timestamp
- Only used if Tasker is unavailable or for verifying message history

**Why Tasker over polling+scraping?**
- UI scraping requires opening WhatsApp (battery drain, screen conflicts with phone_loop.py, fragile to WhatsApp UI changes)
- Tasker reads notification content directly — no screen interaction needed for reading
- Phone mutex only needed for SENDING replies, not for reading messages
- Tasker is a well-established app (~$4) — not an exotic dependency

---

## 4. Context Management: The Core Innovation

### Why Group Chats Are Different from Voice/Telegram

| Dimension | Voice/Telegram | WhatsApp Group |
|-----------|---------------|----------------|
| Messages/day | 10-30 | 50-200+ |
| Noise ratio | ~10% | ~60-70% |
| Media | Rare | Constant (photos, videos, voice notes) |
| Participants | 1 (Rajesh) | 10-20+ family members |
| Annie's role | Active participant | Silent observer |
| Response frequency | Every message | Maybe 1-5/day |

### Context Budget: 128K Tokens (Gemma 4 26B)

Gemma 4 supports 128K context (`--max-model-len 131072` in vLLM). This gives the WhatsApp agent generous headroom — no need to aggressively compress everything.

| Slot | Budget | Contents |
|------|--------|----------|
| System prompt | 2,000 | Role, rules, group member list, family context |
| Context Engine briefing | 4,000 | Key entities, active promises, family facts (loaded from CE at session start) |
| Rolling 7-day digests | 5,000 | Daily summaries for last 7 days (~700 tokens each) |
| Recent raw messages | 20,000 | Last ~200-300 messages verbatim (full day+ of group activity) |
| Hourly digests (today) | 3,000 | Detailed hourly breakdowns for today |
| **Response headroom** | **~94,000** | For LLM reasoning, tool use, and response generation |

**Key advantage of 128K**: The agent can keep an entire day of raw messages PLUS a full week of digests, with 94K tokens of headroom. This means:
- No "Lost in the Middle" risk — recent messages are a small fraction of the window
- Rajesh can ask detailed questions about today's conversation without any information loss
- Week-old context is still available via daily digests
- Entity extraction and fact storage flows into Context Engine for permanent memory

### 3-Tier Compaction (Tuned for 128K)

With 128K, compaction is less aggressive but still needed for multi-day history:

**Tier 0 — Noise Filter (on ingest, zero LLM cost)**
- Discard: greetings ("Good morning 🌞"), emoji-only, stickers, reactions, forwarded joke chains
- Collapse: "5 people reacted to Mom's photo" instead of 5 separate notifications
- Media → metadata only: "Mom sent a photo" (no image processing)
- **Result: 60-70% of messages filtered out immediately**
- **Remaining messages kept verbatim in the 20K raw message buffer**

**Tier 1 — Hourly Digest (every hour)**
- LLM summarizes the hour's messages into ~300 token digest
- Format: "10:00-11:00: Mom shared vacation photos (Goa beach, 4 photos). Dad asked about hotel checkout — Mom confirmed 11am. Sister's flight confirmed 3pm from BLR. Rajesh suggested dinner at 8pm, no objections."
- **Today's hourly digests kept in context (3K budget)**
- **Yesterday's hourly digests available on-demand from local storage**

**Tier 2 — Daily Digest (4 AM IST)**
- Compress all hourly digests into ~700 token daily summary
- Extract: key decisions, events, action items, emotional arc, participants
- **Ingested into Context Engine** as JSONL segments (`device_id="annie-whatsapp"`) — entity extraction, hybrid retrieval, memory tiers all apply automatically
- **Last 7 daily digests kept in context (5K budget)**

**Raw message rotation**: Messages older than 24h rotate out of the 20K raw buffer. Their content lives on in hourly/daily digests and in Context Engine entities.

### The "Silent for 3 Days" Problem

With 128K context, this is largely solved:
- 7 daily digests in context = 5K tokens (always available)
- Today's raw messages = up to 20K tokens (full detail)
- For deeper history, query Context Engine (`/v1/context?query=...&hours_back=720`)
- Context Engine's entity extraction means key facts (decisions, events, people) are permanently searchable

### Long-Term Memory: Context Engine (NOT a separate file)

Annie already has a full memory infrastructure:
- **Entity extraction**: person, topic, promise, event, decision, relationship — with confidence thresholds
- **Memory tiers**: L0 (<7d raw) → L1 (7-90d consolidated) → L2 (>90d patterns)
- **Hybrid retrieval**: BM25 + vector + graph with RRF fusion
- **Temporal decay**: 30-day half-life, evergreen entities (person/place) never decay below 0.3
- **Sensitivity filtering**: channel-aware (WhatsApp = "open" sensitivity level)

The WhatsApp agent writes JSONL segments to Context Engine using the same format as the audio pipeline. No `memory.json` needed — Context Engine IS the memory. Group member facts, family events, recurring patterns — all extracted and stored automatically.

---

## 5. Trigger Detection: When Should Annie Speak?

### Two-Phase Gate

**Phase 1 — Fast Regex (every message batch, zero cost)**
```python
TRIGGER_RE = re.compile(r'\bannie\b', re.IGNORECASE)
```
- If sender is NOT Rajesh → log observation, **stay silent**
- If sender IS Rajesh → proceed to Phase 2

**Phase 2 — LLM Classification (only when regex fires)**
- Send last 10 messages + trigger message to Gemma 4 on Titan
- Classify: `direct_request` | `mention_only` | `false_positive`
  - "Annie, what's the plan for Saturday?" → `direct_request` → respond
  - "I told Annie about it" → `mention_only` → stay silent
  - "Annie's grandma called" → `false_positive` → stay silent

### Cadence Limits
- Max 1 response per 5 minutes
- Max 5 responses per day (configurable)
- Group chat etiquette: keep responses short

### Phase 2 (Future): Other Family Members
Eventually, allow other family members to address Annie too. But start with Rajesh-only to avoid the "annoying bot" problem.

---

## 6. Response Delivery

Via u2 — same mechanism already proven in `setup_whatsapp.py`:
1. Acquire phone mutex
2. Navigate to Holliday chat
3. Type in `com.whatsapp:id/entry`
4. Tap `com.whatsapp:id/send`
5. Release mutex

### Thread-Awareness (Future)
WhatsApp is rolling out threaded replies (2025-2026 beta). When available, Annie should respond in the same thread Rajesh used, keeping the main chat clean.

---

## 7. Privacy Principles

| Principle | Implementation |
|-----------|---------------|
| **Transparency** | Announce Annie to the group before enabling |
| **Rajesh-only** | Only Rajesh can trigger responses (Phase 1) |
| **No profiling** | Don't build profiles of other family members |
| **30-day retention** | Auto-delete raw messages; digests kept 4 weeks |
| **Sensitivity** | Other members' personal/financial/medical details → anonymize in digests |
| **Information partition** | Annie's private chats with Rajesh never leak into group responses |
| **Honest about limits** | "I saw Mom sent a voice note but I can't listen to audio" |

---

## 8. Anti-Patterns to Avoid

### From Industry Research

| Anti-Pattern | What Goes Wrong | Prevention |
|---|---|---|
| **Responding when not addressed** | Annoys everyone, notification fatigue | Two-phase trigger gate |
| **Context rot** | LLMs perform worse with more irrelevant context | Aggressive 4-tier compaction |
| **Lost in the Middle** | Models favor start/end of context, lose mid-context | 128K gives headroom; recent messages always at end of context |
| **Stale data confidence** | AI confidently uses outdated info ("I thought you were still in Mumbai") | Timestamp all facts, hedge uncertain recall |
| **Pretending to understand media** | "That's a great photo!" when you can't see it | Acknowledge receipt, be honest about limits |
| **Information oversharing** | Leaking one person's private info to the group | Strict information partition |
| **Over-responding** | Bot replies 5 times in a thread | Cadence limits (1/5min, 5/day) |
| **Notification fatigue** | Family mutes the group because of Annie | Zero proactive messages |

### Group Chat Specific

- **Don't dump entire chat history into LLM** — selective context injection wins
- **Don't treat group chat like 1:1** — different dynamics, multiple conversations interleaved
- **Don't assume thread context** — WhatsApp's lack of threads means messages are often out of order
- **Don't respond to forwarded content** — forwarded jokes/news aren't addressed to Annie

---

## 9. Bridge to Annie's Main Systems

The WhatsApp agent is independent but can bridge to Annie:

- **Daily digest → Context Engine**: POST compressed daily summary as a searchable segment
- **Voice/Telegram queries**: "What's happening in the family group?" retrieves WhatsApp digest via existing hybrid retrieval
- **Entity extraction**: Key family facts (events, decisions) flow into Context Engine's entity store

This means Rajesh can ask Annie about the family group through ANY channel (voice, Telegram, dashboard) — not just WhatsApp.

---

## 10. Phone Mutex (Shared Resource)

The Pixel screen is shared between phone_loop.py (calls) and whatsapp_agent.py:

```
~/.her-os/annie/phone-active       → phone_loop owns the screen
~/.her-os/annie/whatsapp-scraping   → whatsapp_agent owns the screen
```

- File-based mutex with PID + timestamp
- Stale lock detection (>60s = stale, force release)
- **Phone calls always take priority** — WhatsApp agent yields immediately

---

## 11. Architecture Diagram

```
┌──────────────────────────────────────────────────────────┐
│ Panda (Desktop PC, USB to Pixel)                          │
│                                                           │
│  whatsapp_agent.py (standalone daemon)                    │
│  ┌──────────────┐   ┌───────────────────────────────┐    │
│  │Tasker Events  │──▶│Context Manager               │    │
│  │(real-time)    │   │(own 128K window on Gemma 4)   │    │
│  └──────────────┘   │ ├─ Noise filter (Tier 0)      │    │
│         │            │ ├─ Raw message buffer (20K)    │    │
│  ┌──────▼───────┐   │ ├─ Hourly digests (Tier 1)    │    │
│  │Trigger       │   │ └─ Daily digest → CE (Tier 2)  │    │
│  │Detector      │   └───────────────────────────────┘    │
│  │(regex → LLM) │                                        │
│  └──────┬───────┘                                        │
│         ▼                                                │
│  ┌───────────────┐                                       │
│  │Response via u2│  ◄── mutex (send only) ──►            │
│  │(type + send)  │      phone_loop.py                    │
│  └───────────────┘                                       │
└──────────────────────┬───────────────────────────────────┘
                       │
         ┌─────────────┼──────────────┐
         ▼             ▼              ▼
    ┌─────────┐  ┌──────────────┐  ┌──────────────────┐
    │Pixel 9a │  │  Titan       │  │ Context Engine   │
    │WhatsApp │  │  Gemma 4 128K│  │ (JSONL ingest,   │
    │+ Tasker │  └──────────────┘  │  entity extract,  │
    └─────────┘                    │  hybrid retrieval) │
                                   └──────────────────┘
```

---

## 12. File Structure (Proposed)

```
services/whatsapp-agent/
├── agent.py              # Main daemon (Tasker event listener + context manager)
├── tasker_receiver.py    # Tasker/AutoRemote event handler (HTTP webhook or ADB broadcast)
├── scraper.py            # UI scraping via u2 (fallback message extraction)
├── compaction.py         # 3-tier WhatsApp-specific compaction (noise → hourly → daily)
├── trigger.py            # Regex + LLM trigger detection
├── responder.py          # Response generation + u2 delivery
├── context_client.py     # HTTP client for Context Engine (copy pattern from telegram-bot)
├── jsonl_writer.py       # JSONL writer for CE ingestion (copy pattern from audio-pipeline)
├── config.py             # Constants (context budget, cadence limits, CE endpoints)
├── mutex.py              # Phone screen mutex (shared with phone_loop, send-only)
└── tests/

~/.her-os/whatsapp/
├── seen.json             # Message dedup hashes
├── digests/
│   ├── hourly/           # Hourly digests (kept 48h)
│   └── daily/            # Daily summaries (kept 7 days, also in CE)
└── sessions/             # Raw messages (kept 24h, then rotated)

# NO memory.json — long-term memory lives in Context Engine
# JSONL segments written to CE with device_id="annie-whatsapp"
```

---

## 13. Implementation Phases

| Phase | What | Effort |
|-------|------|--------|
| 1. Foundation | Tasker setup on Pixel + event receiver on Panda + send mutex | 2 days |
| 2. Context | Noise filter + 3-tier compaction + JSONL writer for CE | 2 days |
| 3. Trigger & Response | Regex + LLM gate + u2 delivery + cadence | 2 days |
| 4. Integration | CE ingestion + entity extraction + observability + dashboard | 1 day |
| 5. Hardening | Error recovery + watchdog + UI scraping fallback | 1 day |

---

## 14. Open Questions

1. **Notification text encryption**: Need to verify on actual Pixel whether `dumpsys notification` shows full message text or just "1 new message" — this determines whether Step 2 (UI scraping) is always needed
2. **Multiple groups**: Start with Holliday only, but architecture should support adding more groups later
3. **Voice notes**: Family groups often have voice notes — should Annie eventually transcribe these via Whisper on Panda?
4. **Typing indicator**: Should Annie show "typing..." in the group while generating a response? (Probably not — surprise is better than anticipation for a silent observer)
5. **Read receipts**: Should Annie's message-reading show blue ticks to the group? (May need to disable read receipts in WhatsApp settings)
