# Annie Kernel — Complete Implementation Plan

## Context

The Annie Kernel blueprint (`docs/ANNIE-KERNEL.md`, 4,444 lines, 16 sections) evolves Annie from "LLM with tools" to an "OS-style supervisor with job scheduling." This plan implements the ENTIRE blueprint with zero deferrals — every section, every component, every creature, every test. Each phase has verification gates that must pass before proceeding.

**User requirements:**
- 100% blueprint coverage — nothing deferred or forgotten
- Tests for every phase, verified before moving on
- E2E verification using APIs — no user involvement needed for tests
- Visual tests via Chrome/Playwright for dashboard, creatures, silhouettes, event emitters, time-travel audit
- All test scripts saved as regression tests for future sessions
- Factory reset + synthetic data seeding for testing
- Deploy via git push/pull to Titan
- `./start.sh` and `./stop.sh` updated for new architecture

**Adversarial review adjustments (incorporated):**
- Simplified 3-level priority FIFO (not CFS-style scheduler) — saves 5-7 sessions
- Static fallback map (not Haiku debugger) — no LLM call for error recovery
- Tool-name frequency counting alongside hash-based loop detection
- Function hooks (not Plugin class hierarchy) — simpler, same composability
- Emotional arc as `before_model` callback (not kernel-level scheduler signal)
- Common `_execute_tool_typed()` wrapper for both text and voice paths
- Phased ToolResult migration (wrapper first, then per-tool conversion)

---

## Phase 0: Foundation — Factory Reset, Synthetic Data & Regression Infrastructure

**Goal:** Create the testing foundation before any kernel code. All test scripts saved for reuse.

### 0.1 Factory Reset Script
**NEW FILE:** `scripts/factory_reset.sh` (~120 lines)
- Clear PostgreSQL events/entities tables (Context Engine)
- Clear Redis session cache
- Clear Neo4j graph data
- Clear `~/.her-os/annie/` state files (sessions, checkpoints, tasks)
- Clear audio pipeline JSONL files
- Preserve `.env` files and model weights
- Preserve creature registry and dashboard build
- Reference: existing procedure in `memory/project_factory_reset_and_seed.md`
- **Run via:** `ssh titan "cd ~/workplace/her/her-os && bash scripts/factory_reset.sh"`

### 0.2 Profile Seed Script
**NEW FILE:** `scripts/seed_profile.sh` (~80 lines)
- Inject Rajesh's profile data via JSONL pipeline (from `docs/RAJESH-PROFILE.md`)
- Create seed conversation segments for Context Engine
- Populate entities: Rajesh (person), Annie (assistant), family members, key locations
- Wait for Context Engine to ingest and extract entities
- Verify: `curl http://localhost:8100/v1/entities | jq '.total'` shows >5 entities
- Reference: `memory/project_factory_reset_and_seed.md` for exact JSONL format

### 0.3 Synthetic Data Generator
**NEW FILE:** `scripts/generate_kernel_test_data.py` (~300 lines)
- Generate synthetic creature events for ALL 48 creatures (42 existing + 6 new kernel)
- Generate synthetic audit events (supervisor decisions, error routing, task scheduling)
- Generate synthetic task queue scenarios (priority ordering, aging, preemption)
- Generate synthetic tool loop failures (403 retry, timeout cascade, empty results)
- Output: `tests/fixtures/synthetic_events.json`, `tests/fixtures/synthetic_tasks.json`
- **Purpose:** Feed dashboard visual tests and time-travel audit tests without requiring live services

### 0.4 Regression Test Runner
**NEW FILE:** `scripts/run_regression.sh` (~150 lines)
- Runs ALL test suites in sequence with exit-on-failure
- Services: annie-voice (pytest), context-engine (pytest), telegram-bot (pytest), dashboard (vitest + playwright)
- Reports: coverage per service, total pass/fail, timing
- **Saved for reuse:** This script runs after every future session
- Run: `bash scripts/run_regression.sh` (on laptop for unit tests, on Titan for E2E)

### 0.5 E2E Health Check Script
**NEW FILE:** `scripts/health_check.sh` (~60 lines)
- Pings ALL 12 service health endpoints on Titan (see infrastructure report)
- Reports: service name, status, response time
- Fails if any service is unhealthy
- **Run before and after each phase deployment**

### Verification Gate 0
```bash
# On laptop:
bash scripts/run_regression.sh          # All ~4,900 existing tests pass
# On Titan (after push/pull):
bash scripts/factory_reset.sh           # Clean state
bash scripts/seed_profile.sh            # Rajesh's data loaded
bash scripts/health_check.sh            # All 12 services healthy
python scripts/generate_kernel_test_data.py  # Synthetic data created
```

---

## Phase 1: Loop Detection + Typed Errors (Session 1)

**Goal:** Stop the "5 minutes wasted retrying the same failed approach" problem. Self-healing for the text chat path.

### 1.1 ToolResult Dataclass
**NEW FILE:** `services/annie-voice/tool_result.py` (~60 lines)
- `ToolStatus` enum: SUCCESS, ERROR_TRANSIENT, ERROR_PERMANENT, ERROR_BLOCKED, PARTIAL
- `ToolResult` frozen dataclass: status, data, error_type, alternatives, confidence (default None)
- `to_llm_string()` method for backward-compatible LLM consumption
- Copy pattern: immutable dataclass matching `services/annie-voice/models.py` style

### 1.2 Loop Detector
**NEW FILE:** `services/annie-voice/loop_detector.py` (~150 lines)
- `ToolCallRecord` dataclass: tool_name, args_hash, result_hash, timestamp
- `LoopDetection` dataclass: stuck, level (WARNING/CRITICAL), detector, count, message
- `LoopDetector` class:
  - `check(tool_name, args) -> LoopDetection` — check before tool execution
  - `record(tool_name, args, result) -> None` — record after tool execution
  - 3 detectors: generic repeat (same hash), no-progress (same result hash), **tool-frequency (same tool_name >N times)**
  - Constants: HISTORY_SIZE=20, WARNING_THRESHOLD=3, CRITICAL_THRESHOLD=5
  - Emits `golem/loop_detected` via `emit_event()`
  - Scoped per-request (new instance per `stream_chat` call) — avoids compaction interaction bug
- Integration point: `text_llm.py` line 667 (Claude path), line 925 (OpenAI path)

### 1.3 Error Router
**NEW FILE:** `services/annie-voice/error_router.py` (~100 lines)
- `ErrorRouter` class with static `FALLBACK_CHAINS` dict:
  - `http_403` → [try_alternative_url, use_execute_python, report_failure]
  - `http_404` → [search_for_url, report_failure]
  - `http_429` → [wait_and_retry, try_alternative_source, report_failure]
  - `timeout` → [retry_once, try_simpler_request, report_failure]
  - `empty_result` → [broaden_query, try_alternative_source, report_failure]
  - `parse_error` → [retry_raw, report_failure]
  - `default` → [retry_once, report_failure]
- `get_strategy(error_type, attempt) -> str` — returns next strategy from chain
- Emits `golem/strategy_selected` via `emit_event()`
- Startup validation: warns if registered tools lack fallback chains

### 1.4 Supervised Tool Loop
**MODIFY FILE:** `services/annie-voice/text_llm.py` (+120/-30 lines)
- `_execute_tool_typed()` wrapper (line ~483): catches exceptions, returns `ToolResult`
  - HTTP errors → classify by status code into ToolStatus
  - Timeout → ERROR_TRANSIENT
  - Generic exception → ERROR_PERMANENT
  - Success → SUCCESS with data=raw_string
  - **Both Claude and OpenAI paths** call this wrapper
- Modify `_stream_claude()` (lines 602-681): inject loop check + error routing
- Modify `_stream_openai_compat()` (lines 722-932): inject loop check + error routing
- LoopDetector instantiated per-request (avoid compaction interaction)
- When loop detected (CRITICAL): inject strategy hint into tool result message
- When all fallbacks exhausted: graceful failure message to user

### 1.5 Tool ToolResult Migration (Top 3 Tools)
**MODIFY FILE:** `services/annie-voice/tools.py` (+30 lines)
- `search_web()` returns `ToolResult` with error classification
- `fetch_webpage()` returns `ToolResult` with HTTP status → error_type mapping
- **Voice path safety:** Pipecat handlers (`handle_web_search`, `handle_fetch_webpage`) call `.to_llm_string()` on the result, so voice path stays string-based

### 1.6 Golem Creature Registration
**MODIFY FILE:** `services/annie-voice/observability.py` (+3 lines)
- Add `"golem": {"zone": "thinking", "process": "kernel-supervisor"}` to `_CREATURES` dict (line 29-47)

### 1.7 Tests for Phase 1
**NEW FILE:** `services/annie-voice/tests/test_loop_detector.py` (~250 lines)
- Generic repeat: same tool+args called 3x → WARNING, 5x → CRITICAL
- No-progress: same tool+args+result hash → detect stale loop
- Tool-frequency: `search_web` called 6x → CRITICAL regardless of args
- Ping-pong: alternating A↔B detection
- Per-request scoping: new instance per request, no cross-request leakage
- Integration: mock `_execute_tool` → 403 five times, assert loop detector blocks after 3

**NEW FILE:** `services/annie-voice/tests/test_error_router.py` (~150 lines)
- Each error type routes to correct first strategy
- Subsequent attempts advance through chain
- Chain exhaustion returns `report_failure`
- ERROR_BLOCKED → immediate failure (no retries)
- Unknown error type → default chain
- Startup validation warns on missing tools

**NEW FILE:** `services/annie-voice/tests/test_tool_result.py` (~80 lines)
- ToolResult.to_llm_string() formatting
- Frozen dataclass (cannot mutate)
- ToolStatus enum values

**NEW FILE:** `services/annie-voice/tests/test_supervised_loop.py` (~200 lines)
- Integration: mock LLM + mock tools, run supervised loop end-to-end
- YouTube 403 regression: mock fetch_webpage → 403 5x, assert <3 calls same args, total <30s
- Error routing: 429 → wait_and_retry strategy applied
- Graceful failure: all strategies exhausted → user-friendly message
- Existing behavior preserved: successful tool calls unchanged
- Audit events emitted: verify `emit_event` called with `golem/loop_detected` and `golem/strategy_selected`

### Verification Gate 1
```bash
# On laptop:
pytest services/annie-voice/tests/test_loop_detector.py -v
pytest services/annie-voice/tests/test_error_router.py -v
pytest services/annie-voice/tests/test_tool_result.py -v
pytest services/annie-voice/tests/test_supervised_loop.py -v
pytest services/annie-voice/tests/ -x  # ALL existing 1,070+ tests pass, zero regressions

# Deploy to Titan:
git add -A && git commit -m "feat: Phase 1 — loop detection + typed errors + error routing"
git push && ssh titan "cd ~/workplace/her/her-os && git pull"
./stop.sh annie && ./start.sh annie

# E2E on Titan (API-based, no user involvement):
curl -X POST http://titan:8100/v1/chat -d '{"message":"summarize https://youtube.com/watch?v=dQw4w9WgXcQ"}'
# Assert: response arrives <60s (not 5 min), contains summary or graceful failure
bash scripts/health_check.sh  # All services healthy
```

---

## Phase 2: Callback Lifecycle + Function Hooks (Session 2)

**Goal:** Composable hooks that fire on every LLM call and tool execution, replacing scattered validation code.

### 2.1 Callback Hooks System
**NEW FILE:** `services/annie-voice/kernel_hooks.py` (~120 lines)
- `HookRegistry` class:
  - `before_model_hooks: list[Callable]` — inject loop warnings, emotional context
  - `after_model_hooks: list[Callable]` — strip think tags, validate response
  - `before_tool_hooks: list[Callable]` — SSRF blocking, rate limits, audit
  - `after_tool_hooks: list[Callable]` — error classification, audit
- Built-in hooks (registered by default):
  - `think_strip_hook` — strips `<think>` blocks (replaces ThinkBlockFilter for text path)
  - `audit_hook` — emits `golem/tool_dispatch` and `golem/tool_complete` events
  - `security_hook` — SSRF check (existing logic from `tools.py` lines 60-80)
- NO class hierarchy, NO plugin base class — just function hooks in a list

### 2.2 Temp State Namespace
**MODIFY FILE:** `services/annie-voice/text_llm.py` (+20 lines)
- Add `temp_state: dict = {}` at start of `stream_chat()`
- Pass to `_execute_tool_typed()` — tools can write to it
- Clear at end of each user message turn
- Use case: `search_web` writes URLs to `temp_state["search_urls"]`, `fetch_webpage` reads them

### 2.3 Full ToolResult Migration (All Tools)
**MODIFY FILES:** (each +10-20 lines)
- `services/annie-voice/tools.py` — search_web, fetch_webpage (already done in Phase 1, verify)
- `services/annie-voice/memory_tools.py` — search_memory, save_note, read_notes, delete_note, update_note
- `services/annie-voice/code_tools.py` — execute_python
- `services/annie-voice/browser_agent_tools.py` — browser_navigate, browser_snapshot, browser_click, schedule_coffee_delivery
- `services/annie-voice/subagent_tools.py` — invoke_researcher, invoke_memory_dive, invoke_draft_writer
- `services/annie-voice/visual_tools.py` — render_table, render_chart, render_svg
- Each tool returns `ToolResult` with appropriate `error_type` and `alternatives`

### 2.4 State Change Auditing
**MODIFY FILE:** `services/annie-voice/text_llm.py` (+15 lines)
- After each tool execution: emit `golem/state_change` with tool_name, result status, round number
- After each LLM call: emit `golem/llm_response` with token count, model used

### 2.5 Tests for Phase 2
**NEW FILE:** `services/annie-voice/tests/test_kernel_hooks.py` (~200 lines)
- Hook registration and execution order
- before_model injects warning when loop detected
- after_model strips think tags
- before_tool blocks SSRF attempts
- after_tool classifies errors
- Hook removal (for testing without specific hooks)

**NEW FILE:** `services/annie-voice/tests/test_all_tools_typed.py` (~300 lines)
- Every tool function returns `ToolResult` (not bare string)
- Each error path returns correct `ToolStatus`
- `to_llm_string()` produces human-readable output for every tool
- Voice path: Pipecat handlers still get strings via `.to_llm_string()`

### Verification Gate 2
```bash
# On laptop:
pytest services/annie-voice/tests/test_kernel_hooks.py -v
pytest services/annie-voice/tests/test_all_tools_typed.py -v
pytest services/annie-voice/tests/ -x  # ALL tests pass

# Deploy:
git commit && git push && ssh titan "cd ~/workplace/her/her-os && git pull"
./stop.sh annie && ./start.sh annie

# E2E:
curl -X POST http://titan:8100/v1/chat -d '{"message":"search for weather in bangalore"}'
# Assert: golem/tool_dispatch event visible in SSE stream
bash scripts/health_check.sh
```

---

## Phase 3: Task Queue + Priority Scheduling (Session 3)

**Goal:** Replace the lane-based FIFO with a 3-level priority queue. User tasks don't block each other.

### 3.1 Task Queue
**NEW FILE:** `services/annie-voice/task_queue.py` (~250 lines)
- `TaskPriority` IntEnum: REALTIME=0, HIGH=1, NORMAL=2, LOW=3, BACKGROUND=4
- `TaskState` str Enum: QUEUED, RUNNING, COMPLETED, FAILED, CANCELLED, TIMEOUT
- `Task` **frozen** dataclass: task_id, name, priority, state, user_message, source, created_at, deadline_soft_s=300, deadline_hard_s=900
- `TaskQueue` class:
  - Uses `asyncio.PriorityQueue` (not threading.Lock + manual heap)
  - `submit(task) -> str` — returns task_id, emits `clockwork/task_enqueue`
  - `pop_next() -> Task` — returns highest priority, emits `clockwork/task_dequeue`
  - `cancel(task_id)` — emits `clockwork/task_cancel`
  - `list_tasks() -> list[Task]` — for status API
  - Queue depth limits: HIGH:3, NORMAL:5, LOW:3, BACKGROUND:5
  - Aging: `effective_priority = base - floor(wait_s / 300)`, floor at 1
  - Coalescing: duplicate tasks (same name + user_message) merged
- `clockwork` creature registration in observability.py

### 3.2 Agent Runtime Integration
**MODIFY FILE:** `services/annie-voice/agent_context.py` (+80/-60 lines)
- `AgentRunner.__init__()`: accept `TaskQueue` parameter
- `AgentRunner._lane_worker()`: pull from `TaskQueue.pop_next()` instead of per-lane `asyncio.Queue`
- Remove `_queues: dict[str, asyncio.Queue]` and `DEFAULT_LANES`
- Keep voice-priority gating: when `_is_voice_active()`, pause non-REALTIME task execution
- Keep `_execute()` logic intact (budget enforcement, Beast health, evolution overlay, cost tracking)
- Keep `execute_direct()` bypass for orchestrator sub-stages

### 3.3 Scheduler Integration
**MODIFY FILE:** `services/annie-voice/agent_scheduler.py` (+20/-10 lines)
- `_fire_job()` creates a `Task` with appropriate priority:
  - Cron jobs → LOW
  - Manual triggers → HIGH
  - Proactive pulse → BACKGROUND
- Submit to `TaskQueue` instead of `AgentRunner` directly

### 3.4 Server Integration
**MODIFY FILE:** `services/annie-voice/server.py` (+25 lines)
- Initialize `TaskQueue` singleton in startup sequence (after AgentRunner, before AgentScheduler)
- Pass `TaskQueue` to `AgentRunner` constructor
- Add `/v1/tasks` GET endpoint → `TaskQueue.list_tasks()`
- Keep `_llm_semaphore` for voice path (unchanged) — voice bypasses the queue
- Keep `background_llm_call()` — uses queue priority internally

### 3.5 Task Persistence
**MODIFY FILE:** `services/annie-voice/task_queue.py` (+50 lines)
- `_persist_queue()` → write `~/.her-os/annie/tasks/queue.json` on every submit/cancel (atomic write via rename)
- `_restore_queue()` → reload on startup
- Completed tasks cleaned after 24h
- Reference pattern: `services/audio-pipeline/jsonl_writer.py` atomic write

### 3.6 Tests for Phase 3
**NEW FILE:** `services/annie-voice/tests/test_task_queue.py` (~300 lines)
- Priority ordering: HIGH dequeues before NORMAL before BACKGROUND
- Aging: BACKGROUND task after 15 min → effective_priority=1 (HIGH level)
- Queue depth limits: 6th NORMAL task rejected
- Coalescing: duplicate tasks merged, queue depth unchanged
- Cancel: QUEUED task cancelled, RUNNING unaffected
- Persistence: queue survives restart (mock filesystem)
- Frozen Task: cannot mutate (TypeError on assignment)
- Audit events: clockwork/task_enqueue, clockwork/task_dequeue emitted

**MODIFY FILE:** `services/annie-voice/tests/test_agent_context.py` (+100 lines)
- AgentRunner uses TaskQueue (not lane queues)
- Voice priority gating still works
- execute_direct bypass still works
- Background agents (cron, proactive) still execute

### Verification Gate 3
```bash
# On laptop:
pytest services/annie-voice/tests/test_task_queue.py -v
pytest services/annie-voice/tests/test_agent_context.py -v
pytest services/annie-voice/tests/ -x

# Deploy:
git commit && git push && ssh titan pull && ./stop.sh annie && ./start.sh annie

# E2E (API-based):
# Submit task and verify queue
curl http://titan:7860/v1/tasks | jq '.tasks | length'  # Queue visible
# Submit 3 messages rapidly via API, verify priority ordering in logs
bash scripts/health_check.sh
```

---

## Phase 4: Resource Pool — Auto-Routing Beast vs Claude Code (Session 4)

**Goal:** Replace "Claude, ..." prefix hack with automatic backend routing. Shadow mode first, then automatic.

### 4.1 Backend Classifier
**NEW FILE:** `services/annie-voice/resource_pool.py` (~150 lines)
- `Backend` enum: BEAST, CLAUDE_CODE, NANO
- `RoutingDecision` frozen dataclass: backend, rule_matched, alternatives_rejected, confidence
- `classify_backend(user_message, source) -> RoutingDecision`:
  - YouTube URL → CLAUDE_CODE (has yt-dlp)
  - `git `, `commit`, `push`, `pull`, `diff` keywords → CLAUDE_CODE
  - `edit file`, `create file`, filesystem ops → CLAUDE_CODE
  - `voice` source → always BEAST (latency constraint)
  - Default → BEAST (local-first)
- Emits `golem/backend_classified` audit event
- `BackendHealthMonitor`:
  - Pings Beast `/health` every 10s
  - After 3 failures → marks unhealthy, routes to Claude Code
  - On recovery → re-enables Beast
  - Emits `golem/backend_health` events

### 4.2 Integration — Shadow Mode
**MODIFY FILE:** `services/telegram-bot/bot.py` (+25/-5 lines)
- In `handle_text()` (line 388-428): call `classify_backend()` alongside `detect_claude_prefix()`
- Log "would have auto-routed to {backend}" when classifier disagrees with prefix
- Prefix still works (override) — shadow mode is logging only

### 4.3 Integration — Automatic Mode
**MODIFY FILE:** `services/telegram-bot/bot.py` (+15 lines)
- Feature flag: `AUTO_ROUTE_ENABLED` in `.env` (default: false)
- When enabled: non-prefixed messages auto-route based on classifier
- "Claude, ..." prefix becomes an explicit override

### 4.4 Djinn Creature Registration
**MODIFY FILE:** `services/annie-voice/observability.py` (+2 lines)
- Add `"djinn": {"zone": "acting", "process": "worker-claude-code"}`

### 4.5 Tests for Phase 4
**NEW FILE:** `services/annie-voice/tests/test_resource_pool.py` (~200 lines)
- YouTube URL → CLAUDE_CODE
- "git status" → CLAUDE_CODE
- "what's the weather" → BEAST
- voice source → always BEAST
- Beast unhealthy → fallback to CLAUDE_CODE
- Beast recovery → re-enable routing
- Prefix override still works
- Shadow mode logging verified
- Audit events emitted correctly

**NEW FILE:** `services/telegram-bot/tests/test_auto_routing.py` (~100 lines)
- Shadow mode: classifier runs, logs decision, prefix still controls
- Auto mode: classifier decision used when no prefix
- Override: "Claude, git status" → CLAUDE_CODE regardless of classifier

### Verification Gate 4
```bash
# On laptop:
pytest services/annie-voice/tests/test_resource_pool.py -v
pytest services/telegram-bot/tests/test_auto_routing.py -v
pytest services/telegram-bot/tests/ -x
pytest services/annie-voice/tests/ -x

# Deploy:
git commit && git push && ssh titan pull && ./stop.sh annie telegram && ./start.sh annie telegram

# E2E:
# Send YouTube URL via Telegram API, verify shadow log shows "would have auto-routed to CLAUDE_CODE"
# Send "git status" via Telegram API, verify classification
bash scripts/health_check.sh
```

---

## Phase 5: Sub-Agent Tool Loops + Voice Integration (Session 5)

**Goal:** Sub-agents get their own tool loops with loop detection. Voice path gets lightweight supervision.

### 5.1 Sub-Agent Tool Loops
**MODIFY FILE:** `services/annie-voice/subagent_tools.py` (+100/-30 lines)
- `invoke_researcher()`: gets own LoopDetector + ErrorRouter + tool loop (max 3 rounds)
- `invoke_memory_dive()`: gets own tool loop
- `invoke_draft_writer()`: gets own tool loop
- Timeout escalation: 30s per sub-agent → fail with graceful message
- Confidence scoring: sub-agent result includes confidence field
- Structured schemas: Pydantic model for sub-agent input/output

### 5.2 Voice Pipeline Loop Detection
**MODIFY FILE:** `services/annie-voice/bot.py` (+30 lines)
- Add lightweight `LoopDetector` to voice tool loop (Pipecat `FunctionCallParams` handlers)
- Voice-only features: LoopDetector (check + record), graceful failure message
- NO callbacks, NO error router, NO debugger in voice path (<100ms overhead)
- Emit `golem/voice_loop_detected` for voice-specific tracking

### 5.3 Result Validation
**MODIFY FILE:** `services/annie-voice/text_llm.py` (+40 lines)
- `_validate_tool_result(result, query) -> float` (0-1 relevance score)
- Keyword overlap check between query and result
- CAPTCHA/block page detection (common HTML patterns)
- Low confidence (<0.5) → append warning to result

### 5.4 Tests for Phase 5
**NEW FILE:** `services/annie-voice/tests/test_subagent_loops.py` (~200 lines)
- Sub-agent tool loop completes normally
- Sub-agent loop detection fires after 3 same-tool calls
- Sub-agent timeout → graceful failure
- Confidence scoring in results

**NEW FILE:** `services/annie-voice/tests/test_voice_loop_detection.py` (~100 lines)
- Voice loop detector triggers on repeated tool calls
- Voice graceful failure message (not silence)
- Voice path overhead <10ms

**NEW FILE:** `services/annie-voice/tests/test_result_validation.py` (~100 lines)
- Weather result for "oil prices" query → low confidence
- CAPTCHA page detection
- Valid result → high confidence
- Empty result → zero confidence

### Verification Gate 5
```bash
# On laptop:
pytest services/annie-voice/tests/test_subagent_loops.py -v
pytest services/annie-voice/tests/test_voice_loop_detection.py -v
pytest services/annie-voice/tests/test_result_validation.py -v
pytest services/annie-voice/tests/ -x

# Deploy + E2E:
git commit && git push && ssh titan pull && ./stop.sh annie && ./start.sh annie
# Test sub-agent via API: invoke researcher tool, verify tool loop events
bash scripts/health_check.sh
```

---

## Phase 6: Job Control + Preemption + Notifications (Session 6)

**Goal:** User can check task status, reprioritize, cancel. Background tasks yield to interactive.

### 6.1 Job Control Tools
**NEW FILE:** `services/annie-voice/job_control.py` (~100 lines)
- `task_status()` → list active/queued/recent tasks with status
- `reprioritize_task(task_id, new_priority)` → change priority
- `cancel_task(task_id)` → cancel queued/running task
- Intent detection: "what are you working on" → task_status, "cancel that" → cancel_task

### 6.2 Cooperative Preemption
**MODIFY FILE:** `services/annie-voice/task_queue.py` (+40 lines)
- `_should_preempt(current_task) -> bool` — check between tool rounds
- When preemption needed: let current round finish, then yield
- No checkpoint serialization (per adversarial review) — re-queue from scratch
- Voice starts → all non-REALTIME tasks yield immediately

### 6.3 Completion Notifications
**MODIFY FILE:** `services/annie-voice/task_queue.py` (+30 lines)
- `on_complete` callback → notify user of background task completion
- Rate limit: max 1 notification per 30s
- Route to Telegram if text session, voice if voice session

### 6.4 Tests for Phase 6
**NEW FILE:** `services/annie-voice/tests/test_job_control.py` (~150 lines)
- List tasks shows correct status
- Reprioritize changes effective priority
- Cancel stops queued task
- Cancel running task: finishes current round, then stops
- Preemption: HIGH task arrives, NORMAL yields between rounds
- Notification: task completes → callback fired
- Rate limiting: 2 completions in 10s → only 1 notification

### Verification Gate 6
```bash
pytest services/annie-voice/tests/test_job_control.py -v
pytest services/annie-voice/tests/ -x

# Deploy + E2E:
git commit && git push && ssh titan pull && ./stop.sh annie && ./start.sh annie
# Submit background task, check /v1/tasks endpoint, cancel it
curl http://titan:7860/v1/tasks
```

---

## Phase 7: 6 New Kernel Creatures + Dashboard (Session 7)

**Goal:** All 6 kernel creatures registered, rendered with SVG silhouettes, emitting events. Job queue panel on dashboard. All 23 dark creatures instrumented.

### 7.1 Register 6 Kernel Creatures
**MODIFY FILE:** `services/context-engine/chronicler.py` (+12 lines)
- Add to CREATURE_REGISTRY:
  - `golem` — kernel supervisor, zone: thinking
  - `clockwork` — task scheduler, zone: thinking
  - `phoenix-ash` — self-healing/debugger, zone: thinking
  - `roc` — YouTube worker, zone: acting
  - `manticore` — research worker, zone: acting
  - `djinn` — Claude Code worker, zone: acting

### 7.2 SVG Silhouettes for 6 Creatures
**MODIFY FILE:** `docs/creature-catalog.html` (+300 lines)
- Add SVG rendering code for golem, clockwork, phoenix-ash, roc, manticore, djinn
- Accent colors consistent with neural reef palette
- Animation states: idle, activating, working, completing, error

### 7.3 Dashboard Creature Registry
**MODIFY FILE:** `services/context-engine/dashboard/src/creatures/registry.ts` (+60 lines)
- Add 6 kernel creature entries with zone, color, description
- Add 5 missing creatures from the 37→42 gap (if any)

### 7.4 Instrument 23 Dark Creatures
**MODIFY FILES:** Multiple (~5 lines each, ~115 lines total)
- Context Engine processes: add `emit_event()` calls for owl, ouroboros, dragon, kraken, starfish, narwhal, spider, phoenix, basilisk, butterfly, firefly, hummingbird, chameleon, hawk, eagle, oracle
- Annie Voice processes: add calls for pythia, mnemosyne, scribe, librarian
- MCP Server: add calls for pegasus
- Each creature gets at least: `start` event at process begin, `complete` event at process end

### 7.5 Job Queue Dashboard Panel
**NEW FILE:** `services/context-engine/dashboard/src/components/JobQueuePanel.tsx` (~150 lines)
- Real-time panel showing: ACTIVE task, QUEUED tasks, SUSPENDED tasks
- Columns: task name, priority, age, worker (creature), backend
- Preemption flash animation
- Data source: `/v1/tasks` endpoint (new in Phase 3)

### 7.6 Tests for Phase 7
**NEW FILE:** `services/context-engine/tests/test_kernel_creatures.py` (~100 lines)
- All 48 creatures (42 + 6) registered in CREATURE_REGISTRY
- All 48 have zone + service + process metadata
- No duplicate IDs

**MODIFY FILE:** `services/annie-voice/tests/test_observability.py` (+50 lines)
- All 48 creatures have at least 1 `emit_event()` call in source code
- Verify golem, clockwork, phoenix-ash, roc, manticore, djinn emit events

**NEW FILE:** `services/context-engine/dashboard/tests/jobQueuePanel.test.ts` (~100 lines)
- Panel renders active task
- Panel shows queue depth
- Priority colors correct
- Preemption flash animation triggers

### 7.7 Visual E2E Tests (Playwright + Chrome)
**NEW FILE:** `services/context-engine/dashboard/e2e/kernel-creatures.spec.ts` (~200 lines)
- **Creature rendering:** Navigate to dashboard, verify all 48 creature silhouettes render
- **Event emission:** Inject synthetic events, verify creatures light up
- **Job queue panel:** Verify panel shows tasks from `/v1/tasks`
- **Time-travel audit:** Scrub timeline, verify audit events show supervisor decisions
- **Screenshot capture:** Save screenshots for visual regression
- **Run with:** `npx playwright test e2e/kernel-creatures.spec.ts`

**NEW FILE:** `services/context-engine/dashboard/e2e/audit-trail.spec.ts` (~150 lines)
- Navigate to time-travel view
- Load synthetic audit events (from Phase 0 generator)
- Scrub to specific timestamp, verify event details shown
- Verify golem decisions visible (loop detected, strategy selected, backend classified)
- Verify clockwork events visible (task enqueued, dequeued, cancelled)
- Screenshot capture

### Verification Gate 7
```bash
# On laptop:
pytest services/context-engine/tests/test_kernel_creatures.py -v
pytest services/annie-voice/tests/test_observability.py -v
cd services/context-engine/dashboard && npx vitest run  # All dashboard unit tests
cd services/context-engine/dashboard && npx playwright test e2e/kernel-creatures.spec.ts
cd services/context-engine/dashboard && npx playwright test e2e/audit-trail.spec.ts

# Deploy + Visual:
git commit && git push && ssh titan pull
./stop.sh dashboard annie ce && ./start.sh ce annie dashboard

# Visual verification on Titan dashboard (via Chrome MCP or Playwright):
# 1. Open dashboard URL
# 2. Verify all 48 creatures visible in aquarium
# 3. Trigger synthetic events → creatures light up
# 4. Verify job queue panel shows
# 5. Scrub timeline → audit events display
bash scripts/health_check.sh
```

---

## Phase 8: Auditability + Privacy-Aware Logging (Session 8)

**Goal:** Every kernel decision is auditable. Time-travel debugging works. Privacy-safe audit logs.

### 8.1 Audit Event Schema Formalization
**MODIFY FILE:** `services/annie-voice/observability.py` (+30 lines)
- Formal `AuditEvent` type hint (not a new class — just structured dict)
- Required fields: component, event_type, task_id, decision, reasoning, alternatives
- `_sanitize_for_audit(data) -> dict`: strip message content (keep role+length), redact tool results beyond 50 chars, never include system prompts

### 8.2 Comprehensive Audit Coverage
Verify ALL kernel components emit audit events (every decision path):
- **LoopDetector:** `golem/loop_detected` (already Phase 1)
- **ErrorRouter:** `golem/strategy_selected` (already Phase 1)
- **TaskQueue:** `clockwork/task_enqueue`, `clockwork/task_dequeue`, `clockwork/task_reject`, `clockwork/task_cancel` (already Phase 3)
- **ResourcePool:** `golem/backend_classified`, `golem/backend_health` (already Phase 4)
- **Hooks:** `golem/tool_dispatch`, `golem/tool_complete` (already Phase 2)
- **Sub-agents:** `phoenix-ash/self_heal`, `roc/youtube_task`, `manticore/research_task`, `djinn/code_task`
- **Preemption:** `clockwork/task_preempt`, `clockwork/task_resume`

### 8.3 Audit Log Retention
**MODIFY FILE:** `services/context-engine/config.py` (+5 lines)
- `AUDIT_RETENTION_DAYS_ACTIVE = 30`
- `AUDIT_RETENTION_DAYS_ARCHIVE = 365`
- `AUDIT_LOG_DIR = "~/.her-os/annie/audit/"`

### 8.4 Supervisor Self-Monitoring (Watchdog)
**MODIFY FILE:** `services/annie-voice/text_llm.py` (+25 lines)
- Exception handler wrapping the entire supervised loop
- On ANY uncaught exception: fall back to flat tool loop behavior (graceful degradation)
- Watchdog: if no `emit_event` for `deadline_hard_s`, kill task and log kernel error
- Emit `golem/watchdog_triggered` event

### 8.5 Tests for Phase 8
**NEW FILE:** `services/annie-voice/tests/test_audit_completeness.py` (~200 lines)
- Mock `emit_event`, run every kernel code path, assert ALL expected events emitted
- Verify NO event contains full message content (privacy check)
- Verify NO event contains system prompt
- Verify tool args are truncated/hashed in audit events
- Verify `_sanitize_for_audit()` redacts correctly
- Supervisor crash → fallback to flat loop (not silence)
- Watchdog fires after deadline → task killed + event emitted

### Verification Gate 8
```bash
pytest services/annie-voice/tests/test_audit_completeness.py -v
pytest services/annie-voice/tests/ -x

# Deploy:
git commit && git push && ssh titan pull && ./stop.sh annie && ./start.sh annie

# E2E audit trail verification:
# 1. Submit a task that triggers loop detection
# 2. Query Context Engine events API: curl http://titan:8100/v1/events?creature=golem
# 3. Verify golem/loop_detected event present with decision + reasoning
# 4. Verify event data does NOT contain message content
bash scripts/health_check.sh
```

---

## Phase 9: Emotional Arc + Advanced Patterns (Session 9)

**Goal:** Emotional context influences LLM behavior (via before_model hook, NOT scheduler). Advanced patterns from ADK.

### 9.1 Emotional Arc as before_model Hook
**NEW FILE:** `services/annie-voice/emotional_context.py` (~80 lines)
- `EmotionalContext` dataclass: emotion, confidence, source (voice/text), turns_at_emotion
- `emotional_before_model_hook(messages, emotional_ctx) -> messages`:
  - Confidence >0.7 AND same emotion 2+ turns → inject emotional guidance into system prompt
  - frustrated → "Be direct, skip pleasantries"
  - stressed → "Keep responses ultra-short"
  - happy → "Match energy, suggest ideas"
  - tired → "Minimal responses, defer non-urgent"
- Text-only fallback: keyword-based sentiment (frustration/excitement words) — no LLM call
- Register as `before_model` hook in `kernel_hooks.py`

### 9.2 Compaction with Overlap (ADK 9.9)
**MODIFY FILE:** `services/annie-voice/compaction.py` (+20 lines)
- Add `overlap_messages=2` parameter to Tier 2 compaction
- Last 2 messages before the cut point are included in BOTH summary and recent context
- Prevents context discontinuity at compaction boundary

### 9.3 Response Caching (ADK 9.19)
**MODIFY FILE:** `services/annie-voice/kernel_hooks.py` (+30 lines)
- `cache_before_model_hook`: check if identical request was made in last 5 min
- If cached: return cached response, skip LLM call
- TTL: 300s, max 50 entries

### 9.4 Parallel Tool Execution (ADK 9.27)
**MODIFY FILE:** `services/annie-voice/text_llm.py` (+20 lines)
- When LLM returns multiple tool_calls in one response: run them concurrently with `asyncio.gather()`
- No dependency analysis for now — all parallel unless explicitly sequential

### 9.5 Tests for Phase 9
**NEW FILE:** `services/annie-voice/tests/test_emotional_context.py` (~120 lines)
- Frustrated detection → prompt adjustment
- Confidence <0.7 → no change
- Text keyword fallback works
- 3-turn sliding window requirement

**NEW FILE:** `services/annie-voice/tests/test_advanced_patterns.py` (~120 lines)
- Compaction overlap preserves context continuity
- Response cache hit returns cached response
- Parallel tool execution: 2 tools complete concurrently (mock asyncio.gather)

### Verification Gate 9
```bash
pytest services/annie-voice/tests/test_emotional_context.py -v
pytest services/annie-voice/tests/test_advanced_patterns.py -v
pytest services/annie-voice/tests/ -x

git commit && git push && ssh titan pull && ./stop.sh annie && ./start.sh annie
bash scripts/health_check.sh
```

---

## Phase 10: Infrastructure Updates + Full E2E Certification (Session 10)

**Goal:** Update start.sh/stop.sh, run full regression, visual certification of every creature and audit trail.

### 10.1 Update start.sh
**MODIFY FILE:** `start.sh` (+30 lines)
- Add `start_kernel_check()` function: verify task_queue.py loaded, /v1/tasks endpoint responds
- Add kernel health to `show_status()`: task queue depth, active task, backend health
- Update `start_annie()`: initialize TaskQueue persistence directory `~/.her-os/annie/tasks/`
- Add `AUTO_ROUTE_ENABLED` env var pass-through

### 10.2 Update stop.sh
**MODIFY FILE:** `stop.sh` (+10 lines)
- Add task queue drain on graceful shutdown: persist all queued tasks before exit
- Add cleanup of stale task JSON files older than 7 days

### 10.3 Full Regression Test Suite
```bash
# ALL tests, ALL services, on laptop:
bash scripts/run_regression.sh
# Expected: ~5,200+ tests (4,900 existing + ~300 new from kernel)

# Dashboard visual tests:
cd services/context-engine/dashboard
npx vitest run                                    # ~1,900 unit tests
npx playwright test                               # ALL E2E specs including new kernel ones
```

### 10.4 Visual Certification (Chrome MCP or Playwright)
**NEW FILE:** `services/context-engine/dashboard/e2e/full-certification.spec.ts` (~300 lines)
This is the master visual certification test that validates EVERYTHING:

1. **Creature Census:** Navigate to creature catalog, screenshot ALL 48 creatures, verify each renders
2. **Creature Activation:** Inject events for each of 48 creatures, verify each lights up
3. **Silhouette Quality:** Compare silhouette screenshots against baseline (pixel diff)
4. **Event Emitter Verification:** For each of 48 creatures, trigger its process, verify SSE event received
5. **Job Queue Panel:** Submit 3 tasks at different priorities, verify panel shows correct ordering
6. **Time-Travel Audit:** Load 100 synthetic audit events, scrub timeline from start to end, verify:
   - golem decisions visible with reasoning
   - clockwork task lifecycle visible
   - phoenix-ash self-healing events
   - djinn Claude Code dispatch events
7. **Going Back in Time:** Navigate to a specific timestamp, verify dashboard shows exact state at that moment
8. **Screenshot Archive:** Save all screenshots to `tests/visual-regression/` for future comparison
9. **Event Schema Validation:** Verify all events match the AuditEvent schema

### 10.5 Titan E2E Smoke Test
**NEW FILE:** `scripts/titan_e2e_smoke.sh` (~100 lines)
Full end-to-end test on Titan with REAL services:
```bash
# 1. Factory reset
bash scripts/factory_reset.sh

# 2. Seed data
bash scripts/seed_profile.sh

# 3. Health check all services
bash scripts/health_check.sh

# 4. Submit test messages via Annie Voice API
curl -X POST http://titan:7860/v1/chat -H "Content-Type: application/json" \
  -d '{"message":"what is the weather in bangalore"}'
# Assert: response received, golem events emitted

# 5. Submit YouTube URL (triggers loop detection if blocked)
curl -X POST http://titan:7860/v1/chat -H "Content-Type: application/json" \
  -d '{"message":"summarize https://youtube.com/watch?v=dQw4w9WgXcQ"}'
# Assert: response within 60s or graceful failure

# 6. Check task queue
curl http://titan:7860/v1/tasks | jq '.'

# 7. Check audit events
curl http://titan:8100/v1/events?creature=golem | jq '.events | length'
# Assert: > 0 golem events

# 8. Check creature events
curl http://titan:8100/v1/events?creature=clockwork | jq '.events | length'

# 9. Health check again
bash scripts/health_check.sh
```

### 10.6 Blueprint Update
**MODIFY FILE:** `docs/ANNIE-KERNEL.md` (~800 lines removed, ~400 added)
- Cut: CFS-style scheduler detail, ADK CONSIDER patterns, plugin class hierarchy, emotional arc as kernel signal
- Add: vLLM crash recovery, supervisor self-monitoring, privacy-aware auditing, voice/text feature matrix, integration test plan
- Update: Implementation roadmap reflecting actual phases completed
- Mark all sections as IMPLEMENTED with commit references

### Verification Gate 10 (FINAL)
```bash
# On laptop — full regression:
bash scripts/run_regression.sh  # ALL ~5,200 tests pass

# Dashboard visual:
cd services/context-engine/dashboard
npx playwright test e2e/full-certification.spec.ts --headed  # Watch it run

# On Titan — full E2E:
git commit && git push && ssh titan pull
./stop.sh nuke  # Full reset
./start.sh      # Clean start with new architecture
bash scripts/titan_e2e_smoke.sh  # Full E2E smoke test
bash scripts/health_check.sh     # All 12 services healthy

# Visual verification via Chrome MCP (if available):
# Open dashboard, verify all 48 creatures, job queue panel, audit trail
```

---

## Complete File Inventory

### New Files (17)
| File | Phase | Lines | Purpose |
|------|-------|-------|---------|
| `scripts/factory_reset.sh` | 0 | ~120 | Database + state cleanup |
| `scripts/seed_profile.sh` | 0 | ~80 | Rajesh profile data seeding |
| `scripts/generate_kernel_test_data.py` | 0 | ~300 | Synthetic event/task data |
| `scripts/run_regression.sh` | 0 | ~150 | Full test suite runner |
| `scripts/health_check.sh` | 0 | ~60 | Service health verification |
| `scripts/titan_e2e_smoke.sh` | 10 | ~100 | Titan E2E smoke test |
| `services/annie-voice/tool_result.py` | 1 | ~60 | ToolResult dataclass |
| `services/annie-voice/loop_detector.py` | 1 | ~150 | Loop detection |
| `services/annie-voice/error_router.py` | 1 | ~100 | Error classification + fallback chains |
| `services/annie-voice/kernel_hooks.py` | 2 | ~120 | Callback hook system |
| `services/annie-voice/task_queue.py` | 3 | ~300 | Priority queue + task lifecycle |
| `services/annie-voice/resource_pool.py` | 4 | ~150 | Backend routing + health monitor |
| `services/annie-voice/job_control.py` | 6 | ~100 | Task status/cancel/reprioritize tools |
| `services/annie-voice/emotional_context.py` | 9 | ~80 | Emotional arc as hook |
| `dashboard/src/components/JobQueuePanel.tsx` | 7 | ~150 | Job queue dashboard panel |
| `dashboard/e2e/kernel-creatures.spec.ts` | 7 | ~200 | Creature visual E2E tests |
| `dashboard/e2e/audit-trail.spec.ts` | 7 | ~150 | Audit trail E2E tests |
| `dashboard/e2e/full-certification.spec.ts` | 10 | ~300 | Master visual certification |

### Modified Files (16)
| File | Phases | Purpose |
|------|--------|---------|
| `text_llm.py` | 1,2,5,8,9 | Supervised loop, hooks, validation, watchdog, parallel tools |
| `tools.py` | 1,2 | ToolResult returns |
| `memory_tools.py` | 2 | ToolResult returns |
| `code_tools.py` | 2 | ToolResult returns |
| `browser_agent_tools.py` | 2 | ToolResult returns |
| `subagent_tools.py` | 2,5 | ToolResult + sub-agent tool loops |
| `visual_tools.py` | 2 | ToolResult returns |
| `observability.py` | 1,3,4,8 | Creature registration + audit schema |
| `agent_context.py` | 3 | TaskQueue integration |
| `agent_scheduler.py` | 3 | Task priority submission |
| `server.py` | 3 | TaskQueue singleton + /v1/tasks |
| `bot.py` (telegram) | 4 | Auto-routing shadow + automatic mode |
| `bot.py` (voice) | 5 | Voice loop detection |
| `compaction.py` | 9 | Overlap parameter |
| `start.sh` | 10 | Kernel health + task dir init |
| `stop.sh` | 10 | Task queue drain + cleanup |
| `chronicler.py` | 7 | 6 new creatures in registry |
| `registry.ts` | 7 | Dashboard creature entries |
| `creature-catalog.html` | 7 | SVG silhouettes |
| `ANNIE-KERNEL.md` | 10 | Blueprint update with results |

### New Test Files (16)
| File | Phase | Lines | Tests |
|------|-------|-------|-------|
| `test_loop_detector.py` | 1 | ~250 | Loop detection (6 scenarios) |
| `test_error_router.py` | 1 | ~150 | Error routing (5 scenarios) |
| `test_tool_result.py` | 1 | ~80 | ToolResult dataclass |
| `test_supervised_loop.py` | 1 | ~200 | Integration: supervised loop E2E |
| `test_kernel_hooks.py` | 2 | ~200 | Hook lifecycle |
| `test_all_tools_typed.py` | 2 | ~300 | All tools return ToolResult |
| `test_task_queue.py` | 3 | ~300 | Queue ordering/aging/persistence |
| `test_resource_pool.py` | 4 | ~200 | Backend classification/health |
| `test_auto_routing.py` | 4 | ~100 | Telegram auto-routing |
| `test_subagent_loops.py` | 5 | ~200 | Sub-agent tool loops |
| `test_voice_loop_detection.py` | 5 | ~100 | Voice path loop detection |
| `test_result_validation.py` | 5 | ~100 | Result relevance scoring |
| `test_job_control.py` | 6 | ~150 | Job control tools + preemption |
| `test_audit_completeness.py` | 8 | ~200 | Audit coverage + privacy |
| `test_emotional_context.py` | 9 | ~120 | Emotional arc hook |
| `test_advanced_patterns.py` | 9 | ~120 | Compaction overlap, caching, parallel |
| `test_kernel_creatures.py` | 7 | ~100 | 48 creature registry |
| `kernel-creatures.spec.ts` | 7 | ~200 | Visual E2E: creatures |
| `audit-trail.spec.ts` | 7 | ~150 | Visual E2E: audit timeline |
| `full-certification.spec.ts` | 10 | ~300 | Master visual certification |
| `jobQueuePanel.test.ts` | 7 | ~100 | Job queue panel unit test |

---

## Session Estimates

| Phase | Session | Lines (new+mod+test) | Risk | Gate |
|-------|---------|---------------------|------|------|
| 0: Foundation | 0.5 | ~710 | LOW | Scripts work, data loads |
| 1: Loop Detection | 1 | ~1,100 | MEDIUM | YouTube 403 stops in <30s |
| 2: Hooks + Full ToolResult | 1 | ~700 | MEDIUM | All tools return ToolResult |
| 3: Task Queue | 1 | ~750 | HIGH | Priority ordering works |
| 4: Resource Pool | 0.5 | ~500 | LOW | Shadow mode logs correctly |
| 5: Sub-agents + Voice | 1 | ~600 | HIGH | Voice loop detection works |
| 6: Job Control | 0.5 | ~300 | LOW | /v1/tasks shows tasks |
| 7: Creatures + Dashboard | 1.5 | ~1,200 | MEDIUM | All 48 creatures render |
| 8: Auditability | 0.5 | ~250 | LOW | Privacy-safe audit events |
| 9: Emotional + Advanced | 0.5 | ~350 | LOW | Emotional hook fires |
| 10: Infrastructure + Cert | 1 | ~800 | MEDIUM | Full E2E passes on Titan |
| **TOTAL** | **~9** | **~7,260** | | |

---

## Anti-Pattern Guards

**DO NOT:**
- Use `threading.Lock` — all async code uses `asyncio.Lock` or `asyncio.PriorityQueue`
- Mutate shared state — all dataclasses are frozen, state transitions return new objects
- Create plugin class hierarchies — use simple function hook lists
- Add LLM calls for routing decisions — all routing is programmatic (regex/keywords)
- Skip voice path testing — every text change must be verified against voice path
- Defer testing — every phase has its own test files, all must pass before proceeding
- Use `rsync` to deploy — always `git push` + `ssh titan git pull`
- Add creatures without SVG silhouettes — every creature must render on dashboard
- Store message content in audit events — sanitize before logging
- Break existing tests — zero regressions on every phase
