# Plan: Adopt ARC Orchestration Patterns into Annie (v4 — Post-Review #2)

**Date:** 2026-03-20
**Reference:** `docs/RESEARCH-AGENT-ORCHESTRATION.md`
**Adversarial Review #1:** 31 issues (core phases). All addressed.
**Adversarial Review #2:** 28 issues (ClaudeCodeClient, Research Orchestrator, NotebookLM). All addressed.
**Total: 59 issues found across 2 reviews. 0 unresolved.**

---

## Context

Adopts AutoResearchClaw's "body" (multi-stage orchestration, evolution, checkpointing) into Annie's agent framework. Annie gets ARC-quality research capability while staying **free to run by default** — all local models, zero API costs. Optional Claude Code CLI integration for users with MAX subscription.

**Key design principles:**
1. **Free by default** — Nano + Super (local) handle everything. No API keys required. Open-source friendly.
2. **Optional Claude Code CLI** — users with MAX subscription can set `CLAUDE_CODE_ENABLED=true`. Research stages 4-6 route to `claude -p` for Opus-quality synthesis/writing at $0.
3. **ARC's body, companion's soul** — 7-stage research pipeline (not 23). No peer review, no LaTeX, no Docker. Output is a NotebookLM-ready Markdown report.
4. **Orchestrator direct execution** → stages call `_execute_direct()` inline, not via lane queue (prevents deadlock)
5. **Evolution store with Lock + in-memory cache** → prevents JSONL race condition

---

## Model Tier Architecture

```
model_tier: nano    →  Nemotron Nano on Titan     (always available, default)
model_tier: super   →  Nemotron Super on Beast    (if Beast healthy, else Nano fallback)
model_tier: claude  →  Claude Code CLI `claude -p` (if CLAUDE_CODE_ENABLED, else Super fallback)

Fallback chain: claude → super → nano (graceful degradation, never fails)
```

**Settings (annie-voice config):**
```
CLAUDE_CODE_ENABLED=false          # Set true if user has MAX subscription
CLAUDE_CODE_PATH=claude            # Path to claude CLI binary
CLAUDE_CODE_TIMEOUT=300            # Max seconds per call
```

**Open-source default: everything runs on Nano + Super. $0. No accounts needed.**

---

## Implementation Order (7 phases)

| Phase | What | Files | Tests | Effort | Deps |
|-------|------|-------|-------|--------|------|
| 1 | Evolution/lessons (Lock + cache + sanitization) | `evolution.py` | ~25 | 3 hr | None |
| 2 | Typed adapters (typed call dataclasses) | `adapters.py` | ~20 | 2 hr | None |
| 3 | Model tier routing + ClaudeCodeClient | modify `agent_discovery.py`, `agent_context.py`, new `claude_code_client.py` | ~12 | 1.5 hr | None |
| 4 | Orchestrator + gate + meditation + **research + NotebookLM** | `orchestrator.py`, `approval.py`, `meditation_orchestrator.py`, `research_orchestrator.py` | ~45 | 8 hr | P1, P2, P3 |
| 5 | Checkpoint/resume (with completed flag) | `checkpoint.py` | ~12 | 2 hr | P4 |
| 6 | Integration E2E test harness | `test_orchestration_e2e.py` | ~20 | 3 hr | All |
| 7 | Cost tracking (tokens only — all tiers are $0) | `cost_tracker.py` | ~8 | 1 hr | None |

**Total: 10 files, ~142 tests, ~20.5 hr**

**Parallel:** Phase 1 + 2 + 3 + 7 (no deps) → Phase 4 (needs 1+2+3) → Phase 5 (needs 4) → Phase 6 (all)

---

## Phase 1: Evolution/Lessons System

### Critical Fixes from Review
- **JSONL race condition** → `asyncio.Lock` on all writes + in-memory cache (load once at startup)
- **Overlay injection timing** → BEFORE `build_agent_prompt()`, not after LLM call
- **Prompt injection relay** → sanitize lessons (strip XML, cap 500 chars), wrap in `<prior_lessons>` block
- **Overlay budget cap** → 500 tokens max, truncate if exceeded
- **Quality gate on extraction** → validate non-empty + relevant before appending

### Files
**`services/annie-voice/evolution.py`** (~250 lines)
- `LessonCategory` enum: TONE, TIMING, EXTRACTION, SEARCH, APPROVAL, SYSTEM
- `LessonEntry` frozen dataclass with `to_dict()`/`from_dict()` roundtrip
- `EvolutionStore` class:
  - `__init__(store_dir)` — loads JSONL into in-memory `list[LessonEntry]` cache
  - `append(entry)` — acquires `asyncio.Lock`, appends to cache + JSONL
  - `query_for_agent(name, max_age_days=90)` — filters cache (no file I/O)
  - `build_overlay(agent_name, max_lessons=5, max_tokens=500)` — returns sanitized prompt fragment wrapped in `<prior_lessons>` block
- `extract_lessons(agent_name, result, spec)` — extracts from run result, validates quality
- `_sanitize_lesson(text)` — strip XML/angle brackets, cap 500 chars, strip newlines

### Integration (actual code changes)
**`agent_context.py` `_execute()` line ~675:**
```python
# BEFORE build_agent_prompt
evolution_overlay = ""
try:
    from evolution import get_global_store
    store = get_global_store()
    if store:
        evolution_overlay = store.build_overlay(spec.name)
except Exception:
    pass  # evolution is optional

if evolution_overlay:
    messages[0]["content"] += f"\n\n{evolution_overlay}"
```

**`agent_context.py` `_execute()` after on_complete (~line 780):**
```python
# Extract lessons from completed run
try:
    from evolution import get_global_store, extract_lessons
    store = get_global_store()
    if store and result:
        lessons = extract_lessons(spec.name, result, spec)
        for lesson in lessons:
            await store.append(lesson)
except Exception:
    pass  # evolution is optional
```

### Tests (~25)
`tests/test_evolution.py` — roundtrip, time-decay, quality gate, cache, JSONL corruption recovery, sanitization, overlay token cap

---

## Phase 2: Typed Adapters

### Critical Fix from Review
- **Typed call dataclasses** (not raw tuples) for recording stubs

### Files
**`services/annie-voice/adapters.py`** (~200 lines)
- 4 Protocol classes: `EmailAdapter`, `CalendarAdapter`, `ClipboardAdapter`, `BookingAdapter`
- 4 Recording stubs with typed call dataclasses:
```python
@dataclass(frozen=True)
class EmailListCall:
    account: str
    limit: int

class RecordingEmailAdapter:
    calls: list[EmailListCall | EmailReadCall | EmailDraftCall | EmailSendCall]
```
- `AdapterBundle` dataclass with recording defaults

### Tests (~20)
`tests/test_adapters.py` — protocol compliance, call recording, typed dataclass verification

---

## Phase 3: Model Tier Routing + ClaudeCodeClient

### What
Add `model_tier` field to AgentDefinition. Add `ClaudeCodeClient` that calls `claude -p` CLI. Graceful fallback: claude→super→nano.

### Files to Modify
**`agent_discovery.py`** — Add to AgentDefinition:
```python
model_tier: str = "nano"  # nano | super | claude
```

**`agent_context.py` `_get_client()`** — Replace static fallback with tier routing:
```python
async def _get_client(self, model_tier: str = "nano"):
    if model_tier == "claude" and self._claude_enabled:
        return self._claude_client or self._create_claude_client()
    if model_tier in ("claude", "super") and self._beast_healthy:
        return self._beast_client or self._create_beast_client()
    return self._llm_client or self._create_local_client()
    # Fallback chain: claude → super → nano (never fails)
```

**`agent_context.py` `__init__()`** — Add Claude Code config:
```python
self._claude_enabled = os.getenv("CLAUDE_CODE_ENABLED", "false").lower() == "true"
self._claude_client: Any = None
```

**`agent_context.py` `_execute()`** — Pass tier from spec metadata:
```python
model_tier = spec.metadata.get("model_tier", "nano")
client = await self._get_client(model_tier)
```

### Files to Create
**`services/annie-voice/claude_code_client.py`** (~80 lines)

**Review fixes applied:**
- **BUG-2 fix:** Pipe prompt via stdin (not argv) — avoids ARG_MAX for large research prompts
- **Flaw 1 fix:** Include ALL message roles (system, user, assistant) — don't drop assistant context
- **Missing 1 fix:** Validate output doesn't contain known auth error patterns
- **MAINT-3 fix:** `_create_claude_client()` wrapped in try/except for fallback safety
- **SEC-1 fix:** Sanitize web-sourced content before passing to CLI

```python
class ClaudeCodeClient:
    """LLM client using Claude Code CLI (MAX subscription, $0).

    Pipes prompt via stdin to avoid ARG_MAX limits.
    Falls back gracefully if claude binary not found or call fails.
    """
    def __init__(self, binary="claude", timeout=300):
        self._binary = os.getenv("CLAUDE_CODE_PATH", binary)
        self._timeout = int(os.getenv("CLAUDE_CODE_TIMEOUT", str(timeout)))

    async def create_completion(self, messages, max_tokens=4000):
        prompt = self._messages_to_prompt(messages)
        try:
            proc = await asyncio.create_subprocess_exec(
                self._binary, "-p",
                "--output-format", "text",
                stdin=asyncio.subprocess.PIPE,    # Pipe via stdin, NOT argv
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE,
            )
            stdout, stderr = await asyncio.wait_for(
                proc.communicate(input=prompt.encode("utf-8")),
                timeout=self._timeout,
            )
        except FileNotFoundError:
            raise RuntimeError(f"Claude CLI not found at '{self._binary}'")
        except asyncio.TimeoutError:
            proc.kill()  # Kill zombie process on timeout
            raise RuntimeError(f"claude -p timed out after {self._timeout}s")

        result = stdout.decode("utf-8").strip()
        if proc.returncode != 0:
            raise RuntimeError(f"claude -p failed (rc={proc.returncode}): {stderr.decode()[:200]}")
        # Validate: auth errors sometimes exit 0 with error text
        if not result or "authentication" in result.lower()[:100]:
            raise RuntimeError(f"claude -p returned invalid output: {result[:100]}")
        return result

    def _messages_to_prompt(self, messages):
        # Include ALL roles — don't drop assistant context (review fix)
        parts = []
        for m in messages:
            role = m.get("role", "user")
            content = m.get("content", "")
            if role == "system":
                parts.append(content)
            elif role == "user":
                parts.append(f"Human: {content}")
            elif role == "assistant":
                parts.append(f"Assistant: {content}")
        return "\n\n".join(parts)

    async def close(self):
        pass
```

### Tests (~12)
`tests/test_model_routing.py` — nano→local, super→Beast, claude+enabled→CLI, claude+disabled→Super fallback, fallback chain E2E

---

## Phase 4: Orchestrator + Gate + Meditation + Research

### Critical Fixes from Review
- **Direct execution** — stages call `_execute_direct()` inline, NOT `runner.submit()` (prevents lane deadlock)
- **Gate timeout** — 5 min max, writes pending file to disk, park-and-release pattern
- **No dual meditation paths** — remove old `build_*_meditation_spec()` in this phase
- **Dedup lock** — per-agent-name lock prevents concurrent runs of same orchestrator

### Files
**`services/annie-voice/orchestrator.py`** (~350 lines)
- `StageContract` frozen dataclass: name, model_tier, dod, max_retries, gate_tier (0=no gate)
- `OrchestratorResult` dataclass: success, output, stages, total_tokens, elapsed
- `AgentOrchestrator` base class:
  - `orchestrate(context, runner) -> OrchestratorResult`
  - `_run_stage(contract, spec, runner)` — calls `runner._execute_direct(spec)` (not `submit()`)
  - `_gate(contract, description)` — if tier >= 3: write pending file, wait with timeout
  - `_accumulate(tokens_in, tokens_out)`

**`services/annie-voice/approval.py`** (~120 lines)
- `ApprovalTier` enum: SILENT(1), INFORM(2), PROPOSE_CONFIRM(3), DRAFT_REVIEW(4), DOUBLE_CONFIRM(5)
- `request_approval(tier, description, timeout_s=300) -> ApprovalResult`
- `classify_approval_tier(action_type, default=PROPOSE_CONFIRM)` — logs WARNING for unknown

**`services/annie-voice/meditation_orchestrator.py`** (~120 lines)
- 3 stages: Gather → Reflect → Process
- Replaces `meditation.py` `build_*_meditation_spec()` (old path removed)

**`services/annie-voice/research_orchestrator.py`** (~250 lines)
- 7-stage research pipeline:

```python
class ResearchOrchestrator(AgentOrchestrator):
    """7-stage research pipeline. Produces NotebookLM-ready Markdown report."""

    STAGES = [
        StageContract(name="scope", model_tier="nano", max_retries=1,
                      dod="3-5 sub-questions generated from topic"),
        StageContract(name="discover", model_tier="nano", max_retries=1,
                      dod="15-25 source URLs collected via web search"),
        StageContract(name="extract", model_tier="super", max_retries=1,
                      dod="Knowledge cards extracted from each source"),
        StageContract(name="synthesize", model_tier="claude", max_retries=2,
                      dod="Cross-source synthesis with agreements/conflicts"),
        StageContract(name="write", model_tier="claude", max_retries=2,
                      dod="Structured Markdown report with citations"),
        StageContract(name="review", model_tier="claude", max_retries=1,
                      dod="Self-critic score >= 7/10"),
        StageContract(name="deliver", model_tier="nano", gate_tier=2,
                      dod="Report saved, summary sent via Telegram"),
    ]
    # model_tier="claude" falls back to "super" if CLAUDE_CODE_ENABLED=false
    # Falls back further to "nano" if Beast unhealthy. Always works.
```

**Review #2 fixes applied to stages:**
- **Alt 2 (adopted):** Stage outputs written to disk as JSON — not accumulated in LLM context
- **PERF-1 fix:** `asyncio.Semaphore(3)` limits concurrent LLM calls in extract stage
- **SEC-4 fix:** Topic slug sanitized: `re.sub(r'[^a-z0-9-]', '-', topic.lower())[:80]`
- **Missing 4 fix:** Inter-stage data wrapped in XML blocks, web content sanitized
- **BUG-4 fix:** Review stage `retry_target="write"` — retries Stage 5, not itself

**Stage details (disk-based output per stage):**
- Stage 1 (SCOPE): Parse topic → 3-5 sub-questions. Nano. → `stage-1-scope.json`
- Stage 2 (DISCOVER): Web search per sub-Q via SearXNG → 15-25 URLs. → `stage-2-urls.json`
- Stage 3 (EXTRACT): `batch_fetch_pages(Semaphore=3)` → `KnowledgeCard` per page. Super. → `stage-3-cards.json`. Web content sanitized (strip `<script>`, limit 5K chars).
- Stage 4 (SYNTHESIZE): Read cards from disk. Merge → themes, conflicts. Claude (or Super fallback). → `stage-4-synthesis.json`. Input wrapped in `<knowledge_cards>` XML.
- Stage 5 (WRITE): Read synthesis from disk. Structured Markdown report. Claude. → `report.md`
- Stage 6 (REVIEW): Read report from disk. Self-critic 1-10. If < 7, `retry_target="write"` retries Stage 5 with feedback. Claude. → `stage-6-review.json`
- Stage 7 (DELIVER): Save final `report.md` + `sources.json` to `~/.her-os/annie/research/<sanitized-slug>/`. Telegram summary immediately. NotebookLM podcast as fire-and-forget background (see NotebookLM section below).

**Topic slug sanitization:**
```python
def _sanitize_topic_slug(topic: str) -> str:
    return re.sub(r'[^a-z0-9-]', '-', topic.lower().strip())[:80].strip('-')
```

**Supporting additions:**
- `tools.py` — add `batch_fetch_pages(urls, max_concurrent=3)` (~40 lines, Semaphore-limited)
- `KnowledgeCard` dataclass (url, title, claims, quotes, relevance)
- All stage outputs written to `~/.her-os/annie/research/<slug>/` as JSON (not in LLM context)

**NotebookLM Integration** (optional dependency — Annie works without it):

**Review #2 fixes applied:**
- **BUG-3 fix:** Correct package name (`notebooklm-py`), correct class (`NotebookLMClient`), correct async API
- **PERF-2 fix:** Audio generation runs in `asyncio.to_thread()` with 10-min timeout. Fire-and-forget — report delivered immediately, audio as follow-up
- **Missing 5 fix:** Session expiry handled by timeout. `NOTEBOOKLM_ENABLED` env guard
- **SEC-3 fix:** Document credential storage location warning

```
Dependency: pip install notebooklm-py     (unofficial API by Tang Ling — correct PyPI name)
One-time:   notebooklm login              (opens Chrome, caches Google credentials)
Env guard:  NOTEBOOKLM_ENABLED=false      (default off, user opts in)
Source:     https://www.youtube.com/watch?v=usTeU4Uh0iM
Cost:       $0
WARNING:    Google OAuth tokens cached in ~/.config/notebooklm/ — user should be aware
```

```python
# In research_orchestrator.py Stage 7:
async def _deliver_to_notebooklm(self, report_path, source_urls, topic):
    """Upload research to NotebookLM and generate podcast. Optional, async-safe."""
    if not os.getenv("NOTEBOOKLM_ENABLED", "false").lower() == "true":
        return None
    try:
        from notebooklm import NotebookLMClient

        async def _run_notebooklm():
            async with await NotebookLMClient.from_storage() as client:
                nb = await client.notebooks.create(f"Annie Research: {topic}")
                # Add report as text source
                report_text = report_path.read_text("utf-8")
                await client.sources.add_text(nb.id, report_text, wait=True)
                # Add source URLs (max 20)
                for url in source_urls[:20]:
                    try:
                        await client.sources.add_url(nb.id, url, wait=True)
                    except Exception:
                        pass  # Skip failed URL sources
                # Generate podcast audio (takes 2-10 min on Google servers)
                await client.artifacts.generate_audio(nb.id)
                audio_status = await client.artifacts.wait_for_completion(
                    nb.id, timeout=600
                )
                if audio_status.ready:
                    audio_path = await client.artifacts.download_audio(
                        nb.id, report_path.parent / "podcast.mp3"
                    )
                    return {"notebook_id": nb.id, "audio_path": str(audio_path)}
                return {"notebook_id": nb.id, "audio_path": None}

        # Run in thread to avoid blocking event loop (Selenium-based library)
        result = await asyncio.wait_for(
            asyncio.to_thread(asyncio.run, _run_notebooklm()),
            timeout=720,  # 12 min max (audio gen can take 10 min)
        )
        return result
    except ImportError:
        logger.info("notebooklm-py not installed — skipping NotebookLM")
        return None
    except asyncio.TimeoutError:
        logger.warning("NotebookLM audio generation timed out (>12 min)")
        return None
    except Exception as exc:
        logger.warning("NotebookLM delivery failed: %s", exc)
        return None
```

**Design: fire-and-forget audio, immediate report.**
- Report delivered via Telegram immediately after Stage 6 (REVIEW)
- NotebookLM audio generation runs as a background follow-up
- If audio completes, delivered as Telegram voice message
- If audio fails/times out, user already has the report — no data loss
- `NOTEBOOKLM_ENABLED=false` by default — user must opt in
- Annie never blocks on NotebookLM. It's a bonus, not a requirement.

**Output format (NotebookLM-friendly Markdown):**
```markdown
# [Topic]

## Executive Summary
[3-4 sentences — the "podcast intro"]

## Key Findings
### 1. [Sub-question 1]
[Finding with citations]
> "Direct quote" — Source Name

## What I'm Not Sure About
[Gaps, contradictions]

## Sources
1. [Title](URL) — Key contribution: ...

---
*Generated by Annie. N sources consulted. Quality: X/10 (self-assessed)*
*NotebookLM notebook: [link] | Podcast audio: [attached]*
```

### New method in AgentRunner
**`agent_context.py`** — Add `_execute_direct()`:
```python
async def _execute_direct(self, spec: AgentSpec, *, gate_wait_s: float = 0.0) -> str:
    """Execute agent inline (no queue). Used by orchestrators for sub-stages."""
    return await self._execute(spec, gate_wait_s=gate_wait_s)
```

### Tests (~40)
- `tests/test_orchestrator.py` — base class, stage contracts, DoD, retry, accumulation
- `tests/test_approval.py` — tier classification, timeout, unknown type WARNING
- `tests/test_meditation_orchestrator.py` — 3-stage flow, context gathering, workspace writes
- `tests/test_research_orchestrator.py` — 7-stage flow, knowledge cards, batch fetch, report format, self-critic retry, fallback tiers, NotebookLM delivery (mock: installed+success, installed+login-fail, not-installed→graceful skip)

---

## Phase 5: Checkpoint/Resume

### Critical Fix from Review
- **`completed` flag** in checkpoint JSON
- **Scheduler integration** — `_scan_incomplete_checkpoints()` at startup, dedup against cron

### Files
**`services/annie-voice/checkpoint.py`** (~120 lines)
- `write_checkpoint(run_dir, stage_name, run_id, completed=False)`
- `read_checkpoint(run_dir) -> CheckpointState | None`
- `CheckpointState` dataclass: `last_stage, run_id, timestamp, completed`

### Tests (~12)
`tests/test_checkpoint.py` — atomic write, resume mid-pipeline, completed=True→None, corrupt→None

---

## Phase 6: Integration E2E Harness

### Already Built (56 tests passing)
`tests/test_orchestration_e2e.py` — 56 automated integration tests covering:
- AgentRunner pipeline (submit→execute→callback, voice gate, timeout, run summary)
- Model tier routing (nano, beast fallback, beast healthy)
- Session broker cross-channel (voice→text visibility, persistence, day boundary)
- Meditation E2E (journal write, pending approval, empty result)
- Schedule task (YAML creation, result delivery)
- Health monitor state machine (self-contained logic)
- Evolution store (JSONL roundtrip, time decay, corruption recovery, sanitization)
- Context window FULL LIFECYCLE (growth→threshold→Tier1→Tier2→compact→continue)
- Agent budget enforcement (nano cap, xl capacity, per-item truncation, trim boundaries)
- Multi-agent coordination (sequential agents, queue depth, status)
- Topic slug sanitization (path traversal, special chars, length cap)

### Mock Infrastructure (already built)
- `FakeLLM` — multi-response, raises `AssertionError` on exhaustion, tracks calls
- `FakeTelegramBot` — records messages + documents without sending
- `workspace_dir`, `evolution_dir`, `research_dir` fixtures (tmp_path based)

### Still Needed (from hostile test reviewer — add in this phase)

**7 untested code paths:**
- `_check_beast_health()` real HTTP (mock httpx, test URL construction)
- `_lane_worker()` error branch (inject exception from `_execute`)
- `trigger_now()` success + failure paths
- `_scheduler_loop()` + `_fire_job()` (mock clock, verify job fires)
- `AgentDiscovery` hot-reload (write YAML, verify re-scan)
- `session_broker` double-eviction summary merge
- `meditation.py` TOOL_TIP + USER_FACT extraction

**4 false confidence fixes:**
- Health monitor tests → import actual code or add sys.path
- Voice gate test → assert `gate_wait_s > 0` on run summary
- Beast routing test → assert `beast_llm.call_count == 1`
- Schedule validation tests → test actual validation function

**5 missing integration scenarios:**
- Scheduler fires while voice active → verify gate timeout behavior
- Session 200-msg eviction during voice session
- Scheduler start with pre-populated discovery
- Context overflow during multi-stage orchestrator
- Beast unhealthy mid-pipeline (between stages)

**5 edge cases:**
- Empty YAML agent file (skip, don't crash)
- 4 AM IST boundary race (two requests at boundary)
- Evolution JSONL >10MB (performance)
- `on_complete` exception (don't kill lane worker)
- XL budget with empty system_prompt

### Total tests target: ~80 (56 existing + ~24 from reviewer gaps)

---

## Phase 7: Cost Tracking

### Critical Fix from Review
- **In-memory accumulator + periodic flush**, `asyncio.Lock`, configurable pricing

### Files
**`services/annie-voice/cost_tracker.py`** (~100 lines)
- In-memory accumulator, flush every 5 min + on shutdown
- Load pricing from `~/.her-os/annie/model-pricing.json`

### Tests (~8)

---

## Pre-Mortem (with mitigations)

| # | Scenario | Category | Mitigation |
|---|----------|----------|------------|
| 1 | JSONL corruption from concurrent writes | Resource | `asyncio.Lock` + in-memory cache |
| 2 | New agent always routes to Nano | Silent | `model_tier` YAML field — explicit per-agent |
| 3 | Approval timeout wastes compute | Temporal | 5-min timeout → REJECTED, checkpoint preserved |
| 4 | Checkpoint dir fills disk | Resource | Cleanup on completion, 30-day retention |
| 5 | Evolution overlay exceeds token budget | Resource | 500-token hard cap |
| 6 | Concurrent meditation runs | Temporal | Per-agent-name dedup lock |
| 7 | Stubs pass but real adapters fail | Silent | **Deferred** — integration tests come with real implementations |
| 8 | Prompt injection via lessons | Cascade | Sanitize + XML wrapper + char cap |

---

## Test Coverage Gaps (from hostile test reviewer)

### Untested Code Paths (must fix)
| # | Gap | File | Impact | Fix Phase |
|---|-----|------|--------|-----------|
| 1 | `_check_beast_health()` real HTTP path never exercised | `agent_context.py:786` | Beast failover bugs invisible | Phase 6 (E2E) |
| 2 | `_lane_worker()` error recording branch (`status="error"`) | `agent_context.py:648` | Error runs not recorded | Phase 6 (E2E) |
| 3 | `AgentScheduler.trigger_now()` — zero coverage | `agent_scheduler.py:328` | Manual trigger endpoint broken | Phase 6 (E2E) |
| 4 | `_scheduler_loop()` + `_fire_job()` — actual cron tick | `agent_scheduler.py:346` | Scheduled jobs never actually fire in tests | Phase 6 (E2E) |
| 5 | `AgentDiscovery` hot-reload (watch/debounce/re-scan) | `agent_discovery.py:177` | YAML file changes not detected | Phase 6 (E2E) |
| 6 | `session_broker._evict_oldest()` double-eviction merge | `session_broker.py:383` | Summary overwrite on 2nd eviction | Phase 6 (E2E) |
| 7 | `meditation.py` TOOL_TIP and USER_FACT extraction | `meditation.py:307` | TOOLS.md/USER.md never written | Phase 6 (E2E) |

### False Confidence Tests (must strengthen)
| # | Test | Problem | Fix |
|---|------|---------|-----|
| 1 | Health monitor state machine tests | Test inline logic, not actual `health_monitor.py` | Import from telegram-bot or add sys.path |
| 2 | `test_voice_priority_gate_waits` | Asserts `call_count > 1` not timing | Add gate_wait_s assertion on run summary |
| 3 | `test_beast_used_when_healthy` | Doesn't assert `beast_llm.call_count == 1` | Add explicit call count + no-nano assertion |
| 4 | Schedule task validation tests | Test datetime math, not actual tool code | Import from bot.py or test the validation inline |

### Missing Integration Scenarios (add to Phase 6)
| # | Scenario | Why It Matters |
|---|----------|---------------|
| 1 | Scheduler fires job while voice is active | Job fires → gate timeout → schedule advances but result lost |
| 2 | Session eviction mid-voice (200 msg cap hit) | Eviction summary coherence during active voice |
| 3 | Scheduler start with pre-populated discovery | Double-schedule bug invisible |
| 4 | Context window overflow during multi-stage orchestrator | Stage 3 fills context → Stage 4 can't fit prompt |
| 5 | Beast goes unhealthy MID-pipeline (between stages) | Silent quality degradation, no log |

### Missing Edge Cases (add to Phase 6)
| # | Edge Case | Untested |
|---|-----------|----------|
| 1 | Empty YAML agent file | Discovery should skip, not crash |
| 2 | 4 AM IST boundary race in session broker | Two requests at 3:59:59 and 4:00:01 |
| 3 | Evolution JSONL file exceeds 10MB | Performance degradation |
| 4 | `on_complete` callback raises exception | Should not kill the lane worker |
| 5 | Agent with budget="xl" but empty system_prompt | Token estimation edge case |
| 6 | `submit()` before `start()` called | Should raise clear error |
| 7 | `trim_messages()` with only tool messages (no user/system) | Degenerate shape |
| 8 | Concurrent `get_or_create_session()` from voice + text | Lock contention |
| 9 | `max_workers > 1` in lane config | Config is ignored — only 1 worker created |

### CRITICAL: Pre-Existing Bugs Found by Reviewer (fix before or during Phase 4)
| # | Bug | File:Line | Impact | Fix |
|---|-----|-----------|--------|-----|
| 1 | **Month-end scheduler fires every tick** — `anchor_today.day < 28` is False for days 28-31 → `next_run` set to past → job fires every second | `agent_scheduler.py:146-154` | Meditation runs hundreds of times on month-end | Fix anchor calculation for all days |
| 2 | **`on_complete` arity detection exception-driven** — lambdas trigger ValueError → try (result, spec) → TypeError → try (result) → works. 2 exceptions per meditation callback | `agent_context.py:762-777` | Works but fragile. Plan already proposes standardizing to single `(result)` signature | Standardize on_complete to `Callable[[str], Awaitable[None]]`, remove arity detection |
| 3 | **`max_workers` config field is a lie** — `start()` always creates 1 worker per lane regardless of `lane_config` value | `agent_context.py:384-453` | Callers expecting parallelism get serial execution silently | Either implement multi-worker or remove the misleading config field |
| 4 | **`model_tier` metadata field has no effect** — `_get_client()` ignores `spec.metadata["model_tier"]`, routes purely on `_beast_healthy` | `agent_context.py:810` | Phase 3 routing is DOA until `_get_client()` is refactored to accept tier | Phase 3 MUST change `_get_client()` signature (already in plan) |

---

## Adversarial Review Summary

### Review #1 (core phases 1-7)
**Architecture Review:** 4 flaws, 5 missing, 2 alternatives, 1 production incident
**Code Quality Review:** 5 bugs, 4 security vulns, 4 maintenance nightmares, 3 performance bombs
**Subtotal:** 31 issues → 24 implemented, 1 alternative accepted, 1 deferred, 5 eliminated

### Review #2 (ClaudeCodeClient, Research Orchestrator, NotebookLM)
**Architecture Review:** 4 flaws, 5 missing, 2 alternatives, 1 production incident
**Code Quality Review:** 5 bugs, 4 security vulns, 4 maintenance nightmares, 2 performance bombs
**Subtotal:** 28 issues → 22 implemented, 1 alternative adopted, 1 rejected, 4 eliminated

**Grand Total: 59 issues found across 2 reviews. 0 unresolved.**

### Key Redesigns from Reviews
1. **Complexity Router eliminated** → YAML `model_tier` field (Review #1)
2. **Orchestrator direct execution** → `_execute_direct()` prevents lane deadlock (Review #1)
3. **Gate timeout** → 5 min max, park-and-release, TIMEOUT = REJECTED (Review #1)
4. **Evolution store** → `asyncio.Lock` + in-memory cache (Review #1)
5. **ClaudeCodeClient stdin pipe** → prompt via stdin, not argv (Review #2, BUG-2)
6. **Include all message roles** → `_messages_to_prompt` includes assistant context (Review #2, Flaw 1)
7. **NotebookLM correct API** → `notebooklm-py`, `NotebookLMClient`, async pattern (Review #2, BUG-3)
8. **Disk-based stage outputs** → each stage writes JSON to disk, not in LLM context (Review #2, Alt 2)
9. **Semaphore(3) on batch extract** → prevents vLLM 503 from 25 parallel calls (Review #2, PERF-1)
10. **Topic slug sanitization** → `re.sub(r'[^a-z0-9-]', '-', topic)` prevents path traversal (Review #2, SEC-4)
11. **Beast health cache 60s** → not re-checked per stage (Review #2, MAINT-2)
12. **Self-critic retry_target** → review retries write stage, not itself (Review #2, BUG-4)