# Recipe: Telegram Claude Code Routing

## What This Is

Multi-agent dispatch for the Telegram bot. Messages starting with "Claude" (comma, colon, or space) or the `/claude` command route to Claude Code CLI on Titan. Everything else goes to Annie.

## Architecture

```
User sends Telegram message
  |
  v
handle_text() --> debounce 1.5s --> _process_text_message()
  |                                       |
  |                              [typing indicator starts]
  |                                       |
  |                              detect_claude_prefix()?
  |                              YES --> handle_claude_code_message()
  |                                        |
  |                                        v
  |                              claude -p --output-format text
  |                              --permission-mode bypassPermissions
  |                              --add-dir /home/rajesh/workplace/her/her-os
  |                              --append-system-prompt "answer directly..."
  |                              cwd=/tmp (avoids CLAUDE.md auto-memory)
  |                                        |
  |                                        v
  |                              split_message() --> reply_text()
  |
  |                              NO --> chat_with_annie() (Annie route)
  |                                     |
  |                                     v
  |                                  fallback: search_memory()
  |
  v
/claude command --> cmd_claude() --> handle_claude_code_message()
```

## Key Files

| File | What |
|------|------|
| `services/telegram-bot/typing_indicator.py` | Async context manager, sends ChatAction.TYPING every 4s with backoff |
| `services/telegram-bot/claude_code_handler.py` | Prefix detection, subprocess invocation, message splitting, per-chat locks |
| `services/telegram-bot/bot.py` | Routing integration, startup checks, /claude command, shutdown cleanup |
| `services/telegram-bot/config.py` | CLAUDE_CODE_ENABLED, TIMEOUT, WORKING_DIR, BINARY env vars |
| `services/telegram-bot/tests/test_typing_indicator.py` | 13 tests |
| `services/telegram-bot/tests/test_claude_code_handler.py` | 43 tests |
| `services/telegram-bot/tests/test_bot.py` | 10 new routing + startup tests (appended) |

## Titan .env Requirements

The bot's `.env` on Titan (`~/workplace/her/her-os/services/telegram-bot/.env`) needs:

```
CLAUDE_CODE_PATH=/home/rajesh/.local/bin/claude
ANTHROPIC_API_KEY=<oauth-token-from-credentials.json>
```

The `CLAUDE_CODE_PATH` is needed because the bot runs via `nohup` which has a minimal PATH that doesn't include `~/.local/bin`.

The `ANTHROPIC_API_KEY` is the OAuth access token from `~/.claude/.credentials.json` (field: `claudeAiOauth.accessToken`). Only needed if using `--bare` mode.

## Claude Code CLI Flags (final working configuration)

```bash
claude -p \
  --output-format text \
  --permission-mode bypassPermissions \
  --add-dir /home/rajesh/workplace/her/her-os \
  --append-system-prompt "CRITICAL: answer directly, no memory updates"
# cwd=/tmp (NOT the project root)
```

### Why each flag

| Flag | Why |
|------|-----|
| `-p` | Non-interactive single-shot mode, prompt via stdin |
| `--output-format text` | Clean text for Telegram (not JSON ceremony) |
| `--permission-mode bypassPermissions` | Autonomous tool use without interactive prompts. Global settings.json had `defaultMode: "plan"` which blocked tool execution. |
| `--add-dir <project>` | Load CLAUDE.md project context + access project files |
| `--append-system-prompt` | Override auto-memory behavior from CLAUDE.md rules |
| `cwd=/tmp` | **CRITICAL**: Prevents loading her-os CLAUDE.md as the project context. Without this, auto-memory rules cause Claude to output session summaries instead of answers. |

## Bugs Found & Fixed (in order of discovery)

### Bug 1: `asyncio.wait_for(lock.acquire(), timeout=0)` always times out (Python 3.12)

**Symptom**: Every Claude request got "I'm already working on something" even with a free lock.

**Root cause**: `asyncio.wait_for(coro, timeout=0)` fires the timeout before the coroutine completes, even for an instantly-completing lock acquire.

**Fix**: Use `lock.locked()` check + explicit `await lock.acquire()`. Safe in single-threaded asyncio.

**Regression test**: `test_free_lock_allows_execution` verifies free lock produces "Got it" + result.

### Bug 2: `claude` binary not on Titan PATH

**Symptom**: "Claude CLI not found" error. Startup check disabled the route.

**Root cause**: Bot runs via `nohup .venv/bin/python bot.py` which has minimal PATH (`/usr/local/bin:/usr/bin:...`). Claude is at `~/.local/bin/claude`.

**Fix**: Set `CLAUDE_CODE_PATH=/home/rajesh/.local/bin/claude` in `.env` on Titan.

### Bug 3: 12 plugins cause 30-60s startup per request

**Symptom**: Each `claude -p` invocation takes 30-60s before returning, mostly spent initializing 12 plugin MCP servers.

**Root cause**: Titan's `~/.claude/settings.json` has 12 plugins enabled (Playwright, Firebase, GitHub, Vercel, Pyright LSP, Context7, etc.). Each starts an MCP server at launch.

**Attempted fix**: `--bare` flag skips plugins but also skips OAuth authentication (requires `ANTHROPIC_API_KEY` env var instead).

**Current state**: Not using `--bare`. Accepting the 30-60s startup. All plugins load each time.

**If you need --bare later**: Set `ANTHROPIC_API_KEY` in `.env` from `~/.claude/.credentials.json` > `claudeAiOauth.accessToken`. Then add `--bare --add-dir <project>` to the command.

### Bug 4: 4GB memory limit kills Claude Code

**Symptom**: `claude -p` exits with rc=1 after ~3 minutes. "Unable to connect to API" in stderr.

**Root cause**: `preexec_fn` set `resource.setrlimit(RLIMIT_AS, 4GB)` on the subprocess. Claude Code with 12 plugins needs more than 4GB for Node.js + MCP servers + HTTPS connections.

**Fix**: Removed `preexec_fn` entirely. No memory limit.

### Bug 5: Stop hooks pollute stdout with checklist text

**Symptom**: Telegram receives a "Checklist:" text instead of the actual answer.

**Root cause**: `hookify` and `ralph-loop` plugins (installed but NOT enabled) had Stop hooks that appended checklist text to stdout. With `--output-format text`, this became part of the response. Also `vercel` plugin cache had a Stop hook.

**Fix**: Removed plugin directories on Titan:
```bash
rm -rf ~/.claude/plugins/marketplaces/claude-plugins-official/plugins/hookify
rm -rf ~/.claude/plugins/marketplaces/claude-plugins-official/plugins/ralph-loop
rm -rf ~/.claude/plugins/cache/claude-plugins-official/vercel
```
Also disabled vercel in `~/.claude/settings.json`.

### Bug 6: CLAUDE.md auto-memory replaces answers with session summaries

**Symptom**: Response is "Nothing to implement — session was a single arithmetic question. No MEMORY.md update needed." instead of "4".

**Root cause**: her-os `CLAUDE.md` has extensive auto-memory rules. Claude Code loads CLAUDE.md when `cwd` is the project root. After answering the user's question, it runs 3 more turns of auto-memory processing. The `--output-format text` captures the LAST assistant turn, which is the session summary.

**Diagnosis**: `--output-format json` showed `num_turns: 4` with `result` containing the wrap-up text, not the "4" answer.

**Fix**: Run subprocess from `cwd=/tmp` (not the project root) so CLAUDE.md doesn't auto-load. Use `--add-dir` to still give access to project files. Add `--append-system-prompt` telling Claude to answer directly without memory updates.

### Bug 7: Regex too strict — required comma/colon after "Claude"

**Symptom**: "Claude what is 2+2" went to Annie (no comma). User naturally types without punctuation.

**Original regex**: `^[Cc]laude[,:]\s*(.*)` — required comma or colon.

**Fix**: Changed to `^[Cc]laude[,:\s]\s*(.*)` — accepts comma, colon, OR space.

**Trade-off**: "Claude from Google emailed me" now triggers Claude Code (false positive). Use `/claude` command for unambiguous routing.

### Bug 8: FileNotFoundError caught in wrong try block

**Symptom**: Would have caused uncaught exception if binary disappeared after startup check.

**Root cause**: `FileNotFoundError` from `create_subprocess_exec` was caught inside the `proc.communicate()` try block, but the error comes from `create_subprocess_exec` which is outside that block.

**Fix**: Moved `create_subprocess_exec` into its own try/except block.

## Typing Indicator Details

- Sends `ChatAction.TYPING` every 4 seconds (Telegram expires at ~5s)
- Single indicator wraps entire `_process_text_message()` (not per-route)
- Exponential backoff on errors: 4s -> 8s -> 16s -> 32s cap
- `CancelledError` handled explicitly (it's `BaseException` in Python 3.9+)
- Graceful no-op when `telegram` package not installed (test environments)
- Factory function: `typing_for(bot, chat_id)` returns `TypingIndicator` context manager

## Concurrency

- Per-chat `asyncio.Lock` (dict keyed by chat_id)
- One Claude Code process per chat at a time
- Other chats not blocked (per-chat, not global)
- `lock.locked()` check before `lock.acquire()` — safe in single-threaded asyncio
- Lock always released in `finally` block
- Dependency injection: tests pass their own lock via `lock=` parameter

## Message Splitting

- 4000 char limit (not 4096, leaves room for Telegram overhead)
- Split priority: paragraph (`\n\n`) -> line (`\n`) -> hard cut
- Max 15 chunks with "[Output truncated — N chars remaining]" notice
- 0.3s delay between chunks to avoid Telegram flood control
- Never returns empty list (always at least `["[No output]"]`)
- Sent as plain text (no `parse_mode`) to avoid MarkdownV2 escaping hell

## Security

- `ALLOWED_CHAT_IDS` must be non-empty or Claude route auto-disables at startup
- `is_authorized(chat_id)` check before routing
- stderr never sent to Telegram (generic error message only)
- Working directory validated at startup
- Claude binary existence checked at startup

## Startup Checks (in bot.py `_post_init`)

1. `shutil.which(CLAUDE_CODE_BINARY)` — binary exists?
2. `os.path.isdir(CLAUDE_CODE_WORKING_DIR)` — working dir exists?
3. `ALLOWED_CHAT_IDS` non-empty? (refuse to enable with open access)
4. If any check fails: log CRITICAL + set `CLAUDE_CODE_ENABLED = False`

## Shutdown Cleanup

- `_active_proc_pid` tracked during execution
- `cleanup_active_process()` called from `bot._post_shutdown()`
- Sends SIGTERM to orphaned Claude Code subprocess

## Config Environment Variables

| Variable | Default | What |
|----------|---------|------|
| `TELEGRAM_CLAUDE_CODE_ENABLED` | `1` | Enable/disable Claude route |
| `CLAUDE_CODE_TIMEOUT` | `300` | Max execution time (seconds) |
| `CLAUDE_CODE_WORKING_DIR` | `/home/rajesh/workplace/her/her-os` | Project directory for --add-dir |
| `CLAUDE_CODE_PATH` | `claude` | Path to claude binary (set to absolute path in .env on Titan) |

## Test Coverage

- **428 total tests pass** (local + Titan)
- `test_typing_indicator.py`: 13 tests (enter, refresh, cancel, backoff, CancelledError, factory)
- `test_claude_code_handler.py`: 43 tests (prefix, splitting, subprocess, concurrency, cmd_claude, regression)
- `test_bot.py`: 10 new tests (routing, startup checks, typing integration)
- Key regression test: `test_free_lock_allows_execution` catches the timeout=0 bug

## Commits

```
6473bc4 feat: multi-agent dispatch — typing indicators + Claude Code routing via Telegram
4356054 fix: replace wait_for(lock, timeout=0) with locked() + acquire
ba576eb fix: add --bare flag for fast startup (skip 12 plugin MCP servers)
8fd0bef fix: drop --bare (breaks OAuth auth), relax prefix to accept bare space
c4ef984 fix: remove 4GB memory limit — was killing Claude Code with 12 plugins
67eb3b2 fix: add --permission-mode bypassPermissions for autonomous tool use
5bfedf1 fix: drop --bare, keep bypassPermissions — full MCP support with OAuth auth
37fe3cc fix: revert --bare, remove hookify+ralph-loop plugins that polluted stdout
81d282f fix: use --bare to prevent stop hook from polluting Claude response
bdede65 fix: use --output-format json and parse result field
1c121ee fix: append-system-prompt overrides auto-memory
79ffd69 fix: run from /tmp to avoid CLAUDE.md auto-memory, use --add-dir for project access
```
