# Observability Guide — Creature Registry & Instrumentation

## Architecture

```
chronicler.py (MASTER — 27 creatures, all zones/services)
    ├── dashboard/registry.ts  (MIRROR — must be 1:1 with master)
    ├── annie-voice/observability.py  (SUBSET — service="annie-voice" only)
    └── audio-pipeline/observability.py  (SUBSET — service="audio-pipeline" only)
```

**Chronicler is the single source of truth.** Sub-registries duplicate the creature metadata they need (zone, process) for their service. The dashboard mirrors the full registry with additional UI metadata (accent colors, labels, isLLM flag).

Three event ingestion paths:
1. **Filesystem** — audio-pipeline writes JSONL to `/data/events/` (no network coupling)
2. **Direct** — context-engine calls `record_event()` in-process
3. **HTTP** — annie-voice batch-POSTs to `/v1/events/emit` every 2 seconds

## The Checklist: Adding a New Creature

When adding a creature, update **all** of these files:

### Registry files (4)
- [ ] `services/context-engine/chronicler.py` — CREATURE_REGISTRY (master)
- [ ] `services/<service>/observability.py` — `_CREATURES` dict (sub-registry for your service)
- [ ] `services/context-engine/dashboard/src/types.ts` — `CreatureId` union type
- [ ] `services/context-engine/dashboard/src/creatures/registry.ts` — ENTRIES array + ACCENTS

### Visual files (3)
- [ ] `services/context-engine/dashboard/src/creatures/silhouettes.ts` — SVG path data
- [ ] `services/context-engine/dashboard/src/creatures/organism.ts` — tendril config (if custom)
- [ ] `docs/creature-catalog.html` — visual catalog entry

### Connection topology (1)
- [ ] `services/context-engine/dashboard/src/connections/topology.ts` — data/control/llm connections

### Instrumentation (1-3)
- [ ] Add `emit_event("creature", "start/complete/error")` calls in the source file that implements the creature's process
- [ ] For LLM creatures: wire via `ObservabilityProcessor(llm_creature="creature")`

### Synthetic events (1)
- [ ] `services/context-engine/dashboard/src/events/synthetic.ts` — demo event for testing

### Tests (4+)
- [ ] `dashboard/tests/registrySync.test.ts` — auto-validates (just run it)
- [ ] `dashboard/tests/creatures.test.ts` — update count assertions if needed
- [ ] `dashboard/tests/eventSchema.test.ts` — update if new event patterns
- [ ] `context-engine/tests/test_chronicler.py` — update count assertions
- [ ] `dashboard/tests/silhouettes.test.ts` — auto-validates silhouette presence
- [ ] `<service>/tests/test_observability.py` — verify instrumentation

## LLM Creatures (FUTURE_CREATURES)

Some creatures represent LLM backends that are registered but not yet deployed:

| Creature | Process | Status |
|----------|---------|--------|
| fairy | llm-qwen3-32b | Future (Sprint 1.3+) |
| pegasus | llm-mistral-small | Future |
| werewolf | llm-sarvam-m | Future |
| lion | llm-qwen35-35b | Future |

These are in the registries so the dashboard renders them (greyed out in the LLM pool), but no code path emits events for them yet. They're listed as `FUTURE_CREATURES` in the instrumentation coverage tests and are exempt from the "must have emit_event" check.

When a future creature becomes active:
1. Add `emit_event`/`record_event` calls in the source
2. Remove it from `FUTURE_CREATURES` in test files
3. The instrumentation coverage tests will now enforce it

## Dashboard Navigator (Time Machine)

The right-side navigator panel shows event bubbles with a fish-eye effect.

**Sort order:** Newest events on top. When new events arrive, they appear at the top of the bubble list, pushing older events down. In live mode, the panel auto-scrolls to the top (where the newest events are).

**Scroll stepping:** Scroll down moves toward older events (deeper into history). Scroll up moves toward newer events. This applies on both the navigator panel and the main canvas.

**Activation lifecycle:**
1. SSE event arrives (must be < 30s old for live activation)
2. Creature activates: radius grows 22→38px, glow 0.12→1.0, tendrils extend, 4 glow halos
3. After 4s (6s for errors), creature deactivates back to idle state
4. Sequential activations staggered by 200ms

**Key files:**
- `dashboard/src/timeMachine/navigator.ts` — bubble creation, fish-eye, scroll stepping
- `dashboard/src/creatures/renderer.ts` — activation timers, flash effect
- `dashboard/src/events/source.ts` — SSE connection, 30s live threshold
- `dashboard/tests/navigatorSortOrder.test.ts` — sort order verification tests

## End-to-End Verification

### 1. Emit an event via curl
```bash
curl -X POST http://localhost:8100/v1/events/emit \
  -H "X-Internal-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '[{
    "creature": "selkie",
    "service": "annie-voice",
    "process": "webpage-fetch",
    "zone": "acting",
    "event_type": "complete",
    "data": {"url": "https://example.com"},
    "reasoning": "test event"
  }]'
```

### 2. Verify SSE stream
```bash
curl -N "http://localhost:8100/v1/events/stream?token=$TOKEN"
# Should see the selkie event arrive in real-time
```

### 3. Verify dashboard
Open `http://<host>:5174/` — the selkie creature should flash its accent color.

### 4. Verify PostgreSQL persistence
```bash
docker exec -it postgres psql -U her -d her_os -c \
  "SELECT creature, event_type, timestamp FROM events ORDER BY timestamp DESC LIMIT 5;"
```

## Common Pitfalls

### The Selkie Bug
**What happened:** 4 new creatures (chimera, selkie, nymph, gargoyle) were added to all registries locally but never deployed to Titan. When asking Annie for gold prices, the flow went to selkie (webpage fetch) but the creature never lit up on the dashboard because the Titan deployment still had the old registry.

**Root cause:** No cross-registry test existed. Registries could drift across 3 Python services and 1 TypeScript dashboard with zero warnings.

**Fix:** `registrySync.test.ts` now reads all Python registries from disk and compares against the TypeScript registry. Any drift fails the test suite with diagnostic messages like:
```
creature 'selkie' exists in chronicler.py but not in dashboard registry.ts
```

### Stale Test Counts
When adding creatures, tests with hardcoded counts (e.g., `assert len(REGISTRY) == 23`) will pass locally but break when the count changes. The `registrySync.test.ts` tests dynamically compare registries rather than hardcoding counts, so they're resilient to additions.

### Missing Sub-Registry Entries
A creature added to chronicler.py with `service: "annie-voice"` but not added to `annie-voice/observability.py` will never emit events from that service. The cross-registry tests catch this:
```
creature 'nymph' has service='annie-voice' in chronicler.py but missing from annie-voice/observability.py
```

### Minotaur Identity Crisis
The minotaur creature maps to "llm-ollama" (generic), not a specific model. Its label says "Ollama LLM" and the actual model name comes from `bot.py` at runtime. Don't hardcode model names in the registry — they change with Ollama model swaps.

### Topology Gaps (Invisible Return Paths)
**What happened:** Griffin (tool router) had outgoing connections to chimera and selkie (tool execution), but no connections BACK to the LLM creatures. Tool results returned to the LLM via Pipecat's internal `result_callback`, but the dashboard showed griffin's connection ending at the tool — the return path was invisible. This cost hours debugging why tool calls appeared disconnected.

**Also:** `handle_fetch_webpage` didn't emit griffin events at all, making the `griffin → selkie` connection a partial lie (selkie events fired, but griffin never started/completed around them).

**Root cause:** Static topology can't represent runtime-conditional behavior. Only one LLM is active per session (centaur for Claude, minotaur for Ollama), but both `griffin → centaur` and `griffin → minotaur` connections exist in the topology. A naive `activateConnectionsFrom('griffin')` would light up BOTH — phantom glow on the inactive LLM.

**Fix:** The `delivered_to` pattern (see below).

### The `delivered_to` Pattern
When a creature routes results to a runtime-selected target, include `delivered_to` in the event data:

```python
emit_event("griffin", "complete",
           data={"tool": "web_search", "delivered_to": llm_creature},
           reasoning="Search completed")
```

The dashboard's `onEvent` handler checks for `delivered_to` and activates only the specific connection:
- **With `delivered_to`:** Only `griffin → <target>` glows
- **Without `delivered_to`:** All outgoing connections from the creature glow (default behavior)

This prevents phantom glow on inactive paths while keeping both connections in the static topology for completeness.

**Checklist for adding connections with runtime-conditional targets:**
1. Both endpoints must emit events (source fires start/complete, sink fires when it receives)
2. Add `delivered_to` field to the source's complete event
3. Set the target creature via a module-level variable from `bot.py`
4. Add connections for ALL possible targets in `topology.ts`
5. The dashboard will select which connection to light up based on `delivered_to`
