# Next Session: Phone Thinking/Processing Sounds

## What
When Annie is processing on a phone call (LLM thinking, web search, tool calls), the caller hears dead silence. This feels like a dropped call. Add ambient "thinking sounds" — short audio cues that play during processing so the caller knows Annie is still there and working.

## Why
From session 26: Annie does a web search for "weather in Bangalore" — the gap between the user finishing speaking and Annie responding is ~4 seconds of total silence (STT + LLM + tool call + TTS). On a phone call, silence = "did she hang up?" A simple audio cue bridges this gap.

## Context: Where the silence happens

The phone conversation loop (`phone_loop.py`) has this flow per turn:
```
USER SPEAKS → [silence starts] → STT (500ms) → LLM (500ms-3s) → tool call (1-3s) → TTS (800ms-1.5s) → [silence ends] → ANNIE SPEAKS
```

The silence window is from `collect_utterance()` returning WAV bytes to `play_audio()` starting the response. During this window, the caller hears nothing.

### Key code locations:
- `phone_loop.py:1264` — STT starts (after `collect_utterance` returns)
- `phone_loop.py:1309` — Streaming pipeline starts (LLM + tool calls + TTS)
- `phone_loop.py:~857` — `_run_streaming_pipeline()` — tool calls happen here
- `phone_audio.py:390` — `play_audio()` — first response audio plays

### Existing infrastructure:
- `phone_audio.py` already has `play_audio(wav_path)` for BT HFP playback
- `phone_audio.PHONE_TMP` directory for temporary WAV files
- `gpu_lock` (asyncio.Lock) protects STT/TTS from concurrent GPU access — thinking sound must NOT use GPU
- TTS backends: Chatterbox (English), IndicF5 (Kannada) — both HTTP, both use GPU

## Design Considerations

### What kind of sound?
Options to evaluate:
1. **Pre-recorded ambient sounds** — gentle hum, soft typing, or a brief "hmm" clip. Zero GPU cost. Just `play_audio()` on a static WAV.
2. **Short verbal cue** — Pre-generate a few clips: "Let me think...", "One moment...", "Hmm...", "Checking that...". Play randomly. Pre-generated at startup (no runtime TTS cost).
3. **Periodic soft tone** — A gentle beep or chime every 2 seconds during processing. Like a "hold" indicator.

### When to play?
- **After STT, before first TTS plays** — this is the main silence window
- **During tool calls** — web search, maps lookup, etc. can take 1-3s
- **NOT during TTS generation** — TTS generates sentence-by-sentence; once first sentence is ready, it plays immediately (streaming pipeline)

### Constraints:
- Must NOT use GPU (GPU is busy with STT/LLM/TTS)
- Must be interruptible (if first TTS sentence is ready, stop the thinking sound immediately)
- Must work with BT HFP (16kHz mono PCM via pw-play)
- Must not interfere with barge-in detection
- Short duration clips (~1-2s) played in a loop until response is ready

### Architecture sketch:
```python
# In phone_loop.py, after STT returns text:

# Start thinking sound in background
thinking_task = asyncio.create_task(_play_thinking_sound(cancel_event))

# Run streaming pipeline (LLM + tools + TTS)
await _run_streaming_pipeline(...)

# Stop thinking sound (first TTS sentence will play)
thinking_task.cancel()
```

The `_play_thinking_sound()` function would loop a pre-generated WAV clip until cancelled. It uses `pw-play` (same as response audio) so it goes through BT HFP.

### Open questions for implementation:
1. Should the sound be a verbal "hmm" (more human) or an ambient tone (less intrusive)?
2. Should different processing stages have different sounds (searching vs thinking)?
3. How to handle the transition from thinking sound to first response sentence smoothly (crossfade? gap? immediate cut?)
4. Should the thinking sound play during STT too, or only during LLM/tools?

## Files to Modify

1. `services/annie-voice/phone_loop.py` — Add `_play_thinking_sound()`, wire into turn loop
2. `services/annie-voice/phone_audio.py` — May need a `play_audio_loop()` variant that repeats until cancelled
3. `services/annie-voice/assets/` (new) — Pre-recorded thinking sound WAV files (16kHz mono)

## Start Command
```
cat docs/NEXT-SESSION-PHONE-THINKING-SOUNDS.md
```
Then implement. Key decision: choose the sound type first (verbal vs tone), pre-generate or record the clips, then wire into the turn loop.

## Verification
1. Call Annie on phone
2. Ask something that requires a web search ("What's the weather?")
3. Hear thinking sound during the processing gap
4. Annie's response plays smoothly after thinking sound stops
5. Barge-in still works (interrupt during thinking sound → Annie stops and listens)
6. Run existing tests: `cd services/annie-voice && python -m pytest tests/ -q`