# Next Session: Gemma 4 Thinking-Token Leak + Resume-on-Reject Polish

## What Happened

Session 122 implemented **resume-on-reject** (Layer 2) — when an intruder voice triggers a barge-in, Annie resumes speaking from where she was interrupted. The mechanism is **working** (E2E verified: 3 resumes fired, audio duration tracking correct, threshold tuned to 0.30).

But E2E testing exposed a **pre-existing Gemma 4 issue** that severely degrades voice UX: the model's internal reasoning leaks as spoken text.

## Issue 1: Gemma 4 Free-Form Reasoning Leak (HIGH PRIORITY)

### What's happening

Gemma 4 (with thinking enabled) outputs chain-of-thought as plain text content:

```
TTS spoke: "This means I cannot give a 'detailed history' in one go."
TTS spoke: "I have to do it bit by bit."
TTS spoke: "Wait, if I'm too brief, it won't satisfy 'in detail.'"
TTS spoke: "But the rules are 'CRITICAL.'"
TTS spoke: "Actually, the instructions are strict: 'MAXIMUM 2 sentences per response.'"
TTS spoke: "Plan:"
TTS spoke: "1. Start with the Abacus (the foundation)."
TTS spoke: "2. Move through mechanical calculators."
```

Annie literally speaks her internal planning aloud before giving the actual answer.

### Why existing filters don't catch it

1. **ThinkBlockFilter** (`think_filter.py`) — catches `<think>...</think>` and `<|channel>thought...<channel|>` tag pairs. The free-form reasoning has NO tags.

2. **ToolCallTextFilter** (`bot.py:ToolCallTextFilter`) — catches `think(thought="...")` JSON format and narration phrases. The free-form reasoning doesn't match these patterns.

3. **vLLM `--reasoning-parser gemma4`** — should route thinking to `reasoning_content`, but Gemma 4 sometimes outputs reasoning as regular content without the expected `<|channel>thought` prefix.

### What to investigate

1. **Check vLLM reasoning parser behavior.** Is `--reasoning-parser gemma4` actually set in start.sh? Is it working for most thinking but missing some? Check the vLLM server logs for how it handles Gemma 4's output format.

2. **Check if `<|channel>` fragments are leaking.** The WebRTC transcript showed `<|channel>'ll get right on that` — this is a `<|channel>` token NOT followed by "thought", so the ThinkBlockFilter's open tag `"<|channel>thought"` doesn't match. Quick fix: also match `<|channel>` alone (strip any content between `<|channel>` and `<channel|>`).

3. **Consider a reasoning-pattern filter.** Catch phrases like "Wait, if I'm too brief", "The rules are", "Plan:", "Step 1:", "Actually, I'll try", "I should break this down" — these are reasoning metacognition, not conversation. BUT be careful: some of these could appear in legitimate speech. May need a confidence-based approach.

4. **Check if disabling thinking for WebRTC only would help.** Phone has no text display, so thinking tokens don't matter there (TTS was already filtering them). WebRTC shows text. The memory says thinking was kept enabled because "Annie is doing a good job on the Phone" — but the WebRTC experience is terrible with thinking leaks.

### Files to examine

- `services/annie-voice/think_filter.py` — ThinkBlockFilter (tag-based, lines 25-28)
- `services/annie-voice/bot.py` — ToolCallTextFilter (regex, lines 493-561), pipeline wiring (lines 1357-1365)
- `start.sh` — vLLM launch flags (look for `--reasoning-parser`, `--chat-template-kwargs`)
- `services/annie-voice/tts_text_clean.py` — clean_for_tts() used by phone pipeline

## Issue 2: Text Display Duplicates on Resume (LOW — cosmetic)

When resume fires, `ResponseTracker._handle_resume()` pushes `TextFrame(text=remaining)`. This text flows to the WebRTC transport display, creating duplicate sentences in the transcript (original text + re-injected remaining text). The AUDIO is correct — Annie resumes from the right point. Only the text display is wrong.

**Fix options:**
- Use a custom frame type that TTS recognizes but display ignores
- Set a flag on the TextFrame (if Pipecat supports metadata on frames)
- Push audio directly instead of text (but TTS sentence aggregator needs TextFrames)

## Issue 3: `<|channel>` Fragment in Text Display

The WebRTC transcript shows `<|channel>'ll get right on that` — a `<|channel>` token at the start of a sentence. The ThinkBlockFilter requires `<|channel>thought` (with "thought" suffix) to enter suppression mode. A bare `<|channel>` passes through.

**Quick fix:** Add `<|channel>` (without "thought") to the tag pairs, or strip `<|channel>` and `<channel|>` tokens unconditionally as a post-processing step.

## Current State

### What's deployed and working
- Resume-on-reject: 3 new files, 5 modified, 100 tests pass
- Speaker gate threshold: 0.30 (lowered from 0.38)
- ToolCallTextFilter: enabled for Gemma 4 (catches `think(thought="...")`)
- Audio duration tracking: estimates playback position from wall-clock time

### Commits (all on main)
- `f3510f2` — initial implementation (6 phases)
- `8ee7a74` — fix UnboundLocalError
- `848df16` — fix TTS ordering + diagnostic logging
- `449c705` — fix `_start_interruption` method name
- `a1448a2` — fix no-arg signature
- `c8a37c3` — audio duration tracking
- `b0f4475` — threshold 0.30 + ToolCallTextFilter for Gemma 4

### Verification after fixes
Run: `cd services/annie-voice && python -m pytest tests/ -x -q`
Expected: 100+ pass, 0 fail (on the resume-on-reject test files)
