# Next Session: Chatterbox Voice Tuning + Phone STT/Echo Fix

## Status

**Chatterbox dual-TTS is DEPLOYED and WORKING** — Chatterbox server on Panda:8772 (3,058 MB VRAM), AutoRoutingBackend routing English→Chatterbox, Kannada→IndicF5. The Chatterbox server logs confirm 5+ successful TTS requests from phone calls.

**Two critical issues found during live phone testing:**

### Issue 1: STT Garbling (HIGH — Annie can't understand Rajesh)

Whisper `base` (39M params) cannot handle BT HFP 16kHz audio quality. Transcriptions are garbage:
- "that they have now in Bangalore" (should be "how is the weather in Bangalore")
- "ましょう" (Japanese hallucination!)
- "to top side of that tip and go Doyle" (complete nonsense)
- "how is the method back to right now" (should be "how is the weather right now")

**Root cause:** `PhoneSTT.__init__` defaults to `model_name="base"` at `services/annie-voice/phone_audio.py:495`. The Pipecat path uses `large-v3-turbo` which works fine.

**Fix:** Upgrade to `medium` (~1.5 GB) or `large-v3-turbo` (~3 GB). Panda has **11 GB free** after Chatterbox + IndicF5. Either model fits easily.

**Config:** `PhoneSTT(model_name="medium")` or add `PHONE_WHISPER_MODEL` env var to `.env`.

### Issue 2: Echo (MEDIUM — user hears echo, STT picks up Annie's voice)

Chatterbox generates 3-6 second audio at the new slower pace. The BT HFP mic picks up this audio as echo. Current echo guard is `skip_initial_s=1.5` (only 1.5 seconds), which means the barge-in detector starts listening while Annie is still speaking.

Post-playback echo drain shows `echo_drain=6000ms` — 6 seconds of echo audio being drained.

**Fix options:**
1. Increase `skip_initial_s` from 1.5 to something proportional to TTS audio duration
2. Pass TTS audio duration to `detect_bargein()` so it can skip that many seconds + buffer
3. Increase post-playback sleep from 0.3s to 0.8-1.0s for BT echo tail

**Code locations:**
- `phone_audio.py:344` — `skip_initial_s: float = 1.5`
- `phone_loop.py:882` — `await asyncio.sleep(0.3)` (post-playback echo settle)

## User's Question: Gemma 4 E2B/E4B for STT?

Rajesh asked: "Can we try Gemma 4 E2B or E4B? They understand language clearly and we can run them on Pixel?"

**From existing research (`docs/RESEARCH-GEMMA4-E2B-E4B-AUDIO.md`):**
- E2B (2.3B active, 5.1B total) and E4B (4.5B active, 7.9B total) DO have audio input
- They CAN do ASR and audio question answering
- **BUT:** 30-second max audio, no streaming, no real-time, batch processing only
- E2B audio quality was "garbled" in testing, E4B was "decent but not competitive"
- Running on Pixel 9a (8 GB RAM): E2B MIGHT fit with INT4 quantization (~1.5 GB), E4B unlikely
- **Verdict from research:** Not suitable to replace Whisper for real-time phone conversations

**Better path:** Upgrade Whisper from `base` to `medium` or `large-v3-turbo` on Panda. This is a one-line change.

## Voice Tuning Status

Chatterbox defaults tuned per docs recommendation:
- `cfg_weight=0.3` (pace: slower, more deliberate — docs say 0.3 for fast ref speakers)
- `exaggeration=0.3` (calmer, less rushed)
- `temperature=0.6` (more consistent prosody)

**Paralinguistic tags available:** Chatterbox Turbo natively supports `[laugh]`, `[chuckle]`, `[cough]` in text. Could add to LLM system prompt for natural reactions.

User heard the tuned voice and said "sounds better but speaks fast, needs human cadence." The cfg=0.3 change is now deployed — need user to test again after echo fix (echo was making it hard to evaluate voice quality).

## Start Commands

```bash
# Fix 1: Upgrade Whisper model (one-line change)
# In phone_audio.py:495, change "base" to "medium"
# OR add to ~/.her-os/.env on Panda: PHONE_WHISPER_MODEL=medium

# Fix 2: Increase echo guard
# In phone_audio.py:344, change skip_initial_s=1.5 to 3.0
# In phone_loop.py:882, change sleep(0.3) to sleep(0.8)

# Then restart phone:
./stop.sh phone && ./start.sh phone
```

## Files to Modify

1. `services/annie-voice/phone_audio.py:495` — PhoneSTT model_name default OR add env var
2. `services/annie-voice/phone_audio.py:344` — detect_bargein skip_initial_s
3. `services/annie-voice/phone_loop.py:882` — post-playback echo settle sleep
4. `~/.her-os/.env` on Panda — add PHONE_WHISPER_MODEL if using env var approach

## Verification

1. Call Annie → speak clearly → check STT logs show correct transcription
2. No echo audible during or after Annie speaks
3. Barge-in still works (interrupt Annie mid-sentence)
4. Voice quality: Samantha sounds natural, not rushed
5. Latency: STT should be <500ms with medium model