# Research: Real-Time Audio Bridge — Pixel 9a ↔ Panda

**Date:** 2026-03-31
**Context:** Annie's Pixel 9a (mic + speaker) needs real-time audio bridge to Panda (RTX 5070 Ti, x86_64, 192.168.68.57) over USB. Panda runs GPU-accelerated STT (IndicConformerASR, 145ms) and TTS (IndicF5, 2.1s). LLM on Titan via SSH.
**Status:** Research complete. Recommended architecture at bottom.

---

## Use Cases

| # | Use Case | Audio Direction | Hard Part |
|---|----------|----------------|-----------|
| 1 | Annie as voice assistant | Pixel mic → Panda STT; Panda TTS → Pixel speaker | Bidirectional streaming, low latency |
| 2 | Mom calls Annie's Airtel SIM | Call downlink → Panda STT; Panda TTS → call uplink | Injecting audio INTO active cellular call |

---

## Approach 1: scrcpy Audio Forwarding

### What It Does

scrcpy 2.0+ (merged 2023) forwards device audio to the host computer over USB or TCP. PR #5870 (merged 2025-03-29 into dev branch) added 8 new audio sources including voice call capture.

### Available Audio Sources (scrcpy 3.x / dev branch)

| Source | Description | API Level |
|--------|-------------|-----------|
| `output` (default) | Whole audio output, disables on-device playback | 30+ |
| `playback` | Audio playback (apps can opt-out) | 30+ |
| `mic` | Standard microphone | 30+ |
| `mic-unprocessed` | Raw microphone (no noise reduction) | 30+ |
| `mic-voice-recognition` | Tuned for voice recognition | 30+ |
| `mic-voice-communication` | Echo cancellation + gain control | 30+ |
| `mic-camcorder` | Optimized for video recording | 30+ |
| `voice-call` | Both uplink + downlink of active call | 30+ |
| `voice-call-uplink` | Call TX only (what device sends) | 30+ |
| `voice-call-downlink` | Call RX only (what device receives) | 30+ |
| `voice-performance` | Mic + device playback (karaoke mode) | 30+ |

### Audio Codecs

| Codec | Format | Notes |
|-------|--------|-------|
| `opus` (default) | Compressed | ~20ms frame size |
| `aac` | Compressed | ~21.3ms frame size |
| `flac` | Lossless compressed | — |
| `raw` | PCM 16-bit LE | Best for STT piping, zero encode/decode overhead |

### Commands for Testing

```bash
# Capture microphone to file (dictaphone mode)
scrcpy --audio-source=mic --no-video --no-playback --record=mic.opus

# Capture microphone as raw PCM (lowest latency)
scrcpy --audio-source=mic --audio-codec=raw --no-video --no-playback --record=mic.pcm

# Capture voice call downlink (what mom says)
scrcpy --audio-source=voice-call-downlink --audio-codec=raw --no-video --record=call-dl.pcm

# Capture voice call uplink + downlink
scrcpy --audio-source=voice-call --audio-codec=raw --no-video --record=call.pcm

# Reduce buffer for lower latency (default 50ms)
scrcpy --audio-source=mic --audio-buffer=20 --no-video --no-playback

# Audio-only, no video, no control
scrcpy --audio-source=mic --no-video --no-control
```

### Latency

- Audio blocks produced every **~20ms** (OPUS/AAC frame size)
- Default audio buffer: **50ms** (adjustable via `--audio-buffer=N`)
- USB transport overhead: **~2-5ms**
- Raw PCM eliminates codec encode/decode overhead
- **Estimated total: 25-55ms** (mic capture → host)

### Feasibility Assessment

| Criterion | Rating |
|-----------|--------|
| **Mic capture (Use Case 1)** | WORKS — `--audio-source=mic` is solid |
| **Call audio capture (Use Case 2)** | PARTIALLY WORKS — `voice-call-downlink` added in PR #5870 but behavior is device-dependent. Pixel 8 reported to only produce packets during active calls (which is what we want). Some devices produce garbage. Must test on Pixel 9a. |
| **Audio injection (TTS → device)** | DOES NOT WORK — scrcpy is one-directional (device → host). No reverse audio path exists. |
| **Audio injection into call** | DOES NOT WORK — scrcpy cannot inject audio into call uplink |
| **Latency** | ~30-55ms (excellent for STT) |
| **Implementation complexity** | 1/5 — just command-line flags |

### Key Limitations

1. **One-directional only**: scrcpy streams audio FROM device TO host. No way to send audio back.
2. **Call audio requires dev branch**: voice-call sources are in PR #5870 (merged to dev, not yet in stable release). Must build from source or use the [scrcpy-with-call-audio fork](https://github.com/yNEX/scrcpy-with-call-audio).
3. **No stdout piping**: scrcpy records to file or plays via SDL. Cannot directly pipe raw PCM to another process's stdin. Would need to record to a named pipe (FIFO) or use `--record=/dev/stdout` workaround (untested).
4. **Device-dependent call audio**: voice-call sources rely on Android's `MediaRecorder.AudioSource.VOICE_CALL` which requires `CAPTURE_AUDIO_OUTPUT` permission — scrcpy's server runs as shell user which has this permission, but actual capture depends on OEM audio HAL implementation.

---

## Approach 2: ADB Audio Streaming

### audiosource Project (gdzx/audiosource)

**Architecture:**
1. Small Android app captures mic via `AudioRecord`
2. Streams raw audio over a `LocalSocket` (Unix domain socket)
3. ADB forwards the socket: `adb forward localabstract:audiosource localabstract:audiosource`
4. Linux client reads socket → writes to PulseAudio pipe source

**Commands:**
```bash
# Install and run (from audiosource repo)
./audiosource run

# Manual setup:
adb forward localabstract:audiosource localabstract:audiosource
pactl load-module module-pipe-source source_name=android \
  channels=1 format=s16 rate=44100 file=/tmp/audiosource
socat -u ABSTRACT-CONNECT:audiosource PIPE:/tmp/audiosource
```

**Audio Format:** PCM s16le, 44100 Hz, mono

**Latency Problem — Head-of-Line Blocking:**
The [dzx.fr blog post](https://dzx.fr/blog/low-latency-microphone-audio-android/) documents a critical issue: latency starts low but grows to **~12 seconds** under load due to buffer accumulation across the pipeline:

| Component | Buffer Size | Latency Added |
|-----------|-------------|---------------|
| ALSA hardware | 1,792 bytes | ~20ms |
| AudioRecord | 7,168 bytes | ~82ms |
| LocalSocket send | 2,048 bytes | ~23ms |
| ADB socket | 212,992 bytes | **~2 seconds** |
| Linux pipe (default) | 65,536+ bytes | **~12 seconds** |
| PulseAudio tlength | configurable | ~20ms |

**Solution — Discard-on-Backpressure:**
```python
import socket, os, fcntl

# Set pipe to non-blocking, reduce size to 4096 bytes
fcntl.fcntl(pipe_fd, fcntl.F_SETFL, os.O_NONBLOCK)
fcntl.fcntl(pipe_fd, 1031, 4096)  # F_SETPIPE_SZ

buf = bytearray(1024)
while True:
    n = sock.recv_into(buf, 1024, socket.MSG_WAITALL)
    try:
        os.write(pipe_fd, buf[:n])  # Non-blocking write
    except BlockingIOError:
        pass  # Discard if pipe full — prevents latency growth
```

This keeps latency bounded at ~50-80ms at the cost of occasional audio glitches (dropped frames).

### avream Project (Kacoze/avream)

- Creates virtual "AVream Mic" device on Linux via PulseAudio/PipeWire
- Captures Android mic over ADB, no Android app required
- GTK4 UI + daemon architecture
- Designed for "real meetings and recordings"
- Reverse audio (PC → phone speaker) is **not stable** per README
- No specific latency figures published

### Raw ADB Shell Approaches

```bash
# Direct mic capture via adb shell (requires root or system privileges)
adb shell "cat /dev/snd/pcmC0D0c" > /tmp/mic.pcm  # Usually permission-denied

# Screen recording audio-only (not a real thing — no --audio-only flag)
# adb exec-out screenrecord --audio-only  # DOES NOT EXIST
```

**Note:** Raw ADB shell audio capture is not supported without root. The AudioRecord API requires an app context.

### Feasibility Assessment

| Criterion | Rating |
|-----------|--------|
| **Mic capture (Use Case 1)** | WORKS — audiosource/avream both work |
| **Call audio capture (Use Case 2)** | DOES NOT WORK — these tools capture mic, not call audio |
| **Audio injection (TTS → device)** | DOES NOT WORK — one-directional |
| **Latency** | ~50-80ms with discard strategy; can grow unbounded without it |
| **Implementation complexity** | 2/5 — needs Android app + Linux client |

---

## Approach 3: Custom Android App (WebSocket/USB)

### Architecture

```
┌─────────────────────────────────────────────┐
│                PIXEL 9a                      │
│                                              │
│  ┌──────────┐    ┌──────────────────────┐   │
│  │AudioRecord│───→│  WebSocket Client    │   │
│  │(mic input)│    │  (OkHttp/Ktor)       │   │
│  └──────────┘    │                      │   │
│                  │  ws://panda:8766     │   │
│  ┌──────────┐    │                      │   │
│  │AudioTrack │←───│  Receives TTS audio  │   │
│  │(speaker)  │    │  from Panda          │   │
│  └──────────┘    └──────────────────────┘   │
└──────────────────────────────────────────────┘
                    ▲ USB or WiFi
                    │ WebSocket
                    ▼
┌──────────────────────────────────────────────┐
│                 PANDA                         │
│                                               │
│  ┌──────────────────────────────────────┐    │
│  │  Python WebSocket Server (FastAPI)    │    │
│  │  port 8766                            │    │
│  │                                       │    │
│  │  receive PCM → STT (IndicConformer)   │    │
│  │  → Titan LLM → TTS (IndicF5)         │    │
│  │  → send PCM back to Pixel             │    │
│  └──────────────────────────────────────┘    │
└──────────────────────────────────────────────┘
```

### Android Side (Kotlin)

```kotlin
// AudioRecord config for mic capture
val sampleRate = 16000  // STT models expect 16kHz
val bufferSize = AudioRecord.getMinBufferSize(
    sampleRate,
    AudioFormat.CHANNEL_IN_MONO,
    AudioFormat.ENCODING_PCM_16BIT
)
val recorder = AudioRecord(
    MediaRecorder.AudioSource.MIC,
    sampleRate,
    AudioFormat.CHANNEL_IN_MONO,
    AudioFormat.ENCODING_PCM_16BIT,
    bufferSize
)

// AudioTrack config for TTS playback
val player = AudioTrack.Builder()
    .setAudioAttributes(AudioAttributes.Builder()
        .setUsage(AudioAttributes.USAGE_MEDIA)
        .setContentType(AudioAttributes.CONTENT_TYPE_SPEECH)
        .build())
    .setAudioFormat(AudioFormat.Builder()
        .setEncoding(AudioFormat.ENCODING_PCM_16BIT)
        .setSampleRate(24000)  // TTS output rate
        .setChannelMask(AudioFormat.CHANNEL_OUT_MONO)
        .build())
    .setTransferMode(AudioTrack.MODE_STREAM)
    .build()

// WebSocket: send mic chunks, receive TTS chunks
// Use OkHttp WebSocket client
// Binary messages: raw PCM chunks
```

### Python Server Side

```python
import asyncio
import websockets

async def audio_bridge(websocket):
    """Bidirectional audio bridge."""
    async for message in websocket:
        # message = raw PCM from Pixel mic
        text = await stt_process(message)       # IndicConformerASR (145ms)
        response = await llm_process(text)       # Nemotron Nano on Titan
        audio = await tts_process(response)      # IndicF5 (2.1s)
        await websocket.send(audio)              # Send TTS PCM back

async def main():
    async with websockets.serve(audio_bridge, "0.0.0.0", 8766):
        await asyncio.Future()
```

### Connectivity Options

**Option A — ADB Port Forwarding (recommended for USB):**
```bash
# Forward Panda port 8766 to be accessible from Pixel via USB
adb reverse tcp:8766 tcp:8766
# Now Pixel app connects to ws://localhost:8766
```
Advantage: No WiFi needed, uses USB bandwidth (480 Mbps USB 2.0), zero network config.

**Option B — Direct WiFi:**
```
# Pixel connects to ws://192.168.68.57:8766
```
Advantage: No USB cable needed. Disadvantage: WiFi latency (~2-10ms LAN), requires static IP.

### Feasibility Assessment

| Criterion | Rating |
|-----------|--------|
| **Mic capture (Use Case 1)** | WORKS — standard AudioRecord API |
| **Call audio capture (Use Case 2)** | PARTIALLY — requires InCallService or AccessibilityService |
| **TTS → speaker (Use Case 1)** | WORKS — AudioTrack MODE_STREAM |
| **TTS → call uplink (Use Case 2)** | DOES NOT WORK — see Approach 4 |
| **Latency** | ~5-15ms (WebSocket over USB) + processing time |
| **Implementation complexity** | 3/5 — Android app + Python server |
| **Bidirectional** | YES — this is the only approach that handles both directions |

---

## Approach 4: Phone Call Audio Injection (The Hard Problem)

### The Fundamental Android Restriction

**Android does not allow third-party apps to inject audio into the cellular call uplink.** This is by design, enforced at the audio HAL level, and has been the case since Android 10+.

The APIs that exist:
- `AUDIO_SOURCE_VOICE_CALL` — **capture only**, requires `CAPTURE_AUDIO_OUTPUT` (system app permission)
- `AUDIO_SOURCE_VOICE_UPLINK` / `VOICE_DOWNLINK` — **capture only**, same restriction
- `InCallService` — provides call UI, **no audio injection capability**
- `ConnectionService` — for VoIP calls only, not cellular calls
- `CallScreeningService` — call screening, no audio access

### Option 4A: BCP (Basic Call Player) — Rooted Approach

[BCP](https://github.com/chenxiaolong/BCP) by chenxiaolong demonstrates audio injection into calls:

**How it works:**
1. Uses `MODIFY_PHONE_STATE` privileged permission (system app only)
2. Creates an `AudioTrack` targeting the **telephony output device**
3. Plays audio that gets mixed into the call uplink
4. The telephony output device is used by Google Dialer for call screening

**Requirements:**
- **Rooted phone** (Magisk)
- Installed as **system app** (via Magisk module)
- Device must implement the telephony output device (Pixel devices do, for call screening)
- **Archived project** (read-only since Sep 2023)

**Pixel 9a compatibility:** Likely works — newer Pixel devices implement the telephony output device for Google Dialer call screening. Pixel 9a can be rooted via Magisk (guides exist, though Android 15 has some issues with Magisk 27-29).

**Risk assessment:**
- Rooting voids warranty
- SafetyNet/Play Integrity may fail (banking apps)
- Root detection by NPCI banking apps (UPI payments may break)
- BCP is archived, no ongoing maintenance
- Telephony output device is internal API, can break with Android updates

### Option 4B: Speakerphone Acoustic Coupling

**How it works:**
1. Put call on speakerphone
2. Play TTS audio through the Pixel's speaker
3. Mom hears it through the speakerphone mic

**Implementation:**
```bash
# During active call, set speakerphone mode via ADB
adb shell input keyevent KEYCODE_CALL  # Answer call
adb shell cmd telecom set-audio-route SPEAKER  # Speakerphone mode

# Play TTS audio file on device
adb shell am start -a android.intent.action.VIEW \
  -d file:///sdcard/tts_response.wav \
  -t audio/wav
```

**Problems:**
- Echo cancellation (AEC) actively fights this — it's designed to REMOVE speaker audio from the mic signal
- Audio quality degraded by room acoustics
- Mom hears the Pixel's speaker distortion + room reverb
- Double-talk (mom speaking while TTS plays) causes AEC to gate aggressively
- Volume control is difficult — too loud = distortion, too quiet = inaudible
- **Verdict: Unreliable and low quality**

### Option 4C: Audio Dongle / Hardware Loopback

**How it works:**
1. Connect a USB-C audio adapter to the Pixel
2. Route call audio to the adapter's output
3. Connect adapter output to Panda's audio input (for STT)
4. Route TTS from Panda's audio output to adapter's input
5. Adapter input feeds into the call uplink

**Problem:** USB-C is already used for ADB. Would need wireless ADB + USB-C audio adapter, or a USB-C hub. And Android call audio routing to USB accessories is inconsistent.

### Feasibility Assessment

| Approach | Works? | Quality | Complexity | Risk |
|----------|--------|---------|------------|------|
| BCP (rooted) | YES (Pixel only) | Good | 4/5 | High — root, archived, internal API |
| Speakerphone | PARTIALLY | Poor | 1/5 | Low but unreliable |
| Audio dongle | PARTIALLY | Medium | 4/5 | Medium — USB conflict |
| VoIP (see #5) | YES | Good | 3/5 | Low |

---

## Approach 5: VoIP/SIP Alternative

### The Core Insight

**If mom calls a VoIP number instead of Jio SIM, we have FULL control over audio routing.** The cellular call audio injection problem disappears entirely because VoIP calls use standard audio APIs where we control both uplink and downlink.

### Architecture: FreeSWITCH + Linphone

```
┌───────────┐     GSM call     ┌──────────────┐
│  Mom's     │ ───────────────→ │ GSM Gateway  │
│  phone     │                  │ (GoIP-1)     │
│            │ ←─────────────── │ SIM: Jio     │
└───────────┘                   └──────┬───────┘
                                       │ SIP/RTP
                                       ▼
                                ┌──────────────┐
                                │ FreeSWITCH   │
                                │ (on Panda)   │
                                └──────┬───────┘
                                       │ RTP audio
                                       ▼
                                ┌──────────────┐
                                │ sip-to-ai    │
                                │ bridge       │
                                │ (on Panda)   │
                                └──────┬───────┘
                                       │ PCM
                                       ▼
                                ┌──────────────┐
                                │ STT + LLM    │
                                │ + TTS        │
                                │ (Panda+Titan)│
                                └──────────────┘
```

### Option 5A: Mom Calls Jio SIM → GSM Gateway → VoIP

**Hardware:** GoIP-1 (single-port GSM/VoIP gateway)
- Insert Jio SIM into GoIP-1 box
- Mom calls the Jio number as normal
- GoIP-1 converts cellular call to SIP/RTP
- FreeSWITCH on Panda receives SIP call
- Audio routed to Annie's STT/TTS pipeline

**Pricing:**
- GoIP-1: ~₹5,000-15,000 on AliExpress/IndiaMART
- 4G VoLTE GSM Gateway (Jio compatible): ₹15,000-30,000 (Aria Telecom)
- FreeSWITCH: free, open source

**Pros:** Mom uses same number, no behavior change, clean audio path
**Cons:** Hardware cost, GoIP reliability, Jio SIM locked in gateway (not in phone), VoLTE support varies

### Option 5B: Mom Calls VoIP Number (No GSM)

**Setup:**
1. Get a VoIP number (e.g., via VoIP.ms, sipgate, or Indian VoIP provider)
2. Run FreeSWITCH on Panda
3. Route incoming calls to Annie's audio pipeline

**Pros:** Simplest audio path, no GSM gateway hardware
**Cons:** Mom must call a different number (or install Linphone app), no cellular fallback

### Option 5C: Linphone on Pixel + SIP Account

**Setup:**
1. Install Linphone on the Pixel 9a
2. Register with a SIP provider (or self-hosted FreeSWITCH)
3. Linphone auto-answers incoming SIP calls
4. Audio captured via Android's standard VoIP audio path

**Linphone auto-answer:** Supported. Configure in settings or via SIP `Call-Info: answer-after=0` header from FreeSWITCH.

**Pros:** Clean VoIP audio, full bidirectional control, no root needed
**Cons:** Mom needs SIP app or must call a VoIP number

### Option 5D: sip-to-ai Bridge (Direct, No Phone)

The [sip-to-ai](https://github.com/aicc2025/sip-to-ai) project is a Python asyncio bridge that connects SIP calls directly to AI voice agents:

- **Pure Python, zero C dependencies**
- G.711 μ-law @ 8kHz ↔ PCM16 conversion
- <10ms bridge-side latency
- Supports pluggable AI backends (we'd plug in our own STT/LLM/TTS)
- Apache 2.0 licensed

This could **replace the Pixel entirely** for the phone call use case — mom calls a SIP number, FreeSWITCH routes to sip-to-ai, which streams to our STT/TTS pipeline.

### Feasibility Assessment

| Criterion | Rating |
|-----------|--------|
| **Audio quality** | Excellent — clean digital path, no acoustic coupling |
| **Latency** | ~10-20ms (SIP bridge) + processing time |
| **Bidirectional** | YES — full control of both directions |
| **Mom's experience** | Depends on option: same number (5A) or new number (5B/5C) |
| **Implementation complexity** | 3/5 (5A with GSM gateway) or 2/5 (5D direct SIP) |
| **Phone call support** | YES — this is the entire point |

---

## Approach 6: Tasker + AutoInput

### Capabilities

- Tasker can detect incoming calls, auto-answer, toggle speakerphone
- AutoInput can interact with UI elements via accessibility
- **Cannot capture call audio** — no access to audio streams
- **Cannot inject audio into calls** — same Android restriction as all apps
- MediaProjection captures screen audio but **explicitly excludes phone call audio** on Android 10+

### What Tasker CAN Do (Useful)

```
# Auto-answer incoming call after 2 rings
Event: Phone Ringing
Action: Wait 4s → Answer Call → Set Audio Route Speaker

# Launch Annie voice app on incoming call
Event: Phone Ringing (from contact "Mom")
Action: Answer Call → Launch App "Annie Voice"
```

### Android 14/15 Issues

AutoInput's accessibility service is **unstable on Android 15** — it gets disabled multiple times per day. Workaround: Tasker profile to re-enable it every hour. This is a reliability concern for an always-on assistant.

### Feasibility Assessment

| Criterion | Rating |
|-----------|--------|
| **Call auto-answer** | WORKS |
| **Speakerphone toggle** | WORKS |
| **Call audio capture** | DOES NOT WORK |
| **Audio injection** | DOES NOT WORK |
| **Reliability (Android 15)** | POOR — accessibility service drops |
| **Implementation complexity** | 2/5 |

---

## Comparative Summary

| Approach | UC1: Voice Asst (mic→STT, TTS→speaker) | UC2: Call Audio (capture + inject) | Latency | Complexity | Recommended? |
|----------|----------------------------------------|-----------------------------------|---------|------------|-------------|
| scrcpy | MIC→HOST: yes, HOST→SPEAKER: **no** | CAPTURE: partially, INJECT: **no** | 30-55ms | 1/5 | For mic capture only |
| ADB streaming | MIC→HOST: yes, HOST→SPEAKER: **no** | **no** | 50-80ms | 2/5 | Not recommended |
| Custom app (WS) | MIC→HOST: yes, HOST→SPEAKER: **yes** | CAPTURE: partially, INJECT: **no** | 5-15ms | 3/5 | **YES — Use Case 1** |
| BCP (rooted) | N/A | INJECT: yes (Pixel only, rooted) | ~20ms | 4/5 | Risky |
| Speakerphone | N/A | INJECT: poor quality | N/A | 1/5 | Last resort |
| VoIP/SIP | Full bidirectional | Full bidirectional | 10-20ms | 3/5 | **YES — Use Case 2** |
| Tasker | Auto-answer only | **no** | N/A | 2/5 | Supplementary |

---

## Recommended Architecture

### Use Case 1: Annie as Voice Assistant (No Phone Call)

**Winner: Custom Android App with WebSocket over ADB USB reverse**

```
PIXEL 9a                                    PANDA
┌────────────────────┐                     ┌──────────────────────────┐
│  Annie Voice App   │                     │  Audio Bridge Server     │
│                    │                     │  (Python, port 8766)     │
│  AudioRecord(mic)  │──── PCM 16kHz ────→│                          │
│  16kHz, mono,      │    WebSocket        │  ┌─────────────────┐    │
│  PCM_16BIT         │    over USB         │  │IndicConformerASR│    │
│                    │   (adb reverse)     │  │  (145ms, GPU)   │    │
│  AudioTrack        │                     │  └────────┬────────┘    │
│  (MODE_STREAM)     │←── PCM 24kHz ──────│           │ text         │
│  24kHz, mono       │    WebSocket        │           ▼              │
│                    │                     │  ┌─────────────────┐    │
│  Wake word         │                     │  │ Nemotron Nano   │    │
│  detection (local) │                     │  │ (Titan via SSH) │    │
│                    │                     │  └────────┬────────┘    │
│  VAD (local)       │                     │           │ response     │
│                    │                     │           ▼              │
└────────────────────┘                     │  ┌─────────────────┐    │
                                           │  │ IndicF5 TTS     │    │
                                           │  │ (2.1s, GPU)     │    │
                                           │  └─────────────────┘    │
                                           └──────────────────────────┘
```

**Why this wins:**
1. **Bidirectional** — only approach that handles both mic→STT and TTS→speaker
2. **Lowest transport latency** — WebSocket over USB reverse (`adb reverse tcp:8766 tcp:8766`) gives ~5-15ms round-trip
3. **Standard APIs** — AudioRecord and AudioTrack are the most battle-tested Android audio APIs
4. **Local VAD** — app does Voice Activity Detection locally, only sends speech segments (saves bandwidth and STT compute)
5. **Wake word** — can run a lightweight wake word detector (e.g., Porcupine, OpenWakeWord) on-device before streaming to Panda
6. **No root** — standard app permissions suffice
7. **ADB reverse** means the app connects to `ws://localhost:8766` — no network config, no WiFi dependency

**Implementation plan:**
1. Android app (Kotlin): ~200-300 lines. AudioRecord → WebSocket send. WebSocket receive → AudioTrack.
2. Python server (FastAPI/websockets): ~150 lines. Receive PCM → STT → LLM → TTS → send PCM.
3. ADB reverse setup: single command, can be automated in start.sh.
4. Total: ~2-3 days development.

### Use Case 2: Mom Calls Annie

**Winner: VoIP/SIP with sip-to-ai bridge (Option 5D), with GSM gateway upgrade path (Option 5A)**

#### Phase 1: Direct SIP (Immediate, No Hardware)

```
Mom installs Linphone/WhatsApp
        │
        │ SIP call or WhatsApp call
        ▼
┌──────────────────┐
│ FreeSWITCH       │
│ (Panda)          │
│ port 5060 SIP    │
└────────┬─────────┘
         │ RTP audio (G.711 μ-law 8kHz)
         ▼
┌──────────────────┐
│ sip-to-ai bridge │
│ (adapted)        │
│ Python asyncio   │
└────────┬─────────┘
         │ PCM 16kHz
         ▼
┌──────────────────┐
│ STT → LLM → TTS │
│ (Panda + Titan)  │
└──────────────────┘
```

**Why this wins for Phase 1:**
- No hardware purchase needed
- Clean digital audio path, no acoustic coupling
- Full bidirectional audio control
- sip-to-ai bridge is Python asyncio, fits our stack perfectly
- FreeSWITCH is free, mature, well-documented
- Mom needs Linphone app (free) or can video/audio call via WhatsApp (needs integration work)

**Limitation:** Mom must call a SIP number or use Linphone. Not her regular Jio number.

#### Phase 2: GSM Gateway (When Budget Allows)

Add a GoIP-1 GSM gateway (₹5,000-15,000):
- Insert the Jio SIM into GoIP-1
- Mom calls the same Jio number she always calls
- GoIP-1 converts to SIP → FreeSWITCH → Annie pipeline
- **Zero behavior change for Mom**

**Note on Pixel 9a for calls:** With the VoIP approach, the Pixel 9a is NOT needed for the phone call use case. The Pixel is only needed for Use Case 1 (voice assistant with mic/speaker) and for other Annie tasks (ADB automation, shopping apps, UPI). This is actually cleaner — the phone is a "body" for physical world interaction, and voice calls go through a dedicated telephony path.

### Supplementary: Tasker on Pixel

Even with the custom app approach, install Tasker for:
- Auto-launching Annie app on boot
- Detecting phone state changes
- Automating settings (WiFi, Do Not Disturb, brightness)
- ADB wireless debugging re-enable after reboot (via Shizuku integration)

### What NOT to Do

1. **Don't root the Pixel 9a** — breaks SafetyNet, UPI apps (GPay, PhonePe) may stop working, and BCP is archived
2. **Don't use speakerphone acoustic coupling** — unreliable, poor quality, AEC fights it
3. **Don't try to inject audio into cellular calls** — Android's architecture fundamentally prevents this for non-system apps
4. **Don't use sndcpy/usbaudio** — deprecated (AOA2), superseded by scrcpy audio
5. **Don't over-engineer the mic capture** — scrcpy `--audio-source=mic` works for quick testing, but the custom app is needed for bidirectional audio

---

## Implementation Priority

| Priority | Task | Effort | Dependency |
|----------|------|--------|------------|
| P0 | Custom Android app (AudioRecord + AudioTrack + WebSocket) | 2-3 days | Pixel 9a purchased |
| P0 | Python WebSocket audio bridge server on Panda | 1 day | STT/TTS already working |
| P1 | Wake word detection on Pixel (OpenWakeWord or Porcupine) | 1 day | P0 done |
| P1 | VAD on Pixel (WebRTC VAD or Silero) | 0.5 day | P0 done |
| P2 | FreeSWITCH + sip-to-ai on Panda | 2 days | — |
| P3 | GoIP-1 GSM gateway | 1 day setup | Hardware purchased |
| P3 | Linphone auto-answer config | 0.5 day | FreeSWITCH running |

**Total for Use Case 1 (voice assistant):** ~3-5 days
**Total for Use Case 2 (phone calls):** ~3-5 additional days

---

## Open Questions

1. **Pixel 9a scrcpy voice-call source** — does `--audio-source=voice-call-downlink` actually work on Pixel 9a? Must test after purchase. If it does, scrcpy can be used as a quick-and-dirty call audio capture while the VoIP solution is being built.

2. **IndicF5 TTS latency** — at RTF=0.808 (2.1s for 2.6s audio), the total pipeline latency is dominated by TTS. Before building the audio bridge, optimize TTS: try half-precision, streaming synthesis, or smaller voice model.

3. **WebSocket vs raw TCP** — for the custom app, WebSocket adds ~2ms framing overhead. Raw TCP socket would be marginally faster but harder to debug. WebSocket is the pragmatic choice.

4. **Audio format negotiation** — STT expects 16kHz mono, TTS outputs 24kHz mono. The bridge server handles resampling, but the Android app needs to know both rates. Use a simple handshake at WebSocket connect time.

5. **Mom's tech comfort** — will she install Linphone? If not, the GSM gateway (Phase 2) becomes P1 priority. Alternative: WhatsApp call integration (needs research — WhatsApp doesn't expose audio APIs to third parties).

6. **Battery life** — AudioRecord running continuously on Pixel 9a will drain battery. The phone should be on bypass charging (dock/cradle with pass-through power). Pixel 9a supports Adaptive Charging which helps.

---

## References

- [scrcpy audio documentation](https://github.com/Genymobile/scrcpy/blob/master/doc/audio.md)
- [scrcpy PR #5870 — voice call audio sources](https://github.com/Genymobile/scrcpy/pull/5870)
- [scrcpy-with-call-audio fork](https://github.com/yNEX/scrcpy-with-call-audio)
- [Low-latency mic capture via ADB (dzx.fr)](https://dzx.fr/blog/low-latency-microphone-audio-android/)
- [audiosource — Android as USB mic](https://github.com/gdzx/audiosource)
- [avream — Android webcam+mic for Linux](https://github.com/Kacoze/avream)
- [BCP — call audio injection (archived)](https://github.com/chenxiaolong/BCP)
- [sip-to-ai — SIP to AI voice bridge](https://github.com/aicc2025/sip-to-ai)
- [Android Telecom framework](https://developer.android.com/develop/connectivity/telecom)
- [Android audio sharing/capture restrictions](https://developer.android.com/media/platform/sharing-audio-input)
- [Android InCallService API](https://developer.android.com/reference/android/telecom/InCallService)
- [WebSocket audio streaming (Canopas)](https://canopas.com/android-send-live-audio-stream-from-client-to-server-using-websocket-and-okhttp-client-ecc9f28118d9)
- [Voice AI Android app with WebSocket (WebRTC.ventures)](https://webrtc.ventures/2026/02/blog-voice-ai-android-app-gemini-prototype/)
- [FreeSWITCH + AI integration (SignalWire)](https://developer.signalwire.com/platform/integrations/freeswitch/add-ai-to-freeswitch/)
- [GSM Gateway India pricing](https://www.gsmgateway.in/latest-gsm-gateway-price-in-India-2025.html)
- [Aria 4G VoLTE GSM Gateway](https://ariatelecom.net/Jio-4G-Volte-Gateway.aspx)
- [Linphone Android auto-answer](https://github.com/BelledonneCommunications/linphone-android/issues/1989)
- [sndcpy — deprecated audio forwarding](https://github.com/rom1v/sndcpy)
