# Research: Pipecat Voice Agent PoC on DGX Spark

**Date:** 2026-02-28
**Status:** Research complete
**Context:** Voice agent PoC for her-os — Dimension 2 (Voice OS) and Dimension 1 (Personal Companion)

---

## Table of Contents

1. [Pipecat Overview & Architecture](#1-pipecat-overview--architecture)
2. [The Pipeline: STT → LLM → TTS](#2-the-pipeline-stt--llm--tts)
3. [Transport Options](#3-transport-options)
4. [Local STT: Whisper Integration](#4-local-stt-whisper-integration)
5. [Claude API Integration](#5-claude-api-integration)
6. [Local TTS: Kokoro Integration](#6-local-tts-kokoro-integration)
7. [Example Code: Minimal Voice Agent](#7-example-code-minimal-voice-agent)
8. [DGX Spark / Blackwell / aarch64 Compatibility](#8-dgx-spark--blackwell--aarch64-compatibility)
9. [Recommended Architecture for her-os](#9-recommended-architecture-for-her-os)
10. [Implementation Plan](#10-implementation-plan)
11. [Anti-Patterns & Risks](#11-anti-patterns--risks)
12. [Sources](#12-sources)

---

## 1. Pipecat Overview & Architecture

[Pipecat](https://github.com/pipecat-ai/pipecat) is an open-source Python framework by [Daily](https://www.daily.co/) for building real-time voice and multimodal conversational AI agents. It orchestrates audio, video, AI services, and conversation pipelines with ultra-low latency (500-800ms voice-to-voice round trip).

### Core Concepts

| Concept | What It Does |
|---------|-------------|
| **Frame** | Atomic unit of data flowing through the pipeline (audio frame, text frame, image frame, control frame) |
| **Processor** | Transforms frames — each service (STT, LLM, TTS) is a processor |
| **Pipeline** | Ordered chain of processors; frames flow left-to-right (downstream) or right-to-left (upstream for interruptions) |
| **Transport** | Handles I/O — receives audio from users, sends audio back. WebRTC, WebSocket, etc. |
| **PipelineRunner** | Event loop that drives the pipeline |
| **PipelineTask** | Wraps a pipeline with lifecycle management (start, stop, idle timeout) |

### How It Works

```
User speaks → Transport receives audio
  → STT converts speech to text (TranscriptionFrame)
  → Context Aggregator collects user message
  → LLM generates response (TextFrame stream)
  → TTS converts text chunks to audio (AudioRawFrame)
  → Transport sends audio back to user
```

Key design principle: **streaming everywhere**. The LLM streams tokens, and as each sentence completes, TTS begins generating audio immediately. The user hears the first words before the LLM finishes generating the full response.

### Interruption Handling

Pipecat has built-in interruption support via Voice Activity Detection (VAD). When the user starts speaking while the bot is still talking:
1. VAD detects user speech
2. Pipeline sends upstream cancellation frames
3. TTS stops generating, transport stops playing
4. New user utterance flows through normally

This is handled automatically — no custom code needed.

---

## 2. The Pipeline: STT → LLM → TTS

The canonical Pipecat voice agent pipeline has 7 components:

```python
pipeline = Pipeline([
    transport.input(),              # 1. Receive audio from user
    stt,                            # 2. Speech-to-Text
    context_aggregator.user(),      # 3. Collect user messages into LLM context
    llm,                            # 4. Generate response
    tts,                            # 5. Text-to-Speech
    transport.output(),             # 6. Send audio back to user
    context_aggregator.assistant(), # 7. Collect assistant messages into context
])
```

### Context Management

Pipecat provides `LLMUserContextAggregator` and `LLMAssistantContextAggregator` (created as a pair via helper) that automatically manage the conversation context:
- User aggregator collects transcribed text and adds it to the LLM context as user messages
- Assistant aggregator collects the LLM's response and adds it as assistant messages
- The context window is maintained automatically across turns

### Voice Activity Detection (VAD)

VAD is configured on the transport, not as a separate pipeline processor:

```python
transport = SmallWebRTCTransport(
    webrtc_connection=connection,
    params=TransportParams(
        audio_in_enabled=True,
        audio_out_enabled=True,
        vad_enabled=True,
        vad_analyzer=SileroVADAnalyzer(
            params=VADParams(stop_secs=0.2)
        ),
    ),
)
```

Silero VAD is the default and recommended analyzer. The `stop_secs` parameter controls how long to wait after the user stops speaking before considering the utterance complete.

---

## 3. Transport Options

Pipecat supports multiple transports. Here is a comparison for self-hosted scenarios:

### 3.1 SmallWebRTCTransport (RECOMMENDED for PoC)

**What:** Peer-to-peer WebRTC with no external service dependency. Fully self-hosted.

**How it works:**
- Uses a FastAPI endpoint (`/api/offer`) for WebRTC signaling (SDP exchange)
- After signaling, audio flows directly peer-to-peer
- Includes a [prebuilt web client](https://github.com/pipecat-ai/small-webrtc-prebuilt) for testing
- No Daily account, no LiveKit server, no cloud dependency

**Install:** `pip install "pipecat-ai[webrtc]"`

**Pros:**
- Zero external dependencies — fully self-hosted
- Low latency (peer-to-peer, no relay)
- Built-in prebuilt client for rapid testing
- Handles NAT traversal with configurable STUN/TURN
- All foundational examples now default to this transport

**Cons:**
- WebRTC is complex under the hood (but Pipecat abstracts it)
- For production across different networks, may need STUN/TURN servers
- Currently no mobile SDK (web client only)

**Architecture:**
```
Browser (prebuilt client)
   ↕ WebRTC (P2P audio)
Pipecat server (FastAPI + SmallWebRTCTransport)
   ↕ Local function calls
STT / LLM / TTS services
```

### 3.2 FastAPIWebsocketTransport

**What:** WebSocket transport integrated with FastAPI. Originally designed for telephony (Twilio, Telnyx, Plivo) but usable for custom WebSocket clients.

**Install:** `pip install "pipecat-ai[websocket]"`

**Pros:**
- Simple protocol (WebSocket)
- Works with any WebSocket client
- Good for server-to-server audio connections

**Cons:**
- Primarily designed for telephony serializers (Twilio, Telnyx)
- No built-in web client for testing
- Higher latency than WebRTC for browser-based clients
- Need to build your own client

### 3.3 DailyTransport

**What:** WebRTC via Daily's cloud infrastructure.

**Pros:** Production-grade, handles TURN/STUN, recording, multi-participant
**Cons:** Requires Daily account (free tier available), cloud dependency

### 3.4 LiveKitTransport

**What:** WebRTC via LiveKit (self-hosted or cloud).

**Pros:** Open-source server, can self-host
**Cons:** Need to run LiveKit server separately, more infrastructure

### Transport Recommendation

**For the PoC: SmallWebRTCTransport.** Zero external dependencies, built-in test client, all examples use it. For production, evaluate Daily or self-hosted LiveKit based on scale needs.

---

## 4. Local STT: Whisper Integration

### 4.1 Built-in WhisperSTTService

Pipecat has a **native local Whisper service** — no API keys, no cloud, runs entirely on-device.

**Install:** `pip install "pipecat-ai[whisper]"`

**Under the hood:** Uses `faster-whisper` (CTranslate2 backend) by default.

```python
from pipecat.services.whisper import WhisperSTTService, Model
from pipecat.transcriptions.language import Language

stt = WhisperSTTService(
    model=Model.LARGE,          # TINY, BASE, SMALL, MEDIUM, LARGE, LARGE_V3_TURBO
    device="cuda",              # "cpu", "cuda", "auto"
    compute_type="float16",     # "int8", "float16", "float32", "default"
    language=Language.EN,       # 99+ languages supported
    no_speech_prob=0.4,         # Filter threshold for non-speech
)
```

**Available models:**

| Model | Size | Quality | Speed |
|-------|------|---------|-------|
| `TINY` | 39M | Low | Fastest |
| `BASE` | 74M | Fair | Fast |
| `SMALL` | 244M | Good | Moderate |
| `MEDIUM` | 769M | Very Good | Slower |
| `LARGE` | 1.5B | Best (multilingual) | Slowest |
| `LARGE_V3_TURBO` | ~800M | Excellent | Fast |
| `DISTIL_MEDIUM_EN` | ~400M | Good (English) | Fast |
| `DISTIL_LARGE_V2` | ~700M | Very Good | Moderate |

### 4.2 DGX Spark Compatibility Issue

**NOTE:** `faster-whisper` uses CTranslate2. PyPI aarch64 wheels are CPU-only, but the `mekopa/whisperx-blackwell` Docker container has CTranslate2 compiled from source with full CUDA. For bare-metal/venv, `WhisperSTTService` with `device="cuda"` will fail unless you compile CTranslate2 from source.

From our prior research (`RESEARCH-STT-GPU-DGX-SPARK.md`):
- CTranslate2 PyPI aarch64 wheels are CPU-only
- Docker container (mekopa/whisperx-blackwell) has CTranslate2 with CUDA — proven working

### 4.3 Solution: PyTorch Whisper on Titan

Our validated solution from Phase 0 uses `openai-whisper` directly on PyTorch (which works on Blackwell via NGC container or nightly wheels). However, Pipecat's `WhisperSTTService` wraps `faster-whisper`, not `openai-whisper`.

**Three approaches for DGX Spark:**

**Option A: Custom WhisperSTTService wrapping PyTorch Whisper (RECOMMENDED)**

Write a custom `SegmentedSTTService` subclass that uses `openai-whisper` (PyTorch-based) instead of `faster-whisper`. This is the cleanest approach since we already have PyTorch Whisper running on Titan at 62x realtime.

**Option B: Use the existing audio-pipeline as an STT API**

Our already-deployed `audio-pipeline` service on Titan runs PyTorch Whisper with diarization. We could create a thin Pipecat STT service that calls it via HTTP/WebSocket, similar to how Deepgram's service works.

**Option C: Run whisper.cpp with CUDA**

whisper.cpp with ggml supports Blackwell. Build it with CUDA and wrap it as a Pipecat service. Lower latency than PyTorch for single-utterance transcription.

---

## 5. Claude API Integration

### 5.1 AnthropicLLMService (Built-in)

Pipecat has **first-class Anthropic/Claude support**.

**Install:** `pip install "pipecat-ai[anthropic]"`

```python
from pipecat.services.anthropic import AnthropicLLMService

llm = AnthropicLLMService(
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    model="claude-sonnet-4-5-20250929",
)
```

### 5.2 Full Configuration

```python
llm = AnthropicLLMService(
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    model="claude-sonnet-4-5-20250929",
    params=AnthropicLLMService.InputParams(
        max_tokens=4096,
        temperature=0.7,
        # Extended thinking (optional — adds a thinking step before response)
        thinking=AnthropicLLMService.ThinkingConfig(
            type="enabled",
            budget_tokens=10000,  # min 1024
        ),
    ),
    enable_prompt_caching=True,  # Reduces cost for repeated system prompts
)
```

### 5.3 Supported Features

| Feature | Status |
|---------|--------|
| Streaming responses | Yes |
| Function calling / tools | Yes |
| Vision (image input) | Yes |
| Prompt caching | Yes |
| Extended thinking | Yes (thought frames emitted separately) |
| Runtime parameter updates | Yes (via `UpdateSettingsFrame`) |
| Custom clients (Bedrock, Vertex) | Yes (`AsyncAnthropicBedrock`, `AsyncAnthropicVertex`) |

### 5.4 Context Setup

```python
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext

context = OpenAILLMContext(
    messages=[
        {
            "role": "system",
            "content": """You are Annie, a warm and thoughtful personal AI companion.
You are having a real-time voice conversation. Keep responses concise and natural.
Avoid using special characters, markdown, or formatting — your output will be spoken aloud.
Express warmth and genuine interest in what the user shares.""",
        }
    ],
)

context_aggregator = llm.create_context_aggregator(context)
```

### 5.5 No ARM64 Issues

The Anthropic service is a pure HTTP client (using `anthropic` Python SDK). No native code, no CUDA, no architecture issues. Works identically on aarch64 and x86_64.

---

## 6. Local TTS: Kokoro Integration

### 6.1 Current State of Kokoro in Pipecat

**No official Kokoro TTS service in Pipecat core.** There are open issues ([#1445](https://github.com/pipecat-ai/pipecat/issues/1445), [#2324](https://github.com/pipecat-ai/pipecat/issues/2324)) requesting it, but no PR has been merged.

Community implementations exist:
- [kwindla/macos-local-voice-agents](https://github.com/kwindla/macos-local-voice-agents) — uses `TTSMLXIsolated` (MLX-specific, Apple Silicon only)
- [Modal low-latency voice bot](https://modal.com/blog/low-latency-voice-bot) — uses Kokoro via a custom Modal service over WebSocket

### 6.2 Best Approach: Kokoro-FastAPI + OpenAI TTS Service (RECOMMENDED)

[Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI) provides a Dockerized Kokoro server with an **OpenAI-compatible `/v1/audio/speech` endpoint**. Pipecat's `OpenAITTSService` supports custom `base_url`, so we can point it at Kokoro-FastAPI.

**The bridge:**

```python
from pipecat.services.openai import OpenAITTSService

tts = OpenAITTSService(
    api_key="not-needed",                    # Kokoro-FastAPI doesn't need a key
    base_url="http://localhost:8880/v1",      # Local Kokoro-FastAPI server
    voice="af_heart",                        # Kokoro voice ID
    model="kokoro",                          # Model identifier
    sample_rate=24000,
)
```

**Kokoro-FastAPI features:**
- Docker images with NVIDIA GPU PyTorch support (`ghcr.io/remsky/kokoro-fastapi-gpu`)
- **ARM64/multi-arch support** with baked-in models
- 35-100x realtime speed on GPU
- OpenAI-compatible API (drop-in replacement)
- 54 voices across 8 languages
- Streaming support

**Install on Titan:**
```bash
cd docker/gpu
docker compose up --build
# Serves on http://localhost:8880
```

### 6.3 Alternative: XTTS (Built-in but Abandoned)

Pipecat has a built-in `XTTSService` that connects to a locally-hosted Coqui XTTS streaming server.

```python
from pipecat.services.xtts import XTTSService

tts = XTTSService(
    voice_id="Ana Florence",
    base_url="http://localhost:8000",
    aiohttp_session=session,
)
```

**Problems:**
- Coqui has discontinued operations — no further updates
- XTTS model is larger (2GB) and slower than Kokoro (82M)
- Unknown Blackwell/aarch64 compatibility

**Verdict:** Use Kokoro-FastAPI, not XTTS.

### 6.4 Alternative: Custom TTSService Subclass

If Kokoro-FastAPI doesn't work or has latency issues, write a custom Pipecat TTS service:

```python
from pipecat.services.tts_service import TTSService
from pipecat.frames.frames import AudioRawFrame

class KokoroLocalTTSService(TTSService):
    def __init__(self, voice="af_heart", device="cuda", **kwargs):
        super().__init__(sample_rate=24000, **kwargs)
        # Load Kokoro model directly
        self._voice = voice
        self._device = device

    async def run_tts(self, text: str, context_id: str):
        """Must yield Frame objects containing synthesized audio."""
        # Generate audio using Kokoro directly
        audio = kokoro_generate(text, self._voice, self._device)
        yield AudioRawFrame(
            audio=audio.tobytes(),
            sample_rate=24000,
            num_channels=1,
        )
```

The base `TTSService` handles text aggregation, sentence splitting, filtering, and frame management. Subclasses only need to implement `run_tts()`.

### 6.5 DGX Spark Kokoro Status

From our prior research (`RESEARCH-TTS-GPU-DGX-SPARK.md`):
- Kokoro GPU mode fails with `nvrtc: error: invalid value for --gpu-architecture`
- **Solution proven:** Monkey-patch `TorchSTFT` to use `disable_complex=True` or patch `.abs()` on complex tensors
- CPU mode works at 2.5x realtime
- GPU mode (with patch) should achieve 35-100x realtime

**Kokoro-FastAPI likely needs the same patch for Blackwell.** Either:
1. Use the CPU Docker image (slower but works out of the box)
2. Apply the TorchSTFT patch to the GPU Docker image
3. Build a custom Docker image with the patch

---

## 7. Example Code: Minimal Voice Agent

### 7.1 The Simplest Possible Voice Agent (from Pipecat foundational examples)

```python
"""
Minimal voice agent: SmallWebRTC + Whisper STT + Claude + Kokoro TTS
Based on pipecat foundational examples with local services.
"""
import os
from dotenv import load_dotenv
from loguru import logger

from pipecat.frames.frames import EndFrame, TTSSpeakFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.services.anthropic import AnthropicLLMService
from pipecat.services.openai import OpenAITTSService
from pipecat.services.whisper import WhisperSTTService, Model
from pipecat.transports.base_transport import TransportParams
from pipecat.transports.smallwebrtc.transport import SmallWebRTCTransport
from pipecat.audio.vad.silero import SileroVADAnalyzer, VADParams

load_dotenv(override=True)

async def run_bot(transport, runner_args):
    # --- STT: Local Whisper ---
    stt = WhisperSTTService(
        model=Model.LARGE_V3_TURBO,
        device="cuda",
        compute_type="float16",
    )

    # --- LLM: Claude ---
    llm = AnthropicLLMService(
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        model="claude-sonnet-4-5-20250929",
    )

    # --- TTS: Kokoro via OpenAI-compatible API ---
    tts = OpenAITTSService(
        api_key="not-needed",
        base_url="http://localhost:8880/v1",
        voice="af_heart",
        model="kokoro",
    )

    # --- Conversation context ---
    context = OpenAILLMContext(
        messages=[
            {
                "role": "system",
                "content": (
                    "You are Annie, a warm personal AI companion. "
                    "Keep responses concise and conversational. "
                    "No markdown, no special characters — you are speaking aloud."
                ),
            }
        ],
    )
    context_aggregator = llm.create_context_aggregator(context)

    # --- Pipeline ---
    pipeline = Pipeline([
        transport.input(),
        stt,
        context_aggregator.user(),
        llm,
        tts,
        transport.output(),
        context_aggregator.assistant(),
    ])

    task = PipelineTask(pipeline)

    @transport.event_handler("on_client_connected")
    async def on_connected(transport, client):
        await task.queue_frames([
            TTSSpeakFrame("Hello! I'm Annie. How can I help you today?"),
        ])

    @transport.event_handler("on_client_disconnected")
    async def on_disconnected(transport, client):
        await task.cancel()

    runner = PipelineRunner(handle_sigint=True)
    await runner.run(task)
```

### 7.2 The Runner (FastAPI + SmallWebRTC signaling)

```python
"""
run.py — FastAPI app that serves the prebuilt WebRTC client
and handles SDP signaling for SmallWebRTCTransport.
"""
import asyncio
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request
from fastapi.responses import HTMLResponse
from pipecat.transports.smallwebrtc.connection import SmallWebRTCConnection
from pipecat.transports.smallwebrtc.transport import SmallWebRTCTransport
from pipecat.transports.base_transport import TransportParams
from pipecat.audio.vad.silero import SileroVADAnalyzer

from bot import run_bot

connections = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    yield
    # Cleanup connections on shutdown
    for conn in connections.values():
        await conn.close()

app = FastAPI(lifespan=lifespan)

@app.post("/api/offer")
async def offer(request: Request):
    body = await request.json()
    sdp = body.get("sdp")
    peer_id = body.get("peer_id", "default")

    connection = SmallWebRTCConnection()
    connections[peer_id] = connection

    transport = SmallWebRTCTransport(
        webrtc_connection=connection,
        params=TransportParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
        ),
    )

    # Run bot in background
    asyncio.create_task(run_bot(transport, None))

    # Handle SDP offer and return answer
    answer = await connection.offer(sdp)
    return {"sdp": answer, "type": "answer"}
```

**Note:** The actual runner code in Pipecat examples uses `pipecat.runner.run.main()` which handles this automatically. The above is a simplified illustration.

### 7.3 Real-World Example: macos-local-voice-agents

The most complete local voice agent example is [kwindla/macos-local-voice-agents](https://github.com/kwindla/macos-local-voice-agents), which achieves <800ms voice-to-voice latency using:

```python
pipeline = Pipeline([
    transport.input(),       # SmallWebRTCTransport
    stt,                     # WhisperSTTServiceMLX (MLX Whisper)
    rtvi,                    # RTVI processor (transcript events)
    context_aggregator.user(),
    llm,                     # OpenAILLMService pointing at local LM Studio
    tts,                     # TTSMLXIsolated (Kokoro-82M via mlx-audio)
    transport.output(),
    context_aggregator.assistant(),
])
```

This is macOS/Apple Silicon specific (MLX), but the architecture translates directly to CUDA on DGX Spark.

---

## 8. DGX Spark / Blackwell / aarch64 Compatibility

### 8.1 Pipecat Core: No Issues

Pipecat itself is pure Python. No native code, no CUDA dependency. It runs on any platform with Python 3.10+.

**Confirmed:** Daily's team [trained Smart Turn on DGX Spark](https://www.daily.co/blog/training-smart-turn-on-the-nvidia-dgx-spark/) using PyTorch nightly with CUDA 13.0. They used separate `requirements_aarch64.txt` files for ARM-specific pinned versions.

### 8.2 Service-Level Compatibility

| Service | aarch64+Blackwell Status | Notes |
|---------|-------------------------|-------|
| **SmallWebRTCTransport** | Works | Pure Python + aiortc (no CUDA) |
| **SileroVADAnalyzer** | Works | ONNX Runtime, CPU-only |
| **WhisperSTTService** (faster-whisper) | CPU only (PyPI) / GPU (Docker) | CTranslate2 PyPI aarch64 wheels lack CUDA; Docker-compiled has CUDA |
| **PyTorch Whisper** (custom service) | GPU works | Validated on Titan at 62x RT |
| **AnthropicLLMService** | Works | Pure HTTP client |
| **OpenAITTSService** → Kokoro-FastAPI | Works | HTTP client → Docker container |
| **Kokoro-FastAPI GPU** | Needs patch | Same TorchSTFT nvrtc issue |
| **Kokoro-FastAPI CPU** | Works | Slower (2.5x RT), but functional |

### 8.3 PyTorch on DGX Spark

Required for any local GPU inference (Whisper STT or custom Kokoro TTS):

```bash
# Option 1: NGC Container (RECOMMENDED)
docker run --gpus all -it nvcr.io/nvidia/pytorch:25.11-py3

# Option 2: Nightly wheels
pip install torch==2.11.0.dev20260122 \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu130

# Option 3: Community wheels (cypheritai)
pip install torch-2.11.0a0-cp312-cp312-linux_aarch64.whl
```

### 8.4 The `torchcodec` Issue

Daily's Smart Turn training hit this: `torchcodec` does not provide aarch64 binaries. This affects audio processing libraries that depend on it. Workaround: use `soundfile` or `librosa` for audio I/O instead.

---

## 9. Recommended Architecture for her-os

### 9.1 PoC Architecture (Phase 1)

```
┌─────────────────────────────────────────────────┐
│                  DGX Spark (Titan)                │
│                                                   │
│  ┌──────────────┐    ┌──────────────────────────┐ │
│  │ Kokoro-FastAPI│    │ Pipecat Voice Agent       │ │
│  │ (Docker, GPU) │    │                           │ │
│  │ :8880         │    │  SmallWebRTCTransport     │ │
│  │               │◄───│  ↓                        │ │
│  │ /v1/audio/    │    │  PyTorch Whisper STT      │ │
│  │   speech      │    │  ↓                        │ │
│  └──────────────┘    │  AnthropicLLMService      │ │
│                       │  ↓                        │ │
│                       │  OpenAITTSService         │ │
│                       │    → localhost:8880        │ │
│                       │  ↓                        │ │
│                       │  SmallWebRTCTransport out │ │
│                       │                           │ │
│                       │  FastAPI :7860             │ │
│                       └──────────────────────────┘ │
│                                                   │
│  ┌──────────────┐                                 │
│  │ cloudflared   │──── https://voice.her-os.app   │
│  │ tunnel        │                                 │
│  └──────────────┘                                 │
└─────────────────────────────────────────────────┘

                    ↕ WebRTC (P2P audio)

┌─────────────────────────────────┐
│  Browser / Flutter app          │
│  (small-webrtc-prebuilt client) │
└─────────────────────────────────┘
```

### 9.2 Service Breakdown

| Component | Technology | Runs Where | GPU? |
|-----------|-----------|------------|------|
| Transport | SmallWebRTCTransport | Pipecat process | No |
| VAD | SileroVADAnalyzer | Pipecat process | No (ONNX CPU) |
| STT | Custom PyTorch Whisper service | Pipecat process | Yes (CUDA) |
| LLM | AnthropicLLMService | Cloud API | N/A |
| TTS | Kokoro-FastAPI (Docker) | Separate container | Yes (CUDA) |
| Signaling | FastAPI `/api/offer` | Pipecat process | No |
| Tunnel | cloudflared | Separate process | No |

### 9.3 Why This Architecture

1. **STT as custom service in-process:** Avoids network hop for STT. PyTorch Whisper is already validated on Titan.
2. **TTS as separate Docker container:** Kokoro-FastAPI is pre-packaged, OpenAI-compatible, handles its own GPU memory management. Isolates the TorchSTFT patch issue.
3. **Claude via API:** No local LLM needed for PoC. Claude is the her-os LLM strategy anyway.
4. **SmallWebRTC:** Zero infrastructure. The prebuilt client gives us an instant test UI.
5. **cloudflared tunnel:** Makes the voice agent accessible from anywhere without port forwarding.

### 9.4 Latency Budget

| Stage | Target | Notes |
|-------|--------|-------|
| Network (WebRTC) | 20-50ms | P2P, local network even better |
| VAD + turn detection | 200ms | Silero VAD stop_secs=0.2 |
| STT (Whisper large-v3-turbo) | 50-100ms | PyTorch GPU, short utterances |
| LLM (Claude API) | 200-500ms | Time to first token, streaming |
| TTS (Kokoro GPU) | 50-100ms | Streaming, first chunk |
| **Total voice-to-voice** | **520-950ms** | Target: under 1 second |

---

## 10. Implementation Plan

### Phase 1: Bare Minimum PoC (1-2 days)

1. **Install Pipecat on Titan:**
   ```bash
   pip install "pipecat-ai[anthropic,webrtc,silero]"
   ```

2. **Start Kokoro-FastAPI (Docker GPU):**
   ```bash
   docker pull ghcr.io/remsky/kokoro-fastapi-gpu:latest
   docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu
   # If nvrtc error → use CPU image first, patch later
   ```

3. **Write custom PyTorch Whisper STT service:**
   Subclass `SegmentedSTTService`, wrap our existing PyTorch Whisper code.

4. **Write bot.py:**
   Use the pipeline from Section 7.1 with our custom STT and Kokoro-FastAPI TTS.

5. **Write run.py:**
   FastAPI + SmallWebRTC signaling endpoint.

6. **Test:**
   Open the prebuilt WebRTC client in a browser, speak, verify round-trip.

### Phase 2: Polish (1-2 days)

7. **Patch Kokoro-FastAPI for Blackwell GPU** (if CPU mode is too slow):
   Apply the TorchSTFT `disable_complex=True` patch to the Docker image.

8. **Add cloudflared tunnel:**
   ```bash
   cloudflared tunnel --url http://localhost:7860
   ```

9. **Tune VAD parameters:**
   Adjust `stop_secs`, test with real conversations.

10. **Add conversation memory:**
    Integrate with her-os Context Engine (when built) to persist conversation context.

### Phase 3: Flutter Integration (later)

11. **WebRTC from Flutter:**
    Use `flutter_webrtc` package to connect to the Pipecat voice agent from the Nudge app.

12. **Phone call integration:**
    Replace transport with Twilio/Telnyx + FastAPIWebsocketTransport for PSTN calls.

---

## 11. Anti-Patterns & Risks

### 11.1 Risks

| Risk | Impact | Mitigation |
|------|--------|------------|
| Kokoro-FastAPI GPU fails on Blackwell | TTS fallback to CPU (2.5x RT, still usable) | Start with CPU, patch later |
| WhisperSTTService won't work on Titan (bare-metal) | CTranslate2 PyPI lacks aarch64 CUDA | Use custom PyTorch Whisper service or Docker-compiled CTranslate2 |
| Claude API latency spikes | Voice-to-voice > 1.5s | Cache common responses, use streaming aggressively |
| WebRTC NAT traversal fails remotely | Can't connect from outside LAN | Use cloudflared tunnel for HTTP signaling, TURN for media relay |
| Pipecat version churn | APIs change between versions | Pin version, read CHANGELOG before upgrading |

### 11.2 Anti-Patterns

- **Don't `pip install faster-whisper` and expect GPU on Titan.** CTranslate2 PyPI aarch64 = CPU only. Use PyTorch Whisper (bare-metal) or the Docker-compiled CTranslate2 (container).
- **Don't use DailyTransport for self-hosted PoC.** It adds a cloud dependency for no benefit at this stage.
- **Don't run the LLM locally for the PoC.** Claude API is faster and smarter than any model that fits on Titan. Local LLM is a Phase 3+ optimization.
- **Don't skip VAD.** Without VAD, the STT processes silence and hallucinates transcriptions.
- **Don't send long text chunks to TTS.** Pipecat's sentence aggregation handles this, but verify it's enabled.
- **Don't forget `no_speech_prob` filtering.** Whisper hallucinates on silence. Set threshold to 0.4-0.6.

---

## 12. Forked WebRTC Client

**Date:** 2026-03-01
**Status:** Implemented

### 12.1 What & Why

We forked Pipecat's prebuilt WebRTC client ([`small-webrtc-prebuilt/client/`](https://github.com/pipecat-ai/small-webrtc-prebuilt)) into `services/annie-voice/client/` to add per-connection LLM backend selection. The upstream client is generic — it has no mechanism to pass custom parameters like `llm_backend` through the connection request.

### 12.2 Source

- **Repo:** `https://github.com/pipecat-ai/small-webrtc-prebuilt`
- **License:** BSD 2-Clause
- **Forked directory:** `client/` only (not the Python server code)
- **Forked from:** `main` branch, `v2.2.0` (`package.json` version)

### 12.3 What We Changed

| File | Change |
|------|--------|
| `src/index.tsx` | Added `LLMBackend` state + selector UI (Claude/Ollama toggle buttons). Passes `llm_backend` through `startBotParams.requestData`. |
| `src/style.css` | Added `.llm-selector` styles (dark theme, pill toggle). |
| `package.json` | Renamed to `annie-voice-client`, set `private: true`. |
| `index.html` | Changed title to "Annie Voice". |
| `vite.config.js` | Removed `/sessions` proxy (unused). |

### 12.4 How It Works

```
Browser                          Server (server.py)
┌─────────────────────┐         ┌──────────────────────┐
│ LLM Selector        │         │                      │
│ [Claude] [Ollama]   │         │ POST /start          │
│       ↓             │────────>│ body.llm_backend     │
│ ConsoleTemplate     │         │       ↓              │
│  requestData: {     │         │ run_bot(transport,    │
│    llm_backend: ... │  WebRTC │   llm_backend=...)   │
│  }                  │<=======>│                      │
└─────────────────────┘         └──────────────────────┘
```

Each WebRTC connection gets its own pipeline. Switching LLM = disconnect + select new backend + reconnect.

### 12.5 Build Instructions

```bash
cd services/annie-voice/client
npm install
npm run build          # → client/dist/
```

`server.py` serves `client/dist/` at `/client/`. If `dist/` doesn't exist, falls back to the pip-installed prebuilt (without LLM selector).

### 12.6 Maintenance Policy

- We maintain this fork locally. Do NOT submit changes upstream.
- When upgrading Pipecat or `voice-ui-kit`, check if upstream `small-webrtc-prebuilt` changed `index.tsx` and merge manually.
- Our changes are minimal (one component wrapper + CSS) — conflicts should be rare.

---

## 13. Sources

### Pipecat Documentation
- [Pipecat Introduction](https://docs.pipecat.ai/getting-started/introduction)
- [SmallWebRTCTransport](https://docs.pipecat.ai/server/services/transport/small-webrtc)
- [FastAPIWebsocketTransport](https://docs.pipecat.ai/server/services/transport/fastapi-websocket)
- [Anthropic LLM Service](https://docs.pipecat.ai/server/services/llm/anthropic)
- [Whisper STT Service](https://docs.pipecat.ai/server/services/stt/whisper)
- [OpenAI TTS Service](https://docs.pipecat.ai/server/services/tts/openai)
- [XTTS Service](https://docs.pipecat.ai/server/services/tts/xtts)
- [TTSService Base Class](https://reference-server.pipecat.ai/en/stable/api/pipecat.services.tts_service.html)

### Pipecat GitHub
- [pipecat-ai/pipecat](https://github.com/pipecat-ai/pipecat) — Main repo
- [pipecat-ai/pipecat-examples](https://github.com/pipecat-ai/pipecat-examples) — Example apps
- [pipecat-ai/small-webrtc-prebuilt](https://github.com/pipecat-ai/small-webrtc-prebuilt) — Test client
- [Foundational Example: 01-say-one-thing.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/foundational/01-say-one-thing.py)
- [Kokoro/Orpheus/CSM Support Issue #1445](https://github.com/pipecat-ai/pipecat/issues/1445)
- [Add Kokoro TTS Issue #2324](https://github.com/pipecat-ai/pipecat/issues/2324)
- [SmallWebRTC examples PR #1534](https://github.com/pipecat-ai/pipecat/pull/1534)

### Local Voice Agent Examples
- [kwindla/macos-local-voice-agents](https://github.com/kwindla/macos-local-voice-agents) — Kokoro + Whisper + local LLM on macOS
- [NVIDIA/voice-agent-examples](https://github.com/NVIDIA/voice-agent-examples) — Pipecat + Riva + NVIDIA NIMs
- [Modal: One-Second Voice-to-Voice Latency](https://modal.com/blog/low-latency-voice-bot) — Kokoro + Parakeet + Qwen

### Kokoro TTS
- [remsky/Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI) — Docker + OpenAI-compatible API
- [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) — Model weights

### DGX Spark / Blackwell
- [Training Smart Turn on DGX Spark (Daily blog)](https://www.daily.co/blog/training-smart-turn-on-the-nvidia-dgx-spark/)
- [DGX Spark aarch64 library compatibility (NVIDIA forums)](https://forums.developer.nvidia.com/t/architecture-and-library-compatibility-on-aarch64/350389)
- [natolambert/dgx-spark-setup](https://github.com/natolambert/dgx-spark-setup) — ML training setup guide

### her-os Prior Research
- `docs/RESEARCH-TTS-GPU-DGX-SPARK.md` — Kokoro GPU fix (TorchSTFT patch)
- `docs/RESEARCH-STT-GPU-DGX-SPARK.md` — PyTorch Whisper GPU validation
- `docs/RESEARCH-VOICE-CALLS.md` — Voice call infrastructure landscape