# Research: Real-Time Conversational Voice AI (State of the Art, Early 2026)

**Date:** 2026-02-28
**Status:** Complete
**Scope:** Latency budgets, streaming architectures, turn-taking, interruption handling, filler generation, frameworks, voice-to-voice models

---

## Table of Contents

1. [Latency Budgets](#1-latency-budgets)
2. [Streaming Architectures](#2-streaming-architectures)
3. [Turn-Taking and Endpointing](#3-turn-taking-and-endpointing)
4. [Interruption Handling (Barge-In)](#4-interruption-handling-barge-in)
5. [Filler Generation and Latency Masking](#5-filler-generation-and-latency-masking)
6. [Key Frameworks](#6-key-frameworks)
7. [Voice-to-Voice Models](#7-voice-to-voice-models)
8. [Component Latency Reference Tables](#8-component-latency-reference-tables)
9. [Cost Analysis](#9-cost-analysis)
10. [Implications for her-os](#10-implications-for-her-os)

---

## 1. Latency Budgets

### Human Conversational Baseline

Human turn-taking in natural conversation happens in **200-300ms** — speakers begin formulating their response *before* the other person finishes. This is the gold standard that voice AI aspires to but cannot yet match. Key thresholds:

| Delay | Human Perception |
|-------|-----------------|
| < 300ms | Imperceptible — feels instant, like human conversation |
| 300-500ms | Acceptable — still feels natural |
| 500-800ms | Noticeable — feels slightly slow but tolerable |
| 800-1200ms | Awkward — clearly laggy, users start losing patience |
| > 1200ms | Broken — users abandon or talk over the agent |

### Production Targets (2025-2026)

**Twilio's recommended targets (November 2025):**
- Mouth-to-Ear Turn Gap: **1,115ms** (target), 1,400ms (upper limit)
- Platform Turn Gap: **885ms** (target), 1,100ms (upper limit)

**Component budget breakdown:**
| Component | Target | Upper Limit |
|-----------|--------|-------------|
| Speech-to-Text (incl. endpointing) | 350ms | 500ms |
| LLM Time-to-First-Token | 375ms | 750ms |
| Text-to-Speech Time-to-First-Audio | 100ms | 250ms |
| Network/Processing Overhead | ~150ms | ~200ms |
| **Total** | **~975ms** | **~1,700ms** |

**Key insight:** The LLM TTFT dominates the budget. But endpointing (deciding the user is *done* speaking) is actually the most impactful — poor endpointing adds 500ms+ of dead air or causes premature cutoffs. As AssemblyAI argues, developers have "overoptimized on latency to address the underlying problem that is endpointing" — an XY problem.

### What "Smooth" Feels Like

The difference between "smooth" and "robotic" is not just raw latency — it's **perceived latency**:
- Expressive TTS voices feel faster than monotone ones, even at identical latency
- Filler words ("Sure, let me check...") mask processing time
- Backchanneling ("uh-huh") during user speech signals active listening
- Consistent latency feels better than variable latency (a system that's always 600ms feels smoother than one that oscillates between 300ms and 1200ms)

---

## 2. Streaming Architectures

### The Three Architecture Families

#### Architecture A: Cascading Pipeline (STT -> LLM -> TTS)

```
User speaks → [VAD] → [STT] → text → [LLM] → text → [TTS] → Agent speaks
                                 ↑                        ↑
                          wait for full         wait for full
                          transcription          response
```

- **Latency:** 800ms-2000ms (sequential, each stage waits)
- **Pros:** Simple to build, debug, swap components independently, predictable cost ($0.15/min)
- **Cons:** Highest latency, loses tone/emotion in text conversion, limited interruptibility
- **When to use:** Phone-based systems (PSTN 8kHz degrades S2S benefits), cost-sensitive deployments, maximum flexibility needed
- **Examples:** Deepgram + GPT-4.1 + Cartesia TTS; Gladia + Gemini Flash + ElevenLabs

#### Architecture B: Streaming Pipeline (modern standard)

```
User speaks → [VAD] → [Streaming STT] → partial text → [Streaming LLM] → tokens → [Streaming TTS] → audio chunks
                              ↓                              ↓                           ↓
                     emit as available              emit as available             play as available
```

- **Latency:** 500-800ms (overlapped processing)
- **Key innovation:** Data flows continuously. STT transcribes in chunks, LLM starts generating on partial input, TTS begins synthesizing before the LLM finishes.
- **Pros:** Dramatically lower perceived latency, standard production approach
- **Cons:** More complex orchestration, partial results can cause false starts
- **When to use:** Most production voice agents in 2025-2026

#### Architecture C: Native Voice-to-Voice (Speech-to-Speech)

```
User speaks → [Single Multimodal Model] → Agent speaks
               (audio in, audio out)
```

- **Latency:** 160-300ms (single model, no handoffs)
- **Pros:** Lowest latency, preserves tone/emotion/prosody, natural interruption handling, full-duplex capable
- **Cons:** Expensive ($0.30-1.50+/min), opaque reasoning, limited voice customization, transcription accuracy lags specialized STT
- **When to use:** Premium web-based experiences where naturalness matters most
- **Examples:** GPT-4o Realtime, Gemini Live Native Audio, Moshi

#### Architecture D: Half-Cascade (emerging hybrid)

```
User speaks → [Audio Encoder] → embeddings → [Text LLM with audio understanding] → text → [TTS] → Agent speaks
```

- **Latency:** 190-300ms
- **How it works:** Audio is encoded into the LLM's embedding space directly (no separate STT), the LLM reasons in text, then a separate TTS generates speech.
- **Pros:** Retains tone/prosody from input, uses proven text LLM reasoning, lower cost than full S2S
- **Cons:** Still needs separate TTS, output quality limited by TTS choice
- **Examples:** Ultravox, OpenAI Realtime API (internally half-cascade), Gemini Live 2.5 Flash

### Streaming Implementation Details

**Critical streaming patterns:**

1. **Sentence-boundary chunking:** LLM output is split at sentence boundaries and fed to TTS chunk-by-chunk. The first sentence plays while subsequent sentences synthesize.

2. **Speculative prefetching (Sierra AI):** Precompute likely next steps. For known callers, load customer data before they even speak.

3. **Provider hedging (Sierra AI):** Fan requests across multiple LLM providers simultaneously; use whichever responds first.

4. **Adaptive model routing:** Route simple tasks to small fast models (Haiku, Gemini Flash-Lite), complex reasoning to larger models.

5. **Persistent connections:** Eliminate handshake overhead by maintaining persistent HTTP/WebSocket connections to all services.

6. **Audio format alignment:** Standardize on PCM 16kHz for STT, 24-48kHz for TTS. Hidden codec conversions and resampling can add 50-100ms of invisible latency.

---

## 3. Turn-Taking and Endpointing

This is the **single most impactful factor** in making voice AI feel natural vs robotic. Getting endpointing wrong — either cutting users off or waiting too long — damages the experience more than raw latency.

### Three Approaches

#### Approach 1: Silence Detection (simplest, most common)

Wait for N milliseconds of silence after the user stops speaking.

- **Typical threshold:** 500-700ms
- **Tradeoff:** Lower threshold = faster response but more false triggers during natural pauses. Higher threshold = fewer false triggers but feels sluggish.
- **Problem:** Humans pause for 200-500ms *within* sentences (to think, breathe, find a word). A 500ms threshold will frequently cut off mid-thought.

#### Approach 2: Semantic/Contextual Endpointing

Analyze the *content* of what was said to determine if a thought is complete.

**AssemblyAI Universal-Streaming** (text + audio hybrid):
- `end_of_turn_confidence_threshold`: default 0.7
- `min_end_of_turn_silence_when_confident`: default 160ms (minimum silence when model is confident turn is over)
- `max_turn_silence`: default 2400ms (fallback to silence-based)
- Analyzes syntactic completeness rather than relying on static VAD thresholds
- Emits immutable transcripts (no retroactive word changes that complicate logic)

**LiveKit End-of-Utterance (EOU) model:**
- 135M parameter transformer based on SmolLM v2
- Uses sliding context window of last 4 conversation turns
- Does not replace VAD — dynamically adjusts VAD silence timeout based on predictions
- Inference latency: ~50ms on CPU
- Results: 85% reduction in unintentional interruptions, only 3% false "turn not over" predictions
- Current limitation: English only

#### Approach 3: Prosodic/Audio Analysis (Pipecat approach)

Analyze audio features like pitch contour, speaking rate, and intonation patterns to detect "transition-relevant points" (TRPs) — moments where speakers naturally yield the floor.

- **Pros:** Language-agnostic, captures cues invisible to text analysis
- **Cons:** Performance varies significantly with accents and speech patterns

### Comparison of Turn Detection Approaches

| Aspect | Silence Detection | LiveKit (text) | Pipecat (audio) | AssemblyAI (hybrid) |
|--------|------------------|----------------|-----------------|---------------------|
| Input analyzed | Silence duration | Transcript text | Audio prosody | Text + Audio |
| VAD dependency | 100% | High | None | Low |
| Noise robustness | Poor | Poor | Poor | Good |
| Accent sensitivity | Low | Low | High | Medium |
| Response speed | Fixed | Slow (needs text) | Fast | Adaptive |
| False positive rate | High | Medium | Medium | Low |

### Key Insight: The Endpointing-Latency Tradeoff

Proper turn detection provides "far greater improvements to the user experience than incremental latency optimizations." A system that correctly detects turn completion in 160ms silence (high confidence) will feel dramatically faster than one that always waits 600ms — even if the rest of the pipeline is identical.

**Production recommendation:** Use a hybrid approach. Start with silence detection (500ms), overlay a semantic/transformer model to dynamically shorten the timeout when the model is confident the turn is complete (down to 160ms), and extend it when the user appears mid-thought.

---

## 4. Interruption Handling (Barge-In)

Barge-in — the user interrupting the AI mid-speech — is a critical feature for natural conversation. Humans interrupt constantly: to correct, to redirect, to agree, to disagree.

### Implementation Architecture

```
[Agent speaking] + [User mic open, VAD monitoring]
        |
        v
[VAD detects user speech on echo-cancelled audio]
        |
        v
[Interrupt signal via RTCDataChannel within ~300ms]
        |
        ├── Server: halt LLM generation, clear outgoing buffer
        ├── Client: flush local audio buffer, stop playback
        └── TTS: cancel pending synthesis
        |
        v
[Determine what agent actually said (word-level timestamps)]
        |
        v
[Truncate conversation context to only what was heard]
        |
        v
[Process user's interruption as new input]
```

### Critical Implementation Details

1. **Echo cancellation is mandatory.** Without it, the agent's own speech triggers VAD, creating infinite interrupt loops.

2. **Small TTS chunks (100-200ms).** Keep TTS output in small buffers so you can stop quickly without cutting off mid-word.

3. **Word-level timestamp tracking.** When interrupted, use TTS timestamps to determine which portion of the response the user actually heard. Truncate the assistant's context accordingly — only retain text matching audio that was played.

4. **Semantic interrupt classification.** Distinguish between:
   - **Genuine interruption** ("wait, stop", "no, actually...") — halt and listen
   - **Backchanneling** ("uh-huh", "right", "yeah") — continue speaking
   - **Ambient noise** — ignore
   Advanced systems use a lightweight classifier for this.

5. **Increased VAD sensitivity post-interrupt.** After an interruption, temporarily increase VAD sensitivity to catch follow-up speech quickly.

### Latency Targets for Barge-In

| Metric | Target |
|--------|--------|
| Barge-in detection (VAD) | P50 ~200ms |
| Agent audio stop | P50 ~300ms |
| Total interrupt-to-silence | < 500ms |

### Common Failure Mode

After a user interrupts even once, no further audio is heard for the rest of the call. The agent generates responses that are never spoken. Root cause: the abort/cancel signal propagation through the audio pipeline is incomplete. Fix: register an abort listener that calls `reader.cancel()`, causing `read()` to resolve with `{done: true}`.

---

## 5. Filler Generation and Latency Masking

### The Problem

Even with streaming, there's a 300-800ms gap between when a user finishes speaking and when the agent starts responding. Silence during this gap feels unnatural. Humans fill this gap with fillers, intake breaths, and "thinking sounds."

### Filler Strategies

#### Strategy 1: LLM-Generated Fillers

Prompt the LLM to begin responses with natural fillers: "um", "well", "so", "sure, let me think about that..."

- Since fillers are short (1-3 tokens), the LLM generates them almost instantly
- Chunking rules split at the filler word, allowing TTS to play "um" while the real response generates
- **Impact:** Reduces perceived latency by 50-70% depending on response length
- **Risk:** Overuse sounds robotic. Must be varied and contextually appropriate.

#### Strategy 2: Pre-Computed Acknowledgment Phrases

Cache common phrases as pre-synthesized audio:
- "Sure, let me check on that"
- "Great question"
- "One moment"
- "Let me pull that up"

Play these with zero TTS latency while the LLM processes. Sierra AI caches frequent phrases for zero playback latency.

#### Strategy 3: Context-Aware Progress Indicators

Rather than generic fillers, generate specific interim responses:
- "Let me pull up your order details" (when doing a tool call)
- "I'm looking into that now" (when the query requires reasoning)
- Keyboard typing sounds during tool use (after 3.5s of silence)

#### Strategy 4: Backchannel During User Speech

While the user is speaking, the agent produces listening signals:
- "uh-huh", "oh", "yeah", "got it", "oh no!"
- NVIDIA PersonaPlex specifically trains for this behavior
- Vapi uses a proprietary model to determine the best moment and type of backchannel

**Key distinction:** Backchanneling happens *during* user speech (signals active listening). Fillers happen *after* user speech (masks processing time). Both are essential for natural conversation.

### Anti-Patterns

- Playing "uh-huh" at fixed intervals regardless of content
- Using the same filler every time ("sure, sure, sure...")
- Generating fillers that are longer than the actual processing time (filler finishes after the real response is ready)
- Using fillers to mask genuinely slow systems instead of fixing the latency

---

## 6. Key Frameworks

### Pipecat (by Daily)

**What it is:** Open-source Python framework for building voice and multimodal conversational AI pipelines.

**Architecture:**
- Pipeline-based: series of processors that handle real-time audio, text, and video frames
- Each pipeline starts and ends with a transport node (WebRTC, WebSocket, etc.)
- Processors: AI service integrations (STT, LLM, TTS) + local processors (audio filters, text parsers)

**Key features:**
- Transport-agnostic: works with Daily.co rooms, LiveKit rooms, or custom WebSocket transports
- Modular processor chain: swap any STT/LLM/TTS provider
- Built-in VAD (Silero) with configurable speech start/end times
- Supports NVIDIA NIM containers
- Full-duplex audio support

**Turn detection:** Audio-feature based (prosody, intonation). High control but requires significant tuning. Performance varies with accents.

**Best for:** Developers who want maximum pipeline control and transport flexibility. Good for self-hosted deployments.

**Typical latency:** 500-800ms end-to-end.

**GitHub:** 10k+ stars, very active development.

### LiveKit Agents

**What it is:** Open-source framework for building real-time voice AI agents as participants in LiveKit rooms.

**Architecture:**
- Room-based: agents join LiveKit rooms as full participants alongside users
- VoicePipelineAgent: handles STT -> LLM -> TTS orchestration
- RealtimeAgent: wraps native voice-to-voice models (OpenAI Realtime, Gemini Live)
- Built-in telephony stack (SIP support)

**Key features:**
- End-of-Utterance (EOU) transformer model: 135M params, 85% fewer unintentional interruptions
- Agent Builder: prototype agents in-browser without code
- Built-in telephony: make/receive phone calls via SIP
- Python and Node.js SDKs
- Cloud deployment with managed infrastructure

**Turn detection:** Transformer-based text analysis (EOU model) + Silero VAD. Cleanest out-of-box turn detection.

**Best for:** Rapid prototyping, production deployment at scale, telephony integration. Best developer experience of the frameworks.

**Pricing:** Free tier (100 concurrent, 5,000 min/month); paid from $50/month.

**GitHub:** 5k+ stars.

### Daily

**What it is:** WebRTC infrastructure platform. Daily is the company behind Pipecat.

**Role in the stack:** Provides the real-time transport layer (WebRTC rooms) that Pipecat pipelines connect to. Not a framework itself but the infrastructure that makes Pipecat production-ready.

**Key features:**
- WebRTC rooms with recording, transcription
- Global edge network for low-latency media
- HIPAA-eligible infrastructure

### Vapi

**What it is:** Managed voice AI platform (not open-source).

**Key features:**
- Proprietary fusion model for backchannel timing
- Built-in orchestration, telephony, analytics
- $0.05/minute + component costs

**Best for:** Fast deployment without infrastructure management.

### Comparison Summary

| Feature | Pipecat | LiveKit Agents | Vapi |
|---------|---------|----------------|------|
| Open source | Yes | Yes | No |
| Transport | Daily, LiveKit, custom | LiveKit rooms | Managed |
| Turn detection | Audio-based (prosody) | Transformer (EOU) | Proprietary fusion |
| Telephony | Via Daily/Twilio | Built-in SIP | Built-in |
| Agent Builder | No | Yes (browser) | Yes |
| Language | Python | Python, Node.js | API-based |
| Self-hostable | Yes | Yes | No |
| Pipeline control | Maximum | High | Low |
| Setup complexity | High | Medium | Low |

### Framework Selection Guidance

- **Maximum control, self-hosted:** Pipecat
- **Best balance of DX and power:** LiveKit Agents
- **Fastest to production (managed):** Vapi
- **Phone-first deployment:** LiveKit Agents (native SIP) or Vapi

---

## 7. Voice-to-Voice Models

### The Paradigm Shift

Traditional pipelines (STT -> LLM -> TTS) lose information at each boundary. When speech becomes text, you lose tone, emotion, hesitation, emphasis, speaking rate, accent nuance. The LLM sees words but has no idea if the user said them with a frustrated sigh or cheerful tone. Voice-to-voice models process audio natively, preserving these paralinguistic features.

### Production Models (early 2026)

#### OpenAI gpt-realtime (GA, December 2025)

- **Architecture:** Half-cascade (audio encoder -> text reasoning -> audio decoder)
- **Latency:** ~250-300ms TTFT
- **Key features:**
  - Native audio understanding (captures laughs, tone shifts, mid-sentence language switches)
  - Remote MCP server support for tool use
  - SIP integration for phone calling
  - Image inputs alongside voice
  - BigBench Audio accuracy: 82.8% (vs 65.6% for previous model)
  - MultiChallenge instruction following: 30.5% (vs 20.6% previous)
  - 18.6pp gain in instruction following, 12.9pp gain in tool calling
- **Voices:** Multiple built-in voices; custom voices for enterprise
- **Pricing:** Higher than pipeline (~10x), but unchanged from previous model
- **Limitation:** Transcription accuracy lower than dedicated STT (Deepgram, AssemblyAI)

#### Google Gemini 2.5 Flash Native Audio (GA, December 2025)

- **Architecture:** True native audio (single model, audio-in audio-out)
- **Latency:** ~280ms TTFT
- **Key features:**
  - Native speech-to-speech translation (preserves intonation, pacing, pitch)
  - Live API for continuous streams of audio, video, and text
  - Improved function calling and instruction following
  - Available in Google AI Studio, Vertex AI, Gemini Live, Search Live
- **Differentiator:** Multilingual speech-to-speech translation with prosody preservation
- **Pricing:** Higher than pipeline but lower than OpenAI Realtime

#### Moshi (Kyutai Labs, open-source)

- **Architecture:** True native full-duplex audio (7B parameters)
- **Latency:** 160ms theoretical (80ms Mimi codec + 80ms acoustic delay), ~200ms practical on L4 GPU
- **Key innovation — Inner Monologue:** Predicts time-aligned text tokens as a prefix to audio tokens. This "thinking in text" step dramatically improves linguistic quality while maintaining audio-native processing.
- **Key innovation — Dual Streams:** Explicitly models input and output audio streams jointly. Removes the concept of "speaker turns" — handles overlap and interruptions natively.
- **Architecture details:**
  - Helium backbone: 7B params, trained on 2.1T tokens of English text
  - Mimi codec: Neural audio codec with semantic + acoustic features
  - Depth Transformer: models inter-codebook dependencies per time step
  - Temporal Transformer: models temporal dependencies across time steps
- **Limitation:** English only, reasoning quality below GPT-4o/Gemini
- **Significance:** Proves full-duplex is achievable at 200ms. NVIDIA PersonaPlex builds on this architecture.

#### NVIDIA PersonaPlex (research, 2025)

- **Architecture:** Based on Moshi, 7B parameters, Mimi encoder/decoder
- **Latency:** 170ms average across interaction types
- **Key innovation — Hybrid Prompting:** Combines voice embeddings (vocal characteristics) with text prompts (role/context) for persona control
- **Key innovation — Backchanneling:** Specifically trained to generate contextual backchannels ("uh-huh", "oh no!", "got it") at appropriate moments
- **Performance:**
  - Conversation dynamics: 94.1% success rate (smooth turn-taking)
  - Outperforms Moshi, Gemini Live, Qwen 2.5 Omni on FullDuplexBench
- **Training data:** 7,303 real conversations (1,217 hrs) + 39,322 synthetic assistant dialogues (410 hrs) + 105,410 synthetic customer service conversations (1,840 hrs)
- **Significance:** State-of-the-art in conversational dynamics. Shows that full-duplex + backchanneling + persona control is achievable in a single model.

#### Ultravox (Fixie AI, open-source)

- **Architecture:** Half-cascade (audio encoder -> text LLM -> text output, TTS separate)
- **Latency:** ~150ms TTFT (with Llama 3.1 8B on A100)
- **Key innovation — Multimodal Projector:** Converts audio directly into the LLM's embedding space without separate ASR
- **Model variants:** Llama 3.3, Gemma 3, Qwen 3 backbones
- **Languages:** 42 languages supported
- **Current state:** Audio-in, text-out (audio output planned via unit vocoder)
- **Significance:** Lightest-weight approach to audio-native understanding. Can run on a single A100.

#### Claude Voice Mode (Anthropic, beta May 2025)

- **Architecture:** Not disclosed; likely STT -> Claude -> TTS pipeline
- **Voices:** 5 options (Buttery, Airy, Mellow, Glassy, Rounded)
- **Model:** Claude Sonnet 4 by default
- **Integration:** Google Calendar, Gmail, Google Docs via voice
- **Upcoming:** Offline voice packs (Q1 2026) for on-device processing
- **Significance:** Brings Claude's reasoning quality to voice, but not a native voice-to-voice model

### Model Comparison

| Model | Architecture | Latency | Full Duplex | Open Source | Production Ready |
|-------|-------------|---------|-------------|-------------|-----------------|
| gpt-realtime | Half-cascade | ~280ms | No | No | GA |
| Gemini Native Audio | Native audio | ~280ms | Partial | No | GA |
| Moshi | Native full-duplex | ~200ms | Yes | Yes | Research |
| PersonaPlex | Native full-duplex | ~170ms | Yes | Research | Research |
| Ultravox | Half-cascade | ~150ms | No | Yes | Production |
| Claude Voice | Pipeline (likely) | Unknown | No | No | Beta |

### When Voice-to-Voice Changes the Game

**Voice-to-voice wins when:**
- Emotional tone matters (therapy, companionship, customer de-escalation)
- Full-duplex conversation is needed (natural back-and-forth)
- Paralinguistic features carry meaning (sarcasm, hesitation, excitement)
- Latency below 300ms is required

**Pipeline still wins when:**
- Phone/telephony deployment (8kHz degrades S2S advantages)
- Cost sensitivity ($0.15/min vs $0.30-1.50/min)
- Maximum accuracy needed (specialized STT still beats S2S transcription)
- Tool calling/function execution is complex
- Auditability required (text intermediate is inspectable)

---

## 8. Component Latency Reference Tables

### STT Providers (early 2026)

| Provider | Latency | Notes |
|----------|---------|-------|
| Deepgram Nova-3 | ~150ms (US) | 250-350ms global |
| AssemblyAI Universal-2 | 300-600ms | Best semantic endpointing |
| Groq Whisper | < 300ms | GPU-accelerated |
| WhisperX (self-hosted) | 380-520ms | On-premise option |

### LLM Time-to-First-Token

| Model | TTFT | Notes |
|-------|------|-------|
| Groq-served Llama | ~200ms | Fastest inference |
| Gemini Flash | ~300ms | Good balance |
| Claude 3.5 Haiku | ~350ms | |
| GPT-4o-mini | ~400ms | |
| Claude 3.5 Sonnet | ~800ms | Too slow for voice |

### TTS Time-to-First-Audio

| Provider | TTFA | Notes |
|----------|------|-------|
| Cartesia Sonic | 40-95ms | Purpose-built for real-time, consistent under load |
| ElevenLabs Flash v2.5 | 75ms | High quality |
| Deepgram Aura-2 | < 150ms | |
| PlayHT | ~300ms | |

### Self-Hosted GPU Capacity (concurrent streams)

| GPU | ASR Streams | LLM Concurrent | TTS Streams |
|-----|-------------|----------------|-------------|
| L4 | 50 | 20-30 | 100 |
| L40S | 100 | 50-75 | 200 |
| A100 | 100 | 75-100 | 250 |
| H100 | 200+ | 150-200 | 400+ |

---

## 9. Cost Analysis

### Per-Minute Costs by Architecture

| Architecture | Cost/min | Scales with length? |
|-------------|----------|-------------------|
| Cascading pipeline (managed) | $0.10-0.20 | No (linear) |
| Cascading pipeline (self-hosted) | $0.05-0.10 | No (linear) |
| Speech-to-speech (OpenAI/Gemini) | $0.22-0.30 baseline | Yes (exponential) |
| Speech-to-speech (5-min convo) | ~$0.30/min | |
| Speech-to-speech (30-min convo) | ~$1.50/min | Context window growth |

**Critical cost trap for S2S models:** Context accumulation. The model re-charges for all previous audio tokens on each turn. A 30-minute conversation costs ~5x per minute what a 5-minute conversation costs. One developer reported "$10 consumed during weekend integration testing."

### Cost Breakdown (Managed Pipeline)

| Component | Cost/min |
|-----------|----------|
| ASR/STT | $0.006 |
| LLM | $0.02-0.10 |
| TTS | $0.02 |
| Orchestration (Vapi/LiveKit) | $0.05 |
| Telephony | $0.01 |
| **Total** | **$0.10-0.20** |

---

## 10. Implications for her-os

### Architecture Recommendation for Dim 2 (Voice OS)

Given her-os's goals (personal companion, local-first, Titan GPU hardware, long conversations):

**Phase 1 — Streaming Pipeline (immediate):**
- Whisper (already validated on Titan at 62x RT) -> Claude API -> Kokoro TTS (already validated)
- Add Pipecat for pipeline orchestration (maximum control, self-hosted)
- Implement streaming at every boundary
- Target: 800ms end-to-end

**Phase 2 — Enhanced Pipeline (near-term):**
- Add LiveKit EOU model for turn detection (replace silence-based endpointing)
- Add filler generation (LLM-prompted + pre-cached phrases)
- Add barge-in support with echo cancellation
- Implement backchanneling during user speech
- Target: 500-600ms end-to-end

**Phase 3 — Hybrid Voice-to-Voice (when ready):**
- Deploy Moshi or Ultravox on Titan for audio-native understanding
- Keep Kokoro TTS for output quality (half-cascade approach)
- Use full-duplex for natural turn-taking
- Target: 300ms end-to-end

### Key Decisions to Make

1. **Transport layer:** WebRTC (Pipecat/Daily) vs WebSocket vs direct audio stream from Omi wearable
2. **Turn detection:** Start with silence-based, evolve to transformer-based (LiveKit EOU or equivalent)
3. **Self-hosted vs API:** For a personal companion with long conversations, self-hosted pipeline avoids the S2S context accumulation cost trap
4. **Full-duplex priority:** PersonaPlex/Moshi show that backchanneling and overlap handling are what make AI feel truly "alive" — this should be a high-priority feature even in the pipeline architecture

### Watch List

- **Moshi improvements:** Currently English-only, limited reasoning. Watch for multilingual + stronger backbone.
- **NVIDIA PersonaPlex:** If open-sourced, could run on Titan directly. Best conversational dynamics model.
- **Ultravox audio output:** Currently text-out only. When audio output ships, becomes a strong self-hosted half-cascade option.
- **Gemini Native Audio API pricing:** If Google makes native audio affordable, it changes the cost calculus.
- **Sarvam models:** For Indian language support (Kannada/Hindi), watch Sarvam's voice models.

---

## Sources

- [AssemblyAI: The Voice AI Stack for Building Agents (2026)](https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents)
- [AssemblyAI: Turn Detection Endpointing](https://www.assemblyai.com/blog/turn-detection-endpointing-voice-agent)
- [Softcery: Real-Time vs Turn-Based Architecture](https://softcery.com/lab/ai-voice-agents-real-time-vs-turn-based-tts-stt-architecture)
- [Twilio: Core Latency in AI Voice Agents](https://www.twilio.com/en-us/blog/developers/best-practices/guide-core-latency-ai-voice-agents)
- [Introl: Voice AI Infrastructure Guide](https://introl.com/blog/voice-ai-infrastructure-real-time-speech-agents-asr-tts-guide-2025)
- [Retell AI: Latency Face-Off 2025](https://www.retellai.com/resources/ai-voice-agent-latency-face-off-2025)
- [Sierra AI: Engineering Low-Latency Voice Agents](https://sierra.ai/blog/voice-latency)
- [LiveKit Blog: Transformer Turn Detection](https://blog.livekit.io/using-a-transformer-to-improve-end-of-turn-detection/)
- [LiveKit Agents Documentation](https://docs.livekit.io/agents/)
- [Pipecat GitHub](https://github.com/pipecat-ai/pipecat)
- [Pipecat Documentation](https://docs.pipecat.ai/getting-started/introduction)
- [OpenAI: Introducing gpt-realtime](https://openai.com/index/introducing-gpt-realtime/)
- [OpenAI: Audio Model Updates (Dec 2025)](https://developers.openai.com/blog/updates-audio-models/)
- [Google: Gemini 2.5 Native Audio Updates](https://blog.google/products-and-platforms/products/gemini/gemini-audio-model-updates/)
- [NVIDIA PersonaPlex](https://research.nvidia.com/labs/adlr/personaplex/)
- [Moshi Paper (arXiv)](https://arxiv.org/abs/2410.00037)
- [Ultravox GitHub](https://github.com/fixie-ai/ultravox)
- [Anthropic: Claude Voice Mode](https://techcrunch.com/2025/05/27/anthropic-launches-a-voice-mode-for-claude/)
- [Gustavo Garcia: Framework Comparison](https://medium.com/@ggarciabernardo/realtime-ai-agents-frameworks-bb466ccb2a09)
- [Modal: One-Second Voice-to-Voice Latency](https://modal.com/blog/low-latency-voice-bot)
- [F22 Labs: LiveKit vs Pipecat](https://www.f22labs.com/blogs/difference-between-livekit-vs-pipecat-voice-ai-platforms/)
- [Deepgram: Low Latency Voice AI](https://deepgram.com/learn/low-latency-voice-ai)
- [Zoice: Interruption Handling](https://zoice.ai/blog/interruption-handling-in-conversational-ai/)