# Research: NVIDIA PersonaPlex

**Date:** 2026-03-01
**Status:** Research complete
**Verdict:** Not viable on DGX Spark today. Monitor for quantized models + kernel fixes.

---

## 1. What Is NVIDIA PersonaPlex?

PersonaPlex is NVIDIA's **real-time, full-duplex speech-to-speech conversational model** released January 15, 2026. Unlike traditional cascaded voice pipelines (STT -> LLM -> TTS), PersonaPlex is a **single unified model** that listens and speaks simultaneously.

### Key Properties

| Property | Value |
|----------|-------|
| Parameters | 7 billion (7B) |
| Architecture | Based on Moshi (Kyutai Labs) + Helium LLM backbone |
| Audio codec | Mimi neural codec (encoder + decoder) |
| Sample rate | 24 kHz |
| Frame duration | 80 ms per audio frame |
| Duplex mode | Full duplex (concurrent listen + speak) |
| Voice options | 16 preset voices (NATF0-3, NATM0-3, VARF0-4, VARM0-4) |
| License (code) | MIT |
| License (weights) | NVIDIA Open Model License (commercial OK) |
| Release | January 15, 2026 |

### How It Works

1. **Mimi Speech Encoder** (ConvNet + Transformer) converts incoming user audio into token sequences
2. **Temporal Transformer + Depth Transformer** (the 7B Helium LLM) processes both user and agent audio streams simultaneously in a dual-stream configuration
3. **Mimi Speech Decoder** (Transformer + ConvNet) generates output speech tokens back into audio
4. The model maintains internal state continuously -- no turn-taking required

### Persona Control

PersonaPlex accepts two conditioning inputs:
- **Voice prompt** (`.pt` file): Audio tokens establishing vocal characteristics, speaking style, prosody
- **Text prompt** (string): Role definition, background, scenario context, behavioral constraints

Both prompts persist throughout the conversation, maintaining consistent persona.

### Full-Duplex Behavior

Unlike cascaded pipelines, PersonaPlex:
- Handles **interruptions** naturally (barge-in)
- Produces **backchannels** ("uh-huh", "mm-hmm") while user speaks
- Manages **overlapping speech** without explicit VAD
- Achieves **170ms smooth turn-taking latency** and **240ms interruption latency** (per FullDuplexBench)

### Benchmark Results (FullDuplexBench)

| Metric | Score |
|--------|-------|
| Smooth Turn Taking Success Rate | 0.908 |
| User Interruption Success Rate | 0.950 |
| Smooth Turn Taking Latency | 0.170s |
| User Interruption Latency | 0.240s |
| Speaker Similarity (WavLM) | 0.650 |
| Response Quality (GPT-4o judge) | 4.29/5 |

---

## 2. Availability: Open Source, NIM, API

### Open Source (YES)

- **Code:** [github.com/NVIDIA/personaplex](https://github.com/NVIDIA/personaplex) (MIT license)
- **Weights:** [huggingface.co/nvidia/personaplex-7b-v1](https://huggingface.co/nvidia/personaplex-7b-v1) (NVIDIA Open Model License, commercial OK, requires HF agreement)
- **Server:** Built-in Python server on port 8998 (WSS), includes React web UI client

### NIM Container (NO)

As of March 2026, PersonaPlex is **not available as a NIM container**. It does not appear on [build.nvidia.com](https://build.nvidia.com). Deployment is via the GitHub repo's own server implementation only.

### Cloud API (NO -- unofficial only)

There is an unofficial third-party site (personaplex.io) but no official NVIDIA API endpoint. The model is self-hosted only.

---

## 3. System Requirements

### Officially Tested Hardware

| Hardware | Status |
|----------|--------|
| NVIDIA A100 80 GB | Tested and documented by NVIDIA |
| NVIDIA H100 | Listed as supported (Hopper) |
| NVIDIA RTX 4090 (24 GB) | Community confirmed working perfectly (~0.1s stable latency) |
| RTX PRO 6000 Blackwell (sm_120) | Works with cu128 PyTorch wheels |

### Memory Requirements

- **VRAM usage:** ~18-20 GB during active conversation on RTX 4090
- **Minimum usable:** 24 GB VRAM (RTX 3090/4090/A10G)
- **Recommended:** 40+ GB VRAM for comfortable headroom
- **CPU offload:** Available via `--cpu-offload` flag (requires `accelerate` package), but degrades real-time performance
- **FP8 quantized version:** NVIDIA working on it, targeting 16 GB VRAM (not released yet)
- **INT4/INT8 quantization:** Community attempts exist but no official support; dequantization to FP16 at load time defeats the purpose

### Software Requirements

- Python 3.12+
- PyTorch with CUDA support
- Node.js 20+ (for web UI client)
- libopus-dev (Opus audio codec)
- HF_TOKEN environment variable (for model download)

---

## 4. DGX Spark (GB10) Compatibility

### Verdict: NOT VIABLE (as of March 2026)

PersonaPlex has a **known, unresolved performance issue** on DGX Spark:

**GitHub Issue:** [NVIDIA/personaplex#3](https://github.com/NVIDIA/personaplex/issues/3) -- "PersonaPlex produces choppy/unusable audio on DGX Spark (GB10)"

### The Problem

| Metric | DGX Spark (GB10) | RTX 4090 (reference) |
|--------|-------------------|----------------------|
| GPU compute per frame | **120-135 ms** | <80 ms |
| Required for real-time | 80 ms | 80 ms |
| Latency behavior | Continuously increasing | Stable ~0.1s |
| GPU utilization | ~91% | ~80% |
| Audio quality | Choppy, unusable | Perfect |
| Real-time capable | **NO** | YES |

The GPU compute time (120-135ms) exceeds the 80ms real-time budget per audio frame by 50-68%. The pipeline is **GPU-bound**: audio codec processing is negligible (~0.5ms), but model inference cannot keep up.

### Why DGX Spark Is Slower Than RTX 4090

Despite having 128 GB unified memory (vs 24 GB on RTX 4090), the GB10's GPU compute throughput for this specific workload is insufficient. The Blackwell SM_121 architecture on DGX Spark has different compute characteristics than desktop Ada Lovelace (RTX 4090). The unified memory architecture shares bandwidth between CPU and GPU, and the GB10 chip is optimized for efficiency and large-model serving rather than raw single-stream inference speed.

### Failed Workarounds

All of these were attempted and none resolved the issue:
- Installing PyTorch 2.10.0+cu130 for Blackwell
- Disabling torch.compile and CUDA graphs
- Disabling Flash Attention (SDPA fallback)
- Opus compilation with intrinsic optimizations
- Various quantization approaches
- CPU offloading (makes latency worse)

### What Could Fix It

1. **FP8 quantization** -- NVIDIA is working on FP8 weights-only quantization. If compute per frame drops below 80ms, DGX Spark becomes viable. Not released yet.
2. **SM_121-optimized CUDA kernels** -- The broader DGX Spark ecosystem is still maturing. Optimized kernels for SM_121 could close the gap.
3. **Smaller model** -- A PersonaPlex-3B or similar distilled variant could fit within the compute budget. Not announced.
4. **TensorRT optimization** -- Converting to TensorRT with Blackwell-specific optimizations. No community success yet.

---

## 5. Pipecat Integration

### Fundamental Architecture Mismatch

PersonaPlex and Pipecat have a **fundamental architectural incompatibility**:

| Aspect | Pipecat | PersonaPlex |
|--------|---------|-------------|
| Architecture | Cascaded pipeline (STT -> LLM -> TTS) | Single unified model (audio-in -> audio-out) |
| Processing | Separate frame processors per stage | One model handles everything |
| Audio flow | Frames flow through discrete services | Bidirectional WebSocket audio streams |
| Duplex mode | Simulated via interruption handling | Native full duplex |
| API pattern | Service-per-component | Single WebSocket endpoint |

Pipecat is designed around the **frame processor pipeline** pattern: audio frames flow through discrete STT, LLM, and TTS services. PersonaPlex collapses all three into a single model with a WebSocket interface. There is no "STT output" or "LLM output" to tap into -- it is audio-in, audio-out.

### Can It Be Done? Yes, But It Defeats the Purpose

There are two possible integration approaches, neither ideal:

#### Approach A: PersonaPlex as a "Black Box" Service

Replace the entire STT -> LLM -> TTS pipeline with a single Pipecat processor that proxies audio to/from PersonaPlex's WebSocket server.

```python
# Conceptual sketch -- NOT production code
class PersonaPlexService(AIService):
    """Wraps PersonaPlex WebSocket as a single Pipecat processor."""

    def __init__(self, server_url: str, voice_prompt: str, text_prompt: str):
        super().__init__()
        self.server_url = server_url
        self.voice_prompt = voice_prompt
        self.text_prompt = text_prompt
        self._ws = None

    async def start(self, frame: StartFrame):
        # Connect to PersonaPlex WSS server
        self._ws = await websockets.connect(
            f"wss://{self.server_url}/ws",
            # PersonaPlex expects specific audio format
        )

    async def process_frame(self, frame: Frame, direction: FrameDirection):
        if isinstance(frame, AudioRawFrame):
            # Send user audio to PersonaPlex
            await self._ws.send(frame.audio)  # PCM audio

            # Receive agent audio from PersonaPlex
            response_audio = await self._ws.recv()  # Ogg Opus
            # Decode and push downstream
            await self.push_frame(AudioRawFrame(audio=decoded_audio))

        elif isinstance(frame, EndFrame):
            await self._ws.close()
```

**Problems with this approach:**
- Pipecat's interrupt handling, VAD, and turn-taking logic conflicts with PersonaPlex's built-in handling
- No text transcript available for logging, context injection, or tool use
- Cannot inject external knowledge (RAG, memory, function calling) into the conversation
- Loses all Pipecat middleware benefits (emotion detection, context management, etc.)
- The only value Pipecat adds is transport (WebRTC/WebSocket to client)

#### Approach B: Use PersonaPlex for Voice Only, Keep Pipecat for Orchestration

Use PersonaPlex's Mimi encoder/decoder for voice cloning and audio processing, but keep the cascaded architecture for reasoning:

1. Mimi encoder -> audio tokens (voice features)
2. Pipecat STT -> text
3. Pipecat LLM (Claude/Qwen) -> response text
4. Pipecat TTS (Kokoro) with PersonaPlex voice conditioning -> audio

**Problems:** This requires extracting Mimi from PersonaPlex and integrating it as a standalone component, which is not how the model was designed. The voice conditioning only works within the full PersonaPlex inference loop.

### Recommendation

**Do not integrate PersonaPlex with Pipecat.** They solve the same problem (voice agents) with incompatible architectures. Choose one:

- **Pipecat (current annie-voice approach):** Cascaded STT + LLM + TTS. Modular, debuggable, supports tool use, RAG, memory injection, emotion detection. Higher latency but more capable.
- **PersonaPlex (standalone):** Ultra-low latency full-duplex conversation. Natural backchanneling. But no tool use, no RAG, no external knowledge injection, limited to what the 7B model knows.

For her-os, where **memory retrieval, knowledge graph queries, and contextual awareness** are core requirements, the cascaded Pipecat architecture is the right choice. PersonaPlex cannot query Graphiti, cannot inject memory context, and cannot call tools.

---

## 6. PersonaPlex vs Current annie-voice Stack

| Capability | annie-voice (Pipecat) | PersonaPlex |
|------------|----------------------|-------------|
| Turn-taking latency | ~400-800ms (cascaded) | ~170ms (native) |
| Interruption handling | VAD-based, sometimes awkward | Native, natural |
| Backchanneling | None | Yes ("uh-huh", "mm") |
| Knowledge injection | Yes (Claude/Qwen context window) | No |
| Tool calling | Yes (via LLM) | No |
| Memory/RAG | Yes (planned Graphiti integration) | No |
| Emotion detection | Yes (SER pipeline sidecar) | No (voice only) |
| Voice quality | Kokoro TTS (high quality) | Mimi codec (good but codec-quality) |
| Voice cloning | Limited (Kokoro presets) | 16 presets + custom voice prompts |
| DGX Spark compatible | Yes (running today) | No (120-135ms/frame, needs <80ms) |
| Modular/debuggable | Yes (per-component logs) | No (single model, black box) |

---

## 7. What to Watch

1. **FP8 quantized PersonaPlex** -- NVIDIA is actively working on this. If released, re-test on DGX Spark.
2. **PersonaPlex v2** -- Potential smaller/distilled variants that fit DGX Spark's compute budget.
3. **SM_121 kernel maturity** -- As the DGX Spark ecosystem matures, optimized CUDA kernels may close the performance gap.
4. **Pipecat speech-to-speech support** -- Pipecat may add native S2S model support in the future (Moshi integration was mentioned in community discussions).
5. **PersonaPlex NIM container** -- If NVIDIA ships a NIM with TensorRT optimizations, performance characteristics could change dramatically.

---

## Sources

- [NVIDIA PersonaPlex Research Page](https://research.nvidia.com/labs/adlr/personaplex/)
- [GitHub: NVIDIA/personaplex](https://github.com/NVIDIA/personaplex)
- [HuggingFace: nvidia/personaplex-7b-v1](https://huggingface.co/nvidia/personaplex-7b-v1)
- [PersonaPlex Preprint (arXiv 2602.06053)](https://research.nvidia.com/labs/adlr/files/personaplex/personaplex_preprint.pdf)
- [GitHub Issue #3: Choppy audio on DGX Spark](https://github.com/NVIDIA/personaplex/issues/3)
- [GitHub Issue #2: Blackwell architecture instructions](https://github.com/NVIDIA/personaplex/issues/2)
- [DataCamp Tutorial: Run PersonaPlex Locally](https://www.datacamp.com/tutorial/nvidia-personaplex-tutorial)
- [DeepWiki: NVIDIA/personaplex Architecture](https://deepwiki.com/NVIDIA/personaplex)
- [MarkTechPost: PersonaPlex-7B-v1 Release](https://www.marktechpost.com/2026/01/17/nvidia-releases-personaplex-7b-v1-a-real-time-speech-to-speech-model-designed-for-natural-and-full-duplex-conversations/)
- [Towards AI: PersonaPlex Analysis](https://pub.towardsai.net/nvidia-personaplex-incredible-achievement-but-dumb-as-f-278384ac1bbe)
- [NVIDIA Voice Agent Examples (Pipecat)](https://github.com/NVIDIA/voice-agent-examples)
- [Pipecat Framework](https://github.com/pipecat-ai/pipecat)
- [HuggingFace Discussion: Quantization for 8GB GPU](https://huggingface.co/nvidia/personaplex-7b-v1/discussions/33)
