# Research: Speaker Intelligence & ASR Pipeline

> Living document — updated as new questions arise.

## Current her-os Pipeline

```
Omi wearable audio
  → Whisper (OpenAI) ─── STT: audio → words
  → WhisperX (Oxford) ── forced alignment: word-level timestamps
  → pyannote (CNRS) ──── diarization: who spoke when
  → JSONL segments with speaker labels
  → Context Engine
```

**Two-path architecture** in `services/audio-pipeline/pipeline.py`:

| Path | Latency | What it does | Uses pyannote? |
|------|---------|-------------|----------------|
| **Fast path** | ~2-3s per clip | Whisper STT + ECAPA-TDNN embedding → cosine similarity → "rajesh"/"other" | No (too slow) — uses extracted embedding model only |
| **Sweep path** | Every 60s (background) | Full pyannote diarization on accumulated session audio → N-speaker clustering → retroactive label correction | Yes — full `speaker-diarization-3.1` |

Key design choice: the embedding model used in the fast path is **extracted from inside pyannote** (`self._diarize_model.model._embedding`) — zero extra VRAM.

---

## Component Breakdown

### 1. Whisper (OpenAI)

- **What:** Speech-to-text model. Turns audio into words.
- **Version in her-os:** Custom PyTorch Whisper (`whisper_stt.py`) for Blackwell aarch64 compatibility.
- **Limitations:** No native word-level timestamps (needs forced alignment). Slower than modern alternatives (~60 RTFx).

### 2. WhisperX (Max Bain, Oxford)

- **What:** Orchestration layer that glues Whisper + forced alignment + pyannote together.
- **Role:** Adds word-level timestamps via forced alignment (wav2vec2), then calls pyannote for speaker labels, then merges via `assign_word_speakers()`.
- **Key insight:** WhisperX is NOT pyannote. It's a pipeline coordinator that uses pyannote as one of its components.
- **Repo:** https://github.com/m-bain/whisperX

### 3. pyannote (Hervé Bredin, CNRS/Paris)

- **What:** Speaker diarization — answers "who spoke when?"
- **Name:** French for "annotate" (sounds like "piano").
- **Version in her-os:** `pyannote/speaker-diarization-3.1` (via WhisperX wrapper).
- **HuggingFace:** https://huggingface.co/pyannote
- **Licensing:** Dual model:
  - **Free/open-source:** `speaker-diarization-3.1`, `community-1` (pyannote.audio 4.0). Requires HF token + license acceptance.
  - **Premium/paid:** `precision-2` on pyannoteAI cloud.
- **Latest:** `community-1` (4.0) reportedly a significant jump over 3.1. Worth benchmarking.
- **Patches required on DGX Spark:** See `services/audio-pipeline/patches.py`:
  - PyTorch 2.6 `weights_only=True` breaks pyannote checkpoint loading.
  - pyannote's semver parser fails on NGC PEP 440 version strings (e.g., `2.6.0a0+ecf3bae`).

### 4. ECAPA-TDNN (Speaker Embeddings)

- **What:** Neural network that converts a voice clip into a 256-dim embedding vector.
- **Role in her-os:** Powers the fast path. Rajesh's enrolled voiceprint is compared via cosine similarity.
- **Threshold:** 0.28 (works for both Omi mic and WebRTC; TV/YouTube scores 0.07-0.20, Rajesh 0.30+).
- **Source:** Extracted from pyannote's internal embedding model — no separate model load.

---

## Parakeet (NVIDIA + Suno.ai)

NVIDIA's ASR model family — currently #1 on HuggingFace Open ASR leaderboard.

| Variant | Params | WER | Speed (RTFx) | Notes |
|---------|--------|-----|-------------|-------|
| Parakeet-TDT 0.6B v2 | 600M | 6.05% | 3,386 | English, punctuation + timestamps |
| Parakeet-CTC 1.1B | 1.1B | — | >2,000 | Highest accuracy |
| Parakeet-TDT 0.6B v3 | 600M | — | — | Multilingual (newer) |

**Key advantages over Whisper:**
- **50x faster inference** (RTFx 3,386 vs ~60)
- **Native word-level timestamps** via TDT (Token-and-Duration Transducer) — no forced alignment pass needed
- **Lower WER** (6.05% vs Whisper large-v3's ~8-10%)
- **DGX-native** — NVIDIA model, runs great on Blackwell
- **Small footprint** — 600M params, ~1.2 GB VRAM

**Limitation:** STT only. No diarization. Still needs pyannote (or equivalent) for speaker attribution.

**Potential her-os upgrade path:**
```
Current:   Whisper → WhisperX alignment → pyannote
Upgraded:  Parakeet TDT → pyannote (skip alignment pass, 50x faster STT)
```

**Integration notes:**
- pyannoteAI already hosts Parakeet as an STT option — they see them as complementary.
- Built on NeMo framework. Available via NIM containers or direct HuggingFace download.
- References:
  - HuggingFace: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
  - NVIDIA blog: https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/

---

## Speaker Activity Transcription (SAT) — Emerging Paradigm

### Traditional Pipeline (what we use)

```
Audio → VAD (is someone speaking?)
      → STT (what did they say?)
      → Diarization (who said it?)
```

Three separate models. Errors compound at each stage.

### End-to-End SAT Approach (emerging)

```
Audio → Single model → Speaker-attributed transcript
```

One model does VAD + STT + diarization simultaneously. No pipeline, no error compounding.

**Key components absorbed into SAT:**
- **SAD (Speaker Activity Detection):** Detects "someone is speaking from 0.5s to 3.2s" — traditionally a separate trainable model.
- **Speaker Embedding Extraction:** Voice characteristic vectors — traditionally ECAPA-TDNN or similar.
- **Clustering/Assignment:** Grouping segments by speaker — traditionally spectral clustering or EEND.

**Notable SAT research (papers only — no open weights available as of March 2026):**

| Model | Date | Result | Open weights? |
|-------|------|--------|--------------|
| **SpeakerLM** | Aug 2025 | Single MLLM does ASR + diarization + speaker recognition. cpCER 16.05 vs 23.20 cascaded SOTA (31% better). Uses SenseVoice-large encoder + Transformer projector. | No — paper only |
| **DNCASR** | Jun 2025 | Joint neural clustering + ASR for long multi-party meetings. End-to-end trainable. | No — paper only |
| **TagSpeech** | Jan 2026 | End-to-end multi-speaker ASR + diarization with speaker tags. | No — paper only |
| **SA-EEND** | Earlier | Self-Attentive End-to-End Neural Diarization — predecessor approach. | Partial (diarization only, no ASR) |

**Cloud APIs that already do single-call speaker-attributed transcription:**

| Service | What it does | Self-hosted? |
|---------|-------------|-------------|
| **pyannoteAI STT Orchestration** | Parakeet/Whisper + pyannote diarization in one API call | No — cloud only |
| **AssemblyAI Universal-2** | Real-time speaker attribution, 100+ languages | No — cloud only |
| **Google Chirp** | 100+ languages, word-level timestamps + speaker labels | No — cloud only |
| **Soniox v4** | 60+ languages, speaker diarization built-in | No — cloud only |

**Status for her-os (self-hosted, local-first):**
- Cloud APIs are production-ready TODAY but violate our privacy architecture.
- No open-source end-to-end SAT model has published weights we can run locally.
- The pipeline approach (Parakeet/Whisper + pyannote) remains the only viable self-hosted option.
- Watch for SpeakerLM weight release — it would be a game-changer if it runs on DGX Spark.

---

## Upgrade Roadmap

```
Phase 1 (now):      Whisper → WhisperX → pyannote 3.1
Phase 2 (near):     Parakeet TDT → pyannote community-1
Phase 3 (future):   End-to-end SAT model (when production-ready)
```

### Phase 2 Investigation TODOs

- [ ] Benchmark Parakeet-TDT 0.6B v2 on Titan (VRAM, latency, WER vs our Whisper)
- [ ] Test pyannote community-1 vs 3.1 (DER improvement on our data)
- [ ] Evaluate if WhisperX can be dropped entirely with Parakeet's native timestamps
- [ ] Check Parakeet NeMo vs HuggingFace deployment path on aarch64 Blackwell

---

## References

- [pyannote on HuggingFace](https://huggingface.co/pyannote)
- [pyannote speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
- [pyannote community-1 blog](https://www.pyannote.ai/blog/community-1)
- [pyannote GitHub](https://github.com/pyannote/pyannote-audio)
- [WhisperX GitHub](https://github.com/m-bain/whisperX)
- [Parakeet-TDT 0.6B v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)
- [NVIDIA Parakeet blog](https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/)
- [Best STT benchmarks 2026](https://northflank.com/blog/best-open-source-speech-to-text-stt-model-in-2026-benchmarks)
- [SpeakerLM paper](https://arxiv.org/abs/2508.06372)
- [DNCASR paper (ACL 2025)](https://aclanthology.org/2025.acl-long.899.pdf)
- [TagSpeech paper](https://www.arxiv.org/pdf/2601.06896)
- [AssemblyAI: Combining ASR + Diarization](https://www.assemblyai.com/blog/combining-speech-recognition-and-diarization-in-one-end-to-end-model)
- [AssemblyAI diarization guide](https://www.assemblyai.com/blog/what-is-speaker-diarization-and-how-does-it-work)
- [pyannoteAI STT Orchestration blog](https://www.pyannote.ai/blog/stt-orchestration)
- [Soniox Speech-to-Text](https://soniox.com/speech-to-text)
- [NVIDIA NeMo diarization docs](https://docs.nvidia.com/nemo-framework/user-guide/25.09/nemotoolkit/asr/speaker_diarization/intro.html)
- [Awesome Speaker Diarization (curated list)](https://wq2012.github.io/awesome-diarization/)
