# Speaker Identification Research — Omi Analysis

**Date:** 2026-03-11
**Source:** `vendor/omi-source` (BasedHardware/omi on GitHub)
**Purpose:** Understand how Omi "automatically learns and identifies different speakers"

---

## TL;DR

Omi's "automatic speaker learning" is **one-time registration + static embedding lookup**. Not true ML learning. The embedding never updates after the first registration.

---

## The 3-Step Pipeline

### Step 1 — Pyannote Diarization (always automatic, every session)
```python
# backend/diarizer/diarization.py
diarization_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token=os.getenv('HUGGINGFACE_TOKEN')
).to(device)
```
Every conversation starts fresh: `SPEAKER_00`, `SPEAKER_01`, etc. No memory between sessions at this layer.

### Step 2 — Text-Based Name Detection (automatic, first conversation)
```python
# backend/utils/speaker_identification.py
SPEAKER_IDENTIFICATION_PATTERNS = {
    'en': [
        r"\b(I am|I'm|My name is|my name is)\s+([A-Z][a-zA-Z]*)\b",
        r"\b([A-Z][a-zA-Z]*)\s+is my name\b",
    ],
    'hi': [r"(मैं हूँ|मेरा नाम है)\s+([\u0900-\u097F]+)"],
    # 40+ languages
}
```
If speaker says "I'm Alice", system auto-suggests that label for `SPEAKER_00`. User confirms in UI.

### Step 3 — Voice Embedding Matching (automatic, after first registration)
```python
# backend/utils/stt/speaker_embedding.py
def extract_embedding_from_bytes(audio_data: bytes, filename: str) -> np.ndarray:
    api_url = os.getenv('HOSTED_SPEAKER_EMBEDDING_API_URL')
    response = requests.post(f"{api_url}/v2/embedding", files=files, timeout=300)
    return np.array(response.json()['embedding'], dtype=np.float32).reshape(1, -1)

def is_same_speaker(emb1, emb2, threshold=0.45) -> Tuple[bool, float]:
    distance = cdist(emb1, emb2, metric="cosine")[0, 0]
    return distance < threshold, distance  # 0.45 from VoxCeleb EER=2.8%
```

---

## How the "Learning" Actually Works

1. **First conversation:** User speaks → diarized as SPEAKER_00
   - Text detection triggers (if "My name is Alice")
   - User manually confirms in naming UI
   - System extracts 10s audio sample, verifies quality (≥8s, ≥70% dominant speaker)
   - Sends to hosted API → gets float[] embedding
   - Stores embedding in Firestore per person

2. **Second conversation onwards:**
   - New diarization runs (fresh SPEAKER_00 etc.)
   - For each new speaker segment (≥2s), extract embedding
   - Cosine distance vs all stored person embeddings
   - If `distance < 0.45` → auto-suggest that person
   - User can confirm or override

3. **No updating:** The system always compares against the **first stored embedding**. No continuous learning.

---

## Sample Extraction Quality Gate

```python
# speaker_identification.py: extract_speaker_samples()
# 1. Trim audio to 10s centered on segment
# 2. Verify: ≥8 seconds of audio
# 3. Transcribe with Deepgram — must match expected text
# 4. Verify ≥70% of segment is dominant speaker (no overlap)
# 5. Store to GCS: {uid}/people_profiles/{person_id}/{filename}.wav
# 6. Extract embedding from hosted API
# 7. Store embedding in Firestore: person_data['speaker_embedding'] = [...]
# Max 1 sample per person (configurable)
```

---

## Key Files

| File | Purpose |
|------|---------|
| `backend/diarizer/diarization.py` | Pyannote diarization |
| `backend/utils/speaker_identification.py` | Text detection + sample extraction |
| `backend/utils/speaker_assignment.py` | Maps speaker IDs to person profiles |
| `backend/utils/stt/speaker_embedding.py` | Embedding extraction + cosine matching |
| `backend/routers/transcribe.py` | Main orchestration during real-time transcription |
| `app/lib/pages/conversation_detail/widgets/name_speaker_sheet.dart` | Flutter naming UI |

---

## Critical Assessment

| Feature | Implementation | Reality |
|---------|---------------|---------|
| Diarization | Pyannote community-1 | ✓ Production-ready |
| Text detection | Regex, 40+ languages | ✓ Works for explicit self-intro |
| Voice matching | Cosine distance < 0.45 | ✓ Accurate in clean audio |
| Auto-learning | Fixed embedding, never updates | ✗ Not real learning |
| Sample diversity | Max 1 sample per person | ⚠ Fragile to voice changes |
| Embedding API | External hosted service | ⚠ Privacy + availability risk |
| Audio storage | Google Cloud Storage | ⚠ Cloud dependency, privacy concern |

---

## Implications for her-os

### What to adopt
- **Pyannote diarization** — already in our audio-pipeline ✓
- **Multi-language text detection** — simple regex, high value, easy to add
- **0.45 cosine threshold** — proven benchmark value
- **Quality gate for samples** — 8s minimum, 70% speaker dominance check

### What to improve over Omi
1. **Multiple samples per person** — improves robustness to voice variation (emotion, illness, aging)
2. **Confidence scoring** — surface distance value, not just binary match/no-match
3. **Progressive autonomy** — low confidence → suggest, high confidence → auto-assign silently
4. **Local embedding model** — use `pyannote/wespeaker` or `speechbrain` locally (we have GPU); avoid external API dependency
5. **Embedding update** — when a high-confidence match (distance < 0.25) occurs, optionally update stored embedding as running average

### Architecture sketch for her-os
```
Audio (BLE/Omi) → Pyannote diarize → Per-speaker audio segments
                                              ↓
                               Local embedding model (GPU)
                                              ↓
                         Compare vs person_profiles table (PostgreSQL)
                                    ↙         ↘
                          Match (< 0.45)    No match
                              ↓                 ↓
                    Auto-assign + log      Text detection?
                    confidence score          ↓ yes
                                          Suggest + user confirms
                                               ↓
                                    Extract quality sample
                                    Store embedding in DB
```

---

## her-os Current State — What Annie Actually Does

### Architecture Overview

Annie's audio-pipeline uses a **two-path diarization + identification system**:

**Fast path** (real-time, per-segment):
```
Audio chunk → Whisper STT → ECAPA-TDNN embedding → Cosine vs enrollment bank → "rajesh" or "other_X"
```

**Sweep path** (every 60s, background reconciliation):
```
Accumulated session audio → Pyannote full diarization → SPEAKER_00/01/... → Map to "rajesh"/"other_X" → Re-label existing segments
```

### What Her-OS Does Well (vs Omi)

| Capability | her-os | Omi |
|------------|--------|-----|
| Embedding model | **Local ECAPA-TDNN (GPU)** — zero API dependency | External hosted API |
| Privacy | **100% local** — no audio leaves the machine | Audio stored on Google Cloud |
| Cross-session memory | **YES** — `enrollment_bank.npy` persists across all sessions | YES — Firestore |
| In-session clustering | **YES** — `other_0_M`, `other_1_F` (with gender from F0 pitch) | Limited |
| Speaker feedback | **YES** — `POST /v1/feedback` corrects enrollment bank | YES — UI confirmation |
| Sweep re-diarization | **YES** — full pyannote sweep every 60s for consistency | Session-only |
| Multilingual threshold | **0.48** — tuned for Kannada/English code-switching | 0.45 (English-focused) |

### Key Code Snippets

**Enrollment bank (cross-session persistence):**
```python
# pipeline.py — loaded at startup, persisted to /data/enrollment_bank.npy
def _load_enrollment_bank(self) -> None:
    if ENROLLMENT_BANK_PATH.exists():
        self._enrollment_bank = np.load(ENROLLMENT_BANK_PATH)   # (N, 256) ECAPA-TDNN embeddings
        self._enrollment_meta = json.loads(ENROLLMENT_META_PATH.read_text())
```

**Speaker identification (cosine, local):**
```python
# pipeline.py — COSINE_THRESHOLD = 0.48 (tuned for multilingual)
def _identify_speaker(self, embedding: np.ndarray) -> tuple[str, float]:
    best_sim = max(cosine_similarity(embedding, enrolled) for enrolled in self._enrollment_bank)
    if best_sim > COSINE_THRESHOLD:
        return "rajesh", best_sim
    return "other", best_sim  # Only one enrolled identity!
```

**JSONL output stored per segment:**
```json
{
  "speaker": "rajesh",        // OR "other", "other_0_M", "other_1_F"
  "text": "I met Ellen today",
  "start": 0.5,
  "end": 2.3,
  "segment_id": "abc12345"
}
```

**Entity extraction uses speaker labels as context markers:**
```python
# extract.py — speaker label sent to Claude but not used for identity resolution
"[0:05] [rajesh] I met Ellen today"
"[0:10] [other_0_F] That sounds great"
# Claude extracts "Ellen" as entity but doesn't know other_0_F = Ellen
```

### PostgreSQL Schema (Current)
```sql
CREATE TABLE segments (
    segment_id TEXT PRIMARY KEY,
    speaker TEXT NOT NULL,   -- "rajesh", "other", "other_0_M"
    -- NO similarity score stored
    -- NO speaker name mapping
    -- NO person_id foreign key
);
```

---

## Gap Analysis: Omi vs her-os vs Ideal

### Feature-by-Feature Comparison

| Feature | Omi | her-os Today | Gap |
|---------|-----|-------------|-----|
| **Diarization** | Pyannote community-1 | Pyannote 3.1 (newer) | ✅ her-os ahead |
| **Embedding model** | External hosted API | Local ECAPA-TDNN (GPU) | ✅ her-os ahead |
| **Privacy** | Google Cloud Storage | 100% local | ✅ her-os ahead |
| **Enrolled speakers** | N people (household) | **1 person (rajesh only)** | ❌ Critical gap |
| **Speaker name resolution** | Maps SPEAKER_X → "Alice"/"Bob" | Maps to "rajesh"/"other_X" only | ❌ Critical gap |
| **Text-based name detection** | Yes (40+ languages, regex) | **None** | ❌ Missing |
| **Cross-session guest identity** | Yes — embedding per person | **No — "other" resets each session** | ❌ Critical gap |
| **Confidence score stored** | No (binary in DB too) | No | — Tied |
| **Multiple samples per person** | 1 sample max | **1 enrollment bank (multi-style)** | ✅ her-os better (multi-style) |
| **Multilingual support** | 40+ languages | Tuned for Kannada/English | ✅ her-os better for our use case |
| **Speaker → entity linking** | No | **No** | — Tied (both missing) |
| **Household member registry** | Yes (Firestore per user) | **No DB table** | ❌ Missing |
| **Speaker correction UI** | Flutter naming sheet | API endpoint only, no UI | ⚠ Partial |
| **Gender detection** | No | **Yes** (F0 pitch, 1.5s+ segments) | ✅ her-os ahead |

### The 3 Critical Gaps

#### Gap 1 — Single-Person Enrollment (Most Impactful)
**What happens today:** Any voice that isn't Rajesh becomes `other`, `other_0_M`, `other_1_F` — these labels reset every session. Ellen who visits twice a week is a stranger every time.

**What Omi does:** Each person is enrolled once (manually confirmed), then auto-recognized forever.

**What we need:**
```sql
CREATE TABLE household_members (
    id UUID PRIMARY KEY,
    name TEXT NOT NULL,           -- "Ellen", "Mom", "Arjun"
    relationship TEXT,            -- "colleague", "family", "friend"
    embedding FLOAT[256],         -- L2-normalized ECAPA-TDNN
    enrollment_styles JSONB,      -- {"normal": [...], "excited": [...]}
    created_at TIMESTAMPTZ
);
```
And update `_identify_speaker()` to check ALL enrolled members, not just "rajesh".

#### Gap 2 — No Text-Based Speaker Detection
**What happens today:** If someone says "Hi I'm Ellen, good to meet you", we transcribe the words but don't extract "Ellen" as the speaker identity.

**What Omi does:** Regex patterns in 40+ languages extract the name, auto-create a person profile suggestion.

**What we need (simple addition to audio-pipeline):**
```python
SPEAKER_NAME_PATTERNS = {
    'en': [
        r"\b(?:I am|I'm|My name is|call me)\s+([A-Z][a-zA-Z]+)\b",
        r"\b([A-Z][a-zA-Z]+)\s+(?:here|speaking)\b",
    ],
    'kn': [r"(?:ನಾನು|ನನ್ನ ಹೆಸರು)\s+([\u0C80-\u0CFF]+)"],  # Kannada
    'hi': [r"(?:मैं हूँ|मेरा नाम)\s+([\u0900-\u097F]+)"],
}
```

#### Gap 3 — Speaker ↔ Entity Linking
**What happens today:** Claude extracts "Ellen Sharma" as a `[person]` entity from text. The segment is labeled `other_0_F`. These two facts are never connected.

**What we need:** When entity extraction identifies a person name in a segment, attempt to match that name against the guest speaker who said it → link `other_0_F` → `Ellen Sharma` → trigger enrollment if audio sample is good enough.

---

## Recommended Implementation Order

| Priority | Feature | Effort | Value |
|----------|---------|--------|-------|
| **P0** | Multi-person enrollment (household_members table + update `_identify_speaker`) | 2 days | Immediate impact |
| **P1** | Text-based name detection (regex, EN + KN + HI) | 0.5 day | Low effort, high ROI |
| **P2** | Speaker ↔ entity linking (post-extraction hook) | 1 day | Closes feedback loop |
| **P3** | Confidence score in DB (`segments.speaker_similarity FLOAT`) | 0.5 day | Enables progressive autonomy |
| **P4** | Speaker correction UI in dashboard | 2 days | UX polish |

### Database additions needed
```sql
-- Extend existing entities/people table or add:
CREATE TABLE speaker_profiles (
    id UUID PRIMARY KEY,
    person_name TEXT NOT NULL,
    embedding FLOAT[] NOT NULL,  -- or use pgvector
    sample_count INT DEFAULT 1,
    created_at TIMESTAMPTZ,
    updated_at TIMESTAMPTZ
);

CREATE TABLE speaker_samples (
    id UUID PRIMARY KEY,
    person_id UUID REFERENCES speaker_profiles(id),
    audio_path TEXT,  -- local path, not cloud
    duration_seconds FLOAT,
    quality_score FLOAT,
    created_at TIMESTAMPTZ
);
```
