# Voice-Based Emotion Recognition & Sentiment Analysis from Audio

**Research date:** 2026-02-27
**Context:** Adding voice/audio-based emotion detection to the existing her-os audio pipeline (Whisper + pyannote on DGX Spark, Blackwell GPU, 128GB unified memory)

---

## Table of Contents

1. [State of the Art: SER Models (2024-2026)](#1-state-of-the-art-ser-models)
2. [Model Deep-Dives](#2-model-deep-dives)
3. [Paralinguistic Features: Beyond Emotions](#3-paralinguistic-features-beyond-emotions)
4. [Architecture Considerations for her-os](#4-architecture-considerations-for-her-os)
5. [Multimodal Fusion: Text + Voice](#5-multimodal-fusion-text--voice)
6. [Indian Language Emotion Recognition](#6-indian-language-emotion-recognition)
7. [Privacy & Ethics](#7-privacy--ethics)
8. [Practical Implementation Recommendation](#8-practical-implementation-recommendation)
9. [Latest Model Versions (Feb 2026 Update)](#9-latest-model-versions-feb-2026-update)
10. [Benchmark Results — DGX Spark (2026-02-27)](#10-benchmark-results--dgx-spark-2026-02-27)
11. [Segment Length Analysis for SER Models](#11-segment-length-analysis-for-ser-models-2026-02-27)
12. [Production Deployment — VAD-Gated Dual-Model SER (2026-02-27)](#12-production-deployment--vad-gated-dual-model-ser-2026-02-27)
13. [Compound Emotion Labels — Categorical + Dimensional Fusion (2026-02-27)](#13-compound-emotion-labels--categorical--dimensional-fusion-2026-02-27)

---

## 1. State of the Art: SER Models

### Landscape Overview

Speech Emotion Recognition (SER) has advanced dramatically in 2024-2025, driven by self-supervised pre-training (SSL) on massive speech corpora. The dominant paradigm is:

1. **Pre-train** a large speech encoder on unlabeled audio (wav2vec2, HuBERT, WavLM, data2vec, Whisper)
2. **Fine-tune** or train a lightweight head for emotion classification
3. Optionally **distill** to smaller models for deployment

The [EmoBox benchmark](https://arxiv.org/abs/2406.07162) (Interspeech 2024) evaluated 10 pre-trained models across 32 datasets in 14 languages, representing the largest SER benchmark to date.

### EmoBox Benchmark Results (Top Models on IEMOCAP)

| Model | Parameters | IEMOCAP UA% | Rank on 32 datasets | License |
|-------|-----------|-------------|---------------------|---------|
| Whisper large-v3 (encoder) | ~635M | **73.54%** | #1 on 23/32 | MIT |
| WavLM large | ~315M | 69.47% | Strong on 12/32 | MIT |
| HuBERT large | ~315M | 67.42% | Competitive | MIT |
| data2vec 2.0 large | ~315M | Competitive | Mid-range | MIT |
| wav2vec2 base | ~95M | ~65% | Lower | MIT |

**Key finding:** Whisper large-v3 encoder performs significantly better than all SSL models for emotion, ranking top-1 on 23/32 datasets and top-3 on 30/32. This is notable because Whisper is supervised (trained for ASR), not self-supervised.

### Two Paradigms

**Categorical Emotions:** Classify into discrete labels (angry, happy, sad, neutral, fear, disgust, surprise). Most models use 4-9 classes.

**Dimensional Emotions (VAD):** Predict continuous values for:
- **Valence** (negative to positive feeling)
- **Arousal** (calm to excited)
- **Dominance** (submissive to dominant)

The dimensional approach is richer and allows capturing nuanced emotional states. Research consistently shows valence is hardest to predict from audio alone (text helps significantly), while arousal is easiest.

---

## 2. Model Deep-Dives

### 2.1 emotion2vec+ (Recommended Primary Model)

**Origin:** ACL 2024 paper from Alibaba DAMO Academy. Integrated into FunASR/ModelScope.
**Repository:** https://github.com/ddlBoJack/emotion2vec

| Variant | Parameters | Training Data | Use Case |
|---------|-----------|---------------|----------|
| emotion2vec+ seed | ~90M | 201h (academic EmoBox data) | Baseline, clean data |
| emotion2vec+ base | ~90M | 4,788h (pseudo-labeled) | **Best accuracy/size ratio** |
| emotion2vec+ large | ~300M | 42,526h (pseudo-labeled) | Maximum accuracy |

**Architecture:** Based on data2vec (self-supervised framework). Pre-trained via online distillation combining utterance-level and frame-level losses.

**9-Class Emotion Labels:**
0: angry, 1: disgusted, 2: fearful, 3: happy, 4: neutral, 5: other, 6: sad, 7: surprised, 8: unknown

**Audio Requirements:** 16kHz (same as our pipeline)

**Performance:**
- Claims SOTA on IEMOCAP with only linear layers on top
- Consistent improvements across 10+ languages
- Achieves 84.9%/82.9% on EmoDB/RAVDESS benchmarks

**License:** emotion2vec code is MIT. The emotion2vec+ models use the FunASR Model License which permits commercial use with attribution — "You are free to use, copy, modify, and share FunASR Software" with requirement to "attribute the source and author information and retain relevant model names."

**GPU/VRAM Estimate:**
- Base (~90M params): ~360MB in FP32, ~180MB in FP16. Comfortably fits alongside Whisper+pyannote.
- Large (~300M params): ~1.2GB in FP32, ~600MB in FP16. Still fits easily on 128GB unified memory.
- Seed model can run on CPU with 4GB RAM (per ModelScope specs).

**Inference Code (FunASR):**
```python
from funasr import AutoModel

model = AutoModel(model="iic/emotion2vec_plus_large")
res = model.generate("audio.wav", output_dir="./outputs",
                     granularity="utterance", extract_embedding=False)
# Returns: scores for 9 emotion classes
```

**Integration with existing pipeline:** Uses 16kHz audio (same format). Can process the same WAV segments already extracted by pyannote diarization. No format conversion needed.

### 2.2 SpeechBrain emotion-recognition-wav2vec2-IEMOCAP

**Repository:** https://huggingface.co/speechbrain/emotion-recognition-wav2vec2-IEMOCAP

| Aspect | Detail |
|--------|--------|
| Architecture | wav2vec2 (base) + conv/residual blocks + attentive statistical pooling |
| Parameters | ~95M (wav2vec2-base) |
| Accuracy | 78.7% on IEMOCAP test set |
| Emotions | 4 classes (anger, happiness, sadness, neutral) |
| License | **Apache 2.0** |
| Audio | 16kHz mono |

**Code:**
```python
from speechbrain.inference.interfaces import foreign_class
classifier = foreign_class(
    source="speechbrain/emotion-recognition-wav2vec2-IEMOCAP",
    pymodule_file="custom_interface.py",
    classname="CustomEncoderWav2vec2Classifier",
    run_opts={"device": "cuda"}
)
out_prob, score, index, text_lab = classifier.classify_file("audio.wav")
```

**Pros:** Clean Apache 2.0 license. Well-documented. SpeechBrain ecosystem.
**Cons:** Only 4 emotion classes. wav2vec2-base performance is lower than newer models.

### 2.3 SpeechBrain Emotion Diarization (WavLM-Large)

**Repository:** https://huggingface.co/speechbrain/emotion-diarization-wavlm-large

This is unique: it performs **emotion diarization** — detecting which emotion appears at which time within an utterance, with start/end timestamps.

| Aspect | Detail |
|--------|--------|
| Architecture | WavLM-Large + frame-wise classifier |
| Parameters | ~315M |
| Emotions | neutral, happy, sad (+ training data from 5 datasets) |
| Performance | 29.7% EDER on ZaionEmotionDataset |
| License | **Apache 2.0** |
| Audio | 16kHz mono |

**Code:**
```python
from speechbrain.inference.diarization import Speech_Emotion_Diarization
classifier = Speech_Emotion_Diarization.from_hparams(
    source="speechbrain/emotion-diarization-wavlm-large",
    run_opts={"device": "cuda"}
)
diary = classifier.diarize_file("example.wav")
# Output: [{'start': 0.0, 'end': 1.94, 'emotion': 'n'},
#          {'start': 1.94, 'end': 4.48, 'emotion': 'h'}]
```

**Limitation:** Trained on audios with only 1 non-neutral emotion event. Limited emotion classes.

### 2.4 audeering wav2vec2-large-robust (Dimensional VAD)

**Repository:** https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim

| Aspect | Detail |
|--------|--------|
| Architecture | Wav2Vec2-Large-Robust pruned to 12 layers |
| Parameters | ~200M |
| Output | Arousal, Dominance, Valence (continuous 0-1) |
| Training | MSP-Podcast v1.7 |
| License | **CC-BY-NC-SA-4.0** (non-commercial only!) |
| Audio | 16kHz |
| VRAM | ~800MB-1.5GB in FP32 |

**Code:**
```python
from transformers import Wav2Vec2Processor
model_name = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = EmotionModel.from_pretrained(model_name).to(device)
# Output: [[arousal, dominance, valence]] in range 0..1
```

**CRITICAL:** License is CC-BY-NC-SA-4.0 (non-commercial). For commercial use, a license must be acquired from audEERING. For personal self-hosted use in her-os, this is fine.

### 2.5 Wav2Small (Ultra-Lightweight)

**Paper:** https://arxiv.org/abs/2408.13920 (audEERING, 2024)

| Aspect | Detail |
|--------|--------|
| Parameters | **72,000** (72K!) |
| ONNX size | **120KB** quantized |
| Architecture | VGG7 feature extractor, time→channel reshaping |
| Output | Arousal, Dominance, Valence |
| Training | Knowledge distillation from teacher wav2vec2 model |

**Why interesting:** Could run as a near-zero-cost sidecar alongside Whisper. Negligible VRAM. Real-time on CPU. Teacher model achieves SOTA valence CCC=0.676 on MSP-Podcast.

### 2.6 SenseVoice (FunAudioLLM)

**Repository:** https://github.com/FunAudioLLM/SenseVoice

SenseVoice combines ASR + emotion recognition + audio event detection in a single model.

| Variant | Speed | Emotion | Languages |
|---------|-------|---------|-----------|
| SenseVoice-Small | 70ms for 10s audio (15x faster than Whisper-Large) | 7 emotions | 50+ languages |
| SenseVoice-Large | Slower but highest accuracy | 7 emotions | 50+ languages |

**7 Emotion Labels:** Happy, Sad, Angry, Neutral, Fearful, Disgusted, Surprised

**License:** FunASR Model License (commercial use allowed with attribution)

**Interesting trade-off:** If you used SenseVoice instead of Whisper, you would get STT + emotion in one pass. However, we already have Whisper running well, and SenseVoice's STT quality may differ.

### 2.7 CLAP and Variants (Zero-Shot)

[CLAP](https://github.com/microsoft/CLAP) (Contrastive Language-Audio Pretraining) can do zero-shot audio classification by matching audio to text descriptions.

**2024-2025 emotion-specific variants:**
- **ParaCLAP** (2024): Specialized for paralinguistic and emotion tasks
- **RA-CLAP** (2025): Self-distillation for emotional speaking style retrieval
- **CLAIP-Emo** (2025): LoRA adaptation of CLIP/CLAP for audiovisual emotion

**Verdict:** Interesting for zero-shot emotion taxonomy expansion, but dedicated SER models (emotion2vec+) outperform CLAP-based approaches for standard emotion recognition.

### 2.8 Whisper Encoder for Emotion

Recent research (2024-2025) shows Whisper's encoder features are highly effective for emotion:

- Whisper large-v3 encoder ranked #1 on 23/32 SER datasets in EmoBox
- LoRA fine-tuning of Whisper encoder achieves competitive F1 scores
- A fine-tuned Whisper-Large-v3 for SER exists: [firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3](https://huggingface.co/firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3)

**Approach:** Extract hidden states from Whisper's encoder (which we already run for STT!) and add a lightweight emotion classifier head. This is the most efficient approach since it reuses computation.

### 2.9 DistilHuBERT (Lightweight)

| Aspect | Detail |
|--------|--------|
| Size | ~0.02MB model |
| Accuracy | 70.64% (F1: 70.36%) |
| Architecture | 75% compressed HuBERT |
| Use case | Edge deployment, real-time |

Excellent for edge but accuracy trade-off may not be worth it when we have ample GPU memory.

---

## 3. Paralinguistic Features: Beyond Emotions

### 3.1 What Can Be Detected from Voice

| Feature | Detection Quality | Open Source? | Notes |
|---------|------------------|-------------|-------|
| **Stress** | Good (AUC ~0.98 in some studies) | Partial | Pitch, intensity, duration, vocal tract spectrum |
| **Fatigue/Tiredness** | Moderate | Limited | Sonde Health, academic models |
| **Cognitive Load** | Good | Limited | Formant frequencies, MFCCs. Gradient Boosting AUC=0.98 |
| **Depression** | Moderate-Good | Research only | Kintsugi (71.3% sensitivity from 25s speech) |
| **Anxiety** | Moderate | Research only | Co-detected with depression markers |
| **Confidence** | Research stage | No | Prosodic features |
| **Engagement** | Research stage | Partial | Prosody-based prediction |
| **Parkinson's Disease** | Promising | Research only | Voice tremor, reduced loudness |
| **Cognitive Decline** | Early stage | Research only | 2025 study: short voice samples |

### 3.2 Commercial Players

| Company | Focus | Approach | Open Source |
|---------|-------|----------|-------------|
| **Hume AI** | 48 emotion dimensions from prosody | Cloud API ($0.0276/min audio) | No (cloud API only) |
| **Kintsugi** | Depression/anxiety screening | Vocal biomarkers, 25s speech | No |
| **Sonde Health** | Mental fitness scoring | 30s voice analysis | No |
| **Ellipsis Health** | Depression/anxiety | Voice biomarkers | No |
| **audEERING** | Emotion (VAD) + openSMILE | Models + toolkit | Partial (openSMILE is open) |

### 3.3 Hume AI Expression Measurement

Hume's Speech Prosody model detects **48 emotional dimensions** including:
Admiration, Adoration, Aesthetic Appreciation, Amusement, Anger, Anxiety, Awe, Awkwardness, Boredom, Calmness, Concentration, Confusion, Contemplation, Contempt, Contentment, Craving, Desire, Determination, Disappointment, Disgust, Distress, Doubt, Ecstasy, Embarrassment, Empathic Pain, Entrancement, Envy, Excitement, Fear, Guilt, Horror, Interest, Joy, Love, Nostalgia, Pain, Pride, Realization, Relief, Romance, Sadness, Satisfaction, Shame, Surprise, Sympathy, Tiredness, Triumph

**Pricing:** Starts at $0.0276/min for audio. Not self-hostable.
**No open-source alternative** exists with this level of granularity.

### 3.4 openSMILE (Traditional Feature Extraction)

[openSMILE](https://github.com/audeering/opensmile) is the established open-source toolkit for acoustic feature extraction:
- 998 acoustic features in the "emobase" set
- Features: intensity, loudness, 12 MFCCs, pitch (F0), F0 envelope, 8 LSFs, ZCR, delta coefficients
- Written in C++, runs on Linux/Windows/macOS/RPi
- License: audEERING's custom license (open source but check terms)

**Use case for her-os:** Extract traditional acoustic features as complementary signal to deep learning emotion models. Features like pitch variation, speaking rate, and energy patterns provide interpretable signals.

---

## 4. Architecture Considerations for her-os

### 4.1 Current Pipeline (from pipeline.py)

```
Omi wearable → WAV (16kHz, 16-bit, mono)
  → process_fast():
      Whisper large-v3 STT → text + language
      ECAPA-TDNN embedding → speaker ID (rajesh/other)
  → sweep() (every 60s):
      pyannote diarization → speaker labels
      Whisper re-transcribe → aligned text
```

**Key insight:** The audio is already 16kHz PCM mono — exactly what all SER models need. No conversion required.

### 4.2 Proposed Integration: Emotion Per Speaker Turn

```
Omi wearable → WAV (16kHz)
  → process_fast():
      Whisper STT → text + language
      ECAPA-TDNN → speaker ID
      emotion2vec+ → 9-class emotion scores    ← NEW
      (optional) VAD model → arousal/valence    ← NEW
  → Combine: {speaker, text, emotion, arousal, valence, start, end}
```

**Can emotion run on the same audio?** YES. All models take 16kHz audio input. The audio samples already extracted in `_wav_to_samples()` can be passed directly to emotion2vec+.

**Can we get emotion per speaker turn?** YES. The diarization segments provide `(start, end, speaker)` tuples. Slice the audio by these boundaries and run emotion detection per segment.

### 4.3 GPU Memory Budget

Current usage on DGX Spark (128GB unified memory):

| Component | Estimated VRAM |
|-----------|---------------|
| Whisper large-v3 (FP16) | ~3GB |
| pyannote speaker-diarization-3.1 | ~1.5GB |
| ECAPA-TDNN (speaker embedding) | ~0.2GB |
| wav2vec2 alignment model | ~0.4GB |
| **Total current** | **~5.1GB** |

Adding emotion models:

| Addition | VRAM (FP16) | Notes |
|----------|-------------|-------|
| emotion2vec+ base (90M) | ~180MB | Best accuracy/size ratio |
| emotion2vec+ large (300M) | ~600MB | Maximum accuracy |
| audeering VAD model (200M) | ~800MB | Dimensional (non-commercial license) |
| Wav2Small (72K) | ~0.5MB | Near-zero cost |

**Verdict:** With 128GB unified memory, VRAM is not a constraint at all. Even loading emotion2vec+ large + audeering VAD model adds only ~1.4GB. We could load every model listed and still use <10% of available memory.

### 4.4 Reusing Whisper Encoder Features

The most efficient approach would be to extract emotion from Whisper's own encoder hidden states, since the encoder is already running for STT:

1. During `process_fast()`, Whisper encodes the audio → hidden states
2. Extract the final encoder hidden states (768-dim for large-v3)
3. Pass through a lightweight emotion classifier head
4. Get emotion prediction with zero additional encoder computation

This requires training/fine-tuning the classifier head on top of Whisper features, or using a pre-trained one like [firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3](https://huggingface.co/firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3).

### 4.5 Real-Time vs Batch Processing

| Approach | Latency | When |
|----------|---------|------|
| **Real-time (per clip)** | +50-200ms per 4-10s clip | Fast path: emotion2vec+ on each clip |
| **Batch (per sweep)** | Background | Sweep path: re-analyze all segments |

**Recommendation:** Run emotion2vec+ in the fast path (per clip). The model processes a 10-second clip in well under 1 second on GPU, adding negligible latency to the 2-3 second fast path.

### 4.6 Updated DiarizedSegment Data Model

```python
@dataclass
class DiarizedSegment:
    speaker: str
    text: str
    start: float
    end: float
    language: str = "en"
    stt_engine: str = "whisper"
    # NEW: emotion fields
    emotion: str = "neutral"           # top emotion label
    emotion_scores: dict = field(default_factory=dict)  # all 9 scores
    arousal: float = 0.5               # 0-1 continuous
    valence: float = 0.5               # 0-1 continuous
    dominance: float = 0.5             # 0-1 continuous
```

---

## 5. Multimodal Fusion: Text + Voice

### 5.1 Why Combine?

- **Valence** is hard to detect from audio alone (is the person speaking loudly because they're angry or excited?) but text reveals semantic valence ("I love this" vs "I hate this")
- **Arousal** is easy from audio (energy, pitch, speaking rate) but ambiguous in text
- **Sarcasm/irony** requires both modalities (cheerful tone + negative words, or vice versa)

### 5.2 Fusion Architectures (2024-2025)

| Approach | Description | Complexity |
|----------|-------------|------------|
| **Late fusion (score averaging)** | Run text sentiment + voice emotion independently, average/weight scores | Low |
| **Feature concatenation** | Concatenate text embeddings + audio embeddings, feed to classifier | Medium |
| **Cross-modal transformer** | Attention between text and audio features (MemoCMT, MTAF) | High |
| **Dynamic Attention Fusion (DAF)** | Adaptive attention weighting per utterance | Medium |
| **Two-layer Dynamic Bayesian (2L-DBMM)** | Bayesian mixture for combining modalities | Medium |

**Recommended for her-os Phase 1:** Start with **late fusion (weighted averaging)**:
1. Text sentiment from Claude API (already planned) → valence/polarity score
2. Voice emotion from emotion2vec+ → categorical + arousal
3. Combine: `final_valence = 0.6 * text_valence + 0.4 * audio_valence`
4. Voice arousal from audio (audio is more reliable for arousal)

Evolve to feature-level fusion later if needed.

### 5.3 Key Research References

- [MemoCMT](https://www.nature.com/articles/s41598-025-89202-x) (2025): Cross-modal transformer for speech + text
- [Dynamic Attention Fusion](https://arxiv.org/html/2509.22729v1) (2025): Lightweight framework, adaptive weighting
- [Multimodal Affective Communication](https://www.mdpi.com/2076-3417/14/15/6631) (2024): Machine learning fusion of speech emotion + text sentiment

---

## 6. Indian Language Emotion Recognition

### 6.1 Kannada-Specific Resources

**NITK-KLESC corpus:** Kannada Language Emotional Speech Corpus from NITK (National Institute of Technology Karnataka). Contains 5 emotions: Fear, Sad, Anger, Happy, Neutral. Primarily designed for speaker recognition under emotional variation.

**Additional Kannada dataset:** A Kannada Emotional Speech Dataset available on [Zenodo](https://zenodo.org/records/6345107).

### 6.2 Cross-Lingual SER Performance on Indian Languages

A 2024 study on Indian Cross Corpus SER tested four Indian languages:

| Language | Family | Single Corpus Accuracy | Cross-Lingual Accuracy |
|----------|--------|----------------------|----------------------|
| Hindi | Indo-Aryan | 90.52% | 92.76% |
| Urdu | Indo-Aryan | 84.00% | 95.97% |
| Telugu | Dravidian | 90.60% | 93.85% |
| **Kannada** | **Dravidian** | **89.00%** | **92.95%** |

These results use DCNN with multiple acoustic features (MAF) — not transfer learning from large pre-trained models, which could perform differently.

### 6.3 Cross-Language Generalization

**Good news for her-os:**
- emotion2vec shows consistent improvements across 10+ languages
- Large SSL models (Whisper, WavLM) capture universal prosodic patterns that transfer across languages
- Arousal/energy features are relatively language-independent
- The EmoBox benchmark tested 14 languages with Urdu (closest to Hindi) included

**Challenges:**
- Most SER models are trained predominantly on English/Chinese data
- Cultural differences in emotional expression exist (e.g., vocal displays of politeness vs friendliness)
- Indian-accented English may have different prosodic patterns
- Kannada-specific fine-tuning data is limited

### 6.4 Recommendation for Indian Languages

1. **Start with emotion2vec+ large** — it claims multilingual robustness and is trained on diverse data
2. **Test empirically** with Kannada speech samples to measure accuracy degradation
3. **If needed:** Fine-tune emotion2vec+ base on NITK-KLESC + other Indian emotion corpora
4. **Audio features (arousal, energy)** are more language-independent than semantic emotion labels — prioritize dimensional (VAD) signals for cross-language robustness

---

## 7. Privacy & Ethics

### 7.1 EU AI Act (February 2025)

The EU AI Act explicitly addresses emotion recognition:

**Prohibited (Article 5(1)(f)):** AI emotion recognition systems in:
- Workplaces (except medical/safety)
- Educational settings (except medical/safety)

**High-risk (requires compliance):** Emotion recognition in other contexts.

**Key distinction:** Detecting readily apparent physical states (smiling, tired) is NOT regulated. Inferring emotions (happy, sad, amused) IS regulated.

**Impact on her-os:** Since her-os is a **personal self-monitoring tool** (not workplace/education), it falls outside the prohibited category. However, it would be classified as high-risk in the EU, requiring transparency obligations and compliance procedures.

### 7.2 Self-Monitoring vs Surveillance

| Context | Status | Ethical Concerns |
|---------|--------|-----------------|
| **Self-monitoring** (her-os use case) | Acceptable | User consents, controls data, benefits directly |
| **Monitoring others** without consent | Problematic | Privacy violation, power imbalance |
| **Ambient monitoring** of household | Gray area | Family members, guests may not consent |

### 7.3 Consent Framework for her-os

**Recommended approach:**

1. **Explicit opt-in:** Emotion detection must be a toggle the user enables deliberately
2. **Granular control:** Allow disabling emotion detection while keeping STT active
3. **Data transparency:** Show the user exactly what emotions were detected, when
4. **Retention limits:** Emotion data should have configurable retention (e.g., aggregate after 30 days, delete raw scores after 90 days)
5. **Multi-person handling:** When the speaker is "other" (not enrolled), do NOT store emotion data for that speaker, or store only aggregated anonymous data
6. **No behavioral manipulation:** Emotion data should inform the user, never be used to manipulate them
7. **Local processing only:** All emotion inference runs locally on the DGX Spark, no cloud APIs for emotion data

### 7.4 Ethical Red Lines

- Never infer mental health diagnoses from emotion data alone
- Never share emotion data with third parties
- Never use emotion data for advertising/targeting
- Always allow the user to delete all emotion history
- Be transparent about accuracy limitations (models are wrong 20-40% of the time)
- Label emotion annotations as "estimates" not "facts"

---

## 8. Practical Implementation Recommendation

### 8.1 Phase 1: Minimal Viable Emotion (Recommended First Step)

**Model:** emotion2vec+ base (~90M parameters)
**Why:** Best accuracy/size ratio, MIT-compatible license, FunASR integration, 16kHz input (same as our pipeline), 9 emotion classes, multilingual

**Integration point:** Add to `process_fast()` in pipeline.py:

```python
# In AudioPipeline.__init__():
self._emotion_model = None

# In load_models():
from funasr import AutoModel
self._emotion_model = AutoModel(model="iic/emotion2vec_plus_base")

# In process_fast(), after STT:
if self._emotion_model and len(audio_samples) >= MIN_EMBEDDING_SAMPLES:
    emotion_result = self._emotion_model.generate(
        audio_samples,  # or save to temp file
        granularity="utterance",
        extract_embedding=False,
    )
    # Parse emotion scores from result
```

**VRAM added:** ~180MB (FP16). Negligible on 128GB system.
**Latency added:** Estimated <100ms per 4-10 second clip on Blackwell GPU.

### 8.2 Phase 2: Add Dimensional Emotions

**Add:** audeering VAD model for arousal/valence/dominance
**Why:** Continuous values are richer than categories; arousal is more reliable across languages
**Note:** CC-BY-NC-SA license — fine for personal use, not for commercial distribution
**Alternative:** Wav2Small for near-zero-cost dimensional emotion

### 8.3 Phase 3: Multimodal Fusion

**Combine** text sentiment (from Claude analysis) + voice emotion (from emotion2vec+/VAD model):
- Text → valence (positive/negative feeling)
- Audio → arousal (excited/calm) + categorical emotion
- Fused → richer emotional profile per speaker turn

### 8.4 Phase 4: Whisper Encoder Reuse (Optimization)

**Optimize:** Extract emotion from Whisper's encoder hidden states directly
**Why:** Zero additional encoder computation — the hidden states are already computed for STT
**How:** Train a lightweight classifier head on Whisper features, or use LoRA fine-tuning
**Savings:** Eliminates the need for a separate emotion encoder model

### 8.5 Implementation Checklist

- [ ] Add emotion2vec+ large to Docker image (pip install funasr + download model, bake weights)
- [ ] Add audeering wav2vec2 to Docker image (custom EmotionModel class, bake weights)
- [ ] Add `emotion` fields to `DiarizedSegment` dataclass (primary, intensity, confidence, valence, arousal)
- [ ] Implement confidence-weighted fusion logic in audio pipeline
- [ ] Run both models in parallel in `process_fast()` on each audio clip
- [ ] Return fused emotion in `/v1/transcribe` API response
- [ ] Add emotion toggle endpoint (`/v1/config/emotion` enable/disable)
- [ ] Test with English speech samples (accuracy baseline)
- [ ] Test with Kannada speech samples (cross-language check)
- [ ] Implement privacy controls (no emotion storage for "other" speakers)
- [ ] Add emotion data to session info endpoint
- [ ] (Phase 3) Implement late fusion with text sentiment

### 8.6 Model Comparison Summary

| Model | Params | Emotions | License | VRAM (FP16) | Recommendation |
|-------|--------|----------|---------|-------------|----------------|
| **emotion2vec+ large** | 300M | 9 categorical | FunASR (commercial OK) | ~627MB | **Phase 1 primary** |
| emotion2vec+ base | 90M | 9 categorical | FunASR (commercial OK) | ~359MB | Backup (conservative) |
| **audeering A/V/D** | 200M | 3 dimensional | CC-BY-NC-SA | ~631MB | **Phase 1 dimensional (fused with large)** |
| SpeechBrain wav2vec2 | 95M | 4 categorical | Apache 2.0 | ~380MB | Cleanest license |
| Wav2Small | 72K | 3 dimensional | (check) | ~0.5MB | Ultra-lightweight option |
| SpeechBrain emotion-diar | 315M | 3 + timestamps | Apache 2.0 | ~1.2GB | Temporal emotion tracking |
| SenseVoice-Small | ~Whisper-Small | 7 categorical | FunASR (commercial OK) | Similar to Whisper-Small | If replacing Whisper |
| Whisper encoder + head | 635M (shared) | Varies | MIT | 0 additional | **Phase 4: optimize** |

---

## 9. Latest Model Versions (Feb 2026 Update)

Cross-checked all models for 2025-2026 updates. Key findings:

### 9.1 Models with NO New Releases (Still Current)

| Model | Latest Version | Last Update | Notes |
|-------|---------------|-------------|-------|
| emotion2vec+ (seed/base/large) | v1 (May 2024) | Maintenance only | No v2. ACL 2024 paper remains definitive |
| WavLM (Base+/Large) | v1 (2021) | No updates | Still top-performing backbone; heavily used at Interspeech 2025 |
| Whisper open-source | large-v3-turbo (Oct 2024) | No v4 | OpenAI released GPT-4o-based API models (Mar 2025) — closed, not open-source |
| audeering VAD | v1 | No updates | Still CC-BY-NC-SA. Widely cited in 2025 research |
| EmoBox benchmark | v1 (Interspeech 2024) | No v2 | Remains the standard SER benchmark |
| Hume AI | EVI 4-mini (2025) | API only | Still NO open-source models. Octave 2 TTS launched |

### 9.2 Updated / New Releases

**FunASR 1.2.6** (Mar 2025) — Framework emotion2vec runs on. Latest stable. Also: **Fun-ASR-Nano-2512** (Dec 2025) — end-to-end ASR for 31 languages, but ASR-focused, not SER.

**SpeechBrain 1.0** (mid-2024, 1.0.x maintenance ongoing) — Added **emotion diarization** (`speechbrain/emotion-diarization-wavlm-large`). Unique capability: temporal emotion boundaries ("which emotion appears when?"). Uses WavLM Large + frame-wise classifier on ZED (real-life non-acted emotions).

**Emotion2Vec-S** (2025, third-party) — `ASLP-lab/Emotion2Vec-S`, self-supervised variant (~1.13 GB). Paper: "Steering Language Model to Stable SER via Contextual Perception and Chain of Thought."

### 9.3 Important New Models (2025-2026)

#### SenseVoice-Small (FunAudioLLM) — NOTABLE

| Aspect | Detail |
|--------|--------|
| HuggingFace | `FunAudioLLM/SenseVoiceSmall` |
| Parameters | ~234M (Whisper-Small equivalent) |
| Speed | **70ms for 10s audio** (15x faster than Whisper-Large) |
| Combined output | ASR + Emotion + Language ID + Audio Events (laughter, music, applause) |
| Emotions | Happy, Sad, Angry, Neutral |
| Languages | 5 (zh, en, yue, ja, ko) — SenseVoice-Large supports 50+ |
| License | FunASR Model License (commercial OK with attribution) |
| Architecture | Non-autoregressive encoder-only |

**Trade-off for her-os:** Single model replaces separate STT + SER + LID. But only 4 emotions (vs emotion2vec's 9), and SenseVoice-Small only covers 5 languages (no Kannada). **Verdict: keep Whisper + add emotion2vec separately** — more flexible, Kannada support preserved.

#### GLM-ASR-Nano-2512 (Zhipu AI, Dec 2025) — INTERESTING

| Aspect | Detail |
|--------|--------|
| HuggingFace | `zai-org/GLM-ASR-Nano-2512` |
| Parameters | 1.5B |
| Special | Explicitly trained for **emotional speech** and **multi-person overlapping dialogue** |
| Languages | Chinese + 7 dialects + 26 regional accents |
| License | Open source |

Outperforms Whisper v3 on multiple benchmarks but focused on Chinese. Not directly useful for English/Kannada SER.

#### Emotion-LLaMAv2 (Jan 2026)

| Aspect | Detail |
|--------|--------|
| Paper | arXiv:2601.16449 |
| Architecture | LLaMA2 + Conv Attention pre-fusion |
| Modalities | Audio + Video + Text (multimodal) |
| Benchmark | MMEVerse (12 datasets, 130k training clips) |
| Outperforms | Qwen2.5 Omni, AffectGPT |

Multimodal (requires video input). Too heavy for per-clip pipeline processing. Not recommended for Phase 1.

#### XEUS (Carnegie Mellon WAVLab)

| Aspect | Detail |
|--------|--------|
| HuggingFace | `espnet/xeus` |
| Parameters | 577M |
| Languages | 4000+ |
| Pre-trained on | 1M+ hours |

E-Branchformer architecture. Strong at Interspeech 2025 SER challenge alongside WavLM and Whisper. Requires fine-tuning for SER (no pretrained emotion checkpoint). Interesting for cross-lingual generalization but high effort.

### 9.4 NVIDIA NeMo — No SER Models

NeMo supports speech classification as a training recipe but has **no pretrained emotion models**. Only command recognition (MatchboxNet, MarbleNet) and language ID (AmberNet). Would need to train from scratch.

### 9.5 pyannote 4.0 — No Emotion Features

pyannote remains exclusively speaker diarization + VAD + overlapped speech. No emotion capabilities. But pairs well: use pyannote segments → run emotion2vec per segment.

### 9.6 Updated Recommendation (Feb 2026)

Original recommendation stands — **emotion2vec+ base** for Phase 1. The 2025-2026 landscape hasn't produced a better option for our specific pipeline:

| Criterion | emotion2vec+ base (winner) | SenseVoice-Small (runner-up) |
|-----------|---------------------------|------------------------------|
| Emotion classes | 9 | 4 |
| Kannada support | Language-agnostic embeddings | No (5 langs only) |
| Integration effort | Add alongside Whisper | Would replace Whisper |
| VRAM | ~180MB | ~235MB |
| Commercial license | Yes (with attribution) | Yes (with attribution) |

**No model has obsoleted emotion2vec+ since its May 2024 release.** It remains SOTA on EmoBox cross-corpus benchmarks.

---

## 10. Benchmark Results — DGX Spark (2026-02-27)

Ran `scripts/benchmark_ser.py` on Titan (NVIDIA GB10 Blackwell, 128GB unified memory).
Test audio: `test-her-movie.wav` (71s, 16kHz mono, movie "Her" dialogue), split into 14 clips of ~5s.
6 runs total — Run 1 (2-model), Run 2 (audeering failed), Run 3 (3-model), Run 4 (4-model, added large), Run 5 (speech-segmented 3-model), Run 6 (speech-segmented 4-model, native venv).

### Final Results: 4-Model Comparison (Run 4)

| Metric | emotion2vec+ base | emotion2vec+ large | Whisper-SER (firdhokk) | audeering A/V/D | Winner |
|--------|-------------------|-------------------|------------------------|-----------------|--------|
| Parameters | 90M | 300M | 0.6B | 200M | emotion2vec base (smallest) |
| Load time | **5,191 ms** | 176,890 ms | 15,687 ms | 7,275 ms | emotion2vec base |
| VRAM | **359 MB** | 628 MB | 2,495 MB | 631 MB | emotion2vec base (7x less than Whisper) |
| Latency p50 | 64 ms | 73 ms | 450 ms | **29 ms** | audeering (2.2x faster than base) |
| Latency p95 | 84 ms | 88 ms | 457 ms | 30 ms | audeering |
| Output type | 9 categorical | 9 categorical | 7 categorical | 3 continuous (A/V/D) | — |
| License | FunASR (commercial OK) | FunASR (commercial OK) | Apache 2.0 | **CC-BY-NC-SA** | emotion2vec |

### Per-Clip Predictions (all 4 models)

| Clip | emotion2vec+ base | emotion2vec+ large | Whisper-SER | audeering (A / D / V → mapped) |
|------|------------------|--------------------|-------------|--------------------------------|
| 0 | **sad** | sad | sad | 0.38 / 0.33 / 0.32 → sad/tired |
| 1 | **sad** | sad | sad | 0.46 / 0.43 / 0.45 → neutral |
| 2 | **surprised** | happy | surprised | 0.70 / 0.68 / 0.65 → happy/excited |
| 3 | **neutral** | neutral | fearful | 0.37 / 0.43 / 0.45 → neutral |
| 4 | **happy** | happy | surprised | 0.50 / 0.53 / 0.61 → calm/content |
| 5 | **sad** | neutral | happy | 0.29 / 0.38 / 0.51 → neutral |
| 6 | **neutral** | happy | happy | 0.37 / 0.48 / 0.73 → calm/content |
| 7 | **neutral** | **unknown** | happy | 0.30 / 0.32 / 0.35 → sad/tired |
| 8 | **neutral** | **unknown** | happy | 0.41 / 0.46 / 0.56 → neutral |
| 9 | **neutral** | neutral | happy | 0.39 / 0.49 / 0.56 → neutral |
| 10 | **neutral** | happy | happy | 0.42 / 0.49 / 0.62 → calm/content |
| 11 | **neutral** | neutral | happy | 0.40 / 0.44 / 0.44 → neutral |
| 12 | **neutral** | **unknown** | happy | 0.52 / 0.55 / 0.53 → neutral |
| 13 | **neutral** | happy | happy | 0.49 / 0.56 / 0.74 → calm/content |

### Analysis

**emotion2vec+ large is NOT worth it:** 3.3x more params (300M vs 90M), 1.7x more VRAM (628 vs 359 MB), ~177s load time (downloads/converts every load from ModelScope), and **worse predictions** — 3/14 clips classified as "unknown" (clips 7, 8, 12). The larger model's broader training distribution doesn't map cleanly to the 9 emotion categories for nuanced speech.

**Whisper-SER has a massive "happy" bias:** Classifies 10/14 clips as "happy" with 99%+ confidence. The audio is from the movie "Her" — predominantly contemplative, sad, and neutral dialogue. This strongly suggests overfitting to its small training set (RAVDESS + SAVEE + TESS + URDU = ~4,170 acted emotion samples).

**emotion2vec+ base shows realistic variety:** sad (3), surprised (1), happy (1), neutral (9). The neutral predictions for quiet dialogue clips are plausible for movie conversation. Trained on 4,788 hours of pseudo-labeled diverse audio, which generalizes better to real speech.

**audeering provides continuous emotional texture:** Dimensional A/V/D output aligns with emotion2vec+ on extreme clips (clip 0: both detect sadness; clip 2: both detect high arousal). The continuous values are richer than categorical labels — useful for tracking emotional trends over time. However, the rough categorical mapping (threshold-based happy/sad/neutral) loses nuance that the raw scores preserve.

**audeering is fastest (29ms, 2.2x faster than emotion2vec+ base)** but outputs dimensional scores requiring interpretation, not direct categorical labels. Best for: emotional trending dashboards, mood-over-time graphs, stress monitoring.

### audeering Loading Bug — Lesson Learned

The initial Run 2 produced all zeros from audeering. Root cause and fix documented here for future reference.

**Bug 1 — AutoModel drops custom head:** `AutoModel.from_pretrained("audeering/...", trust_remote_code=True)` loaded the base `Wav2Vec2Model` encoder but **silently dropped the `RegressionHead` classifier**. The regression head weights were logged as "UNEXPECTED" keys. The audeering model doesn't ship a custom `modeling_*.py` — it relies on users defining classes locally. No `auto_map` in config, so `AutoModel` falls back to `Wav2Vec2Model`.

**Bug 2 — transformers v5.2.0 compat:** Even with custom classes, `from_pretrained` crashed with `'EmotionModel' object has no attribute 'all_tied_weights_keys'`. The audeering model card code predates transformers v5.2's `_finalize_model_loading` which expects `all_tied_weights_keys` as a **dict** (not set).

**Fix:** Define custom `EmotionModel(Wav2Vec2PreTrainedModel)` + `RegressionHead` from model card, add `_tied_weights_keys = []` class attribute + `self.all_tied_weights_keys = {}` in `__init__`. Full working code in `scripts/benchmark_ser.py`.

### Decision (Run 4): emotion2vec+ base — ACCEPTED (primary), audeering — INTERESTING (secondary)

Benchmark-driven decision across 4 models (Run 4):
- **emotion2vec+ base: ACCEPTED** — best balance of accuracy (9 categorical emotions), VRAM (359 MB), and generalization. 64ms/clip is well within real-time budget. FunASR license is commercial-OK.
- **emotion2vec+ large: REJECTED** — 3/14 "unknown" predictions, 177s load time, only marginally different latency. The base model is strictly better for our use case.
- **Whisper-SER: REJECTED** — massive "happy" bias (10/14 clips), 7x slower, 7x more VRAM. Overfit to small acted-speech datasets.
- **audeering A/V/D: INTERESTING** — fastest inference (29ms), continuous emotional dimensions provide unique signal for mood tracking. But CC-BY-NC-SA license blocks commercial use. Could complement emotion2vec+ if licensed commercially.

### Run 5: Speech-Segmented Clips (2026-02-27)

**Problem with Runs 1-4:** Fixed 5s window splitting was naive — it included music-only clips (the "Her" movie has background score), cut across speech/silence boundaries, and diluted emotional signal. Many clips in the second half were classified as "neutral" by emotion2vec+ base even when speech had detectable emotion.

**Research finding:** MSP-Podcast corpus (the standard SER training dataset, 409 hours) uses a **2.75-11s segment range**. The consensus optimal SER clip length is **3-5 seconds** — enough phonetic coverage for reliable emotion detection, short enough to avoid mixed emotions within a clip. See Section 11 for full research.

**New segmentation approach:**
1. Used `ffmpeg silencedetect` (-30dB, 0.5s min silence) to identify speech boundaries
2. Merged speech regions separated by < 1.5s gaps (continuous dialogue context)
3. Filtered to >= 3.0s minimum (MSP-Podcast research minimum)
4. Split segments > 7s into ~4s chunks
5. Skipped music intro (0-4.82s) — no speech content

Result: 9 speech-only clips (avg 3.8s) from 71s source, vs 14 naive 5s clips before.

#### Run 5 Results: 3-Model Comparison (speech-segmented)

Whisper-SER excluded (REJECTED in Run 4).

| Metric | emotion2vec+ base | emotion2vec+ large | audeering A/V/D | Winner |
|--------|-------------------|-------------------|-----------------|--------|
| Parameters | 90M | 300M | 200M | emotion2vec base |
| Load time | 100,279 ms | 185,084 ms | **8,041 ms** | audeering |
| VRAM | **359 MB** | 627 MB | 631 MB | emotion2vec base |
| Latency p50 | 31 ms | 36 ms | **18 ms** | audeering (1.7x faster) |
| Latency p95 | 35 ms | 39 ms | **21 ms** | audeering |
| Output type | 9 categorical | 9 categorical | 3 continuous (A/V/D) | — |

Note: Load times are high because funasr downloads from ModelScope inside a fresh Docker container. In production (baked into image), emotion2vec+ base loads in ~5s.

#### Per-Clip Predictions (Run 5, speech-segmented)

| Clip | Time range | emotion2vec+ base | emotion2vec+ large | audeering (A / D / V → mapped) |
|------|-----------|-------------------|-------------------|--------------------------------|
| 0 | 4.82-8.43s | **fearful** (0.91) | fearful (1.00) | 0.47 / 0.39 / 0.34 → sad/tired |
| 1 | 8.43-12.03s | **sad** (0.88) | sad (0.56) | 0.59 / 0.52 / 0.44 → neutral |
| 2 | 19.48-22.62s | **surprised** (0.49) | neutral (0.78) | 0.60 / 0.62 / 0.70 → happy/excited |
| 3 | 26.88-31.36s | **happy** (1.00) | happy (1.00) | 0.44 / 0.52 / 0.76 → calm/content |
| 4 | 37.90-41.14s | **neutral** (0.55) | neutral (0.98) | 0.37 / 0.39 / 0.41 → neutral |
| 5 | 51.06-55.34s | **neutral** (1.00) | happy (0.95) | 0.45 / 0.52 / 0.67 → calm/content |
| 6 | 58.77-62.85s | **neutral** (1.00) | neutral (0.48) | 0.49 / 0.55 / 0.53 → neutral |
| 7 | 62.85-66.92s | **neutral** (1.00) | happy (0.91) | 0.55 / 0.60 / 0.80 → happy/excited |
| 8 | 66.92-71.00s | **neutral** (0.77) | happy (1.00) | 0.49 / 0.57 / 0.73 → calm/content |

#### Run 5 Analysis

**Speech segmentation dramatically improved results:**
- emotion2vec+ base now detects 5 distinct emotions (fearful, sad, surprised, happy, neutral) vs only 3 before (sad, happy, neutral + lots of neutral). The music-diluted clips in Run 4 were masking real emotions.
- emotion2vec+ large has **zero "unknown" predictions** (0/9 vs 3/14 in Run 4) — the "unknown" results were likely caused by music-only or music-heavy clips. However, it now skews heavily toward "happy" (5/9 clips), similar to Whisper-SER's bias.
- Latencies improved ~2x across all models (shorter clips = less processing): base 64→31ms, large 73→36ms, audeering 29→18ms.

**emotion2vec+ base vs large — nuanced picture:**
- Both agree on clips 0 (fearful), 1 (sad), 3 (happy), 4 (neutral) — the clearest emotional signals.
- They disagree on clips 2, 5, 7, 8 where large predicts "happy" and base predicts "neutral" or "surprised".
- Audeering's continuous values suggest clips 5, 7, 8 DO have positive valence (0.67-0.80) — so the large model may be correct that there's positive emotion, but "happy" is too strong a label for calm contentment.
- Base's "neutral" for clips 5-8 is arguably wrong — there's clearly some emotional content. But base's overall variety (5 emotions) is more plausible than large's "happy" bias.

**audeering provides the richest signal:**
- Its dimensional output distinguishes sad/tired (V<0.4), neutral (V=0.4-0.6), calm/content (V=0.6-0.8), and happy/excited (V>0.6, A>0.5).
- This 4-level granularity is better than either emotion2vec model's tendency to collapse moderate emotion into "neutral" (base) or "happy" (large).
- But it can't detect specific categorical emotions like "fearful" or "surprised" — it only measures arousal, dominance, valence.

**Complementary use confirmed:** Best pipeline = emotion2vec+ base for categorical labels + audeering for dimensional tracking. The two models provide non-overlapping signal: categories tell you WHAT emotion, dimensions tell you HOW MUCH emotion.

#### Updated Decision (Run 5)

Decisions unchanged, but with more confidence:
- **emotion2vec+ base: ACCEPTED** — speech segmentation revealed it detects 5 distinct emotions. 31ms/clip (improved from 64ms). Bake into container image.
- **emotion2vec+ large: STILL REJECTED** — no more "unknown" but has a "happy" bias (5/9 clips). 185s load time. Not worth 1.7x VRAM for worse categorical accuracy.
- **audeering A/V/D: UPGRADED from INTERESTING to RECOMMENDED (Phase 2)** — dimensional output is clearly complementary to categorical. Together they provide richer emotion understanding. Fast (18ms), only 631 MB additional VRAM.

### Run 6: 4-Model Re-evaluation with Speech-Segmented Clips (2026-02-27)

**Context:** Whisper-SER was re-evaluated with speech-only clips (previously REJECTED in Run 4 on music-contaminated clips). Hypothesis: the "happy bias" was from music contamination, not model deficiency.

**Environment:** Python venv with torch 2.10.0+cu128 on Titan (RTX 5070 Ti, Blackwell). Native host venv (not Docker) — solves NGC container compatibility issues with torchaudio/transformers. This also resolves the ABI mismatch, torchvision circular imports, and transformers version conflicts encountered in containerized runs.

**Segmentation:** Same speech-segmented clips as Run 5 (9 clips, avg 3.8s, speech-only).

#### Run 6 Results: 4-Model Comparison (speech-segmented, native venv)

| Metric | emotion2vec+ base | emotion2vec+ large | Whisper-SER | audeering A/V/D |
|--------|-------------------|-------------------|-------------|-----------------|
| Parameters | 90M | 300M | 0.6B | 200M |
| Load time | 102,377 ms | 194,599 ms | 100,031 ms | **32,643 ms** |
| VRAM | **359 MB** | 627 MB | 2,496 MB | 631 MB |
| Latency p50 | 32 ms | 33 ms | 129 ms | **10 ms** |
| License | FunASR (commercial OK) | FunASR (commercial OK) | Apache 2.0 | CC-BY-NC-SA |

#### Per-Clip Predictions (Run 6, speech-segmented, all 4 models)

| Clip | Time range | emotion2vec+ base | emotion2vec+ large | Whisper-SER | audeering (A/V → mapped) |
|------|-----------|-------------------|-------------------|-------------|--------------------------|
| 0 | 4.82-8.43s | fearful | fearful | sad | 0.47/0.34 → sad/tired |
| 1 | 8.43-12.03s | sad | sad | happy | 0.59/0.44 → neutral |
| 2 | 19.48-22.62s | surprised | neutral | surprised | 0.60/0.70 → happy/excited |
| 3 | 26.88-31.36s | happy | happy | happy | 0.44/0.76 → calm/content |
| 4 | 37.90-41.14s | neutral | neutral | surprised | 0.37/0.41 → neutral |
| 5 | 51.06-55.34s | neutral | happy | happy | 0.45/0.67 → calm/content |
| 6 | 58.77-62.85s | neutral | neutral | happy | 0.49/0.53 → neutral |
| 7 | 62.85-66.92s | neutral | happy | surprised | 0.55/0.80 → happy/excited |
| 8 | 66.92-71.00s | neutral | happy | happy | 0.49/0.73 → calm/content |

#### Run 6 Analysis

**Whisper-SER REHABILITATED:** Happy predictions dropped from 10/14 (71%) in Run 4 to 4/9 (44%) in Run 6. Music contamination was the primary cause of the "happy bias", not model deficiency. Whisper-SER now detects 3 distinct emotions (sad, surprised, happy). Still no "neutral" or "fearful" — limited emotion vocabulary compared to emotion2vec's 9 categories.

**emotion2vec+ large nuance — reconsidered:** Clips 5, 7, 8 show "happy" where base says "neutral" and audeering shows positive valence (0.67, 0.80, 0.73). This suggests large IS detecting real positive affect, not just bias. The "happy bias" label from Run 5 may have been unfair — the large model's sensitivity to mild positive emotion could be a feature, not a bug.

**Latency: audeering dominates** at 10ms p50 (3.2x faster than emotion2vec+ base, 12.9x faster than Whisper-SER). Whisper-SER at 129ms is still fast enough for real-time — and actually a 3.5x improvement over Run 4's 450ms (shorter speech-only clips). All models are faster on speech-segmented clips than on the naive 5s windows.

**Infrastructure fix validated:** Native venv with torch 2.10.0+cu128 resolved NGC container issues (torchaudio ABI mismatch, torchvision circular imports, transformers version conflicts). All 4 models loaded and ran without errors.

**emotion2vec+ base consistency:** Predictions are nearly identical to Run 5 (same clips, same segmentation, different runtime). This confirms the model is deterministic and the results are reproducible across Docker and native venv environments.

#### Updated Decision (Run 6)

- **emotion2vec+ large: ACCEPTED** (primary categorical SER) — positive affect sensitivity confirmed as genuine, not bias. Clips where it says "happy" and base says "neutral" correlate with audeering's positive valence scores (V=0.67-0.80). For an emotional awareness system (Dimension 6), sensitivity > conservatism. 33ms/clip, 627MB VRAM, 300M params. Bake into container.
- **emotion2vec+ base: BACKUP** — conservative fallback. Labels ambiguous clips "neutral" rather than guessing. 32ms/clip, 359MB, 90M params. Use if large proves too sensitive in production.
- **Whisper-SER: REHABILITATED** (not rejected) — improved dramatically with speech-only clips. The Run 4 rejection was based on music-contaminated input. Still 4x slower and 7x more VRAM than emotion2vec+ base. Apache 2.0 license is an advantage for commercial distribution.
- **audeering A/V/D: ACCEPTED (Phase 1, alongside large)** — fastest model (10ms), continuous A/V/D dimensions complement categorical labels. Run both in parallel (43ms combined, ~1.26GB). Confidence-weighted fusion at audio pipeline layer. CC-BY-NC-SA license (personal use OK, blocks commercial redistribution).

### Fusion Strategy: Confidence-Weighted (decided Session 83)

Both models run in parallel on each audio segment. Output is fused into a single enriched structure:

```json
{
  "emotion": {
    "primary": "happy",       // from emotion2vec+ large (highest softmax)
    "intensity": "mild",      // mapped from audeering arousal: <0.3 low, 0.3-0.6 mild, 0.6-0.8 moderate, >0.8 high
    "confidence": 0.85,       // fusion: emotion2vec softmax × agreement factor
    "valence": "positive",    // mapped from audeering valence: <0.35 negative, 0.35-0.65 neutral, >0.65 positive
    "arousal": "low"          // mapped from audeering arousal: <0.4 low, 0.4-0.7 moderate, >0.7 high
  }
}
```

**Agreement factor** adjusts confidence:
- emotion2vec "happy" + audeering valence > 0.6 → agreement, confidence stays high
- emotion2vec "happy" + audeering valence < 0.4 → conflict, confidence reduced (×0.6)
- emotion2vec "neutral" + audeering arousal > 0.6 → possible miss, add `"note": "elevated arousal"`

**Why fusion MUST happen at the audio layer (not deferred to Claude or Context Engine):**

1. **Claude is not always available.** The audio pipeline runs locally on Titan. Claude API calls are expensive, rate-limited, and add latency. Emotional state detection must work without any external API dependency — it's a core local capability.
2. **Real-time emotional state for hypergraph and memories.** The knowledge graph (Graphiti) and memory layer (Mem0) need emotional annotations on every segment as they arrive. Waiting for Claude to analyze emotions would add seconds of latency and create a dependency bottleneck in the ingestion pipeline.
3. **Consumer simplicity.** Every downstream consumer (Context Engine, Claude when available, UI, hypergraph) gets a clean pre-digested signal. No consumer needs to understand categorical vs dimensional emotion models — they just read `primary`, `intensity`, `confidence`.
4. **Offline resilience.** If the internet is down, the audio pipeline still produces emotion-annotated transcripts. This is the local-first privacy principle (ADR) in action.

---

## 11. Optimal SER Segment Length — Research Findings

**Consensus minimum:** 2.75 seconds, established by the MSP-Podcast corpus (409 hours, the largest naturalistic SER dataset). Human annotators cannot reliably label emotion below 2.75s, so models trained on those labels inherit the same limitation.

**Optimal range:** 3.0-5.0 seconds — best accuracy/duration tradeoff. Matches the training distributions of both emotion2vec (IEMOCAP avg 4.5s) and audeering (MSP-Podcast 2.75-11s).

| Duration | Assessment |
|----------|-----------|
| < 1.0s | UNRELIABLE — skip SER |
| 1.0-2.0s | LOW CONFIDENCE — use with caution |
| 2.0-3.0s | ACCEPTABLE — minimum recommended |
| **3.0-5.0s** | **OPTIMAL — best accuracy/duration tradeoff** |
| 5.0-8.0s | GOOD — diminishing returns |
| 8.0-11.0s | ACCEPTABLE — risk of mixed emotions within segment |
| > 11.0s | SHOULD SPLIT at pauses |

**Short clips (< 2s):** Insufficient phonetic coverage, missing prosody patterns, zero-padding introduces noise. Diarization often produces sub-second interjections ("yeah", "hmm") — skip SER for these or merge with adjacent same-speaker segments.

**Long clips (> 10s):** Emotional state may shift multiple times; single-label assignment loses the arc. The model averages frame-level features, diluting emotional peaks. Split at pauses >= 300ms.

**Music/non-speech:** emotion2vec was explicitly trained on song emotion recognition — it will confidently classify background music with a plausible emotion label. A sad background song produces "sad" even with no speaker. **Pipeline guards needed:** pyannote VAD filters pure non-speech; SNR check for speech-over-music; segments < 1s from diarization are likely noise artifacts.

**Recommended pipeline config:**
```python
SER_MIN_DURATION = 1.0    # skip SER below this
SER_LOW_CONFIDENCE = 2.0  # flag as low confidence
SER_MAX_DURATION = 11.0   # split segments above this
SER_SPLIT_PAUSE = 0.3     # minimum pause for splitting
SER_OPTIMAL_RANGE = (3.0, 8.0)
```

**Key citations:** MSP-Podcast Corpus (2.75-11s criteria), emotion2vec ACL 2024, Wagner et al. IEEE TAFFC 2023 (audeering fine-tuning), Csuka et al. 2024 (1.5s vs 3s vs 5s empirical comparison), EmoBox Interspeech 2024.

---

## 11. Segment Length Analysis for SER Models (2026-02-27)

**Research question:** What is the optimal and minimum audio segment length for reliable Speech Emotion Recognition, specifically for a pipeline that receives variable-length diarized speech segments?

### 11.1 What the Literature Says About Minimum Utterance Length

**The consensus minimum is approximately 2-3 seconds.** Below this threshold, emotion recognition accuracy degrades significantly.

**MSP-Podcast corpus (the largest naturalistic SER dataset, 409 hours)** established the most rigorous duration criteria in the field:
- **Minimum: 2.75 seconds** — "The lower threshold is justified by the need to have enough context for a rater to reliably infer an emotional label during the perceptual evaluation" ([Busso et al., 2025, arXiv:2509.09791](https://arxiv.org/html/2509.09791v1)). If human annotators cannot reliably label emotion below 2.75s, a model trained on those annotations inherits the same limitation.
- **Maximum: 11 seconds** — "Emotions can vary during a speaking turn, so having a single label may not accurately reflect the emotional content." Segments >11s are re-segmented at pauses ≥0.3s.
- **Additional filter:** Segments with fewer than 5 words are excluded even if ≥2.75s, eliminating thin-content segments.

**Empirical accuracy by segment length** (from [Csuka et al., 2024, PMC10987695](https://pmc.ncbi.nlm.nih.gov/articles/PMC10987695/)):

| Segment Length | DNN Accuracy (Emo-DB) | DNN Accuracy (RAVDESS) | DNN Accuracy (Combined) |
|---------------|----------------------|------------------------|------------------------|
| 1.5 seconds | 64.69% | 53.55% | 54.49% |
| 3.0 seconds | **72.91%** | 60.01% | **62.36%** |
| 5.0 seconds | 69.21% | **61.00%** | 61.79% |

**Key finding:** 3 seconds is the sweet spot. Going from 1.5s to 3s yields a major accuracy jump (+8-13%), but 5s provides diminishing or negative returns vs 3s, likely because the additional context introduces emotional ambiguity or averaged-out signals.

**Speaker recognition analogy:** Research on speaker verification shows accuracy drops sharply below 1 second — from 97% for 2.5-minute pairs to 71.5% for 2-second pairs. The same pattern holds for emotion, which requires even more temporal context than speaker identity.

### 11.2 emotion2vec Training & Evaluation Lengths

**emotion2vec (original, ACL 2024):**
- **Pre-training data:** 262 hours from 5 English datasets: IEMOCAP (7.0h), MELD (12.2h), CMU-MOSEI (91.9h), MEAD (37.3h), MSP-Podcast V1.8 (113.5h)
- **IEMOCAP utterances:** 5,531 utterances over 7 hours → average ~4.5 seconds (range: 0.5s to 25s)
- **MSP-Podcast utterances:** 73,042 utterances over 113.5 hours → average ~5.6 seconds (constrained to 2.75-11s by design)
- **Pre-training approach:** Online distillation with combined utterance-level and frame-level losses. The "chunk embedding" method for utterance-level loss performed best (71.79% WA on IEMOCAP), dividing input into multiple tokens to represent global emotion.
- **Feature extraction rate:** 50 Hz (20ms frames), meaning a 3-second clip produces 150 frames, a 5-second clip produces 250 frames.
- **No explicit min/max constraint** documented for inference. The model accepts arbitrary-length 16kHz mono WAV input.

**emotion2vec+ variants:**
- **emotion2vec+ seed:** Fine-tuned on 201 hours of EmoBox academic data (the same datasets with durations ranging 0.1h to 113.5h).
- **emotion2vec+ base:** 4,788 hours of filtered pseudo-labeled data.
- **emotion2vec+ large:** 42,526 hours of filtered pseudo-labeled data.
- All variants inherit the same duration characteristics from their training data, which is dominated by MSP-Podcast (2.75-11s) and IEMOCAP (~4.5s avg).

**Practical minimum for emotion2vec:** The wav2vec2/data2vec feature extractor has a minimum receptive field of 400 samples (25ms at 16kHz). Technically, the model can process audio as short as 25ms, but the output will be meaningless for emotion. Given the training data distribution peaks at 3-6 seconds, segments below ~1.5s are increasingly out-of-distribution.

### 11.3 audeering wav2vec2 Emotion Model

**Model:** `audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim`
**Paper:** [Wagner et al., "Dawn of the Transformer Era in SER: Closing the Valence Gap"](https://arxiv.org/abs/2203.07378) (IEEE TAFFC, 2023)

- **Fine-tuned on MSP-Podcast v1.7** — which means all training utterances are 2.75-11 seconds (by the MSP-Podcast selection criteria documented above).
- **Additional test corpora:** IEMOCAP (~4.5s avg), CMU-MOSI (~2.6h).
- **Base model:** wav2vec2-large-robust (pruned from 24 to 12 transformer layers), pre-trained on 960h LibriSpeech + noisy speech. The "robust" variant was trained with added noise for robustness.
- **Input:** 16kHz mono audio as float32 numpy array. No documented minimum or maximum duration.
- **Memory scaling:** wav2vec2 transformers use full self-attention, so memory scales quadratically with sequence length. Practical limit is ~80 seconds before GPU memory issues on consumer hardware (not a concern on DGX Spark with 128GB).

**Since audeering was trained exclusively on MSP-Podcast (2.75-11s range), segments outside this range are out-of-distribution.** Particularly, very short segments (<2s) will produce unreliable arousal/valence/dominance scores.

### 11.4 Consensus on Ideal Segment Length

Based on the literature, training data characteristics, and empirical benchmarks:

| Duration Range | Assessment | Evidence |
|---------------|------------|----------|
| **<1.0s** | **UNRELIABLE — discard or flag** | Below minimum receptive field for emotion. Human annotators cannot label reliably. Diarization artifacts common at this length. |
| **1.0-2.0s** | **LOW CONFIDENCE — use with caution** | Some emotion signal present but accuracy drops ~15-20% vs 3s. May work for high-arousal emotions (anger, fear) but poor for nuanced states. |
| **2.0-3.0s** | **ACCEPTABLE — minimum recommended** | MSP-Podcast lower bound is 2.75s. Most SER datasets include utterances starting here. |
| **3.0-5.0s** | **OPTIMAL — best accuracy/duration tradeoff** | 3s segments show peak accuracy in controlled studies. IEMOCAP average is 4.5s. emotion2vec pre-training data peaks in this range. |
| **5.0-8.0s** | **GOOD — diminishing returns** | Accuracy plateau or slight decline. Single-emotion assumption starts to weaken as duration increases. |
| **8.0-11.0s** | **ACCEPTABLE — risk of mixed emotions** | MSP-Podcast upper bound is 11s. Emotions may shift within this window, making a single label less representative. |
| **>11.0s** | **SHOULD SPLIT — re-segment at pauses** | MSP-Podcast re-segments these at pauses ≥0.3s. A single emotion label is unreliable for long stretches. |

**Practical recommendation for her-os pipeline:** Accept segments 2.0-11.0s. Flag segments <2.0s as low-confidence. Split segments >11.0s at pause boundaries (Whisper word timestamps can identify these).

### 11.5 Very Short Clips (<2s) vs Long Clips (>10s)

**Short clips (<2s):**
- Insufficient phonetic coverage for prosody-based emotion features
- Missing speaker-specific patterns (speaking rate, pitch contours) that require multi-sentence context
- Zero-padding to match model expectations introduces noise artifacts
- Diarization often produces sub-second segments for interjections ("yeah", "hmm", "okay") — these carry less emotional signal
- **Mitigation:** Merge adjacent same-speaker segments before SER. If still <2s after merging, either skip SER or attach the emotion from the nearest qualifying segment.

**Long clips (>10s):**
- Emotional state may shift multiple times within the segment
- A single categorical label (e.g., "sad") loses the temporal emotional arc
- The model averages frame-level features into one utterance embedding, diluting peaks
- **Mitigation options:**
  1. **Split at pauses:** Use Whisper word-level timestamps to find silence gaps ≥300ms, split there (MSP-Podcast approach)
  2. **Sliding window:** Run SER on overlapping 5s windows with 2.5s hop, report per-window emotions
  3. **Frame-level analysis:** emotion2vec supports frame-level (50Hz) output — extract per-frame emotion, then detect emotion boundaries (emotion diarization)

### 11.6 Music, Non-Speech Audio, and SER Models

**The problem:** In a real-world pipeline, diarized segments may contain background music, TV audio, environmental sounds, or segments where the speaker is humming/singing rather than speaking. What happens when SER models encounter these?

**emotion2vec with music/song input:**
- emotion2vec was **explicitly evaluated on song emotion recognition** and "outperforms all known SSL models even without finetuning" on this task ([Ma et al., 2023](https://arxiv.org/abs/2312.15185)). This is actually a concern — the model will happily classify music with an emotion label, which may not reflect the *speaker's* emotional state.
- Music segments from TV, radio, or ambient sources will produce confident emotion predictions that are **semantically correct for the music but meaningless for the speaker.** A sad background song will be classified as "sad" even though no one is speaking.

**wav2vec2-based models (audeering, SpeechBrain) with non-speech:**
- wav2vec2 was pre-trained on speech (LibriSpeech), so its feature representations are optimized for speech characteristics. Non-speech audio produces out-of-distribution features.
- The model will still output predictions (it cannot refuse), but the A/V/D scores will be unreliable.
- No specific studies document false-positive rates for music input on these models.

**Practical pipeline guards for her-os:**

1. **VAD gate (already in pipeline):** pyannote VAD filters non-speech regions. Music-only segments should be caught by VAD and not reach SER. However, speech-over-music (common in podcasts, TV) will pass VAD.
2. **SNR check:** Samples with estimated SNR <20 dB (heavy background music/noise) should be flagged as low-confidence for SER. The audio pipeline already has energy metrics that could serve as a proxy.
3. **Speech-to-noise ratio:** If the diarization segment comes from a region where pyannote detected overlapping speakers or was uncertain, flag the SER result as low-confidence.
4. **Post-hoc filtering:** If emotion2vec returns high confidence for "other" or "unknown" (categories 5 and 8), this may indicate non-speech input. These can be used as a soft signal.
5. **Duration-based filtering:** Very short segments (<1s) from diarization are more likely to be noise artifacts, interjections, or music bleed. Skip SER on these.

### 11.7 Recommendations for her-os Audio Pipeline

Given that our pipeline receives diarized segments from pyannote (variable length, 16kHz mono), here is the recommended SER preprocessing:

```
Diarized segment arrives
    │
    ├── Duration < 1.0s → SKIP SER, mark as "too_short"
    │
    ├── Duration 1.0-2.0s → RUN SER, flag confidence="low"
    │                        Consider merging with adjacent same-speaker segment
    │
    ├── Duration 2.0-11.0s → RUN SER, confidence="normal"
    │                         This is the sweet spot (matches training data)
    │
    └── Duration > 11.0s → SPLIT at pauses ≥300ms (using Whisper timestamps)
                            Then run SER on each sub-segment
```

**Configuration constants (for `audio-pipeline` service):**

```python
SER_MIN_DURATION = 1.0      # seconds — skip SER below this
SER_LOW_CONFIDENCE = 2.0    # seconds — flag as low confidence below this
SER_MAX_DURATION = 11.0     # seconds — split segments above this
SER_SPLIT_PAUSE = 0.3       # seconds — minimum pause for splitting long segments
SER_OPTIMAL_RANGE = (3.0, 8.0)  # seconds — highest expected accuracy
```

### 11.8 Sources for This Section

- [MSP-Podcast Corpus paper](https://arxiv.org/html/2509.09791v1) — 2.75-11s duration criteria, re-segmentation methodology
- [emotion2vec ACL 2024 paper](https://aclanthology.org/2024.findings-acl.931/) — Pre-training data (262h, 5 datasets), chunk embedding, frame-level features
- [emotion2vec+ large on HuggingFace](https://huggingface.co/emotion2vec/emotion2vec_plus_large) — 42,526h training data, 9 emotion classes
- [EmoBox Interspeech 2024](https://arxiv.org/html/2406.07162v1) — 32 datasets, 262K utterances, ~294h total, avg ~4s/utterance
- [Wagner et al., "Dawn of the Transformer Era in SER"](https://arxiv.org/abs/2203.07378) — audeering wav2vec2 fine-tuning on MSP-Podcast
- [Csuka et al., 2024, "Implementing ML for continuous emotion prediction from uniformly segmented voice recordings"](https://pmc.ncbi.nlm.nih.gov/articles/PMC10987695/) — 1.5s vs 3s vs 5s accuracy comparison
- [Neurocomputing, "Graph-based emotion recognition with attention pooling for variable-length utterances"](https://www.sciencedirect.com/science/article/abs/pii/S0925231222005380) — Variable-length challenges
- [IEMOCAP database](https://sail.usc.edu/iemocap/) — 12h, 5531 utterances, 0.5-25s range, ~4.5s average
- [Chunk-Level SER framework](https://ieeexplore.ieee.org/document/9442335/) — Dynamic temporal modeling for variable-length segments
- [Temporal Bucketing for SER](https://www.sciencedirect.com/science/article/abs/pii/S0010482525012648) — Time-series bucketing outperforms SOTA on 5 datasets
- [wav2vec2 HuggingFace Forum: File size/length limits](https://discuss.huggingface.co/t/file-size-speech-length-limit-for-wave2vec2/3636) — Practical memory constraints

---

## 12. Production Deployment — VAD-Gated Dual-Model SER (2026-02-27)

**Date:** 2026-02-27
**Status:** Deployed on Titan, running in production alongside audio pipeline.

### 12.1 Architecture: VAD-Gated Sidecar

The SER pipeline runs as a **sidecar process** alongside the audio pipeline Docker container. This separation was forced by the NGC ABI incompatibility (see RESEARCH-DOCKER-DEPLOYMENT.md Section 16) but turned out to be the cleaner architecture anyway.

```
┌────────────────────────────┐       ┌──────────────────────────────┐
│ Audio Pipeline (Docker)    │       │ SER Sidecar (native venv)    │
│                            │       │                              │
│  WAV in → WhisperX STT    │       │  WAV in → silero-vad         │
│         → pyannote diariz  │       │         → speech segments    │
│         → speaker ID       │       │         → emotion2vec+ large │
│         → SER HTTP call ───┼─:9101─┼→        → audeering wav2vec2│
│         → time-overlap map │       │         → fused emotion JSON │
│         → enriched response│       │                              │
│  Port: 9100                │       │  Port: 9101                  │
└────────────────────────────┘       └──────────────────────────────┘
```

### 12.2 Pipeline Flow

1. **Audio arrives** at `/v1/transcribe` (audio pipeline, port 9100)
2. **WhisperX STT** transcribes → text segments with timestamps
3. **Pyannote diarization** identifies speakers → enrolled user vs. others
4. **Privacy gate:** If primary speaker is enrolled user (voiceprint match):
   - Audio pipeline sends **full clip WAV** to SER sidecar at `localhost:9101/v1/analyze`
5. **SER sidecar** runs:
   - **silero-vad** detects speech segments (merge gaps < 1.5s, filter < 1.0s)
   - **emotion2vec+ large** → categorical emotion (9 classes with softmax probabilities)
   - **audeering wav2vec2** → dimensional A/V/D (arousal, valence, dominance: 0..1)
   - **Fusion** → primary emotion + intensity + confidence + valence + arousal labels
6. **Time-overlap mapping:** Audio pipeline matches SER emotion segments to WhisperX text segments by finding the maximum temporal overlap
7. **Response:** Each text segment includes `emotion: {primary, intensity, confidence, valence, arousal, e2v_scores, avd_scores}` or `null` (non-enrolled speaker)

### 12.3 Fusion Strategy

The fusion combines categorical (emotion2vec) and dimensional (audeering) signals:

| Output Field | Source | How |
|-------------|--------|-----|
| `primary` | emotion2vec | Highest softmax probability label |
| `confidence` | emotion2vec × agreement | Softmax probability × 1.15 (agree) / 0.75 (disagree) / 1.0 (neutral) |
| `intensity` | audeering arousal | low (<0.35), mild (0.35-0.5), moderate (0.5-0.65), high (>0.65) |
| `valence` | audeering valence | negative (<0.45), neutral (0.45-0.55), positive (>0.55) |
| `arousal` | audeering arousal | low (<0.4), moderate (0.4-0.6), high (>0.6) |

**Agreement factor:** If emotion2vec says "happy" (positive) and audeering valence > 0.55 (positive), confidence gets a 15% boost. If they disagree (e.g., emotion2vec says "happy" but valence < 0.45), confidence is penalized by 25%.

**Duration penalty:** Segments < 2.0s get a 30% confidence reduction (below MSP-Podcast training minimum of 2.75s).

### 12.4 VAD Parameters

Chosen to match benchmark methodology (Run 5/6):

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Merge gaps | < 1.5s | Consecutive speech with brief pauses treated as one segment |
| Min duration | >= 1.0s | Below 1s: unreliable SER, diarization artifacts |
| Low confidence | < 2.0s | MSP-Podcast min is 2.75s; flag shorter clips |
| Speech threshold | 0.5 | silero-vad default probability threshold |

### 12.5 Deployment Details

**SER Sidecar:** Runs as a native Python process in `~/ser-venv/`:
- `torch 2.10.0+cu128` (native Blackwell SM_120 support, no JIT compilation)
- Models cached in `~/.cache/modelscope/` and `~/.cache/huggingface/`
- Cold start: ~15s (model loading only)
- Hot inference: ~44ms per segment (1ms VAD + 33ms emotion2vec + 10ms audeering)

**Audio Pipeline:** Docker container `her-os-audio` (NGC base):
- Calls SER sidecar via HTTP `POST http://localhost:9101/v1/analyze`
- 5-second timeout, graceful degradation (returns `emotion: null` if sidecar down)
- No Dockerfile changes — SER is purely an HTTP client addition

**VRAM (measured on Titan):**

| Component | VRAM | Container |
|-----------|------|-----------|
| WhisperX large-v3 | ~5 GB | Docker (her-os-audio) |
| pyannote diarization 3.1 | ~4 GB | Docker (her-os-audio) |
| emotion2vec+ large (300M params) | 627 MB | Native venv |
| audeering wav2vec2 (200M params) | 631 MB | Native venv |
| silero-vad (1.5M params) | ~1 MB | Native venv |
| **Total** | **~10.3 GB** | Mixed |

128 GB unified memory → ~8% utilization.

### 12.6 Privacy Boundary

Emotion analysis is **only performed for the enrolled user's speech segments.** This is enforced at two levels:

1. **Audio pipeline:** Only calls SER sidecar when `speaker_label not in ("other", "SPEAKER_UNKNOWN")`
2. **Response:** Non-enrolled segments have `emotion: null`

This means: if a family member speaks near the Omi wearable, their speech is transcribed (for context) but their emotions are NOT analyzed. Only the enrolled user's emotional state is tracked.

### 12.7 Flutter App Integration

The Flutter desktop app displays emotion as color-coded chips in the transcript:

| Emotion | Color | Badge |
|---------|-------|-------|
| happy | Amber | "happy" |
| sad | Blue | "sad" |
| angry | Red | "angry" |
| neutral | Grey | "neutral" |
| fearful | Purple | "fearful" |
| surprised | Teal | "surprised" |
| disgusted | Brown | "disgusted" |

Chips only appear when `confidence > 0.3`. Below that threshold, the emotion is too uncertain to display.

### 12.8 Key Lessons from Deployment

1. **NGC ABI is fragile:** Never pip install torch-dependent packages inside NGC containers. Use a sidecar.
2. **Docker Hub IPv6 on Titan:** Docker Hub is unreachable from Titan due to IPv6 routing issues. Use native venv or fix Docker daemon IPv6 config.
3. **`snapshot_download` for model baking:** When baking HuggingFace models into Docker images, use `huggingface_hub.snapshot_download()` instead of inline Python class definitions (Docker shell corrupts multi-line Python).
4. **Time-overlap mapping:** Whisper text timestamps and VAD emotion timestamps are independent. Map them by computing maximum overlap duration, not start-time proximity.
5. **Graceful degradation is essential:** The audio pipeline must function without SER. A 5-second HTTP timeout + connection error handling ensures STT is never blocked by SER issues.

---

## 13. Compound Emotion Labels — Categorical + Dimensional Fusion (2026-02-27)

### 13.1 Motivation

The dual-model SER pipeline (emotion2vec+ categorical + audeering A/V/D dimensional) returns 9 basic emotion categories. But human emotional experience is far more granular: "excited" and "content" are both "happy," yet feel completely different. By fusing the categorical label with dimensional scores, we can derive **compound emotions** — 37+ nuanced labels that better capture what someone is actually feeling.

### 13.2 Theoretical Foundations

Three emotion models inform the compound mapping:

**Plutchik's Wheel (1980)** — 8 primary emotions (joy, trust, fear, surprise, sadness, disgust, anger, anticipation) arranged in a flower pattern. Adjacent emotions blend into dyads (joy + trust = love). Emotions vary in intensity: ecstasy → joy → serenity.

**Russell's Circumplex (1980)** — 2D space with valence (pleasant/unpleasant) and arousal (activated/deactivated) as axes. All emotions are points in this space. "Excited" = high arousal + positive valence. "Depressed" = low arousal + negative valence.

**Mehrabian's PAD Model (1996)** — 3D space adding **Dominance** (feeling in control vs. controlled). This third axis differentiates emotions that Russell's 2D model conflates: anger (high dominance) vs. fear (low dominance) both have negative valence + high arousal, but dominance separates them.

### 13.3 The Compound Mapping

**Input:** `(primary_emotion, arousal_level, dominance_level)` → **Output:** compound label

Arousal and dominance are each quantized to 3 levels using audeering's 0-1 range:
- **Low:** ≤ 0.40
- **Moderate:** 0.40 – 0.60
- **High:** > 0.60

#### Happy derivatives (7 compounds)

| Arousal | Dominance | Compound | Description |
|---------|-----------|----------|-------------|
| high | high | elation | triumphant joy, peak positive |
| high | low/neutral | excitement | energized happiness, anticipation |
| moderate | high | pride | satisfied achievement |
| moderate | neutral/low | joy | standard happiness |
| low | high | contentment | peaceful satisfaction |
| low | low | relief | tension release |
| low | neutral | serenity | calm happiness |

#### Angry derivatives (6 compounds)

| Arousal | Dominance | Compound | Description |
|---------|-----------|----------|-------------|
| high | high | rage | explosive, in control |
| high | low/neutral | frustration | blocked, not in control |
| moderate | high | irritation | mild, dismissive anger |
| moderate | low | resentment | simmering, powerless |
| low | high | contempt | cold disdain |
| low | low/neutral | bitterness | lingering hurt |

#### Fearful derivatives (6 compounds)

| Arousal | Dominance | Compound | Description |
|---------|-----------|----------|-------------|
| high | low | panic | acute, no escape |
| high | high/neutral | alarm | alert, fight response |
| moderate | low | anxiety | persistent worry |
| moderate | neutral/high | nervousness | mild apprehension |
| low | low | dread | anticipated doom |
| low | neutral/high | unease | vague discomfort |

#### Sad derivatives (6 compounds)

| Arousal | Dominance | Compound | Description |
|---------|-----------|----------|-------------|
| high | any | grief | acute loss |
| moderate | low | sadness | standard sadness |
| moderate | high | disappointment | unmet expectations |
| low | low | melancholy | deep, lingering |
| low | high | resignation | accepted loss |
| low | neutral | wistfulness | gentle longing |

#### Surprised derivatives (5 compounds)

| Arousal | Dominance | Compound | Description |
|---------|-----------|----------|-------------|
| high | high | amazement | positive awe |
| high | low | shock | overwhelming |
| moderate | high | delight | pleasant surprise |
| moderate | low | disbelief | can't accept |
| moderate | neutral | curiosity | wanting to understand |
| low | any | realization | quiet understanding |

#### Disgusted derivatives (4 compounds)

| Arousal | Dominance | Compound | Description |
|---------|-----------|----------|-------------|
| high | high | revulsion | strong rejection |
| high | low | horror | disgusted + afraid |
| moderate | high | contempt | moral disgust |
| low/moderate | low/neutral | aversion | mild distaste |

#### Neutral derivatives (5 compounds)

| Arousal | Dominance | Compound | Description |
|---------|-----------|----------|-------------|
| low | neutral | neutral | baseline |
| low | low | disengaged | checked out |
| low | high | calm | peaceful, grounded |
| moderate | high | focused | attentive, task-oriented |
| moderate | low | uncertain | wavering, undecided |

### 13.4 PAD Octants

Mehrabian's 8 octants divide the 3D PAD space using the midpoint (0.5) of each dimension as the boundary:

| Valence | Arousal | Dominance | Octant | Example |
|---------|---------|-----------|--------|---------|
| + | high | high | exuberant | triumphant, bold |
| + | high | low | dependent | giddy, flustered |
| + | low | high | relaxed | peaceful, assured |
| + | low | low | docile | submissive calm |
| − | high | high | hostile | angry, aggressive |
| − | high | low | anxious | panicked, helpless |
| − | low | high | disdainful | contemptuous |
| − | low | low | bored | disengaged, flat |

### 13.5 Threshold Calibration

The thresholds (0.40/0.60 for arousal and dominance) were chosen based on:
1. **Audeering's training data (MSP-Podcast)** — most neutral speech clusters around 0.45-0.55 for all three dimensions
2. **Avoiding excessive sensitivity** — tighter bands (0.45/0.55) cause too-frequent label changes; wider bands (0.35/0.65) collapse most speech to "moderate"
3. **The 0.40/0.60 split** — creates roughly equal thirds across typical conversational audio

These thresholds should be refined based on real-world testing with the her-os user's typical speech patterns.

### 13.6 Best Practices

1. **Compound labels are descriptive, not diagnostic** — never treat "anxiety" as a clinical assessment
2. **Track trajectories, not snapshots** — a single "grief" label means less than a shift from "joy" → "sadness" → "grief" over 30 minutes
3. **Fuzzy boundaries are expected** — the same audio clip might plausibly be "excitement" or "elation." Don't overfit on precision
4. **The primary label is always the fallback** — if the compound mapping produces an implausible result, the raw emotion2vec label is still reliable
5. **Confidence gating still applies** — compound labels below 0.3 confidence are suppressed, same as primary labels

### 13.7 Flutter Emotion Wheel

A Plutchik-style `CustomPainter` wheel in the Flutter app visualizes the compound mapping in real-time:
- **8 sectors** (Joy, Trust, Fear, Surprise, Sadness, Disgust, Anger, Anticipation)
- **3 rings per sector** (inner = intense, middle = moderate, outer = mild)
- **Active segment** pulses with a glow animation when the current compound label matches
- **Center label** shows compound + confidence
- **Debug mode** shows raw AVD scores, PAD octant, and per-model outputs
- Located at `lib/widgets/emotion_wheel.dart` and `lib/screens/emotion_wheel_page.dart`

---

## Sources

### Models & Repositories
- [emotion2vec GitHub](https://github.com/ddlBoJack/emotion2vec) — ACL 2024 official code
- [emotion2vec+ large on HuggingFace](https://huggingface.co/emotion2vec/emotion2vec_plus_large)
- [SpeechBrain emotion-recognition-wav2vec2-IEMOCAP](https://huggingface.co/speechbrain/emotion-recognition-wav2vec2-IEMOCAP)
- [SpeechBrain emotion-diarization-wavlm-large](https://huggingface.co/speechbrain/emotion-diarization-wavlm-large)
- [audeering wav2vec2-large-robust VAD model](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim)
- [Wav2Small paper](https://arxiv.org/abs/2408.13920) — 72K parameter distilled model
- [SenseVoice GitHub](https://github.com/FunAudioLLM/SenseVoice) — Multilingual voice understanding
- [Microsoft CLAP](https://github.com/microsoft/CLAP) — Audio-language model
- [openSMILE](https://github.com/audeering/opensmile) — Traditional acoustic feature extraction
- [FunASR](https://github.com/modelscope/FunASR) — Alibaba speech recognition toolkit
- [Whisper SER fine-tune](https://huggingface.co/firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3)

### Benchmarks & Papers
- [EmoBox: Multilingual Multi-corpus SER Toolkit and Benchmark](https://arxiv.org/abs/2406.07162) — Interspeech 2024, 32 datasets, 14 languages
- [emotion2vec: Self-Supervised Pre-Training for SER](https://aclanthology.org/2024.findings-acl.931/) — ACL 2024 Findings
- [Speech Emotion Recognition Leveraging Whisper Representations](https://arxiv.org/html/2602.06000)
- [LoRA-adapted Whisper for SER](https://arxiv.org/abs/2509.08454) — Mechanistic interpretability
- [ParaCLAP for paralinguistic tasks](https://arxiv.org/html/2406.07203v1)
- [Indian Cross Corpus SER](https://www.iieta.org/journals/ria/paper/10.18280/ria.380318) — Hindi, Urdu, Telugu, Kannada
- [Speech Emotion Diarization](https://arxiv.org/html/2306.12991) — "Which emotion appears when?"
- [Emotion Detection: Lightweight vs Transformer Models](https://arxiv.org/html/2511.00402)

### Multimodal Fusion
- [Comprehensive Review of Multimodal Emotion Recognition](https://pmc.ncbi.nlm.nih.gov/articles/PMC12292624/) — 2025 survey
- [MemoCMT: Cross-Modal Transformer](https://www.nature.com/articles/s41598-025-89202-x)
- [Dynamic Attention Fusion](https://arxiv.org/html/2509.22729v1)
- [Multi-modal emotion recognition with prompt learning](https://www.nature.com/articles/s41598-025-89758-8)

### Paralinguistics & Vocal Biomarkers
- [Vocal Biomarkers: Integrating into Digital Health](https://pmc.ncbi.nlm.nih.gov/articles/PMC12293195/) — 2025 review
- [Kintsugi Voice: AI voice biomarker for depression](https://www.kintsugihealth.com/solutions/kintsugivoice)
- [Sonde Health: Mental fitness from voice](https://www.sondehealth.com/vocal-biomarkers-for-mental-fitness-scoring-and-tracking)
- [Hume AI: Expression Measurement](https://dev.hume.ai/docs/expression-measurement/overview) — 48 emotion dimensions
- [Hume AI Speech Prosody Model](https://www.hume.ai/products/speech-prosody-model)

### Ethics & Regulation
- [EU AI Act: Prohibited Practices (Article 5)](https://artificialintelligenceact.eu/article/5/)
- [Emotion Recognition Systems under EU AI Act](https://www.williamfry.com/knowledge/the-time-to-ai-act-is-now-a-practical-guide-to-emotion-recognition-systems-under-the-ai-act/)
- [Technical Solutions to Emotion AI Privacy Harms](https://dl.acm.org/doi/10.1145/3715275.3732074) — ACM FAccT 2025
- [Ethical Considerations in Emotion Recognition Research](https://www.mdpi.com/2813-9844/7/2/43)
- [Cross-Cultural Bias in Emotion Recognition](https://arxiv.org/pdf/2510.13557)

### Indian Languages
- [NITK-KLESC: Kannada Emotional Speech Corpus](https://ieeexplore.ieee.org/document/10482961/)
- [Kannada Emotional Speech Database on Zenodo](https://zenodo.org/records/6345107)
- [CNN-Transformer for Code-Mixed English-Hindi Emotion](https://link.springer.com/article/10.1007/s44163-025-00400-y) — 2025

### Cross-Language SER
- [Cross-Corpus Language-Independent SER](https://link.springer.com/article/10.1007/s40747-026-02227-1)
- [EmoBox Multilingual Benchmark](https://www.isca-archive.org/interspeech_2024/ma24b_interspeech.pdf)

### Compound Emotion Framework (Section 13)
- [Plutchik's Wheel of Emotions (Wikipedia)](https://en.wikipedia.org/wiki/Robert_Plutchik#Plutchik's_wheel_of_emotions)
- [Russell's Circumplex Model (1980)](https://doi.org/10.1037/h0077714) — valence × arousal 2D space
- [Mehrabian PAD Model (1996)](https://doi.org/10.1111/j.1467-6494.1996.tb00544.x) — Pleasure-Arousal-Dominance 3D
- [PAD Emotional State Model (Wikipedia)](https://en.wikipedia.org/wiki/PAD_emotional_state_model)

### 2025-2026 New Models (Section 9)
- [SenseVoice GitHub](https://github.com/FunAudioLLM/SenseVoice) — Combined ASR + SER + LID
- [SenseVoiceSmall HuggingFace](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
- [Emotion-LLaMAv2 (arXiv)](https://arxiv.org/abs/2601.16449) — Multimodal emotion (Jan 2026)
- [Emotion-LLaMA GitHub](https://github.com/ZebangCheng/Emotion-LLaMA)
- [XEUS HuggingFace](https://huggingface.co/espnet/xeus) — 4000+ language embeddings
- [GLM-ASR-Nano-2512 HuggingFace](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) — Emotional speech ASR
- [ASLP-lab/Emotion2Vec-S HuggingFace](https://huggingface.co/ASLP-lab/Emotion2Vec-S) — Self-supervised variant
- [Fun-ASR-Nano-2512 HuggingFace](https://huggingface.co/FunAudioLLM/Fun-ASR-Nano-2512)
- [FunASR PyPI](https://pypi.org/project/funasr/) — v1.2.6 (Mar 2025)
- [Interspeech 2025 SER Challenge](https://arxiv.org/html/2506.02088) — WavLM, Whisper, XEUS comparison
- [pyannote speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1)