# Research: Gemma 4 E2B/E4B Audio Capabilities

**Date:** 2026-04-06
**Status:** Complete
**Decision:** See verdict at bottom

---

## 1. Model Overview

Google released Gemma 4 on April 2, 2026 under Apache 2.0. The family has four sizes:

| Model | Params (effective) | Params (total) | Audio | Vision | Context |
|-------|-------------------|----------------|-------|--------|---------|
| **E2B** | 2.3B | 5.1B | YES | YES | 128K |
| **E4B** | 4.5B | 7.9B | YES | YES | 128K |
| 26B A4B (MoE) | 26B | — | NO | YES | 256K |
| 31B Dense | 31B | — | NO | YES | 256K |

**Critical fact:** Audio input is exclusive to E2B and E4B. The larger models (26B, 31B) have NO audio encoder at all.

---

## 2. Audio Input Capabilities

### What They CAN Do

1. **Automatic Speech Recognition (ASR)** — Transcribe audio to text in the original language
2. **Automatic Speech Translation (AST)** — Transcribe + translate spoken audio into another language
3. **Audio Question Answering** — Answer questions about audio content ("What is the speaker talking about?", "Can you describe this audio in detail?")
4. **Audio-based Reasoning** — Understand context and content of speech beyond pure transcription
5. **Multimodal Function Calling with Audio** — Use audio input to trigger tool calls

### What They CANNOT Do

1. **NO Text-to-Speech (TTS)** — Output is text only. No audio generation whatsoever.
2. **NO Speaker Identification/Diarization** — Cannot identify who is speaking or separate speakers
3. **NO Emotion Detection** — Not trained for paralinguistic features (tone, emotion, stress)
4. **NO Word-Level Timestamps** — Unlike Whisper, cannot produce per-word timing
5. **NO Music/Non-Speech Understanding** — Explicitly NOT trained on music or non-speech audio
6. **NO Real-Time Streaming** — Batch processing only; no streaming audio input
7. **NO Long Audio** — Hard 30-second maximum per audio input

---

## 3. Audio Encoder Architecture

### Conformer Design

- **Type:** USM-style Conformer (same base architecture as Gemma 3n)
- **Parameters:** ~305M (compressed from 681M in Gemma 3n — 55% reduction)
- **Frame Duration:** 40ms (improved from 160ms in Gemma 3n — 4x more responsive)
- **Token Cost:** 25 tokens per second of audio (vs 6.25 tokens/sec in Gemma 3n)
- **Max Audio:** 30 seconds = 750 tokens maximum

### Audio Processing Pipeline

1. **Feature Extraction:** Raw audio → mel-spectrograms (time x frequency)
2. **Chunking:** Mel features grouped into chunks as tokenization points
3. **Downsampling:** Two 2D convolutional layers compress chunks into "soft tokens"
4. **Conformer Processing:** Conformer (Transformer + convolutional module) processes soft tokens
5. **Linear Projection:** Output projected to match Gemma 4's embedding dimensionality
6. **Per-Layer Embeddings (PLE):** Computed before soft tokens merge into sequence

### Audio Input Requirements

- **Sample Rate:** 16 kHz
- **Frame Size:** 32ms frames
- **Bit Depth:** 32-bit float, normalized to [-1, 1]
- **Channels:** Mono only (multi-channel must be downmixed)
- **Formats:** WAV, MP3 (framework-dependent)
- **Best practice:** Place audio content BEFORE text in prompts

---

## 4. Benchmarks

### Official Benchmarks (from model card)

| Benchmark | E4B | E2B | What It Measures |
|-----------|-----|-----|-----------------|
| **FLEURS** | 0.08 (8% WER) | 0.09 (9% WER) | Multilingual ASR word error rate (lower = better) |
| **CoVoST** | 35.54 | 33.47 | Speech translation quality (BLEU, higher = better) |

### Community Real-World Testing (MacBook Pro M4 Pro, Ollama)

**English ASR:**
- E4B: Perfect transcription, every word correct with punctuation (1.0s)
- E2B: Garbled — missing words, no punctuation (2.8s)

**French ASR:**
- E4B: Perfect transcription with all French accents correct (1.6s)
- E2B: Fragmented, missing most of the sentence (4.1s)

**Arabic ASR:**
- E4B: Perfect Arabic transcription, every word correct (6.0s)
- E2B: Garbled — wrong words, disordered (6.0s)

### Performance Metrics (Community Testing)

| Model | Inference Speed | Memory | Audio Quality |
|-------|----------------|--------|---------------|
| E4B | 57 tok/s | 5.6 GB | Excellent |
| E2B | 95 tok/s | 3.6 GB | **Poor** |

**CRITICAL FINDING: E2B audio quality is dramatically worse than E4B across all tested languages.** E2B excels at text/vision tasks but its audio transcription is garbled. E4B is the minimum viable model for audio.

---

## 5. Language Support for Audio

### Pre-training Coverage
- Pre-trained on **140+ languages**
- Out-of-the-box support for **35+ languages**

### Confirmed Indian Languages in Training Data
The 140+ language list includes: **Hindi, Kannada, Tamil, Telugu, Bengali, Gujarati, Malayalam, Marathi, Punjabi, Urdu, Assamese**

### Audio-Specific Language Quality
- **No language-specific ASR benchmarks published** — FLEURS/CoVoST are aggregate scores
- **No code-mixed/code-switching testing** — No data on Hinglish (Hindi-English) or Kannada-English
- **No Indian language audio benchmarks** — Community testing only covered English, French, Arabic
- The USM (Universal Speech Model) lineage suggests decent Indian language coverage since Google's USM was trained on 300+ languages including many Indian languages

### Verdict on Indian Language Audio
Unknown quality. The USM heritage is promising, but there are zero published benchmarks for Kannada, Hindi, or Tamil ASR quality on these models. Would need to benchmark ourselves.

---

## 6. On-Device Deployment

### Pixel 9a (8 GB RAM)

| Framework | E2B Possible? | Audio? | Memory |
|-----------|--------------|--------|--------|
| **LiteRT-LM** | YES (<1.5 GB) | **NO** — text-only in current release | 1.7 GB (Android) |
| **Android AICore** | YES (via Gemini Nano) | Unknown — not documented | System-managed |
| **AI Edge Gallery** | YES (Play Store) | Unknown | — |
| **Ollama (via Termux)** | Tight but possible | Audio works in Ollama | 3.6 GB (Q4_K_M) |

**CRITICAL: LiteRT-LM (the official Pixel deployment path) does NOT currently support audio input for Gemma 4 E2B.** The HuggingFace model card for `litert-community/gemma-4-E2B-it-litert-lm` explicitly shows text-only, with vision and audio noted as "loaded as needed" but not included. This is likely a future capability.

### NVIDIA Jetson Orin Nano (8 GB)

- Supported: E2B and E4B via llama.cpp and vLLM
- GGUF format: Q8_0 = 5.0 GB for E2B
- TensorRT-LLM optimization available
- Audio support depends on framework (see framework matrix below)

### DGX Spark (128 GB)

- All Gemma 4 models run comfortably
- vLLM recommended for high-throughput serving
- Audio fully supported via vLLM with `vllm[audio]` extras

### Framework Audio Support Matrix

| Framework | E2B Audio | E4B Audio | Status |
|-----------|----------|----------|--------|
| **HuggingFace Transformers** | YES | YES | Full support, reference implementation |
| **vLLM** | YES | YES | Full support, OpenAI-compatible API |
| **Ollama** | YES | YES | Works but E2B quality poor |
| **llama.cpp** | **BROKEN** | **BROKEN** | Issue #21325 — audio encoder detected but eval fails |
| **LiteRT-LM** | **NO** | **NO** | Text-only in current release |
| **MLX** | YES | YES | Full multimodal support |
| **mistral.rs** | YES | YES | Full multimodal support |
| **LM Studio** | **NO** | **NO** | Audio not yet supported |

---

## 7. API / How to Send Audio

### Via HuggingFace Transformers (Reference Implementation)

```python
from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-E4B-it"  # Use E4B, not E2B for audio!
model = AutoModelForMultimodalLM.from_pretrained(MODEL_ID, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

messages = [{
    "role": "user",
    "content": [
        {"type": "audio", "audio": "path/to/audio.wav"},
        {"type": "text", "text": "Transcribe the following speech segment in its original language."},
    ]
}]

inputs = processor.apply_chat_template(
    messages, tokenize=True, return_dict=True,
    return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)
```

### Via vLLM OpenAI-Compatible API

```bash
# Start server
vllm serve google/gemma-4-E4B-it \
  --max-model-len 8192 \
  --limit-mm-per-prompt image=4,audio=1

# Install: uv pip install "vllm[audio]"
```

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": "https://example.com/audio.wav"}},
            {"type": "text", "text": "Transcribe this audio."}
        ]
    }],
    max_tokens=512
)
```

### Via Ollama

```bash
ollama run gemma4:e4b
# Audio input supported but no documented CLI example for audio files
# Works via OpenAI-compatible endpoint
```

### Recommended Prompt Templates

**ASR:**
```
Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.
Follow these specific instructions:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits.
```

**AST (Translation):**
```
Transcribe the following speech segment in {SOURCE_LANGUAGE},
then translate it into {TARGET_LANGUAGE}.
First output the transcription, then one newline,
then output '{TARGET_LANGUAGE}: ' followed by the translation.
```

---

## 8. Comparison with Dedicated STT Models

| Feature | Gemma 4 E4B | Whisper Large-v3 | IndicConformerASR | Nemotron Speech |
|---------|-------------|------------------|-------------------|-----------------|
| **Parameters** | 4.5B effective | 1.5B | 600M | ~1B |
| **Primary Task** | Multimodal LLM (ASR is one capability) | ASR specialist | Indian-lang ASR specialist | ASR specialist |
| **WER (FLEURS)** | 8% (aggregate) | ~5-7% (English) | Best for Indian langs | Excellent |
| **Max Audio** | 30 seconds | Unlimited (chunked) | Unlimited | Unlimited |
| **Streaming** | NO | YES (with chunking) | YES | YES |
| **Word Timestamps** | NO | YES | YES | YES |
| **Speaker Diarization** | NO | NO (needs pyannote) | NO | NO |
| **Indian Languages** | In training data, quality unknown | Decent for Hindi, weak for Kannada | 22 Indian langs, BEST quality | Limited Indian |
| **Code-Mixed** | Unknown | Weak | Supported | Unknown |
| **Emotion Detection** | NO | NO | NO | NO |
| **On-Device (phone)** | YES (<4 GB) | Large (3+ GB) | YES (~3 GB) | NO |
| **Additional Capabilities** | Vision, reasoning, tool calling, translation | Transcription only | Transcription only | Transcription only |
| **Memory on Titan** | ~5.6 GB (E4B) | ~3 GB | ~3 GB | ~4 GB |

### Key Architectural Difference
Whisper and IndicConformerASR are **dedicated ASR models** — they do one thing well. Gemma 4 E4B is a **multimodal LLM** that happens to include ASR as one capability among many. The trade-off: Gemma 4 can reason about audio content and combine it with text/vision, but cannot match specialized models on ASR-specific features (timestamps, streaming, long audio).

---

## 9. Limitations Summary

### Hard Limits
1. **30-second max audio** — Cannot process longer clips
2. **Text-only output** — No TTS, no audio generation
3. **1 audio clip per prompt** (vLLM default; may be configurable)
4. **Batch only** — No streaming audio input
5. **Mono 16kHz only** — Must downmix and resample

### Quality Concerns
1. **E2B audio quality is poor** — Community testing shows garbled output across languages
2. **E4B is minimum viable** — But E4B is 2x larger (5.6 GB vs 3.6 GB)
3. **No word-level timestamps** — Cannot align words to audio frames
4. **No speaker separation** — Cannot handle multi-speaker scenarios
5. **Music/non-speech not trained** — Only speech audio works

### Deployment Concerns
1. **llama.cpp audio broken** (issue #21325) — Major edge deployment blocker
2. **LiteRT-LM has no audio** — The official Pixel path is text-only currently
3. **LM Studio no audio** — Another popular local tool lacking support
4. **Ollama audio works but E2B quality is bad** — Only E4B produces usable results

### Missing for Annie's Use Cases
1. No real-time streaming (needed for voice conversation)
2. No speaker identification (needed to identify Rajesh vs others)
3. No emotion detection (needed for Dimension 6 - Emotional Awareness)
4. No diarization (needed for multi-speaker contexts)
5. No code-mixed support verified (Rajesh speaks English-Kannada mix)
6. 30-second limit too short for continuous conversation

---

## 10. Verdict for her-os / Annie

### Can Gemma 4 E2B/E4B Replace Our Current STT Pipeline?

**NO.** Not for the primary voice conversation loop.

**Reasons:**
1. **30-second hard limit** — Voice conversations are continuous, not 30-second clips
2. **No streaming** — Annie needs real-time STT for responsive conversation
3. **No timestamps** — Needed for alignment and diarization
4. **No speaker ID** — Annie needs to know who is speaking
5. **E2B quality is terrible for audio** — Only E4B works, and it's slower
6. **LiteRT (Pixel path) has no audio** — Cannot run on Annie's phone for STT

### Where Gemma 4 E4B Audio COULD Be Useful

1. **Supplementary audio understanding** — "What was the speaker talking about?" post-hoc analysis
2. **Speech translation** — Translate Kannada audio to English text (if quality is good)
3. **Audio QA on recordings** — Answer questions about saved audio clips
4. **Offline emergency ASR** — When network is down, use as fallback (30-sec chunks)
5. **On-device privacy** — Quick audio understanding without sending to cloud

### Recommended Architecture (No Change)

Keep the current pipeline:
- **Primary STT:** WhisperX (Titan) / IndicConformerASR (Panda for Indian langs)
- **Diarization:** pyannote
- **Emotion:** emotion2vec+
- **Voice Agent STT:** Custom Whisper STT (Blackwell aarch64)

### Future Watch

- **LiteRT-LM audio support** — When this ships, E4B on Pixel becomes interesting for quick on-device transcription
- **llama.cpp fix** (issue #21325) — Would enable GGUF-based edge deployment with audio
- **Community Indian language benchmarks** — If someone publishes Kannada/Hindi WER, reassess
- **Gemma 5 / future edge models** — The 40ms frame duration and 305M encoder show active compression research; future versions may fix the streaming/duration limits

---

## Sources

- [Audio understanding | Gemma | Google AI for Developers](https://ai.google.dev/gemma/docs/capabilities/audio)
- [Gemma 4 model card | Google AI for Developers](https://ai.google.dev/gemma/docs/core/model_card_4)
- [Gemma 4: Byte for byte, the most capable open models (Google Blog)](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/)
- [Welcome Gemma 4: Frontier multimodal intelligence on device (HuggingFace Blog)](https://huggingface.co/blog/gemma4)
- [google/gemma-4-E2B-it (HuggingFace Model Card)](https://huggingface.co/google/gemma-4-E2B-it)
- [google/gemma-4-E4B-it (HuggingFace Model Card)](https://huggingface.co/google/gemma-4-E4B-it)
- [Bringing AI Closer to the Edge with Gemma 4 (NVIDIA Blog)](https://developer.nvidia.com/blog/bringing-ai-closer-to-the-edge-and-on-device-with-gemma-4/)
- [Gemma 4 E2B | Jetson AI Lab](https://www.jetson-ai-lab.com/models/gemma4-e2b/)
- [Gemma 4 Usage Guide (vLLM Recipes)](https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html)
- [Ollama gemma4:e2b](https://ollama.com/library/gemma4:e2b)
- [A Visual Guide to Gemma 4 (Maarten Grootendorst)](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4)
- [llama.cpp Issue #21325 — Gemma 4 audio support missing](https://github.com/ggml-org/llama.cpp/issues/21325)
- [litert-community/gemma-4-E2B-it-litert-lm (HuggingFace)](https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm)
- [Android AICore Developer Preview (Android Blog)](https://android-developers.googleblog.com/2026/04/AI-Core-Developer-Preview.html)
- [Gemma 4 on Arm (ARM Blog)](https://newsroom.arm.com/blog/gemma-4-on-arm-optimized-on-device-ai)
- [I Tested Every Gemma 4 Model Locally (DEV Community)](https://dev.to/akartit/i-tested-every-gemma-4-model-locally-on-my-macbook-what-actually-works-3g2o)
- [Google Developers Blog — Gemma 4 Agentic Skills](https://developers.googleblog.com/bring-state-of-the-art-agentic-skills-to-the-edge-with-gemma-4/)
- [Gemma 4 Deep Dive — DeepMind](https://deepmind.google/models/gemma/gemma-4/)
- [Realistic path to offline Gemini Live (Google AI Forum)](https://discuss.ai.google.dev/t/realistic-path-to-a-fully-local-offline-gemini-live-using-gemma-4-e2b-e4b-on-device/138013)
