# Research: MiniMax M2.7 — Multilingual LLM for her-os

**Date:** 2026-03-19
**Status:** Research complete
**Decision:** Modular adoption — keep Qwen3.5 for voice, cherry-pick MiniMax Speech 2.6 for TTS, use Qwen-MT for translation

---

## 1. Executive Summary

MiniMax released **M2.7** on March 18, 2026 — a proprietary, self-evolving LLM with strong agentic capabilities. However, **M2.7 is API-only and not open source**. The open-source alternative is **M2.5** (230B MoE, 10B active parameters), which *can* run on DGX Spark but barely fits and is slower than our current Qwen3.5 9B.

**The real opportunity isn't the LLM — it's MiniMax Speech 2.6** (TTS with voice cloning, 40+ languages, <250ms latency) and the broader **modular multilingual architecture** that MiniMax's product line demonstrates.

### Verdict

| Component | Recommendation | Why |
|-----------|---------------|-----|
| **MiniMax M2.7 (LLM)** | Skip | API-only, slower than Qwen3.5 for voice, not translation-focused |
| **MiniMax M2.5 (LLM)** | Skip | 101 GB at 3-bit, 20-26 tok/s — slower and tighter than Qwen3.5 |
| **MiniMax Speech 2.6 (TTS)** | Evaluate | Pipecat integration exists, voice cloning, emotion control, 40+ langs |
| **Translation layer** | Build modularly | Soniox STT (Kannada) + Qwen-MT (92-lang translation) |

---

## 2. MiniMax Model Family

### 2.1 MiniMax M2.7 (Proprietary, March 2026)

| Spec | Value |
|------|-------|
| **Architecture** | Proprietary, self-evolving RL |
| **Parameters** | Undisclosed |
| **Context window** | 204,800 tokens (200K input, 131K output) |
| **Open source** | No — API-only |
| **License** | Proprietary |
| **API pricing** | $0.30/1M input, $1.20/1M output |
| **Throughput** | ~48 tok/s (via API) |
| **TTFT** | 3.55s |

**Key benchmarks:**
- SWE-Pro: 56.22% (near Claude Opus 3.5)
- Terminal Bench 2: 57.0%
- Skill adherence: 97% on 40+ complex skills (>2000 tokens)

**Novel feature:** "Self-evolution" — the model participates in 30-50% of its own RL training workflow. This is architecturally interesting but not relevant to our use case.

**Kannada support:** Not explicitly mentioned in any documentation.

### 2.2 MiniMax M2.5 (Open Source, February 2026)

| Spec | Value |
|------|-------|
| **Architecture** | MoE + Lightning Attention |
| **Parameters** | 230B total / 10B active per token |
| **Experts** | 8 active per token |
| **Context window** | 200K tokens |
| **Open source** | Yes |
| **License** | Modified MIT |
| **HuggingFace** | `MiniMaxAI/MiniMax-M2.5` |

**Key benchmarks:**
- SWE-Bench Verified: 80.2%
- GPQA Diamond: 85.2%
- AIME 2025: 86.3%
- HumanEval: 89.6%

**Quantization options:**

| Format | Size | Quality | Speed (DGX Spark) |
|--------|------|---------|-------------------|
| BF16 (full) | 457 GB | Best | N/A (doesn't fit) |
| GGUF Q6_K_XL | ~200 GB | Very good | N/A (doesn't fit) |
| GGUF Q3_K_XL (Unsloth Dynamic) | 101 GB | Good | 20-26 tok/s |
| GGUF 2-bit | ~83 GB | Degraded | ~15-20 tok/s |
| AWQ 4-bit (Marlin) | ~130 GB | Good | 741 tok/s (H100 only) |

### 2.3 MiniMax-Text-01 (Open Source, January 2025)

| Spec | Value |
|------|-------|
| **Architecture** | Hybrid Lightning Attention + Softmax + MoE |
| **Parameters** | 456B total / 45.9B active per token |
| **Context** | 1M training / 4M inference |
| **License** | Apache 2.0 |

**Innovation:** First production-grade linear attention mechanism at scale. Lightning Attention has O(d^2 n) complexity vs O(n^2 d) for standard attention — enables the 4M token context window.

**For her-os:** Too large (456B). Overkill for voice. Interesting for long-context research tasks but not our priority.

---

## 3. DGX Spark Fit Analysis

### 3.1 Current VRAM Budget

| Service | VRAM |
|---------|------|
| Qwen3.5 9B (vLLM NVFP4) | ~18 GB |
| Ollama extraction models | ~16 GB |
| WhisperX STT | ~2 GB |
| Kokoro TTS | ~1 GB |
| Other services | ~5 GB |
| **Total used** | **~42 GB** |
| **Available** | **~86 GB** |

### 3.2 MiniMax M2.5 Fit Scenarios

**Scenario A: Replace Qwen3.5**
- Retire Qwen3.5 (-18 GB): 24 GB used
- Add M2.5 Q3_K_XL (+101 GB): **125 GB total**
- Headroom: 3 GB (dangerously tight)
- Throughput: 20-26 tok/s (vs Qwen3.5's 33 tok/s — **20% slower**)

**Scenario B: Replace Ollama**
- Retire Ollama (-16 GB): 26 GB used
- Add M2.5 Q3_K_XL (+101 GB): **127 GB total**
- Headroom: 1 GB (impossible in practice)

**Scenario C: Run alongside everything (2-bit)**
- Add M2.5 2-bit (+83 GB): **125 GB total**
- Headroom: 3 GB
- Quality: Significantly degraded at 2-bit

**Verdict:** M2.5 *barely fits* on DGX Spark. It would require retiring an existing model and delivers slower throughput than Qwen3.5. Not worth the trade.

### 3.3 Serving Stack

| Engine | DGX Spark Status |
|--------|-----------------|
| **llama.cpp** | Working. Proven by community (see [re-cinq/minimax-m2.5-nvidia-dgx](https://github.com/re-cinq/minimax-m2.5-nvidia-dgx)). Build with `-DGGML_CUDA_ARCHITECTURES="121"`. |
| **vLLM** | Broken. CUDA 13 wheel issues, SM_121 kernel gaps, FP4 Marlin bugs ([#36821](https://github.com/vllm-project/vllm/issues/36821), [#35519](https://github.com/vllm-project/vllm/issues/35519)). |
| **SGLang** | Same aarch64/Blackwell issues as vLLM. |

---

## 4. Multilingual & Translation Analysis

### 4.1 MiniMax's Language Support

MiniMax models are **not translation-focused**. Their strength is coding, agentic tasks, and reasoning. Language support details:

- M2.7: No explicit language list published. General multilingual implied.
- M2.5: 10+ programming languages. Natural language list not specified.
- **Kannada: Not mentioned in any MiniMax documentation.**

### 4.2 Better Translation Alternatives

| Model/Service | Languages | Kannada | Strength | Integration |
|---------------|-----------|---------|----------|-------------|
| **Qwen-MT** | 92 | Yes (Indian langs supported) | Purpose-built translation, $0.5/M tokens | API call |
| **NLLB-200 (Meta)** | 200 | Yes (70% accuracy gain on Indian langs) | Best open-source translation model | Self-hosted, translation-only |
| **SeamlessM4T v2 (Meta)** | 100+ | Likely | Speech-to-speech translation | Self-hosted, adds ~1-2s latency |
| **Soniox** | 50+ | Yes (auto-detect mid-sentence) | Real-time STT with language detection | API |
| **Google Translate API** | 130+ | Yes | Battle-tested, fast | API |

**Key insight:** Translation is a solved problem. Purpose-built models (NLLB-200, Qwen-MT) massively outperform general-purpose LLMs at translation. The right approach is modular: use a dedicated translation service, not a do-everything LLM.

### 4.3 Kannada-Specific Considerations

- Rajesh speaks English to Annie; Kannada is ambient-only (from memory: `user_annie_language.md`)
- Kannada-English code-switching detection achieves 98% accuracy with transformer models
- Soniox auto-detects language changes mid-sentence — ideal for ambient capture
- Whisper's Kannada performance is poor — a dedicated Kannada STT would be needed for ambient transcription

---

## 5. MiniMax Speech 2.6 — The Real Opportunity

MiniMax's **TTS model** is more interesting for Annie than their LLM.

### 5.1 Specs

| Feature | Value |
|---------|-------|
| **Latency** | <250ms (turbo variant) |
| **Languages** | 40+ |
| **Voice cloning** | 10-second sample |
| **Emotion control** | Built-in (friendly, serious, excited, etc.) |
| **Models** | speech-2.6-hd, speech-2.6-turbo |
| **Pricing** | ~$0.01/1K chars |

### 5.2 Pipecat Integration (Already Exists!)

```python
from pipecat.services.minimax import MiniMaxTTSService

tts = MiniMaxTTSService(
    api_key="your-key",
    model="speech-2.6-turbo",
    emotion="friendly"
)
```

This is a **drop-in replacement** for Kokoro TTS in the Annie pipeline.

### 5.3 Comparison with Current Kokoro

| Aspect | Kokoro (current) | MiniMax Speech 2.6 |
|--------|-----------------|-------------------|
| **Latency** | ~30ms (GPU, local) | <250ms (API) |
| **Languages** | English, Japanese | 40+ including Indian languages |
| **Voice cloning** | Limited | 10-second sample |
| **Emotion** | Basic | Rich control |
| **Cost** | Free (local GPU) | ~$0.01/1K chars |
| **Privacy** | Local (on-device) | Cloud API |
| **VRAM** | ~1 GB | 0 GB (API) |

**Trade-off:** Kokoro is faster and fully local (privacy). MiniMax Speech 2.6 adds multilingual + voice cloning + emotion but requires cloud API calls.

### 5.4 Potential Use Cases for Annie

1. **Multilingual voice output**: Annie responds in Kannada when ambient conversation is in Kannada
2. **Voice cloning**: Annie speaks in a consistent voice across languages (clone Rajesh's voice)
3. **Emotional TTS**: Annie's voice conveys empathy, excitement, concern based on conversation context
4. **Language learning**: Annie translates and speaks back in target language

---

## 6. What We Can Learn from MiniMax's Approach

### 6.1 Lightning Attention (Architectural Innovation)

MiniMax pioneered **Lightning Attention** — a linear-complexity attention mechanism:
- Standard attention: O(n^2 d) — quadratic in sequence length
- Lightning Attention: O(d^2 n) — linear in sequence length
- Enables 4M token context windows (vs typical 128K-200K)

**Relevance for her-os:** Not directly applicable (we use off-the-shelf models), but validates that long-context is becoming commodity. Future models we adopt may use similar techniques.

### 6.2 MoE at Scale (10B Active / 230B Total)

MiniMax proves that **MoE with small active parameters** can match dense models:
- M2.5: 10B active / 230B total — competitive with 70B dense models
- Only 8 experts active per token — massive efficiency gain

**Relevance for her-os:** When selecting future models, prefer MoE architectures — they give better quality per VRAM GB. Our current Qwen3.5 9B is dense; future Qwen MoE variants could be interesting.

### 6.3 Self-Evolution (M2.7)

M2.7's self-evolution capability (model participates in its own RL training) is novel but:
- Proprietary — can't learn implementation details
- Requires massive compute for training loop
- Not relevant for inference-only deployments like Annie

### 6.4 Modular Audio Pipeline

MiniMax's product line demonstrates the right architecture:
- **Separate text model** (M2.7) for reasoning
- **Separate speech model** (Speech 2.6) for synthesis
- **Separate vision model** (VL-01) for image understanding
- Each model optimized for its domain

**This validates our current approach:** Whisper (STT) + Qwen3.5 (LLM) + Kokoro (TTS) as separate, specialized components.

---

## 7. Recommended Integration Architecture

### Phase 1: Translation Detection (Ambient Kannada)
```
Ambient audio → Whisper STT (English)
              → Soniox STT (Kannada detection + transcription)
              → Language label per segment stored in Context Engine
```

### Phase 2: Async Translation Layer
```
Kannada transcript → Qwen-MT API (async, non-blocking)
                   → Store: {original_kn: "...", translated_en: "...", timestamp: ...}
                   → Context Engine indexes both for cross-lingual search
```

### Phase 3: Multilingual TTS (Optional)
```
Qwen3.5 response (English) → Kokoro TTS (default, local, fast)
                             → MiniMax Speech 2.6 (when Kannada output needed)
                             → Language router selects based on input language
```

### What This Enables
- Annie understands Kannada conversations (ambient capture)
- Knowledge graph stores multilingual context
- Annie can respond in Kannada when appropriate
- No change to primary voice pipeline latency

---

## 8. Cost Analysis

### API Costs (If Using MiniMax Services)

| Service | Usage Estimate | Monthly Cost |
|---------|---------------|-------------|
| M2.7 API (if used for complex tasks) | ~10M tokens/month | ~$15 |
| Speech 2.6 TTS | ~500K chars/month | ~$5 |
| Qwen-MT (translation) | ~5M tokens/month | ~$2.50 |
| Soniox STT (Kannada) | ~10 hrs/month | ~$10 |
| **Total** | | **~$32.50/month** |

### Self-Hosted Alternative Costs
- NLLB-200 (translation): Free, ~2 GB VRAM
- Soniox: No self-hosted option (API only)
- MiniMax Speech: No self-hosted option (API only)

---

## 9. Risks & Concerns

1. **Privacy**: MiniMax Speech 2.6 and Qwen-MT are cloud APIs — audio/text leaves the device. Conflicts with her-os "local-first privacy" principle (CLAUDE.md).

2. **Kannada quality**: No MiniMax model explicitly validates Kannada support. Would need testing before committing.

3. **Vendor lock-in**: Voice cloning on MiniMax means the cloned voice profile is locked to their platform. Migrating away loses the voice.

4. **Latency stacking**: Adding translation + multilingual TTS adds latency. Must be async/parallel to avoid degrading voice experience.

5. **DGX Spark ceiling**: M2.5 barely fits. If future MiniMax models grow, they won't be viable for local deployment.

---

## 10. Sources

### MiniMax Official
- [MiniMax M2.7 Model Page](https://www.minimax.io/models/text/m27)
- [MiniMax M2.7 Announcement](https://www.minimax.io/news/minimax-m27-en)
- [MiniMax M2.5 Announcement](https://www.minimax.io/news/minimax-m25)
- [MiniMax Speech 2.6](https://www.minimax.io/news/minimax-speech-26)
- [MiniMax API Docs](https://platform.minimax.io/docs/api-reference/api-overview)
- [MiniMax-Text-01 Paper](https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdf)

### HuggingFace
- [MiniMax-M2.5 Model Card](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)
- [Unsloth M2.5 GGUF](https://huggingface.co/unsloth/MiniMax-M2.5-GGUF)
- [MiniMax-Text-01](https://huggingface.co/MiniMaxAI/MiniMax-Text-01)

### DGX Spark Deployment
- [re-cinq/minimax-m2.5-nvidia-dgx](https://github.com/re-cinq/minimax-m2.5-nvidia-dgx)
- [wshobson/minimax-dgx-spark](https://github.com/wshobson/minimax-dgx-spark)
- [Unsloth M2.5 Guide](https://unsloth.ai/docs/models/minimax-m25)
- [vLLM M2 Deployment Guide](https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html)

### vLLM/DGX Issues
- [vLLM #36821 — sm_121 kernel gaps](https://github.com/vllm-project/vllm/issues/36821)
- [vLLM #35519 — aarch64 NVFP4 crash](https://github.com/vllm-project/vllm/issues/35519)
- [vLLM #37030 — Marlin FP4 wrong output](https://github.com/vllm-project/vllm/issues/37030)

### Translation & Multilingual
- [Meta NLLB-200](https://ai.meta.com/research/no-language-left-behind/)
- [Meta SeamlessM4T v2](https://ai.meta.com/research/publications/seamlessm4t-massively-multilingual-multimodal-machine-translation/)
- [Qwen-MT (92 languages)](https://qwenlm.github.io/blog/qwen-mt/)
- [Soniox Kannada STT](https://soniox.com/speech-to-text/kannada)

### Analysis & Benchmarks
- [Artificial Analysis — M2.7](https://artificialanalysis.ai/models/minimax-m2-7)
- [VentureBeat — M2.7 Self-Evolution](https://venturebeat.com/technology/new-minimax-m2-7-proprietary-ai-model-is-self-evolving-and-can-perform-30-50)

### Pipecat Integration
- [Pipecat MiniMax TTS Docs](https://docs.pipecat.ai/server/services/tts/minimax)
