# Research: Alternative LLM Backends for her-os (Gemini, Mistral, Sarvam)

**Date:** 2026-02-25 (Gemini), 2026-02-25 (Mistral added)
**Status:** Research complete
**Context:** ADR-002 chose Claude API. This research evaluates whether alternative providers (Gemini, Mistral, Sarvam) could be cheaper/free and good enough for the her-os pipeline, including self-hosted options on DGX Spark.

---

## Table of Contents

### Part A: Google Gemini
1. [Executive Summary](#1-executive-summary)
2. [Gemini Model Lineup (February 2026)](#2-gemini-model-lineup-february-2026)
3. [Pricing Comparison: Gemini vs Claude](#3-pricing-comparison-gemini-vs-claude)
4. [Free Tier Analysis](#4-free-tier-analysis)
5. [API Capabilities](#5-api-capabilities)
6. [Entity Extraction Quality: Gemini vs Claude](#6-entity-extraction-quality-gemini-vs-claude)
7. [Python SDK](#7-python-sdk)
8. [Local/Self-Hosted Options (Gemma)](#8-localself-hosted-options-gemma)
9. [Vertex AI vs AI Studio Pricing](#9-vertex-ai-vs-ai-studio-pricing)
10. [Context Window Sizes](#10-context-window-sizes)
11. [Data Privacy Considerations](#11-data-privacy-considerations)
12. [Cost Modeling for her-os Pipeline](#12-cost-modeling-for-her-os-pipeline)
13. [Recommendation](#13-recommendation)

### Part B: Mistral AI
14. [Mistral Executive Summary](#14-mistral-executive-summary)
15. [Mistral Model Lineup (2025-2026)](#15-mistral-model-lineup-2025-2026)
16. [NVIDIA Optimizations & DGX Spark](#16-nvidia-optimizations--dgx-spark)
17. [NVFP4: Blackwell's Quantization Superpower](#17-nvfp4-blackwells-quantization-superpower)
18. [DGX Spark Fit Assessment](#18-dgx-spark-fit-assessment)
19. [Mistral NIM Containers](#19-mistral-nim-containers)
20. [Voxtral: Mistral's STT Alternative](#20-voxtral-mistrals-stt-alternative)
21. [Mistral vs Gemini vs Claude: Self-Hosted Comparison](#21-mistral-vs-gemini-vs-claude-self-hosted-comparison)
22. [Mistral Recommendation for her-os](#22-mistral-recommendation-for-her-os)

### Part C: Sarvam AI
23. [Sarvam AI Summary](#23-sarvam-ai-summary)

---

## 1. Executive Summary

Google Gemini is a **strong alternative** to Claude for her-os pipeline work, particularly for high-volume, cost-sensitive tasks. Key findings:

- **Gemini 2.5 Flash-Lite** at $0.10/$0.40 per MTok is **10x cheaper** than Claude Haiku 4.5 ($1/$5)
- **Free tier exists** but is too limited for production (5-15 RPM, 100-1,000 RPD)
- **Batch API** offers 50% discount on all paid models
- **Context caching** offers 75-90% discount on cached reads
- **Entity extraction quality** is comparable; Gemini slightly edges Claude on structured data extraction in benchmarks
- **Structured output** (JSON Schema) is native and reliable on all Gemini 2.5+ models
- **Python SDK** (`google-genai`) is GA, stable, async-capable, Pydantic-compatible
- **Gemma 3** (open weights, 1B-27B) can run locally on Titan for zero-cost inference
- **Data privacy**: Paid API does NOT use data for training. Free tier MAY use data for training.

**Bottom line**: A hybrid strategy using Gemini Flash-Lite for high-volume extraction and Claude for complex reasoning would reduce costs by 80-90% vs Claude-only.

---

## 2. Gemini Model Lineup (February 2026)

### Current Models (from ai.google.dev/gemini-api/docs/pricing)

| Model | Released | Category | Context Window | Max Output |
|-------|----------|----------|---------------|------------|
| Gemini 3.1 Pro Preview | Feb 2026 | Flagship reasoning | 1M tokens | 64K tokens |
| Gemini 3 Pro Preview | Nov 2025 | Reasoning | 1M tokens | 64K tokens |
| Gemini 3 Flash Preview | Dec 2025 | Fast + thinking | 1M tokens | 32K tokens |
| Gemini 2.5 Pro | Stable | Pro-tier reasoning | 1M tokens (2M coming) | 64K tokens |
| Gemini 2.5 Flash | Stable | Workhorse | 1M tokens | 32K tokens |
| Gemini 2.5 Flash-Lite | Feb 2026 (GA) | Cheapest, fastest | 1M tokens | 32K tokens |
| Gemini 2.0 Flash | Stable (EOL Mar 2026) | Legacy fast | 1M tokens | 8K tokens |

**Note:** Gemini 2.0 Flash and 2.0 Flash-Lite are deprecated, retiring March 3, 2026. Migrate to 2.5 Flash-Lite.

---

## 3. Pricing Comparison: Gemini vs Claude

### Full Pricing Table (per 1M tokens, USD)

#### Gemini Models (AI Studio / Developer API)

| Model | Input | Output | Cache Read | Batch Input | Batch Output |
|-------|-------|--------|------------|-------------|--------------|
| **Gemini 3.1 Pro Preview** | $2.00 (<=200K) / $4.00 (>200K) | $12.00 / $18.00 | $0.20 / $0.40 | $1.00 / $2.00 | $6.00 / $9.00 |
| **Gemini 3 Pro Preview** | $2.00 / $4.00 | $12.00 / $18.00 | $0.20 / $0.40 | $1.00 / $2.00 | $6.00 / $9.00 |
| **Gemini 3 Flash Preview** | $0.50 / $1.00 | $3.00 | $0.05 / $0.10 | $0.25 / $0.50 | $1.50 |
| **Gemini 2.5 Pro** | $1.25 / $2.50 | $10.00 / $15.00 | $0.125 / $0.25 | $0.625 / $1.25 | $5.00 / $7.50 |
| **Gemini 2.5 Flash** | $0.30 | $2.50 | $0.03 | $0.15 | $1.25 |
| **Gemini 2.5 Flash-Lite** | **$0.10** | **$0.40** | **$0.01** | **$0.05** | **$0.20** |
| Gemini 2.0 Flash (EOL) | $0.10 | $0.40 | $0.025 | $0.05 | $0.20 |

*Note: Audio input has different pricing (generally higher). Text/image prices shown above.*

#### Claude Models (Anthropic API)

| Model | Input | Output | Cache Hit | Batch Input | Batch Output |
|-------|-------|--------|-----------|-------------|--------------|
| **Claude Opus 4.6** | $5.00 | $25.00 | $0.50 | $2.50 | $12.50 |
| **Claude Opus 4.5** | $5.00 | $25.00 | $0.50 | $2.50 | $12.50 |
| **Claude Sonnet 4.6** | $3.00 | $15.00 | $0.30 | $1.50 | $7.50 |
| **Claude Sonnet 4.5** | $3.00 (<=200K) / $6.00 (>200K) | $15.00 / $22.50 | $0.30 | $1.50 | $7.50 |
| **Claude Haiku 4.5** | $1.00 | $5.00 | $0.10 | $0.50 | $2.50 |
| **Claude Haiku 3.5** | $0.80 | $4.00 | $0.08 | $0.40 | $2.00 |
| Claude Haiku 3 | $0.25 | $1.25 | $0.03 | $0.125 | $0.625 |

### Head-to-Head Price Comparison (same tier)

| Task Tier | Gemini | Claude | Gemini Savings |
|-----------|--------|--------|----------------|
| **Cheapest** | Flash-Lite: $0.10/$0.40 | Haiku 3: $0.25/$1.25 | **60% cheaper input, 68% cheaper output** |
| **Budget workhorse** | 2.5 Flash: $0.30/$2.50 | Haiku 4.5: $1.00/$5.00 | **70% cheaper input, 50% cheaper output** |
| **Mid-tier** | 2.5 Pro: $1.25/$10.00 | Sonnet 4.5: $3.00/$15.00 | **58% cheaper input, 33% cheaper output** |
| **Flagship** | 3.1 Pro: $2.00/$12.00 | Opus 4.6: $5.00/$25.00 | **60% cheaper input, 52% cheaper output** |

**Conclusion: Gemini is 33-70% cheaper across every tier.**

---

## 4. Free Tier Analysis

### What You Get (No Credit Card Required)

| Model | RPM | RPD | TPM | Context |
|-------|-----|-----|-----|---------|
| Gemini 2.5 Pro | 5 | 100 | 250,000 | 1M tokens |
| Gemini 2.5 Flash | 10 | 250 | 250,000 | 1M tokens |
| Gemini 2.5 Flash-Lite | 15 | 1,000 | 250,000 | 1M tokens |

**Claude comparison:** Claude has NO free API tier. New users get a small trial credit only.

### Free Tier Limitations

- **RPD is the bottleneck.** 100 RPD for Pro = ~4 requests/hour. 250 RPD for Flash = ~10/hour.
- **December 2025 cutback:** Google reduced free limits by 50-80% in December 2025.
- **Per-project limits** (not per API key). Can't bypass by creating multiple keys.
- **Data may be used for training** on the free tier (see Section 11).
- **No SLA.** No guarantees on availability or latency.

### Is Free Tier Viable for her-os?

**For development/prototyping: YES.** Flash-Lite at 1,000 RPD with 250K TPM is enough for testing entity extraction pipelines during development.

**For production: NO.** A single user generating 30 conversations/day, each requiring 3-5 LLM calls for entity extraction, summarization, and relationship extraction, would need 90-150 RPD minimum. With retries and multi-step processing, the free tier would be exhausted by mid-afternoon. The 5 RPM limit on Pro would also cause queueing issues.

**Verdict: Free tier for dev, paid tier for production.**

---

## 5. API Capabilities

### Feature Comparison

| Capability | Gemini API | Claude API |
|------------|-----------|------------|
| **Structured Output (JSON)** | Native JSON Schema enforcement. Pydantic/Zod support. `anyOf`, `$ref` supported. Property ordering preserved on 2.5+. | Tool use with JSON Schema. No native `response_format` like Gemini. Structured output via tool_use pattern. |
| **Tool Use / Function Calling** | Full support. Multi-turn tool calls. Parallel function calling. | Full support. Multi-turn. Parallel. |
| **Prompt/Context Caching** | **Implicit** (auto, free on 2.5+) + **Explicit** (manual, guaranteed savings). 75-90% discount on cached reads. | 5-min cache (1.25x write) + 1-hour cache (2x write). Cache hits at 10% of input price. |
| **Batch API** | 50% discount. 24h target turnaround. Higher rate limits. | 50% discount. Async processing. |
| **Streaming** | Yes | Yes |
| **System Instructions** | Yes | Yes (system prompt) |
| **Multimodal Input** | Text, images, video, audio, PDF | Text, images, PDF |
| **Thinking/Reasoning** | Configurable thinking budget (0-24K tokens) on Flash. | Extended thinking on Sonnet/Opus. |
| **Grounding** | Google Search grounding built-in | Web search tool ($10/1K searches) |
| **Code Execution** | Built-in sandbox | Code execution tool (free with web search) |

### Structured Output Quality

Gemini's structured output is **native and enforced at the model level**:
```python
from google import genai
from pydantic import BaseModel

class Entity(BaseModel):
    name: str
    type: str
    confidence: float

response = client.models.generate_content(
    model="gemini-2.5-flash-lite",
    contents="Extract entities from: Rajesh told Arun about the product launch",
    config=genai.types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=list[Entity]
    )
)
```

The JSON output is guaranteed to match the schema (syntactically). Semantic correctness depends on the model and prompt.

### Context Caching (Key Cost Saver)

Gemini's caching is particularly relevant for her-os:

- **Implicit caching** (Gemini 2.5+): Automatic. If your request prefix matches a previous request's prefix, Google automatically applies cache pricing. No code changes needed.
- **Explicit caching**: Create a named cache with system prompt + shared context. Reuse across requests.
- **Discount**: 90% on Gemini 2.5+ models, 75% on Gemini 2.0 models.
- **Cache storage**: $1-4.50 per million tokens per hour (varies by model).

For her-os, the system prompt + entity extraction instructions + known entity list would be cached, reducing per-request input cost by 75-90%.

---

## 6. Entity Extraction Quality: Gemini vs Claude

### Benchmark Data (2025-2026)

| Benchmark | Gemini | Claude | Winner |
|-----------|--------|--------|--------|
| **Unstructured.io Data Extraction** (Jul 2025) | 2.5 Pro: 93% precision, 81% recall | Opus 4: 90% precision, 80% recall | **Gemini** |
| **Invoice Extraction** (2025) | 96% accuracy (text PDF), best on scanned docs | 97% accuracy (text PDF) | **Tie** (Claude text, Gemini scanned) |
| **Structured JSON adherence** | Native schema enforcement | Via tool_use pattern | **Gemini** (native) |
| **Nuanced relationship extraction** | Good | Excellent (superior at synthesizing from documents) | **Claude** |
| **Conversational context understanding** | Good | Excellent (better at document-level reasoning) | **Claude** |

### Assessment for her-os Use Cases

| Pipeline Task | Best Model | Reasoning |
|---------------|-----------|-----------|
| **Entity extraction** (people, topics, places) | Gemini Flash-Lite | High volume, structured output, cost-sensitive. Gemini's schema enforcement reduces parsing errors. |
| **Promise/commitment detection** | Gemini 2.5 Flash or Claude Haiku | Requires some nuance. Flash is adequate. |
| **Emotion extraction** | Claude Haiku 4.5 | Emotional nuance is Claude's strength. |
| **Memory summarization** | Gemini 2.5 Flash | Bulk summarization. Cost-sensitive. Good enough quality. |
| **Relationship extraction** (multi-hop) | Claude Sonnet 4.5 | Complex reasoning. Claude excels here. |
| **Morning Debrief generation** | Claude Sonnet 4.5 | User-facing narrative. Quality matters more than cost. |
| **Conversation analysis** (overall) | Gemini 2.5 Flash | Good balance of quality and cost for classification/tagging. |

**Key insight**: Entity extraction (the highest-volume task) is where Gemini shines. It does not need the nuance of Claude. A named entity is either detected or not. Gemini's native JSON Schema enforcement actually makes it *more reliable* for structured extraction than Claude's tool_use pattern.

---

## 7. Python SDK

### google-genai (GA, Recommended)

```bash
pip install google-genai            # sync
pip install google-genai[aiohttp]   # async support
```

- **Status:** General Availability (GA) as of May 2025. Stable, production-ready.
- **Replaces:** `google-generativeai` (DEPRECATED)
- **Python:** 3.9+
- **Async:** Full async support via aiohttp
- **Pydantic:** Native Pydantic model support for request/response schemas
- **Repository:** [github.com/googleapis/python-genai](https://github.com/googleapis/python-genai)

### Basic Usage

```python
from google import genai

# API key auth (AI Studio)
client = genai.Client(api_key="YOUR_KEY")

# Vertex AI auth
client = genai.Client(
    vertexai=True,
    project="your-project",
    location="us-central1"
)

# Synchronous
response = client.models.generate_content(
    model="gemini-2.5-flash-lite",
    contents="Extract entities from this text..."
)

# Async
import asyncio
async_client = genai.Client(api_key="YOUR_KEY", http_options={"api_version": "v1beta"})
response = await async_client.aio.models.generate_content(
    model="gemini-2.5-flash-lite",
    contents="Extract entities..."
)
```

### Pydantic Integration (Structured Output)

```python
from pydantic import BaseModel
from google import genai

class ExtractedEntities(BaseModel):
    people: list[str]
    topics: list[str]
    promises: list[str]
    emotions: list[str]

client = genai.Client(api_key="YOUR_KEY")
response = client.models.generate_content(
    model="gemini-2.5-flash-lite",
    contents=transcript_text,
    config=genai.types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=ExtractedEntities
    )
)
entities = ExtractedEntities.model_validate_json(response.text)
```

### Instructor Integration (Works with Both Claude and Gemini)

```python
import instructor

# Gemini via Instructor (same API as Claude)
client = instructor.from_provider("google-genai", async_client=False)
entities = client.create(
    model="gemini-2.5-flash-lite",
    response_model=ExtractedEntities,
    messages=[{"role": "user", "content": transcript_text}]
)
```

### SDK Maturity Assessment

| Aspect | Rating | Notes |
|--------|--------|-------|
| Stability | Excellent | GA since May 2025 |
| Documentation | Good | Official docs + cookbook examples |
| Async support | Excellent | Full async via aiohttp |
| Type hints | Good | Pydantic native |
| Error handling | Good | Structured error types |
| Community | Large | Google ecosystem |
| FastAPI integration | Excellent | Async + Pydantic = natural fit |

---

## 8. Local/Self-Hosted Options (Gemma)

### Gemma 3 Model Family

| Model | Parameters | VRAM (BF16) | VRAM (Q4) | Multimodal | Context |
|-------|-----------|-------------|-----------|------------|---------|
| Gemma 3 270M | 270M | <1 GB | <0.5 GB | Text only | 32K |
| Gemma 3 1B | 1B | 2 GB | 1 GB | Text only | 128K |
| Gemma 3 4B | 4B | 8 GB | 4 GB | Text + Image | 128K |
| Gemma 3 12B | 12B | 24 GB | 9 GB | Text + Image | 128K |
| Gemma 3 27B | 27B | 54 GB | 18 GB | Text + Image | 128K |

### Titan Fit Assessment

Titan has 128 GB unified memory (GB10 Blackwell). All Gemma 3 models fit easily:

| Model | Titan Fit | Use Case |
|-------|-----------|----------|
| **Gemma 3 4B (Q4)** | 4 GB / 128 GB = 3% | Entity extraction, classification |
| **Gemma 3 12B (Q4)** | 9 GB / 128 GB = 7% | Summarization, analysis |
| **Gemma 3 27B (Q4)** | 18 GB / 128 GB = 14% | Complex reasoning, relationship extraction |

### Gemma 3 for Entity Extraction

- **LangExtract**: Open-source Python library designed for structured extraction with Gemma
- **Fine-tuning**: Gemma 3 4B can be fine-tuned for NER with Unsloth (1.6x faster, 60% less VRAM)
- **Structured output**: Gemma 3 supports function calling and controlled generation natively
- **License**: Open weights, responsible commercial use permitted

### Zero-Cost Pipeline with Gemma on Titan

```
Conversation transcript
  --> Gemma 3 4B (entity extraction, local, free)
  --> Gemma 3 12B (summarization, local, free)
  --> Claude API (complex reasoning, relationship extraction, paid)
```

**VRAM budget on Titan:**
- Qwen3-Embedding-8B: 14.1 GB (ADR-016)
- Whisper large-v3: 8.75 GB
- Kokoro TTS: ~1 GB
- Gemma 3 4B (Q4): 4 GB
- Gemma 3 12B (Q4): 9 GB
- **Total: ~37 GB / 128 GB available** (29% utilization)

This leaves 91 GB headroom. Even loading Gemma 3 27B (Q4, 18 GB) alongside everything else would use only 46 GB (36%).

### Tradeoffs: Local Gemma vs Gemini API

| Factor | Gemma (Local) | Gemini API |
|--------|--------------|------------|
| Cost per query | $0 | $0.10-2.00/MTok |
| Latency | Low (on-device GPU) | Network dependent (~100-500ms) |
| Quality (4B) | Good for extraction | Excellent (larger model) |
| Quality (27B) | Very good | Excellent |
| Structured output | Requires prompting discipline | Native schema enforcement |
| Maintenance | Model updates manual | Auto-updated by Google |
| Privacy | 100% local | Paid: not used for training |
| Availability | Always (Titan uptime) | Google SLA dependent |

---

## 9. Vertex AI vs AI Studio Pricing

### Platform Comparison

| Aspect | Google AI Studio | Vertex AI |
|--------|-----------------|-----------|
| **Target** | Developers, prototyping | Enterprise, production |
| **Free tier** | Yes (rate-limited) | No (but $300 new-user credit) |
| **Token pricing** | See Section 3 tables | Similar or slightly lower per-token |
| **Billing** | API key, Google Cloud billing | Google Cloud billing required |
| **SLA** | None (free tier), basic (paid) | Enterprise SLA |
| **Data residency** | No control | Regional control (GDPR) |
| **Private networking** | No | VPC, private endpoints |
| **Fine-tuning** | Limited | Full SFT support |
| **Compliance** | Basic | SOC2, HIPAA (BAA), ISO 27001 |

### Pricing Differences

For **identical models**, the per-token pricing is **approximately the same** between AI Studio and Vertex AI. The key differences are:

1. **AI Studio** has a free tier; Vertex AI does not (except $300 trial credit)
2. **Vertex AI** may offer slightly lower per-token prices on some models
3. **Vertex AI** charges for additional infrastructure (endpoints, storage, networking)
4. **Vertex AI** enables fine-tuning, which AI Studio does not fully support

### Recommendation for her-os

**Use AI Studio (Developer API)** for Sprint 1 and initial production. Reasons:
- Free tier for development
- Same models, same pricing
- Simpler setup (API key vs Google Cloud project)
- No enterprise features needed for single-user self-hosted app
- ADR-004 (Local-First Privacy) means we control the data ourselves

---

## 10. Context Window Sizes

### Complete Context Window Comparison

| Model | Input Context | Max Output | Notes |
|-------|--------------|------------|-------|
| **Gemini 3.1 Pro** | 1M tokens | 64K tokens | Most capable |
| **Gemini 3 Pro** | 1M tokens | 64K tokens | |
| **Gemini 3 Flash** | 1M tokens | 32K tokens | |
| **Gemini 2.5 Pro** | 1M tokens (2M coming) | 64K tokens | |
| **Gemini 2.5 Flash** | 1M tokens | 32K tokens | |
| **Gemini 2.5 Flash-Lite** | 1M tokens | 32K tokens | |
| **Claude Opus 4.6** | 200K (1M beta) | 32K | 1M requires Tier 4 |
| **Claude Sonnet 4.5** | 200K (1M beta) | 16K | 1M requires Tier 4 |
| **Claude Haiku 4.5** | 200K | 8K | No 1M option |

**Gemini advantage:** 1M context window is **standard on all models** (including free tier and the cheapest Flash-Lite). Claude's 1M is beta, limited to Tier 4 organizations, and incurs 2x pricing above 200K.

For her-os, most individual transcript segments are <10K tokens. The 1M window is useful for loading the full entity context (known people, places, relationships) alongside the transcript for richer extraction.

---

## 11. Data Privacy Considerations

### Critical for her-os (ADR-004: Local-First Privacy)

| Scenario | Data Used for Training? | Acceptable for her-os? |
|----------|------------------------|----------------------|
| **Gemini Free Tier** | **YES** (may be used) | **NO** -- violates ADR-004 |
| **Gemini Paid API** (AI Studio) | **NO** (Google pledges) | Yes, with caveats |
| **Gemini Vertex AI** | **NO** (enterprise terms) | Yes |
| **Claude API** | **NO** (default policy) | Yes |
| **Gemma (local)** | **N/A** (runs on Titan) | **BEST** -- zero data leaves device |

### Key Privacy Facts

1. **Gemini Paid API**: "Gemini doesn't use your prompts or its responses as data to train its models" for paid setups. Prompts are encrypted in-transit.
2. **Gemini Free Tier**: Free/unpaid users "by default may have their input used to improve Google AI models."
3. **Claude API**: "Data is not used for training by default." Stricter default policy.
4. **Neither Gemini nor Claude** are designed for HIPAA, PCI-DSS, or regulated data out-of-the-box.
5. **Gemini Terms of Service** state: "You should not upload sensitive personal information."

### Privacy Strategy for her-os

Given ADR-004 (local-first):
1. **Never use Gemini free tier** in production (training risk)
2. **Prefer Gemma (local)** for maximum privacy on high-sensitivity transcripts
3. **Use paid Gemini API** for non-sensitive bulk processing where local Gemma quality is insufficient
4. **Continue using Claude API** for user-facing outputs (Morning Debrief) where quality trumps cost
5. **PII redaction** before any API call (already planned in ADR-014 email agent architecture)

---

## 12. Cost Modeling for her-os Pipeline

### Assumptions

- 1 user (Rajesh)
- ~30 conversations/day (Omi captures ~6 hours of ambient audio)
- Average transcript segment: ~2,000 tokens
- Entity extraction prompt + system context: ~1,500 tokens input, ~500 tokens output
- Summarization: ~3,000 tokens input, ~500 tokens output
- Relationship extraction (complex): ~5,000 tokens input, ~1,000 tokens output
- Morning Debrief generation: ~10,000 tokens input, ~2,000 tokens output

### Option A: Claude-Only (Current ADR-002)

| Task | Model | Calls/Day | Input MTok | Output MTok | Daily Cost |
|------|-------|-----------|------------|-------------|------------|
| Entity extraction | Haiku 4.5 | 30 | 0.045 | 0.015 | $0.12 |
| Summarization | Haiku 4.5 | 30 | 0.090 | 0.015 | $0.17 |
| Relationship extraction | Sonnet 4.5 | 10 | 0.050 | 0.010 | $0.30 |
| Morning Debrief | Sonnet 4.5 | 1 | 0.010 | 0.002 | $0.06 |
| **Daily total** | | **71** | | | **$0.65** |
| **Monthly total** | | **~2,130** | | | **~$19.50** |

### Option B: Gemini-Only

| Task | Model | Calls/Day | Input MTok | Output MTok | Daily Cost |
|------|-------|-----------|------------|-------------|------------|
| Entity extraction | Flash-Lite | 30 | 0.045 | 0.015 | $0.01 |
| Summarization | 2.5 Flash | 30 | 0.090 | 0.015 | $0.06 |
| Relationship extraction | 2.5 Pro | 10 | 0.050 | 0.010 | $0.16 |
| Morning Debrief | 2.5 Pro | 1 | 0.010 | 0.002 | $0.03 |
| **Daily total** | | **71** | | | **$0.26** |
| **Monthly total** | | **~2,130** | | | **~$7.80** |

### Option C: Hybrid (Gemini + Claude) -- RECOMMENDED

| Task | Model | Calls/Day | Input MTok | Output MTok | Daily Cost |
|------|-------|-----------|------------|-------------|------------|
| Entity extraction | Gemini Flash-Lite | 30 | 0.045 | 0.015 | $0.01 |
| Summarization | Gemini 2.5 Flash | 30 | 0.090 | 0.015 | $0.06 |
| Relationship extraction | Claude Sonnet 4.5 | 10 | 0.050 | 0.010 | $0.30 |
| Morning Debrief | Claude Sonnet 4.5 | 1 | 0.010 | 0.002 | $0.06 |
| **Daily total** | | **71** | | | **$0.43** |
| **Monthly total** | | **~2,130** | | | **~$12.90** |

### Option D: Local Gemma + Claude (Maximum Savings)

| Task | Model | Calls/Day | Daily Cost |
|------|-------|-----------|------------|
| Entity extraction | Gemma 3 4B (local) | 30 | $0.00 |
| Summarization | Gemma 3 12B (local) | 30 | $0.00 |
| Relationship extraction | Claude Sonnet 4.5 | 10 | $0.30 |
| Morning Debrief | Claude Sonnet 4.5 | 1 | $0.06 |
| **Daily total** | | **71** | **$0.36** |
| **Monthly total** | | **~2,130** | **~$10.80** |

### Option E: Gemini Batch API (Non-Real-Time Processing)

For entity extraction and summarization that don't need real-time results, batch processing halves the cost:

| Task | Model | Calls/Day | Daily Cost |
|------|-------|-----------|------------|
| Entity extraction (batch) | Gemini Flash-Lite | 30 | $0.005 |
| Summarization (batch) | Gemini 2.5 Flash | 30 | $0.03 |
| Relationship extraction | Claude Sonnet 4.5 | 10 | $0.30 |
| Morning Debrief | Claude Sonnet 4.5 | 1 | $0.06 |
| **Daily total** | | **71** | **$0.40** |
| **Monthly total** | | **~2,130** | **~$11.85** |

### Cost Summary

| Strategy | Monthly Cost | vs Claude-Only |
|----------|-------------|----------------|
| A: Claude-only | $19.50 | baseline |
| B: Gemini-only | $7.80 | **-60%** |
| C: Hybrid (Gemini + Claude) | $12.90 | **-34%** |
| D: Local Gemma + Claude | $10.80 | **-45%** |
| E: Gemini Batch + Claude | $11.85 | **-39%** |

**With context caching** (90% discount on Gemini cached reads), Options B/C/E would be even cheaper. The system prompt + entity schema (~1,000 tokens) would be cached across all 60 daily extraction/summarization calls, saving an additional ~$0.03/day.

---

## 13. Recommendation

### For Sprint 1 (ADR-018: `claude -p` Subprocess)

**No change.** Sprint 1 uses `claude -p` subprocess via Claude Max subscription. This is already paid for and simplest to implement. The Gemini research informs the production API transition, not Sprint 1.

### For Production API Transition

**Recommended: Option C (Hybrid) with Option D (Gemma) as Phase 2 enhancement.**

#### Phase 1: Hybrid Gemini + Claude API

```
Transcript --> Gemini 2.5 Flash-Lite (entity extraction, $0.10/$0.40 MTok)
          --> Gemini 2.5 Flash (summarization, $0.30/$2.50 MTok)
          --> Claude Sonnet 4.5 (relationship extraction, $3/$15 MTok)
          --> Claude Sonnet 4.5 (Morning Debrief, $3/$15 MTok)
```

Cost: ~$12.90/month (34% savings vs Claude-only)

#### Phase 2: Add Local Gemma for Zero-Cost Bulk

```
Transcript --> Gemma 3 4B local (entity extraction, $0)
          --> Gemma 3 12B local (summarization, $0)
          --> Claude Sonnet 4.5 (relationship extraction, $3/$15 MTok)
          --> Claude Sonnet 4.5 (Morning Debrief, $3/$15 MTok)
```

Cost: ~$10.80/month (45% savings vs Claude-only)

### Architecture Implication

The LLM adapter layer should be **provider-agnostic from day 1**:

```python
# her_os/llm/base.py
class LLMProvider(Protocol):
    async def extract_entities(self, transcript: str) -> ExtractedEntities: ...
    async def summarize(self, text: str) -> str: ...
    async def extract_relationships(self, context: str) -> list[Relationship]: ...

# her_os/llm/gemini.py
class GeminiProvider(LLMProvider): ...

# her_os/llm/claude.py
class ClaudeProvider(LLMProvider): ...

# her_os/llm/gemma.py
class GemmaLocalProvider(LLMProvider): ...

# her_os/llm/router.py
class LLMRouter:
    """Routes tasks to optimal provider based on task type and config."""
    def __init__(self, config: LLMConfig):
        self.entity_provider = GeminiProvider(model="gemini-2.5-flash-lite")
        self.summary_provider = GeminiProvider(model="gemini-2.5-flash")
        self.reasoning_provider = ClaudeProvider(model="claude-sonnet-4.5")
        self.narrative_provider = ClaudeProvider(model="claude-sonnet-4.5")
```

### What NOT to Do

1. **Do NOT use Gemini free tier in production** -- data privacy risk, rate limits too low
2. **Do NOT replace Claude entirely** -- Claude is superior for nuanced reasoning and user-facing text
3. **Do NOT add Gemma in Sprint 1** -- validate API-based pipeline first, optimize with local models later
4. **Do NOT over-engineer the router** -- start with hard-coded routing, add dynamic routing later

### Dependencies to Add (When Ready)

```
# requirements/base.txt (production API transition)
google-genai>=1.0,<2.0          # Gemini API SDK
google-genai[aiohttp]           # Async support
```

---

## Sources

### Official Documentation
- [Gemini Developer API Pricing](https://ai.google.dev/gemini-api/docs/pricing)
- [Gemini API Rate Limits](https://ai.google.dev/gemini-api/docs/rate-limits)
- [Gemini Structured Outputs](https://ai.google.dev/gemini-api/docs/structured-output)
- [Gemini Context Caching](https://ai.google.dev/gemini-api/docs/caching)
- [Gemini Batch API](https://ai.google.dev/gemini-api/docs/batch-api)
- [Gemini API Libraries](https://ai.google.dev/gemini-api/docs/libraries)
- [Gemini API Models](https://ai.google.dev/gemini-api/docs/models)
- [Gemini API Terms of Service](https://ai.google.dev/gemini-api/terms)
- [Vertex AI Pricing](https://cloud.google.com/vertex-ai/generative-ai/pricing)
- [Claude API Pricing](https://platform.claude.com/docs/en/about-claude/pricing)
- [Gemma 3 Model Overview](https://ai.google.dev/gemma/docs/core)
- [google-genai PyPI](https://pypi.org/project/google-genai/)
- [python-genai GitHub](https://github.com/googleapis/python-genai)

### Benchmark & Comparison Sources
- [Unstructured AI vs Gemini, Claude, OpenAI o3: Data Extraction](https://www.multimodal.dev/post/unstructured-ai-vs-gemini-claude-and-openai-o3)
- [Claude vs Gemini Comparison (DataCamp)](https://www.datacamp.com/blog/claude-vs-gemini)
- [Gemini 3.1 Pro Review (Medium)](https://medium.com/@leucopsis/gemini-3-1-pro-review-1403a8aa1a96)
- [Invoice Extraction: Claude vs GPT vs Gemini](https://www.koncile.ai/en/ressources/claude-gpt-or-gemini-which-is-the-best-llm-for-invoice-extraction)

### Pricing Analysis Sources
- [Gemini API Pricing Calculator (CostGoat)](https://costgoat.com/pricing/gemini-api)
- [Gemini Pricing in 2026 (Finout)](https://www.finout.io/blog/gemini-pricing-in-2026)
- [Google Gemini API Pricing Guide (MetaCTO)](https://www.metacto.com/blogs/the-true-cost-of-google-gemini-a-guide-to-api-pricing-and-integration)

### Privacy Sources
- [How Gemini for Google Cloud Uses Your Data](https://docs.cloud.google.com/gemini/docs/discover/data-governance)
- [Gemini API Terms and Data Privacy (Redact)](https://redact.dev/blog/gemini-api-terms-2025)
- [Google Consumer vs Enterprise Training Policies (i10x)](https://i10x.ai/news/google-gemini-training-data-consumer-vs-enterprise)

### SDK & Integration Sources
- [Google Gen AI SDK Documentation](https://googleapis.github.io/python-genai/)
- [Structured Outputs with genai SDK (Instructor)](https://python.useinstructor.com/integrations/genai/)
- [Google Pydantic AI Integration](https://ai.pydantic.dev/models/google/)
- [Gemini Entity Extraction Cookbook](https://github.com/google-gemini/cookbook/blob/main/examples/json_capabilities/Entity_Extraction_JSON.ipynb)

### Gemma Sources
- [LangExtract + Gemma for Structured Data Extraction (TDS)](https://towardsdatascience.com/using-googles-langextract-and-gemma-for-structured-data-extraction/)
- [Gemma 3 Fine-tuning with Unsloth](https://unsloth.ai/blog/gemma3)
- [Fine-tune Gemma 3 with NeMo-AutoModel (NVIDIA)](https://docs.nvidia.com/nemo/automodel/latest/guides/omni/gemma3-3n.html)
- [Gemini 2.5 Flash-Lite GA Announcement](https://developers.googleblog.com/gemini-25-flash-lite-is-now-stable-and-generally-available/)
- [Gemini 2.5 Thinking Budget (VentureBeat)](https://venturebeat.com/ai/googles-gemini-2-5-flash-introduces-thinking-budgets-that-cut-ai-costs-by-600-when-turned-down)

---

# Part B: Mistral AI

---

## 14. Mistral Executive Summary

Mistral AI (Paris, founded 2023) offers a compelling self-hosted option for her-os on DGX Spark. Key findings:

- **Mistral 3 family (Dec 2025)** is a major release: open-weight Apache 2.0 models from 3B to 14B (dense) + 675B MoE flagship
- **Ministral 14B** and **Mistral Small 3.2 (24B)** are the sweet spots for DGX Spark — fast, capable, open-weight
- **NVFP4 (4-bit floating point)** is Blackwell's native quantization — 2.3x throughput over BF16, <1% accuracy loss. This is DGX Spark's secret weapon
- **NVIDIA NIM containers** exist for Mistral Small 3.2 and Ministral 14B — pre-optimized for deployment
- **Devstral 2 (123B)** fits on DGX Spark at FP4 (~70 GB) for local code agent use
- **Voxtral Mini 4B Realtime (Feb 2026)** is open-weight STT with sub-200ms latency — potential Whisper alternative (but no Kannada)
- **Mistral Large 3 (675B MoE)** does NOT fit on a single DGX Spark — requires multi-GPU datacenter hardware

**Bottom line:** Ministral 14B or Mistral Small 3.2 running locally at NVFP4 on DGX Spark gives us zero-cost, zero-latency entity extraction with hardware-accelerated quantization. This is the strongest self-hosted option we've found.

---

## 15. Mistral Model Lineup (2025-2026)

### Complete Release Timeline

| Date | Model | Params | Type | Context | License | Notes |
|------|-------|--------|------|---------|---------|-------|
| **Jan 2025** | Mistral Small 3 | 24B | Dense | 32K | Apache 2.0 | First Small 3 line |
| **Jan 2025** | Codestral 25.01 | 22B | Dense | 256K | Proprietary | 86.6% HumanEval, 80+ languages |
| **Mar 2025** | Mistral Small 3.1 | 24B | Dense | 128K | Apache 2.0 | Added vision, long context |
| **May 2025** | Mistral Medium 3 | Undisclosed | Dense | 128K | Closed (API-only) | ~90% of Sonnet 3.7 at 8x lower cost |
| **May 2025** | Devstral 1 | 24B | Dense | 128K | Apache 2.0 | Code agents, 46.8% SWE-Bench |
| **Jun 2025** | Magistral Small | 24B | Dense | 128K | Apache 2.0 | Reasoning model, 70.7% AIME2024 |
| **Jun 2025** | Magistral Medium | Undisclosed | Dense | 128K | Closed (API-only) | Reasoning, 73.6% AIME2024 |
| **Jun 2025** | Mistral Small 3.2 | 24B | Dense | 128K | Apache 2.0 | Better instruction following, function calling |
| **Aug 2025** | Codestral 25.08 | 22B | Dense | — | Proprietary | Enterprise coding |
| **Aug 2025** | Mistral Medium 3.1 | Undisclosed | Dense | — | Closed (API-only) | Update to Medium 3 |
| **Dec 2025** | **Mistral Large 3** | **675B** | **MoE (41B active)** | **256K** | **Apache 2.0** | Frontier. Trained on 3,000 H200s |
| **Dec 2025** | **Ministral 3 (3B/8B/14B)** | **3-14B** | **Dense** | **128-256K** | **Apache 2.0** | 3 sizes × 3 variants (base/instruct/reasoning) |
| **Dec 2025** | **Devstral 2** | **123B** | **Dense** | **256K** | **Modified MIT** | 72.2% SWE-Bench Verified |
| **Dec 2025** | Devstral Small 2 | 24B | Dense | 256K | Apache 2.0 | 68.0% SWE-Bench (best open-weight at size) |
| **Dec 2025** | Mistral OCR 3 | — | — | — | API-only | 74% win rate over OCR 2 |
| **Feb 2026** | **Voxtral Transcribe 2** | **4B** | **Dense** | — | **Apache 2.0** | Sub-200ms STT, 13 languages, open-weight |

### Key Model Families

#### Ministral 3 (December 2025) — Edge/Small Models
- **9 variants total**: 3 sizes (3B, 8B, 14B) × 3 variants (Base, Instruct, Reasoning)
- All support vision input
- All Apache 2.0 licensed
- 3B runs on devices with as little as 4 GB VRAM at 4-bit quantization
- 14B is the quality sweet spot for on-device inference

#### Mistral Small 3.x (24B) — Workhorse
- The 24B dense model has been iterated 4 times (3.0 → 3.1 → 3.2 → Magistral Small)
- Apache 2.0 throughout
- 128K context, vision support (from 3.1 onwards)
- Magistral Small adds reasoning (chain-of-thought) capabilities

#### Mistral Large 3 (675B MoE) — Frontier
- Granular Mixture-of-Experts with thousands of expert subnetworks
- Only 41B parameters active per token (rest dormant)
- But ALL 675B must be loaded into memory — does NOT fit on DGX Spark
- Apache 2.0 (remarkably, for a frontier model)
- NVFP4 checkpoint available on HuggingFace
- Designed for multi-GPU nodes (8×H200, GB200 NVL72)

#### Devstral 2 (123B) — Code Agents
- 72.2% SWE-Bench Verified (top-tier)
- Dense 123B — fits on DGX Spark at FP4 (~70 GB)
- Modified MIT license (more permissive than Apache 2.0 for some uses)
- Released alongside Mistral Vibe CLI (terminal agent)

### Mistral API Pricing (La Plateforme)

| Model | Input ($/MTok) | Output ($/MTok) | Notes |
|-------|---------------|-----------------|-------|
| Ministral 3B | $0.04 | $0.04 | Cheapest |
| Ministral 8B | $0.10 | $0.10 | |
| Mistral Small 3.2 (24B) | $0.10 | $0.30 | Best value workhorse |
| Mistral Medium 3 | $0.40 | $2.00 | Closed-weight |
| Mistral Large 3 (675B) | $2.00 | $6.00 | Frontier MoE |
| Codestral 25.08 | $0.30 | $0.90 | Code-specific |

**Comparison:** Mistral Small 3.2 at $0.10/$0.30 is competitive with Gemini Flash-Lite at $0.10/$0.40. But self-hosted on DGX Spark it costs $0.

---

## 16. NVIDIA Optimizations & DGX Spark

### NVIDIA + Mistral Partnership

NVIDIA co-engineered Mistral 3 deployment with several optimizations:

1. **Wide Expert Parallelism (Wide-EP):** Optimized MoE GroupGEMM kernels for Mistral Large 3's granular expert architecture
2. **Expert distribution and load balancing** across NVLink domains
3. **Expert scheduling** for full GPU utilization on MoE forward passes
4. **On GB200 NVL72:** Up to 10x higher performance than H200, exceeding 5M tokens/sec/MW

### NVFP4 Checkpoints

Mistral provides official NVFP4 (4-bit floating point) checkpoints:
- `mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4` on HuggingFace
- Quantized offline using the open-source **llm-compressor** library
- Recipe targets only MoE weights; all other components remain at original precision
- Supported backends: SGLang, TensorRT-LLM, vLLM (with Marlin FP4 fallback for A100/H100)

### TensorRT-LLM Integration

- Mistral models are first-class citizens in TensorRT-LLM
- Pre-built engines available for common configurations
- INT8/FP8/FP4 quantization supported via TensorRT model optimizer
- On Blackwell, FP4 is hardware-native (no emulation overhead)

---

## 17. NVFP4: Blackwell's Quantization Superpower

### What is NVFP4?

NVFP4 is a **4-bit floating-point** format native to NVIDIA Blackwell GPUs (5th-gen Tensor Cores). Unlike INT4 (used in GGUF/GPTQ), NVFP4 retains **floating-point semantics** with:
- Shared exponent across groups of values
- Compact mantissa for per-value precision
- Higher-precision FP8 scaling factors
- Finer-grained block scaling for accuracy preservation

### Performance Characteristics

| Metric | Value | Notes |
|--------|-------|-------|
| Memory reduction vs FP16 | **~3.5x** | 16-bit → 4-bit + scaling overhead |
| Memory reduction vs FP8 | **~1.8x** | 8-bit → 4-bit + scaling overhead |
| Throughput gain over BF16 | **2.3x** | Native Blackwell Tensor Core acceleration |
| Throughput vs AWQ (4-bit int) | **20% faster** | Per community benchmarks on DGX Spark |
| Accuracy loss (HellaSwag, MMLU, PiQA) | **<1%** | Within ±0.005-0.01 of BF16 baseline |

### Why This Matters for DGX Spark

The DGX Spark (GB10 Blackwell) has:
- **128 GB unified LPDDR** (CPU + GPU coherent memory)
- **1 PFLOP FP4** compute (with sparsity)
- Native NVFP4 hardware support (no software emulation)

This means:
1. Models that are 2x too large for BF16 can run at NVFP4
2. Models that already fit get 2.3x faster inference
3. More VRAM headroom for KV cache = longer context = more concurrent requests

### NVFP4 vs GGUF Q4_K_M

| Aspect | NVFP4 | GGUF Q4_K_M |
|--------|-------|-------------|
| Format | Floating-point 4-bit | Integer 4-bit (mixed) |
| Hardware acceleration | **Native on Blackwell** | CPU-optimized (llama.cpp) |
| Throughput on DGX Spark | **2.3x over BF16** | ~1x (CPU-bound on ARM64) |
| Accuracy preservation | <1% loss (FP8 scales) | ~1-2% loss (group quantization) |
| Backend required | TensorRT-LLM, NIM, vLLM, SGLang | llama.cpp, ollama |
| Ecosystem maturity | Growing (2025+) | Mature (2023+) |

**Key insight:** Running Mistral models through ollama/llama.cpp on DGX Spark wastes the Blackwell hardware. You MUST use TensorRT-LLM or NIM to get the NVFP4 acceleration. This is the difference between "it runs" and "it flies."

### DGX Spark Benchmark Data (Mistral Models)

| Model | Precision | Single-User tok/s | 128-Concurrency tok/s |
|-------|-----------|-------------------|----------------------|
| Mistral Small 3.1 24B | BF16 | 5.3 | 158.9 |
| Mistral Small 3.1 24B | FP8 | — | 319.7 (2x BF16) |
| Mistral Small 3.1 24B | FP4 (est.) | ~12 | ~370 (2.3x BF16) |

---

## 18. DGX Spark Fit Assessment

### What Fits on Titan (128 GB Unified Memory)

| Model | Params | BF16 Size | FP8 Size | FP4 Size (est.) | Fits at FP4? | Fits at BF16? |
|-------|--------|-----------|----------|-----------------|-------------|---------------|
| Ministral 3-3B | 3B | ~6 GB | ~3 GB | ~2 GB | Yes | Yes |
| Ministral 3-8B | 8B | ~16 GB | ~8 GB | ~5 GB | Yes | Yes |
| **Ministral 3-14B** | **14B** | **~28 GB** | **~14 GB** | **~8 GB** | **Yes** | **Yes** |
| **Mistral Small 3.2** | **24B** | **~55 GB** | **~28 GB** | **~16 GB** | **Yes** | **Yes** |
| Devstral Small 2 | 24B | ~55 GB | ~28 GB | ~16 GB | Yes | Yes |
| Magistral Small | 24B | ~55 GB | ~28 GB | ~16 GB | Yes | Yes |
| Devstral 2 | 123B | ~246 GB | ~123 GB | ~70 GB | **Yes** | No |
| Pixtral Large | 124B | ~248 GB | ~124 GB | ~71 GB | **Yes** | No |
| **Mistral Large 3** | **675B** | **~1350 GB** | **~675 GB** | **~406 GB** | **NO** | **NO** |

### VRAM Budget with Existing her-os Stack

Current Titan VRAM allocation (from ADR-016, Phase 0 validation):

| Service | VRAM |
|---------|------|
| Qwen3-Embedding-8B | 14.1 GB |
| Whisper large-v3 (PyTorch) | 8.75 GB |
| Kokoro TTS | ~1 GB |
| **Subtotal (existing)** | **~24 GB** |

Available for local LLM: **128 - 24 = 104 GB**

| Mistral Model | FP4 Size | Fits alongside stack? | Remaining VRAM |
|---------------|----------|----------------------|----------------|
| Ministral 3-14B | ~8 GB | **Yes** | 96 GB |
| Mistral Small 3.2 (24B) | ~16 GB | **Yes** | 88 GB |
| Devstral 2 (123B) | ~70 GB | **Yes** | 34 GB |
| Two models (14B + 24B) | ~24 GB | **Yes** | 80 GB |

**Ministral 14B at FP4 uses only 8 GB** — we could run it alongside ALL existing services with 96 GB to spare. Even Devstral 2 (123B) at FP4 fits with 34 GB headroom for KV cache.

---

## 19. Mistral NIM Containers

### Available on NGC Catalog (Pre-Optimized for NVIDIA GPUs)

| NIM Container | Model | Status |
|---------------|-------|--------|
| `nim/mistralai/mistral-small-3.2-24b-instruct-2506` | Mistral Small 3.2 24B | Available |
| `nim/mistralai/ministral-14b-instruct-2512` | Ministral 14B | Available |
| `nim/nv-mistralai/mistral-nemo-12b-instruct` | Mistral Nemo 12B | Available |
| `mistralai/mistral-large-3-675b-instruct-2512` | Mistral Large 3 675B | API only (multi-GPU) |
| `mistralai/mistral-medium-3-instruct` | Mistral Medium 3 | API only |

### NIM Deployment on DGX Spark

NIM containers include TensorRT-LLM with NVFP4 support. Deployment is straightforward:

```bash
# Pull and run Ministral 14B NIM on DGX Spark
docker run -d --gpus all \
  -p 8000:8000 \
  -e NGC_API_KEY=$NGC_API_KEY \
  nvcr.io/nim/mistralai/ministral-14b-instruct-2512:latest
```

This gives you:
- NVFP4 quantization (hardware-native on Blackwell)
- TensorRT-LLM optimized inference
- OpenAI-compatible REST API at `http://localhost:8000/v1/chat/completions`
- Zero per-query cost
- No data leaves the device

### Integration with her-os Eval Harness

The NIM container exposes an OpenAI-compatible API, so adding it to `run_eval.py` would use the same pattern as the Sarvam provider:

```python
# In run_eval.py — local NIM provider
class NIMLocalProvider:
    def __init__(self):
        from openai import OpenAI
        self.client = OpenAI(
            base_url="http://titan:8000/v1",  # or localhost if running on Titan
            api_key="not-needed",
        )

    def call(self, model_id, system, user):
        response = self.client.chat.completions.create(
            model=model_id,
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": user},
            ],
            max_tokens=4096,
            temperature=0.2,
        )
        return (
            response.choices[0].message.content or "",
            response.usage.prompt_tokens if response.usage else 0,
            response.usage.completion_tokens if response.usage else 0,
        )
```

---

## 20. Voxtral: Mistral's STT Alternative

### Voxtral Transcribe 2 (February 4, 2026)

Two variants:

| Variant | Params | Type | Latency | Languages | License | Cost (API) |
|---------|--------|------|---------|-----------|---------|------------|
| Voxtral Mini Transcribe V2 | — | Batch | — | 13 | Closed | $0.003/min |
| **Voxtral Mini 4B Realtime** | **4B** | **Streaming** | **<200ms** | **13** | **Apache 2.0** | $0.006/min |

### Comparison with Whisper large-v3

| Aspect | Voxtral 4B Realtime | Whisper large-v3 |
|--------|---------------------|------------------|
| Size | 4B params | 1.55B params |
| Latency | Sub-200ms (streaming) | Batch (process full audio) |
| Streaming | Yes (configurable 240ms-2.4s) | No (requires chunking hacks) |
| Languages | 13 | 99 |
| **Kannada** | **NOT listed** | **Yes** |
| Diarization | Built-in (V2 batch) | External (pyannote) |
| Open-weight | Yes (Apache 2.0) | Yes (MIT) |
| VRAM | ~8 GB (est. BF16), ~2.5 GB (FP4) | 8.75 GB |

### Assessment for her-os

**Not a Whisper replacement** — Voxtral does not support Kannada, which is critical for her-os (3 of 8 eval transcripts are Kannada-heavy). Whisper large-v3 remains the right choice for STT.

**However**, Voxtral's streaming capability is interesting for voice call scenarios (Dimension 2: Voice OS) where sub-200ms latency matters and the conversation is in English. Worth revisiting when Dimension 2 development begins.

---

## 21. Mistral vs Gemini vs Claude: Self-Hosted Comparison

### For Entity Extraction on DGX Spark

| Factor | Mistral (Local NIM) | Gemma (Local) | Gemini API | Claude API |
|--------|---------------------|---------------|------------|------------|
| **Cost/query** | $0 | $0 | $0.10-0.40/MTok | $1-5/MTok |
| **Latency** | ~50-200ms (GPU) | ~100-300ms (GPU) | ~100-500ms (network) | ~200-2000ms (network) |
| **Model size options** | 3B, 8B, 14B, 24B, 123B | 1B, 4B, 12B, 27B | N/A (API) | N/A (API) |
| **NVFP4 support** | **Yes (NIM/TRT-LLM)** | Partial (NeMo) | N/A | N/A |
| **Structured output** | JSON mode, function calling | JSON + schema | Native schema enforcement | Tool use pattern |
| **Privacy** | 100% local | 100% local | Paid: not trained | Not trained |
| **Availability** | Titan uptime | Titan uptime | Google SLA | Anthropic SLA |
| **Quality (14B)** | Good | Good (12B) | — | — |
| **Quality (24B)** | Very good | Very good (27B) | — | — |
| **Kannada handling** | Unknown (needs eval) | Unknown (needs eval) | Good (tested) | Good (tested) |
| **License** | Apache 2.0 | Open weights | API ToS | API ToS |

### Key Differentiator: NVFP4

The critical difference between running Mistral via NIM vs running Gemma via ollama on DGX Spark is **hardware-native FP4 quantization**. NIM leverages TensorRT-LLM which uses the Blackwell Tensor Cores at FP4, delivering 2.3x throughput over BF16. Ollama/llama.cpp uses CPU-side GGUF quantization which does not leverage the Tensor Cores for 4-bit inference.

**This means Ministral 14B via NIM at FP4 should be ~2x faster than Gemma 12B via ollama at Q4**, even though the models are similar in parameter count.

---

## 22. Mistral Recommendation for her-os

### For LLM Eval Harness (Immediate Next Step)

**Evaluate Ministral 14B-Instruct** on the same 8 transcripts × 5 tasks:

1. Deploy NIM container on Titan: `nim/mistralai/ministral-14b-instruct-2512`
2. Add `NIMLocalProvider` to `run_eval.py` (OpenAI-compatible API)
3. Run entity extraction eval, compare against Claude/Gemini API scores
4. If quality is competitive (score ≥7/10), this becomes the default entity extraction model

### For Production Architecture

**Recommended: Two-tier local + API hybrid:**

```
Tier 1 (Local, $0/query):
  - Ministral 14B-Instruct via NIM (entity extraction, classification)
  - Mistral Small 3.2 via NIM (summarization, function calling)
  - Both at NVFP4 — combined ~24 GB VRAM

Tier 2 (API, quality-critical):
  - Claude Sonnet/Opus (relationship extraction, Morning Debrief, nuanced reasoning)
  - Gemini Flash (batch processing, cost-sensitive bulk tasks)
```

### Updated Cost Modeling (Option F: Local Mistral + Claude)

| Task | Model | Calls/Day | Daily Cost |
|------|-------|-----------|------------|
| Entity extraction | Ministral 14B (local NIM) | 30 | $0.00 |
| Summarization | Mistral Small 3.2 (local NIM) | 30 | $0.00 |
| Relationship extraction | Claude Sonnet 4.5 (API) | 10 | $0.30 |
| Morning Debrief | Claude Sonnet 4.5 (API) | 1 | $0.06 |
| **Daily total** | | **71** | **$0.36** |
| **Monthly total** | | **~2,130** | **~$10.80** |

Same monthly cost as Option D (Local Gemma + Claude), but with potentially **better quality** (14B dense vs 4B quantized) and **faster inference** (NVFP4 hardware acceleration vs GGUF CPU-side).

### What NOT to Do

1. **Do NOT try to run Mistral Large 3 (675B) on DGX Spark** — it needs 406 GB minimum, Titan has 128 GB
2. **Do NOT use ollama/llama.cpp for Mistral on DGX Spark** — use NIM/TensorRT-LLM to get NVFP4 hardware acceleration
3. **Do NOT skip Kannada eval** — Mistral's multilingual capability is unverified on Kannada; must test before committing
4. **Do NOT replace Claude for user-facing text** — Mistral is for bulk extraction, Claude stays for quality-critical reasoning

---

## 23. Sarvam AI Summary

**Researched:** Session 62 (2026-02-25)

### Available API Models

| Model | Params | Type | Pricing | Structured Output |
|-------|--------|------|---------|-------------------|
| **Sarvam-M** | 24B | Chat | Free | **No** (dealbreaker for entity extraction) |

### Coming Soon (NOT on API Yet)

| Model | Status | Notes |
|-------|--------|-------|
| **Sarvam-30B (Vikram)** | Announced | Not on API |
| **Sarvam-105B (Indus)** | Announced | Not on API |

### Other Sarvam Services (Valuable for her-os)

| Service | Model | Notes |
|---------|-------|-------|
| **STT (Kannada-native)** | Saaras v3 | Native Kannada support — better than Whisper for Kannada? |
| **TTS** | Bulbul v3 | Indic language voices |
| **Translation** | Sarvam-Translate v1 | 10+ Indic languages |
| **Vision/OCR** | Sarvam Vision | Document understanding |

### Assessment

- **Sarvam-M for entity extraction: NOT viable** — no structured output support means we can't guarantee JSON schema compliance
- **Sarvam STT (Saaras v3): Worth evaluating** — native Kannada ASR could outperform Whisper on Kannada-heavy transcripts
- **Re-evaluate when Sarvam-30B/105B become available** on the API — larger models may support structured output

---

## Mistral Sources

### Official Mistral
- [Introducing Mistral 3](https://mistral.ai/news/mistral-3) — Dec 2, 2025 launch announcement
- [Mistral Medium 3](https://mistral.ai/news/mistral-medium-3) — May 7, 2025
- [Devstral 1](https://mistral.ai/news/devstral) — May 21, 2025
- [Devstral 2 and Mistral Vibe CLI](https://mistral.ai/news/devstral-2-vibe-cli) — Dec 9-10, 2025
- [Magistral (Reasoning)](https://mistral.ai/news/magistral) — Jun 10, 2025
- [Codestral 25.01](https://mistral.ai/news/codestral-2501) — Jan 2025
- [Mistral Small 3.1](https://mistral.ai/news/mistral-small-3-1) — Mar 2025
- [Voxtral Transcribe 2](https://mistral.ai/news/voxtral-transcribe-2) — Feb 4, 2026
- [Mistral OCR 3](https://mistral.ai/news/mistral-ocr-3) — Dec 19, 2025
- [Pixtral Large](https://mistral.ai/news/pixtral-large) — Late 2024
- [Mistral Docs — Models](https://docs.mistral.ai/getting-started/models)

### NVIDIA + Mistral
- [NVIDIA Partners With Mistral AI to Accelerate New Family of Open Models](https://blogs.nvidia.com/blog/mistral-frontier-open-models/) — Dec 2025
- [NVIDIA-Accelerated Mistral 3 Open Models](https://developer.nvidia.com/blog/nvidia-accelerated-mistral-3-open-models-deliver-efficiency-accuracy-at-any-scale/)
- [Introducing NVFP4 for Efficient and Accurate Low-Precision Inference](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
- [NVFP4 Quantization — DGX Spark](https://build.nvidia.com/spark/nvfp4-quantization)
- [How NVIDIA DGX Spark's Performance Enables Intensive AI Tasks](https://developer.nvidia.com/blog/how-nvidia-dgx-sparks-performance-enables-intensive-ai-tasks/)
- [Mistral Small 3.2 NIM Container (NGC)](https://catalog.ngc.nvidia.com/orgs/nim/teams/mistralai/containers/mistral-small-3.2-24b-instruct-2506)
- [Ministral 14B NIM (NGC)](https://build.nvidia.com/mistralai/ministral-14b-instruct-2512/modelcard)

### NVFP4 & DGX Spark
- [Explaining NVIDIA NVFP4, the DGX Spark's Secret Weapon (Micro Center)](https://www.microcenter.com/site/mc-news/article/nvidia-nvfp4-explained.aspx)
- [The Ultimate Guide to AI Quantization on DGX Spark: NVFP4 vs FP8 vs BF16](https://satgeo.blog/2026/01/22/dgx-spark-quantization-guide-nvfp4-blackwell/)
- [We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! (NVIDIA Forums)](https://forums.developer.nvidia.com/t/we-unlocked-nvfp4-on-the-dgx-spark-20-faster-than-awq/361163)
- [NVIDIA DGX Spark In-Depth Review (LMSYS Org)](https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/)

### HuggingFace
- [Mistral-Large-3-675B-Instruct-2512-NVFP4](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4)
- [Voxtral Mini 4B Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602)

### Analysis
- [NVIDIA and Mistral AI Bring 10x Faster Inference on GB200 NVL72 (MarkTechPost)](https://www.marktechpost.com/2025/12/02/nvidia-and-mistral-ai-bring-10x-faster-inference-for-the-mistral-3-family-on-gb200-nvl72-gpu-systems/)
- [Mistral closes in on Big AI rivals (TechCrunch)](https://techcrunch.com/2025/12/02/mistral-closes-in-on-big-ai-rivals-with-mistral-3-open-weight-frontier-and-small-models/)
- [Mistral AI surfs vibe-coding tailwinds with Devstral 2 (TechCrunch)](https://techcrunch.com/2025/12/09/mistral-ai-surfs-vibe-coding-tailwinds-with-new-coding-models/)
- [Mistral AI Wikipedia](https://en.wikipedia.org/wiki/Mistral_AI)
- [Mistral Large 3 vLLM Recipes](https://docs.vllm.ai/projects/recipes/en/latest/Mistral/Mistral-Large-3.html)
