# Research: Model Selection & Pipeline Architecture for Entity Extraction

**Date:** 2026-03-07
**Status:** Active — recommendations being implemented

---

## Context

**Problem:** her-os extracts entities (people, relationships, topics, promises, emotions) from English conversation transcripts via a single LLM call: `transcript -> Qwen3.5-9B -> entities -> store`. This produces incorrect relationship hierarchies ("Helen's Family Unit"), hallucinated entity types, and has no reasoning/validation step.

**Key finding:** The industry consensus is that **extraction quality is a pipeline problem, not a model problem.** A better model helps, but a multi-stage pipeline with pre-processing, resolution, and validation is transformational.

---

## 1. Which Benchmarks Matter for Entity Extraction?

### Tier 1: Critical (choose models based on these)

| Benchmark | What It Measures | Why It Matters |
|-----------|-----------------|----------------|
| **IFEval** | Instruction-following precision | Can it follow "output JSON with these exact keys"? |
| **BFCL** (Berkeley FC) | Function/tool calling accuracy | Structured schema filling = extraction |
| **MMLU-Pro** | World knowledge breadth | "MTR" = restaurant, not transit system |
| **IFBench** | Structured output format adherence | More specific than IFEval for JSON/format |

### Tier 2: Important supporting signals

| Benchmark | What It Measures | Why |
|-----------|-----------------|-----|
| **BBH** | Multi-step reasoning | Relationship inference from context |
| **MT-Bench** | Conversational understanding | Transcripts are messy dialogue |
| **Arena Hard** | General quality | Holistic capability proxy |

### Not relevant
HumanEval, MATH, GSM8K (code/math, not extraction)

### Key insight
**JSON validity is a solved problem.** Grammar-constrained generation (Ollama `format: json`, llama.cpp GBNF) achieves 100% JSON compliance regardless of model size. The real differentiator is extraction **quality** — correct entity identification, relationship classification, and reasoning.

---

## 2. Model Comparison (ranked by extraction suitability)

| Rank | Model | Params | IFEval | MMLU-Pro | BFCL | VRAM (Q4) | Speed | License | Verdict |
|------|-------|--------|--------|----------|------|-----------|-------|---------|---------|
| **1** | **Qwen3.5-27B** | 27B dense | **95.0** | **86.1** | -- | ~16GB | 1x | Apache 2.0 | **Best overall for extraction** |
| 2 | Qwen3.5-35B-A3B | 35B/3B MoE | 91.9 | 85.3 | 67.3 | ~24GB | 5x | Apache 2.0 | Already deployed, good for bulk |
| 3 | Nemotron 3 Nano 30B | 30B/3.5B MoE | ~80-85 | 78.3 | 53.8 | ~20-25GB | 3x | NVIDIA Open | Worth A/B testing |
| 4 | Gemma 3 27B | 27B dense | 90.4 | 67.5 | -- | ~16GB | 1x | Gemma ToU | Strong reasoning, weaker knowledge |
| 5 | LN-Super-49B | 49B pruned | 89.2 | -- | 73.7 | ~28GB | 0.7x | NVIDIA+Llama | Highest BFCL but heavy |
| 6 | Phi-4-reasoning+ | 14B dense | ~85 | 70.4 | -- | ~9GB | 2x | MIT | Best reasoning-per-GB |
| 7 | Mistral Small 3.2 | 24B dense | ~88 | -- | -- | ~15GB | 1.2x | Apache 2.0 | Native function calling |
| 8 | Nemotron Nano 9B v2 | 9B hybrid | 90.3 | -- | 66.9 | ~5GB | 4x | NVIDIA Open | **Sleeper: not on Ollama yet** |

### Not recommended
- **Phi-4 (base):** IFEval 63.0 — too low for structured extraction
- **Orca 2:** Obsolete (2023, non-commercial)
- **DeepSeek R1:** Weak structured output

---

## 3. NVIDIA Nemotron Assessment

### Nemotron 3 Nano 30B-A3B (best NVIDIA candidate)

**Pros:** Official DGX Spark playbook, 65-67 t/s (NVFP4), `ollama pull nemotron-3-nano:30b`, 1M context window, 9K JSON schema tasks in RL training.

**Cons:** IFEval ~80-85 vs Qwen3.5-27B's 95.0, BFCL 53.8 vs 67.3, MMLU-Pro 78.3 vs 86.1.

**Bottom line:** Throughput advantage is real but extraction is not throughput-bound. Quality gaps on IFEval and BFCL are the concern. Worth A/B testing.

### Nemotron Nano 9B v2 (sleeper)
IFEval 90.3, BFCL v3 66.9 — nearly matches Qwen3.5-35B-A3B at 9B params. **Not yet on Ollama** (pending llama.cpp b6315+ bump).

---

## 4. Industry Architecture Patterns

### The consensus architecture (2025)

```
Stage 1: ASR Post-Processing (fixes "Helen" -> "Ellen")
  Fuzzy-match transcript mentions against known entities from graph

Stage 2: Fast NER (GLiNER2, CPU, 80ms)
  Extract explicit entity spans with zero hallucination

Stage 3: LLM Reasoning (Qwen3.5-27B, GPU, ~2s)
  Classify relationships, resolve ambiguity, infer implicit entities
  Compare against existing knowledge graph

Stage 4: Validation + Storage
  Confidence gating: >0.9 auto-store, 0.7-0.9 flagged, <0.7 human review
  Conflict detection against existing graph
```

### Production systems reference

| System | Approach | Key Technique |
|--------|----------|---------------|
| **Percept** | Two-pass extraction + 5-tier resolution | Regex + LLM, exact -> fuzzy -> contextual |
| **Graphiti** | Entropy-gated fuzzy matching | Shannon entropy decides matching, 60%+ fewer LLM calls |
| **Microsoft GraphRAG** | Multi-stage pipeline | Extract -> merge -> summarize -> community detection |
| **Apple RAC-NEC** | ASR entity correction | Tag -> retrieve -> correct. **33-39% WER reduction** |

---

## 5. Implementation Roadmap

### Sprint 9a (immediate)
1. Switch extraction model: `llamacpp/qwen3.5-9b` -> `ollama/qwen3.5:27b`
2. A/B test Nemotron 3 Nano via benchmark infrastructure
3. Add real MTR transcript to benchmark fixtures

### Sprint 9b (next)
4. ASR post-processing: fuzzy-match against known entities
5. GLiNER2 first-pass NER (CPU, 80ms, zero hallucination)

### Sprint 10
6. Embedding-based entity resolution (cosine > 0.85)
7. Leverage Graphiti dedup (already in stack)
8. Self-consistency voting for critical transcripts

### Tools to adopt

| Tool | Purpose | Install | Integration point |
|------|---------|---------|-------------------|
| **GLiNER2** | Zero-shot NER, first-pass | `pip install gliner2` | Before LLM in extract.py |
| **Instructor** | Pydantic LLM output validation | `pip install instructor` | Wrap LLM call in extract.py |

---

## 6. DGX Spark Memory Budget

| Current Stack | VRAM |
|---------------|------|
| Qwen3.5-9B (llama-server) | ~6 GB |
| Qwen3.5:35b-a3b (Ollama) | ~21 GB |
| qwen3-embedding:8b | ~8 GB |
| Whisper large-v3 | ~9 GB |
| Kokoro TTS | ~0.5 GB |
| Docker/OS | ~3 GB |
| **Current total** | **~48 GB** |
| **Free** | **~80 GB** |

Adding Qwen3.5-27B (Q4): +16 GB -> 64 GB total, 64 GB free.
Adding Nemotron 3 Nano for A/B: +20 GB -> 84 GB total, 44 GB free.

---

## Sources

- Percept: https://github.com/GetPercept/percept
- Graphiti/Zep: https://arxiv.org/html/2501.13956v1
- Apple RAC-NEC: https://machinelearning.apple.com/research/retrieval-asr
- Nemotron 3 Nano: https://research.nvidia.com/labs/nemotron/Nemotron-3/
- Qwen3.5-27B: https://huggingface.co/Qwen/Qwen3.5-27B
- GLiNER2: https://github.com/urchade/GLiNER
- BFCL v4: https://gorilla.cs.berkeley.edu/leaderboard.html
- ACL 2023 conversational NER: https://aclanthology.org/2023.acl-long.98/