# Research: NER Pipeline Evolution — GLiNER2 + DeBERTa + LLM Supervisor

**Date:** 2026-03-09 (updated with web research)
**Context:** Entity extraction currently depends entirely on Qwen3.5-27B (10-30s per batch, heavy GPU). We already have GLiNER2 (zero-shot NER, 205M params) implemented but disabled and CPU-only. This research explores evolving to a specialized NER pipeline where small models handle entity spotting and the LLM becomes a supervisor/verifier.

**Decision needed:** Should we add DeBERTa fine-tuned NER alongside GLiNER2, move both to GPU, and reduce LLM to a supervisor role?

---

## 0. Web Research Findings (Factual Data)

This section contains sourced, factual data from web research conducted 2026-03-09. The rest of the document references these findings.

### 0.1 GLiNER / GLiNER2 — Architecture Deep Dive

**Original GLiNER** (NAACL 2024 paper: [arXiv:2311.08526](https://arxiv.org/abs/2311.08526)):
- Uses a **bidirectional transformer encoder** (DeBERTa-v3 backbone).
- Input: entity type labels separated by a learned `[ENT]` token, concatenated with the sentence.
- The BiLM produces representations for each token. Entity embeddings pass through a **FeedForward Network**; word representations pass through a **span representation layer**.
- A **dot product + sigmoid** computes matching scores between entity representations and span representations in a shared latent space.
- This is fundamentally different from LLMs: parallel span scoring vs sequential token generation.
- Entity labels are natural language strings — the model matches spans to prompts, allowing **post-hoc introduction of new entity types without retraining**.

**Model sizes (GLiNER v1/v2):**

| Variant | Backbone | Parameters | Disk Size |
|---------|----------|------------|-----------|
| Small | DeBERTa-v3-small | ~50M | ~200MB |
| Medium | DeBERTa-v3-base | ~166M | ~600MB |
| Large | DeBERTa-v3-large | ~459M | ~1.2GB |

Latest versions on HuggingFace: `urchade/gliner_small-v2.1`, `urchade/gliner_medium-v2.1`, `urchade/gliner_large-v2.1` (Apache 2.0 license).

**GLiNER2** (EMNLP 2025 demo: [arXiv:2507.18546](https://arxiv.org/html/2507.18546v1)):
- By **Fastino Labs** (successor to Knowledgator's GLiNER).
- Unifies **NER + text classification + structured data extraction + relation extraction** in a single 205M parameter model.
- Processes all labels in a **single forward pass** (unlike DeBERTa token classification which does a separate pass per label — 6.8x slower with 20 labels).
- **Latest pip version:** `gliner2==1.2.4` (released 2026-01-22).
- HuggingFace models: `fastino/gliner2-base-v1` (205M), `fastino/gliner2-large-v1`.
- Also available: `GLiNER XL 1B` (API-only, not downloadable).

**GLiNER2 vs GPT-4o benchmarks** (from the EMNLP paper):
- Overall zero-shot F1: GLiNER2 **0.590** vs GPT-4o **0.599** (near-parity at 1/1000th the size).
- GLiNER2 beats GPT-4o on AI domain (0.547 vs 0.526) and Literature (0.564 vs 0.561).
- **2.6x speedup** over GPT-4o while running on standard CPU hardware.
- Average inference time: **~0.08 seconds** per text (external benchmark).

**GLiNER vs ChatGPT** (original paper):
- All GLiNER variants (small, medium, large) outperform ChatGPT and Vicuna on zero-shot NER.
- Medium GLiNER achieves results comparable to the **13B UniNER** despite being **140x smaller**.

**Performance limits:** GLiNER performance degrades when entity types exceed ~30 labels. Entity order in the prompt affects results due to positional encoding.

**Optimization paths:**
- ONNX quantization: FP32 model (~634MB) reduced to **~188MB** (INT8). However, ONNX was observed to be **50% slower** in some cases due to implementation overhead — not a guaranteed speedup.
- **Rust inference engine** available: [gline-rs](https://github.com/fbilhaut/gline-rs) for C++ speed.
- NVIDIA published [GLiNER-PII](https://huggingface.co/nvidia/gliner-PII) for PII detection (based on GLiNER architecture).

### 0.2 DeBERTa v3 — Architecture & NER Details

**DeBERTa v3** ([ICLR 2023: arXiv:2111.09543](https://arxiv.org/pdf/2111.09543)):
- Extends DeBERTa with **ELECTRA-style replaced token detection (RTD)** pre-training and **gradient-disentangled embedding sharing (GDES)**.
- **Disentangled attention:** Each word represented by two vectors (content + position). Attention weights computed using disentangled matrices — better contextual representation than standard BERT.
- Vocabulary: **128K tokens** (much larger than BERT's 30K).
- Pre-trained on **160GB** of data.

**Model sizes and memory** (from HuggingFace model cards):

| Variant | Backbone Params | Embedding Params | Total | FP16 Inference | Training (Adam) |
|---------|----------------|-----------------|-------|----------------|-----------------|
| XSmall | 22M | 48M | ~70M | ~140MB | ~547MB |
| Small | 44M | 98M | ~142M | ~284MB | ~1.1GB |
| Base | 86M | 98M | ~184M | ~351MB | ~1.4GB |
| Large | 304M | 131M | ~435M | ~828MB | ~3.2GB |

Sources: [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base), [microsoft/deberta-v3-large](https://huggingface.co/microsoft/deberta-v3-large).

**Note:** Add ~20% to inference memory for activations. Training with Adam uses ~4x the model size.

**NER performance benchmarks:**

| Model | Base | Dataset | F1 |
|-------|------|---------|-----|
| dslim/bert-base-NER | BERT-base | CoNLL-2003 | ~0.91 |
| geckos/deberta-base-fine-tuned-ner | DeBERTa-base | CoNLL-2003 | **0.96** |
| tner/deberta-v3-large-tweebank-ner | DeBERTa-v3-large | TweeBank | — |

DeBERTa-v3 demonstrates **30-40% higher sample efficiency** than BERT/RoBERTa — reaches high F1 with less training data.

**ModernBERT comparison** (2025 paper: [arXiv:2504.08716](https://arxiv.org/html/2504.08716v1)):
- When controlling for pre-training data, **DeBERTa-v3 beats ModernBERT on NER**: 93.40 F1 vs 92.03 F1.
- DeBERTa-v3 reaches high F1 with only **60-70% of the training data** that ModernBERT needs.
- ModernBERT is faster to train (1300 H100 GPU-hours for 1T tokens) but lower quality on NLU.
- **Verdict:** For NER accuracy, DeBERTa-v3 remains the better choice. ModernBERT wins on throughput.
- Neither architecture surpasses **~94 F1** on standard NER — a performance ceiling for encoder-only models.

### 0.3 Minimum Training Data Requirements

- Fine-tuning overfits with **<10,000 labeled sentences** unless regularization is applied (dropout, weight decay, early stopping).
- Few-shot NER (100-500 examples) is active research but **does not generalize as well as full fine-tuning**.
- DeBERTa's **30-40% higher sample efficiency** means it needs fewer examples than BERT.
- CoNLL-2003 training set: **14,041 sentences** — the standard benchmark.
- Practical minimum for domain-specific fine-tuning: **500-1000 labeled sentences** for usable results; **2000-5000** for strong results.
- Research paper [arXiv:2205.11799](https://arxiv.org/abs/2205.11799) proposes FFF-NER for few-shot fine-tuning by formulating NER as token prediction closer to pre-training objectives.

### 0.4 Self-Improving NER Pipelines (Teacher-Student Distillation)

**The pattern has multiple names:**
- **Knowledge distillation** (general term)
- **Teacher-student learning** (architecture pattern)
- **Targeted distillation** (NER-specific, coined by UniversalNER)
- **Weak supervision** (when LLM labels are treated as noisy labels)
- **Self-training** (when the student's own predictions feed back)

**Key paper: UniversalNER** ([arXiv:2308.03279](https://arxiv.org/abs/2308.03279), USC + Microsoft):
- Used **ChatGPT to generate instruction-tuning data** for NER from unlabeled web text.
- Dataset: **45,889 input-output pairs**, **240,725 entities**, **13,020 distinct entity types**.
- Distilled into LLaMA-based models that **outperform ChatGPT's NER by 7-9 F1 points** on 43 datasets across 9 domains.
- Demonstrates that targeted distillation from LLMs produces superior NER models.

**Clinical NLP study** ([Nature Scientific Reports, 2024](https://www.nature.com/articles/s41598-024-68168-2)):
- LLM generates **weakly-labeled pseudo-data** for a DeBERTa student model.
- DeBERTa then fine-tuned on small amounts of gold-standard data.
- Validated on: temporal relation extraction, PHI de-identification, adverse drug events.
- **This is the exact pattern we want:** LLM-verified entities -> BIO tags -> train DeBERTa.

**Production distillation approaches** ([Snorkel AI, 2025](https://snorkel.ai/blog/llm-distillation-demystified-a-complete-guide/)):
- **Offline distillation:** Fixed datasets, clear performance targets. Good for nightly batch.
- **On-the-fly distillation:** Runs during training/retraining, triggered by new data or performance drops.
- Teams use **data collection services** alongside distillation to continuously expand training datasets from production inputs.

**Iterative self-improvement patterns:**
- ReST and STaR improve through iterative loops of rationale generation, filtering, and fine-tuning on successful samples.
- SCKD and ISD-QA use iterative self-correction with unlabeled data.
- Lion framework uses adversarial feedback: identifies hard instructions and generates new ones to strengthen the student.

### 0.5 Catastrophic Forgetting — Mitigation Strategies (with citations)

**Strategy 1: Experience Replay / Replay Buffer** (most widely adopted):
- Store exemplars from previous tasks in a fixed-size buffer.
- During each training step, merge buffer samples with current batch.
- **Reservoir sampling** ensures uniform random sample of all past data.
- Reference: [Experience Replay for Continual Learning (NeurIPS 2019)](https://arxiv.org/abs/1811.11682).
- Simplest and most effective per multiple 2024-2025 surveys.

**Strategy 2: Elastic Weight Consolidation (EWC)**:
- Tracks important weights using **Fisher Information Matrix**.
- Adds **quadratic penalty** when those weights change.
- Available for spaCy NER: [spacy-ewc](https://pypi.org/project/spacy-ewc/).
- Reference: [Overcoming catastrophic forgetting (PNAS 2017)](https://www.pnas.org/doi/10.1073/pnas.1611835114).

**Strategy 3: Self-Authored Data Mixing (SA-SFT)**:
- Generate self-dialogues before fine-tuning, mix with task data.
- Reference: [EMNLP 2025](https://aclanthology.org/2025.emnlp-main.1108.pdf).

**Strategy 4: Parameter-Efficient Fine-Tuning (PEFT / LoRA)**:
- Freeze most weights, only train adapter layers.
- Dramatically reduces forgetting since base weights untouched.
- Reference: [Parameter-Efficient Continual Fine-Tuning survey (2025)](https://arxiv.org/html/2504.13822v2).

**Strategy 5: Periodic Full Retrain**:
- Retrain from base checkpoint weekly with all accumulated data.
- Simplest but wastes computation on stable entities.

### 0.6 BIO/IOB2 Tagging Format

```
Token:  I    spoke  with  Ellen  Sharma  about  visiting  Bangalore  tomorrow
Label:  O    O      O     B-PER  I-PER   O      O         B-LOC      O
```

**Tag set for CoNLL-2003:**
```python
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
```

For our domain, extended:
```python
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-LOC': 3, 'I-LOC': 4, 'B-ORG': 5, 'I-ORG': 6, 'B-EVT': 7, 'I-EVT': 8}
```

### 0.7 HuggingFace Fine-Tuning Specifics

**Recommended base model:** `microsoft/deberta-v3-base` (86M backbone params, 128K vocab).

**Training code pattern:**
```python
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(
    "microsoft/deberta-v3-base",
    num_labels=9,  # O + 4 entity types x 2 (B/I)
)

training_args = TrainingArguments(
    output_dir="./deberta-ner-checkpoints",
    learning_rate=2e-5,          # 1e-5 for incremental updates
    per_device_train_batch_size=16,
    num_train_epochs=3,          # 1-2 for incremental
    weight_decay=0.01,
    save_strategy="epoch",
    push_to_hub=False,
)
```

**Incremental fine-tuning:**
- Load previous checkpoint: `from_pretrained("./deberta-ner-v2026-03-08/")`.
- Combine new data with replay buffer (last 30 days).
- Train 1-2 epochs (not 3-5) to avoid overfitting on incremental batch.
- Lower learning rate: `1e-5` (vs `2e-5` for initial).

**Model versioning:**
- Date-stamped checkpoint dirs (`deberta-ner-v2026-03-09/`).
- Keep last 7 checkpoints, auto-delete older.
- `Trainer` saves to `output_dir` automatically.

### 0.8 Comprehensive Comparison Table

| Dimension | GLiNER2 (zero-shot) | DeBERTa-v3 (fine-tuned) | LLM (Qwen3.5-27B) |
|-----------|---------------------|------------------------|--------------------|
| **Architecture** | Span-matching encoder (205M) | Token classification encoder (184M base) | Autoregressive decoder (27B) |
| **How NER works** | Match entity labels to text spans | BIO tag each token | Generate entity JSON from prompt |
| **Inference latency** | ~20-40ms GPU, ~80ms CPU | ~15-30ms GPU | 10-30s |
| **Zero-shot F1** | 0.59 (vs GPT-4o 0.60) | N/A (requires training) | ~0.85 (good prompting) |
| **Fine-tuned F1** | N/A | >0.96 on CoNLL-2003 | N/A |
| **Model size (disk)** | ~500MB | ~351MB (FP16) | ~16GB (Q4) |
| **VRAM (inference)** | ~500MB | ~420MB (FP16) | ~16GB |
| **VRAM (training)** | N/A | ~1.4GB (Adam, base) | N/A |
| **Training data needed** | None | 500-1000 sentences (initial) | None |
| **New entity types** | Instant (add label string) | Requires retraining | Instant (update prompt) |
| **Handles >30 types** | Degrades | No issue | No issue |
| **Promises/decisions** | Cannot | Cannot | Excellent |
| **Relationships** | Cannot | Cannot | Excellent |
| **Improves over time** | No | Yes (nightly training) | No (static weights) |
| **Day-1 usefulness** | Immediate | Zero | Immediate |
| **Cost per 1000 batches** | ~$0 (local) | ~$0 (local) | GPU-seconds |
| **Best for** | Cold start, novel entities | Known domain entities | Semantic extraction, verification |

### 0.9 Sources

- [GLiNER NAACL 2024 paper](https://arxiv.org/abs/2311.08526)
- [GLiNER2 EMNLP 2025 demo paper](https://arxiv.org/html/2507.18546v1)
- [GLiNER GitHub](https://github.com/urchade/GLiNER)
- [GLiNER2 GitHub (Fastino)](https://github.com/fastino-ai/GLiNER2)
- [gliner2 PyPI](https://pypi.org/project/gliner2/1.2.3/)
- [DeBERTa-v3 ICLR 2023 paper](https://arxiv.org/pdf/2111.09543)
- [microsoft/deberta-v3-base HuggingFace](https://huggingface.co/microsoft/deberta-v3-base)
- [microsoft/deberta-v3-large HuggingFace](https://huggingface.co/microsoft/deberta-v3-large)
- [geckos/deberta-base-fine-tuned-ner](https://huggingface.co/geckos/deberta-base-fine-tuned-ner)
- [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER)
- [ModernBERT vs DeBERTa-v3 paper](https://arxiv.org/html/2504.08716v1)
- [UniversalNER paper](https://arxiv.org/abs/2308.03279)
- [Clinical NLP weak supervision (Nature Scientific Reports)](https://www.nature.com/articles/s41598-024-68168-2)
- [Snorkel AI distillation guide](https://snorkel.ai/blog/llm-distillation-demystified-a-complete-guide/)
- [Experience Replay for Continual Learning (NeurIPS 2019)](https://arxiv.org/abs/1811.11682)
- [EWC (PNAS 2017)](https://www.pnas.org/doi/10.1073/pnas.1611835114)
- [spacy-ewc PyPI](https://pypi.org/project/spacy-ewc/)
- [Parameter-Efficient Continual Fine-Tuning survey (2025)](https://arxiv.org/html/2504.13822v2)
- [Catastrophic Forgetting in LLMs (EMNLP 2025)](https://aclanthology.org/2025.emnlp-main.1108.pdf)
- [GLiNER vs LLM zero-shot NER comparison (UBIAI)](https://ubiai.tools/comparing-gliner-with-llm-zero-shot-labeling-for-named-entity-recognition/)
- [GLiNER as LLM alternative for query parsing (Sease 2025)](https://sease.io/2025/10/gliner-as-an-alternative-to-llms-for-query-parsing-evaluation.html)
- [nvidia/gliner-PII](https://huggingface.co/nvidia/gliner-PII)
- [gline-rs (Rust inference engine)](https://github.com/fbilhaut/gline-rs)
- [FFF-NER few-shot paper](https://arxiv.org/abs/2205.11799)
- [HuggingFace Token Classification docs](https://huggingface.co/docs/transformers/v4.17.0/tasks/token_classification)
- [GLiNER Zilliz blog (architecture deep-dive)](https://zilliz.com/blog/gliner-generalist-model-for-named-entity-recognition-using-bidirectional-transformer)
- [Illustrated GLiNER (Medium)](https://medium.com/@shahrukhx01/illustrated-gliner-e6971e4c8c52)

---

## 1. Current Pipeline (as of Session 261)

```
audio → WhisperX (GPU) → text
  → ASR correction (CPU, <10ms)         ← kraken creature
  → GLiNER2 NER (CPU, ~200ms, OFF)      ← hawk creature
  → Qwen3.5-27B extraction (GPU, 10-30s) ← unicorn creature
  → Validation pipeline                  ← fairy creature (contradiction only)
  → Store: PostgreSQL + Neo4j + Qdrant
```

### Current Problems

| Problem | Impact |
|---------|--------|
| Qwen3.5-27B does ALL extraction | 10-30s per batch, blocks GPU |
| 27B model loaded permanently in Ollama | ~16 GB VRAM reserved |
| GLiNER2 is CPU-only and disabled | Wastes potential, adds 200ms when ON |
| No domain-specific training | LLM re-discovers "Ellen", "Appa" every session |
| Single point of failure | If Ollama is slow, extraction backs up entirely |

### What the LLM Actually Does

The extraction prompt (`extract.py:22-44`) asks the LLM to do two fundamentally different jobs:

**Job 1 — Entity spotting (easy):**
- Find person names: "Ellen", "Rajesh", "Amma"
- Find places: "Bangalore", "HSR Layout"
- Find organizations: "Infosys", "ThoughtWorks"
- Find events: "Holi", "standup meeting"

**Job 2 — Semantic understanding (hard):**
- Detect promises: "I need to call the dentist" → implicit promise with deadline
- Detect decisions: "We decided to use PostgreSQL" → decision entity
- Detect relationships: "Ellen is Rajesh's colleague" → relationship entity
- Detect emotions: "I'm feeling stressed about the deadline" → emotion entity
- Assign confidence scores and sensitivity levels

Job 1 can be done by small NER models in milliseconds. Job 2 genuinely needs LLM reasoning.

---

## 2. The Two NER Models

### 2.1 GLiNER2 (Zero-Shot NER)

**What it is:** A multi-task information extraction system by Fastino Labs (evolved from Knowledgator's GLiNER). It takes any label strings at runtime and finds matching entities in text — no training needed. Also supports text classification and structured extraction.

**Architecture:** Uses **DeBERTa as its backbone** — the `gliner2` library is built on top of DeBERTa's disentangled attention encoder (205M params) with a span extraction head. Entity labels and text are concatenated with a learned `[ENT]` separator. The model computes matching scores between entity label representations and text span representations via dot product + sigmoid in a shared latent space. Processes all labels in a **single forward pass** (unlike standalone DeBERTa token classification which requires one pass per label). See Section 0.1 for full architecture details.

> **Key discovery:** Since GLiNER2 is DeBERTa-based internally, fine-tuning the backbone and loading it via `GLiNER2.from_pretrained()` could give us domain-tuned NER *without changing any calling code* in `ner.py`. This is a potential shortcut for Phase 2.

**Our implementation:** `services/context-engine/ner.py`
- Model: `fastino/gliner2-base-v1` (205M params, ~500MB)
- Labels: `["person", "location", "organization", "event"]`
- Currently: CPU-only, feature-flagged OFF (`GLINER2_ENABLED=0`)
- Confidence discount: `confidence * 0.8` (NER is pre-filter, not final)

**Key metrics:**
| Metric | CPU (current) | GPU (proposed) |
|--------|--------------|----------------|
| Model size | 205M params (~500MB) | Same |
| Inference latency | ~200ms per batch | ~20-40ms per batch |
| VRAM usage | 0 | ~500MB (~0.4% of 128GB) |
| Training needed | None (zero-shot) | None |

**Strengths:**
- Add new entity types instantly (just add label strings)
- No training data needed — works day 1
- Handles multilingual text (English + Kannada mentions)

**Limitations:**
- Lower precision than a fine-tuned model on domain-specific entities
- Cannot detect complex entities (promises, decisions, relationships)
- Confidence calibration is rough — needs threshold tuning

### 2.2 DeBERTa v3 (Fine-Tuned NER)

**What it is:** Microsoft's DeBERTa v3 (Decoding-enhanced BERT with disentangled Attention) is a transformer encoder model. When fine-tuned for NER, it classifies each token as belonging to an entity type using BIO (Begin-Inside-Outside) tags.

**Architecture:** 184M total params (86M backbone + 98M embedding) for DeBERTa-v3-base; 435M for v3-large. Uses disentangled attention — content and position embeddings are processed separately via ELECTRA-style RTD pre-training and gradient-disentangled embedding sharing (GDES). 128K token vocabulary. Fine-tuned by adding a token classification head on top. See Section 0.2 for full details.

**How fine-tuning works:**
1. Prepare data in BIO/IOB2 format:
   ```
   Token:  I    spoke  with  Ellen  about  Bangalore
   Label:  O    O      O     B-PER  O      B-LOC
   ```
2. Fine-tune using HuggingFace `Trainer`:
   - Base model: `microsoft/deberta-v3-base` (or a pre-trained NER model like `dslim/deberta-v3-base-ner`)
   - Training: 3-5 epochs, ~10 minutes on GPU for small datasets
   - Data: BIO-tagged examples generated from LLM-verified entities

**Key metrics:**
| Metric | Value |
|--------|-------|
| Model size | 184M params / ~351MB (FP16 base), 435M / ~828MB (FP16 large) |
| Inference latency | ~15-30ms per batch (GPU) |
| VRAM (inference) | ~420MB (base FP16) |
| VRAM (training, Adam) | ~1.4GB (base) |
| Training per cycle | ~10 min on GPU (small dataset) |
| Data needed (initial) | ~500-1000 labeled sentences |
| Data needed (incremental) | ~50-100 new sentences per nightly cycle |

**Strengths:**
- Very high precision on trained entity types (F1 > 0.96 on CoNLL-2003)
- Gets better over time as more training data accumulates
- Fast inference — comparable to GLiNER2
- Well-understood model — enormous HuggingFace ecosystem

**Limitations:**
- Useless on day 1 (no training data yet)
- Adding new entity types requires retraining
- Risk of catastrophic forgetting during incremental updates (mitigatable — see Section 5)
- Cannot detect semantic entities (promises, decisions) — only named entities

### 2.3 Head-to-Head Comparison

| Dimension | GLiNER2 | DeBERTa (fine-tuned) | LLM (Qwen3.5-27B) |
|-----------|---------|---------------------|-------------------|
| Entity spotting | Good (zero-shot) | Excellent (trained) | Excellent |
| Domain entities | Decent | Excellent after training | Excellent |
| Promises/decisions | Cannot | Cannot | Excellent |
| Relationships | Cannot | Cannot | Excellent |
| Emotional context | Cannot | Cannot | Excellent |
| Latency | ~20-40ms (GPU) | ~15-30ms (GPU) | 10-30s |
| VRAM | ~500MB | ~600MB | ~16GB |
| Training needed | None | Yes (nightly) | None |
| Day-1 usefulness | Immediate | Zero | Immediate |
| Improves over time? | No | Yes | No (static weights) |
| Cost per batch | ~0 | ~0 | GPU-seconds |

---

## 3. The Evolved Pipeline

### 3.1 Architecture

```
audio → WhisperX (GPU) → text
  → ASR correction (CPU, <10ms)                      ← kraken
  → GLiNER2 (GPU, ~20ms) → zero-shot entities        ← hawk
  → DeBERTa (GPU, ~20ms) → domain-trained entities    ← NEW creature
  → Merge + Deduplicate + Confidence fusion
  → LLM Supervisor (GPU, 2-5s, ONLY when needed):     ← unicorn (or 9B model)
      a) Verify low-confidence NER entities
      b) Extract semantic entities (promises, decisions, relationships, emotions)
      c) Resolve conflicts between GLiNER2 and DeBERTa
  → Validation pipeline                               ← fairy
  → Store
```

### 3.2 What Changes

| Component | Current | Evolved |
|-----------|---------|---------|
| GLiNER2 | CPU, OFF | **GPU, ON by default** |
| DeBERTa | Not implemented | **New: fine-tuned NER model** |
| LLM role | Does everything | **Supervisor: verify + relationships only** |
| LLM model | Qwen3.5-27B (16GB) | **Could be 9B distilled (3GB)** |
| LLM calls | Every batch | **Only uncertain batches (~20-30%)** |
| Entity spotting | LLM (10-30s) | **NER models (~40ms)** |
| New creature | — | **eagle (DeBERTa NER)** |

### 3.3 Decision Routing Logic

Not every batch needs the LLM. The routing logic:

```python
# After NER models run:
ner_entities = gliner2_entities + deberta_entities  # merged + deduped

# Case 1: High-confidence NER entities, no semantic entities needed
# → Store directly, skip LLM entirely
if all(e.confidence > 0.85 for e in ner_entities) and no_promise_keywords(text):
    store(ner_entities)
    return

# Case 2: Low-confidence or conflicting entities
# → Send to LLM for verification
if any(e.confidence < 0.6 for e in ner_entities) or has_conflicts(ner_entities):
    verified = await llm.verify(ner_entities, text)
    store(verified)

# Case 3: Text contains semantic signals (promises, decisions, emotions)
# → Send to LLM for semantic extraction (with NER hints pre-attached)
if has_promise_keywords(text) or has_decision_keywords(text):
    semantic = await llm.extract_semantic(text, ner_hints=ner_entities)
    store(ner_entities + semantic)
```

**Promise/decision keyword detection** is a simple regex or keyword set:
- Promise signals: "I need to", "I have to", "remind me", "don't forget", "by tomorrow", "deadline"
- Decision signals: "we decided", "let's go with", "the plan is", "I chose"
- Emotion signals: "I feel", "stressed", "happy", "frustrated", "excited"

This keyword detection is cheap (CPU, <1ms) and acts as a gate: if no semantic signals in the text, the LLM is not called at all.

### 3.4 GPU Budget

On the DGX Spark (128 GB unified memory):

| Model | VRAM | % of 128GB | Source |
|-------|------|------------|--------|
| WhisperX (STT) | ~4.7 GB | 3.7% | Measured |
| GLiNER2 (NER) | ~0.5 GB | 0.4% | HuggingFace model card |
| DeBERTa-v3-base (NER) | ~0.42 GB | 0.3% | HuggingFace (FP16 + 20% overhead) |
| Qwen3.5-9B (LLM supervisor) | ~3 GB | 2.3% | Measured (Q4_K_M) |
| Qwen3-Embedding-8B | ~8 GB | 6.3% | Measured |
| **Total extraction pipeline** | **~16.6 GB** | **13.0%** | |

Compare to current: Qwen3.5-27B alone uses ~16 GB. The evolved pipeline uses roughly the same VRAM but is **10-100x faster** for 70-80% of batches (those that don't need LLM).

### 3.5 Can We Use 9B Instead of 27B?

**Yes, for the supervisor role.** The LLM is no longer extracting from scratch — it's:
1. Verifying pre-extracted entities (binary yes/no + confidence adjustment)
2. Extracting only semantic entities (promises, decisions, relationships)
3. Getting NER hints in the prompt (reduces hallucination)

This is a simpler task than full extraction. The Qwen3.5-9B Opus-Distilled model (already running on llama-server:8003) is well-suited:
- Good at structured reasoning (fine-tuned on chain-of-thought traces)
- Already loaded, no additional VRAM
- 6.2s average latency vs 27B's 10-30s
- Can be freed: if 9B handles supervision, 27B can be unloaded from Ollama

**Net VRAM savings:** ~16 GB (27B unloaded) - 1.1 GB (GLiNER2 + DeBERTa) = **~14.9 GB freed** for other models.

---

## 4. The Self-Improving Loop (Teacher-Student Distillation)

This is a well-established pattern in NLP with multiple names (see Section 0.4 for full research):
- **Knowledge distillation** / **teacher-student learning** (general terms)
- **Targeted distillation** (NER-specific, coined by UniversalNER — USC + Microsoft)
- **Weak supervision** (when LLM labels are treated as noisy labels)

Our setup:
- **Teacher:** The LLM (large, slow, accurate) — Qwen3.5-27B or 9B
- **Student:** DeBERTa (small, fast, starts ignorant)
- **Training signal:** The teacher's verified outputs become the student's training data

**Precedent:** A 2024 Nature Scientific Reports study validated exactly this pattern — LLM generates weakly-labeled pseudo-data for a DeBERTa student model, which is then fine-tuned on gold-standard data. UniversalNER (2023) demonstrated that models distilled this way can **outperform ChatGPT's NER by 7-9 F1 points**.

### 4.1 Nightly Training Loop

The 3 AM cron window (`nightly.py`) already runs:
1. Memory tier decay (phoenix creature)
2. Stale validation expiry
3. Wonder + Comic generation

Add a 4th step: **DeBERTa fine-tuning.**

```
3:00 AM — Nightly window starts
  ├── [existing] Memory decay (phoenix)
  ├── [existing] Validation expiry
  ├── [existing] Wonder + Comic generation
  └── [NEW] DeBERTa NER training (eagle)
        1. Query today's LLM-verified entities from PostgreSQL
        2. Retrieve the original transcript segments (from segments table)
        3. Generate BIO-tagged training examples:
           - Tokenize each segment
           - Align verified entity names to token positions
           - Tag: B-PER, I-PER, B-LOC, I-LOC, etc.
        4. Add to training replay buffer (keep last 30 days)
        5. Fine-tune DeBERTa for 1-2 epochs on the full buffer
        6. Save new model checkpoint (versioned: deberta-ner-v{date})
        7. Hot-swap the model: next extraction cycle uses new weights
```

### 4.2 Training Data Economics

| Timeframe | Estimated Data | DeBERTa Quality |
|-----------|---------------|-----------------|
| Week 1 | ~50-100 verified entities, ~200 sentences | Poor — barely better than random |
| Week 2 | ~200-400 entities, ~800 sentences | Mediocre — recognizes top-10 frequent names |
| Month 1 | ~1000-2000 entities, ~4000 sentences | Good — handles 50-60% of common entities |
| Month 3 | ~5000+ entities, ~15000 sentences | Strong — handles 80%+ of entity spotting |

The cold start problem is real but manageable: GLiNER2 covers entity spotting during weeks 1-4 while DeBERTa trains up. The LLM handles everything else. As DeBERTa improves, LLM calls decrease naturally.

### 4.3 Catastrophic Forgetting Mitigation

When fine-tuning incrementally (nightly), DeBERTa can forget old entities while learning new ones. Five strategies from the literature (see Section 0.5 for full citations):

**Strategy 1: Replay Buffer (recommended)**
- Keep a rolling buffer of the last 30 days of training data
- Each nightly fine-tune uses ALL data in the buffer, not just today's
- Uses **reservoir sampling** to ensure uniform random sample of all past data
- Old data naturally ages out after 30 days
- Simple, effective, low storage (~10MB for 30 days of NER data)
- The most widely adopted strategy per 2024-2025 surveys ([NeurIPS 2019](https://arxiv.org/abs/1811.11682))

**Strategy 2: Elastic Weight Consolidation (EWC)**
- Track important weights using **Fisher Information Matrix**
- Add **quadratic penalty** when those weights change
- Available for spaCy NER: [spacy-ewc](https://pypi.org/project/spacy-ewc/)
- More complex to implement, modest improvement over replay buffer
- Reference: [PNAS 2017](https://www.pnas.org/doi/10.1073/pnas.1611835114)

**Strategy 3: Parameter-Efficient Fine-Tuning (PEFT / LoRA)**
- Freeze most model weights, only train small adapter layers
- Dramatically reduces forgetting since base weights are untouched
- LoRA adds low-rank decomposition to attention/feed-forward layers
- Reference: [PEFT survey 2025](https://arxiv.org/html/2504.13822v2)

**Strategy 4: Self-Authored Data Mixing (SA-SFT)**
- Generate self-dialogues before fine-tuning, mix with task data
- Mitigates forgetting while improving in-domain performance
- Reference: [EMNLP 2025](https://aclanthology.org/2025.emnlp-main.1108.pdf)

**Strategy 5: Periodic Full Retrain**
- Instead of nightly incremental, do a weekly full retrain from base checkpoint
- Use entire historical training data
- Simplest but wastes computation on stable entities

**Recommendation:** Start with **replay buffer** (Strategy 1). Simple, most proven, good enough. Consider adding **LoRA** (Strategy 3) if forgetting becomes measurable — it is complementary to replay buffer.

---

## 5. Implementation Phases

### Phase 1: GLiNER2 on GPU + ON by Default (Quick Win)

**Effort:** ~1 hour. Config change only.

```python
# config.py — change two lines:
GLINER2_ENABLED = os.getenv("GLINER2_ENABLED", "1") != "0"  # default ON
GLINER2_DEVICE = os.getenv("GLINER2_DEVICE", "cuda")  # GPU
```

**What this buys:**
- GLiNER2 runs on every extraction batch (~20ms vs ~200ms)
- NER hints always available in LLM prompt → better extraction accuracy
- No new models, no training infrastructure needed
- Hawk creature events in dashboard for every batch

**Risk:** Near zero. GLiNER2 is a pre-filter — its output is hints for the LLM, not final entities.

### Phase 2: DeBERTa Fine-Tuned NER + Nightly Training

**Effort:** ~2-3 days.

New files:
- `services/context-engine/deberta_ner.py` — Model loading, inference, BIO→entity conversion
- `services/context-engine/ner_trainer.py` — Training data generation, fine-tune loop, model versioning
- `services/context-engine/ner_merge.py` — Merge + deduplicate GLiNER2 and DeBERTa results
- Updates to `nightly.py` — Add training step to 3 AM cron
- Updates to `main.py` — Call both NER models, merge results

New creature: **eagle** (DeBERTa NER — precision hunter, trained on domain data)

Training infrastructure:
- Base model: `microsoft/deberta-v3-base` + token classification head
- Training: HuggingFace `Trainer` with `TokenClassificationDataCollator`
- Model storage: `~/.local/share/her-os-models/deberta-ner/` (versioned checkpoints)
- Replay buffer: `~/.local/share/her-os-models/deberta-ner/training_data/` (30-day rolling JSONL)

### Phase 3: LLM as Supervisor + Routing Logic

**Effort:** ~2-3 days.

Changes:
- New `services/context-engine/extraction_router.py` — Decision routing logic (when to call LLM)
- Keyword detection for promises/decisions/emotions
- Separate LLM prompts: `verify_prompt` (yes/no + confidence) vs `extract_semantic_prompt` (relationships, promises, decisions)
- Switch extraction LLM from 27B to 9B distilled
- Metrics: track LLM call rate (target: <30% of batches)

### Phase 4: Unload 27B + Full Automation

**Effort:** ~1 day.

- Remove `qwen3.5:27b` from Ollama startup (saves 16 GB VRAM)
- Update `start.sh` to not preload 27B
- Monitor extraction quality for 1 week
- If regression: re-enable 27B as fallback

---

## 6. Risks and Mitigations

| Risk | Severity | Mitigation |
|------|----------|------------|
| DeBERTa cold start (weeks 1-4) | Low | GLiNER2 + LLM cover everything initially |
| Catastrophic forgetting | Medium | 30-day replay buffer |
| NER misses complex entities | Low | LLM still handles promises/decisions/relationships |
| GPU contention during training | Low | Training runs at 3 AM (low activity), takes ~10 min |
| Model versioning complexity | Medium | Simple date-stamped checkpoints, keep last 7 |
| Training data quality | Medium | Only use LLM-**verified** entities (not raw NER output) |

---

## 7. Success Metrics

| Metric | Current | Phase 1 Target | Phase 3 Target |
|--------|---------|---------------|----------------|
| Entity extraction latency (p50) | 15s | 12s (same + NER hints) | ~50ms (NER only batches) |
| LLM calls per batch | 100% | 100% (still needed) | <30% |
| Entity spotting accuracy (F1) | ~0.85 (LLM) | ~0.85 (LLM + NER hints) | ~0.90 (NER + LLM verify) |
| GPU memory for extraction | 16 GB (27B) | 16.5 GB (27B + GLiNER2) | ~4 GB (9B + NER models) |
| Daily NER improvement | None | None | Yes (nightly DeBERTa training) |

---

## 8. Recommendation

**Phase 1 is a no-brainer.** Two config lines, immediate benefit, zero risk. Do it now.

**Phase 2 is the real evolution.** DeBERTa + nightly training is the core architectural shift. It's 2-3 days of work but fundamentally changes the extraction economics. Wait until after factory reset (clean data) to start collecting clean training data.

**Phase 3 depends on Phase 2 success.** Once DeBERTa handles 60%+ of entities reliably (estimated: month 2-3), introduce the routing logic and test 9B as supervisor.

**Phase 4 is the payoff.** Unloading the 27B model frees 16 GB VRAM — room for new models, bigger context windows, or additional services.

---

## 9. Connection to Existing Architecture

This evolution aligns with several existing ADRs:
- **ADR-022** (multi-stage pipeline): Adds two more stages before LLM
- **ADR-019** (dual-model routing): Extends the routing concept from LLM choice to "NER vs LLM"
- **ADR-023** (9B distilled for Annie): Same 9B model could serve as extraction supervisor
- **ADR-007** (temporal decay): DeBERTa training data uses the same salience/decay logic — older entities get less weight in training

The 3 AM nightly window (`nightly.py`) is the natural integration point for DeBERTa training — it already runs decay, validation expiry, wonder generation, and comic generation. Adding NER training is a 5th parallel task in the same window.

---

## 10. Integration Shortcut: Fine-Tune GLiNER2's Own Backbone

**Key discovery from code analysis:** GLiNER2 uses DeBERTa as its internal backbone. This means we have two paths for Phase 2:

**Path A (separate model):** Add a standalone DeBERTa NER model alongside GLiNER2. Two models, two inference calls, merge results.

**Path B (fine-tune GLiNER2's backbone):** Fine-tune the DeBERTa backbone inside GLiNER2 on domain data, then load the fine-tuned checkpoint via `GLiNER2.from_pretrained(local_path)`. Zero code changes in `ner.py`. Same API, better domain accuracy.

Path B is cleaner but riskier — fine-tuning GLiNER2's backbone might break its zero-shot capability. Path A is safer — keep GLiNER2 for zero-shot, add DeBERTa for domain. **Recommend Path A first, explore Path B later.**

---

## 11. Open Questions

1. **Should DeBERTa train on ALL entity types or only named entities (person, place, org)?** Promises and decisions might benefit from a separate classifier, not NER.
2. **ONNX export for GLiNER2?** Converting to ONNX could further reduce latency (15ms → 5ms) but adds export complexity. Worth exploring in Phase 2.
3. **Should we keep GLiNER2 long-term?** Once DeBERTa is trained, GLiNER2's zero-shot capability might be redundant for known types. Keep it for novel/rare entity types?
4. **Training on Titan vs CPU?** Fine-tuning at 3 AM could use GPU (faster, 10 min) or CPU (slower, 30-60 min but zero GPU contention). GPU is fine at 3 AM.
5. **What about UniversalNER?** UniversalNER (USC + Microsoft, 2023) used ChatGPT to distill NER into LLaMA models, outperforming ChatGPT by 7-9 F1 on 43 datasets. It covers 13K+ entity types. However, it's LLaMA-based (7B+), far larger than GLiNER2 (205M). Not a practical replacement for our CPU/small-GPU NER stage. Better as a reference for the distillation pattern (see Section 0.4).
6. **Path A vs Path B for DeBERTa integration?** Separate model (safe, more VRAM) vs fine-tune GLiNER2's backbone (elegant, may break zero-shot). See Section 10.