# Research: Cross-Encoder Reranking for Context Engine

**Date:** 2026-03-11
**Trigger:** Rajesh's RAG class mentioned `ms-marco-MiniLM` reranker — identified as a high-impact, low-risk upgrade to her-os retrieval.
**Decision:** Implement with `cross-encoder/ms-marco-MiniLM-L-6-v2` (CPU-only, ~91 MB, zero GPU contention).

## Current Pipeline

```
BM25 (PostgreSQL) + Vector (Qdrant) + Graph (Graphiti)
    → Reciprocal Rank Fusion (RRF, k=60)
    → Temporal Decay (half-life=30d, evergreen floor=0.3)
    → MMR Diversity (λ=0.7, Jaccard word overlap)
    → Return top-k
```

**Weakness:** After RRF and temporal decay, the relevance signal is a rank-based approximation. MMR diversifies but doesn't refine relevance — it uses simple word overlap, not learned semantic similarity.

## Upgraded Pipeline

```
BM25 + Vector + Graph (retrieve limit×4 per source when reranker ON)
    → RRF → Temporal Decay
    → Cross-Encoder Rerank (score all pairs, sigmoid normalize, keep top limit×3)
    → MMR Diversity (now working on cross-encoder-scored candidates)
    → Return top-k
```

**Gain:** Cross-encoder scores (query, passage) pairs jointly — it sees both simultaneously, unlike bi-encoder which embeds them separately. This catches semantic matches that vector similarity misses (e.g., "promise about groceries" matching "picking up milk").

## Model: `cross-encoder/ms-marco-MiniLM-L-6-v2`

| Property | Value |
|----------|-------|
| Parameters | 22.7M |
| Download size | ~91 MB |
| Score range | Raw logits (unbounded) → normalized via sigmoid to [0, 1] |
| CPU throughput | ~601 pairs/sec (PyTorch), ~981 pairs/sec (ONNX O3) |
| Time to rerank 50 docs | ~83ms (PyTorch CPU), ~51ms (ONNX) |
| NDCG@10 (TREC DL 2019) | 74.30 |
| MRR@10 (MS Marco Dev) | 39.01 |
| Max sequence length | 512 tokens (query + passage combined) |
| GPU requirement | None — runs on CPU |

## Alternatives Considered

| Library | Model | Size | Quality | Deps |
|---------|-------|------|---------|------|
| `sentence-transformers` CrossEncoder | ms-marco-MiniLM-L-6-v2 | 91 MB | NDCG 74.30 | PyTorch (already installed) |
| `flashrank` | ms-marco-TinyBERT-L-2 | 4 MB | NDCG 69.84 | ONNX Runtime only |
| `flashrank` | ms-marco-MiniLM-L-12 | 34 MB | NDCG 74.04 | ONNX Runtime only |
| `rerankers` (Answer.AI) | any backend | varies | varies | Wrapper, adds indirection |

### Qwen3-Reranker (future consideration)

| Model | Params | Size | MTEB-R | Multilingual | API |
|-------|--------|------|--------|-------------|-----|
| Qwen3-Reranker-0.6B | 600M | ~1.2 GB | 65.80 | 100+ langs | `AutoModelForCausalLM` + chat template |
| Qwen3-Reranker-4B | 4B | ~8 GB | — | 100+ langs | Same |
| Qwen3-Reranker-8B | 8B | ~16 GB | 69.02 | 100+ langs | Same |

**Architecture:** LLM-based (causal language model with yes/no chat template). Very different from BERT-class cross-encoders. Requires GPU for practical speed — running 0.6B on CPU for 50 pairs would take seconds, not milliseconds.

**When to switch:** If multilingual reranking is needed (Kannada, Hindi) or if a GPU-idle scheduling window is available. A sentence-transformers wrapper exists (`tomaarsen/Qwen3-Reranker-0.6B-seq-cls`) for drop-in replacement.

**Not chosen now** because: (1) GPU contention is the #1 constraint, (2) MiniLM runs in 83ms on CPU vs seconds for Qwen3 on CPU, (3) her-os data is currently English-only.

**Choice for now:** `sentence-transformers` CrossEncoder with ms-marco-MiniLM-L-6-v2 — PyTorch/transformers already in requirements.txt, CPU-native, well-maintained.

## Implementation Details

### New file: `services/context-engine/reranker.py`
- Lazy-loaded singleton model (loaded on first query, stays in RAM)
- `cross_encoder_rerank(query, candidates, top_k)` — pure function with model call
- Sigmoid normalization of raw logits → replaces `final_score` for downstream MMR
- Feature-flagged via `RERANKER_ENABLED` env var (default ON)

### Config additions: `services/context-engine/config.py`
- `RERANKER_ENABLED` — default `1` (ON), set `0` to disable
- `RERANKER_MODEL` — default `cross-encoder/ms-marco-MiniLM-L-6-v2`
- `RERANKER_TOP_K_FACTOR` — default `3` (pass 3×limit candidates to MMR after reranking)
- `RERANKER_RETRIEVAL_MULTIPLIER` — default `4` (retrieve 4×limit per source for better candidate pool)

### Integration: `services/context-engine/main.py`
- Insert reranker call between temporal decay and MMR in `retrieve_context()`
- When reranker is enabled, increase per-source retrieval limit by `RERANKER_RETRIEVAL_MULTIPLIER`
- Log reranker latency and score spread

### Resource impact
- **RAM:** ~91 MB for model weights (loaded once, stays resident)
- **CPU:** ~83ms per query (50 candidates) — negligible vs network latency
- **GPU:** Zero — no VRAM impact, no contention with voice/extraction
- **Disk:** ~91 MB model download (cached in `~/.cache/huggingface/`)

## Anti-Patterns to Avoid

1. **Don't use GPU for the cross-encoder** — it's tiny enough for CPU, and GPU contention is the whole reason we built the voice-active lease system
2. **Don't replace MMR with reranking** — they serve different purposes (relevance vs. diversity). Both are needed.
3. **Don't skip temporal decay** — cross-encoder has no concept of recency. Decay must run before reranking.
4. **Don't normalize cross-encoder scores across queries** — logits are only comparable within a single query's candidate set.

## References

- [sbert.net CrossEncoder API](https://sbert.net/docs/package_reference/cross_encoder/cross_encoder.html)
- [ms-marco-MiniLM-L6-v2 Model Card](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2)
- [FlashRank (lightweight alternative)](https://github.com/PrithivirajDamodaran/FlashRank)
- [rerankers (unified API)](https://github.com/AnswerDotAI/rerankers)
