# Hybrid Retrieval Architecture How Annie finds relevant memories when asked a question. --- ## Overview When a user asks Annie something like "What did I tell Priya about vacation?", the Context Engine runs a three-source hybrid retrieval pipeline. It searches keyword indexes, vector embeddings, and the knowledge graph in parallel, fuses results by rank, applies temporal decay, and then diversifies with MMR reranking. The top-K results are injected into Annie's LLM system prompt as a memory briefing. ``` User query | v +----------------------------------------------+ | Context Engine /v1/context | | | | +-----------+ +-----------+ +----------+ | | | BM25 | | Vector | | Graph | | | | PostgreSQL| | Qdrant | | Graphiti | | | | tsvector | | cosine | | Neo4j | | | +-----+-----+ +-----+-----+ +----+-----+ | | | | | | | +-------+------+------+-------+ | | | | | Reciprocal Rank Fusion (k=60) | | | | | Temporal Decay (30-day half-life) | | | | | MMR Reranking (lambda=0.7) | | | | | Top-K results | +----------------------------------------------+ | v Annie Voice (context_loader.py) | v block in LLM system prompt ``` --- ## 1. Current Retrieval Pipeline All three search sources are implemented and run in production on Titan. The `/v1/context` endpoint in `main.py` orchestrates the full pipeline. ### 1.1 BM25 Keyword Search PostgreSQL `segments` table with a GIN-indexed tsvector column auto-generated from segment text (`to_tsvector('english', text)`). Query construction in `retrieve.py:build_tsquery` strips tsquery operators and OR-connects all words for maximum recall: `"vacation with Priya"` becomes `vacation | with | Priya`. Fetches `limit * 4` candidates. BM25 is the baseline that always works. ### 1.2 Vector Similarity Search Qwen3-Embedding-8B (via Ollama `/api/embed`) embeds the query into a 1024-dim vector (truncated from native 4096-dim Matryoshka output). Qdrant cosine search against the `her_os_entities` collection returns up to `limit * 2` candidates. Segments are batch-embedded at ingest time as a non-blocking background task. If Qdrant is down, BM25 still works. ### 1.3 Graph Traversal Search Graphiti 0.28.1 searches the Neo4j temporal knowledge graph for entity nodes and relationship edges. Returns typed facts like `Rajesh -> Priya: discussed vacation plans for March`. Graph results get `timestamp = now` so temporal decay treats them as fresh (correct for consolidated knowledge). ### 1.4 Temporal Decay Weighting After RRF fusion, scores are adjusted for recency: ``` decay_factor = max(floor, 2^(-age_days / half_life)) final_score = rrf_score * decay_factor ``` - `TEMPORAL_DECAY_HALF_LIFE_DAYS = 30` - `EVERGREEN_FLOOR = 0.3` for types `{person, place, relationship}` | Age | Non-Evergreen | Evergreen (floor=0.3) | |-----|--------------|----------------------| | 0 days | 1.000 | 1.000 | | 30 days | 0.500 | 0.500 | | 60 days | 0.250 | 0.300 (floor) | | 120 days | 0.063 | 0.300 (floor) | Evergreen rationale (ADR-007): "Rajesh is married to Priya" must remain findable months later. Non-evergreen topics gracefully fade. ### 1.5 MMR Reranking for Diversity Maximal Marginal Relevance eliminates near-duplicate results: ``` MMR_score = lambda * relevance - (1 - lambda) * max_similarity_to_selected ``` - `MMR_LAMBDA = 0.7` (70% relevance, 30% diversity penalty) - `MMR_CANDIDATE_POOL_SIZE = 20` (cap before O(K^2) reranking) Similarity is Jaccard word-overlap. Word sets pre-computed once for the pool. Greedy selection: pick highest MMR score, repeat until K results selected. --- ## 2. Hybrid Pipeline Details ### 2.1 Vector Search via cuVS (GPU, 0.17ms) Titan validation proved cuVS IVF-Flat at 0.17ms per query on GPU -- 23x faster than Qdrant's HNSW on CPU (3.9ms). The current deployment uses Qdrant. The planned upgrade replaces Qdrant's search backend with cuVS while keeping Qdrant as the metadata store, or moves to a direct cuVS + PostgreSQL payload approach. ### 2.2 Graphiti Graph Traversal (1-2 Hops) Graphiti's search does entity-centric BFS: from matched entity nodes, it traverses 1-2 relationship hops to gather connected facts. For "What did I tell Priya about vacation?", it finds the `Priya` node, walks to connected edges (`told_about`, `planning`), and returns facts like "planning vacation to Goa in March." Edge weights from Graphiti's internal scoring determine result order. ### 2.3 Reciprocal Rank Fusion (RRF) Three ranked lists with incompatible scoring systems (BM25 ts_rank_cd 0-10+, cosine similarity 0-1, graph edge weight 0-1) are fused by rank position: ``` RRF_score(item) = SUM over all lists: 1 / (k + rank_i) ``` Where `k = 60` (from `config.py:RRF_K`) and `rank_i` is 1-indexed. If `seg-A` appears at rank 1 in BM25, rank 2 in vector, and rank 2 in graph: `RRF = 1/61 + 1/62 + 1/62 = 0.0486`. A single-list rank-1 item gets only 0.0164. Items confirmed by multiple retrieval methods score significantly higher. **Why RRF over raw score normalization:** BM25 scores (0-10+) and cosine similarity (0-1) are not comparable. Weighted mixing or min-max normalization is fragile. RRF uses only rank position -- no normalization needed. **Why k = 60:** Smooths rank differences so no single source dominates. Standard default from Cormack et al. (2009). Rank 1 (1/61 = 0.0164) vs rank 10 (1/70 = 0.0143) is a small difference. **Duplicates:** When a segment appears in multiple lists, the richest version (most metadata keys) is kept and RRF scores are summed. --- ## 3. Query Processing Flow **Step 1 -- Pre-session memory load.** At connection, `context_loader.py` calls `GET /v1/context?query=recent+conversations+and+key+facts&limit=10`. Timeout: 3 seconds. If Context Engine is down, Annie starts fresh. **Step 2 -- Parallel fan-out.** Three concurrent `asyncio.create_task()` calls. **Step 3 -- BM25.** `build_tsquery()` converts query to OR-connected tokens. PostgreSQL GIN index scan. Pool: `limit * 4`. **Step 4 -- Vector.** `embed_text()` via Qwen3-Embedding-8B (~75ms). `search_vectors()` cosine search in Qdrant. Pool: `limit * 2`. **Step 5 -- Graph.** `search_graph()` calls Graphiti over Neo4j. Returns entity-relationship edges. Normalized via `normalize_graph_results()`. **Step 6 -- RRF fusion.** `reciprocal_rank_fusion(bm25, vector, graph)` merges lists by rank. Multi-list items get boosted. **Step 7 -- Temporal decay.** `apply_temporal_decay(fused, score_key="rrf_score")` applies age-based decay. Evergreen entities floor at 0.3. **Step 8 -- MMR.** `mmr_rerank(decayed, k=limit)` selects top-K diversified. **Step 9 -- Response.** `RetrievalResult` Pydantic models with `text`, `source_session`, `timestamp`, `relevance_score`, `entities`, `speaker`. --- ## 4. Scoring and Ranking Details ### Score Lifecycle ``` BM25: ts_rank_cd --| Vector: cosine (0-1) |--> RRF score --> * decay_factor --> final_score --> MMR selection Graph: edge weight (0-1)-| ``` 1. **RRF score:** Rank-based, typically 0.005-0.050 range 2. **Final score:** RRF * decay factor (depends on age) 3. **MMR selection:** Greedy pick based on `lambda * final_score - (1-lambda) * max_similarity` ### Candidate Pool Sizes BM25 fetches 4x the requested limit. Vector fetches 2x. Graph fetches 1x. For `limit=5`: up to 35 candidates enter RRF (20 BM25 + 10 vector + 5 graph). --- ## 5. Performance Targets and Benchmarks All latencies measured on NVIDIA DGX Spark (128 GB unified memory, Blackwell GPU): | Stage | Target | Measured on Titan | |-------|--------|-------------------| | Query embedding (Qwen3-Embedding-8B) | <80ms | 75.1ms | | BM25 search (PostgreSQL GIN) | <5ms | 0.137ms | | Vector search (Qdrant cosine) | <5ms | 3.9ms | | Vector search (cuVS IVF-Flat, planned) | <1ms | 0.17ms | | Graph traversal (Neo4j Cypher) | <10ms | 6.9ms | | RRF fusion | <1ms | <0.1ms | | Temporal decay | <1ms | <0.1ms | | MMR reranking | <1ms | <0.5ms | | **Total retrieval** | **<100ms** | **~86ms** | | Context load timeout | 3.0s | Well within budget | Total retrieval is dominated by the embedding step (75ms). BM25, vector, and graph search run in parallel via `asyncio.create_task()`, so wall-clock time is `max(embedding+vector, bm25, graph)`, not the sum. --- ## 6. Context Injection Into Annie's LLM Prompt ### Memory Loading At `on_client_connected` (before the user speaks), Annie calls: ```python memory_text = await load_memory_context( context_engine_url=CONTEXT_ENGINE_URL, token=CONTEXT_ENGINE_TOKEN, query="recent conversations and key facts", limit=10, ) ``` Timeout: 3 seconds (`CONTEXT_LOAD_TIMEOUT`). Graceful degradation: if Context Engine is down, Annie starts without memory. ### Briefing Format Results are formatted as a bulleted list by `format_memory_briefing()`: ``` Here's what you remember from recent conversations: - Rajesh said: "We should plan that Goa trip for March" - Priya said: "Let me check the dates with my parents" ``` Long texts are truncated at 200 characters per result. ### Sanitization (ARCH-8) Before prompt injection: - Control characters stripped (except `\n`, `\t`) - Total length capped at 2000 characters (`MAX_MEMORY_CHARS`) - Wrapped in XML tags with an injection warning: ```xml Here's what you remember from recent conversations: - Rajesh said: "We should plan that Goa trip for March" ``` The XML comment prevents prompt injection via adversarial content stored in prior conversations. ### System Prompt Assembly The memory block is appended after the base system prompt in `bot.py`: ```python context.messages[0]["content"] = SYSTEM_PROMPT + "\n\n" + memory_text ``` Annie's base prompt defines her personality ("warm, thoughtful, genuinely interested"). The memory block adds grounding in actual conversations. --- ## 7. Worked Example: "What did I tell Priya about vacation?" **BM25** returns (OR-connected tokens `What | did | I | tell | Priya | about | vacation`): | Rank | Text | |------|------| | 1 | "We should plan that Goa trip, Priya" | | 2 | "I told Priya we can do March for vacation" | | 3 | "Priya said she needs to check dates" | **Vector** returns (query embedded to 1024-dim, cosine search): | Rank | Text | |------|------| | 1 | "I told Priya we can do March for vacation" | | 2 | "We should plan that Goa trip, Priya" | | 3 | "Looking at flights to Goa for next month" | "Looking at flights" has no keyword overlap with "Priya" or "vacation" but is semantically related -- this is what vector search adds. **Graph** returns edges: `Rajesh -> Priya: planning vacation to Goa in March` **RRF fusion:** "I told Priya..." at rank 2 (BM25) + rank 1 (vector) = `1/62 + 1/61 = 0.0325`. "Looking at flights..." at rank 3 (vector only) = `1/63 = 0.0159`. **Temporal decay:** If "I told Priya..." was 3 days ago: `0.0325 * 0.933 = 0.0303`. If "flights to Goa" was 45 days ago: `0.0159 * 0.354 = 0.0056`. The Rajesh-Priya relationship edge (evergreen) decays no lower than 0.3. **MMR reranking:** Near-duplicate segments are penalized. Diverse results (flights segment, graph edge) are pushed up. **Annie receives:** ```xml Here's what you remember from recent conversations: - Rajesh said: "I told Priya we can do March for vacation" - Rajesh said: "We should plan that Goa trip, Priya" - Rajesh said: "Looking at flights to Goa for next month" - Rajesh -> Priya: planning vacation to Goa in March ``` Annie answers: "You told Priya you'd like to go to Goa in March. You were also looking at flights. Want me to check current prices?" --- ## Graceful Degradation ``` All three working: BM25 + Vector + Graph --> full hybrid Qdrant down: BM25 + Graph --> keyword + relationships Neo4j down: BM25 + Vector --> keyword + semantic Both down: BM25 only --> keyword search (always works) Context Engine down: Annie starts fresh --> no memory, still functional ``` Each search helper catches exceptions and returns `[]`. BM25 is the foundation. Vector and graph are purely additive. --- ## Configuration Reference `services/context-engine/config.py`: | Constant | Value | Purpose | |----------|-------|---------| | `DEFAULT_LOOKBACK_HOURS` | 168 (7 days) | BM25 time window | | `TEMPORAL_DECAY_HALF_LIFE_DAYS` | 30 | Decay half-life | | `MMR_LAMBDA` | 0.7 | Relevance vs diversity balance | | `MMR_CANDIDATE_POOL_SIZE` | 20 | Cap before O(K^2) reranking | | `DEFAULT_RESULT_LIMIT` | 5 | Default top-K | | `RRF_K` | 60 | RRF smoothing constant | | `EMBEDDING_MODEL` | `qwen3-embedding:8b` | Ollama embedding model | | `EMBEDDING_DIM` | 1024 | Matryoshka truncation dimension | | `QDRANT_COLLECTION` | `her_os_entities` | Vector store collection | `services/context-engine/retrieve.py`: | Constant | Value | Purpose | |----------|-------|---------| | `EVERGREEN_TYPES` | `{person, place, relationship}` | Never fully decay | | `EVERGREEN_FLOOR` | 0.3 | Minimum decay factor | `services/annie-voice/context_loader.py`: | Constant | Value | Purpose | |----------|-------|---------| | `CONTEXT_LOAD_TIMEOUT` | 3.0s | Max wait for Context Engine | | `MAX_MEMORY_CHARS` | 2000 | Cap on injected memory text | --- ## File Reference | File | Role | |------|------| | `services/context-engine/retrieve.py` | BM25 query builder, RRF, temporal decay, MMR, normalizers | | `services/context-engine/config.py` | All retrieval constants | | `services/context-engine/models.py` | `RetrievalResult` Pydantic model | | `services/context-engine/main.py` | `/v1/context` endpoint, parallel search orchestration | | `services/context-engine/embed.py` | Ollama embedding wrapper (Qwen3-Embedding-8B) | | `services/context-engine/qdrant_store.py` | Qdrant vector store client | | `services/context-engine/graphiti_client.py` | Graphiti/Neo4j knowledge graph client | | `services/context-engine/schema.sql` | PostgreSQL schema with GIN indexes | | `services/annie-voice/context_loader.py` | Pre-session memory loading and prompt injection | | `services/annie-voice/bot.py` | System prompt assembly with memory block |