# Research: Entity Clustering for Memory Zone Visualization

**Date:** 2026-03-05
**Context:** Dashboard memory zone shows ~37-50 entities in 8 type columns.
When a column has >8 entities, we need to cluster overflow into "+N" group dots.
Entity names are short (1-3 words): "chai", "coffee", "Holi", "joy", "happiness".

**Decision needed:** Which clustering approach for grouping same-type entities?

---

## 1. String/Token Overlap (Levenshtein, Jaccard, Jaro-Winkler)

### How It Works
Compute pairwise distance between entity names using character-level or token-level metrics,
then feed the distance matrix into agglomerative clustering.

### Metrics
| Metric | Best For | Example |
|--------|----------|---------|
| Levenshtein | Typo detection, near-duplicates | "coffee" vs "coffees" = 1 |
| Jaro-Winkler | Name matching (shared prefix bonus) | "Rajesh" vs "Raj" = 0.93 |
| Jaccard (token) | Word-set overlap | "drink preferences" vs "food preferences" = 0.5 |
| Dice coefficient | Similar to Jaccard, less harsh | Same pairs, slightly higher scores |

### Libraries
- **[RapidFuzz](https://github.com/rapidfuzz/RapidFuzz)** — C++ core, 10-100x faster than thefuzz.
  `cdist()` computes full N x N matrix in one call. For N=50 items, <1ms.
- **[strsim](https://pypi.org/project/strsim/)** — Pure Python, 12 algorithms in one package.
- **[textdistance](https://pypi.org/project/textdistance/)** — 30+ algorithms, optional C backends.

### Pros
- Zero external dependencies (no GPU, no model, no network)
- Sub-millisecond for N < 100
- Deterministic — same input always produces same clusters
- Already have Jaccard in `retrieve.py` (`_text_similarity()`)

### Cons
- **No semantic awareness** — "joy" and "happiness" score 0.0 (no shared characters)
- **No conceptual grouping** — "chai" and "coffee" have no string overlap
- Only catches near-duplicates and typo variants
- Useless for our primary use case (grouping semantically related entities)

### Verdict: SUPPLEMENT ONLY
Good for dedup (pre-processing step), not for semantic grouping.
Use RapidFuzz `cdist` as a fast pre-filter to merge obvious duplicates before semantic clustering.

---

## 2. Embedding-Based Clustering

### How It Works
1. Embed each entity name with a sentence/text embedding model
2. Compute cosine similarity matrix from embeddings
3. Cluster using agglomerative, k-means, or HDBSCAN

### Our Infrastructure
We already have:
- **Ollama** running on Titan with `qwen3-embedding:8b` (4096-dim, Matryoshka)
- **`embed.py`** with `embed_text()` and `embed_batch()` (batch = single model forward pass)
- **Qdrant** for vector storage + cosine search
- Matryoshka truncation to 1024 dims at 98% quality

### Embedding Short Text Challenge
Short entity names (1-3 words) produce less informative embeddings than full sentences.
Mitigations:
- **Prompt engineering**: embed `"Entity type: topic. Name: chai"` instead of just `"chai"`
- **Contextual enrichment**: append first_seen date, related entities, or snippet from extraction
- Research paper (PeerJ 2024) found that `paraphrase-distilroberta-base-v1` significantly
  outperformed other models for short text clustering; however, any modern embedding model
  with 768+ dims captures semantic nuance better than bag-of-words approaches

### Pipeline for Our Use Case
```python
from embed import embed_batch
from sklearn.cluster import AgglomerativeClustering
import numpy as np

async def cluster_entities(names: list[str], entity_type: str) -> list[list[int]]:
    """Cluster entity names of the same type. Returns group indices."""
    # Enrich short names with type context
    texts = [f"{entity_type}: {name}" for name in names]
    embeddings = await embed_batch(texts)  # single Ollama call

    # Cosine distance matrix
    X = np.array(embeddings)
    X = X / np.linalg.norm(X, axis=1, keepdims=True)  # L2 normalize

    clustering = AgglomerativeClustering(
        n_clusters=None,
        distance_threshold=0.5,  # tune this
        metric='cosine',
        linkage='average',
    )
    labels = clustering.fit_predict(X)
    return labels
```

### Latency Analysis
| Step | Latency (N=20 entities) | Notes |
|------|-------------------------|-------|
| embed_batch() | ~75ms | Single Ollama call, GPU, batch native |
| Normalize + cluster | <1ms | NumPy + sklearn, tiny matrix |
| **Total** | ~80ms | Acceptable for 30s poll interval |

For N=50: embed_batch ~150ms (linear in batch size). Still fine.

### Pros
- **Semantic awareness** — "joy" and "happiness" cluster together
- **Conceptual grouping** — "chai" and "coffee" recognized as beverages
- Already have the infrastructure (Ollama, embed.py, GPU)
- One network call (batch embedding), rest is local NumPy
- Embedding quality is high (qwen3-embedding:8b is state-of-art)

### Cons
- Requires Ollama to be running (dependency on GPU service)
- Short text embeddings are noisier than long text
- Need to tune distance_threshold per entity type
- Not deterministic (model updates could change clusters)

### Verdict: PRIMARY APPROACH
Best balance of semantic quality, latency, and infrastructure reuse.

---

## 3. LLM-Based Grouping

### How It Works
Send entity names to an LLM with a prompt asking it to group them semantically.

### Prompt Design
```
Given these entities of type "topic":
["chai", "coffee", "drink preferences", "Holi", "color festival", "Diwali", "work stress"]

Group them into semantic clusters. Return JSON:
{"groups": [{"label": "beverages", "members": ["chai", "coffee", "drink preferences"]},
            {"label": "festivals", "members": ["Holi", "color festival", "Diwali"]},
            {"label": "work", "members": ["work stress"]}]}
```

### Research: LLM-Guided Clustering (ACL 2025)
Paper: "[LLM-Guided Semantic-Aware Clustering for Topic Modeling](https://aclanthology.org/2025.acl-long.902/)"
- Uses LLM to generate candidate topic labels, then refines by merging similar ones
- Outperforms BERTopic on topic coherence metrics
- But designed for thousands of documents, not 8-20 short names

### Research: Human-Interpretable Clustering (Royal Society Open Science 2025)
Paper: "[Human-interpretable clustering of short text using large language models](https://royalsocietypublishing.org/doi/10.1098/rsos.241692)"
- Compares LDA, doc2vec+GMM, and MiniLM+GMM for short text (Twitter bios, 160 chars)
- LLM-based embeddings (MiniLM) + GMM produce most human-interpretable clusters
- Key insight: LLMs bridge the "validation gap" between cluster production and interpretation
- However, they used embeddings (approach #2), not direct LLM prompting

### Latency Analysis
| Model | Latency | Cost |
|-------|---------|------|
| qwen3.5:35b-a3b (local) | 2-5s | $0 |
| Claude API | 1-2s | ~$0.003/call |
| Ollama smaller model | 0.5-1s | $0 |

### Pros
- **Best semantic understanding** — understands "Holi" is a festival, "chai" is Indian tea
- **Human-readable labels** — generates cluster names ("beverages", "festivals")
- **Cultural context** — important for our Indian-context personal AI
- **Flexible** — handles edge cases, ambiguous entities, mixed languages

### Cons
- **Slowest approach** — 2-5s per call (too slow for real-time dashboard refresh)
- **Non-deterministic** — same input may produce different groupings
- **Overkill for <20 items** — the LLM is wasted on such small input
- **Fragile output parsing** — JSON parsing failures, hallucinated entity names
- **GPU contention** — competes with STT, TTS, extraction on shared GPU

### Verdict: OFFLINE PRE-COMPUTATION ONLY
Use for nightly batch clustering (alongside decay job at 3 AM).
Cache results in DB. Dashboard reads cached clusters, not live LLM calls.

---

## 4. Agglomerative / Hierarchical Clustering (scipy/sklearn)

### How It Works
Bottom-up: each entity starts as its own cluster, pairs are merged iteratively
based on a linkage criterion until a distance threshold or target cluster count is reached.

### Linkage Methods
| Method | Strategy | Best For |
|--------|----------|---------|
| Ward | Minimize within-cluster variance | Euclidean distance only, even-sized clusters |
| Complete | Minimize max pairwise distance | Compact clusters, sensitive to outliers |
| Average | Minimize mean pairwise distance | Balanced approach, works with cosine distance |
| Single | Minimize closest-pair distance | Chaining effect, good for elongated clusters |

**Recommendation: `average` linkage with cosine distance.**
Ward requires Euclidean (not cosine). Single creates chain-like clusters.
Average gives the best trade-off for semantic embeddings.

### Implementation
```python
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist

# After embedding
dist_matrix = pdist(X, metric='cosine')
Z = linkage(dist_matrix, method='average')

# Cut dendrogram at threshold
labels = fcluster(Z, t=0.5, criterion='distance')

# OR: cut to get max_clusters
labels = fcluster(Z, t=max_clusters, criterion='maxclust')
```

### Dendrogram for Visual Debugging
```python
from scipy.cluster.hierarchy import dendrogram
dendrogram(Z, labels=entity_names)  # visual cluster structure
```

### Why Hierarchical > Flat for Our Use Case
- **Adaptive**: cut at different thresholds based on column crowding
  - Column has 12 items, want max 8 visible? Cut to get ~4 clusters + singletons = ~8 display items
  - Column has 30 items? Cut more aggressively
- **Dendrogram**: shows which entities are "closest" — useful for tooltip ("chai + coffee + tea")
- **No k selection**: distance_threshold adapts to data density

### Latency
For N < 100: `pdist()` + `linkage()` + `fcluster()` total < 1ms.
The bottleneck is embedding, not clustering.

### Pros
- Sub-millisecond clustering
- Scipy is a standard dependency (already in most Python stacks)
- Adaptive cut threshold matches our "show top N, cluster rest" requirement
- Deterministic given same embeddings
- Dendrogram provides explainability

### Cons
- Not a standalone solution — needs a distance matrix (from embeddings or string similarity)
- O(N^2) distance matrix, but N < 100 so irrelevant

### Verdict: THE CLUSTERING ALGORITHM
Pair with embedding-based distances (approach #2). This is the clustering method.

---

## 5. Community Detection (Neo4j / Graph-Based)

### How It Works
Entities in our knowledge graph (Neo4j via Graphiti) have relationships.
Community detection algorithms (Louvain, Leiden) find groups of densely connected nodes.

### Algorithms Available in Neo4j GDS
| Algorithm | Speed | Quality | Our Use |
|-----------|-------|---------|---------|
| **Leiden** | Fast | Better than Louvain (guarantees well-connected communities) | Preferred |
| **Louvain** | Very fast | Good, but can produce poorly-connected communities | Legacy |
| **Label Propagation** | Fastest | Lower quality | Too noisy for <50 nodes |
| **WCC** (Weakly Connected Components) | O(N) | Finds disconnected subgraphs | Pre-filter only |

### GraphRAG Approach (Microsoft)
Microsoft's GraphRAG uses **Hierarchical Leiden** — recursive community detection until
communities are small enough. Each community gets an LLM-generated summary.
This is exactly the pattern we'd want: hierarchical communities with labels.

Reference: [GraphRAG Dataflow](https://microsoft.github.io/graphrag/index/default_dataflow/)

### Implementation
```cypher
-- Neo4j GDS: Create graph projection
CALL gds.graph.project('entities', 'Entity', 'RELATED_TO')

-- Run Leiden
CALL gds.leiden.stream('entities', {maxLevels: 3})
YIELD nodeId, communityId
RETURN gds.util.asNode(nodeId).name AS entity, communityId
```

### Latency
- Graph projection: ~50ms (cached)
- Leiden on 50 nodes: <10ms
- **But**: requires Neo4j GDS plugin (separate from base Neo4j)

### Pros
- **Uses existing relationships** — clusters entities that actually co-occur in conversations
- **Hierarchical** — Leiden produces multi-level communities
- **No embedding needed** — pure graph structure
- **Meaningful clusters** — entities mentioned together cluster together

### Cons
- **Requires Neo4j GDS plugin** — extra installation, licensing (Community edition is free)
- **Relationship-dependent** — new entities with few relationships won't cluster well
- **Graph sparsity** — with only 37-50 entities and possibly few relationships,
  community detection may produce trivial results (one big community or all singletons)
- **Different axis than we need** — we cluster within a type column (e.g., all topics),
  but graph communities cross types (a person + their topics + their places)

### Verdict: FUTURE ENHANCEMENT
Valuable when the graph is denser (200+ entities, rich relationships).
Not practical now with sparse graph. Would need same-type subgraph projection.

---

## 6. Topic Modeling (LDA, BERTopic)

### BERTopic Pipeline
```
Embeddings → UMAP (dimensionality reduction) → HDBSCAN (clustering) → c-TF-IDF (topic naming)
```

### BERTopic for Our Use Case
```python
from bertopic import BERTopic
from sklearn.cluster import KMeans

# For small datasets, use KMeans instead of HDBSCAN
# HDBSCAN needs min_cluster_size which is tricky for N < 20
cluster_model = KMeans(n_clusters=4)
topic_model = BERTopic(hdbscan_model=cluster_model)
topics, probs = topic_model.fit_transform(entity_names)
```

### BERTopic Hierarchical Topics
BERTopic supports [hierarchical topic modeling](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html) —
produces a dendrogram of topic merges. Could be useful for adaptive clustering.

### Problems for Our Use Case
1. **Minimum viable dataset**: BERTopic expects 100+ documents. With 8-20 entity names per type column, UMAP dimensionality reduction is meaningless (reducing 1024-dim to 5-dim with only 8 data points produces garbage)
2. **HDBSCAN min_cluster_size**: default is 10. With 8 items, no clusters form.
3. **c-TF-IDF**: designed for document-length text, not 1-3 word names.
4. **Heavy dependency**: BERTopic pulls in sentence-transformers, UMAP, HDBSCAN, sklearn — large install footprint.

### LDA (Latent Dirichlet Allocation)
Even worse for short text: treats each entity name as a "document" with 1-3 words.
No meaningful word co-occurrence patterns to discover.

### Pros
- BERTopic is well-maintained, good documentation
- Hierarchical topics feature is useful conceptually
- Automatic topic naming via c-TF-IDF

### Cons
- **Fundamentally wrong tool for N < 100 short items**
- Heavy dependency chain
- UMAP is stochastic — non-deterministic
- Overkill — we just need to group 8-20 names, not model topics across thousands of documents

### Verdict: NOT RECOMMENDED
Designed for large corpora of long documents. Our use case (small N, short text) violates every assumption BERTopic makes. The embedding + agglomerative approach gives us the semantic benefit without the overhead.

---

## 7. Entity Resolution / Deduplication (Related Problem)

### Why It Matters
Before clustering, we should deduplicate: "Rajesh" and "rajesh" and "Rajesh Kumar" may be the same entity.
Entity resolution is a prerequisite to clustering, not a replacement.

### Approaches

**String-based (fast pre-filter):**
```python
from rapidfuzz import fuzz, process

# Find near-duplicates
for i, name1 in enumerate(names):
    for j, name2 in enumerate(names[i+1:], i+1):
        if fuzz.ratio(name1.lower(), name2.lower()) > 85:
            merge(i, j)  # same entity
```

**Embedding-based:**
Already handled by our approach #2 — if two entities have cosine similarity > 0.95,
they're likely duplicates rather than just semantically similar.

**LLM-based (highest quality):**
GraphRAG (KGGEN, Mo et al. 2025) uses iterative LLM-guided clustering to merge
equivalent entities beyond surface matching. Our Graphiti integration already does
entity resolution during extraction.

### dedupe Library
[dedupe](https://github.com/dedupeio/dedupe) — Python library for accurate fuzzy matching
and entity resolution. Uses active learning. Overkill for 50 entities but good reference.

### Our Current State
- `extract.py` already does entity resolution during extraction (Graphiti handles it)
- `upsert_entity` in `db.py` creates deterministic IDs: `f"{entity_type}:{name.lower()}"`
- Case normalization prevents "Rajesh" vs "rajesh" duplicates
- But "chai" vs "masala chai" vs "Indian tea" are currently separate entities

### Verdict: PRE-PROCESSING STEP
Run RapidFuzz dedup before embedding-based clustering.
Merge obvious duplicates, then cluster the rest semantically.

---

## Recommended Architecture

### Two-Phase Approach

**Phase A: Dedup (string-based, <1ms)**
```
RapidFuzz cdist → merge entities with ratio > 85
```

**Phase B: Semantic Clustering (embedding-based, ~80ms)**
```
embed_batch() → cosine distance → agglomerative (average linkage) → fcluster
```

### Where It Runs

**Option 1: Backend (Context Engine)**
- New endpoint: `GET /v1/entities?clustered=true`
- Clustering runs server-side after list_entities query
- Returns entities with `cluster_id` and `cluster_label` fields
- Dashboard reads clusters, renders top-N + "+M" badges
- **Pro**: GPU access for embeddings, caching, single source of truth
- **Con**: Adds latency to entity endpoint

**Option 2: Nightly Batch (3 AM job)**
- Compute clusters during nightly decay job
- Store `cluster_id` in entities table
- Dashboard reads pre-computed clusters
- **Pro**: Zero latency impact on dashboard polling
- **Con**: Clusters stale until next nightly run

**Option 3: Dashboard (TypeScript)**
- String-only clustering (no embeddings available in browser)
- RapidFuzz equivalent: [fuzzball](https://www.npmjs.com/package/fuzzball) for JS
- **Pro**: No backend changes
- **Con**: No semantic awareness — only catches near-duplicates

### Recommendation: Option 1 with Caching

```
GET /v1/entities → list_entities() → cluster if >8 per type column
                                    → cache clusters (invalidate on entity change)
                                    → return entities with cluster_id
```

Cache invalidation: re-cluster when entity count changes or on manual trigger.
Worst case re-computation: 150ms (50 entities, batch embed + cluster).

### Dashboard Display Logic

```
For each type column:
  entities = all entities of this type
  if len(entities) <= MAX_VISIBLE (8):
    render all as individual dots
  else:
    top_n = entities sorted by salience, take top (MAX_VISIBLE - cluster_count)
    clusters = pre-computed clusters from backend
    for each cluster:
      render single dot with "+N" badge
      tooltip shows cluster members
      click expands cluster
```

### Data Model Addition
```sql
ALTER TABLE entities ADD COLUMN cluster_id TEXT;
ALTER TABLE entities ADD COLUMN cluster_label TEXT;
-- cluster_id: "topic:beverages", cluster_label: "Beverages"
-- NULL means entity is displayed individually (not clustered)
```

---

## Comparison Matrix

| Approach | Semantic Quality | Latency (N=50) | Dependencies | Our Infra | Recommendation |
|----------|-----------------|-----------------|--------------|-----------|----------------|
| String overlap | None (dedup only) | <1ms | rapidfuzz | Have Jaccard | Dedup pre-filter |
| **Embedding + Agglomerative** | **High** | **~80ms** | **sklearn, numpy** | **Have Ollama, embed.py** | **PRIMARY** |
| LLM prompting | Highest | 2-5s | LLM API | Have qwen3.5 | Nightly label gen |
| Agglomerative (sklearn) | N/A (needs distances) | <1ms | sklearn | Standard | Clustering algo |
| Community detection (Neo4j) | Graph-based | ~60ms | Neo4j GDS | Have Neo4j | Future (sparse graph) |
| BERTopic / LDA | Moderate | ~500ms | Heavy chain | None | Not recommended |
| Entity resolution | Dedup quality | <1ms | rapidfuzz | Have in extract.py | Pre-processing |

---

## Implementation Priority

### Phase 1 (Immediate — solves column overflow)
1. Add `cluster_id`, `cluster_label` columns to entities table
2. Backend: after `list_entities`, run embedding + agglomerative clustering for crowded columns
3. Cache clusters, invalidate on entity count change
4. Dashboard: render clustered entities as single dots with "+N" badge

### Phase 2 (Enhancement — better labels)
1. Nightly LLM job generates human-readable cluster labels
2. Store labels in `cluster_label` column
3. Dashboard tooltip shows label + member list

### Phase 3 (Future — graph-aware)
1. When entity count > 200 and graph is dense
2. Use Leiden community detection for cross-type clustering
3. Hierarchical communities for drill-down navigation

---

## Key References

### Papers
- [Human-interpretable clustering of short text using LLMs](https://royalsocietypublishing.org/doi/10.1098/rsos.241692) — Royal Society Open Science, Jan 2025. MiniLM+GMM best for short text.
- [LLM-Guided Semantic-Aware Clustering for Topic Modeling](https://aclanthology.org/2025.acl-long.902/) — ACL 2025. LLM generates + refines topic labels.
- [Experimental study on short-text clustering using transformer-based semantic similarity](https://peerj.com/articles/cs-2078/) — PeerJ 2024. paraphrase-distilroberta-base-v1 best for SentenceBERT clustering.
- [Short Text Clustering Algorithms, Application and Challenges: A Survey](https://www.mdpi.com/2076-3417/13/1/342) — Applied Sciences 2023. Comprehensive survey.

### Tools & Libraries
- [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) — Fast string matching (C++ core)
- [scikit-learn AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)
- [scipy.cluster.hierarchy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) — Linkage + dendrogram
- [BERTopic](https://maartengr.github.io/BERTopic/) — Topic modeling (not recommended for our case)
- [Neo4j GDS Leiden](https://neo4j.com/docs/graph-data-science/current/algorithms/leiden/) — Community detection
- [dedupe](https://github.com/dedupeio/dedupe) — Entity resolution library
- [GraphRAG](https://microsoft.github.io/graphrag/) — Microsoft's hierarchical entity clustering + community summaries
- [AmpliGraph](https://docs.ampligraph.org/en/1.1.0/tutorials/ClusteringAndClassificationWithEmbeddings.html) — KG embedding clustering

### Existing Codebase
- `services/context-engine/embed.py` — `embed_text()`, `embed_batch()` (Ollama qwen3-embedding:8b)
- `services/context-engine/retrieve.py` — `_text_similarity()` (Jaccard), MMR reranking
- `services/context-engine/config.py` — `EMBEDDING_MODEL`, `EMBEDDING_DIM` (1024)
- `services/context-engine/db.py` — `list_entities()`, `upsert_entity()`
- `services/context-engine/dashboard/src/entities/store.ts` — Entity polling + reconciliation
- `services/context-engine/dashboard/src/canvas/memoryZone.ts` — 8 type columns, tier bands