# Research: HydraDB vs her-os Infrastructure

> **Date:** 2026-04-14 | **Status:** Complete | **Decision:** Not recommended for adoption

## TL;DR

HydraDB is a PostgreSQL columnar analytics extension (OLAP) powered by DuckDB. It's designed for **100M+ row analytical queries** on event/clickstream data. her-os is an **OLTP + hybrid retrieval** system processing **<1K rows/day** across 6 specialized storage layers. HydraDB solves a problem her-os doesn't have.

---

## What is HydraDB?

| Attribute | Details |
|-----------|---------|
| **Type** | PostgreSQL extension (uses `tableam` API from PG 12+) |
| **Core engine** | DuckDB vectorized execution via `pg_duckdb` |
| **Storage** | Columnar (column-oriented, not row-oriented) |
| **Founded** | 2021, Y Combinator W22 batch |
| **Funding** | $3.1M seed (Pioneer Fund, YC, Lombardstreet) |
| **License** | AGPL 3.0 (columnar) + Apache 2.0 (pg_duckdb) |
| **GitHub** | `hydradatabase/columnar` — 3,021 stars, 102 forks |
| **Maturity** | 1.0 GA, production-ready, actively maintained |
| **Deployment** | Docker (drop-in PG replacement), cloud hosted, self-hosted |

**Key claims:** 400x faster analytics vs vanilla PostgreSQL, 5x data compression, serverless autoscaling.

---

## HydraDB Feature Deep-Dive

### Strengths
- **Columnar storage:** Stores data by column, not row. Aggregations (`SUM`, `AVG`, `COUNT GROUP BY`) scan only needed columns — massive speedup on wide tables.
- **DuckDB engine:** Vectorized query execution. Processes batches of column values in parallel, not row-by-row.
- **Compression:** 5x typical (columnar naturally compresses well — repeated values in columns).
- **Drop-in PostgreSQL:** Standard `psycopg2`/`asyncpg` drivers work. No code changes for reads.
- **Hybrid tables:** Can mix heap (row) tables for OLTP and columnar tables for OLAP in same database.

### Limitations
- **No logical replication** on columnar tables
- **No foreign keys** on columnar tables
- **Only btree/hash indexes** (no GIN, no BRIN, no bitmap)
- **Small transactions slower** than heap — each columnar write has higher overhead
- **Updates are slow** — columnar storage is optimized for append-only / batch insert
- **Low concurrency** — parallel execution consumes high resources per query
- **No ANN search** (pgvector ivfflat not supported on columnar)
- **JSON operators `#>` and `#>>` not supported** on analytics tables

---

## her-os Current Infrastructure

### Storage Layers (6 active)

| Layer | Technology | Purpose | Volume/Year | Pattern |
|-------|-----------|---------|-------------|---------|
| **Source** | JSONL (filesystem) | Raw transcripts from Omi | 44-88 MB | Append-only, inotify watch |
| **Query Cache** | PostgreSQL 16 | Segments (tsvector+GIN BM25), entities, sessions | 65-130 MB | OLTP: upsert + full-text search |
| **Vector Index** | Qdrant | Segment/entity embeddings (1024-dim) | 365-730 MB | ANN similarity search |
| **Knowledge Graph** | Neo4j + Graphiti | Temporal entity relationships | 10-30 MB | Graph traversal (k-hop) |
| **Observability** | PostgreSQL + ring buffer | Service events, creature tracking | 25-76 MB | High-frequency append + SSE |
| **Memory** | Mem0 (planned) | Long-term personal memory | TBD | Hybrid graph/vector/KV |

### PostgreSQL Tables (10)

| Table | Rows/Day | Key Indexes | Access Pattern |
|-------|----------|-------------|----------------|
| `segments` | 240-480 | GIN (tsvector), btree (session, created) | BM25 full-text search |
| `entities` | 5-15 | btree (type, name) | Lookup by type+name |
| `entity_mentions` | 20-60 | btree (entity_id, segment_id) | Join entity-to-segment |
| `entity_validations` | High churn | btree (status, expires) | Think+Validate pipeline |
| `sessions` | 8-15 | PRIMARY (session_id) | Metadata lookup |
| `nudge_log` | 0-3 | btree (entity, time) | Promise tracking |
| `wonder_log` | 1 | UNIQUE (date) | Daily briefing |
| `comic_log` | 1 | UNIQUE (date) | Daily comic |
| `observability_events` | 86K-260K | btree (timestamp, service) | Time-range queries, 7-day retention |
| `ingest_offsets` | ~10 | PRIMARY (file_path) | Resume tracking |

### Retrieval Pipeline (latency budget)

| Operation | Target | Actual | Engine |
|-----------|--------|--------|--------|
| BM25 search (tsvector) | <50ms | ~20-30ms | PostgreSQL GIN index |
| Vector search (ANN) | <10ms | 0.81ms | Qdrant HNSW |
| Graph traversal (5-hop) | <50ms | 1.9ms | Neo4j |
| Full hybrid retrieval | <100ms | 71.9ms | RRF fusion + temporal decay + MMR |

---

## Comparison Matrix

### Star Rating Guide
- 5 = Perfect fit, strong recommendation
- 4 = Good fit, minor gaps
- 3 = Adequate, some trade-offs
- 2 = Poor fit, significant drawbacks
- 1 = Wrong tool for the job

### Dimension Ratings

| Dimension | HydraDB | Current Stack | Winner | Notes |
|-----------|---------|---------------|--------|-------|
| **OLTP (transactional writes)** | 2 | 5 | Current | HydraDB columnar writes are slow for small transactions. her-os upserts 240-480 segments/day — classic OLTP. |
| **Full-text search (BM25)** | 1 | 5 | Current | HydraDB doesn't support GIN indexes. Would lose tsvector BM25 search — the backbone of retrieval. |
| **Analytics (aggregations)** | 5 | 3 | HydraDB | HydraDB shines at `GROUP BY`, `SUM`, time-series rollups. But her-os has no heavy analytics workload. |
| **Vector search (ANN)** | 1 | 5 | Current | HydraDB can't do ivfflat ANN. Qdrant handles this with 0.81ms latency. |
| **Graph queries** | 1 | 5 | Current | HydraDB has no graph capability. Neo4j + Graphiti handles temporal relationships. |
| **Data compression** | 5 | 3 | HydraDB | 5x columnar compression. But total data is <1.3 GB/10yr — compression is irrelevant at this scale. |
| **Concurrency** | 2 | 4 | Current | HydraDB's parallel execution consumes high resources per query. her-os needs concurrent retrieval + extraction + observability. |
| **Python/FastAPI compat** | 4 | 5 | Current | HydraDB uses standard PG drivers. But losing GIN/tsvector means rewriting retrieve.py. |
| **Deployment complexity** | 3 | 4 | Current | HydraDB is drop-in Docker, but adds AGPL license concern + new extension to maintain. Current PG 16 Alpine is battle-tested. |
| **Data volume fit** | 1 | 5 | Current | HydraDB targets 100M+ row OLAP. her-os processes ~300 rows/day. Using HydraDB here is like using a combine harvester to mow a lawn. |
| **Observability events** | 4 | 4 | Tie | HydraDB could handle 86K-260K events/day analytically. But current btree + 7-day retention + ring buffer already works fine. |
| **Replication** | 1 | 4 | Current | No logical replication on columnar tables. If multi-device sync is needed later, HydraDB blocks this path. |
| **Foreign keys** | 1 | 5 | Current | No FK support on columnar tables. entity_mentions FK to entities and segments would break. |

### Overall Scores

| | HydraDB for her-os | Current Stack |
|---|---|---|
| **Overall** | 2.0 / 5 | 4.5 / 5 |
| **Fit for current workload** | 1 / 5 | 5 / 5 |
| **Fit for 10-year scale** | 2 / 5 | 4 / 5 |
| **Migration effort** | 1 / 5 (massive rewrite) | N/A |

---

## Scenario Analysis: When Would HydraDB Make Sense?

### Scenario 1: "What topics did Rajesh discuss most in Q1 2027?"
- **Current:** `SELECT entity_name, COUNT(*) FROM entities JOIN entity_mentions ... WHERE created BETWEEN ... GROUP BY entity_name ORDER BY count DESC` — runs in <50ms on <10K rows. PostgreSQL handles this trivially.
- **HydraDB:** Would be faster on 10M+ rows. At her-os scale, no difference.
- **Verdict:** Not needed.

### Scenario 2: "Aggregate all observability events by creature per hour for dashboards"
- **Current:** btree index on (timestamp, service). 86K-260K rows/day with 7-day retention = ~600K-1.8M rows. PostgreSQL scans this in <100ms.
- **HydraDB:** Columnar would compress these well and aggregate faster. But 1.8M rows is still trivial for PostgreSQL.
- **Verdict:** Not needed until >50M rows retained (which requires changing the 7-day policy).

### Scenario 3: "Multi-user household with 5 family members, 5 years of data"
- **Current:** 5x volume = ~3.5 GB/5yr. PostgreSQL handles 3.5 GB easily.
- **HydraDB:** Still no need. PostgreSQL benchmarks show <100ms queries on tables up to 100M rows with proper indexes.
- **Verdict:** Not needed.

### Scenario 4: "Product analytics across 10,000 her-os users (SaaS)"
- **Current stack:** Would struggle. 10K users x 300 rows/day = 3M rows/day = 1B rows/year.
- **HydraDB:** Now it shines. Columnar compression + DuckDB vectorized aggregation on 1B rows = exactly the sweet spot.
- **Verdict:** HydraDB makes sense IF her-os becomes a multi-tenant SaaS product with thousands of users. Not relevant for personal/household use.

---

## Alternatives Worth Considering (if analytics needs grow)

| Alternative | Stars for her-os | Why |
|-------------|-----------------|-----|
| **Vanilla PostgreSQL (current)** | 5 | Already deployed, handles current and 10-year scale trivially |
| **TimescaleDB** | 3 | Best for time-series. Useful if observability events grow past 7-day retention to months/years. Overkill now. |
| **DuckDB (embedded)** | 3 | Could run ad-hoc analytics on exported JSONL/Parquet files without touching PostgreSQL. Zero infrastructure. Good for one-off research queries. |
| **pg_duckdb (extension only)** | 3 | Lighter than full HydraDB. Adds DuckDB engine to existing PG without columnar migration. |
| **Citus** | 2 | Distributed PostgreSQL. Only if multi-node sharding is needed. her-os is single-machine (DGX Spark). |

---

## Recommendation

### Do NOT adopt HydraDB. Current stack is correct.

**Reasons:**
1. **Wrong workload class.** her-os is OLTP + hybrid retrieval (BM25 + vector + graph). HydraDB is OLAP analytics. Adopting it would be replacing a screwdriver with a jackhammer.
2. **Would break core features.** No GIN indexes = no BM25 search. No ANN = no vector similarity. No FK = no entity-mention referential integrity.
3. **Data is tiny.** ~300 rows/day, <1.3 GB over 10 years. PostgreSQL 16 handles this without breaking a sweat.
4. **Migration cost is extreme.** Would require rewriting `retrieve.py` (BM25), `ingest.py` (upserts), `schema.sql` (drop all GIN indexes), and removing FK constraints.
5. **AGPL license risk.** If her-os is ever distributed commercially, AGPL requires source disclosure for modifications.

### When to revisit
- If her-os becomes a **multi-tenant SaaS** with >1000 users (Scenario 4)
- If **observability retention** changes from 7 days to years (>100M rows)
- If a new **analytics dashboard** needs sub-second aggregations over billions of rows

None of these are on the current roadmap.

---

## References

- [HydraDB Official](https://www.hydra.so/)
- [HydraDB Docs](https://docs.hydra.so/overview)
- [GitHub: hydradatabase/columnar](https://github.com/hydradatabase/columnar) (3,021 stars)
- [GitHub: duckdb/pg_duckdb](https://github.com/duckdb/pg_duckdb)
- [Y Combinator: Hydra](https://www.ycombinator.com/companies/hydra)
- [HydraDB Benchmarks](https://github.com/hydradatabase/columnar/blob/main/BENCHMARKS.md)
- [Columnar FAQ](https://columnar.docs.hydra.so/concepts/faqs)
