# Titan (DGX Spark) — Setup Recipes & Gotchas

> **Purpose:** Single source of truth for bringing up the her-os stack on Titan.
> Every gotcha, workaround, and exact command is documented here so we never waste time debugging the same issues again.

## Hardware Profile

| Component | Value |
|-----------|-------|
| Architecture | **aarch64** (ARM64, NOT x86) |
| OS | Ubuntu 24.04.4 LTS |
| GPU | NVIDIA GB10 (Blackwell, SM_120/SM_121) |
| GPU Driver | 580.126.09 |
| CUDA | **13.0** (system) |
| Memory | 128 GB unified (shared CPU/GPU) |
| Python | 3.12.3 (system) |

## Critical Gotcha: NVIDIA CUDA Package Naming

**The #1 source of wasted time.** NVIDIA ships separate pip packages for CUDA 12 and CUDA 13:

| Package | CUDA 12 | CUDA 13 (Titan) |
|---------|---------|-----------------|
| CuPy | `cupy-cuda12x` | `cupy-cuda13x` |
| cuGraph | `cugraph-cu12` | `cugraph-cu13` |
| cuVS | `cuvs-cu12` | `cuvs-cu13` |
| pylibraft | `pylibraft-cu12` | `pylibraft-cu13` |
| libraft | `libraft-cu12` | `libraft-cu13` |
| pylibcugraph | `pylibcugraph-cu12` | `pylibcugraph-cu13` |

**ALL RAPIDS packages require:** `--extra-index-url=https://pypi.nvidia.com`

### The Shared Module Directory Trap

Both `-cu12` and `-cu13` variants install to the **same Python module directory** (e.g., both `cupy-cuda12x` and `cupy-cuda13x` install to `cupy/`).

**NEVER run `pip uninstall <package>-cu12` if the `-cu13` version is installed!** It will delete the shared module files, breaking the cu13 installation. If you accidentally do this, you must `pip install --force-reinstall <package>-cu13` to recreate the files.

### The numpy Cascade

RAPIDS `--force-reinstall` pulls in `numpy>=2.4` which **breaks numba** (requires `numpy<2.3`), which **breaks Whisper**.

**RULE:** After ANY RAPIDS package installation, always run:
```bash
pip install 'numpy<2.3'
```

## DNS Resolution — systemd-resolved Fix

DGX Spark uses `systemd-resolved` with a stub resolver at `127.0.0.53`. On WiFi connections, it periodically loses its upstream DNS configuration, causing all hostname resolution to fail while IP connectivity works fine.

**Symptoms:**
- `nslookup nvcr.io` → `communications error to 127.0.0.53#53: timed out`
- `ping 8.8.8.8` → works fine
- Docker builds fail with: `dial tcp: lookup nvcr.io on 127.0.0.53:53: i/o timeout`
- `resolvectl status` shows `Current Scopes: none` on all interfaces

**Quick fix (immediate, lost on reboot):**
```bash
# Find the active network interface
ip route show default  # → "default via ... dev wlP9s9 ..."

# Set Google DNS on that interface
sudo resolvectl dns wlP9s9 8.8.8.8 8.8.4.4
sudo resolvectl domain wlP9s9 "~."

# Verify
nslookup nvcr.io  # Should resolve immediately
```

**Permanent fix (survives reboot):**
```bash
sudo mkdir -p /etc/systemd/resolved.conf.d
sudo tee /etc/systemd/resolved.conf.d/dns.conf << 'EOF'
[Resolve]
DNS=8.8.8.8 8.8.4.4
FallbackDNS=1.1.1.1 1.0.0.1
EOF
sudo systemctl restart systemd-resolved
```

**Note:** Docker daemon has `{"dns": ["8.8.8.8", "8.8.4.4"]}` in `/etc/docker/daemon.json` which helps running containers, but Docker BuildKit image pulls still go through the host's systemd-resolved.

## Environment Setup — Complete Recipe

### Step 1: Create venv
```bash
python3 -m venv /tmp/her-os-tier1
source /tmp/her-os-tier1/bin/activate
pip install --upgrade pip  # 24.0 → 26.x (needed for version parsing)
```

### Step 2: Install PyTorch (nightly, CUDA 12.8 wheels work on Blackwell via SM_90 binary compat)
```bash
pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128
```

### Step 3: Install RAPIDS stack (cu13, from NVIDIA PyPI)
```bash
pip install cupy-cuda13x cugraph-cu13 cuvs-cu13 cudf-cu13 pylibraft-cu13 libraft-cu13 pylibcugraph-cu13 \
  --extra-index-url=https://pypi.nvidia.com
```

### Step 4: Install ML models & tools
```bash
pip install transformers accelerate sentencepiece  # Embedding models
pip install openai-whisper                          # STT (large-v3)
pip install kokoro>=0.8 soundfile                   # TTS
pip install gliner                                  # NER
```

### Step 5: Install backend dependencies
```bash
pip install fastapi uvicorn asyncpg redis aioredis   # Web + DB
```

### Step 6: Pin numpy (CRITICAL — must be LAST pip install)
```bash
pip install 'numpy<2.3'
```

### Step 7: Verify all imports
```bash
python3 -c "
import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')
import cupy; print(f'CuPy {cupy.__version__}')
import cugraph; print(f'cuGraph {cugraph.__version__}')
from cuvs.neighbors import ivf_flat; print('cuVS OK')
import pylibraft; print(f'pylibraft {pylibraft.__version__}')
import whisper; print('Whisper OK')
import kokoro; print('Kokoro OK')
from gliner import GLiNER; print('GLiNER OK')
"
```

## Audio Pipeline — Fresh Install

### Step 1: Build Docker images + start containers

```bash
cd ~/workplace/her/her-os/services/audio-pipeline
./run.sh                 # builds images + starts audio pipeline + SER sidecar
./run.sh logs            # watch for "Warmup complete — ready for traffic!"
```

### Step 2: Wait for first warmup (one-time ~276s)

The first run after an image rebuild takes ~276s to JIT-compile all CUDA kernels from PTX to SASS. This populates the CUDA JIT cache (629 MB). Subsequent restarts use the cache → **3-8s warmup**.

```bash
# Watch warmup progress:
docker logs -f her-os-audio 2>&1 | grep -i warmup

# After warmup, verify fast restart:
docker restart her-os-audio
docker logs --since 30s her-os-audio 2>&1 | grep -i warmup
# Expected: "Warmup: total 3.1s" (with warm cache)
```

### Step 3: Verify

```bash
# Test transcription
curl -X POST http://localhost:9100/transcribe \
  -F "audio=@test-her-movie.wav"
```

### Hot deploy (code changes only)

Source files are volume-mounted — no rebuild needed for code changes:
```bash
git pull && docker restart her-os-audio
# Restart takes ~3-8s with warm CUDA cache
```

### CUDA JIT Cache — Critical for SM_121

The `run.sh` sets these environment variables to make the CUDA JIT cache work:
```bash
CUDA_CACHE_DISABLE=0       # NGC base image disables cache by default!
CUDA_CACHE_PATH=/cuda-cache # persistent volume mount
CUDA_CACHE_MAXSIZE=2147483648  # 2 GB
CUDA_FORCE_PTX_JIT=0       # USE the cache! Base image sets =1 which defeats it
```

**If warmup is slow (~276s) on restart:** The cache was cleared or invalidated. Causes:
- Image rebuild (cache is still valid, just needs one warm run)
- CUDA driver version changed (invalidates all cache entries)
- Cache directory was deleted: `~/.local/share/her-os-audio/cuda-cache/`
- Cache files are root-owned (created inside container) — don't try to clear from host

Full rebuild only needed for Dockerfile/dependency changes:
```bash
./run.sh build && ./run.sh
```

## Docker Services

### Start PostgreSQL + Redis
```bash
docker run -d --name titan-postgres -p 15432:5432 \
  -e POSTGRES_PASSWORD=postgres \
  postgres:16-alpine

docker run -d --name titan-redis -p 16379:6379 \
  redis:7-alpine
```

### Verify
```bash
docker exec titan-redis redis-cli ping       # → PONG
docker exec titan-postgres psql -U postgres -c "SELECT 1"  # → 1
```

## LD_LIBRARY_PATH — Required for RAPIDS Native Libraries

RAPIDS packages install native `.so` files in non-standard locations. You **must** set `LD_LIBRARY_PATH` before importing cuVS, cuGraph, or cuDF:

```bash
SITE=/tmp/her-os-tier1/lib/python3.12/site-packages
export LD_LIBRARY_PATH=$SITE/libcuvs/lib64:$SITE/libcugraph/lib64:$SITE/libcudf/lib64:$SITE/librmm/lib64:$SITE/libkvikio/lib64:$SITE/libucxx/lib64:$SITE/ucxx/lib64:$SITE/rapids_logger/lib64:$SITE/lib64:$SITE/nvidia/libnvcomp/lib64:$SITE/libraft/lib64:$LD_LIBRARY_PATH
```

**Shorthand for running any Python script:**
```bash
source /tmp/her-os-tier1/bin/activate
SITE=/tmp/her-os-tier1/lib/python3.12/site-packages
LD_LIBRARY_PATH=$(find $SITE -maxdepth 2 -name 'lib64' -type d | tr '\n' ':')$LD_LIBRARY_PATH python3 your_script.py
```

## HuggingFace Token

Saved at `~/.huggingface/token` on both Titan and Beast.

```
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
```

Set via environment if needed: `HF_TOKEN=$(cat ~/.huggingface/token)`

## Kokoro TTS — Blackwell GPU Patch (MANDATORY)

Kokoro's `TorchSTFT` uses complex tensor operations that trigger NVIDIA's Jiterator JIT compiler, which fails on Blackwell (SM_120/SM_121) because nvrtc doesn't recognize the architecture.

**Root cause:** `torch.abs()`, `torch.angle()`, and `magnitude * torch.exp(phase * 1j)` on complex CUDA tensors use Jiterator → nvrtc → fails.

**Fix:** Decompose to real/imag operations (avoids complex tensors entirely):

```python
import torch
import kokoro.istftnet as istftnet

def _patched_transform(self, input_data):
    fwd = torch.stft(input_data, self.filter_length, self.hop_length, self.win_length,
                     window=self.window.to(input_data.device), return_complex=False)
    real, imag = fwd[..., 0], fwd[..., 1]
    return torch.sqrt(real**2 + imag**2 + 1e-8), torch.atan2(imag, real)

def _patched_inverse(self, magnitude, phase):
    real = magnitude * torch.cos(phase)
    imag = magnitude * torch.sin(phase)
    return torch.istft(torch.complex(real, imag), self.filter_length, self.hop_length,
                       self.win_length, window=self.window.to(magnitude.device)).unsqueeze(-2)

istftnet.TorchSTFT.transform = _patched_transform
istftnet.TorchSTFT.inverse = _patched_inverse
```

**Why `torch.complex()` works but `torch.abs()` doesn't:** `torch.complex()` is a constructor (creates complex from real components, no JIT). `torch.abs()` on complex is an elementwise op that goes through Jiterator.

**Apply BEFORE importing `KPipeline`.**

## Annie-Voice — GPU TTS Setup (Replaces Docker CPU Container)

The `annie-voice` service originally used Kokoro-FastAPI as a Docker container (CPU, ~400ms). Session 110 switched to **direct GPU inference in-process** (~30ms).

### What changed

| Before | After |
|--------|-------|
| `ghcr.io/remsky/kokoro-fastapi-cpu:latest` Docker container | `kokoro>=0.8` pip package, in-process |
| HTTP API (`localhost:8880/v1/audio/speech`) | Direct `kokoro.KPipeline` GPU inference |
| ~400ms latency (CPU) | ~30ms latency (GPU, Blackwell-patched) |
| Separate container, separate process | Runs in annie-voice Python process alongside Whisper STT |

### Setup steps

```bash
cd ~/workplace/her/her-os/services/annie-voice
source .venv/bin/activate  # or whatever your venv is

# 1. Install kokoro (needs torch already installed)
pip install 'kokoro>=0.8'

# 2. Stop the old Docker container (no longer needed)
docker compose stop kokoro-tts
docker compose rm -f kokoro-tts

# 3. Start the voice agent — Kokoro loads in-process with Blackwell patch
python server.py
# Log output should show:
#   "Blackwell patch applied: TorchSTFT.transform + .inverse (SM_121 safe)"
#   "KokoroTTS: loaded on cuda (voice=af_heart, lang=a)"
```

### Key files

| File | Purpose |
|------|---------|
| `blackwell_patch.py` | TorchSTFT monkey-patch (imported by kokoro_tts.py) |
| `kokoro_tts.py` | Direct GPU inference via `kokoro.KPipeline` |
| `bot.py` | `KOKORO_DEVICE` env var (default: `cuda`) |
| `docker-compose.yml` | Kokoro service commented out (SearXNG remains) |

### Benchmark results (2026-03-01)

| Mode | Average Latency | Speedup |
|------|----------------|---------|
| CPU Docker (Kokoro-FastAPI) | **922ms** | — |
| GPU Direct (Blackwell-patched) | **46ms** | **20.2x** |

Text: "Hey Rajesh, good morning! How are you doing today?" (5 runs each)

### VRAM budget

| Model | VRAM |
|-------|------|
| Whisper large-v3-turbo STT | ~3 GB |
| Kokoro-82M TTS | ~0.5 GB |
| Ollama LLM (loaded separately) | varies |
| **Total (STT + TTS)** | **~3.5 GB of 128 GB** |

## Ollama — GPU Access Fix (MANDATORY)

The NVIDIA Container Toolkit is pre-installed on DGX Spark but **NOT registered with Docker** by default. Without this fix, all Ollama models run on CPU (100% CPU, 0% GPU), even though the GPU has free VRAM.

**Symptoms:**
- `docker exec ollama ollama ps` shows `100% CPU` for loaded models
- `docker exec ollama nvidia-smi` → `Failed to initialize NVML: Unknown Error`
- `docker info | grep runtime` shows only `runc`, no `nvidia`
- Models run 5-10x slower than expected

**One-time fix (requires sudo):**
```bash
# Register NVIDIA runtime with Docker
sudo nvidia-ctk runtime configure --runtime=docker

# Restart Docker daemon to pick up the new runtime
sudo systemctl restart docker

# Verify — should now list nvidia runtime
docker info | grep -i runtime
# Expected: Runtimes: io.containerd.runc.v2 nvidia runc

# Restart Ollama container
docker restart ollama

# Verify GPU access inside container
docker exec ollama nvidia-smi       # Should show GPU
docker exec ollama ollama ps        # Should show GPU % instead of CPU %
```

**Root cause:** `/etc/docker/daemon.json` only had DNS config, missing the NVIDIA runtime registration. The `nvidia-ctk runtime configure` command adds the `nvidia` runtime to the daemon config.

**Available Ollama models on Titan (as of 2026-03-02):**

| Model | Size | Family | Notes |
|-------|------|--------|-------|
| qwen3:8b | 5.2 GB | Qwen3 | Fast, decent extraction |
| qwen3:32b | 20 GB | Qwen3 | Dense, high quality |
| qwen3.5:27b | 17 GB | Qwen3.5 | Dense, vision-capable |
| qwen3.5:35b-a3b | 23 GB | Qwen3.5 MoE | ~3B active params, should be fast on GPU |
| llama3.1:8b | 4.9 GB | Llama 3.1 | Known tool-calling bugs |
| gemma3:4b | 3.3 GB | Gemma 3 | Smallest |
| mistral-small:24b | 14 GB | Mistral 3 | — |
| nemotron-3-nano | 24 GB | Nemotron MoE | — |
| ministral-3:14b | 9.1 GB | Mistral 3 | — |

## Whisper STT — PyTorch or Docker CTranslate2

CTranslate2 **PyPI aarch64 wheels** are CPU-only. However, the `mekopa/whisperx-blackwell` Docker container has CTranslate2 4.4.0 compiled from source with full CUDA support (float16, bfloat16, int8).

**For bare-metal/venv installs**, use OpenAI's Whisper package (`openai-whisper`) which runs on PyTorch → CUDA natively:

```python
import whisper
model = whisper.load_model("large-v3", device="cuda")
result = model.transcribe(audio_array, language="en", fp16=True)
```

**For Docker deployments**, the audio-pipeline's WhisperX container runs CTranslate2 on GPU.

## Benchmark Results — Bare-Metal (2026-02-24)

> **Note:** These bare-metal results led to ADR-016 v1 (dual-model). The NGC Docker benchmarks above superseded these and led to ADR-016 v2 (8B-primary). Bare-metal results are preserved for reference.

### Context Retrieval Pipeline

| Component | 0.6B | 8B |
|-----------|------|-----|
| Embedding (single sentence) | **69.4ms** | 341.6ms |
| Embedding (batch 16, per-sentence) | **4.7ms** | 50.9ms |
| cuVS vector search (100K, top-10) | **0.17ms** | 0.17ms |
| cuGraph BFS (10K nodes) | **2.3ms** | 2.3ms |
| cuGraph PageRank (10K nodes) | **1.0ms** | 1.0ms |
| **Pipeline total** | **71.9ms ✓** | **344.0ms ✗** |

**Original decision (superseded):** ADR-016 v1 chose 0.6B for real-time based on these numbers. NGC Docker benchmarks showed 8B at 74.7ms → ADR-016 v2 uses 8B for everything.

### Voice Pipeline

| Component | Latency | RTF |
|-----------|---------|-----|
| Whisper large-v3 STT (5s audio) | **483ms** | 0.097x (10x real-time) |
| Whisper large-v3 STT (30s audio) | **481ms** | 0.016x (62x real-time) |
| Kokoro TTS (short text) | **30ms** | 0.016x (62x real-time) |
| Kokoro TTS (medium text) | **40ms** | 0.014x (71x real-time) |
| **Pipeline total (5s input)** | **513ms ✓** | — |

### Infrastructure

| Service | Latency |
|---------|---------|
| Redis GET | **0.034ms** |
| Redis SET | **0.036ms** |
| PostgreSQL SELECT+ORDER+LIMIT | **0.44ms** |
| GLiNER NER (6 labels) | **132.5ms** |

### VRAM Usage

| Model | VRAM |
|-------|------|
| Qwen3-Embedding-0.6B | 1.1 GB |
| Qwen3-Embedding-8B | 14.1 GB |
| Whisper large-v3 | 8.75 GB |
| Kokoro-82M | ~0.5 GB |
| **Total (0.6B + Whisper + Kokoro)** | **~10.4 GB** |
| **Available** | **128 GB** |
| **Headroom** | **~118 GB** |

## Running the Full Benchmark

```bash
# Activate environment
source /tmp/her-os-tier1/bin/activate

# Set library paths
SITE=/tmp/her-os-tier1/lib/python3.12/site-packages
export LD_LIBRARY_PATH=$(find $SITE -maxdepth 2 -name 'lib64' -type d | tr '\n' ':')$LD_LIBRARY_PATH

# Set HuggingFace token
export HF_TOKEN=$(cat ~/.huggingface/token)

# Ensure Docker services are running
docker start titan-postgres titan-redis

# Run benchmark
python3 /tmp/benchmark_v2.py
```

## Troubleshooting Quick Reference

| Symptom | Cause | Fix |
|---------|-------|-----|
| `No module named 'cupy'` | cu12 uninstall deleted shared module dir | `pip install --force-reinstall cupy-cuda13x` then `pip install 'numpy<2.3'` |
| `No module named 'pylibraft'` | Same shared dir issue | `pip install --force-reinstall pylibraft-cu13 --extra-index-url=https://pypi.nvidia.com` then `pip install 'numpy<2.3'` |
| `No module named 'pylibcugraph'` | Same shared dir issue | `pip install --force-reinstall pylibcugraph-cu13 --extra-index-url=https://pypi.nvidia.com` then `pip install 'numpy<2.3'` |
| `libcuvs_c.so: cannot open` | Missing LD_LIBRARY_PATH | Set LD_LIBRARY_PATH (see section above) |
| `libraft.so: cannot open` | Missing LD_LIBRARY_PATH or libraft | Set LD_LIBRARY_PATH including `$SITE/libraft/lib64` |
| `Numba needs NumPy 2.2 or less` | RAPIDS install pulled numpy 2.4 | `pip install 'numpy<2.3'` |
| `nvrtc: error: unrecognized option SM_120` | Blackwell Jiterator issue | Apply Kokoro TorchSTFT patch (see above). Standard PyTorch ops work fine. |
| `torch_dtype is deprecated` | Old Transformers API | Use `dtype=torch.float16` instead of `torch_dtype=torch.float16` |
| Docker build DNS timeout (`lookup nvcr.io ... i/o timeout`) | systemd-resolved lost upstream DNS | `sudo resolvectl dns wlP9s9 8.8.8.8 8.8.4.4 && sudo resolvectl domain wlP9s9 "~."` (see DNS section above) |
| Docker container DNS timeout | IPv6 conflict | `sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1` |
| pip version parsing error | pip 24.0 can't parse nightly versions | `pip install --upgrade pip` |
| `huggingface-cli: not found` (exit 127) | `huggingface_hub>=1.4.0` removed `[cli]` extra | Use Python API: `python -c "from huggingface_hub import snapshot_download; snapshot_download('model')"` |
| Docker build uses stale cache, commands fail | Cached pip layer from failed build (DNS was down) | `docker compose build --no-cache` or `docker builder prune` |
| `Can't find model 'en_core_web_sm'` | spacy model not in container; runtime install goes to unreadable user site-packages | Add `RUN python -m spacy download en_core_web_sm` to Dockerfile builder stage |
| Embedding 4-7x slower than expected | Running bare-metal pip instead of NGC container | Use NGC PyTorch base image — see "NGC Container Performance" section above |
| `soxr_ext.abi3.so: cannot open` / `CUDA not available` after deploy | rsync copied local x86_64 .venv over Titan's aarch64 .venv | See "Annie Voice Venv Recovery" below |

## Package Versions (validated 2026-02-24)

```
torch==2.12.0.dev20260224+cu128
cupy-cuda13x==14.0.1
cugraph-cu13==26.2.0
cuvs-cu13==26.2.0
cudf-cu13==26.2.1
pylibraft-cu13==26.2.0
pylibcugraph-cu13==26.2.0
numpy==2.2.6
openai-whisper (large-v3 model, 1.55B params)
kokoro>=0.8
gliner (urchade/gliner_medium-v2.1)
transformers (Qwen/Qwen3-Embedding-0.6B + 8B)
redis==5.x
asyncpg==0.x
fastapi + uvicorn
```

## Docker Compose — Container Healthcheck Gotchas

**Qdrant:** The `qdrant/qdrant` image is a minimal Debian container with just the Qdrant binary — it has **neither `wget` nor `curl`**. Use bash's built-in `/dev/tcp` for health checks:
```yaml
healthcheck:
  test: ["CMD-SHELL", "bash -c 'echo > /dev/tcp/localhost/6333'"]
```

**Neo4j:** The `neo4j:5-community` image includes `wget` (Debian-based). Standard wget healthcheck works.

**NGC PyTorch:** Full Ubuntu — has both `wget` and `curl`.

## Docker Build — Gotchas for NGC Base Image

**UID 1000 conflict:** The NGC PyTorch image pre-creates user `ubuntu:1000`. Use `usermod -l heros ubuntu` instead of `useradd -u 1000 heros`.

**PyTorch CPU-only in venv:** If you create an isolated venv (`python -m venv`) inside the NGC container, pip won't see NGC's CUDA-enabled PyTorch and will install CPU-only torch from PyPI as a transitive dependency (via transformers, etc.). **Fix:** Always use `python -m venv --system-site-packages` to inherit NGC's PyTorch.

**Symptom:** `AttributeError: module 'torch._C' has no attribute '_dlpack_exchange_api'` — this means the CPU-only torch was installed over NGC's CUDA torch.

**total_mem vs total_memory:** NGC's PyTorch uses `torch.cuda.get_device_properties(0).total_memory` (not `total_mem`). Standard PyPI torch uses `total_mem`.

**wget --spider sends HEAD:** Docker healthchecks using `wget -q --spider` send HTTP HEAD requests. FastAPI `@app.get()` returns 405 Method Not Allowed for HEAD. Use `curl -sf` instead (sends GET).

**UCX segfault on exit:** NGC images include HPC-X UCX libraries. When Python imports RAPIDS (cupy, cugraph), UCX initializes. On process exit, `ucs_topo_cleanup` segfaults (`SIGSEGV`). The imports work fine — it's only the cleanup that crashes. **Fix:** Use `os._exit(0)` to skip atexit handlers, or set `UCX_LOG_LEVEL=error` to suppress verbose backtraces.

**huggingface_hub v1.4+ removed `[cli]` extra:** As of `huggingface_hub>=1.4.0`, the `[cli]` extra no longer exists. `pip install 'huggingface_hub[cli]'` emits a warning and **does not install the `huggingface-cli` entry point** — the command silently vanishes (exit code 127). Use the Python API instead:
```dockerfile
# WRONG — fails with "huggingface-cli: not found"
RUN pip install 'huggingface_hub[cli]'
RUN huggingface-cli download Qwen/Qwen3-Embedding-0.6B

# CORRECT — use Python API directly
RUN pip install huggingface_hub
RUN python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-Embedding-0.6B')"
```

**Docker cache poisoning from network failures:** If a `pip install` runs during a DNS outage, Docker caches the layer as successful (pip exits 0 even when it can't download anything). Subsequent builds use the cached empty layer, causing mysterious "command not found" errors for packages that should be installed. **Fix:** `docker compose build --no-cache` to force a full rebuild, or `docker builder prune` to clear the build cache.

**spacy models need build-time install:** Kokoro TTS depends on `spacy` with `en_core_web_sm`. If you only install it at runtime, spacy writes to user site-packages (`Defaulting to user installation because normal site-packages is not writeable`) which isn't on Python's path inside the container. **Fix:** Add `RUN python -m spacy download en_core_web_sm` to the builder stage of the Dockerfile, after installing voice requirements.

## NGC Container Performance — 4-7x Faster Than Bare-Metal pip

**This is the most important finding from Phase 0.** The NGC PyTorch container (`nvcr.io/nvidia/pytorch:25.11-py3`) provides **4-7x faster inference** than bare-metal pip installations for all compute-bound operations. This single factor changed our embedding architecture (ADR-016 v1 → v2).

**Root cause:** NGC images ship with architecture-specific optimizations — cuBLAS/cuDNN tuned for Blackwell SM_121, optimized memory allocators, pre-compiled CUDA kernels, and NCCL tuning. Bare-metal `pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128` installs generic wheels with no Blackwell-specific optimizations.

### Benchmark Comparison (2026-02-25)

| Component | Bare-metal pip | NGC Docker | Speedup |
|-----------|---------------|------------|---------|
| Embed 0.6B (single) | 69.4ms | **9.78ms** | **7.1x** |
| Embed 8B (single) | 341.6ms | **74.7ms** | **4.6x** |
| Whisper STT (5s) | 483ms | **279.7ms** | **1.7x** |
| GLiNER NER | 132.5ms | **54.6ms** | **2.4x** |
| Redis GET | 0.034ms | 0.022ms | 1.5x |
| PostgreSQL | 0.44ms | 0.17ms | 2.6x |
| cuVS search (100K) | 0.17ms | 0.18ms | ~same |
| cuGraph BFS (10K) | 2.3ms | 2.27ms | ~same |

**Key observations:**
- **Transformer inference** (embedding, STT, NER) benefits most — these are matrix-multiply bound, exactly what NGC's tuned cuBLAS accelerates.
- **CUDA graph operations** (cuVS, cuGraph) show no improvement — they already run optimized CUDA kernels regardless of the PyTorch build.
- **Infrastructure** (Redis, PostgreSQL) improvements are from containerized networking, not GPU.

### Architectural Impact — ADR-016 v2

The NGC speedup **made the 8B embedding model real-time capable** (74.7ms < 100ms target), eliminating the dual-model progressive retrieval architecture:

| | ADR-016 v1 (bare-metal) | ADR-016 v2 (NGC Docker) |
|--|------------------------|------------------------|
| Real-time embedding | 0.6B (69ms) | **8B (74.7ms)** |
| Context retrieval total | 71.9ms (with 0.6B) | **77.2ms (with 8B)** |
| Search quality (MTEB) | 61.83 | **69.44 (+12%)** |
| VRAM | 15.2 GB (both models) | **14.1 GB (8B only)** |
| Architecture complexity | Dual-model + confidence routing + nightly re-index | **Single model** |

**Lesson:** Always benchmark inside the target deployment container before making architectural decisions. The runtime environment can change the performance profile enough to invalidate design trade-offs.

## Anti-Patterns — NEVER Do These

1. **NEVER `pip uninstall <package>-cu12`** if the `-cu13` version is installed (shared module dirs)
2. **NEVER install RAPIDS without `--extra-index-url=https://pypi.nvidia.com`**
3. **NEVER forget `pip install 'numpy<2.3'` after RAPIDS installs**
4. **NEVER `pip install ctranslate2` on ARM64 expecting CUDA** (PyPI wheels are CPU-only; the `mekopa/whisperx-blackwell` Docker container has it compiled from source with CUDA)
5. **NEVER run RAPIDS Python without LD_LIBRARY_PATH set**
6. **NEVER use `torch_dtype=`** (deprecated, use `dtype=`)
7. **NEVER import Kokoro's KPipeline before applying TorchSTFT patch on Blackwell**
8. **NEVER use `pip install 'huggingface_hub[cli]'`** — the `[cli]` extra was removed in v1.4.0. Use `pip install huggingface_hub` and call `snapshot_download()` from Python
9. **NEVER benchmark bare-metal and assume NGC Docker will match** — NGC is 4-7x faster for transformer inference. Always benchmark inside the deployment container
10. **NEVER trust Docker layer cache after a network failure** — `pip install` exits 0 even if it downloads nothing. Bust cache with `--no-cache` if builds behave unexpectedly
11. **NEVER run Qwen3.5 with thinking mode ON for production** — consumes 3-16K tokens of reasoning per call, 5-80x slower, 8% failure rate from token exhaustion. Always use `chat_template_kwargs: {"enable_thinking": false}` and `--jinja` flag on llama-server
12. **NEVER start llama-server without `--jinja`** when serving Qwen3.5 — without it, `chat_template_kwargs` is silently ignored and thinking mode stays ON
13. **NEVER use `--ctx-size 8192` with Qwen3.5** — if thinking mode is ON (even accidentally), 8K context = prompt + thinking + response all compete for space, causing empty outputs. Use `--ctx-size 32768` minimum
14. **NEVER use `CMAKE_CUDA_ARCHITECTURES` with CTranslate2** — it uses legacy FindCUDA (cmake_minimum_required 3.7), not cmake's CUDA language. Use `CUDA_TOOLKIT_ROOT_DIR`, `CUDA_ARCH_LIST`, and `CUDA_NVCC_FLAGS` instead
15. **NEVER rsync code to Titan** — rsync copies the local x86_64 `.venv/` over Titan's aarch64 `.venv/`, corrupting pip and all native packages. Always use `git commit → git push → ssh titan "git pull"` for deployment

## Annie Voice Venv Recovery

If the annie-voice venv is corrupted (rsync incident, broken pip, missing native libs):

```bash
cd ~/workplace/her/her-os/services/annie-voice

# 1. Recreate venv from scratch
python3 -m venv --clear .venv

# 2. Install all Python deps from requirements.txt
.venv/bin/pip install -r requirements.txt

# 3. pip installs CPU-only torch by default — force CUDA version
.venv/bin/pip install --force-reinstall torch==2.10.0 --index-url https://download.pytorch.org/whl/cu128

# 4. NeMo is NOT in requirements.txt (it pulls CPU torch). Install WITH deps, then fix conflicts:
.venv/bin/pip install "nemo_toolkit[asr] @ git+https://github.com/NVIDIA/NeMo.git@main"
# NeMo pulls protobuf 6.x and CPU torch — fix both:
.venv/bin/pip install --force-reinstall torch==2.10.0 --index-url https://download.pytorch.org/whl/cu128
.venv/bin/pip install 'protobuf~=5.29.6'

# 5. Verify everything works
.venv/bin/python -c "
import torch; print('torch:', torch.__version__, 'cuda:', torch.cuda.is_available())
from nemo.collections.asr.models import EncDecRNNTBPEModel; print('NeMo ASR: OK')
import kokoro; print('Kokoro: OK')
import pipecat; print('Pipecat:', pipecat.__version__)
"
# Expected: torch 2.10.0+cu128 cuda True, NeMo ASR OK, Kokoro OK, Pipecat 0.0.105

# 6. Restart Annie (see MEMORY.md for full env vars)
fuser -k 7860/tcp 2>/dev/null; sleep 2
CONTEXT_ENGINE_TOKEN="..." CONTEXT_ENGINE_URL="http://localhost:8100" \
LLAMACPP_BASE_URL="http://localhost:8003/v1" LLM_BACKEND="qwen3.5-9b" \
STT_BACKEND="nemotron" VOICE_AGENT_HOST="0.0.0.0" \
TRANSCRIPT_DIR="$HOME/.local/share/her-os-audio/transcripts" \
MEMORY_NOTES_DIR="$HOME/.local/share/her-os-audio/memories" \
nohup .venv/bin/python server.py > /tmp/annie-voice.log 2>&1 &
```

**Order matters:** Install NeMo BEFORE re-forcing CUDA torch. NeMo's deps pull CPU torch, so the torch reinstall must come AFTER. The protobuf downgrade is needed because pipecat requires ~5.29 while NeMo pulls 6.x.

**Root cause of Session 324 incident:** `rsync -avz services/annie-voice/ titan:~/...` copied local laptop's `.venv/` (x86_64 Ubuntu) over Titan's `.venv/` (aarch64 DGX). Native `.so` files (soxr, torch CUDA bindings) are architecture-specific — x86 binaries crash on aarch64. Cascading failures: broken pip → CPU-only torch → NeMo missing → Nemotron STT fails → Annie can't hear users.

## Qwen3.5-35B-A3B via llama.cpp (Recommended VLM Setup)

```bash
export LD_LIBRARY_PATH=/usr/local/cuda-13/compat:$LD_LIBRARY_PATH

~/llama-cpp-latest/build-gpu/bin/llama-server \
  --host 0.0.0.0 --port 8002 \
  -m ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
  --mmproj ~/models/mmproj-BF16.gguf \
  --ctx-size 32768 --n-gpu-layers 999 -fa auto --jinja
```

**Build requires:** `-DGGML_CUDA_FORCE_CUBLAS=ON` (native MMQ kernels crash on Blackwell MXFP4 tensors).
**API calls must include:** `"chat_template_kwargs": {"enable_thinking": false}` for production use.
**Full setup recipe:** See `docs/RESEARCH-QWEN35-VLM.md`

## Context Engine — Deployment

The Context Engine is the "shared brain" — watches JSONL transcript files written by the audio pipeline, ingests to PostgreSQL, extracts entities via local LLM (qwen3:8b), and provides BM25 search + temporal decay + daily reflection APIs.

### Architecture

```
Audio Pipeline (Docker)              Context Engine (Docker)
writes to /data/transcripts/  ──→  watches /data/transcripts/ (inotify)
                                         │
                                    ┌─────▼──────┐
                                    │ Parse JSONL │
                                    │ Build index │
                                    └─────┬──────┘
                                          │
                                    ┌─────▼──────┐
                                    │ PostgreSQL  │
                                    │ (GIN index) │
                                    └─────┬──────┘
                                          │
                                    ┌─────▼──────────┐
                                    │ Entity Extract  │
                                    │ (qwen3:8b via   │
                                    │  Ollama)        │
                                    └────────────────┘
```

### Shared transcript volume

Both audio-pipeline and context-engine mount the same host directory:

| Service | Host path | Container path |
|---------|-----------|----------------|
| audio-pipeline | `/home/rajesh/.local/share/her-os-audio` | `/data` |
| context-engine | `/home/rajesh/.local/share/her-os-audio/transcripts` | `/data/transcripts` |

The audio pipeline's `jsonl_writer.py` writes to `/data/transcripts/SESSION_ID.jsonl`. The Context Engine's watcher detects these files via inotify.

### Deploy

```bash
cd ~/workplace/her/her-os/services/context-engine

# Create .env (one-time)
cat > .env << 'EOF'
POSTGRES_PASSWORD=<your-password>
CONTEXT_ENGINE_TOKEN=<generate-with: python3 -c "import secrets; print(secrets.token_urlsafe(32))">
EXTRACTION_LLM=ollama/qwen3:8b
DAILY_LLM=ollama/qwen3:8b
OLLAMA_BASE_URL=http://host.docker.internal:11434
TRANSCRIPT_DIR=/home/rajesh/.local/share/her-os-audio/transcripts
EOF

# Start
./run.sh             # or: docker compose up -d --build

# Verify
curl http://localhost:8100/health
curl -H "X-Internal-Token: $CONTEXT_ENGINE_TOKEN" http://localhost:8100/v1/stats
```

### Endpoints

| Endpoint | Auth | Description |
|----------|------|-------------|
| `GET /health` | No | Returns `{"status": "ok"}` — no metadata (SEC-1) |
| `GET /v1/stats` | Yes | Segment/session/entity counts |
| `GET /v1/context?query=...` | Yes | BM25 search + temporal decay + MMR reranking |
| `GET /v1/entities` | Yes | List extracted entities |
| `GET /v1/daily` | Yes | Daily reflection summary |
| `POST /v1/ingest/{session_id}` | Yes | Manual ingest trigger |

Auth: `X-Internal-Token` header with the value from `CONTEXT_ENGINE_TOKEN`.

### Gotchas

1. **`host.docker.internal` on Linux** — Docker doesn't add this automatically. Use `extra_hosts: ["host.docker.internal:host-gateway"]` in docker-compose.yml (already configured).
2. **asyncpg type strictness** — `DATE` columns need `datetime.date` objects, not strings. SQLAlchemy + asyncpg won't auto-coerce.
3. **Ollama `"think": false`** — Must disable Qwen3 chain-of-thought in extraction calls. Without this, timeout at 300s and empty results.
4. **JSON array schema** — Use structured JSON schema (not `format: "json"`) to force array output. Plain `format: "json"` causes some models (especially Qwen3 32B) to stop after a single `{...}` object.

### LLM Extraction Benchmark (GPU, 10 labeled transcripts, JSON array schema)

| Model | F1 | Precision | Recall | Avg Latency | VRAM |
|-------|-----|-----------|--------|-------------|------|
| **qwen3:8b** | **0.523** | **0.508** | **0.565** | **6.3s** | ~5 GB |
| qwen3:32b | 0.501 | 0.523 | 0.565 | 31.4s | ~20 GB |
| qwen3.5:35b-a3b | 0.452 | 0.471 | 0.598 | 8.3s | ~24 GB |
| qwen3.5:27b | 0.435 | 0.358 | 0.573 | 39.7s | ~17 GB |

Winner: **qwen3:8b** — highest F1, fastest, smallest VRAM. ADR-019 winner (Qwen3 32B) tested and confirmed equivalent quality but 5x slower. Bigger models over-extract (lower precision) — extraction is pattern matching, not reasoning.
