# Titan (DGX Spark) — Setup Recipes & Gotchas > **Purpose:** Single source of truth for bringing up the her-os stack on Titan. > Every gotcha, workaround, and exact command is documented here so we never waste time debugging the same issues again. ## Hardware Profile | Component | Value | |-----------|-------| | Architecture | **aarch64** (ARM64, NOT x86) | | OS | Ubuntu 24.04.4 LTS | | GPU | NVIDIA GB10 (Blackwell, SM_120/SM_121) | | GPU Driver | 580.126.09 | | CUDA | **13.0** (system) | | Memory | 128 GB unified (shared CPU/GPU) | | Python | 3.12.3 (system) | ## Critical Gotcha: NVIDIA CUDA Package Naming **The #1 source of wasted time.** NVIDIA ships separate pip packages for CUDA 12 and CUDA 13: | Package | CUDA 12 | CUDA 13 (Titan) | |---------|---------|-----------------| | CuPy | `cupy-cuda12x` | `cupy-cuda13x` | | cuGraph | `cugraph-cu12` | `cugraph-cu13` | | cuVS | `cuvs-cu12` | `cuvs-cu13` | | pylibraft | `pylibraft-cu12` | `pylibraft-cu13` | | libraft | `libraft-cu12` | `libraft-cu13` | | pylibcugraph | `pylibcugraph-cu12` | `pylibcugraph-cu13` | **ALL RAPIDS packages require:** `--extra-index-url=https://pypi.nvidia.com` ### The Shared Module Directory Trap Both `-cu12` and `-cu13` variants install to the **same Python module directory** (e.g., both `cupy-cuda12x` and `cupy-cuda13x` install to `cupy/`). **NEVER run `pip uninstall -cu12` if the `-cu13` version is installed!** It will delete the shared module files, breaking the cu13 installation. If you accidentally do this, you must `pip install --force-reinstall -cu13` to recreate the files. ### The numpy Cascade RAPIDS `--force-reinstall` pulls in `numpy>=2.4` which **breaks numba** (requires `numpy<2.3`), which **breaks Whisper**. **RULE:** After ANY RAPIDS package installation, always run: ```bash pip install 'numpy<2.3' ``` ## DNS Resolution — systemd-resolved Fix DGX Spark uses `systemd-resolved` with a stub resolver at `127.0.0.53`. On WiFi connections, it periodically loses its upstream DNS configuration, causing all hostname resolution to fail while IP connectivity works fine. **Symptoms:** - `nslookup nvcr.io` → `communications error to 127.0.0.53#53: timed out` - `ping 8.8.8.8` → works fine - Docker builds fail with: `dial tcp: lookup nvcr.io on 127.0.0.53:53: i/o timeout` - `resolvectl status` shows `Current Scopes: none` on all interfaces **Quick fix (immediate, lost on reboot):** ```bash # Find the active network interface ip route show default # → "default via ... dev wlP9s9 ..." # Set Google DNS on that interface sudo resolvectl dns wlP9s9 8.8.8.8 8.8.4.4 sudo resolvectl domain wlP9s9 "~." # Verify nslookup nvcr.io # Should resolve immediately ``` **Permanent fix (survives reboot):** ```bash sudo mkdir -p /etc/systemd/resolved.conf.d sudo tee /etc/systemd/resolved.conf.d/dns.conf << 'EOF' [Resolve] DNS=8.8.8.8 8.8.4.4 FallbackDNS=1.1.1.1 1.0.0.1 EOF sudo systemctl restart systemd-resolved ``` **Note:** Docker daemon has `{"dns": ["8.8.8.8", "8.8.4.4"]}` in `/etc/docker/daemon.json` which helps running containers, but Docker BuildKit image pulls still go through the host's systemd-resolved. ## Environment Setup — Complete Recipe ### Step 1: Create venv ```bash python3 -m venv /tmp/her-os-tier1 source /tmp/her-os-tier1/bin/activate pip install --upgrade pip # 24.0 → 26.x (needed for version parsing) ``` ### Step 2: Install PyTorch (nightly, CUDA 12.8 wheels work on Blackwell via SM_90 binary compat) ```bash pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128 ``` ### Step 3: Install RAPIDS stack (cu13, from NVIDIA PyPI) ```bash pip install cupy-cuda13x cugraph-cu13 cuvs-cu13 cudf-cu13 pylibraft-cu13 libraft-cu13 pylibcugraph-cu13 \ --extra-index-url=https://pypi.nvidia.com ``` ### Step 4: Install ML models & tools ```bash pip install transformers accelerate sentencepiece # Embedding models pip install openai-whisper # STT (large-v3) pip install kokoro>=0.8 soundfile # TTS pip install gliner # NER ``` ### Step 5: Install backend dependencies ```bash pip install fastapi uvicorn asyncpg redis aioredis # Web + DB ``` ### Step 6: Pin numpy (CRITICAL — must be LAST pip install) ```bash pip install 'numpy<2.3' ``` ### Step 7: Verify all imports ```bash python3 -c " import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}') import cupy; print(f'CuPy {cupy.__version__}') import cugraph; print(f'cuGraph {cugraph.__version__}') from cuvs.neighbors import ivf_flat; print('cuVS OK') import pylibraft; print(f'pylibraft {pylibraft.__version__}') import whisper; print('Whisper OK') import kokoro; print('Kokoro OK') from gliner import GLiNER; print('GLiNER OK') " ``` ## Audio Pipeline — Fresh Install ### Step 1: Build Docker images + start containers ```bash cd ~/workplace/her/her-os/services/audio-pipeline ./run.sh # builds images + starts audio pipeline + SER sidecar ./run.sh logs # watch for "Warmup complete — ready for traffic!" ``` ### Step 2: Wait for first warmup (one-time ~276s) The first run after an image rebuild takes ~276s to JIT-compile all CUDA kernels from PTX to SASS. This populates the CUDA JIT cache (629 MB). Subsequent restarts use the cache → **3-8s warmup**. ```bash # Watch warmup progress: docker logs -f her-os-audio 2>&1 | grep -i warmup # After warmup, verify fast restart: docker restart her-os-audio docker logs --since 30s her-os-audio 2>&1 | grep -i warmup # Expected: "Warmup: total 3.1s" (with warm cache) ``` ### Step 3: Verify ```bash # Test transcription curl -X POST http://localhost:9100/transcribe \ -F "audio=@test-her-movie.wav" ``` ### Hot deploy (code changes only) Source files are volume-mounted — no rebuild needed for code changes: ```bash git pull && docker restart her-os-audio # Restart takes ~3-8s with warm CUDA cache ``` ### CUDA JIT Cache — Critical for SM_121 The `run.sh` sets these environment variables to make the CUDA JIT cache work: ```bash CUDA_CACHE_DISABLE=0 # NGC base image disables cache by default! CUDA_CACHE_PATH=/cuda-cache # persistent volume mount CUDA_CACHE_MAXSIZE=2147483648 # 2 GB CUDA_FORCE_PTX_JIT=0 # USE the cache! Base image sets =1 which defeats it ``` **If warmup is slow (~276s) on restart:** The cache was cleared or invalidated. Causes: - Image rebuild (cache is still valid, just needs one warm run) - CUDA driver version changed (invalidates all cache entries) - Cache directory was deleted: `~/.local/share/her-os-audio/cuda-cache/` - Cache files are root-owned (created inside container) — don't try to clear from host Full rebuild only needed for Dockerfile/dependency changes: ```bash ./run.sh build && ./run.sh ``` ## Docker Services ### Start PostgreSQL + Redis ```bash docker run -d --name titan-postgres -p 15432:5432 \ -e POSTGRES_PASSWORD=postgres \ postgres:16-alpine docker run -d --name titan-redis -p 16379:6379 \ redis:7-alpine ``` ### Verify ```bash docker exec titan-redis redis-cli ping # → PONG docker exec titan-postgres psql -U postgres -c "SELECT 1" # → 1 ``` ## LD_LIBRARY_PATH — Required for RAPIDS Native Libraries RAPIDS packages install native `.so` files in non-standard locations. You **must** set `LD_LIBRARY_PATH` before importing cuVS, cuGraph, or cuDF: ```bash SITE=/tmp/her-os-tier1/lib/python3.12/site-packages export LD_LIBRARY_PATH=$SITE/libcuvs/lib64:$SITE/libcugraph/lib64:$SITE/libcudf/lib64:$SITE/librmm/lib64:$SITE/libkvikio/lib64:$SITE/libucxx/lib64:$SITE/ucxx/lib64:$SITE/rapids_logger/lib64:$SITE/lib64:$SITE/nvidia/libnvcomp/lib64:$SITE/libraft/lib64:$LD_LIBRARY_PATH ``` **Shorthand for running any Python script:** ```bash source /tmp/her-os-tier1/bin/activate SITE=/tmp/her-os-tier1/lib/python3.12/site-packages LD_LIBRARY_PATH=$(find $SITE -maxdepth 2 -name 'lib64' -type d | tr '\n' ':')$LD_LIBRARY_PATH python3 your_script.py ``` ## HuggingFace Token Saved at `~/.huggingface/token` on both Titan and Beast. ``` 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 ``` Set via environment if needed: `HF_TOKEN=$(cat ~/.huggingface/token)` ## Kokoro TTS — Blackwell GPU Patch (MANDATORY) Kokoro's `TorchSTFT` uses complex tensor operations that trigger NVIDIA's Jiterator JIT compiler, which fails on Blackwell (SM_120/SM_121) because nvrtc doesn't recognize the architecture. **Root cause:** `torch.abs()`, `torch.angle()`, and `magnitude * torch.exp(phase * 1j)` on complex CUDA tensors use Jiterator → nvrtc → fails. **Fix:** Decompose to real/imag operations (avoids complex tensors entirely): ```python import torch import kokoro.istftnet as istftnet def _patched_transform(self, input_data): fwd = torch.stft(input_data, self.filter_length, self.hop_length, self.win_length, window=self.window.to(input_data.device), return_complex=False) real, imag = fwd[..., 0], fwd[..., 1] return torch.sqrt(real**2 + imag**2 + 1e-8), torch.atan2(imag, real) def _patched_inverse(self, magnitude, phase): real = magnitude * torch.cos(phase) imag = magnitude * torch.sin(phase) return torch.istft(torch.complex(real, imag), self.filter_length, self.hop_length, self.win_length, window=self.window.to(magnitude.device)).unsqueeze(-2) istftnet.TorchSTFT.transform = _patched_transform istftnet.TorchSTFT.inverse = _patched_inverse ``` **Why `torch.complex()` works but `torch.abs()` doesn't:** `torch.complex()` is a constructor (creates complex from real components, no JIT). `torch.abs()` on complex is an elementwise op that goes through Jiterator. **Apply BEFORE importing `KPipeline`.** ## Annie-Voice — GPU TTS Setup (Replaces Docker CPU Container) The `annie-voice` service originally used Kokoro-FastAPI as a Docker container (CPU, ~400ms). Session 110 switched to **direct GPU inference in-process** (~30ms). ### What changed | Before | After | |--------|-------| | `ghcr.io/remsky/kokoro-fastapi-cpu:latest` Docker container | `kokoro>=0.8` pip package, in-process | | HTTP API (`localhost:8880/v1/audio/speech`) | Direct `kokoro.KPipeline` GPU inference | | ~400ms latency (CPU) | ~30ms latency (GPU, Blackwell-patched) | | Separate container, separate process | Runs in annie-voice Python process alongside Whisper STT | ### Setup steps ```bash cd ~/workplace/her/her-os/services/annie-voice source .venv/bin/activate # or whatever your venv is # 1. Install kokoro (needs torch already installed) pip install 'kokoro>=0.8' # 2. Stop the old Docker container (no longer needed) docker compose stop kokoro-tts docker compose rm -f kokoro-tts # 3. Start the voice agent — Kokoro loads in-process with Blackwell patch python server.py # Log output should show: # "Blackwell patch applied: TorchSTFT.transform + .inverse (SM_121 safe)" # "KokoroTTS: loaded on cuda (voice=af_heart, lang=a)" ``` ### Key files | File | Purpose | |------|---------| | `blackwell_patch.py` | TorchSTFT monkey-patch (imported by kokoro_tts.py) | | `kokoro_tts.py` | Direct GPU inference via `kokoro.KPipeline` | | `bot.py` | `KOKORO_DEVICE` env var (default: `cuda`) | | `docker-compose.yml` | Kokoro service commented out (SearXNG remains) | ### Benchmark results (2026-03-01) | Mode | Average Latency | Speedup | |------|----------------|---------| | CPU Docker (Kokoro-FastAPI) | **922ms** | — | | GPU Direct (Blackwell-patched) | **46ms** | **20.2x** | Text: "Hey Rajesh, good morning! How are you doing today?" (5 runs each) ### VRAM budget | Model | VRAM | |-------|------| | Whisper large-v3-turbo STT | ~3 GB | | Kokoro-82M TTS | ~0.5 GB | | Ollama LLM (loaded separately) | varies | | **Total (STT + TTS)** | **~3.5 GB of 128 GB** | ## Ollama — GPU Access Fix (MANDATORY) The NVIDIA Container Toolkit is pre-installed on DGX Spark but **NOT registered with Docker** by default. Without this fix, all Ollama models run on CPU (100% CPU, 0% GPU), even though the GPU has free VRAM. **Symptoms:** - `docker exec ollama ollama ps` shows `100% CPU` for loaded models - `docker exec ollama nvidia-smi` → `Failed to initialize NVML: Unknown Error` - `docker info | grep runtime` shows only `runc`, no `nvidia` - Models run 5-10x slower than expected **One-time fix (requires sudo):** ```bash # Register NVIDIA runtime with Docker sudo nvidia-ctk runtime configure --runtime=docker # Restart Docker daemon to pick up the new runtime sudo systemctl restart docker # Verify — should now list nvidia runtime docker info | grep -i runtime # Expected: Runtimes: io.containerd.runc.v2 nvidia runc # Restart Ollama container docker restart ollama # Verify GPU access inside container docker exec ollama nvidia-smi # Should show GPU docker exec ollama ollama ps # Should show GPU % instead of CPU % ``` **Root cause:** `/etc/docker/daemon.json` only had DNS config, missing the NVIDIA runtime registration. The `nvidia-ctk runtime configure` command adds the `nvidia` runtime to the daemon config. **Available Ollama models on Titan (as of 2026-03-02):** | Model | Size | Family | Notes | |-------|------|--------|-------| | qwen3:8b | 5.2 GB | Qwen3 | Fast, decent extraction | | qwen3:32b | 20 GB | Qwen3 | Dense, high quality | | qwen3.5:27b | 17 GB | Qwen3.5 | Dense, vision-capable | | qwen3.5:35b-a3b | 23 GB | Qwen3.5 MoE | ~3B active params, should be fast on GPU | | llama3.1:8b | 4.9 GB | Llama 3.1 | Known tool-calling bugs | | gemma3:4b | 3.3 GB | Gemma 3 | Smallest | | mistral-small:24b | 14 GB | Mistral 3 | — | | nemotron-3-nano | 24 GB | Nemotron MoE | — | | ministral-3:14b | 9.1 GB | Mistral 3 | — | ## Whisper STT — PyTorch or Docker CTranslate2 CTranslate2 **PyPI aarch64 wheels** are CPU-only. However, the `mekopa/whisperx-blackwell` Docker container has CTranslate2 4.4.0 compiled from source with full CUDA support (float16, bfloat16, int8). **For bare-metal/venv installs**, use OpenAI's Whisper package (`openai-whisper`) which runs on PyTorch → CUDA natively: ```python import whisper model = whisper.load_model("large-v3", device="cuda") result = model.transcribe(audio_array, language="en", fp16=True) ``` **For Docker deployments**, the audio-pipeline's WhisperX container runs CTranslate2 on GPU. ## Benchmark Results — Bare-Metal (2026-02-24) > **Note:** These bare-metal results led to ADR-016 v1 (dual-model). The NGC Docker benchmarks above superseded these and led to ADR-016 v2 (8B-primary). Bare-metal results are preserved for reference. ### Context Retrieval Pipeline | Component | 0.6B | 8B | |-----------|------|-----| | Embedding (single sentence) | **69.4ms** | 341.6ms | | Embedding (batch 16, per-sentence) | **4.7ms** | 50.9ms | | cuVS vector search (100K, top-10) | **0.17ms** | 0.17ms | | cuGraph BFS (10K nodes) | **2.3ms** | 2.3ms | | cuGraph PageRank (10K nodes) | **1.0ms** | 1.0ms | | **Pipeline total** | **71.9ms ✓** | **344.0ms ✗** | **Original decision (superseded):** ADR-016 v1 chose 0.6B for real-time based on these numbers. NGC Docker benchmarks showed 8B at 74.7ms → ADR-016 v2 uses 8B for everything. ### Voice Pipeline | Component | Latency | RTF | |-----------|---------|-----| | Whisper large-v3 STT (5s audio) | **483ms** | 0.097x (10x real-time) | | Whisper large-v3 STT (30s audio) | **481ms** | 0.016x (62x real-time) | | Kokoro TTS (short text) | **30ms** | 0.016x (62x real-time) | | Kokoro TTS (medium text) | **40ms** | 0.014x (71x real-time) | | **Pipeline total (5s input)** | **513ms ✓** | — | ### Infrastructure | Service | Latency | |---------|---------| | Redis GET | **0.034ms** | | Redis SET | **0.036ms** | | PostgreSQL SELECT+ORDER+LIMIT | **0.44ms** | | GLiNER NER (6 labels) | **132.5ms** | ### VRAM Usage | Model | VRAM | |-------|------| | Qwen3-Embedding-0.6B | 1.1 GB | | Qwen3-Embedding-8B | 14.1 GB | | Whisper large-v3 | 8.75 GB | | Kokoro-82M | ~0.5 GB | | **Total (0.6B + Whisper + Kokoro)** | **~10.4 GB** | | **Available** | **128 GB** | | **Headroom** | **~118 GB** | ## Running the Full Benchmark ```bash # Activate environment source /tmp/her-os-tier1/bin/activate # Set library paths SITE=/tmp/her-os-tier1/lib/python3.12/site-packages export LD_LIBRARY_PATH=$(find $SITE -maxdepth 2 -name 'lib64' -type d | tr '\n' ':')$LD_LIBRARY_PATH # Set HuggingFace token export HF_TOKEN=$(cat ~/.huggingface/token) # Ensure Docker services are running docker start titan-postgres titan-redis # Run benchmark python3 /tmp/benchmark_v2.py ``` ## Troubleshooting Quick Reference | Symptom | Cause | Fix | |---------|-------|-----| | `No module named 'cupy'` | cu12 uninstall deleted shared module dir | `pip install --force-reinstall cupy-cuda13x` then `pip install 'numpy<2.3'` | | `No module named 'pylibraft'` | Same shared dir issue | `pip install --force-reinstall pylibraft-cu13 --extra-index-url=https://pypi.nvidia.com` then `pip install 'numpy<2.3'` | | `No module named 'pylibcugraph'` | Same shared dir issue | `pip install --force-reinstall pylibcugraph-cu13 --extra-index-url=https://pypi.nvidia.com` then `pip install 'numpy<2.3'` | | `libcuvs_c.so: cannot open` | Missing LD_LIBRARY_PATH | Set LD_LIBRARY_PATH (see section above) | | `libraft.so: cannot open` | Missing LD_LIBRARY_PATH or libraft | Set LD_LIBRARY_PATH including `$SITE/libraft/lib64` | | `Numba needs NumPy 2.2 or less` | RAPIDS install pulled numpy 2.4 | `pip install 'numpy<2.3'` | | `nvrtc: error: unrecognized option SM_120` | Blackwell Jiterator issue | Apply Kokoro TorchSTFT patch (see above). Standard PyTorch ops work fine. | | `torch_dtype is deprecated` | Old Transformers API | Use `dtype=torch.float16` instead of `torch_dtype=torch.float16` | | Docker build DNS timeout (`lookup nvcr.io ... i/o timeout`) | systemd-resolved lost upstream DNS | `sudo resolvectl dns wlP9s9 8.8.8.8 8.8.4.4 && sudo resolvectl domain wlP9s9 "~."` (see DNS section above) | | Docker container DNS timeout | IPv6 conflict | `sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1` | | pip version parsing error | pip 24.0 can't parse nightly versions | `pip install --upgrade pip` | | `huggingface-cli: not found` (exit 127) | `huggingface_hub>=1.4.0` removed `[cli]` extra | Use Python API: `python -c "from huggingface_hub import snapshot_download; snapshot_download('model')"` | | Docker build uses stale cache, commands fail | Cached pip layer from failed build (DNS was down) | `docker compose build --no-cache` or `docker builder prune` | | `Can't find model 'en_core_web_sm'` | spacy model not in container; runtime install goes to unreadable user site-packages | Add `RUN python -m spacy download en_core_web_sm` to Dockerfile builder stage | | Embedding 4-7x slower than expected | Running bare-metal pip instead of NGC container | Use NGC PyTorch base image — see "NGC Container Performance" section above | | `soxr_ext.abi3.so: cannot open` / `CUDA not available` after deploy | rsync copied local x86_64 .venv over Titan's aarch64 .venv | See "Annie Voice Venv Recovery" below | ## Package Versions (validated 2026-02-24) ``` torch==2.12.0.dev20260224+cu128 cupy-cuda13x==14.0.1 cugraph-cu13==26.2.0 cuvs-cu13==26.2.0 cudf-cu13==26.2.1 pylibraft-cu13==26.2.0 pylibcugraph-cu13==26.2.0 numpy==2.2.6 openai-whisper (large-v3 model, 1.55B params) kokoro>=0.8 gliner (urchade/gliner_medium-v2.1) transformers (Qwen/Qwen3-Embedding-0.6B + 8B) redis==5.x asyncpg==0.x fastapi + uvicorn ``` ## Docker Compose — Container Healthcheck Gotchas **Qdrant:** The `qdrant/qdrant` image is a minimal Debian container with just the Qdrant binary — it has **neither `wget` nor `curl`**. Use bash's built-in `/dev/tcp` for health checks: ```yaml healthcheck: test: ["CMD-SHELL", "bash -c 'echo > /dev/tcp/localhost/6333'"] ``` **Neo4j:** The `neo4j:5-community` image includes `wget` (Debian-based). Standard wget healthcheck works. **NGC PyTorch:** Full Ubuntu — has both `wget` and `curl`. ## Docker Build — Gotchas for NGC Base Image **UID 1000 conflict:** The NGC PyTorch image pre-creates user `ubuntu:1000`. Use `usermod -l heros ubuntu` instead of `useradd -u 1000 heros`. **PyTorch CPU-only in venv:** If you create an isolated venv (`python -m venv`) inside the NGC container, pip won't see NGC's CUDA-enabled PyTorch and will install CPU-only torch from PyPI as a transitive dependency (via transformers, etc.). **Fix:** Always use `python -m venv --system-site-packages` to inherit NGC's PyTorch. **Symptom:** `AttributeError: module 'torch._C' has no attribute '_dlpack_exchange_api'` — this means the CPU-only torch was installed over NGC's CUDA torch. **total_mem vs total_memory:** NGC's PyTorch uses `torch.cuda.get_device_properties(0).total_memory` (not `total_mem`). Standard PyPI torch uses `total_mem`. **wget --spider sends HEAD:** Docker healthchecks using `wget -q --spider` send HTTP HEAD requests. FastAPI `@app.get()` returns 405 Method Not Allowed for HEAD. Use `curl -sf` instead (sends GET). **UCX segfault on exit:** NGC images include HPC-X UCX libraries. When Python imports RAPIDS (cupy, cugraph), UCX initializes. On process exit, `ucs_topo_cleanup` segfaults (`SIGSEGV`). The imports work fine — it's only the cleanup that crashes. **Fix:** Use `os._exit(0)` to skip atexit handlers, or set `UCX_LOG_LEVEL=error` to suppress verbose backtraces. **huggingface_hub v1.4+ removed `[cli]` extra:** As of `huggingface_hub>=1.4.0`, the `[cli]` extra no longer exists. `pip install 'huggingface_hub[cli]'` emits a warning and **does not install the `huggingface-cli` entry point** — the command silently vanishes (exit code 127). Use the Python API instead: ```dockerfile # WRONG — fails with "huggingface-cli: not found" RUN pip install 'huggingface_hub[cli]' RUN huggingface-cli download Qwen/Qwen3-Embedding-0.6B # CORRECT — use Python API directly RUN pip install huggingface_hub RUN python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-Embedding-0.6B')" ``` **Docker cache poisoning from network failures:** If a `pip install` runs during a DNS outage, Docker caches the layer as successful (pip exits 0 even when it can't download anything). Subsequent builds use the cached empty layer, causing mysterious "command not found" errors for packages that should be installed. **Fix:** `docker compose build --no-cache` to force a full rebuild, or `docker builder prune` to clear the build cache. **spacy models need build-time install:** Kokoro TTS depends on `spacy` with `en_core_web_sm`. If you only install it at runtime, spacy writes to user site-packages (`Defaulting to user installation because normal site-packages is not writeable`) which isn't on Python's path inside the container. **Fix:** Add `RUN python -m spacy download en_core_web_sm` to the builder stage of the Dockerfile, after installing voice requirements. ## NGC Container Performance — 4-7x Faster Than Bare-Metal pip **This is the most important finding from Phase 0.** The NGC PyTorch container (`nvcr.io/nvidia/pytorch:25.11-py3`) provides **4-7x faster inference** than bare-metal pip installations for all compute-bound operations. This single factor changed our embedding architecture (ADR-016 v1 → v2). **Root cause:** NGC images ship with architecture-specific optimizations — cuBLAS/cuDNN tuned for Blackwell SM_121, optimized memory allocators, pre-compiled CUDA kernels, and NCCL tuning. Bare-metal `pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128` installs generic wheels with no Blackwell-specific optimizations. ### Benchmark Comparison (2026-02-25) | Component | Bare-metal pip | NGC Docker | Speedup | |-----------|---------------|------------|---------| | Embed 0.6B (single) | 69.4ms | **9.78ms** | **7.1x** | | Embed 8B (single) | 341.6ms | **74.7ms** | **4.6x** | | Whisper STT (5s) | 483ms | **279.7ms** | **1.7x** | | GLiNER NER | 132.5ms | **54.6ms** | **2.4x** | | Redis GET | 0.034ms | 0.022ms | 1.5x | | PostgreSQL | 0.44ms | 0.17ms | 2.6x | | cuVS search (100K) | 0.17ms | 0.18ms | ~same | | cuGraph BFS (10K) | 2.3ms | 2.27ms | ~same | **Key observations:** - **Transformer inference** (embedding, STT, NER) benefits most — these are matrix-multiply bound, exactly what NGC's tuned cuBLAS accelerates. - **CUDA graph operations** (cuVS, cuGraph) show no improvement — they already run optimized CUDA kernels regardless of the PyTorch build. - **Infrastructure** (Redis, PostgreSQL) improvements are from containerized networking, not GPU. ### Architectural Impact — ADR-016 v2 The NGC speedup **made the 8B embedding model real-time capable** (74.7ms < 100ms target), eliminating the dual-model progressive retrieval architecture: | | ADR-016 v1 (bare-metal) | ADR-016 v2 (NGC Docker) | |--|------------------------|------------------------| | Real-time embedding | 0.6B (69ms) | **8B (74.7ms)** | | Context retrieval total | 71.9ms (with 0.6B) | **77.2ms (with 8B)** | | Search quality (MTEB) | 61.83 | **69.44 (+12%)** | | VRAM | 15.2 GB (both models) | **14.1 GB (8B only)** | | Architecture complexity | Dual-model + confidence routing + nightly re-index | **Single model** | **Lesson:** Always benchmark inside the target deployment container before making architectural decisions. The runtime environment can change the performance profile enough to invalidate design trade-offs. ## Anti-Patterns — NEVER Do These 1. **NEVER `pip uninstall -cu12`** if the `-cu13` version is installed (shared module dirs) 2. **NEVER install RAPIDS without `--extra-index-url=https://pypi.nvidia.com`** 3. **NEVER forget `pip install 'numpy<2.3'` after RAPIDS installs** 4. **NEVER `pip install ctranslate2` on ARM64 expecting CUDA** (PyPI wheels are CPU-only; the `mekopa/whisperx-blackwell` Docker container has it compiled from source with CUDA) 5. **NEVER run RAPIDS Python without LD_LIBRARY_PATH set** 6. **NEVER use `torch_dtype=`** (deprecated, use `dtype=`) 7. **NEVER import Kokoro's KPipeline before applying TorchSTFT patch on Blackwell** 8. **NEVER use `pip install 'huggingface_hub[cli]'`** — the `[cli]` extra was removed in v1.4.0. Use `pip install huggingface_hub` and call `snapshot_download()` from Python 9. **NEVER benchmark bare-metal and assume NGC Docker will match** — NGC is 4-7x faster for transformer inference. Always benchmark inside the deployment container 10. **NEVER trust Docker layer cache after a network failure** — `pip install` exits 0 even if it downloads nothing. Bust cache with `--no-cache` if builds behave unexpectedly 11. **NEVER run Qwen3.5 with thinking mode ON for production** — consumes 3-16K tokens of reasoning per call, 5-80x slower, 8% failure rate from token exhaustion. Always use `chat_template_kwargs: {"enable_thinking": false}` and `--jinja` flag on llama-server 12. **NEVER start llama-server without `--jinja`** when serving Qwen3.5 — without it, `chat_template_kwargs` is silently ignored and thinking mode stays ON 13. **NEVER use `--ctx-size 8192` with Qwen3.5** — if thinking mode is ON (even accidentally), 8K context = prompt + thinking + response all compete for space, causing empty outputs. Use `--ctx-size 32768` minimum 14. **NEVER use `CMAKE_CUDA_ARCHITECTURES` with CTranslate2** — it uses legacy FindCUDA (cmake_minimum_required 3.7), not cmake's CUDA language. Use `CUDA_TOOLKIT_ROOT_DIR`, `CUDA_ARCH_LIST`, and `CUDA_NVCC_FLAGS` instead 15. **NEVER rsync code to Titan** — rsync copies the local x86_64 `.venv/` over Titan's aarch64 `.venv/`, corrupting pip and all native packages. Always use `git commit → git push → ssh titan "git pull"` for deployment ## Annie Voice Venv Recovery If the annie-voice venv is corrupted (rsync incident, broken pip, missing native libs): ```bash cd ~/workplace/her/her-os/services/annie-voice # 1. Recreate venv from scratch python3 -m venv --clear .venv # 2. Install all Python deps from requirements.txt .venv/bin/pip install -r requirements.txt # 3. pip installs CPU-only torch by default — force CUDA version .venv/bin/pip install --force-reinstall torch==2.10.0 --index-url https://download.pytorch.org/whl/cu128 # 4. NeMo is NOT in requirements.txt (it pulls CPU torch). Install WITH deps, then fix conflicts: .venv/bin/pip install "nemo_toolkit[asr] @ git+https://github.com/NVIDIA/NeMo.git@main" # NeMo pulls protobuf 6.x and CPU torch — fix both: .venv/bin/pip install --force-reinstall torch==2.10.0 --index-url https://download.pytorch.org/whl/cu128 .venv/bin/pip install 'protobuf~=5.29.6' # 5. Verify everything works .venv/bin/python -c " import torch; print('torch:', torch.__version__, 'cuda:', torch.cuda.is_available()) from nemo.collections.asr.models import EncDecRNNTBPEModel; print('NeMo ASR: OK') import kokoro; print('Kokoro: OK') import pipecat; print('Pipecat:', pipecat.__version__) " # Expected: torch 2.10.0+cu128 cuda True, NeMo ASR OK, Kokoro OK, Pipecat 0.0.105 # 6. Restart Annie (see MEMORY.md for full env vars) fuser -k 7860/tcp 2>/dev/null; sleep 2 CONTEXT_ENGINE_TOKEN="..." CONTEXT_ENGINE_URL="http://localhost:8100" \ LLAMACPP_BASE_URL="http://localhost:8003/v1" LLM_BACKEND="qwen3.5-9b" \ STT_BACKEND="nemotron" VOICE_AGENT_HOST="0.0.0.0" \ TRANSCRIPT_DIR="$HOME/.local/share/her-os-audio/transcripts" \ MEMORY_NOTES_DIR="$HOME/.local/share/her-os-audio/memories" \ nohup .venv/bin/python server.py > /tmp/annie-voice.log 2>&1 & ``` **Order matters:** Install NeMo BEFORE re-forcing CUDA torch. NeMo's deps pull CPU torch, so the torch reinstall must come AFTER. The protobuf downgrade is needed because pipecat requires ~5.29 while NeMo pulls 6.x. **Root cause of Session 324 incident:** `rsync -avz services/annie-voice/ titan:~/...` copied local laptop's `.venv/` (x86_64 Ubuntu) over Titan's `.venv/` (aarch64 DGX). Native `.so` files (soxr, torch CUDA bindings) are architecture-specific — x86 binaries crash on aarch64. Cascading failures: broken pip → CPU-only torch → NeMo missing → Nemotron STT fails → Annie can't hear users. ## Qwen3.5-35B-A3B via llama.cpp (Recommended VLM Setup) ```bash export LD_LIBRARY_PATH=/usr/local/cuda-13/compat:$LD_LIBRARY_PATH ~/llama-cpp-latest/build-gpu/bin/llama-server \ --host 0.0.0.0 --port 8002 \ -m ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf \ --mmproj ~/models/mmproj-BF16.gguf \ --ctx-size 32768 --n-gpu-layers 999 -fa auto --jinja ``` **Build requires:** `-DGGML_CUDA_FORCE_CUBLAS=ON` (native MMQ kernels crash on Blackwell MXFP4 tensors). **API calls must include:** `"chat_template_kwargs": {"enable_thinking": false}` for production use. **Full setup recipe:** See `docs/RESEARCH-QWEN35-VLM.md` ## Context Engine — Deployment The Context Engine is the "shared brain" — watches JSONL transcript files written by the audio pipeline, ingests to PostgreSQL, extracts entities via local LLM (qwen3:8b), and provides BM25 search + temporal decay + daily reflection APIs. ### Architecture ``` Audio Pipeline (Docker) Context Engine (Docker) writes to /data/transcripts/ ──→ watches /data/transcripts/ (inotify) │ ┌─────▼──────┐ │ Parse JSONL │ │ Build index │ └─────┬──────┘ │ ┌─────▼──────┐ │ PostgreSQL │ │ (GIN index) │ └─────┬──────┘ │ ┌─────▼──────────┐ │ Entity Extract │ │ (qwen3:8b via │ │ Ollama) │ └────────────────┘ ``` ### Shared transcript volume Both audio-pipeline and context-engine mount the same host directory: | Service | Host path | Container path | |---------|-----------|----------------| | audio-pipeline | `/home/rajesh/.local/share/her-os-audio` | `/data` | | context-engine | `/home/rajesh/.local/share/her-os-audio/transcripts` | `/data/transcripts` | The audio pipeline's `jsonl_writer.py` writes to `/data/transcripts/SESSION_ID.jsonl`. The Context Engine's watcher detects these files via inotify. ### Deploy ```bash cd ~/workplace/her/her-os/services/context-engine # Create .env (one-time) cat > .env << 'EOF' POSTGRES_PASSWORD= CONTEXT_ENGINE_TOKEN= EXTRACTION_LLM=ollama/qwen3:8b DAILY_LLM=ollama/qwen3:8b OLLAMA_BASE_URL=http://host.docker.internal:11434 TRANSCRIPT_DIR=/home/rajesh/.local/share/her-os-audio/transcripts EOF # Start ./run.sh # or: docker compose up -d --build # Verify curl http://localhost:8100/health curl -H "X-Internal-Token: $CONTEXT_ENGINE_TOKEN" http://localhost:8100/v1/stats ``` ### Endpoints | Endpoint | Auth | Description | |----------|------|-------------| | `GET /health` | No | Returns `{"status": "ok"}` — no metadata (SEC-1) | | `GET /v1/stats` | Yes | Segment/session/entity counts | | `GET /v1/context?query=...` | Yes | BM25 search + temporal decay + MMR reranking | | `GET /v1/entities` | Yes | List extracted entities | | `GET /v1/daily` | Yes | Daily reflection summary | | `POST /v1/ingest/{session_id}` | Yes | Manual ingest trigger | Auth: `X-Internal-Token` header with the value from `CONTEXT_ENGINE_TOKEN`. ### Gotchas 1. **`host.docker.internal` on Linux** — Docker doesn't add this automatically. Use `extra_hosts: ["host.docker.internal:host-gateway"]` in docker-compose.yml (already configured). 2. **asyncpg type strictness** — `DATE` columns need `datetime.date` objects, not strings. SQLAlchemy + asyncpg won't auto-coerce. 3. **Ollama `"think": false`** — Must disable Qwen3 chain-of-thought in extraction calls. Without this, timeout at 300s and empty results. 4. **JSON array schema** — Use structured JSON schema (not `format: "json"`) to force array output. Plain `format: "json"` causes some models (especially Qwen3 32B) to stop after a single `{...}` object. ### LLM Extraction Benchmark (GPU, 10 labeled transcripts, JSON array schema) | Model | F1 | Precision | Recall | Avg Latency | VRAM | |-------|-----|-----------|--------|-------------|------| | **qwen3:8b** | **0.523** | **0.508** | **0.565** | **6.3s** | ~5 GB | | qwen3:32b | 0.501 | 0.523 | 0.565 | 31.4s | ~20 GB | | qwen3.5:35b-a3b | 0.452 | 0.471 | 0.598 | 8.3s | ~24 GB | | qwen3.5:27b | 0.435 | 0.358 | 0.573 | 39.7s | ~17 GB | Winner: **qwen3:8b** — highest F1, fastest, smallest VRAM. ADR-019 winner (Qwen3 32B) tested and confirmed equivalent quality but 5x slower. Bigger models over-extract (lower precision) — extraction is pattern matching, not reasoning.