# Research: Docker-Based Deployment for her-os on DGX Spark

**Date:** 2026-02-24
**Context:** Phase 0 — deployment strategy for self-hosted personal AI platform on NVIDIA DGX Spark (aarch64 ARM64, GB10 Blackwell, CUDA 13.0, 128GB unified memory).

---

## Table of Contents

1. [Base Image Selection](#1-base-image-selection)
2. [Docker Compose Architecture](#2-docker-compose-architecture)
3. [How Others Package GPU ML Apps for DGX Spark](#3-how-others-package-gpu-ml-apps-for-dgx-spark)
4. [Model Handling: Bake vs. Download](#4-model-handling-bake-vs-download)
5. [NVIDIA Container Toolkit Setup](#5-nvidia-container-toolkit-setup)
6. [Single `docker compose up` Strategy](#6-single-docker-compose-up-strategy)
7. [Commercial Self-Hosted Product Distribution](#7-commercial-self-hosted-product-distribution)
8. [Image Size Analysis](#8-image-size-analysis)
9. [Multi-Stage Build Strategy](#9-multi-stage-build-strategy)
10. [Health Checks, GPU Verification, and First-Run Setup](#10-health-checks-gpu-verification-and-first-run-setup)
11. [Component Compatibility Matrix](#11-component-compatibility-matrix)
12. [Recommended Dockerfile](#12-recommended-dockerfile)
13. [Recommended docker-compose.yml](#13-recommended-docker-composeyml)
14. [First-Run Bootstrap Script](#14-first-run-bootstrap-script)
15. [Open Questions and Risks](#15-open-questions-and-risks)
16. [NGC ABI Compatibility Gotchas — The Sidecar Pattern](#16-ngc-abi-compatibility-gotchas--the-sidecar-pattern)

---

## 1. Base Image Selection

### Recommendation: `nvcr.io/nvidia/pytorch:25.11-py3`

**Why not a bare CUDA image?** The NVIDIA NGC PyTorch container is the standard for DGX Spark deployments. It includes:

- PyTorch 2.10+ with CUDA 13.0.2 pre-compiled
- cuDNN, cuBLAS, NCCL all pre-built for Blackwell (SM_121)
- Python 3.12 in the expected `/usr/local/lib/python3.12/dist-packages/` layout
- ARM64/aarch64 native support (not emulated)
- Pip constraints file at `/etc/pip/constraint.txt` to avoid version conflicts
- Pre-compiled CUDA kernels that handle the SM_120/SM_121 issue (nvrtc JIT compilation of SM_120 fails on bare CUDA, but NGC containers have pre-compiled everything)

**Alternative options (ranked):**

| Image | Size | Pros | Cons |
|-------|------|------|------|
| `nvcr.io/nvidia/pytorch:25.11-py3` | ~15 GB | Everything pre-built, tested on Blackwell | Large base |
| `nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04` | ~5 GB | Smaller, build PyTorch yourself | Must compile PyTorch (~2hr), SM_121 issues |
| `nvidia/cuda:13.0.1-cudnn-runtime-ubuntu24.04` | ~3 GB | Smallest CUDA image | No dev headers, can't compile extensions |
| `ubuntu:24.04` + pip PyTorch nightly | ~200 MB base | Full control | CUDA 13 wheel availability is intermittent |

**Key learnings from the ecosystem:**

- The natolambert/dgx-spark-setup guide documents that ARM64 + CUDA 13 is a "rare combination and most CI/CD systems don't test this configuration." Packages built against CUDA 12.x link to `libcudart.so.12` but DGX Spark only has `libcudart.so.13`, causing import failures.
- Simon Willison's review confirms "CUDA 13 appears to be very new, and a lot of the existing tutorials and libraries appear to expect CUDA 12."
- The NGC PyTorch container sidesteps these issues entirely — everything is built and tested for this exact hardware.
- Starting with release 25.01, NGC containers are optimized for Blackwell GPU architectures.

**Decision: Use `nvcr.io/nvidia/pytorch:25.11-py3` as the base.** The 15 GB base is large but eliminates the #1 risk (CUDA/PyTorch compatibility on Blackwell ARM64). For production, we can create a slimmed variant once the full dependency set is stable.

---

## 2. Docker Compose Architecture

### Yes, use Docker Compose. It is the standard pattern for DGX Spark.

NVIDIA's official dgx-spark-playbooks repository uses Docker Compose for all their reference deployments. The pattern is clear: one compose file, multiple services, GPU passed to the services that need it.

### Service Architecture

```
┌─────────────────────────────────────────────────────┐
│                  docker-compose.yml                  │
├─────────────────────────────────────────────────────┤
│                                                     │
│  ┌──────────────┐   GPU    ┌──────────────────┐    │
│  │  her-os-core │◄────────►│  Qwen3-Embed     │    │
│  │  (FastAPI)   │          │  (15 GB VRAM)    │    │
│  │  Port 8000   │          └──────────────────┘    │
│  └──────┬───────┘                                   │
│         │                                           │
│  ┌──────┴────────────────────────────────────┐      │
│  │              Internal Network              │      │
│  ├──────────┬──────────┬──────────┬──────────┤      │
│  │          │          │          │          │      │
│  ▼          ▼          ▼          ▼          ▼      │
│ ┌────┐  ┌──────┐  ┌──────┐  ┌──────┐  ┌───────┐  │
│ │PG16│  │Redis7│  │Neo4j5│  │Qdrant│  │Kokoro │  │
│ │5432│  │ 6379 │  │ 7474 │  │ 6333 │  │(TTS)  │  │
│ │    │  │      │  │ 7687 │  │ 6334 │  │ 8880  │  │
│ └────┘  └──────┘  └──────┘  └──────┘  └───────┘  │
│  CPU      CPU       CPU       CPU      CPU/GPU     │
│                                                     │
└─────────────────────────────────────────────────────┘
```

### Design decisions:

1. **her-os-core** — Single GPU-enabled container running FastAPI + all ML inference (embedding, NER, STT). This keeps GPU memory management simple (one process owns the GPU).
2. **Infrastructure services** (PostgreSQL, Redis, Neo4j, Qdrant) — CPU-only containers. These are stateless except for their mounted volumes.
3. **Kokoro TTS** — Can run as a sidecar using the existing `ghcr.io/remsky/kokoro-fastapi-gpu:latest` Docker image (multi-arch, ARM64 supported). Alternatively, embed in her-os-core.
4. **No Kubernetes.** For a single-node DGX Spark, Docker Compose is the right abstraction level. K8s adds complexity with no benefit for a personal AI appliance.

---

## 3. How Others Package GPU ML Apps for DGX Spark

### NVIDIA dgx-spark-playbooks (official reference)

The official NVIDIA repository uses this exact pattern:
- Base image: `nvcr.io/nvidia/pytorch:25.11-py3` (PyTorch fine-tuning playbook)
- Docker Compose for multi-service orchestration
- Volume mounts for HuggingFace cache: `-v ~/.cache/huggingface:/root/.cache/huggingface`
- GPU access via `deploy.resources.reservations.devices`
- Deployed via `docker stack deploy -c docker-compose.yml` or `docker compose up`

### Ollama

- Single binary + Docker image
- Models downloaded on first use, stored in a named volume (`ollama:/root/.ollama`)
- GPU access via `NVIDIA_VISIBLE_DEVICES=all`
- OpenAI-compatible API endpoint

### Kokoro-FastAPI

- Pre-built multi-arch Docker images (ARM64 + x86)
- Separate CPU and GPU images: `ghcr.io/remsky/kokoro-fastapi-cpu:latest` / `ghcr.io/remsky/kokoro-fastapi-gpu:latest`
- Models baked into the image (82M is small enough)

### n8n Self-Hosted AI Starter Kit

- Docker Compose template combining n8n + Ollama + Qdrant + PostgreSQL
- Single `docker compose up` brings everything up
- `.env` file for configuration

### CodeProject.AI Server

- Self-hosted AI microserver distributed via Docker Hub
- Standalone container, no external dependencies
- No off-device data transfer (local-first)

### vLLM on DGX Spark

- New pre-built vLLM Docker images optimized for DGX Spark
- Uses `nvcr.io/nvidia/vllm:25.11-py3` as the base
- Rebuilt NCCL and PyTorch for Blackwell

### Common pattern across all:
1. NGC base image (for GPU apps) or minimal base (for CPU apps)
2. Docker Compose for orchestration
3. Named volumes for model persistence
4. `.env` for configuration
5. Health check endpoints

---

## 4. Model Handling: Bake vs. Download

### Recommendation: Volume mount with first-run download

For a 15 GB embedding model (Qwen3-Embedding-8B), baking into the Docker image is not viable:

| Approach | Image Size Impact | Update Story | Cold Start |
|----------|-------------------|-------------|------------|
| **Bake into image** | +15 GB per model | Rebuild entire image | Fast (already present) |
| **Download on first run** | 0 GB | Just delete volume | Slow first time (~5-15 min) |
| **Separate model volume** | 0 GB | Replace volume only | Fast after first pull |
| **Model sidecar container** | Separate image | Update sidecar only | Depends |

**Strategy: Named Docker volume + first-run bootstrap**

```yaml
volumes:
  huggingface-cache:
    driver: local

services:
  her-os-core:
    volumes:
      - huggingface-cache:/root/.cache/huggingface
```

**First-run bootstrap** (in the application entrypoint):

```python
# bootstrap.py — runs before FastAPI starts
import os
from pathlib import Path

MODELS = {
    "Qwen/Qwen3-Embedding-8B": "embedding",
    "urchade/gliner_multi_pii-v1": "ner",
    "Systran/faster-whisper-large-v3": "stt",
}

def ensure_models():
    """Download models if not present in the HuggingFace cache."""
    from huggingface_hub import snapshot_download
    for model_id, purpose in MODELS.items():
        cache_dir = Path(os.environ.get("HF_HOME", "/root/.cache/huggingface"))
        # snapshot_download is idempotent — skips if already cached
        print(f"[bootstrap] Ensuring {model_id} ({purpose})...")
        snapshot_download(model_id, cache_dir=str(cache_dir / "hub"))
        print(f"[bootstrap] {model_id} ready.")
```

**Why this approach:**
- NVIDIA's own playbooks use this pattern: "Standard volume mount patterns include mounting HuggingFace cache and workspace directories to prevent redundant downloads."
- Image stays at ~18-20 GB instead of 35+ GB
- Model updates don't require image rebuilds
- Customers can pre-download models on fast networks before going offline

---

## 5. NVIDIA Container Toolkit Setup

### DGX Spark: Pre-installed and ready

The NVIDIA Container Toolkit is **pre-installed and pre-configured** on DGX Spark systems. No setup needed.

**What it provides:**
- OCI hooks triggered by the `--gpus` flag
- Automatic driver and library injection into containers
- Bridge between Docker and NVIDIA drivers
- GPU device passthrough

**Verification command:**

```bash
# Verify GPU access in Docker
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
```

**For Docker Compose**, GPU access is specified declaratively:

```yaml
services:
  her-os-core:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
```

**Important notes:**
- The `--gpus` flag triggers OCI hooks that mount GPU devices inside the container
- `count: all` passes all GPUs (DGX Spark has 1 GPU, so `count: 1` and `count: all` are equivalent)
- The `capabilities: [gpu]` field is **required** — Compose returns an error without it
- `count` and `device_ids` are mutually exclusive

### For customer deployments (non-DGX hardware)

If customers run on other NVIDIA GPU hardware, they need:

```bash
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
```

---

## 6. Single `docker compose up` Strategy

### Yes, achievable with dependency ordering and health checks

```yaml
services:
  postgres:
    # No depends_on — starts immediately
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U herosuser"]
      interval: 5s
      timeout: 3s
      retries: 10

  redis:
    # No depends_on — starts immediately
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 10

  neo4j:
    # No depends_on — starts immediately
    healthcheck:
      test: ["CMD-SHELL", "wget -q --spider http://localhost:7474 || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 12
      start_period: 30s  # Neo4j is slow to start

  qdrant:
    # No depends_on — starts immediately
    healthcheck:
      test: ["CMD-SHELL", "wget -q --spider http://localhost:6333/healthz || exit 1"]
      interval: 5s
      timeout: 3s
      retries: 10

  her-os-core:
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
      neo4j:
        condition: service_healthy
      qdrant:
        condition: service_healthy
```

**Startup sequence:**
1. All infrastructure services start in parallel (~5-30 seconds)
2. Health checks confirm each service is ready
3. `her-os-core` starts only after all dependencies are healthy
4. her-os-core runs its own bootstrap (model download if needed, DB migrations, index creation)
5. FastAPI starts accepting requests

**Total cold start time (estimated):**
- Infrastructure services: ~30 seconds (Neo4j is the bottleneck)
- Model download (first run only): ~5-15 minutes depending on network
- Model loading into GPU: ~30-60 seconds
- Total first run: ~6-16 minutes
- Total subsequent runs: ~1-2 minutes

---

## 7. Commercial Self-Hosted Product Distribution

### Recommended approach: GHCR (GitHub Container Registry) + Docker Compose

For a product where "customers buy DGX Spark and pull our image," the standard pattern is:

**Tier 1: Public/Private Container Registry**

```
Customer workflow:
1. Buy DGX Spark
2. SSH into DGX Spark
3. docker login ghcr.io -u ... -p ...   (if private)
4. git clone https://github.com/her-os/deploy
5. docker compose pull
6. docker compose up -d
7. Open http://localhost:8000
```

**Registry options:**

| Registry | Pricing | Auth | Best For |
|----------|---------|------|----------|
| GHCR (GitHub) | Free (public), included with GitHub plan | GitHub PAT | Open source / small teams |
| Docker Hub | Free (1 private repo), $5+/mo | Docker login | Maximum compatibility |
| AWS ECR | $0.10/GB/mo | IAM | AWS-heavy customers |
| Self-hosted (Distribution) | Free (you host) | Custom | Air-gapped / compliance |

**Recommendation: GHCR for launch, self-hosted registry for enterprise.**

- GHCR is free for public images, integrates with GitHub Actions for CI/CD
- Supports multi-arch manifests (push ARM64 + x86 in one tag)
- Customers do `docker pull ghcr.io/her-os/her-os-core:latest`
- For enterprise/air-gapped: ship a tarball (`docker save | gzip`)

**Distribution artifacts:**

```
her-os-deploy/
├── docker-compose.yml          # Orchestration
├── .env.example                # Configuration template
├── config/
│   ├── neo4j/neo4j.conf        # Neo4j tuning
│   ├── postgres/init.sql        # DB schema
│   └── qdrant/config.yaml       # Qdrant config
├── scripts/
│   ├── install.sh               # One-liner installer
│   ├── upgrade.sh               # Version upgrade
│   ├── backup.sh                # Data backup
│   └── health-check.sh          # System verification
└── README.md                    # Setup instructions
```

**The `install.sh` one-liner:**

```bash
#!/bin/bash
# her-os installer
set -euo pipefail

echo "=== her-os installer ==="
echo "Checking prerequisites..."

# Check Docker
docker --version || { echo "Docker not installed"; exit 1; }

# Check NVIDIA Container Toolkit
docker run --rm --gpus all nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi > /dev/null 2>&1 || {
    echo "NVIDIA Container Toolkit not working. GPU access required."
    exit 1
}

# Check disk space (need ~50 GB for images + models + data)
AVAIL=$(df -BG --output=avail / | tail -1 | tr -d ' G')
[[ $AVAIL -gt 50 ]] || { echo "Need 50+ GB free disk space (have ${AVAIL} GB)"; exit 1; }

# Pull and start
docker compose pull
docker compose up -d

echo "=== her-os is starting ==="
echo "First run will download ML models (~20 GB). This takes 5-15 minutes."
echo "Dashboard: http://localhost:8000"
echo "Logs: docker compose logs -f her-os-core"
```

---

## 8. Image Size Analysis

### Estimated sizes

| Component | Size | Notes |
|-----------|------|-------|
| NGC PyTorch base (`25.11-py3`) | ~15 GB | Includes PyTorch, CUDA, cuDNN, NCCL |
| her-os Python dependencies | ~2 GB | FastAPI, transformers, gliner, pipecat, etc. |
| her-os application code | ~50 MB | Python source, configs, templates |
| **Total her-os-core image** | **~17-18 GB** | Without models |
| Qwen3-Embedding-8B model | ~15 GB | Downloaded to volume on first run |
| GLiNER model | ~500 MB | Downloaded to volume on first run |
| faster-whisper model | ~3 GB | Downloaded to volume on first run |
| Kokoro-82M model | ~300 MB | Small enough to bake or download |
| **Total model storage** | **~19 GB** | In named volume, persists across restarts |
| PostgreSQL 16 image | ~230 MB | Official ARM64 |
| Redis 7 image | ~40 MB | Official ARM64 |
| Neo4j 5 image | ~600 MB | Official ARM64 |
| Qdrant image | ~100 MB | Official ARM64 |
| **Total all images** | **~18-19 GB** | |
| **Total disk usage** | **~37-38 GB** | Images + models + initial data |

### Is this acceptable?

**Yes.** For context:
- DGX Spark has 3.7 TB disk (3.5 TB free on Titan)
- 38 GB is ~1% of available storage
- NVIDIA's own vLLM image is ~20 GB
- The NGC PyTorch container alone is 15 GB
- LLM-focused appliances routinely use 50-200+ GB for model storage

**Comparison with similar products:**

| Product | Total Disk | Models |
|---------|-----------|--------|
| Ollama (with Llama 3.1 70B) | ~45 GB | Downloaded on first use |
| Open WebUI + Ollama | ~50 GB | Downloaded on first use |
| vLLM on DGX Spark | ~20 GB image | Model loaded from volume |
| **her-os** | **~38 GB** | **Models in volume** |

---

## 9. Multi-Stage Build Strategy

### Recommended: Two-stage build (builder + runtime)

The NGC PyTorch base already contains most of what we need. The multi-stage approach focuses on keeping build artifacts (compilers, headers, source) out of the final image.

```dockerfile
# ============================================================
# Stage 1: Builder — install and compile all dependencies
# ============================================================
FROM nvcr.io/nvidia/pytorch:25.11-py3 AS builder

# System dependencies for C extensions
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Create virtualenv in a known location
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python dependencies
COPY requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/requirements.txt

# ============================================================
# Stage 2: Runtime — slim production image
# ============================================================
FROM nvcr.io/nvidia/pytorch:25.11-py3 AS runtime

# Only runtime system libraries (no build-essential)
RUN apt-get update && apt-get install -y --no-install-recommends \
    libpq5 \
    wget \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy virtualenv from builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Copy application code
COPY . /app
WORKDIR /app

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \
    CMD wget -q --spider http://localhost:8000/health || exit 1

# Expose ports
EXPOSE 8000

# Entrypoint with bootstrap
CMD ["python", "-m", "her_os.main"]
```

### Size savings from multi-stage

| Component | Builder only | Saved |
|-----------|-------------|-------|
| build-essential | ~250 MB | Removed |
| libpq-dev headers | ~80 MB | Removed |
| pip cache | ~500 MB | Removed |
| **Estimated savings** | **~800 MB** | |

**Note:** Since both stages use the same NGC base, the savings are modest (~800 MB). The main benefit is security (no compilers in production) rather than size. True multi-stage savings are dramatic when going from a `devel` base to a `runtime` base, but the NGC PyTorch container is already a "batteries-included" runtime.

### More aggressive optimization (future)

For production releases, consider:

```dockerfile
# Use runtime-only CUDA base (saves ~10 GB but requires pre-built wheels)
FROM nvidia/cuda:13.0.1-cudnn-runtime-ubuntu24.04 AS runtime

# Install pre-built PyTorch wheel
RUN pip install torch==2.12.0+cu128 --index-url https://download.pytorch.org/whl/cu128
```

This could reduce the image from ~18 GB to ~8 GB, but requires all dependencies to have pre-built ARM64 wheels (risky given the CUDA 13 ecosystem maturity). **Defer this optimization to Phase 2.**

---

## 10. Health Checks, GPU Verification, and First-Run Setup

### Health check endpoint

```python
# her_os/health.py
from fastapi import APIRouter
import torch
import subprocess
import time

router = APIRouter()

@router.get("/health")
async def health():
    """Comprehensive health check."""
    checks = {}

    # 1. GPU available
    checks["gpu"] = {
        "available": torch.cuda.is_available(),
        "device_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
        "vram_total_gb": round(torch.cuda.get_device_properties(0).total_mem / 1e9, 1) if torch.cuda.is_available() else None,
        "vram_used_gb": round(torch.cuda.memory_allocated(0) / 1e9, 1) if torch.cuda.is_available() else None,
    }

    # 2. Models loaded
    from her_os.models import embedding_model, ner_model
    checks["models"] = {
        "embedding": embedding_model is not None,
        "ner": ner_model is not None,
    }

    # 3. Database connections
    checks["databases"] = {
        "postgres": await check_postgres(),
        "redis": await check_redis(),
        "neo4j": await check_neo4j(),
        "qdrant": await check_qdrant(),
    }

    all_ok = all([
        checks["gpu"]["available"],
        checks["models"]["embedding"],
        checks["models"]["ner"],
        all(checks["databases"].values()),
    ])

    return {
        "status": "healthy" if all_ok else "degraded",
        "checks": checks,
        "uptime_seconds": time.time() - START_TIME,
    }

@router.get("/health/gpu")
async def gpu_health():
    """Quick GPU-only health check (for Docker HEALTHCHECK)."""
    if not torch.cuda.is_available():
        return {"status": "error", "message": "GPU not available"}, 503
    return {"status": "ok", "gpu": torch.cuda.get_device_name(0)}
```

### First-run setup sequence

The application entrypoint orchestrates startup:

```python
# her_os/main.py
import asyncio
import logging
from her_os.bootstrap import ensure_models, run_migrations, create_indexes

logger = logging.getLogger("her-os")

async def startup():
    """First-run and every-run startup sequence."""

    # Phase 1: Verify GPU (fail fast)
    import torch
    if not torch.cuda.is_available():
        logger.error("CUDA not available. Check NVIDIA Container Toolkit.")
        raise RuntimeError("GPU required but not available")
    logger.info(f"GPU: {torch.cuda.get_device_name(0)}, "
                f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

    # Phase 2: Download models if needed (first run: ~5-15 min)
    await ensure_models()

    # Phase 3: Load models into GPU
    logger.info("Loading embedding model...")
    from her_os.models import load_embedding_model, load_ner_model
    await load_embedding_model()  # ~30s, 15 GB VRAM
    await load_ner_model()        # ~5s, 500 MB VRAM

    # Phase 4: Run database migrations
    logger.info("Running database migrations...")
    await run_migrations()

    # Phase 5: Create graph indexes (idempotent)
    logger.info("Ensuring graph indexes...")
    await create_indexes()

    logger.info("=== her-os ready ===")

# FastAPI app with lifespan
from fastapi import FastAPI
from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app: FastAPI):
    await startup()
    yield
    # Shutdown: cleanup

app = FastAPI(lifespan=lifespan)
```

### GPU verification in Docker Compose

```yaml
services:
  her-os-core:
    healthcheck:
      # Two-phase health check:
      # 1. During start_period: allow model download/loading
      # 2. After start_period: require healthy response
      test: ["CMD-SHELL", "wget -q --spider http://localhost:8000/health || exit 1"]
      interval: 30s
      timeout: 10s
      start_period: 300s  # 5 minutes for model download on first run
      retries: 3
```

---

## 11. Component Compatibility Matrix (aarch64 + CUDA 13)

This table summarizes the actual state of each component on DGX Spark based on Phase 0 validation and this research:

| Component | aarch64 | CUDA 13 | Docker ARM64 | GPU on ARM64 | Status |
|-----------|---------|---------|--------------|--------------|--------|
| PyTorch | Yes (nightly cu128) | Yes (via NGC) | Yes (NGC) | **Yes** | Validated |
| Qwen3-Embedding-8B | Yes (transformers) | Yes (via PyTorch) | Yes (NGC) | **Yes** | Validated (15 GB) |
| GLiNER | Yes (pip) | N/A (CPU) | Yes (NGC base) | CPU only (fine) | Validated (120ms) |
| faster-whisper | Yes (pip) | **No** (PyPI) / **Yes** (Docker) | Yes (mekopa) | **CPU** (PyPI) / **GPU** (Docker) | CTranslate2 PyPI wheels lack CUDA; `mekopa/whisperx-blackwell` has it compiled from source with CUDA |
| Kokoro-82M | Yes (pip) | **No** (SM_120 issue) | Yes (kokoro-fastapi) | **CPU only** | nvrtc SM_120 not recognized |
| Pipecat | Yes (pip) | N/A (orchestrator) | Yes | N/A | Validated |
| FAISS | Yes (faiss-cpu) | **No** (no ARM64 GPU wheel) | N/A | **CPU only** | Official GPU is x86 only |
| cuGraph/RAPIDS | Yes (conda) | Yes (conda/Docker) | Yes (rapidsai) | **Yes** (conda) | No pip wheel, needs conda |
| PostgreSQL 16 | Yes | N/A | Yes (official) | N/A | Validated |
| Redis 7 | Yes | N/A | Yes (official) | N/A | Validated |
| Neo4j 5 | Yes (since 4.4+) | N/A | Yes (official) | N/A | Validated |
| Qdrant | Yes | N/A | Yes (official) | N/A | Validated |
| asyncpg | Yes (ARM64 wheel) | N/A | Yes | N/A | Validated (0.07ms) |
| Mem0 | Yes (pip) | N/A | Yes | N/A | Validated |
| Graphiti | Yes (pip) | N/A | Yes | N/A | Validated |
| FunASR (emotion2vec+) | Yes (pip) | Yes (via PyTorch) | Yes (python:3.12-slim) | **Yes** | Validated (33ms, sidecar) |
| audeering wav2vec2 | Yes (transformers) | Yes (via PyTorch) | Yes (python:3.12-slim) | **Yes** | Validated (10ms, sidecar) |

**Key insight:** GPU inference on ARM64 is limited to PyTorch-native operations. Libraries that do their own CUDA compilation (CTranslate2, nvrtc JIT) fail on Blackwell SM_120. The NGC PyTorch container handles this by providing pre-compiled kernels. For STT/TTS, CPU mode is viable and validated.

**SER insight:** FunASR and audeering both use pure PyTorch operations (no custom CUDA kernels), so they work on Blackwell via standard torch+cu128. However, they require torchaudio which is ABI-incompatible with the NGC container's custom torch. Solution: sidecar container with `python:3.12-slim` + standard torch 2.10.0+cu128. See Section 16.

---

## 12. Recommended Dockerfile

```dockerfile
# ============================================================
# her-os Dockerfile
# Target: NVIDIA DGX Spark (aarch64, GB10 Blackwell, CUDA 13)
# Base: NVIDIA NGC PyTorch (includes CUDA 13.0.2, Python 3.12)
# ============================================================

# ---- Stage 1: Builder ----
FROM nvcr.io/nvidia/pytorch:25.11-py3 AS builder

# Prevent interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# System build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    libpq-dev \
    git \
    && rm -rf /var/lib/apt/lists/*

# Create isolated virtualenv
RUN python -m venv /opt/her-os-venv
ENV PATH="/opt/her-os-venv/bin:$PATH"
ENV PIP_NO_CACHE_DIR=1

# Install Python dependencies (layer cached unless requirements change)
COPY requirements/base.txt /tmp/requirements-base.txt
COPY requirements/ml.txt /tmp/requirements-ml.txt
COPY requirements/voice.txt /tmp/requirements-voice.txt

RUN pip install --upgrade pip setuptools wheel && \
    pip install -r /tmp/requirements-base.txt && \
    pip install -r /tmp/requirements-ml.txt && \
    pip install -r /tmp/requirements-voice.txt

# ---- Stage 2: Runtime ----
FROM nvcr.io/nvidia/pytorch:25.11-py3 AS runtime

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

# Runtime-only system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    libpq5 \
    wget \
    curl \
    tini \
    && rm -rf /var/lib/apt/lists/*

# Copy virtualenv from builder
COPY --from=builder /opt/her-os-venv /opt/her-os-venv
ENV PATH="/opt/her-os-venv/bin:$PATH"

# HuggingFace cache directory (mounted as volume)
ENV HF_HOME=/data/models
ENV TRANSFORMERS_CACHE=/data/models
ENV TORCH_HOME=/data/torch-cache

# Application code
COPY src/ /app/src/
COPY config/ /app/config/
COPY alembic/ /app/alembic/
COPY alembic.ini /app/
WORKDIR /app

# Non-root user (security)
RUN useradd -m -u 1000 heros && \
    mkdir -p /data/models /data/torch-cache /data/entity-files && \
    chown -R heros:heros /data /app
USER heros

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=300s --retries=3 \
    CMD wget -q --spider http://localhost:8000/health || exit 1

EXPOSE 8000

# Use tini as init system (handles signals properly)
ENTRYPOINT ["tini", "--"]
CMD ["python", "-m", "uvicorn", "her_os.main:app", \
     "--host", "0.0.0.0", "--port", "8000", \
     "--workers", "1", "--loop", "uvloop"]
```

### Requirements files structure

```
requirements/
├── base.txt      # FastAPI, uvicorn, asyncpg, SQLAlchemy, redis, httpx, etc.
├── ml.txt        # transformers, gliner, mem0ai, graphiti, faiss-cpu, sentence-transformers
└── voice.txt     # faster-whisper, kokoro, pipecat-ai[anthropic,whisper,kokoro]
```

---

## 13. Recommended docker-compose.yml

```yaml
# ============================================================
# her-os — Docker Compose for DGX Spark
# Usage: docker compose up -d
# First run downloads ~20 GB of ML models (5-15 min)
# ============================================================

name: her-os

services:
  # ---- Infrastructure (CPU, no GPU needed) ----

  postgres:
    image: postgres:16-alpine
    container_name: her-os-postgres
    restart: unless-stopped
    environment:
      POSTGRES_USER: ${POSTGRES_USER:-heros}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:?Set POSTGRES_PASSWORD in .env}
      POSTGRES_DB: ${POSTGRES_DB:-heros}
    volumes:
      - postgres-data:/var/lib/postgresql/data
      - ./config/postgres/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    ports:
      - "${POSTGRES_PORT:-15432}:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-heros}"]
      interval: 5s
      timeout: 3s
      retries: 10
    deploy:
      resources:
        limits:
          memory: 2G

  redis:
    image: redis:7-alpine
    container_name: her-os-redis
    restart: unless-stopped
    command: redis-server --requirepass ${REDIS_PASSWORD:?Set REDIS_PASSWORD in .env} --maxmemory 1gb --maxmemory-policy allkeys-lru
    volumes:
      - redis-data:/data
    ports:
      - "${REDIS_PORT:-16379}:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
      interval: 5s
      timeout: 3s
      retries: 10
    deploy:
      resources:
        limits:
          memory: 1G

  neo4j:
    image: neo4j:5-community
    container_name: her-os-neo4j
    restart: unless-stopped
    environment:
      NEO4J_AUTH: neo4j/${NEO4J_PASSWORD:?Set NEO4J_PASSWORD in .env}
      NEO4J_PLUGINS: '["apoc"]'
      NEO4J_server_memory_heap_initial__size: 1G
      NEO4J_server_memory_heap_max__size: 2G
      NEO4J_server_memory_pagecache_size: 1G
    volumes:
      - neo4j-data:/data
      - neo4j-logs:/logs
    ports:
      - "${NEO4J_HTTP_PORT:-17474}:7474"
      - "${NEO4J_BOLT_PORT:-17687}:7687"
    healthcheck:
      test: ["CMD-SHELL", "wget -q --spider http://localhost:7474 || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 12
      start_period: 30s
    deploy:
      resources:
        limits:
          memory: 4G

  qdrant:
    image: qdrant/qdrant:latest
    container_name: her-os-qdrant
    restart: unless-stopped
    volumes:
      - qdrant-data:/qdrant/storage
      - ./config/qdrant/config.yaml:/qdrant/config/production.yaml:ro
    ports:
      - "${QDRANT_HTTP_PORT:-16333}:6333"
      - "${QDRANT_GRPC_PORT:-16334}:6334"
    healthcheck:
      test: ["CMD-SHELL", "wget -q --spider http://localhost:6333/healthz || exit 1"]
      interval: 5s
      timeout: 3s
      retries: 10
    deploy:
      resources:
        limits:
          memory: 4G

  # ---- Application (GPU-enabled) ----

  her-os-core:
    build:
      context: .
      dockerfile: Dockerfile
    image: ghcr.io/her-os/her-os-core:${HER_OS_VERSION:-latest}
    container_name: her-os-core
    restart: unless-stopped
    environment:
      # Database connections
      DATABASE_URL: postgresql+asyncpg://${POSTGRES_USER:-heros}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB:-heros}
      REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379/0
      NEO4J_URI: bolt://neo4j:7687
      NEO4J_USER: neo4j
      NEO4J_PASSWORD: ${NEO4J_PASSWORD}
      QDRANT_HOST: qdrant
      QDRANT_PORT: 6333
      # Model config
      HF_HOME: /data/models
      TRANSFORMERS_CACHE: /data/models
      TORCH_HOME: /data/torch-cache
      EMBEDDING_MODEL: ${EMBEDDING_MODEL:-Qwen/Qwen3-Embedding-8B}
      NER_MODEL: ${NER_MODEL:-urchade/gliner_multi_pii-v1}
      # Claude API (for LLM reasoning)
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:?Set ANTHROPIC_API_KEY in .env}
      # Performance
      CUDA_VISIBLE_DEVICES: "0"
      OMP_NUM_THREADS: "4"
      # Voice (optional)
      KOKORO_ENDPOINT: http://kokoro-tts:8880
    volumes:
      - model-cache:/data/models
      - torch-cache:/data/torch-cache
      - ${HER_OS_DATA_DIR:-~/her-os-data}/entities:/data/entity-files  # Host bind mount (git-friendly, inspectable)
      - ./config:/app/config:ro
    ports:
      - "${HER_OS_PORT:-8000}:8000"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
      neo4j:
        condition: service_healthy
      qdrant:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "wget -q --spider http://localhost:8000/health || exit 1"]
      interval: 30s
      timeout: 10s
      start_period: 300s  # 5 min for first-run model download
      retries: 3
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
        limits:
          memory: 120G  # Leave headroom for infrastructure services

  # ---- Optional: Kokoro TTS as sidecar ----

  kokoro-tts:
    image: ghcr.io/remsky/kokoro-fastapi-gpu:latest
    container_name: her-os-kokoro
    restart: unless-stopped
    profiles:
      - voice  # Only starts with: docker compose --profile voice up
    ports:
      - "${KOKORO_PORT:-18880}:8880"
    healthcheck:
      test: ["CMD-SHELL", "wget -q --spider http://localhost:8880/health || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 10
      start_period: 30s
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

# ---- Volumes ----

volumes:
  postgres-data:
    driver: local
  redis-data:
    driver: local
  neo4j-data:
    driver: local
  neo4j-logs:
    driver: local
  qdrant-data:
    driver: local
  model-cache:
    driver: local
  torch-cache:
    driver: local
  # NOTE: entity-files uses host bind mount (not a volume)
  # Mapped to ${HER_OS_DATA_DIR:-~/her-os-data}/entities on the host
  # This enables: ls, git init, git diff, tar for backup

# ---- Networks ----

networks:
  default:
    name: her-os-network
    driver: bridge
```

### .env.example

```bash
# ============================================================
# her-os — Environment Configuration
# Copy to .env and fill in values
# ============================================================

# Version
HER_OS_VERSION=latest

# Database
POSTGRES_USER=heros
POSTGRES_PASSWORD=CHANGE_ME_postgres_secret
POSTGRES_DB=heros
POSTGRES_PORT=15432

# Redis
REDIS_PASSWORD=CHANGE_ME_redis_secret
REDIS_PORT=16379

# Neo4j
NEO4J_PASSWORD=CHANGE_ME_neo4j_secret
NEO4J_HTTP_PORT=17474
NEO4J_BOLT_PORT=17687

# Qdrant
QDRANT_HTTP_PORT=16333
QDRANT_GRPC_PORT=16334

# Claude API (required for LLM reasoning)
ANTHROPIC_API_KEY=sk-ant-...

# Application
HER_OS_PORT=8000

# Data directory (host bind mount — entity files live here, git-friendly)
HER_OS_DATA_DIR=~/her-os-data

# Models (override defaults)
# EMBEDDING_MODEL=Qwen/Qwen3-Embedding-8B
# NER_MODEL=urchade/gliner_multi_pii-v1

# Voice (optional)
# KOKORO_PORT=18880
```

---

## 14. First-Run Bootstrap Script

```bash
#!/bin/bash
# scripts/install.sh — her-os first-run installer for DGX Spark
set -euo pipefail

BOLD='\033[1m'
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
NC='\033[0m'

echo -e "${BOLD}=== her-os installer ===${NC}"
echo ""

# ---- Prerequisites ----

echo -e "${BOLD}Checking prerequisites...${NC}"

# Docker
if ! command -v docker &> /dev/null; then
    echo -e "${RED}[FAIL] Docker not installed${NC}"
    exit 1
fi
echo -e "${GREEN}[OK]${NC} Docker $(docker --version | cut -d' ' -f3)"

# Docker Compose
if ! docker compose version &> /dev/null; then
    echo -e "${RED}[FAIL] Docker Compose not available${NC}"
    exit 1
fi
echo -e "${GREEN}[OK]${NC} $(docker compose version)"

# NVIDIA Container Toolkit
if docker run --rm --gpus all nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi > /dev/null 2>&1; then
    echo -e "${GREEN}[OK]${NC} NVIDIA Container Toolkit (GPU accessible in Docker)"
else
    echo -e "${RED}[FAIL] Cannot access GPU in Docker.${NC}"
    echo "  Install NVIDIA Container Toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/"
    exit 1
fi

# Disk space
AVAIL_GB=$(df -BG --output=avail . | tail -1 | tr -d ' G')
if [ "$AVAIL_GB" -lt 50 ]; then
    echo -e "${RED}[FAIL] Need 50+ GB free disk (have ${AVAIL_GB} GB)${NC}"
    exit 1
fi
echo -e "${GREEN}[OK]${NC} Disk space: ${AVAIL_GB} GB free"

# GPU info
GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null || echo "unknown")
GPU_MEM=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader 2>/dev/null || echo "unknown")
echo -e "${GREEN}[OK]${NC} GPU: ${GPU_NAME} (${GPU_MEM})"

echo ""

# ---- Configuration ----

if [ ! -f .env ]; then
    echo -e "${YELLOW}Creating .env from template...${NC}"
    cp .env.example .env
    echo -e "${YELLOW}[ACTION REQUIRED] Edit .env and set passwords + API key${NC}"
    echo "  nano .env"
    exit 0
fi

echo -e "${GREEN}[OK]${NC} .env file exists"
echo ""

# ---- Pull & Start ----

echo -e "${BOLD}Pulling Docker images...${NC}"
docker compose pull

echo ""
echo -e "${BOLD}Starting her-os...${NC}"
docker compose up -d

echo ""
echo -e "${BOLD}=== her-os is starting ===${NC}"
echo ""
echo "Infrastructure services (PostgreSQL, Redis, Neo4j, Qdrant) start in ~30s."
echo ""
echo -e "${YELLOW}FIRST RUN: ML models (~20 GB) will download automatically.${NC}"
echo -e "${YELLOW}This takes 5-15 minutes depending on your network speed.${NC}"
echo ""
echo "Monitor progress:"
echo "  docker compose logs -f her-os-core"
echo ""
echo "Check health:"
echo "  curl http://localhost:${HER_OS_PORT:-8000}/health"
echo ""
echo "Dashboard:"
echo "  http://localhost:${HER_OS_PORT:-8000}"
echo ""
```

---

## 15. Open Questions and Risks

### Risks

| # | Risk | Severity | Mitigation |
|---|------|----------|------------|
| R1 | NGC PyTorch image tags change/deprecate | MEDIUM | Pin exact tag (`25.11-py3`), test before upgrading |
| R2 | CUDA 13 wheel availability for new Python packages | HIGH | Use NGC base (pre-built), fallback to CPU mode |
| R3 | CTranslate2 (faster-whisper) GPU on ARM64 | RESOLVED | PyPI wheels are CPU-only, but `mekopa/whisperx-blackwell` Docker has CTranslate2 compiled from source with CUDA |
| R4 | Docker Hub DNS timeouts on Titan | KNOWN | Use `--dns 8.8.8.8` or pre-pull images |
| R5 | 128 GB unified memory contention (GPU + all services) | MEDIUM | Memory limits on infrastructure services, monitor usage |
| R6 | First-run model download fails (network issues) | LOW | Bootstrap is idempotent, retry on next start |
| R7 | NGC container license restrictions for commercial use | LOW | NVIDIA NGC containers are generally Apache 2.0 or permissive; verify per-image |

### Open Questions — RESOLVED (session 56, 2026-02-24)

| # | Question | Decision | Rationale |
|---|----------|----------|-----------|
| Q1 | Kokoro TTS: in her-os-core or sidecar? | **Sidecar** (`--profile voice`) | Cleaner separation of concerns. Extra container is not a concern. |
| Q2 | Support non-DGX Spark hardware? | **DGX Spark only** (for now) | Single-arch ARM64 simplifies CI/CD. Multi-arch deferred. |
| Q3 | RAPIDS/cuGraph via Docker? | **Open** (Phase 2) | RAPIDS has ARM64 Docker images. Add as optional profile later. |
| Q4 | Entity files: Docker volume or host bind mount? | **Host bind mount** (`~/her-os-data/`) | Enables `ls`, `git init`, `git diff`, `tar`. Simple backup. Aligns with human-readable files (ADR-006). |
| Q5 | Backup strategy? | **`scripts/backup.sh`** | Dump PG, copy Neo4j, snapshot Qdrant, tar entity files. Needed before Sprint 1. |
| Q6 | How to handle upgrades? | `docker compose pull && docker compose up -d` | Alembic for DB schema migrations. |
| Q7 | CI/CD for ARM64 images? | **Build on actual DGX Spark** | Push to GHCR from Titan/Beast. No cross-compile complexity. |
| Q8 | Run container as root or non-root? | **Non-root** (UID 1000 `heros`) | **Revised session 57:** Phase 0 benchmarks all ran as non-root user without sudo. `--gpus all` maps `/dev/nvidia*` with correct permissions for any user. Root was a false requirement from NGC defaults, not CUDA needs. |
| Q9 | How to handle version upgrades? | **`scripts/upgrade.sh`** | Backup → pull → up → health check → rollback if fail. Alembic for DB schema. |

### Container User: Non-root (Revised, Session 57)

**Original decision (session 56):** Root, because "CUDA driver interface requires elevated access."

**Revised (session 57):** Non-root user (`heros`, UID 1000). Phase 0 validated all 20/21 benchmarks as a non-root user (rajesh, UID 1000) without sudo. CUDA, PyTorch, RAPIDS (cuGraph, cuVS), Whisper, Kokoro — all worked without elevated privileges. The Docker `--gpus all` flag maps `/dev/nvidia*` device nodes with appropriate permissions for any container user. Root was the NGC *default*, not a CUDA *requirement*.

**Benefits of non-root:**
- Entity files on bind mount (`~/her-os-data/`) are owned by UID 1000, matching the host user
- No `chown` workarounds needed
- Better security posture (principle of least privilege)
- Standard Docker best practice

---

## Summary & Recommendation

### The Pattern

```
Customer buys DGX Spark
  --> SSH in
  --> git clone her-os-deploy
  --> cp .env.example .env && nano .env
  --> ./scripts/install.sh
  --> docker compose up -d
  --> Wait 5-15 min (first run model download)
  --> Open http://localhost:8000
```

### Key Decisions (ADR-017, approved session 56)

1. **Base image:** `nvcr.io/nvidia/pytorch:25.11-py3` (NGC, Blackwell-optimized, pre-compiled CUDA)
2. **Orchestration:** Docker Compose (not Kubernetes, not bare Docker)
3. **Models:** Volume-mounted, downloaded on first run (not baked into image)
4. **GPU access:** `deploy.resources.reservations.devices` in Compose
5. **Image size:** ~18 GB (image) + ~19 GB (models) = ~37 GB total (1% of DGX Spark disk)
6. **Distribution:** GHCR for public/private images, `.env` for configuration
7. **Health checks:** Multi-layer (GPU, models, databases), 5-minute start period for first run
8. **Voice pipeline:** Kokoro TTS as **sidecar** container (`--profile voice`). Cleaner separation of concerns.
9. **Target hardware:** **DGX Spark only** (ARM64). No multi-arch for now.
10. **Entity files:** **Host bind mount** (`~/her-os-data/entities/`). Git-friendly, inspectable, simple backup.
11. **CI/CD:** **Build on actual DGX Spark**, push to GHCR. No cross-compile.
12. **Backup:** `scripts/backup.sh` needed (PG dump, Neo4j copy, Qdrant snapshot, entity files tar).
13. **Container user:** **Root** (NVIDIA default). Fix bind mount permissions separately.
14. **Upgrades:** `scripts/upgrade.sh` — backup → pull → up → health check → rollback on failure.

### What This Gives Us

- **Reproducible:** Same image, same behavior, every time
- **One command:** `docker compose up -d` brings the entire stack online
- **Self-contained:** No external dependencies except the Anthropic API key
- **Upgradeable:** `docker compose pull && docker compose up -d`
- **Git-friendly data:** Entity files on the host filesystem — `cd ~/her-os-data && git log`
- **Customer-ready:** Install script validates prerequisites, guides setup

---

## 16. NGC ABI Compatibility Gotchas — The Sidecar Pattern

**Date:** 2026-02-27
**Context:** Integrating dual-model SER (emotion2vec+ large + audeering wav2vec2) alongside the WhisperX audio pipeline.

### The Problem

The audio pipeline runs on `mekopa/whisperx-blackwell:latest`, which is built on NGC `pytorch:24.12-py3`. This container ships a **custom PyTorch 2.6.0a0** that is ABI-incompatible with standard pip packages:

```
NGC torch:  2.6.0a0+b5b1008  (custom build, Blackwell SM_121 → SM_90 compat)
pip torch:  2.10.0+cu128     (standard release, native SM_120 support)
```

When you `pip install torchaudio` inside the NGC container, it **replaces** the NGC torch with the standard pip version, breaking Blackwell GPU support. This also happens with `funasr`, which depends on torchaudio.

### The 4 Failed Attempts

| Attempt | What We Tried | Why It Failed |
|---------|--------------|---------------|
| 1 | `pip install funasr` inside NGC container | Pulled torchaudio → replaced NGC torch → CUDA broke |
| 2 | `pip install --no-deps funasr` + manual deps | funasr imports torchaudio at module level → ImportError |
| 3 | `pip install torchaudio==2.6.0` (matching NGC) | No such version exists on pip (NGC builds custom) |
| 4 | Constrain torch: `pip install funasr torch==2.6.0a0` | pip can't find NGC's custom 2.6.0a0, installs standard → breaks |

### The Solution: Sidecar Container

Instead of modifying the NGC container, run SER in its own container:

```
Audio Pipeline (NGC)          SER Sidecar (python:3.12-slim)
┌──────────────────┐         ┌──────────────────┐
│ whisperx-blackwell│         │ torch 2.10+cu128 │
│ torch 2.6.0a0    │   HTTP  │ funasr           │
│ pyannote         │ ──────→ │ transformers     │
│ Whisper large-v3 │ :9101   │ emotion2vec+     │
│ Port: 9100       │         │ audeering wav2v  │
└──────────────────┘         └──────────────────┘
```

**Why this works:**
- Each container has its own Python environment with compatible packages
- Communication via HTTP (FastAPI on both sides) — simple, debuggable
- Graceful degradation: if SER sidecar is down, STT pipeline continues without emotion
- Independent scaling: SER model updates don't require rebuilding the STT container

### The Native Venv Discovery

During benchmarking, we discovered that **Titan's host Python** works perfectly for SER:

```bash
python3 -m venv /tmp/ser-venv
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install funasr transformers soundfile
```

`torch 2.10.0+cu128` has **native Blackwell SM_120 support** (no arch spoofing needed). This validated the sidecar approach — same packages, just containerized.

### Docker Hub IPv6 Connectivity Issue on Titan

**Date:** 2026-02-27

When attempting to build Docker images that pull from Docker Hub (hub.docker.com), Titan's Docker daemon fails because it tries IPv6 first and Titan's IPv6 connectivity is broken:

```
ERROR: failed to solve: python:3.12-slim: failed to resolve source metadata
  dial tcp [2600:1f18:2148:bc01:...]:443: connect: no route to host
```

**Attempted workarounds that didn't work:**
1. `DOCKER_BUILDKIT=0` (legacy builder) — same IPv6 failure
2. `docker buildx build --network host` — same failure
3. Kernel sysctl to disable IPv6 — requires root, and Docker daemon config change + restart

**Root cause:** Docker daemon on Titan resolves Docker Hub to IPv6 (AAAA records) but IPv6 routing is not configured. This is a common issue on DGX Spark systems where IPv6 is enabled in the kernel but not configured at the network level.

**Workaround: Native venv deployment.** For services that can't pull their base images, deploy using a native Python venv on the Titan host. The SER sidecar uses this approach in production (see deployment section below).

**Future fix:** Configure Docker daemon to prefer IPv4 by adding `{"ip6tables": false, "fixed-cidr-v6": ""}` to `/etc/docker/daemon.json`, or configure IPv6 networking properly on Titan.

### Production Deployment: Native Venv for SER Sidecar

**Date:** 2026-02-27
**Status:** Running on Titan

Given the Docker Hub IPv6 issue, the SER sidecar runs as a **native Python process** managed by a persistent venv on Titan:

```bash
# One-time setup
python3 -m venv ~/ser-venv
~/ser-venv/bin/pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
~/ser-venv/bin/pip install funasr transformers soundfile numpy fastapi 'uvicorn[standard]' python-multipart

# Run (from the her-os repo directory)
cd ~/workplace/her/her-os/services/ser-pipeline
~/ser-venv/bin/uvicorn main:app --host 0.0.0.0 --port 9101
```

**Model downloads happen on first run** (~1.2 GB emotion2vec from ModelScope, ~631 MB audeering from HuggingFace). Models are cached in `~/.cache/modelscope/` and `~/.cache/huggingface/` respectively. Subsequent startups load from cache in ~15s.

**Architecture:**

```
┌────────────────────────────┐       ┌──────────────────────────────┐
│ Audio Pipeline (Docker)    │       │ SER Sidecar (native venv)    │
│ NGC pytorch:24.12-py3      │       │ ~/ser-venv (python 3.12)     │
│ torch 2.6.0a0 (custom)    │ HTTP  │ torch 2.10.0+cu128 (pip)     │
│ WhisperX + pyannote       │ ────→ │ emotion2vec+ large (funasr)  │
│ Port: 9100 (Docker)       │ :9101 │ audeering wav2vec2 (HF)      │
│ Enrollment: /data/        │       │ silero-vad (torch.hub)       │
└────────────────────────────┘       └──────────────────────────────┘
```

**Why native venv works well for SER:**
- No Docker Hub dependency at deploy time
- Shares Titan's GPU drivers directly (no container overhead)
- `torch 2.10.0+cu128` has native SM_120 support — no JIT compilation
- Model caches persist across restarts
- Simple to update: `git pull && restart uvicorn`

**Trade-off vs. Docker:** Loses container isolation and reproducibility. Acceptable for a single-user home server (Titan). For multi-user or commercial deployment, fix the IPv6 issue and use the Docker approach.

### CUDA JIT Compilation Cache

The NGC audio pipeline container compiles CUDA kernels (PTX → SASS) for Blackwell SM_121 at startup. This is because NGC ships kernels for SM_90 (Hopper) and uses `TORCH_CUDA_ARCH_LIST` spoofing.

| Scenario | Warmup Time |
|----------|-------------|
| First-ever run (no cache) | ~280s |
| With cold cache (new image) | ~140s |
| With warm cache (same image, restart) | ~10-30s |

**Cache persistence:** The cache is mounted as a Docker volume at `CUDA_CACHE_PATH=/cuda-cache`, mapped to `~/.local/share/her-os-audio/cuda-cache/` on the host. This survives container restarts but may be invalidated when the Docker image changes.

**The SER sidecar doesn't need JIT compilation** — `torch 2.10.0+cu128` (from pip) includes native SM_120 kernels, so SER cold-starts in ~15s (model loading only, no kernel compilation).

### Dockerfile Gotcha: Inline Python Class Definitions

When baking model weights into Docker images using `RUN python -c "..."`, **inline class definitions break** because Docker's shell layer mangles multi-line Python:

```dockerfile
# BROKEN — Docker's shell interpreter corrupts multi-line class definitions
RUN python -c "\
class MyModel(nn.Module):\n\
    def __init__(self):\n\
        ...\n\
model = MyModel.from_pretrained('...')"
```

**Fix:** Use `huggingface_hub.snapshot_download()` which downloads all model files without needing custom classes:

```dockerfile
# WORKS — simple function call, no class definitions needed
RUN python -c "\
from huggingface_hub import snapshot_download; \
snapshot_download('audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'); \
print('model weights cached')"
```

The custom classes are only needed at inference time (in `ser.py`), not at download time. `snapshot_download` caches config, weights, and processor files so `from_pretrained()` loads from cache at runtime.

### Rules for NGC Container Modifications

1. **NEVER `pip install` packages that depend on torch/torchaudio/torchvision** inside an NGC container. They will replace the custom NGC torch and break GPU support.
2. **Safe to install:** pure Python packages (FastAPI, uvicorn, requests, etc.) that don't link against PyTorch C++ ABI.
3. **When you need torchaudio-dependent packages:** use a sidecar (Docker or native venv) with standard torch from PyPI.
4. **Test ABI compatibility** before adding any ML package: `python -c "import torch; print(torch.__version__)"` — if it changed from the NGC version, you broke it.
5. **Avoid inline Python class definitions in Dockerfiles** — use `snapshot_download()` or simple import-and-construct calls.
6. **Check Docker Hub connectivity on Titan** before relying on Docker builds — IPv6 routing issues may prevent pulling base images.

### Resource Budget

| Component | VRAM | Latency | Deployment |
|-----------|------|---------|-----------|
| WhisperX large-v3 + pyannote | ~9 GB | 2-3s | Docker (her-os-audio) |
| emotion2vec+ large (300M) | 627 MB | 33ms | Native venv (ser-venv) |
| audeering wav2vec2 (200M) | 631 MB | 10ms | Native venv (ser-venv) |
| silero-vad (~1M) | ~1 MB | ~1ms | Native venv (ser-venv) |
| **Total** | **~10.3 GB** | **~2-3s + 44ms** | Mixed |

128 GB unified memory → ~8% utilization. 44ms SER overhead is negligible.

---

## Sources

- [NGC -- DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/ngc.html)
- [NVIDIA Container Runtime for Docker -- DGX Spark](https://docs.nvidia.com/dgx/dgx-spark/nvidia-container-runtime-for-docker.html)
- [NVIDIA DGX Spark Porting Guide](https://docs.nvidia.com/dgx/dgx-spark-porting-guide/porting/compilation.html)
- [NVIDIA dgx-spark-playbooks (GitHub)](https://github.com/NVIDIA/dgx-spark-playbooks)
- [dgx-spark-playbooks pytorch-fine-tune docker-compose.yml](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/docker-compose.yml)
- [natolambert/dgx-spark-setup (ML training guide)](https://github.com/natolambert/dgx-spark-setup)
- [Simon Willison: NVIDIA DGX Spark -- great hardware, early days](https://simonwillison.net/2025/Oct/14/nvidia-dgx-spark/)
- [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
- [PyTorch Release 25.11](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-11.html)
- [CUDA ARM64 NGC Images](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-arm64)
- [Docker Compose GPU Support](https://docs.docker.com/compose/how-tos/gpu-support/)
- [Compose Deploy Specification](https://docs.docker.com/reference/compose-file/deploy/)
- [Docker Multi-Stage Builds](https://docs.docker.com/build/building/multi-stage/)
- [Kokoro-FastAPI (Docker wrapper)](https://github.com/remsky/Kokoro-FastAPI)
- [Qdrant Docker Installation](https://qdrant.tech/documentation/guides/installation/)
- [Neo4j Docker ARM64 support](https://github.com/neo4j/docker-neo4j/issues/140)
- [Self-Hosting Mem0 Docker Guide](https://dev.to/mem0/self-hosting-mem0-a-complete-docker-deployment-guide-154i)
- [PyTorch Nightly cu128 aarch64 wheels issue](https://github.com/pytorch/pytorch/issues/157548)
- [FAISS Installation Guide](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md)
- [CTranslate2 Hardware Support](https://opennmt.net/CTranslate2/hardware_support.html)
- [RAPIDS cuGraph ARM64 Docker](https://hub.docker.com/r/rapidsai/rapidsai/)
- [Docker Model Runner on DGX Spark](https://www.docker.com/blog/new-nvidia-dgx-spark-docker-model-runner/)
- [Optimizing PyTorch Docker images](https://mveg.es/posts/optimizing-pytorch-docker-images-cut-size-by-60percent/)