# Research: ML Model Loading & Caching in Docker Containers

**Date:** 2026-02-25
**Context:** Phase 0 — optimizing the first-boot and cold-start experience for her-os on NVIDIA DGX Spark (aarch64 ARM64, GB10 Blackwell, 128GB unified memory, always-on).

**Problem:** Current setup downloads ~17 GB of ML models from HuggingFace on first run. Cold start takes 5-15 minutes depending on network speed. This is unacceptable for a self-hosted appliance product. Users expect `docker compose up` to "just work" without a multi-minute wait staring at download progress bars.

---

## Table of Contents

1. [Current State & Model Inventory](#1-current-state--model-inventory)
2. [Approach 1: Bake Models into Docker Image](#2-approach-1-bake-models-into-docker-image)
3. [Approach 2: Volume Mount with Pre-Download Script](#3-approach-2-volume-mount-with-pre-download-script)
4. [Approach 3: Multi-Stage Build with Model Download Stage](#4-approach-3-multi-stage-build-with-model-download-stage)
5. [Approach 4: S3/Object Storage Model Pull](#5-approach-4-s3object-storage-model-pull)
6. [Approach 5: Init Container (Docker Compose)](#6-approach-5-init-container-docker-compose)
7. [Approach 6: Model Server Sidecar](#7-approach-6-model-server-sidecar)
8. [What NVIDIA Recommends for NGC-Based Deployments](#8-what-nvidia-recommends-for-ngc-based-deployments)
9. [Comparison Matrix](#9-comparison-matrix)
10. [First-Boot Experience Design](#10-first-boot-experience-design)
11. [Recommendation for her-os](#11-recommendation-for-her-os)
12. [Implementation Plan](#12-implementation-plan)
13. [Open Questions](#13-open-questions)

---

## 1. Current State & Model Inventory

### Models Used by her-os

| Model | Purpose | Size (disk) | Size (VRAM) | Loading Profile |
|-------|---------|-------------|-------------|-----------------|
| Qwen/Qwen3-Embedding-0.6B | Real-time embedding | ~1.2 GB | 1.1 GB | Always loaded |
| Qwen/Qwen3-Embedding-8B | Batch/offline embedding | ~15.0 GB | 14.1 GB | Loaded on demand (nightly re-index) |
| openai/whisper-large-v3 | Speech-to-text | ~3.1 GB | 8.75 GB | Loaded on demand (when audio arrives) |
| urchade/gliner_medium-v2.1 | Named entity recognition | ~0.5 GB | ~0.5 GB | Always loaded |
| hexgrad/Kokoro-82M | Text-to-speech | ~0.3 GB | ~0.5 GB | Sidecar container (separate) |
| **Total** | | **~20.1 GB** | **~25 GB** | |
| **Real-time subset** | 0.6B + GLiNER | **~1.7 GB** | **~1.6 GB** | Must be instant |
| **Full config** | All models | **~20.1 GB** | **~25 GB** | Can be progressive |

### Current Architecture (as of session 58)

```
docker compose up -d
  └─ her-os-core starts
       └─ Application code boots (FastAPI + uvicorn)
            └─ First request triggers model download from HuggingFace Hub
                 └─ 5-15 minutes of downloading on first run
                      └─ Models stored in Docker named volume (model-cache)
                           └─ Subsequent restarts are fast (models in volume)
```

**Pain points:**
- First run: 5-15 minute delay before the system is usable
- Health check has 5-minute `start_period` to accommodate model download
- Network dependency at startup (fails if HuggingFace is down or slow)
- No progress visibility to the user during download
- No way to pre-warm the cache before starting the service

---

## 2. Approach 1: Bake Models into Docker Image

### How it works

Download models during `docker build` and include them as image layers. The resulting Docker image contains everything needed to run — no network required at startup.

### Who does this

| Company/Product | Details |
|----------------|---------|
| **Replicate (Cog)** | Cog explicitly bakes model weights into the Docker image. Their `cog build` command copies weights into the container. They recommend keeping weights in the same directory as `predict.py`. The `--separate-weights` flag puts weights in a dedicated Docker layer for faster rebuilds when only code changes. |
| **Baseten (Truss)** | Bundles model artifacts into the container image. Optimized for their cloud (not self-hosted). |
| **Modal** | Packages code + weights together, achieves sub-second cold starts through aggressive image caching on their infrastructure. |
| **NVIDIA NIM** | Each NIM is packaged as a complete Docker image with model weights included. NIM containers are per-model or per-model-family. NGC catalog hosts pre-built images. |
| **Many HuggingFace tutorials** | Common pattern in tutorials: `RUN huggingface-cli download model-id` in Dockerfile, then COPY into runtime stage. |

### Implementation pattern

```dockerfile
# ---- Stage: Model Downloader ----
FROM python:3.12-slim AS model-downloader

RUN pip install huggingface_hub

# Download each model at build time
ARG HF_TOKEN
ENV HF_TOKEN=${HF_TOKEN}

RUN huggingface-cli download Qwen/Qwen3-Embedding-0.6B --local-dir /models/embedding-0.6b
RUN huggingface-cli download urchade/gliner_medium-v2.1 --local-dir /models/gliner
RUN huggingface-cli download openai/whisper-large-v3 --local-dir /models/whisper-large-v3
# Skip 8B — too large for image, load on demand

# ---- Stage: Runtime ----
FROM nvcr.io/nvidia/pytorch:25.11-py3 AS runtime

# Copy models from downloader stage
COPY --from=model-downloader /models /data/models
# ... rest of runtime setup
```

### Analysis

| Factor | Assessment |
|--------|-----------|
| **Image size** | Balloons from ~18 GB to ~23 GB (with real-time models only) or ~38 GB (all models). The 8B embedding model alone is 15 GB. |
| **Cold start** | **Excellent.** Zero network needed. Models are immediately available when container starts. ~30s to load into GPU memory. |
| **Build time** | **Terrible.** Every build downloads 5-20 GB of models. BuildKit cache can help on rebuilds, but the first build on a new machine takes 10-20 minutes. |
| **Model update** | **Painful.** Must rebuild the entire image to update a model. Every model version bump means re-pushing 20+ GB to the registry. |
| **Registry push/pull** | **Very slow.** A 35+ GB image takes significant time to push to GHCR and pull on the target machine. Docker layer deduplication helps if only the model layer changed, but initial pull is brutal. |
| **ARM64 compat** | Works — model weights are architecture-independent (just tensors). |
| **Offline** | **Perfect.** No network required after initial pull. Ideal for air-gapped environments. |
| **Disk usage** | **Wasteful.** Models stored in image layers AND loaded into memory. If you also have a volume for updated models, you have duplication. |

### Verdict for her-os

**Partial adoption.** Baking the small real-time models (0.6B embedding + GLiNER = ~1.7 GB) into the image is viable and solves the "first request works instantly" problem. But baking the 8B embedding model (15 GB) or Whisper (3.1 GB) makes the image unmanageably large for self-hosted distribution.

---

## 3. Approach 2: Volume Mount with Pre-Download Script

### How it works

Models are stored in a Docker named volume. A separate script (or the application entrypoint) downloads models on first run if they are not present. On subsequent restarts, models are found in the volume and loading is fast.

### Who does this

| Company/Product | Details |
|----------------|---------|
| **Ollama** | Models stored in a named volume (`ollama:/root/.ollama`). Users run `ollama pull llama3` to pre-download. Container starts instantly if models are already pulled. |
| **Open WebUI** | Depends on Ollama's model management. Users must explicitly pull models after first start. |
| **HuggingFace TGI** | Volume mount pattern: `-v $volume:/data`. Models download on first start to the mounted volume. `--shm-size 1g` for shared memory. |
| **vLLM** | Same volume mount pattern: `-v ~/.cache/huggingface:/root/.cache/huggingface`. Downloads model on first start if not cached. Cold starts can take 10+ minutes for larger models. |
| **NVIDIA DGX Spark Playbooks** | Official NVIDIA pattern: "Standard volume mount patterns include mounting HuggingFace cache and workspace directories to prevent redundant downloads." |
| **LocalAI** | Allows specifying models to auto-download at startup via `PRELOAD_MODELS` env var. |

### Implementation pattern (current her-os approach, enhanced)

```yaml
# docker-compose.yml
services:
  model-warmup:
    image: ghcr.io/her-os/her-os-core:latest
    container_name: her-os-model-warmup
    restart: "no"
    command: ["python", "-m", "her_os.bootstrap", "--download-only"]
    environment:
      HF_HOME: /data/models
      HF_TOKEN: ${HF_TOKEN:-}
    volumes:
      - model-cache:/data/models

  her-os-core:
    # ... existing config
    depends_on:
      model-warmup:
        condition: service_completed_successfully
    volumes:
      - model-cache:/data/models
```

```python
# her_os/bootstrap.py
"""Pre-download models to the shared volume."""
import os, sys, time
from pathlib import Path
from huggingface_hub import snapshot_download

MODELS = {
    "Qwen/Qwen3-Embedding-0.6B": {"purpose": "real-time embedding", "priority": 1},
    "urchade/gliner_medium-v2.1": {"purpose": "NER", "priority": 1},
    "openai/whisper-large-v3": {"purpose": "STT", "priority": 2},
    "Qwen/Qwen3-Embedding-8B": {"purpose": "batch embedding", "priority": 3},
}

def download_models(download_only=False, priority_cutoff=None):
    cache_dir = Path(os.environ.get("HF_HOME", "/data/models"))
    cache_dir.mkdir(parents=True, exist_ok=True)

    for model_id, info in sorted(MODELS.items(), key=lambda x: x[1]["priority"]):
        if priority_cutoff and info["priority"] > priority_cutoff:
            print(f"[bootstrap] Skipping {model_id} (priority {info['priority']} > {priority_cutoff})")
            continue
        print(f"[bootstrap] Ensuring {model_id} ({info['purpose']})...")
        t0 = time.time()
        snapshot_download(model_id, cache_dir=str(cache_dir))
        elapsed = time.time() - t0
        print(f"[bootstrap] {model_id} ready ({elapsed:.1f}s)")

if __name__ == "__main__":
    download_only = "--download-only" in sys.argv
    download_models(download_only=download_only)
```

### Analysis

| Factor | Assessment |
|--------|-----------|
| **Image size** | **Small.** Image stays at ~18 GB (code + deps only). Models in separate volume. |
| **Cold start** | **Slow on first run** (5-15 min download). Fast on subsequent restarts (~30s model loading). |
| **Build time** | **Fast.** No model download during build. |
| **Model update** | **Easy.** Delete volume, restart. Or run `huggingface-cli download` targeting the volume. |
| **Registry push/pull** | **Fast.** Small image, quick to push/pull. |
| **ARM64 compat** | Works — `snapshot_download` pulls the correct files. |
| **Offline** | **Fails on first run.** Requires network access. But can pre-warm: `docker compose run model-warmup`. |
| **Disk usage** | **Efficient.** One copy of models in the volume, loaded into GPU memory at runtime. |

### Verdict for her-os

**Good foundation but insufficient alone.** The current approach works for subsequent restarts but the first-run experience is unacceptable. Needs to be combined with a pre-download step or partial baking.

---

## 4. Approach 3: Multi-Stage Build with Model Download Stage

### How it works

A dedicated build stage downloads models at `docker build` time, then `COPY --from=` transfers them into the runtime image. Similar to Approach 1, but the multi-stage structure keeps the build clean and allows caching the model download stage independently.

### Who does this

| Company/Product | Details |
|----------------|---------|
| **Replicate Cog** | Uses this exact pattern internally. The `--separate-weights` flag creates a dedicated layer for model weights. |
| **Many production ML teams** | Common pattern documented in Docker's own ML best practices guides. |
| **Collabnix (Docker community)** | Published guide on "Optimize Your AI Containers with Docker Multi-Stage Builds" recommending this pattern for ML models. |

### Implementation pattern

```dockerfile
# ---- Stage 1: Model Download ----
FROM python:3.12-slim AS models

RUN pip install huggingface_hub

# Use BuildKit cache mount so re-builds don't re-download
RUN --mount=type=cache,target=/tmp/hf-cache \
    huggingface-cli download Qwen/Qwen3-Embedding-0.6B \
      --local-dir /models/embedding-0.6b \
      --cache-dir /tmp/hf-cache

RUN --mount=type=cache,target=/tmp/hf-cache \
    huggingface-cli download urchade/gliner_medium-v2.1 \
      --local-dir /models/gliner \
      --cache-dir /tmp/hf-cache

# ---- Stage 2: Builder (existing) ----
FROM nvcr.io/nvidia/pytorch:25.11-py3 AS builder
# ... install Python deps ...

# ---- Stage 3: Runtime ----
FROM nvcr.io/nvidia/pytorch:25.11-py3 AS runtime

# Copy ONLY the real-time models (small ones)
COPY --from=models /models/embedding-0.6b /data/models/embedding-0.6b
COPY --from=models /models/gliner /data/models/gliner

# Large models still downloaded to volume at runtime
ENV HF_HOME=/data/models
# ... rest of setup
```

### Analysis

| Factor | Assessment |
|--------|-----------|
| **Image size** | **Moderate.** +1.7 GB for real-time models. Large models still in volume. |
| **Cold start** | **Good for real-time, slow for large models.** Real-time models available instantly. Whisper/8B download on first use. |
| **Build time** | **Moderate.** BuildKit `--mount=type=cache` prevents re-downloading on rebuilds. First build still downloads ~1.7 GB. |
| **Model update** | **Requires rebuild for baked models.** Volume models update independently. |
| **Registry push/pull** | **Acceptable.** ~20 GB image (vs 18 without models). Docker layer caching means model layer only re-pulled when model version changes. |
| **ARM64 compat** | Works perfectly. |
| **Offline** | **Partial.** Real-time models work offline. Large models need network on first use. |

### Key advantage: BuildKit cache mounts

The `--mount=type=cache` directive persists the HuggingFace download cache between builds on the same machine. This means:
- First build: downloads models (slow)
- Second build (code change only): models already cached (fast)
- But: cache is local to the build machine, not transferable to CI/CD without additional setup

### Verdict for her-os

**Strong candidate for hybrid approach.** Bake real-time models (0.6B + GLiNER) at build time. Download large models (8B, Whisper) to volume at runtime. Best of both worlds.

---

## 5. Approach 4: S3/Object Storage Model Pull

### How it works

Models are stored in an S3-compatible object store (AWS S3, MinIO, GCS, Azure Blob). At startup, the application or an init container pulls models from the object store to a local volume.

### Who does this

| Company/Product | Details |
|----------------|---------|
| **AWS SageMaker** | Models stored in S3. SageMaker copies artifacts from S3 to `/opt/ml/model` inside the container at startup. This is their primary pattern. |
| **NVIDIA Triton** | Supports model repositories on S3, GCS, and Azure Blob natively. Path prefix: `s3://bucket/path/to/repo`, `gs://bucket/path`, `as://account/container/path`. Credential configuration via `TRITON_CLOUD_CREDENTIAL_PATH` env var. |
| **Google Vertex AI** | Uses GCS for model storage. Vertex pulls models to the container at deployment time. |
| **Azure ML** | Uses Azure Blob Storage. Models mounted to the container via Azure ML infrastructure. |
| **Kubernetes + S3** | Common pattern: init container runs `aws s3 sync s3://bucket/models /models/`, then main container reads from `/models/`. |

### Implementation pattern (MinIO on DGX Spark)

```yaml
# docker-compose.yml
services:
  minio:
    image: minio/minio
    command: server /data
    volumes:
      - minio-data:/data
    ports:
      - "9000:9000"

  model-sync:
    image: minio/mc
    restart: "no"
    entrypoint: >
      sh -c "
        mc alias set local http://minio:9000 minioadmin minioadmin &&
        mc mirror --overwrite local/models /models
      "
    volumes:
      - model-cache:/models
    depends_on:
      - minio

  her-os-core:
    depends_on:
      model-sync:
        condition: service_completed_successfully
    volumes:
      - model-cache:/data/models
```

### Analysis

| Factor | Assessment |
|--------|-----------|
| **Image size** | **Small.** No models in image. |
| **Cold start** | **Depends on storage location.** Local MinIO: fast (network-local). Remote S3: same as HuggingFace download. |
| **Build time** | **Fast.** No models in build. |
| **Model update** | **Easy.** Upload new model to S3, restart. Version control via S3 versioning. |
| **Registry push/pull** | **Fast.** Small image. |
| **ARM64 compat** | Works. MinIO has ARM64 images. |
| **Offline** | **Only if S3 is local.** Remote S3 requires network. |
| **Complexity** | **High.** Adds MinIO service, credential management, sync orchestration. |

### Verdict for her-os

**Overkill for single-node self-hosted.** This pattern shines in multi-node cloud deployments (many containers sharing one model store). For a single DGX Spark appliance, adding MinIO just to serve models adds complexity with no benefit over a Docker volume. However, the pattern is useful to understand for future household multi-device scenarios (see `docs/RESEARCH-DATA-STORAGE.md`).

---

## 6. Approach 5: Init Container (Docker Compose)

### How it works

Docker Compose supports init-container-like behavior using `depends_on` with `condition: service_completed_successfully`. A short-lived service downloads models to a shared volume, then exits. The main application service starts only after the init container completes successfully.

### Who does this

| Company/Product | Details |
|----------------|---------|
| **Kubernetes ecosystem** | Init containers are a first-class Kubernetes concept. ML deployments commonly use them to download models from S3 to local NVMe before the GPU pod starts. Google Cloud GKE documentation recommends this for large model deployments. |
| **HashiCorp Vault (Docker Compose)** | Uses init containers to provision secrets before the main service starts. Same pattern, different use case. |
| **Docker Compose community** | Feature request #6855 on docker/compose led to support via `depends_on` + `service_completed_successfully`. Available since Compose v1.29+. |
| **EKS ML deployments** | Amazon EKS documentation (Jan 2026) recommends init containers for "loading multi-gigabyte model weights for GPU inference." |

### Implementation pattern

```yaml
services:
  # Init container: download models, then exit
  model-warmup:
    image: ghcr.io/her-os/her-os-core:${HER_OS_VERSION:-latest}
    container_name: her-os-model-warmup
    restart: "no"   # CRITICAL: do not restart after completion
    entrypoint: ["python", "-m", "her_os.bootstrap"]
    environment:
      HF_HOME: /data/models
      HF_TOKEN: ${HF_TOKEN:-}
      DOWNLOAD_PRIORITY: ${MODEL_DOWNLOAD_PRIORITY:-2}
    volumes:
      - model-cache:/data/models
    # No GPU needed — just downloading files
    # No healthcheck — uses exit code

  her-os-core:
    # ... existing config
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
      neo4j:
        condition: service_healthy
      qdrant:
        condition: service_healthy
      model-warmup:
        condition: service_completed_successfully
```

### Analysis

| Factor | Assessment |
|--------|-----------|
| **Image size** | **Same.** Reuses the main application image. No additional image needed. |
| **Cold start** | **Deterministic.** Models are guaranteed present when the main service starts. No "first request triggers download" surprise. The download time is still 5-15 minutes, but it happens in a dedicated phase with clear progress output. |
| **Build time** | **No impact.** Models not in build. |
| **Model update** | **Easy.** Delete volume, restart. Init container re-downloads. |
| **User experience** | **Much better.** `docker compose logs -f model-warmup` shows clear download progress. Main service does not start until models are ready. No ambiguous "starting..." state. |
| **Failure handling** | **Clean.** If download fails, init container exits with error. Main service never starts. User sees clear error in logs. No partial/broken state. |
| **ARM64 compat** | Works. Same image, no GPU needed for download. |
| **Offline** | **Fails on first run.** Same as volume approach — needs network for initial download. |

### Key advantage: Separation of concerns

The init container pattern separates "model provisioning" from "model serving." This means:
- Model download runs without GPU reservation
- Progress is clearly visible in a dedicated log stream
- Failure is clean and retryable
- The main service can assume models exist (simpler code)

### Verdict for her-os

**Strong recommendation.** This is the right pattern for the volume-based models (Whisper, 8B embedding). Combined with baking real-time models into the image (Approach 3), this gives us: instant startup for the core functionality + deterministic download for the larger models.

---

## 7. Approach 6: Model Server Sidecar

### How it works

A dedicated inference server (Triton, vLLM, TGI) runs as a separate container and handles model loading, GPU memory management, and inference. The application container sends inference requests to the sidecar over HTTP/gRPC.

### Who does this

| Company/Product | Details |
|----------------|---------|
| **NVIDIA Triton Inference Server** | The canonical model server. Supports multiple model formats (ONNX, TensorRT, PyTorch, TensorFlow). Model repository pattern. Health endpoints. Dynamic model loading/unloading. Auto-completion of model configs. |
| **HuggingFace TGI** | Text Generation Inference server. Docker-first deployment. Volume mount for model weights. Exposes OpenAI-compatible API. |
| **vLLM** | High-performance LLM serving. Docker deployment with HF cache volume mount. Models download on first start. OpenAI API compatible. |
| **BentoML** | Model serving framework. Builds "Bentos" (Docker images with models + serving code). |
| **Kokoro FastAPI** | Already used by her-os as a TTS sidecar. Pre-built image: `ghcr.io/remsky/kokoro-fastapi-gpu:latest`. |

### Implementation pattern (Triton for embedding + NER)

```yaml
services:
  triton:
    image: nvcr.io/nvidia/tritonserver:25.11-py3
    container_name: her-os-triton
    command: >
      tritonserver
      --model-repository=/models
      --model-control-mode=explicit
      --load-model=embedding-0.6b
      --load-model=gliner
    volumes:
      - model-repo:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  her-os-core:
    # Application calls Triton for inference
    environment:
      TRITON_ENDPOINT: http://triton:8001
    # No GPU needed for the core app!
```

### Analysis

| Factor | Assessment |
|--------|-----------|
| **Image size** | **Large.** Triton image is ~20 GB. Total stack: 18 GB (core) + 20 GB (Triton) = 38 GB before models. |
| **Cold start** | **Depends on model loading.** Triton itself starts quickly, but model loading into GPU memory takes same time as direct loading. |
| **Build time** | **Fast.** Pre-built Triton image, no custom build. |
| **Model update** | **Excellent.** Update model in repository, Triton reloads dynamically. No restart needed. `--model-control-mode=explicit` enables hot-reload. |
| **GPU management** | **Excellent.** Triton handles multi-model GPU memory, batching, concurrent inference, model scheduling. This is its core competency. |
| **Complexity** | **High.** Adds another 20 GB container, model repository structure, gRPC client code, health monitoring. |
| **ARM64 compat** | **Triton supports ARM64** on NGC. But the model repository format requires model conversion (to TorchScript, ONNX, or TensorRT). HuggingFace models do not work directly. |
| **Overhead** | Network hop for every inference call (HTTP/gRPC). Adds ~1-5ms latency per request. |

### Verdict for her-os

**Not recommended for Phase 1.** Triton is designed for multi-model, multi-GPU, multi-tenant serving at cloud scale. For a single-user appliance with 4 models, the complexity and disk overhead are not justified. The GPU memory management benefit is real but premature — her-os has 128 GB unified memory and a single user. Revisit when we need dynamic model loading, A/B testing, or model versioning (Phase 3+).

The Kokoro TTS sidecar pattern (already in use) is the right level of sidecar for her-os: a pre-built image that handles one concern completely.

---

## 8. What NVIDIA Recommends for NGC-Based Deployments

### Official NVIDIA patterns

1. **NGC Containers with volume mounts** (recommended for DGX Spark):
   - NVIDIA's DGX Spark Playbooks use Docker Compose with volume mounts for model storage
   - "Standard volume mount patterns include mounting HuggingFace cache and workspace directories to prevent redundant downloads."
   - This is the documented pattern in the DGX Spark User Guide

2. **NVIDIA NIM (self-contained images)**:
   - NIMs are "packaged as container images on a per-model or per-model-family basis"
   - Each NIM includes model weights in the image
   - Designed for enterprise deployment with NGC API keys
   - Air-gap support: `download-to-cache` tool pre-downloads to local cache, then container serves from cache
   - Not applicable to her-os (NIM is for serving single models, not for multi-model applications)

3. **NVIDIA NIM Operator (Kubernetes)**:
   - Kubernetes operator for deploying NIM containers
   - Handles model lifecycle, scaling, health checks
   - Not applicable (her-os uses Docker Compose, not Kubernetes)

4. **NVIDIA AI Workbench**:
   - Desktop tool for running NGC containers locally
   - Manages model downloads and container lifecycle
   - Interesting UX reference but not a deployment pattern

### NVIDIA's air-gap deployment pattern

NVIDIA NIM's air-gap deployment is the most relevant reference for self-hosted:

```bash
# Step 1: On a machine with internet, download models to local cache
docker run --rm -v /path/to/cache:/opt/nim/cache \
    nvcr.io/nim/meta/llama3-8b-instruct:latest \
    download-to-cache

# Step 2: Transfer cache to air-gapped machine (USB, rsync, etc.)
rsync -av /path/to/cache air-gapped-machine:/opt/nim/cache

# Step 3: On air-gapped machine, run with local cache
docker run --gpus all -v /opt/nim/cache:/opt/nim/cache \
    nvcr.io/nim/meta/llama3-8b-instruct:latest
```

**Key insight:** Even NVIDIA separates model weights from container images for large models. The image contains the serving code, the cache volume contains the weights.

### NVIDIA's recommendations for DGX Spark specifically

From the DGX Spark User Guide and recent optimization blog posts:
- DGX Spark has 3.7 TB NVMe storage — disk space is not a constraint
- 128 GB unified memory means models don't need to be swapped in/out
- Always-on deployment means cold start is a one-time cost
- NVMe storage provides 7+ GB/s read throughput — loading 20 GB of model files from disk takes ~3 seconds

**Critical insight: The bottleneck is network download, not disk I/O or GPU loading.**

Once models are on the NVMe, loading them into GPU memory takes seconds. The 5-15 minute first-run penalty is entirely network-bound. This reframes the problem: we need to solve the first-download experience, not the model-loading experience.

---

## 9. Comparison Matrix

| Factor | Bake into Image | Volume + First-Run | Multi-Stage + Volume | S3/Object Store | Init Container | Model Sidecar |
|--------|:---:|:---:|:---:|:---:|:---:|:---:|
| **Cold start (first run)** | Instant | 5-15 min | Instant (small) + 5-15 min (large) | 5-15 min | 5-15 min (but deterministic) | 5-15 min |
| **Cold start (restart)** | Instant | ~30s | ~30s | ~30s | ~30s | ~30s |
| **Image size** | +5-20 GB | No change | +1.7 GB | No change | No change | +20 GB (Triton) |
| **Build time** | +10-20 min | No change | +2-3 min | No change | No change | No change |
| **Model update** | Rebuild image | Delete volume | Rebuild (small), delete volume (large) | Upload to S3 | Delete volume | Update repo |
| **Offline first-run** | Yes | No | Partial | Local S3 only | No | No |
| **User experience** | Seamless | Confusing delay | Good (partial instant) | Complex setup | Clear progress | Complex |
| **Failure handling** | No failure possible | Partial/broken state | Mixed | S3 dependency | Clean failure | Health checks |
| **Complexity** | Low | Low | Medium | High | Medium | High |
| **Disk efficiency** | Wasteful (duplication) | Good | Good | Good | Good | Wasteful |
| **DGX Spark fit** | Good (small models) | Good | Best | Overkill | Best | Overkill |

### Scores for her-os context (1-5, higher is better)

| Factor (weight) | Bake | Volume | Multi-Stage | S3 | Init Container | Sidecar |
|----------------|:---:|:---:|:---:|:---:|:---:|:---:|
| First-run UX (30%) | 5 | 1 | 4 | 1 | 3 | 2 |
| Restart speed (20%) | 5 | 4 | 4 | 4 | 4 | 4 |
| Image size (15%) | 1 | 5 | 4 | 5 | 5 | 1 |
| Model update ease (15%) | 1 | 5 | 3 | 4 | 5 | 5 |
| Simplicity (10%) | 4 | 4 | 3 | 1 | 3 | 1 |
| Offline support (10%) | 5 | 1 | 3 | 2 | 1 | 1 |
| **Weighted score** | **3.40** | **3.10** | **3.60** | **2.75** | **3.45** | **2.35** |

**Winner: Multi-Stage Build + Init Container (hybrid of Approaches 3 + 5) = 3.60 + 3.45 combined**

---

## 10. First-Boot Experience Design

### What makes a good first-boot experience?

Studied patterns from successful self-hosted products:

| Product | First-Boot Strategy | Time | User Feedback |
|---------|-------------------|------|---------------|
| **Ollama** | No models included. User must run `ollama pull`. | Manual | Users know what they're getting |
| **Open WebUI** | No models. Settings page to pull models. | Manual | Confusing for non-technical users |
| **Immich** | ML models download in background on first start. | ~2 min | Good — app is usable while models download |
| **Frigate** | Object detection models baked into image. | Instant | Seamless |
| **Home Assistant** | No ML. Integrations download on demand. | Fast | Good progressive pattern |
| **Plex** | Transcoding libraries included. Media indexed in background. | Progressive | Excellent — usable immediately, gets better over time |

### Design principles for her-os first boot

1. **Be usable immediately.** The core functionality (memory search, context retrieval) must work within 60 seconds of `docker compose up`. This requires the real-time embedding model (0.6B) and NER model (GLiNER) to be available instantly.

2. **Download in the background.** Large models (Whisper, 8B embedding) should download after the core is operational. The user should see "Annie is getting ready to listen" rather than a frozen screen.

3. **Show progress.** Never leave the user wondering if things are working. Progress bars, ETA, download speed — all visible via `docker compose logs`.

4. **Degrade gracefully.** If Whisper is not yet downloaded, voice features show "Voice processing is warming up..." instead of failing silently.

5. **Be idempotent.** If the download is interrupted (network drop, power cycle), restart picks up where it left off.

6. **Remember state.** Once models are downloaded, they persist in the Docker volume. Upgrades and restarts should never re-download.

### Proposed first-boot timeline

```
T+0s    docker compose up -d
T+5s    Infrastructure services starting (PG, Redis, Neo4j, Qdrant)
T+30s   Infrastructure healthy
T+35s   her-os-core starts (0.6B embedding + GLiNER baked in image)
T+40s   Real-time models loaded into GPU memory
T+45s   Health check passes — Annie is online
T+45s   "Annie is ready. Memory search and context engine operational."
T+50s   Background task: download Whisper large-v3 (~3.1 GB)
T+2-5m  Whisper downloaded and loaded — voice features enabled
T+2-5m  "Annie can now listen to voice. STT operational."
T+5-15m Background task: download 8B embedding (if configured)
T+15m   All models ready — full capability
```

**Key insight:** By baking the 1.7 GB of real-time models into the image, we achieve a 45-second first-boot to usable state, while downloading the remaining ~18 GB progressively in the background.

---

## 11. Recommendation for her-os

### Strategy: Hybrid Multi-Stage Bake + Progressive Background Download

Combine the best aspects of Approaches 3 (multi-stage bake) and 5 (init container), with a twist: instead of blocking on model download, use a background download task after the core is operational.

### Architecture

```
┌──────────────────────────────────────────────────────────────┐
│                    Docker Image (build time)                   │
│                                                               │
│  ┌─────────────────────────┐  ┌────────────────────────────┐ │
│  │  Application Code       │  │  Baked Models (~1.7 GB)    │ │
│  │  FastAPI + deps         │  │  - Qwen3-Embedding-0.6B    │ │
│  │  ~18 GB (NGC base)      │  │  - GLiNER medium-v2.1      │ │
│  └─────────────────────────┘  └────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                    Docker Volume (runtime)                     │
│                                                               │
│  ┌────────────────────────────────────────────────────────┐   │
│  │  Downloaded Models (progressive)                       │   │
│  │  Priority 1: (already in image, symlinked)             │   │
│  │  Priority 2: Whisper large-v3      (~3.1 GB) → 2-5 min│   │
│  │  Priority 3: Qwen3-Embedding-8B   (~15 GB)  → 5-15min│   │
│  └────────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘
```

### Tier system

| Tier | Models | Strategy | When Available |
|------|--------|----------|---------------|
| **Tier 1 (Critical)** | 0.6B Embedding, GLiNER | Baked into Docker image | Immediately (T+45s) |
| **Tier 2 (Important)** | Whisper large-v3 | Background download after core boot | T+2-5 min |
| **Tier 3 (Enhancement)** | 8B Embedding | Background download, low priority | T+5-15 min |

### Why this works for her-os specifically

1. **Always-on DGX Spark**: Cold start is a one-time event. Once running, the system stays up indefinitely. This means the first-boot penalty is amortized over months/years of uptime.

2. **128 GB unified memory**: All models fit in memory simultaneously. No need for dynamic loading/unloading (which would favor Triton).

3. **3.7 TB NVMe**: Disk space is not a constraint. A 20 GB Docker image is 0.5% of available storage.

4. **Single user**: No multi-tenant model sharing concerns. No need for S3 or centralized model storage.

5. **Self-hosted, not cloud**: Image size affects only the initial deployment, not per-request costs. A 20 GB image is fine if it means instant startup.

6. **NVMe throughput (~7 GB/s)**: Loading 1.7 GB of baked models from disk to GPU takes <1 second. Loading 20 GB takes ~3 seconds. The disk-to-GPU path is not the bottleneck.

### What this means for the Dockerfile

```dockerfile
# ---- Stage 1: Model Download (build time) ----
FROM python:3.12-slim AS models

RUN pip install --no-cache-dir huggingface_hub[cli]

# Only download Tier 1 (critical) models at build time
# These are small enough to bake (~1.7 GB total)
RUN huggingface-cli download Qwen/Qwen3-Embedding-0.6B \
      --local-dir /models/Qwen--Qwen3-Embedding-0.6B

RUN huggingface-cli download urchade/gliner_medium-v2.1 \
      --local-dir /models/urchade--gliner_medium-v2.1

# ---- Stage 2: Builder (existing — unchanged) ----
FROM nvcr.io/nvidia/pytorch:25.11-py3 AS builder
# ... install Python deps (existing code) ...

# ---- Stage 3: Runtime ----
FROM nvcr.io/nvidia/pytorch:25.11-py3 AS runtime

# ... existing runtime setup ...

# Bake Tier 1 models into image
COPY --from=models /models /data/baked-models

# Volume mount for Tier 2-3 models (downloaded at runtime)
# HF_HOME points to volume, with fallback to baked models
ENV HF_HOME=/data/models
ENV HER_OS_BAKED_MODELS=/data/baked-models
```

### What this means for the application

```python
# her_os/model_loader.py
"""Progressive model loading with baked + downloaded tiers."""

import os
from pathlib import Path
from enum import IntEnum

class ModelTier(IntEnum):
    CRITICAL = 1   # Baked into image, available immediately
    IMPORTANT = 2  # Downloaded in background after boot
    ENHANCEMENT = 3 # Downloaded on demand or low-priority background

MODEL_REGISTRY = {
    "Qwen/Qwen3-Embedding-0.6B": {
        "tier": ModelTier.CRITICAL,
        "purpose": "real-time embedding",
        "baked_path": "Qwen--Qwen3-Embedding-0.6B",
    },
    "urchade/gliner_medium-v2.1": {
        "tier": ModelTier.CRITICAL,
        "purpose": "named entity recognition",
        "baked_path": "urchade--gliner_medium-v2.1",
    },
    "openai/whisper-large-v3": {
        "tier": ModelTier.IMPORTANT,
        "purpose": "speech-to-text",
    },
    "Qwen/Qwen3-Embedding-8B": {
        "tier": ModelTier.ENHANCEMENT,
        "purpose": "batch embedding (nightly re-index)",
    },
}

def get_model_path(model_id: str) -> Path | None:
    """Find model: check baked location first, then volume cache."""
    info = MODEL_REGISTRY.get(model_id, {})
    baked_dir = Path(os.environ.get("HER_OS_BAKED_MODELS", "/data/baked-models"))
    cache_dir = Path(os.environ.get("HF_HOME", "/data/models"))

    # Check baked models first (instant, no download)
    if "baked_path" in info:
        baked_path = baked_dir / info["baked_path"]
        if baked_path.exists():
            return baked_path

    # Check volume cache (previously downloaded)
    # HuggingFace hub cache structure: models--org--name/snapshots/...
    cache_path = cache_dir / f"models--{model_id.replace('/', '--')}"
    if cache_path.exists():
        return cache_path

    return None  # Not available yet

def is_model_available(model_id: str) -> bool:
    return get_model_path(model_id) is not None
```

---

## 12. Implementation Plan

### Phase 1: Immediate improvements (before Sprint 1)

1. **Add model download stage to Dockerfile**
   - New `models` stage that downloads 0.6B + GLiNER at build time
   - `COPY --from=models` into runtime stage
   - Total image size increase: ~1.7 GB (18 GB -> ~20 GB)

2. **Create `her_os/bootstrap.py`**
   - Model registry with tier system
   - `download_models(priority_cutoff=N)` function
   - Progress reporting to stdout
   - Idempotent (skips already-downloaded models)

3. **Add background model download to lifespan**
   - After FastAPI starts and health check passes, launch background task
   - Downloads Tier 2 (Whisper) then Tier 3 (8B) models
   - Updates health endpoint with model availability status

4. **Enhance health endpoint**
   ```json
   {
     "status": "healthy",
     "checks": {
       "gpu": {"available": true, "device": "NVIDIA GB10"},
       "models": {
         "embedding_0.6b": {"status": "ready", "source": "baked"},
         "gliner": {"status": "ready", "source": "baked"},
         "whisper": {"status": "downloading", "progress": "45%"},
         "embedding_8b": {"status": "pending"}
       }
     },
     "capabilities": {
       "memory_search": true,
       "voice_stt": false,
       "batch_embedding": false
     }
   }
   ```

### Phase 2: Optional init container variant

For users who want deterministic startup (all models ready before service starts):

```yaml
# docker-compose.override.yml (opt-in)
services:
  model-warmup:
    image: ghcr.io/her-os/her-os-core:${HER_OS_VERSION:-latest}
    container_name: her-os-model-warmup
    restart: "no"
    entrypoint: ["python", "-m", "her_os.bootstrap", "--all", "--block"]
    environment:
      HF_HOME: /data/models
      HF_TOKEN: ${HF_TOKEN:-}
    volumes:
      - model-cache:/data/models

  her-os-core:
    depends_on:
      model-warmup:
        condition: service_completed_successfully
```

### Phase 3: Pre-download CLI command

```bash
# Download all models before first start
./scripts/install.sh --pre-download-models

# Or manually:
docker compose run --rm her-os-core python -m her_os.bootstrap --all
```

---

## 13. Open Questions

| # | Question | Status |
|---|----------|--------|
| Q1 | Should we bake a quantized version of Whisper into the image (smaller, lower quality) for immediate voice capability, then upgrade to full when downloaded? | OPEN |
| Q2 | Should the 8B embedding model download be opt-in (env var `HER_OS_ENABLE_8B=true`) since most users won't notice the quality difference? | OPEN |
| Q3 | What is the exact HuggingFace Hub cache structure we need to replicate when baking models? Do we use `--local-dir` (flat copy) or match the `models--org--name/snapshots/hash/` structure? | OPEN — needs testing |
| Q4 | Should we create a `her-os-models` Docker image that contains all models, for offline/air-gap deployment scenarios? | OPEN — low priority |
| Q5 | BuildKit `--mount=type=cache` for model downloads during build — does this persist across `docker compose build` invocations on DGX Spark? | OPEN — needs testing on Titan |
| Q6 | Should the health endpoint expose model download progress as a Server-Sent Events (SSE) stream for the web UI? | OPEN — Phase 2 |

---

## Sources

### NVIDIA Official
- [NVIDIA Triton Inference Server Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html)
- [Triton Quickstart](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html)
- [Triton Model Repository](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_repository.html)
- [NVIDIA NIM for LLMs — Getting Started](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html)
- [NVIDIA NIM Air Gap Deployment](https://docs.nvidia.com/nim/large-language-models/latest/deploy-air-gap.html)
- [NGC — DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/ngc.html)
- [DGX Spark User Guide (PDF)](https://docs.nvidia.com/dgx/dgx-spark/dgx-spark.pdf)
- [New Software and Model Optimizations Supercharge DGX Spark](https://developer.nvidia.com/blog/new-software-and-model-optimizations-supercharge-nvidia-dgx-spark/)
- [NVIDIA DGX Spark In-Depth Review (LMSYS)](https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/)

### HuggingFace & Model Serving
- [HuggingFace TGI Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour)
- [HuggingFace Docker Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker)
- [HuggingFace Inference Endpoints — Custom Container](https://huggingface.co/docs/inference-endpoints/engines/custom_container)
- [How to Deploy HuggingFace Models in Docker (Frederick Giasson)](https://fgiasson.com/blog/index.php/2023/08/23/how-to-deploy-hugging-face-models-in-a-docker-container/)
- [Deploy HuggingFace Model on GPU Docker (RunPod)](https://www.runpod.io/articles/guides/deploy-hugging-face-docker)
- [LLM Everywhere: Docker for Local and HuggingFace Hosting (Docker Blog)](https://www.docker.com/blog/llm-docker-for-local-and-hugging-face-hosting/)

### vLLM
- [vLLM Docker Deployment](https://docs.vllm.ai/en/stable/deployment/docker/)
- [vLLM Kubernetes: Model Loading & Caching Strategies (DigitalOcean)](https://www.digitalocean.com/community/conceptual-articles/vllm-kubernetes-model-loading-caching-strategies)
- [Deploying vLLM with Docker: Complete Guide (Medium)](https://medium.com/@juanc.olamendy/deploying-vllm-with-docker-the-complete-guide-to-production-ready-llm-inference-39b812c01535)

### AWS SageMaker
- [SageMaker Inference Toolkit (GitHub)](https://github.com/aws/sagemaker-inference-toolkit)
- [Custom Inference Code with SageMaker Hosting](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html)
- [Docker Containers for Training and Deploying (SageMaker)](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html)

### Kubernetes ML Patterns
- [Loading Multi-Gigabyte Model Weights for GPU Inference on EKS (Jan 2026)](https://garystafford.medium.com/loading-multi-gigabyte-model-weights-for-gpu-inference-on-amazon-eks-8efa93631bba)
- [AI/ML in Kubernetes Best Practices (Wiz)](https://www.wiz.io/academy/ai-ml-kubernetes-best-practices)
- [Serve a Model with a Single GPU in GKE (Google Cloud)](https://cloud.google.com/kubernetes-engine/docs/tutorials/online-ml-inference)

### Docker Build Optimization
- [Docker Multi-Stage Builds](https://docs.docker.com/build/building/multi-stage/)
- [Optimize AI Containers with Multi-Stage Builds (Collabnix)](https://collabnix.com/optimize-your-ai-containers-with-docker-multi-stage-builds-a-complete-guide/)
- [Docker BuildKit Cache Mounts (How-To)](https://vsupalov.com/buildkit-cache-mount-dockerfile/)
- [Docker Compose Init Containers Pattern](https://oneuptime.com/blog/post/2026-02-08-how-to-use-docker-compose-init-containers-pattern/view)

### Replicate Cog
- [Cog: Containers for Machine Learning (GitHub)](https://github.com/replicate/cog)
- [Cog Getting Started](https://cog.run/getting-started/)
- [Using Your Own Model with Cog](https://cog.run/getting-started-own-model/)

### Self-Hosted AI
- [Ollama Docker Image](https://hub.docker.com/r/ollama/ollama)
- [Self-Hosted AI Battle: Ollama vs LocalAI (2025)](https://dev.to/arkhan/self-hosted-ai-battle-ollama-vs-localai-for-developers-2025-edition-b82)
- [Local LLM Hosting: Complete 2025 Guide](https://medium.com/@rosgluk/local-llm-hosting-complete-2025-guide-ollama-vllm-localai-jan-lm-studio-more-f98136ce7e4a)
- [OpenHands on DGX Spark](https://openhands.dev/blog/host-your-own-coding-agents-with-openhands-using-nvidia-dgx-spark)
