# Resource Registry — GPU Memory Budget

> **Purpose:** Single source of truth for what's running on the GPU and how much room is left.
> Every model change (ADR, config, docker-compose) MUST update this file.
>
> **Hardware reference:** [Titan Validation Dashboard](titan-validation.html) — Phase 0 hardware validation (20/21 pass).
> The VRAM budget on that page (line 2373) is **historical** (ADR-019 era, 67 GiB).
> This file is the **current** budget.

## Hardware

### Titan (single machine — all workloads)

| Property | Value |
|----------|-------|
| Machine | NVIDIA DGX Spark |
| GPU | GB202 (Blackwell SM_121) |
| Total memory | 128 GB unified (CPU+GPU shared) |
| Architecture | aarch64 (ARM64) |
| CUDA | 12.8 |
| OS | Ubuntu 24.04 |

## Current Memory Budget

Last updated: 2026-04-06 (Session 449 — Drop Nemotron Super, consolidate all workloads to Gemma 4 on Titan)

### Titan — Active Models

| Component | Model | Memory | Creature | Server | Load Behavior |
|-----------|-------|--------|----------|--------|---------------|
| Voice + text chat + extraction + all tasks | Gemma 4 26B-A4B NVFP4 | 15.74 GB (model) + up to 48 GB (KV cache) | minotaur | vLLM Docker :8003 | Always loaded |
| Embeddings | qwen3-embedding:8b | 14 GB | — | Ollama :11434 | On-demand (KEEP_ALIVE=30s) |
| Whisper STT | whisperx large-v3 | 4.8 GB | siren | audio-pipeline | Always loaded |
| Speaker diarization | pyannote 3.1 | 1.9 GB | cerberus | audio-pipeline | Always loaded |
| Speaker embedding | speechbrain ecapa-voxceleb | 0.6 GB | cerberus | audio-pipeline | Always loaded |
| SER: emotion2vec+ large | emotion2vec+ | 0.6 GB | hydra | ser-sidecar | Always loaded |
| SER: audeering wav2vec2 | wav2vec2-large | 0.6 GB | hydra | ser-sidecar | Always loaded |
| Annie Voice STT | Nemotron Speech 0.6B (nvidia/nemotron-speech-streaming-en-0.6b) | 2.49 GB | serpent | annie-voice (NeMo RNNT) | Always loaded |
| TTS | Kokoro v0.19 | 0.5 GB | leviathan | annie-voice | Always loaded |
| GLiNER2 NER | gliner2-base-v1 | 0 GB | hawk | context-engine | CPU only |

### Titan — vLLM Settings

| Setting | Value | Rationale |
|---------|-------|-----------|
| `gpu_memory_utilization` | `0.50` | 64 GB for vLLM (model 15.74 + KV cache ~48). Leaves 64 GB for audio stack + OS. |
| `max_model_len` | `131072` | Gemma 4 native 128K. Needed for WhatsApp agent context (34K typical). |
| `kv_cache_dtype` | `fp8` | Halves KV cache memory, enables more concurrent requests. |
| `enable_thinking` | `False` | MUST be off — Gemma 4 produces empty content with thinking enabled. |
| `max_num_seqs` | `8` | Concurrent sequences for voice + extraction overlap. May auto-reduce with 128K — verify empirically on deploy. |
| `tool_call_parser` | `gemma4` | Required for tool calling (not hermes/pythonic). |

### Steady-State Scenarios

| Scenario | Models loaded | Memory used | Free | Status |
|----------|-------------|-------------|------|--------|
| **Idle** | audio + Gemma 4 vLLM (model only) + SER + Kokoro + STT | ~27 GB | 101 GB | OK |
| **Voice active** | Idle + KV cache partially filled (~5K tokens) | ~35 GB | 93 GB | OK |
| **WhatsApp agent compaction** | Idle + KV cache (~34K tokens, one long request) | ~45 GB | 83 GB | OK |
| **Extraction running** | Idle + KV cache + long context extraction | ~55 GB | 73 GB | OK |
| **Peak (voice + extraction + embeddings)** | All above + qwen3-embedding:8b | ~69 GB | 59 GB | OK |

**Empirical (2026-04-06 deploy):** KV cache pool = 41.36 GiB, 361,376 tokens total capacity. At max_model_len=131072, supports 2.76 full-128K requests concurrently, or ~8 shorter requests (voice ~5K, extraction ~4K). max-num-seqs=8 held at 128K — no reduction needed.

### Budget Rules

1. **Titan peak ~69 GB** (Gemma 4 + KV cache + audio + embedding) — well under 128 GB
2. **`gpu_memory_utilization=0.50`** caps vLLM at 64 GB (model + KV cache combined)
3. **OLLAMA_KEEP_ALIVE=30s** — embedding model unloads after 30s idle
4. **Single vLLM container** — no GPU contention between models
5. **DGX Spark unified memory** — CPU+GPU share RAM; audio stack and OS need ~20 GB headroom
6. **128K fallback:** If vLLM fails to start at 131072, reduce `--max-num-seqs` to 4→2 (NOT increase gpu_memory_utilization)

### Beast (second DGX Spark — always-on, currently idle workload)

| Property | Value |
|----------|-------|
| Machine | NVIDIA DGX Spark (second unit, twin of Titan) |
| GPU | GB10 Grace Blackwell Superchip (SM_121) |
| Total memory | 128 GB unified (LPDDR5X ~546 GB/s) |
| Architecture | aarch64 (ARM64) |
| Power state | **Always-on** (not powered-off idle) |
| Workload | **Idle** since 2026-04-06 (session 449 — Nemotron Super 120B retired, workloads consolidated to Titan) |
| Performance class | Bandwidth-bound for small-model high-rate inference (~2× slower than Panda for E2B nav per session 105 bench) — good fit for batch/large-model workloads, not for sub-20ms control loops |

**Benchmark reference (session 105, 2026-04-15):** Gemma 4 E4B Q4_K_M on Beast: nav p50 91.3 ms / 10.95 Hz / 100% schema / 25 s cold-start. **~2× slower than Panda** for same model. Architectural verdict: *"small-model high-rate inference belongs on discrete GPU; DGX Spark class is for large-model workloads."*

**Strategic role:** Beast is the highest-leverage near-term hardware because it's always-on, idle, and capable. Candidate workloads:
- **Offline batch processing** (hippocampal-replay pattern): overnight semantic map annotation, embedding generation, conversation summarization
- **Gemma 4 overflow** (if Titan ever saturates): failover target for 26B planning model
- **Ambient observation workload TBD** — technology stack not yet chosen (session-124 reversed the session-119 DeepStream proposal; pending user reconfirmation before a replacement is committed to)

### Orin NX 16GB (reserved for future Orin-NX-native Annie robot)

| Property | Value |
|----------|-------|
| Module | NVIDIA Jetson Orin NX 16GB |
| GPU | Ampere (SM 8.7), 1024 CUDA cores + 32 tensor cores |
| AI performance | 100 TOPS INT8 (2.5× Orin Nano, 3.8× Hailo-8) |
| Memory | 16 GB LPDDR5 unified |
| Power | 10–25 W configurable TDP |
| Architecture | aarch64 |
| Status | **In hand, not yet deployed.** Needs carrier board (Seeed A206, reComputer J401, Auvidea, or custom) |

**Strategic role:** Earmarked for the **future Annie robot** (Pi-based TurboPi gets replaced eventually; next-gen robot runs Orin NX natively). Near-term use: prototyping bench for the future robot's VLM + SLAM stack before the robot chassis arrives.

**Constraint from user (session 119):** Cannot replace Pi on current TurboPi. Can only SUPPORT Pi (add as co-processor over wired Ethernet) if mechanical/power budget allows.

**Key caveats:**
- `llama-server` / llama.cpp needs aarch64 build for Gemma 4 E2B VLM on Orin
- 100 TOPS Ampere on-module, 16 GB unified memory — sized for one VLM + tracking, not simultaneously

### Strategic Architecture (session 119, updated session 124)

Three pieces of idle NVIDIA compute + Pi's idle Hailo-8 create a dual-generation upgrade path:

**Near-term (current TurboPi robot):**

| Tier | Hardware | Role | Status |
|------|----------|------|--------|
| L1 safety reflex | Hailo-8 on Pi 5 | Local YOLO obstacle detection, zero WiFi | **Idle** — activate for WiFi-cliff mitigation |
| L2 nav VLM | Panda (RTX 5070 Ti) | Gemma 4 E2B @ 54 Hz | ACTIVE, keep as-is |
| L3 strategic planner | Titan | Gemma 4 26B @ 1–2 Hz | ACTIVE, keep as-is |
| Batch / overflow | **Beast** (always-on) | Offline processing; Gemma 4 overflow target | **Idle** — ambient-observation workload TBD pending user decision |
| Prototyping bench | Orin NX 16GB | Test future-robot stack on a shelf | **Available** |

**Future robot (Orin-NX-native):**

| Tier | Hardware | Role |
|------|----------|------|
| Onboard brain | Orin NX | VLM + SLAM + tracking, all local, zero WiFi for nav |
| Strategic planner | Titan | Same Gemma 4 26B planning as current gen |
| Batch / overflow | Beast | Same batch role as current gen |

**Key insights from the hardware audit:**

1. **IROS dual-process paper** (arXiv 2601.21506) validates the fast-reactive + slow-semantic pattern: **66% latency reduction** vs always-on VLM, **67.5% success** vs 5.83% VLM-only. Hailo-8 (fast) + Panda VLM (slow) on current robot = exact match.
2. **Isaac Perceptor** (nvblox + cuVSLAM) is NVIDIA's actual robotics stack — worth tracking for future robot but requires stereo camera (current TurboPi has mono).
3. **Hailo-8 on Pi is the first activation priority** — zero WiFi dependency, 430 FPS YOLOv8n local, closes the WiFi-cliff failure mode.

### Panda (RTX 5070 Ti, 16,303 MB)

**Full hardware inventory (session 101, 2026-04-14, verified on-machine via `~/hardware-inventory/collect.sh`):**

Inventory report file on Panda: `/home/rajesh/hardware-inventory/report-20260414_135759.txt` (rajesh-owned, reusable — re-run the collect.sh script anytime to refresh).

| Property | Value | Source |
|----------|-------|--------|
| **Motherboard** | MSI MAG X870 GAMING PLUS WIFI (MS-7E47) rev 2.0, SN `07E4721_P51B614360` | dmidecode baseboard |
| **Chipset** | AMD X870 | MSI model |
| **Socket** | AM5 | dmidecode processor |
| **BIOS** | AMI 2.A53 (2025-09-10), ROM 32 MB, revision 5.35 | dmidecode bios |
| **CPU** | AMD Ryzen 9 9900X3D (Zen 5, Family 26 Model 68 Stepping 0, 12C/24T, 3D V-Cache, max 5550 MHz) | dmidecode processor |
| **CPU ISA flags (critical for ML)** | `avx512_vnni`, `avx_vnni`, `avx512_bf16`, `avx512vl`, `avx512dq`, `avx512f`, `avx512_vbmi2`, `avx512_vp2intersect`, `sha_ni`, `aes`, `vaes` | `lscpu` |
| **RAM total installed** | **64 GB** — 2 × 32 GB DDR5 Corsair `CMK32GX5M1B5200C40` (rated 5200 MT/s, dual-rank) | dmidecode memory |
| **RAM slot layout** | DIMMA2 (32 GB), DIMMB2 (32 GB) populated. **DIMMA1 and DIMMB1 EMPTY — can upgrade to 128 GB.** | dmidecode memory |
| **RAM running speed** | **4800 MT/s (JEDEC — below rated 5200)** — EXPO/XMP profile not enabled in BIOS. Easy perf win available. | dmidecode memory |
| **RAM max capacity** | 128 GB | dmidecode memory |
| **Swap** | 8 GB | `free -h` |
| **GPU** | MSI RTX 5070 Ti 16 GB GDDR7 (GB203-300, Blackwell SM 12.0, subsystem 0x5310, VBIOS 98.03.58.00.92) | `nvidia-smi -q`, `lspci -vvv` |
| **GPU PCIe** | Gen 5 x16 cap (link currently at Gen 1 when idle P8 — ramps to Gen 5 on load) | `nvidia-smi` + `lspci -vv` |
| **GPU power limit** | 300 W (default = max, LOCKED — not adjustable above 300) | `nvidia-smi -q` |
| **Primary NVMe** | WD_BLACK SN8100 2 TB at `02:00.0` — running at **PCIe 5.0 x4 full speed** (32 GT/s) | `lsblk`, `lspci -vv` |
| **Disk usage** | 231 GB / 1.8 TB (14%) | `df -h` |
| **Wi-Fi / Bluetooth** | Qualcomm WCN785x Wi-Fi 7 (802.11be) 320 MHz 2×2 (FastConnect 7800) | `lspci` |
| **Ethernet** | Realtek RTL8126 (5 Gigabit) — **currently DOWN, running on Wi-Fi only** | `lspci`, `/sys/class/net` |
| **SATA controllers** | AMD 600-series + ASMedia ASM1064 (no drives currently attached) | `lspci` |
| **USB controllers** | AMD 43fc + ASMedia ASM2426 + ASM2425 (USB 3.2 + USB4) | `lspci` |
| **IOMMU** | Enabled, GPU in IOMMU group 13 | `lspci -vvv` |
| **Kernel** | Linux 6.17.0-14-generic | `uname -a` |
| **OS** | Ubuntu 24.04.3 LTS (Noble Numbat) | `/etc/os-release` |
| **Chassis** | MSI Desktop (vendor: Micro-Star Int'l; exact case model not in SMBIOS) | dmidecode chassis |
| **PSU wattage** | **NOT exposed via SMBIOS/sysfs — requires physical inspection of PSU sticker.** | Not software-readable |

**Idle temps (verified 2026-04-14):** CPU Tctl 51.9°C (Tccd1 42.2°C, Tccd2 45.5°C), GPU 47°C @ 11.25 W, DIMM 38°C/40°C, NVMe 36°C composite, iGPU 50°C, NIC 47°C — all thermally comfortable.

### PCIe slot inventory (critical for second-GPU question)

Per SMBIOS type-9 slot descriptors:

| Slot | Bus Address | Physical | Electrical | Usage | Notes |
|------|-------------|----------|-----------|-------|-------|
| **PCIE1** | 0000:00:01.1 | **x16** | PCIe 5.0 x16 (CPU-direct) | **In Use — RTX 5070 Ti** | Only full-length slot |
| J3502 (M.2) | 0000:00:01.2 | M.2 Socket 3 | PCIe 5.0 x4 (CPU-direct) | In Use — SN8100 NVMe | M.2, not PCIe slot |
| PCIE3 | 0000:00:02.2 | **x4-sized (Short)** | PCIe x4 | In Use | Cannot fit GPU |
| PCIE4 | 0000:00:02.1 | **x4-sized (Short)** | PCIe x4 | In Use | Cannot fit GPU |

**Consequence: MSI MAG X870 GAMING PLUS WIFI is a single-GPU motherboard.** There is no second x16-sized slot. Adding a second full-size RTX card requires one of:
1. **Motherboard swap** — switch to workstation board (ASUS Pro WS, ASRock Rack, Gigabyte TRX50 Aero, etc.) with 2+ x16-physical slots
2. **PCIe riser cable + open bench** — lift the second card out of the case with a riser; needs external frame/mining rig chassis
3. **M.2-to-PCIe adapter** — use a spare M.2 slot with a GPU riser; bandwidth limited to Gen 5 x4 (≈ Gen 4 x8 equivalent — OK for inference, not training)

**PCIe link speed observation:** RTX 5070 Ti currently shows "Speed 2.5 GT/s (downgraded), Width x16" — that's PCIe 1.0 at P8 idle. Normal behavior. Under inference load it ramps to 32 GT/s (PCIe 5.0 x16).

**Temperatures (idle, session 101):**
- CPU (Tctl): 51.9°C, Tccd1: 42.8°C, Tccd2: 42.4°C
- GPU (RTX 5070 Ti idle): ~47°C, 12.9 W power draw
- AMD iGPU (Rembrandt in Ryzen APU): 50°C edge — note: the iGPU is present but not the primary display; RTX does the work
- DDR5 DIMM 1 temp sensor: 38.2°C
- Network chip (r8169): 47.5°C

**PCIe topology observation (relevant for adding a second GPU):**
- **01:00.0** — RTX 5070 Ti occupies the CPU-direct PCIe 5.0 x16 slot (via root port `00:01.1`)
- **02:00.0** — NVMe on CPU-direct PCIe 5.0 x4 (via root port `00:01.2`)
- Second x16-sized slot (if present on this MB) routes through X870 chipset — typically PCIe 4.0 x4 bandwidth, physically x16 slot
- ASMedia ASM2806 4-Port PCIe x2 Gen3 switches present — used for chipset-facing I/O expansion

**Live measurements from session 67, 2026-04-12** (via
`nvidia-smi --query-compute-apps=pid,process_name,used_memory`). These reflect
actual process VRAM consumption, not registered per-model estimates. The
earlier per-model breakdown (Whisper 6029 + IndicConformer 303 + Kokoro 554)
was inaccurate — those models share a single PyTorch process (`phone_call.py`)
which is smaller than the sum of their independent estimates.

| Component | Process / Model | Memory (live) | Creature | Server | Load Behavior |
|-----------|-----------------|--------------:|----------|--------|---------------|
| Phone call loop (STT + TTS bundle) | `scripts/phone_call.py auto` — Whisper large-v3 + IndicConformerASR + Kokoro in one process (shared CUDA context) | **5,158 MB** | siren + leviathan + — | phone_loop (PyTorch) | Always loaded (started Apr 10) |
| Chatterbox voice-clone TTS | Chatterbox TTS (voice cloning, uvicorn server) | **3,654 MB** | TBD | chatterbox_server (FastAPI :8772) | **Always loaded — CRITICAL for phone call TTS, do NOT stop.** Started Apr 10. **REGISTRY GAP: added in earlier session without updating this doc — filled in session 67.** |
| IndicF5 Kannada voice-clone TTS | IndicF5 400M BF16 | **2,864 MB** (was registered as 1,347 — grew 2.1×) | phoenix | indicf5_server (FastAPI :8771) | **PERMANENTLY RETIRED (session 68)** — Mom speaks English; Chatterbox covers all TTS. Do NOT restart. |
| **Nav VLM for robot car** | Gemma 4 E2B Q4_K_M + mmproj-F16 | **~3,227 MB** | TBD | `panda-llamacpp` docker container (llama.cpp `llama-server`, port 11435, OpenAI-compat endpoint) | **Always loaded (systemd panda-llamacpp.service)** — persistent after session 71. 18.4 ms p50 GPU inference (54 Hz theoretical). **Camera is on Pi 5, NOT Panda** — frames arrive as base64 JPEG via WiFi HTTP POST. Real frame-to-command latency = WiFi upload + 18ms GPU + WiFi response. Fallback: Titan vLLM 26B. |
| **Nav decision sidecar** | panda-nav FastAPI (CPU only, wraps llama-server) | **0 MB** (CPU only, ~150 MB RAM) | — | `panda-nav` (FastAPI :11436, uvicorn) | Always loaded alongside llama-server. Code-based perception→command mapping. Session 74. |

**Peak (phone + Chatterbox + Nav VLM, IndicF5 retired):** **12,039 MB / 16,303 MB (74%)**
**Free:** **4,264 MB (26%)** — No room for additional GPU processes.

**Resolved (session 108, 2026-04-15) — phone_call.py footprint is Whisper-only.**
Source-of-truth audit:
- `services/annie-voice/phone_audio.py:509-519` — `PhoneSTT` loads Whisper large-v3-turbo via
  `whisper.load_model("large-v3-turbo", device="cuda")` (OpenAI PyTorch reference impl, not
  faster-whisper — CTranslate2 has no aarch64 CUDA wheel per session-101 finding).
- `services/annie-voice/tts_backends.py:172-178` — `PHONE_TTS_BACKEND='auto'` (active env)
  builds `AutoRoutingBackend(en=ChatterboxBackend(:8772), kn=IndicF5Backend(:8771))`. **Both are
  HTTP clients — neither loads model weights in-process.** Kokoro backend exists at
  `tts_backends.py:36 KokoroBackend` but is only instantiated when `PHONE_TTS_BACKEND='kokoro'`
  (not the auto path).
- `phone_loop.py:1082-1093` — only one `.load()` call at startup (Whisper). No NeMo, no
  IndicConformer, no Kokoro instantiation in the auto-path.

So the consolidated 5,158 MiB ≈ Whisper large-v3-turbo (~3.2 GB FP32 weights) + CUDA runtime
context + decode-time temp tensors (encoder output, log-mel, beam-search KV cache). The earlier
"6029 + 303 + 554 = 6,886 MB" estimate was wrong — it conflated "available as a backend" with
"resident in VRAM". IndicConformer and Kokoro never touch the phone process; Chatterbox is
out-of-process at PID 1625654 and contributes its own 3,740 MiB independently. **No need for
`py-spy dump` — the code answer is unambiguous.**

### Pi 5 (TurboPi Robot Car, 192.168.68.61)

| Property | Value |
|----------|-------|
| Machine | Raspberry Pi 5 |
| RAM | 16 GB LPDDR4X |
| Storage | 234 GB SD card |
| CPU | ARM Cortex-A76 4-core |
| Architecture | aarch64 |
| Accelerator | Hailo-8 AI HAT+ 26 TOPS (PCIe, vision-only) |
| Camera | icspring UVC camera (USB on Pi, WB range 2800–6500K) — **Pi-local, NOT on Panda.** Frames sent to Panda VLM as base64 JPEG over WiFi HTTP POST (~30-80 KB/frame). Every VLM inference requires a WiFi round-trip. |
| Lidar | SLAMTEC RPLIDAR C1 (DTOF, `/dev/ttyUSB1`, CP2102N, 460800 baud) — 360° scan, 12m range, 5000 samp/s, 0.72° res |
| OS | Debian Trixie 13 |

| Component | Model | Memory | Creature | Server | Load Behavior |
|-----------|-------|--------|----------|--------|---------------|
| Photo description | Gemma 4 E2B (gemma4-car) | 8.5 GB RAM | chariot | Ollama :11434 | On-demand |
| Robot car server | — | ~50 MB | chariot | turbopi-server :8080 | Always running (systemd) |

**Power supply:** SupTronics X-UPS1 with 4x 18650 Li-Ion (reduced from 3-board 12-cell stack, session 92). ~44-52 Wh, 5.1V USB output. No voltage sensing — battery_v=0.

**X-UPS1 Board Settings (for 3-board stack):**
- **AL_ON jumper:** Short on ONE board only (enables always-power-on)
- **12V OFF jumper:** Short on ALL THREE boards (disables 12V output — using 5V only)
- **Power input:** Same voltage adapter on all boards if charging simultaneously
- **Control:** Press button on same board to turn on/off (whichever board you used to power on)

**Peak (Ollama loaded):** ~8.5 GB / 16 GB (53%). Free: ~7.5 GB.

## Change Log

| Date | Session | Change | VRAM Impact |
|------|---------|--------|-------------|
| 2026-04-17 | 123 | **Titan Chatterbox retired — resolved Kokoro CUDA OOM on WebRTC** — After Session 118 deployed Titan Chatterbox on :8773 for WebRTC (never registered here — drift since session 118), session 117's `TTS_BACKEND=kokoro` in `.env` made it redundant. Annie WebRTC has been using direct Kokoro-GPU; Chatterbox on :8773 was 3.2 GB ballast. Unified-memory pressure on Titan (vLLM 60 + Chatterbox 3.2 + audio/SER 5 + OS/Docker/DBs, 98/121 GB used) caused Kokoro `KPipeline(device='cuda')` to OOM on session init — only 3 GB contiguous free. **Removed:** `services/annie-voice/chatterbox_titan_shim.py`, `start_titan_chatterbox`/`stop_titan_chatterbox` functions in start.sh, `titan-chatterbox` dispatch case, `CHATTERBOX_URL=:8773` env in `start_annie` (it was wired to a dead endpoint if anyone flipped `TTS_BACKEND=chatterbox`). **Preserved:** Panda Chatterbox (:8772, `chatterbox_server.py`) untouched — phone TTS unaffected. Commit `8bc5e67`. Also this session (`34b37a8`): added `--reasoning-parser gemma4` to vLLM start.sh so Gemma 4 thinking routes to `reasoning_content` instead of leaking to WebRTC TTS as spoken text. | Titan: **−3.2 GB VRAM** (Chatterbox), **−~6 GB system RAM** (unified-memory view: 98 → 91 GB used). Kokoro's ~200 MB session-init now succeeds. No peak-scenario recalculation needed because Titan Chatterbox was never in the Active Models table (registry gap from session 118). |
| 2026-04-16 | 119 | **Hardware inventory expansion — Beast, Orin NX 16GB, Hailo-8 status documented** — Added three hardware sections to registry. (1) **Beast** (second DGX Spark, GB10, 128 GB, always-on, idle since session 449) — previously undocumented in hardware section despite being listed in change log. (2) **Orin NX 16GB** (user-owned, not yet deployed) — 100 TOPS Ampere, 16 GB LPDDR5. Earmarked for future Orin-NX-native Annie robot; near-term use as prototyping bench. Cannot replace Pi on current TurboPi per user constraint; can only support Pi as co-processor. (3) **Hailo-8 on Pi 5** status clarified — 26 TOPS NPU **idle for navigation**, YOLOv8n @ 430 FPS local if activated (zero WiFi dependency). Added **Strategic Architecture** section capturing 4-tier current + 3-tier future-robot hardware picture. Note: session 119's original entry proposed a DeepStream ambient-observation workload on Beast; that framing was reversed in session 124 pending user reconfirmation, and the related `deepstream-dev` Claude Skill was uninstalled. | No VRAM change (documentation-only). Surfaced ~253 GB of idle compute (Beast 128 GB + Orin NX 16 GB + Hailo-8 26 TOPS + Pi 5 Hailo-8) previously untracked in steady-state scenarios. |
| 2026-04-15 | 113 | **Parakeet v2 LIVE on Panda as phone-daemon STT (test-it-out swap, G3 overridden by user)** — Same session as the bench above. User accepted the strict `stay-whisper` verdict's G3 failure as acceptable given v2's 1.0 pp WER win on phone-sim + 3.3× latency win. **New sidecar service**: `scripts/parakeet_stt_server.py` on Panda port **:11438**, aiohttp, running in isolated `~/parakeet-bench-venv`. Loads `nvidia/parakeet-tdt-0.6b-v2` locally. Auth: `X-Internal-Token`. `phone_audio.py:PhoneSTT` branches on `PARAKEET_URL` env; when set, skips Whisper load and POSTs WAV bytes to sidecar. `start.sh` phone env adds `PARAKEET_URL="${PARAKEET_URL_OVERRIDE:-http://localhost:11438}"` for one-line revert. **No systemd unit yet — manually launched via `nohup`; cron-restart wiring is a follow-up.** | Panda: phone_call.py was 5158 MB (Whisper in-process) → now 0 MB (no GPU model). Parakeet sidecar PID holds 5026 MB. Net: **−132 MB** (tiny saving). Chatterbox 3740 MB + E2B 3222 MB unchanged. Panda peak: 5026 + 3740 + 3222 = 11988 MB (vs prior 12120 MB). Port map update: `:11438` is now Parakeet STT sidecar. |
| 2026-04-15 | 113 | **Parakeet v2/v3 vs Whisper STT bench on Panda (no swap)** — 30 LibriSpeech test-clean + 30 μ-law-8k phone-sim clips. Measured: v2 WER 1.62% (ls-clean) / 1.67% (phone-sim); v3_offline WER 2.34% / 2.20%; Whisper large-v3-turbo 2.05% / 2.68%. Parakeet v2 is 3.3× faster (30 ms vs 99 ms phone-sim mean). **Verdict: `stay-whisper` (strict)** — G3 VRAM saving ≥ 1 GB FAILED: Parakeet v2 = 4757 MB, v3 = 4827 MB, Whisper = 4702 MB. Gap vs session-108's 3 GB estimate is NeMo's full-model scaffolding (Lhotse dataloader, decoder cache); a minimal inference wrapper could flip G3 → verdict adopt-v2, tracked as follow-up. **v3 streaming NOT SUPPORTED** by distributed checkpoint (`EncDecRNNTBPEModel.transcribe_simulate_cache_aware_streaming` raises NotImplementedError; `att_context_size=[-1,-1]`). G5/G6 unreachable this session. Bench venv: `~/parakeet-bench-venv` (retained for follow-up). Raw data: `benchmark-results/stt-2026-04-15/`. | Panda: no steady-state change (bench window only). Peak during bench: 4.8 GB (Parakeet) or 4.7 GB (Whisper) alongside Chatterbox 3.7 GB + E2B 3.2 GB = 11.7 GB; phone daemon was DOWN during Phase 2-5 for ~2 h. Disk: +4 GB bench venv + ~4 GB Parakeet weights persistent in `~/.cache/huggingface/hub/models--nvidia--parakeet-tdt-0.6b-v{2,3}`. |
| 2026-04-15 | 110 | **Chatterbox 500M benchmark on Titan DGX Spark (synthesis parity validated, manual-failover target)** — Installed `chatterbox-tts==0.1.7` + `torch/torchaudio==2.11.0+cu128` in a per-session venv (`.venv-chatterbox-bench/`), applied two Blackwell SM_121 patches (`patch_chatterbox_xvector_cpu_fbank` + `patch_chatterbox_s3tokenizer_log_mel` in `services/annie-voice/blackwell_patch.py`), synthesized 10 utterances with `samantha_evolving.wav` voice clone. **Verdict: `titan_chatterbox_synthesis_parity_with_panda`** — resemblyzer cosine mean 0.9199 (min 0.8652, spread 0.0848) vs Panda HTTP baseline. Synthesis p50 2055 ms, RTF 0.55. VRAM peak 3.33 GB (torch API) / 3.66 GB (nvsmi per-process) — 342 MiB unified-memory gap. Gemma post-flight drift 1.005× baseline (clean). Phase 6 HTTP failover dry-run NOT yet run — redundancy state is synthesis-parity-only. Research: `docs/RESEARCH-CHATTERBOX-TITAN-REDUNDANCY.md`. Verdict JSON: `docs/BENCHMARK-CHATTERBOX-TITAN-20260415.json`. | Titan steady state: no change (bench venv only, no service loaded). Bench-window peak: +3.4 GB VRAM during 10-utt synthesis + Gemma paused. Disk: +3.2 GB HF weights persistent in `~/.cache/huggingface/hub/models--ResembleAI--chatterbox`. |
| 2026-04-15 | 105 | **Gemma 4 E4B Q4_K_M benchmark on Beast + Nemotron Super officially retired** — Ran same E4B benchmark as session 104 but on Beast (GB10 Superchip, 128 GB unified, aarch64, SM_121). **Measured: nav p50 91.3 ms / 10.95 Hz / 100% schema / describe p50 826 ms with p95 2177 ms (2.6× tail) / 5.5 GB live VRAM / 25 s cold-start penalty**. Beast is **~2× slower than Panda** for E4B nav — bandwidth-bound (LPDDR5X ~546 GB/s vs GDDR7 ~896 GB/s). Quality identical (100% schema, 0 `<think>` leaks). Architectural verdict: small-model high-rate inference belongs on discrete GPU; DGX Spark class is for large-model workloads. `vllm-nemotron-super` container stopped and removed — had been idle since 2026-04-06 (session 449 claim verified post-hoc). | Beast steady-state: -92 GB (Nemotron Super retired). Benchmark window only: +5.5 GB E4B, fully torn down. Panda, Titan: no impact. |
| 2026-04-14 | 104 | **Gemma 4 E4B Q4_K_M benchmark on Panda (temporary, non-adopting)** — Ran `unsloth/gemma-4-E4B-it-GGUF` Q4_K_M + mmproj-F16 on temporary `panda-llamacpp-e4b` container (port :11437) with Chatterbox + E2B stopped for ~8 min window. **Measured: p50 47.9 ms / 20.9 Hz / 99-100% schema adherence / describe qualitatively richer than E2B / 4.76 GB live VRAM**. No `<think>` leakage (`--jinja` flag verified). Container torn down at session end; E2B remains production. Full results in `docs/RESEARCH-GEMMA4-E4B-QUANTIZATIONS.md`. **Adoption decision deferred** — E4B is viable but not switched in. | No permanent VRAM impact. During benchmark window: Chatterbox -3,730 MB + E2B -4,420 MB = -8,150 MB freed; E4B +4,874 MB consumed; net -3,276 MB headroom. All restored at session end. |
| 2026-04-12 | 74 | **4-command nav VLM on Panda** — Added `panda-nav` FastAPI sidecar (CPU-only, port 11436) that wraps llama-server with code-based perception→command mapping. Replaced VLM-as-planner with VLM-as-classifier (position×size → 4 commands). Added Pi `POST /drive/turn` (closed-loop IMU at 100Hz, 1 rate-limited command). Annie routes goal-seeking → panda-nav, exploration → Titan 26B. `start.sh` updated with `start_panda_nav()`. | No VRAM impact (CPU-only sidecar). Pi: new endpoint, zero GPU. |
| 2026-04-12 | 71 | **Nav VLM switched to dedicated Panda llama-server (production)** — `panda-llamacpp` promoted to always-loaded systemd service. Annie-voice env vars: `NAV_VLLM_URL=http://192.168.68.57:11435`, `NAV_VLLM_FALLBACK_URL=http://localhost:8003`. IndicF5 permanently retired. `enable_thinking` key bug fixed in `robot_tools.py`. | Panda: IndicF5 -2,864 MB retired; llamacpp steady +3,227 MB. Net peak now 12,039 MB (74%). |
| 2026-04-12 | 67 | **Nav VLM deployed on Panda via llama-server** — `panda-llamacpp` docker container serving `unsloth/gemma-4-E2B-it-GGUF` Q4_K_M + mmproj-F16 on port 11435 via `ghcr.io/ggml-org/llama.cpp:server-cuda`. **p50 = 18.4 ms, 54 Hz** (8.5× faster than Ollama, 1.5× over Tesla FSD 36 Hz target). Thinking mode disabled via `chat_template_kwargs: {enable_thinking: false}` in each request. Container is one-off `docker run -d` (no persistence yet). vLLM dead-end documented: `unsloth/gemma-4-E2B-it-unsloth-bnb-4bit` needs 6.25 GB VRAM (only transformer body is 4-bit; vision/audio/embeddings/LM-head stay FP16) — OOM'd alongside phone_call+Chatterbox. **Q7d RESOLVED.** See `docs/RESEARCH-PANDA-VLM-INFERENCE-STACK.md`. | Panda: **+3,227 MB** (llama-server 3.2 GB). Projected steady-state peak with IndicF5 restored: 14.9 GB / 16.3 GB = 91% (tight but fits). |
| 2026-04-12 | 67 | **Panda registry reconciliation** — live `nvidia-smi` revealed drift from session-453-era entries. Added Chatterbox TTS (3,654 MB, port 8772) which was deployed in an earlier session without updating the registry. Corrected IndicF5 from 1,347 MB → 2,864 MB actual. Consolidated Whisper/IndicConformer/Kokoro into single `phone_call.py` process entry (5,158 MB total, less than sum of independent estimates — see Open VRAM question in Panda section). Panda peak corrected 8,233 → 11,676 MB. Temporarily stopped IndicF5 for vLLM benchmark; free rose 4,627 → 6,823 MB. | Panda registered peak: 8,233 → **11,676 MB** (+3,443 MB drift correction, not a new allocation). |
| 2026-04-12 | 67 | Pi 5 TurboPi: WB temperature 3300K → 2800K in `services/turbopi-server/main.py:440` and `services/turbopi-server/pi-files/_headless_runner.py:150`. Hardware minimum of icspring UVC camera (range 2800–6500). Session 67 measured B/G=0.19 at 3300K → 0.63 at 2800K; curtain visibly neutral. Commit `33f4f6b`, deployed via `git pull` on Pi, no restart needed (live v4l2-ctl already at 2800K). | No VRAM impact (camera setting). |
| 2026-04-10 | 42 | Add RPLIDAR C1 lidar to Pi 5. USB `/dev/ttyUSB1`, pyrplidar 0.1.2 (DenseCabin byte-order patched). Both modes verified: Standard (5843 pts/3s) + DenseBoost (6176 pts/3s, 10.24m range). | Pi 5: no VRAM change (USB peripheral). |
| 2026-04-10 | 38 | Add Pi 5 robot car (TurboPi + Gemma 4 E2B + 3x X-UPS1). 4 Annie tools: drive/photo/look/status. Chariot creature. | Pi 5: +8.5 GB RAM (Ollama). Titan: no change (HTTP client only). |
| 2026-04-06 | 459 | Bump max_model_len 65K→128K for WhatsApp agent + all services. Compaction preset tuned (tier1=0.70, tier2=0.85, max_messages=80). | KV cache pool unchanged (48 GB at 0.50 util). Concurrent seqs may reduce. |
| 2026-04-06 | 453 | Add IndicF5 voice cloning on Panda (indicf5_server.py). Add Panda VRAM section to registry. | Panda: +1,347 MB (IndicF5 400M BF16). Titan: no change. |
| 2026-04-06 | 449 | Drop Nemotron Super on Beast. All workloads consolidated to Gemma 4 26B on Titan. gpu_memory_utilization 0.25→0.50, max_model_len 32K→65K. | Beast freed entirely. Titan peak 27→69 GB (KV cache grows with utilization). |
| 2026-04-05 | 434 | Swap Nemotron Nano 30B (18 GB) → Gemma 4 26B NVFP4 (15.74 GB) on Titan (ADR-031) | **-2.26 GB** (peak 41→39 GB) |
| 2026-03-22 | 354 | Beast: max_model_len 32K→131K, gpu_util 0.70→0.75, NVIDIA-recommended inference settings | Beast KV cache 17 GiB, 5.6x concurrency at 131K |
| 2026-03-22 | 354 | Added Beast (Super 120B) to registry as second machine | +97 GB on Beast (separate machine, no Titan impact) |
| 2026-03-18 | 352 | ADR-028: Single Nemotron replaces Qwen 9B + 27B + 32B | **-62 GB** peak (106→41 GB). Removed 3 models, 1 vLLM container |
| 2026-03-15 | 342 | Swap llama-server Q4_K_M → vLLM NVFP4 QAT-v2b (ADR-027) | +0.95 GB (6.6→7.55 GB), adds Docker container |
| 2026-03-12 | 293 | Swap Qwen3-ASR-1.7B → Nemotron Speech 0.6B | -0.91 GB (Empirical: 2.49 GB, GPU-PB boosting, 431ms avg latency) |
| 2026-03-12 | 291 | Added Qwen3-ASR-1.7B (serpent) for Annie Voice STT, replacing Whisper | +3.8 GB (was 0 — Whisper was never registered) |
| 2026-03-09 | 270 | Retired qwen3.5:35b-a3b (lion), consolidated to qwen3.5:27b | **-33 GB** (35B no longer loads) |
| 2026-03-08 | 259 | Added llama-server qwen3.5-9b (minotaur) for Annie voice | +6.6 GB |
| 2026-03-07 | 255 | Added GLiNER2 hawk (CPU only) | +0 GB (CPU) |
| 2026-03-05 | 248 | Switched extraction from 9B to 27B (ADR-022) | +23 GB (was 5 GB → now 40 GB) |
| 2026-02-28 | — | Initial topology (ADR-019): 35B + 32B + 8B + audio | ~67 GB baseline |

## How to Verify

```bash
# On Titan — check actual VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Check which Ollama models are loaded (should only be qwen3-embedding:8b)
curl -s http://localhost:11434/api/ps | python3 -c "
import json, sys
data = json.load(sys.stdin)
total = 0
for m in data.get('models', []):
    vram_gb = m.get('size_vram', 0) / (1024**3)
    total += vram_gb
    print(f\"  {m['name']:30s} {vram_gb:6.1f} GB VRAM\")
print(f\"  {'TOTAL':30s} {total:6.1f} GB\")
"

# Check vLLM Gemma 4 (single container, single port)
curl -s http://localhost:8003/health
curl -s http://localhost:8003/v1/models  # should show gemma-4-26b

# Full health check (context-engine deep)
curl -s -H "Authorization: Bearer $CONTEXT_ENGINE_TOKEN" http://localhost:8100/health/deep | python3 -m json.tool
```

## Adding a New Model — Checklist

When any ADR, config change, or docker-compose edit adds or changes a model:

- [ ] Update the "Active Models" table above
- [ ] Recalculate "Steady-State Scenarios" — does peak still stay under 110 GB?
- [ ] Add a row to the "Change Log"
- [ ] If peak exceeds 110 GB: either retire another model or document the risk
- [ ] Run "How to Verify" commands after deploying to confirm actual matches budget