# Claude Code + Local LLM Cookbook

**Date:** 2026-03-17
**Status:** Research complete, ready to implement
**Source:** YouTube video PWV2SmngkZY ("Qwen3.5 + Claude Code: Run a Free Local AI Coding Agent" by Fahd Mirza)
**Hardware:** DGX Spark GB10 (Blackwell SM 12.1, 128 GB unified)

---

## TL;DR

Claude Code can use local models via a **LiteLLM proxy** that translates between Anthropic Messages API and OpenAI Chat Completions API. Your vLLM already serves an OpenAI-compatible endpoint — you just need the translation layer.

```
Claude Code ──(Anthropic API)──▶ LiteLLM Proxy ──(OpenAI API)──▶ vLLM (port 8003)
```

---

## Why This Doesn't Work Directly

| Component | API Format | Endpoint |
|-----------|-----------|----------|
| Claude Code | Anthropic Messages API | `POST /v1/messages` |
| vLLM (your QAT v4) | OpenAI Chat Completions | `POST /v1/chat/completions` |
| llama-server | OpenAI Chat Completions | `POST /v1/chat/completions` |

Claude Code **only** speaks Anthropic format. It cannot talk to OpenAI-compatible endpoints directly. The video's approach (llama-server → Claude Code) works because they use a translation layer — the subtitles don't capture the actual config commands.

---

## What the Video Does

1. **Build llama.cpp** from source (`cmake --build`)
2. **Download Qwen3.5-4B GGUF** (Q4_K_M quant) from HuggingFace
3. **Serve with llama-server** — exposes OpenAI-compatible `/v1/chat/completions`
4. **Launch Claude Code** with `--model` alias pointing at the local server via a proxy/config

The video uses a small 4B model as a demo. He acknowledges "4 billion in Q4 quant is not really your ideal agentic model."

---

## Implementation Plan

### Option A: LiteLLM Proxy (Recommended)

Use your **existing vLLM** on port 8003 — no new model serving needed.

#### Step 1: Install LiteLLM

```bash
pip install 'litellm[proxy]'
```

#### Step 2: Create Config

Create `litellm_config.yaml`:

```yaml
model_list:
  # Map Claude model names → your local vLLM
  - model_name: claude-sonnet-4-20250514
    litellm_params:
      model: openai/qwen3.5-9b
      api_base: http://localhost:8003/v1
      api_key: dummy
  - model_name: claude-opus-4-20250514
    litellm_params:
      model: openai/qwen3.5-9b
      api_base: http://localhost:8003/v1
      api_key: dummy

general_settings:
  master_key: sk-local-dev
```

#### Step 3: Start LiteLLM Proxy

```bash
litellm --config litellm_config.yaml --port 4000
```

#### Step 4: Configure Claude Code

Two options:

**Option 4a — Environment variables (per-session):**
```bash
ANTHROPIC_BASE_URL=http://localhost:4000 \
ANTHROPIC_AUTH_TOKEN=sk-local-dev \
claude
```

**Option 4b — Settings file (persistent):**

Add to `~/.claude/settings.json`:
```json
{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:4000",
    "ANTHROPIC_AUTH_TOKEN": "sk-local-dev"
  }
}
```

> **WARNING:** Option 4b redirects ALL Claude Code sessions to local model. Use env vars (4a) for selective use.

#### Step 5: Verify

```bash
# Test LiteLLM is proxying correctly
curl http://localhost:4000/v1/messages \
  -H "x-api-key: sk-local-dev" \
  -H "content-type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "max_tokens": 100,
    "messages": [{"role": "user", "content": "Hello"}]
  }'
```

### Option B: Dedicated Coding Model via llama-server

Serve a **separate model** (not Annie's QAT v4) for coding tasks.

```bash
# Download a coding-optimized GGUF
huggingface-cli download bartowski/Qwen3.5-14B-Instruct-GGUF \
  Qwen3.5-14B-Instruct-Q4_K_M.gguf \
  --local-dir ~/models/

# Serve on a different port (not 8003 — that's Annie)
./llama-server \
  -m ~/models/Qwen3.5-14B-Instruct-Q4_K_M.gguf \
  --port 8004 \
  -ngl 99 \
  -c 32768 \
  --alias qwen3.5-14b-coding

# Then LiteLLM config points to port 8004
```

**Advantage:** Doesn't compete with Annie's vLLM for GPU.
**Disadvantage:** Additional ~10 GB VRAM for 14B Q4_K_M.

---

## Honest Assessment: When To Use This

### Good Use Cases

| Scenario | Why |
|----------|-----|
| Offline development (no internet) | No Anthropic API needed |
| Cost-sensitive batch scripting | Free local inference |
| Privacy-critical code | Code never leaves your machine |
| Quick one-off prompts | No API latency |

### Bad Use Cases

| Scenario | Why |
|----------|-----|
| Complex multi-file refactoring | 9B model lacks reasoning depth |
| Your QAT v4 specifically | Persona-tuned for Annie voice, not coding |
| Replacing Claude Opus for architecture | 32K context vs 1M, inferior tool use |
| Production coding workflow | Claude Sonnet/Opus is dramatically better |

### Model Comparison for Coding

| Factor | Qwen3.5 9B QAT v4 | Qwen3.5 14B Instruct | Claude Sonnet 4.6 |
|--------|--------------------|-----------------------|-------------------|
| Context | 32K | 128K | 200K |
| Coding | Poor (persona-tuned) | Good | Excellent |
| Tool use | Hermes parser | Native | Native, robust |
| Multi-file edits | Weak | Moderate | Strong |
| Cost | Free (local) | Free (local) | ~$3/1M input tokens |
| VRAM | 18 GB (shared w/ Annie) | ~10 GB additional | 0 (API) |

---

## VRAM Budget Impact

Current Titan steady-state (from RESOURCE-REGISTRY.md):

| Model | VRAM |
|-------|------|
| vLLM (QAT v4) | ~18 GB |
| Whisper STT | ~1.5 GB |
| Kokoro TTS | ~1 GB |
| Audio pipeline | ~2 GB |
| Context Engine (Ollama) | ~22 GB |
| **Total** | **~44.5 GB** |

Adding a dedicated coding model:
- **14B Q4_K_M via llama-server:** +10 GB → **54.5 GB** (safe, 73 GB headroom)
- **27B Q4_K_M via llama-server:** +18 GB → **62.5 GB** (safe, 65 GB headroom)

Either fits comfortably within the 128 GB budget.

---

## Gotchas & Known Issues

1. **LiteLLM ↔ Anthropic API parity**: LiteLLM translates but doesn't perfectly replicate Anthropic's Messages API. Features like `extended_thinking`, `computer_use`, and streaming `tool_use` blocks may not translate cleanly.

2. **Claude Code tool calling**: Claude Code relies heavily on tool_use (Read, Write, Edit, Bash, Grep, Glob). The local model must support function calling well — Qwen3.5 with Hermes parser works but is less reliable than Claude.

3. **vLLM concurrency**: Your current vLLM has a 503 guard when Annie voice session is active. If Claude Code also routes through the same vLLM, coding requests would be blocked during voice calls. **Use Option B (separate model on different port) to avoid this.**

4. **Anthropic auth**: Claude Code may still try to validate the API key against Anthropic's servers. The `ANTHROPIC_AUTH_TOKEN` must be accepted by LiteLLM but doesn't need to be a real Anthropic key.

5. **Model name mapping**: Claude Code sends model names like `claude-sonnet-4-20250514`. LiteLLM config must map these exact names to your local model.

---

## Quick Start (Copy-Paste)

```bash
# 1. Install
pip install 'litellm[proxy]'

# 2. Config (writes to current directory)
cat > litellm_config.yaml << 'EOF'
model_list:
  - model_name: claude-sonnet-4-20250514
    litellm_params:
      model: openai/qwen3.5-9b
      api_base: http://192.168.68.52:8003/v1
      api_key: dummy
general_settings:
  master_key: sk-local-dev
EOF

# 3. Start proxy (background)
litellm --config litellm_config.yaml --port 4000 &

# 4. Launch Claude Code with local model
ANTHROPIC_BASE_URL=http://localhost:4000 \
ANTHROPIC_AUTH_TOKEN=sk-local-dev \
claude
```

---

## References

- **Video source**: YouTube PWV2SmngkZY — Fahd Mirza, "Qwen3.5 + Claude Code: Run a Free Local AI Coding Agent"
- **LiteLLM docs**: litellm.ai (proxy setup, Anthropic translation)
- **Claude Code model config**: Anthropic docs — model configuration, LLM gateway
- **Current vLLM config**: `start.sh` lines 302-321
- **VRAM budget**: `docs/RESOURCE-REGISTRY.md`
