# Research: Qwen3.5 Models — VLM Game-Changer for her-os

**Date:** 2026-02-26
**Status:** VALIDATED ON TITAN — Two paths validated, full eval complete:
- **Q4_K_M via llama.cpp** (RECOMMENDED): 61 tok/s, ~21 GiB VRAM, text + vision
- **BF16 via vLLM nightly**: 30 tok/s, ~94 GiB VRAM, text + vision
- **Thinking mode OFF required:** Entity A-(8.2) #5/16, Task 8.0 #4/9, 5-80x faster

## Overview

Qwen3.5 is a new model family from Alibaba released Feb 24, 2026. The key innovation: **Mixture-of-Experts (MoE) with multimodal VLM** — text + image + video in one model with extremely efficient inference.

## Model Family

| Model | Total Params | Active Params | Architecture | Multimodal? | License |
|-------|-------------|---------------|--------------|-------------|---------|
| **Qwen3.5-35B-A3B** | 35B | **3B** | MoE (256 experts, 8+1 active) | **Yes** (Image+Video+Text) | Apache 2.0 |
| **Qwen3.5-122B-A10B** | 122B | **10B** | MoE | Yes | Apache 2.0 |
| Qwen3.5-27B | 27B | 27B (dense) | Hybrid DeltaNet/Attention | No (text-only) | Apache 2.0 |
| Qwen3.5-397B-A17B | 397B | 17B | MoE | Yes | Apache 2.0 |

## Qwen3.5-35B-A3B — Detailed Analysis

### Architecture
- 40 layers, 2048 hidden dim
- 10 × (3 × (Gated DeltaNet → MoE) + 1 × (Gated Attention → MoE))
- 256 total experts, 8 routed + 1 shared activated per token
- Gated DeltaNet: O(n) linear attention (3/4 of layers)
- Gated Attention: Full quadratic attention (1/4 of layers, for global context)
- Token embedding: 248,320 (padded)
- Context: 262K native, up to 1M with YaRN scaling

### Capabilities
- **Multimodal VLM**: Image + Video + Text understanding
- **Tool calling**: Native support (agentic workflows)
- **Thinking mode**: Generates `<think>...</think>` reasoning blocks (disable with `enable_thinking: false`)
- **201 languages** supported (multilingual)

### Benchmark Scores (from HuggingFace)
| Task | Score |
|------|-------|
| MMLU-Pro | 85.3 |
| C-Eval | 90.2 |
| IFEval | 91.9 |
| GPQA Diamond | 84.2 |
| SWE-bench Verified | 69.2 |
| CodeForces | 2028 |
| **MMMU** | **81.4** |
| **MMMU-Pro** | **75.1** |
| **MathVision** | **83.9** |
| **MMBenchEN** | **91.5** |
| **OmniDocBench1.5** | **89.3** |
| **VideoMME (w/sub)** | **86.6** |

### DGX Spark Performance — VALIDATED ON TITAN (Feb 26, 2026)

#### Our Titan Benchmarks (BF16, vLLM 0.16.0rc2 nightly)
| Test | Tokens | Time | Speed | Notes |
|------|--------|------|-------|-------|
| Entity extraction (JSON) | 103 | 3.6s | **28.6 tok/s** | Perfect extraction, clean JSON |
| Essay generation | 651 | 22.0s | **29.6 tok/s** | Thinking disabled |
| Image description | 43 | 10.3s | ~4.2 tok/s* | *Includes image download + encode |
| vLLM peak throughput | — | — | **29.8 tok/s** | From engine stats |

**Key findings:**
- **29-30 tok/s** consistent for text generation (matches published 31 tok/s)
- **Vision works** — correctly described a cat image with details (color, pose, setting)
- `chat_template_kwargs: {"enable_thinking": false}` at request body level disables thinking
- Without disabling thinking, model wastes tokens on "Thinking Process:" text

#### Memory Layout (Titan, BF16)
| Component | Size |
|-----------|------|
| Model weights | **65.53 GiB** |
| KV cache (0.80 util) | **28.45 GiB** (372,768 tokens) |
| Max concurrency (32K ctx) | **37.16x** |
| Torch compile cache | ~3 GiB |

Loading time: 93s (weights) + 26s (torch.compile) + 5s (CUDA graphs) = **~2 min** after download.

#### Community Benchmarks (from [adadrag/qwen3.5-dgx-spark](https://github.com/adadrag/qwen3.5-dgx-spark))
| Test | Output Tokens | Time | Speed |
|------|--------------|------|-------|
| Short (128 tok) | 128 | 4.1s | **31.1 tok/s** |
| Medium (1K tok) | 1,024 | 32.2s | **31.8 tok/s** |
| Long (3.8K tok) | 3,831 | 121.0s | **31.6 tok/s** |

#### Multi-User Concurrency (community benchmarks)
| Users | Per-User | Aggregate | Errors |
|-------|----------|-----------|--------|
| 1 | 3.3 tok/s | 3.3 tok/s | 0 |
| 10 | 8.2 tok/s | 82.0 tok/s | 0 |
| 100 | 4.3 tok/s | **423.5 tok/s** | 0 |

**Critical:** GPU memory utilization >0.90 causes OOM after ~1 hour. Stick with 0.80.

### Q4_K_M Performance — VALIDATED ON TITAN (Feb 26, 2026)

#### Our Titan Benchmarks (Q4_K_M, llama.cpp + FORCE_CUBLAS)
| Test | Tokens | Time | Speed | Notes |
|------|--------|------|-------|-------|
| Entity extraction (JSON) | 3844 total | 62.1s | **61.3 tok/s** TG | Perfect JSON, 6 entities, thinking + output |
| Vision (image description) | ~200 | ~3s | **59.3 tok/s** TG | Correct: tabby kitten, blue eyes, striped fur |
| Prompt processing | 42 tok | 0.25s | **155.1 tok/s** PP | After warmup |
| CPU-only fallback | 50 tok | 6.1s | **8.16 tok/s** | For comparison only — not for production |

**Key findings:**
- **61 tok/s** generation — **2x faster than BF16 vLLM** (30 tok/s)
- **155 tok/s** prompt processing (GPU-accelerated)
- **Vision works** — correctly described cat image with details (breed, eye color, fur pattern)
- **Perfect JSON** entity extraction with all 6 entities
- **Thinking mode works naturally** — model reasons deeply, then outputs clean JSON
- **~21 GiB total VRAM** (model 19.9 + KV 0.16 + RS 0.25 + compute 0.49)
- **~100 GiB free** for other models (STT, TTS, embeddings, graph)

#### Memory Layout (Titan, Q4_K_M)
| Component | Size |
|-----------|------|
| Model weights | **19,940 MiB** (19.5 GiB) |
| KV cache | **160 MiB** (8192 context, F16) |
| RS (recurrent state) | **251 MiB** (DeltaNet state, F32) |
| Compute buffer | **493 MiB** |
| CPU mapped | 273 MiB |
| **Total GPU** | **~20.8 GiB** |
| **Free for other models** | **~100 GiB** |

#### GGUF Tensor Composition (Q4_K_M from Unsloth)
| Type | Count | Notes |
|------|-------|-------|
| f32 | 301 | Norms, embeddings |
| q8_0 | 60 | Higher-precision layers |
| q4_K | 165 | Bulk of model |
| q5_K | 60 | Mixed precision layers |
| q6_K | 67 | Mixed precision layers |
| mxfp4 | 80 | **MoE expert weights (native format)** |
| **Total** | **733 tensors** | 21.2 GB file |

### Inference Stack Comparison on DGX Spark

| Stack | Status | Speed | VRAM | Notes |
|-------|--------|-------|------|-------|
| **llama.cpp Q4_K_M** | **WORKS** | **61 tok/s** | **~21 GiB** | FORCE_CUBLAS build required. Text + Vision |
| **vLLM nightly BF16** | **WORKS** | **30 tok/s** | **~94 GiB** | `vllm/vllm-openai:cu130-nightly` (vLLM 0.16.0+) |
| llama.cpp UD-Q4_K_XL | **CRASHES** | N/A | N/A | MXFP4 tensors (275) crash MMQ kernel on Blackwell |
| Ollama v0.17.1-rc2 | **BROKEN** | N/A | N/A | Model loads but inference 500 errors. DeltaNet support incomplete |
| NIM container | **N/A** | N/A | N/A | No Qwen3.5 NIM exists yet |

### llama.cpp Q4 Setup on DGX Spark (RECOMMENDED)

#### Prerequisites
```bash
# CUDA 13 must be available
ls /usr/local/cuda-13/bin/nvcc  # Should exist on DGX Spark

# OpenSSL dev (optional, for HTTPS image URLs in vision)
sudo apt-get install libssl-dev
```

#### Step 1: Build llama.cpp from source with FORCE_CUBLAS

**CRITICAL**: Must use `-DGGML_CUDA_FORCE_CUBLAS=ON` to bypass broken MMQ kernels for MXFP4 on Blackwell.

```bash
git clone https://github.com/ggml-org/llama.cpp.git ~/llama-cpp-latest
cd ~/llama-cpp-latest
mkdir build-gpu && cd build-gpu

export LD_LIBRARY_PATH=/usr/local/cuda-13/compat:$LD_LIBRARY_PATH

cmake .. \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES='121a-real' \
  -DGGML_CUDA_FORCE_CUBLAS=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13/bin/nvcc

make -j$(nproc) llama-server
```

Build takes ~3 minutes on Grace CPU (20 cores).

#### Step 2: Download GGUF files

**Use the standard Q4_K_M from Unsloth** (NOT the UD-Q4_K_XL which crashes):

```bash
mkdir -p ~/models

# Main model (21.2 GB, ~12 min at 30 MB/s)
wget -O ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
  'https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/resolve/main/Qwen3.5-35B-A3B-Q4_K_M.gguf'

# Vision encoder (861 MB, ~30s)
wget -O ~/models/mmproj-BF16.gguf \
  'https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/resolve/main/mmproj-BF16.gguf'
```

#### Step 3: Launch server

```bash
export LD_LIBRARY_PATH=/usr/local/cuda-13/compat:$LD_LIBRARY_PATH

~/llama-cpp-latest/build-gpu/bin/llama-server \
  --host 0.0.0.0 --port 8002 \
  -m ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
  --mmproj ~/models/mmproj-BF16.gguf \
  --ctx-size 32768 --n-gpu-layers 999 -fa auto --jinja
```

**Critical flags:**
- `--jinja`: Enables chat template processing, required for `chat_template_kwargs` (thinking mode control)
- `--ctx-size 32768`: Total context window (prompt + response). 8192 is too small — thinking mode can consume 3-8K tokens of context alone
- `-fa auto`: Flash attention auto-detect

Server starts in ~10s. First inference request has 1-2s warmup.

#### Step 4: Test

```bash
# Text (thinking DISABLED — RECOMMENDED for production)
curl -s http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "max_tokens": 500,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

# Text (with thinking — for comparison only)
curl -s http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "max_tokens": 500
  }'

# Vision (base64 — HTTPS URLs need OpenSSL build)
IMG_B64=$(base64 -w0 /path/to/image.jpg)
curl -s http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"qwen3.5\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMG_B64}\"}},
        {\"type\": \"text\", \"text\": \"Describe this image.\"}
      ]
    }],
    \"max_tokens\": 1000
  }"
```

### What Failed and Why (Debugging Log)

This section documents every approach we tried, so future sessions can avoid dead ends.

#### Attempt 1: Ollama v0.17.1-rc2 (FAILED)
- `ollama pull qwen3.5:35b-a3b` downloaded Q4_K_M (22 GiB)
- Model loaded fine (42.58s), GPU at 92%
- **All inference returned 500 errors** (generate, chat, streaming)
- Root cause: ollama pre-release has incomplete DeltaNet architecture support
- The GGUF it creates has `rope.dimension_sections` with 3 elements (incompatible with llama.cpp which expects 4)

#### Attempt 2: llama.cpp with Ollama's GGUF (FAILED)
- Symlinked ollama's blob as GGUF
- Error: `rope.dimension_sections has wrong array length; expected 4, got 3`
- Same error with both Docker image (build 8149) and latest source build
- Root cause: ollama converts the model to a non-standard GGUF format

#### Attempt 3: Unsloth UD-Q4_K_XL GGUF (FAILED — CUDA CRASH)
- Downloaded `Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf` (18.3 GiB) from Unsloth
- Loaded correctly (rope.dimension_sections = [11, 11, 10, 0], 4 elements)
- **Crashed on first inference**: `CUDA error: invalid argument` in `ggml_cuda_mul_mat_q`
- Root cause: 275 MXFP4 tensors. MMQ kernel for MXFP4 broken on Blackwell SM_121
- GitHub issues: [#18331](https://github.com/ggml-org/llama.cpp/issues/18331), [#18425](https://github.com/ggml-org/llama.cpp/issues/18425)

#### Attempt 4: Rebuild with sm_89 (Ada PTX fallback) (FAILED)
- Built with `-DCMAKE_CUDA_ARCHITECTURES=89` per GitHub issue #18331
- Same crash: `CUDA error: invalid argument` in `ggml_cuda_mul_mat_q`
- The Ada PTX fallback doesn't help because the issue is in the MMQ kernel itself, not the PTX

#### Attempt 5: Standard Q4_K_M GGUF with sm_121 (FAILED)
- Downloaded `Qwen3.5-35B-A3B-Q4_K_M.gguf` (21.2 GiB) — standard quantization
- Still has 80 MXFP4 tensors (MoE expert weights are natively MXFP4 in Qwen3.5)
- Same crash: `CUDA error: invalid argument` in `ggml_cuda_mul_mat_q`

#### Attempt 6: Rebuild with 121a-real (FAILED)
- Built with `-DCMAKE_CUDA_ARCHITECTURES='121a-real'` per NVIDIA forum
- Same crash

#### Attempt 7: CPU-only (WORKS but too slow)
- `--n-gpu-layers 0` — pure CPU inference
- 8.16 tok/s — confirms GGUF file is valid, problem is CUDA kernel only

#### Attempt 8: FORCE_CUBLAS (SUCCESS!)
- Built with `-DGGML_CUDA_FORCE_CUBLAS=ON`
- This bypasses the custom MMQ quantized matmul kernels entirely
- Uses cuBLAS (NVIDIA's optimized BLAS library) for all matrix multiplications
- **61 tok/s generation, 155 tok/s PP, vision works, perfect quality**
- Note: GitHub issue [#19683](https://github.com/ggml-org/llama.cpp/issues/19683) warns FORCE_CUBLAS causes "FP16 cuBLAS numerical overflow" for the "/" degenerate output bug, but that was on Qwen3.5-397B, NOT 35B-A3B. Our testing shows correct output.

#### Why FORCE_CUBLAS Works
The default llama.cpp build uses custom MMQ (Matrix Multiply Quantized) CUDA kernels optimized for specific quantization formats. These kernels have bugs on Blackwell SM_121 with MXFP4 tensors. `FORCE_CUBLAS` tells llama.cpp to dequantize tensors and use NVIDIA's cuBLAS library instead. cuBLAS is battle-tested on all architectures including Blackwell. The tradeoff is slightly more memory for dequantization buffers, but speed is actually FASTER because cuBLAS is highly optimized for GB10's Tensor Cores.

### vLLM Setup on DGX Spark

```bash
# Pull nightly (has Qwen3.5MoeForConditionalGeneration support)
docker pull vllm/vllm-openai:cu130-nightly

# Launch (BF16, no quantization needed — 70 GB fits in 128 GB unified)
docker run -d \
  --name qwen35 \
  --restart unless-stopped \
  --gpus all \
  --ipc host \
  --shm-size 64gb \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:cu130-nightly \
  Qwen/Qwen3.5-35B-A3B \
  --served-model-name qwen3.5-35b \
  --port 8000 \
  --host 0.0.0.0 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.80 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching
```

First run downloads ~70 GB model weights from HuggingFace.

### API Examples

```bash
# Text chat (OpenAI-compatible)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b",
    "messages": [{"role": "user", "content": "What is the meaning of life?"}],
    "max_tokens": 1024
  }'

# Disable thinking for faster responses
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 1024,
    "extra_body": {"chat_template_kwargs": {"enable_thinking": false}}
  }'

# Vision/image analysis
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
        {"type": "text", "text": "Describe this image in detail."}
      ]
    }],
    "max_tokens": 1024
  }'
```

### Recommended Sampling Parameters
| Mode | Temperature | top_p | top_k | presence_penalty |
|------|-------------|-------|-------|------------------|
| Thinking - General | 1.0 | 0.95 | 20 | 1.5 |
| Thinking - Coding | 0.6 | 0.95 | 20 | 0.0 |
| Instruct - General | 0.7 | 0.8 | 20 | 1.5 |
| Instruct - Reasoning | 1.0 | 1.0 | 40 | 2.0 |

### Thinking Mode — MUST BE DISABLED for Production (Feb 26, 2026)

**TL;DR:** Disable thinking mode (`enable_thinking: false`) for all her-os tasks. It makes Qwen3.5 5-80x faster, eliminates failures, and improves quality on 5/8 tasks.

#### How Thinking Mode Works
Qwen3.5 has a built-in chain-of-thought "thinking" mode that generates `<think>...</think>` reasoning blocks BEFORE the actual response. This consumes 3-16K tokens of context on each call.

#### How to Disable Thinking

**llama.cpp** (requires `--jinja` flag on server):
```json
{
  "chat_template_kwargs": {"enable_thinking": false},
  "temperature": 0.7,
  "top_p": 0.8
}
```

**vLLM**:
```json
{
  "extra_body": {"chat_template_kwargs": {"enable_thinking": false}}
}
```

**Important:** Qwen3.5 does NOT support `/think` `/nothink` prompt switches — those are Qwen3-only features. The only way to control thinking in Qwen3.5 is via `chat_template_kwargs`.

#### Thinking ON vs OFF — Comprehensive Evaluation (58 evals, Feb 26, 2026)

| Task | Thinking ON | Thinking OFF | Delta | Notes |
|------|-----------|------------|-------|-------|
| Sensitivity | 7.2 | **8.0** | +0.8 | Thinking overthinked simple classifications |
| Briefing | **8.5** | 7.9 | -0.6 | Thinking helps creative tasks slightly |
| Nudge | 5.0 | **8.6** | +3.6 | Thinking caused 50% failure rate (token exhaustion) |
| Email Triage | 7.2 | **8.0** | +0.8 | Same as sensitivity — classification improved |
| Email Draft | **8.6** | 8.0 | -0.6 | Thinking helps creative tasks slightly |
| Chat QA | 8.7 | **8.7** | 0.0 | No difference |
| Contradiction | 6.4 | **7.0** | +0.6 | Structured output improved |
| Entity Extraction | **8.6** | 8.2 | -0.4 | Thinking helps complex extraction slightly |
| **AVERAGE** | **7.5** | **8.1** | **+0.5** | |

#### Speed Comparison

| Task | Thinking ON | Thinking OFF | Speedup |
|------|-----------|------------|---------|
| Entity Extraction | 77s/transcript | **15.5s** | **5x** |
| Sensitivity | 64s/input | **6.1s** | **10x** |
| Email Triage | 95s/input | **1.6s** | **59x** |
| Nudge | 295s/input | **2.6s** | **113x** |
| Chat QA | 72s/input | **4.6s** | **16x** |
| Contradiction | 104s/input | **2.9s** | **36x** |

#### Reliability

| Metric | Thinking ON | Thinking OFF |
|--------|-----------|------------|
| Empty responses | 4/58 (7%) | **0/58 (0%)** |
| Timeouts | 1/58 (2%) | **0/58 (0%)** |
| Total output tokens | 199,685 | **13,943** |

#### Why Thinking Hurts

1. **Token exhaustion:** Thinking consumes 3-16K tokens of context. With `--ctx-size 8192`, prompt (~300 tokens) + thinking (~7K) = ~7.3K, leaving only ~900 tokens for the actual response.
2. **Overthinking simple tasks:** Sensitivity classification, email triage, and contradiction detection are simple structured tasks. The model writes 2-8K tokens of reasoning for a 40-word JSON response.
3. **Cost of context:** Even with `--ctx-size 32768`, thinking still uses 10-20x more tokens per call, with no quality benefit on 5/8 tasks.

#### Production Recommendation

```
ALWAYS use: chat_template_kwargs: {"enable_thinking": false}
ALWAYS use: --jinja flag on llama-server
ALWAYS use: temperature: 0.7, top_p: 0.8 (per Qwen instruct-mode docs)
ALWAYS use: --ctx-size 32768 (minimum, for complex prompts)
```

### Why This Matters for her-os
Qwen3.5-35B-A3B **competes with all text models** — not just VLM:
1. **Entity extraction A-(8.2)** — beats Qwen3 32B A-(8.0), close to Haiku A(8.7)
2. **Task benchmark 8.0 (#4/9)** — wins sensitivity, nudge, email triage
3. **Vision** — Image + Video + OCR that no other local model has
4. **Speed** — 15.5s/entity extraction (7x faster than Qwen3 110s)

One model instead of three. Tool calling for MCP portals. Apache 2.0.

**VRAM budget impact (validated):**
- Qwen3 32B (Q4, ollama): 20 GB → leaves 108 GB — but text-only, no vision
- **Qwen3.5-35B-A3B (Q4, llama.cpp)**: 21 GB → leaves **~100 GB** — text + vision + video
- Qwen3.5-35B-A3B (BF16, vLLM): ~94 GB → leaves only ~34 GB — same capabilities, 4.5x more VRAM

## Qwen3.5-122B-A10B — Large MoE

### Specifications
- 122B total, 10B active (MoE)
- More capable than 35B-A3B for complex reasoning ("Long-Horizon" tasks)
- Multimodal: Image + Video + Text
- Apache 2.0

### DGX Spark Fit
| Quantization | Size | Speed | Status |
|-------------|------|-------|--------|
| BF16 | ~245 GB | N/A | **Won't fit** (128 GB) |
| UD-Q4_K_XL | ~70 GB | ~24 tok/s (community) | Fits, needs validation |
| UD-Q3_K_XL | ~55 GB | Unknown | Fits, untested |

- UD-Q4_K_XL stays within ~1 point accuracy of original (Unsloth benchmark)
- GGUF files available: `unsloth/Qwen3.5-122B-A10B-GGUF`

### Known Issues
- **llm-compressor incompatibility** with Transformers v5.0+: `ImportError: cannot import name 'TORCH_INIT_FUNCTIONS'`
- NVFP4 checkpoints may have tensor parallelism mismatches (TP=2 vs TP=1)
- FP8 quantization from Red Hat AI available for 397B variant
- No confirmed vLLM setup yet (community working on it via spark-vllm-docker)

### Ollama Availability
- `ollama pull qwen3.5:122b-a10b` — available with UD-Q4_K_XL
- Download: ~65-70 GB estimated
- **Same ollama bug likely applies** — may need vLLM or llama.cpp instead

### Inference Options for 122B
| Stack | Quantization | Size | Speed (est.) | Feasibility |
|-------|-------------|------|-------------|-------------|
| vLLM nightly | BF16 | ~245 GB | N/A | Won't fit (128 GB) |
| vLLM nightly | AWQ/GPTQ 4-bit | ~70 GB | ~20-25 tok/s | Needs testing |
| llama.cpp | UD-Q4_K_XL GGUF | ~70 GB | ~20-25 tok/s | Most promising |
| Ollama | UD-Q4_K_XL | ~70 GB | ~20-25 tok/s | Blocked by DeltaNet bugs |

Speed estimates based on similar MoE models on DGX Spark:
- `glm4moe 106B.A12B Q4_K` (67.85 GiB): 18.45 tok/s tg, 817 tok/s pp
- `qwen3moe 30B.A3B Q8_0` (30.25 GiB): 47.08 tok/s tg, 2916 tok/s pp
(Source: [hardware-corner.net DGX Spark benchmarks](https://www.hardware-corner.net/first-dgx-spark-llm-benchmarks/))

## Comparison: Qwen3 32B vs Qwen3.5

| Feature | Qwen3 32B (current) | Qwen3.5-35B-A3B | Qwen3.5-122B-A10B |
|---------|---------------------|------------------|---------------------|
| Total params | 32B | 35B | 122B |
| Active params | **32B** (dense) | **3B** (MoE) | **10B** (MoE) |
| Vision | **No** | **Yes** | **Yes** |
| Video | **No** | **Yes** | **Yes** |
| Tool calling | Limited | **Native** | **Native** |
| Thinking mode | No | Yes | Yes |
| Context | 128K | **262K** (1M w/YaRN) | **262K** (1M w/YaRN) |
| DGX Spark speed | ~73 tok/s (ollama Q4) | **61 tok/s** (Q4) / 30 tok/s (BF16) | **24 tok/s** (Q4_K_XL est.) |
| VRAM (quantized) | ~20 GB (Q4) | **~21 GB (Q4)** / 94 GB (BF16) | ~70 GB (Q4) |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |

## NVIDIA NIM VLM Container Status

### Cosmos Reason2 8B — FAILED on DGX Spark
- Container: `nvcr.io/nim/nvidia/cosmos-reason2-8b:latest` (19.5 GB)
- Auto-detected DGX Spark, selected FP8 profile
- **Failed**: `cudaErrorStreamCaptureInvalidated` during CUDA graph compilation
- Root cause: vLLM + torch.compile/CUDA graphs incompatible with Blackwell SM_120
- Same class of issue as other torch.compile failures on Blackwell

### Nemotron Nano 12B VL — FAILED on DGX Spark
- Container: `nvcr.io/nim/nvidia/nemotron-nano-12b-v2-vl:latest` (19.9 GB)
- Auto-detected DGX Spark, selected BF16 profile
- **Failed**: Container hung indefinitely after profile selection — 0% GPU utilization, no inference
- Container alive but no progress after 10+ minutes
- Root cause: Likely same vLLM/Blackwell incompatibility (different failure mode)

### Conclusion
**Both NVIDIA NIM VLM containers fail on DGX Spark.** The underlying vLLM in NIM containers is too old for Blackwell's SM_120 architecture. The vLLM nightly build (`cu130-nightly`) fixes this, but NIM containers use pinned vLLM versions.

## Recommendation for her-os

### RECOMMENDED: Q4_K_M via llama.cpp (FORCE_CUBLAS)
- **61 tok/s** text generation, **155 tok/s** prompt processing
- **~21 GiB VRAM** — leaves **~100 GiB** for other models
- Vision works (image description with base64)
- Thinking mode works naturally (reasoning model)
- Perfect JSON entity extraction
- OpenAI-compatible API on configurable port

### Also Validated: BF16 via vLLM nightly
- **30 tok/s** text generation
- **~94 GiB VRAM** — leaves only **~34 GiB**
- Vision works with HTTPS URLs
- Tool calling enabled (`--enable-auto-tool-choice --tool-call-parser qwen3_coder`)
- Better for: multi-user concurrency (423 tok/s aggregate at 100 users)

### VRAM Budget Decision — RESOLVED

**Q4_K_M is the clear winner for her-os:**

| Budget Item | Q4 llama.cpp | BF16 vLLM |
|------------|-------------|-----------|
| Qwen3.5-35B-A3B | **21 GiB** | 94 GiB |
| NV-Embed-v2 8B (embeddings) | 14 GiB | 14 GiB |
| Whisper large-v3 (STT) | 9 GiB | 9 GiB |
| Kokoro-82M (TTS) | 0.5 GiB | 0.5 GiB |
| cuVS/cuGraph | 2 GiB | 2 GiB |
| pyannote (diarization) | 0.2 GiB | 0.2 GiB |
| **Total** | **~47 GiB** | **~120 GiB** |
| **Free headroom** | **~81 GiB** | **~8 GiB** |

Q4 gives **10x more headroom** than BF16. Room for Qwen3.5-122B-A10B upgrade, additional models, or larger context windows.

### Architectural Decision: Qwen3.5-122B-A10B — REJECTED

**Decision:** Do NOT pursue Qwen3.5-122B-A10B on Titan. Use 35B-A3B exclusively.

**Rationale:**
- Q4_K_M of 122B-A10B requires ~70 GiB VRAM
- With current stack (embedding 14 GiB + STT 9 GiB + TTS 0.5 GiB + cuVS/cuGraph 2 GiB + pyannote 0.2 GiB), 122B needs **~96 GiB total** vs 128 GiB available
- Only **~32 GiB headroom** — insufficient for:
  - Redis, PostgreSQL, FalkorDB memory overhead
  - Concurrent model loading during startup
  - Context window scaling (larger prompts = more KV cache)
  - Future components (GLiNER NER, additional models)
- 35B-A3B at Q4 (21 GiB) leaves **~81 GiB free** — comfortable margin
- 122B at ~20-25 tok/s is 3x slower than 35B's 61 tok/s for single-user use
- 35B-A3B's 3B active params already scores well on entity extraction (pending benchmark confirmation)
- The 122B would be better suited for multi-GPU setups or cloud deployments, neither of which applies to her-os

**Date:** 2026-02-26 (Session 66)

### Future Paths
- **Ollama stable release**: When DeltaNet support lands, ollama may be simpler than llama.cpp
- **llama.cpp MMQ fix**: Track [#18331](https://github.com/ggml-org/llama.cpp/issues/18331) — when native MXFP4 kernel works on Blackwell, FORCE_CUBLAS may no longer be needed

## VLM Testing — OCR, Document Understanding, Kannada (Feb 27, 2026, Session 73)

**TL;DR:** English OCR and document extraction are **excellent** (9/10). Kannada OCR is **not viable** (2/10 pure, 5/10 code-mixed). For her-os: Qwen3.5 VLM covers receipts, screenshots, English documents. Kannada image OCR needs a separate dedicated model.

### Test Setup
- **Server:** llama-server on Titan, port 8002, Q4_K_M + mmproj-BF16.gguf
- **Thinking mode:** DISABLED (`chat_template_kwargs: {"enable_thinking": false}`)
- **Sampling:** temperature 0.7, top_p 0.8
- **Max tokens:** 1500-2000 per test
- **Images:** base64-encoded PNG via OpenAI-compatible `/v1/chat/completions`

### Test Results Summary

| # | Test | Score | Latency | Tokens | Verdict |
|---|------|-------|---------|--------|---------|
| 1 | Scanned Document OCR (English) | **9/10** | 19.0s | 353 | Near-perfect extraction |
| 2 | Structured JSON Extraction | **9/10** | 20.7s | 449 | Valid JSON, all fields correct |
| 3 | Screenshot Understanding | **8/10** | 41.4s | 1254 | Detailed, accurate UI analysis |
| 4 | Kannada Text Recognition | **2/10** | 5.8s | 307 | Total failure — hallucinated content |
| 5 | Mixed Kannada-English | **5/10** | 12.7s | 681 | English perfect, Kannada ~50% |

### Test 1: Scanned Document OCR (English) — 9/10

**Image:** Scanned legal document (Court of Special Deputy Commissioner-3, Bengaluru). 739 KB, typewritten with handwritten date and signature.

**Prompt:** "Extract all text from this scanned document. Preserve the structure and layout as closely as possible."

**Results:**
- ✅ All typed text extracted accurately
- ✅ Structure and layout preserved (header, case number, petitioner, respondents)
- ✅ Handwritten date "7th day of August 2025" correctly read
- ✅ All three respondent names, addresses, ages, and relations correct
- ✅ Case number "RRT(2)(NA)CR : 131/2009-10" perfect
- ⚠️ Minor: "BENAGLURU" in header — but this matches the actual document (the document itself has "BENAGLURU" as a typo in the header while "BENGALURU" appears in the pincode line)
- ✅ Handled S/o. abbreviations, R/at, Smt., Sri. correctly

**Latency:** 19.0s (3068 prompt tokens for image encoding + 353 completion)

### Test 2: Structured JSON Extraction — 9/10

**Same image** as Test 1.

**Prompt:** "Extract structured information as JSON: court_name, case_number, date, presiding_officer, petitioner, respondents array with name, relation, address, age."

**Results:**
```json
{
  "court_name": "Special Deputy Commissioner-3",
  "subdivision": "Bengaluru North Sub-Division, Bengaluru-560009",
  "case_number": "RRT(2)(NA)CR : 131/2009-10",
  "date": "7th August 2025",
  "presiding_officer": {
    "name": "Archana.M.S., I.A.S.",
    "designation": "Special Deputy Commissioner-3, Bengaluru North Sub-Division, Bengaluru"
  },
  "petitioner": {
    "name": "The Government of Karnataka",
    "represented_by": "Tahsildar, Bengaluru East Taluk, Bengaluru"
  },
  "respondents": [
    {"name": "M. Rajanna", "relation": "S/o. late Muddanna", "address": "R/at Kuduregere Village, Jala Hobli, Yelahanka Taluk, Bengaluru", "age": null},
    {"name": "Sri. Ashok Kumar", "relation": "S/o. late T.N. Mudappa", "address": "R/at No.30/12, 6th Main, Vyalikaval, Malleshwaram, Bengaluru-560003", "age": "about 48 years"},
    {"name": "Smt. Tarakeshwari", "relation": "S/o. late T.N. Mudappa", "address": "R/at No.78, B/7, 3rd Main, 6th Cross, Prakash Nagar, Opp. to Raghavendra Temple Sriramapura, Bengaluru-560021", "age": "about 44 years"}
  ]
}
```
- ✅ Valid JSON (parseable, no errors)
- ✅ null for respondent 1's age (correctly absent from document)
- ✅ All fields accurate including nested objects
- ✅ Responded with ONLY JSON (wrapped in markdown code block)
- ⚠️ Minor: wrapped in ` ```json ``` ` markdown — would need stripping in production

**Latency:** 20.7s (449 completion tokens)

### Test 3: Screenshot Understanding — 8/10

**Image:** Icon marketplace webpage (showing "Siren" icon from "World Mythology - ink" set). 470 KB.

**Prompt:** "Describe what you see. What website or application is this? Be specific about UI elements."

**Results:**
- ✅ Correctly identified: icon marketplace platform
- ✅ Icon described accurately: "black line-art illustration of a siren"
- ✅ Artist name "Saeful Muslim" and set "World Mythology - ink" both correct
- ✅ All three pricing tiers correctly identified ($3.33/mo, $4.99, $0)
- ✅ UI elements described: customization panel, color/background options, CTA button
- ✅ Bottom icon row contents described (yeti, phoenix, minotaur, etc.)
- ⚠️ Identified as "IconScout.com" — reasonable guess, may not be exact site
- ⚠️ Very verbose (1254 tokens) — would need prompt engineering to constrain
- ⚠️ Used emojis in response despite not being asked

**Latency:** 41.4s (4095 prompt tokens for 470 KB image + 1254 completion)

### Test 4: Kannada Text Recognition — 2/10

**Image:** PIL-generated PNG with pure Kannada text (Noto Sans Kannada, 36pt, black on white). 26 KB.

**Ground truth:**
```
ನಮಸ್ಕಾರ, ನಾನು ರಾಜೇಶ್. ಬೆಂಗಳೂರಿನಲ್ಲಿ ವಾಸಿಸುತ್ತಿದ್ದೇನೆ.
ನನ್ನ ತಾಯಿ ಮಲ್ಲಿಗೆ ಹೂವು ತರಲು ಹೇಳಿದ್ದಾರೆ.
```
(Translation: "Hello, I am Rajesh. I live in Bengaluru. My mother asked me to bring jasmine flowers.")

**Model output (run 2):**
```
ನಮನಾರ, ನಾನು ರಾಜೀನಾ. ಜಿಂಗೆದಿನ್ಲಿ ಲಾಸಿಲುಳ್ಳದ್ದು.
ನನ್ನ ತಂದೆ ಮಲಗಿ ಕುಳಿತು ಕೆರಲು ಕೋಲಾಯ್ತೀ.
```
(Model's translation: "My father sat down and wept over the broken pillar.")

**Error analysis:**
| Ground Truth | Model Read | Correct? |
|-------------|-----------|----------|
| ನಮಸ್ಕಾರ | ನಮನಾರ | ❌ |
| ನಾನು | ನಾನು | ✅ |
| ರಾಜೇಶ್ | ರಾಜೀನಾ | ❌ |
| ಬೆಂಗಳೂರಿನಲ್ಲಿ | ಜಿಂಗೆದಿನ್ಲಿ | ❌ |
| ವಾಸಿಸುತ್ತಿದ್ದೇನೆ | ಲಾಸಿಲುಳ್ಳದ್ದು | ❌ |
| ತಾಯಿ | ತಂದೆ | ❌ (mother→father) |
| ಮಲ್ಲಿಗೆ | ಮಲಗಿ | ❌ |
| ಹೂವು | ಕುಳಿತು | ❌ |
| ತರಲು | ಕೆರಲು | ❌ |
| ಹೇಳಿದ್ದಾರೆ | ಕೋಲಾಯ್ತೀ | ❌ |

**1/10 words correct.** Translation completely hallucinated. The model can recognize that it's Kannada script but cannot decode the actual glyphs. It produces plausible-looking Kannada text that is semantically unrelated to the input.

### Test 5: Mixed Kannada-English (Code-Mixed) — 5/10

**Image:** PIL-generated PNG with 4 lines of code-mixed text (dual-font: Noto Sans for English, Noto Sans Kannada for Kannada). Labels [Kn], [En], [Mix] in English font.

**Ground truth vs model output:**

| Line | Ground Truth | Model Read | English OCR | Kannada OCR |
|------|-------------|-----------|-------------|-------------|
| 1 | ಇವತ್ತು office ಗೆ late ಆಯ್ತು | ಇವತ್ತು office ಗಿ late ಆಯ್ತು | ✅ perfect | ⚠️ ಗೆ→ಗಿ (close) |
| 2 | The meeting got pushed to 3 PM | The meeting got pushed to 3 PM | ✅ perfect | N/A |
| 3 | ಅಮ್ಮ call ಮಾಡಿದ್ರು, ಸಂಜೆ ಮನೆಗೆ ಬಾ ಅಂತ | ಅಮ್ಮಾ call ಮಾಡಿದ್ರು, ಸಂಜೆ ಮನಿಗ್ ಬಾ ಅಂತ | ✅ perfect | ⚠️ ಮನೆಗೆ→ಮನಿಗ್ |
| 4 | Nandu ಜೊತೆ lunch ಮಾಡಿದೆ, ಅವನ startup idea ಚೆನ್ನಾಗಿದೆ | Nandu ಜೀವಿತ್ lunch ಮಾಡಿದ್ರು, ಅವನ startup idea ಜೀವಾನ್ದೀ | ✅ perfect | ❌ ಜೊತೆ→ಜೀವಿತ್, ಚೆನ್ನಾಗಿದೆ→ಜೀವಾನ್ದೀ |

**Key observations:**
- **English OCR: 100% accurate** — every English word read correctly
- **Kannada OCR: ~50% accurate** — common short words (ಇವತ್ತು, ಅಮ್ಮ, ಸಂಜೆ, ಬಾ ಅಂತ, ಅವನ) correct, longer/inflected forms fail
- **Code-mixing correctly detected** on lines 1, 3, 4
- **Language labels [Kn], [En], [Mix] correctly read**
- **Translation accuracy:** Line 3 incorrectly translates "ಮನೆಗೆ" (house) as "office" — semantic error

**Pattern:** English context anchors improve Kannada recognition. Short, high-frequency Kannada words succeed. Inflected forms and longer words hallucinate.

### VLM Scoring for ADR-019 Routing Table

| VLM Task | Score | Grade | Production Ready? |
|----------|-------|-------|-------------------|
| English Document OCR | 9.0 | A | **Yes** |
| Structured Data Extraction | 9.0 | A | **Yes** |
| UI/Screenshot Analysis | 8.0 | A- | **Yes** |
| Kannada Pure Text OCR | 2.0 | F | **No** |
| Kannada-English Code-Mixed | 5.0 | C | **No** (unreliable) |
| **Weighted VLM Average** | **6.6** | **B-** | **Partial** |

### her-os Implications

**What Qwen3.5 VLM CAN do for Annie:**
1. **Receipt/bill scanning** — Extract amounts, dates, vendors from English receipts (Test 1+2 quality)
2. **Screenshot analysis** — Understand UI context when Rajesh shares screenshots (Test 3 quality)
3. **Document digitization** — Extract structured data from English legal/official documents
4. **Visual context** — Describe photos, identify objects, read English signage

**What Qwen3.5 VLM CANNOT do:**
1. **Kannada document OCR** — Cannot reliably read Kannada script in images
2. **Kannada signage/handwriting** — Even less likely to work than typed text
3. **Code-mixed Kannada chat screenshots** — English parts OK, Kannada unreliable

**Recommendation (updated Feb 27, session 73 research):** For Kannada image OCR, Annie needs a separate model:
- **Sarvam Vision** (3B VLM, free API through Feb 2026) — 23 Indic languages, est. 88-93% Kannada accuracy. **API-only** — no open weights yet. Best quality option.
- **IndicPhotoOCR** (Bhashini/IITJ, open source) — **94.5% Kannada scene text accuracy**. Self-hosted. Best local-first option for scene text (signs, photos).
- **Surya OCR** (Datalab, open source) — 90+ languages including Kannada. Self-hosted. Untested on Kannada but architecture is sound.
- ~~**Google Cloud Vision API**~~ — does NOT list Kannada as supported. Removed from candidates.
- **BharatGen Param2-17B** (HuggingFace, MoE 2.4B active) — explicit Kannada benchmarks. ~10-12 GB Q4. Worth evaluating as Kannada text specialist (not OCR-specific but could help with post-OCR extraction).

**Strategy:** Sarvam Vision API now (free), evaluate IndicPhotoOCR + Surya on Titan for self-hosted, watch for Sarvam Vision open weights.

For English VLM tasks, Qwen3.5-35B-A3B is **production-ready at 9/10 quality**.

## Sources
- [HuggingFace: Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)
- [Unsloth: Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF) — All GGUF quantization variants
- [GitHub: adadrag/qwen3.5-dgx-spark](https://github.com/adadrag/qwen3.5-dgx-spark) — Complete DGX Spark guide with benchmarks
- [NVIDIA Forum: Qwen3.5-35B-A3B on DGX Spark](https://forums.developer.nvidia.com/t/qwen3-5-35b-a3b-on-nvidia-dgx-spark/361724)
- [NVIDIA Forum: Qwen3.5-122B-A10B discussion](https://forums.developer.nvidia.com/t/qwen-qwen3-5-122b-a10b-alibaba-qwen-thought-about-us-d/361639)
- [NVIDIA Forum: Experimental MXFP4 support for Blackwell](https://forums.developer.nvidia.com/t/llama-cpp-experimental-native-mxfp4-support-for-blackwell-pr/355639)
- [Arm: Build llama.cpp GPU on DGX Spark](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu/)
- [llama.cpp #18331: CUDA MUL_MAT crash on Blackwell](https://github.com/ggml-org/llama.cpp/issues/18331) — **CLOSED** (2026-01-29). nvcc O3 optimization bug. FORCE_CUBLAS workaround stays. CUDA 13.1 likely has the underlying fix but unconfirmed on Blackwell sm_121.
- [llama.cpp #18425: MMQ build breaks for SM_121](https://github.com/ggml-org/llama.cpp/issues/18425)
- [llama.cpp #19683: qwen35moe degenerate output with FORCE_CUBLAS](https://github.com/ggml-org/llama.cpp/issues/19683)
- [llama.cpp #19857: Unsloth GGUF rope dimensions fix](https://github.com/ggml-org/llama.cpp/issues/19857)
- [Ollama: qwen3.5:35b-a3b](https://ollama.com/library/qwen3.5:35b-a3b)
- [Unsloth: Qwen3.5 Local Guide](https://unsloth.ai/docs/models/qwen3.5)
- [MarkTechPost: Qwen3.5 Release](https://www.marktechpost.com/2026/02/24/alibaba-qwen-team-releases-qwen-3-5-medium-model-series-a-production-powerhouse-proving-that-smaller-ai-models-are-smarter/)
- [NIM VLM Support Matrix](https://docs.nvidia.com/nim/vision-language-models/latest/support-matrix.html)
- [llama.cpp DGX Spark Discussion #16578](https://github.com/ggml-org/llama.cpp/discussions/16578)
- [Hardware Corner: First DGX Spark LLM Benchmarks](https://www.hardware-corner.net/first-dgx-spark-llm-benchmarks/) — MoE model performance data
