# Research: Qwen3.5-9B True Context Length

**Date:** 2026-03-14 (Session 332)
**Status:** Implemented (Session 334, ADR-026) — bumped to 32K
**Trigger:** Rajesh suspected ctx-size might be 32K not 16K — turned out to be 262K

---

## Summary

The base Qwen3.5-9B supports **262,144 tokens natively** (extensible to ~1M via YaRN). Our docs and compaction config incorrectly claim "32K (base arch)". The Opus-Distilled version was **fine-tuned at 16K**, which is why `start.sh` uses `--ctx-size 16384` — but the underlying architecture supports 262K.

## Findings

### Base Qwen3.5-9B: 262K Native Context

- **`original_max_position_embeddings`: 262,144** (from HuggingFace config)
- Extensible to **1,010,000 tokens** via YaRN rope scaling
- Qwen recommends maintaining **at least 128K context** to preserve thinking capabilities

Source: [Qwen/Qwen3.5-9B HuggingFace](https://huggingface.co/Qwen/Qwen3.5-9B)

### Opus-Distilled Version: Fine-Tuned at 16K

The Jackrong model card explicitly states:

> "Fine-tuned smoothly with a **16,384 token context window** allowing complex multi-step reasoning traces to exist gracefully within memory limits."

SFT doesn't alter the architecture or positional embeddings — it only trains on shorter sequences. Models routinely work beyond their SFT context, especially for conversation (vs. needle-in-haystack retrieval).

Source: [Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF](https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF)

### Architecture: Why Long Context is Cheap

Qwen3.5-9B uses a **hybrid architecture** (32 layers total):

| Layer Type | Count | % of Layers | KV Cache Behavior |
|-----------|-------|-------------|-------------------|
| Gated DeltaNet (linear attention) | 24 | 75% | **Constant memory** — no KV cache growth |
| Gated Attention (standard) | 8 | 25% | Standard KV cache, but only 4 KV heads (GQA) |

Per 8-layer block: 3 DeltaNet + 1 Gated Attention.

**Gated DeltaNet specs:** 32 linear attention heads (V), 16 (QK), head dim 128
**Gated Attention specs:** 16 Q heads, 4 KV heads, head dim 256, RoPE dim 64

This means KV cache grows in only **8 of 32 layers**, making context expansion extremely VRAM-efficient.

### VRAM Cost of Larger Context

| ctx-size | Extra VRAM (over 16K) | Feasible on Titan (128 GB)? |
|----------|----------------------|----------------------------|
| 16,384 (current) | baseline (~6.6 GB model) | Yes (current) |
| 32,768 | ~1 GB | Easily |
| 65,536 | ~2 GB | Easily |
| 131,072 | ~4 GB | Yes |
| 262,144 (native max) | ~8 GB | Yes, but test quality first |

Sources: [Kaitchup KV Cache Breakdown](https://kaitchup.substack.com/p/qwen35-9b-4b-2b-and-08b-gpu-requirements), [Qwen llama.cpp docs](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html)

### Ollama vs llama.cpp

Ollama does **NOT** support Qwen3.5's DeltaNet architecture (confirmed in session 259, `docs/RESEARCH-QWEN35-9B-EVAL.md`). Only `llama.cpp` (llama-server) supports it. This remains true — our `llama-server` on port 8003 is the correct runtime.

---

## What's Wrong in Current Docs/Code

| Location | Current Claim | Actual |
|----------|--------------|--------|
| `docs/RESEARCH-CONTEXT-COMPACTION.md:313` | "32K (base arch), 16K (training context)" | **262K (base arch)**, 16K (distilled SFT) |
| `compaction.py` preset `qwen3.5-9b-32k` | Implies 32K is the base model max | 262K native; 32K is just a conservative setting |
| `start.sh:297` comment | "ctx-size 16K (training context)" | Correct for distilled, but should note base is 262K |
| `start.sh:300` comment | "Rollback: ctx-size to 32768" | 32K is conservative; base supports 262K |
| `docs/RESEARCH-CONTEXT-COMPACTION.md:314` | "Qwen3.5-9B (base) — 32K" | **262K** |

---

## Action Items

### 1. Bump `--ctx-size` to 32768 (low risk, high reward)

**Where:** `start.sh:305`
```bash
# Current
--ctx-size 16384
# Proposed
--ctx-size 32768
```

**Why:** ~1 GB extra VRAM. Doubles Annie's conversation window before compaction fires. The base architecture fully supports this; even the distilled SFT at 16K doesn't mean the model breaks at 16K+1 — positional embeddings are unchanged.

**Risk:** SFT quality *might* degrade slightly beyond 16K for complex reasoning chains. For conversational voice assistant use, this is unlikely to matter. Worth A/B testing.

### 2. Update CompactionConfig preset

**Where:** `services/annie-voice/compaction.py:46-48`
```python
# Current
"qwen3.5-9b": CompactionConfig(
    ctx_size=16384, ...
)
# Proposed
"qwen3.5-9b": CompactionConfig(
    ctx_size=32768, ...
)
```

Also update tests that hardcode `ctx_size=16384` for the default preset.

### 3. Update server.py hardcoded ctx_size

**Where:** `services/annie-voice/server.py:407`
```python
# Current
ctx_size = 16384
# Proposed — read from compaction config instead of hardcoding
from compaction import get_config
config = get_config(os.getenv("LLM_BACKEND", "qwen3.5-9b"))
ctx_size = config.ctx_size
```

### 4. Fix incorrect docs

- `docs/RESEARCH-CONTEXT-COMPACTION.md:313` — Change "32K (base arch)" to "262K (base arch)"
- `docs/RESEARCH-CONTEXT-COMPACTION.md:314` — Change base model to 262K
- `start.sh:297` — Add note: "base arch supports 262K"
- Update `docs/TITAN-SETUP-RECIPES.md:595` — Minimum ctx-size recommendation should reference 262K capability

### 5. Optional: Test 64K or 128K

Given the DeltaNet architecture makes KV cache growth minimal (~2-4 GB for 64-128K), it's worth testing larger contexts. Annie voice sessions could run for hours without compaction.

**Test plan:**
1. Start llama-server with `--ctx-size 65536`
2. Run a long conversation (50+ turns)
3. Monitor VRAM via `nvidia-smi`
4. Check response quality at various context fill levels
5. Compare with 32K baseline

---

## References

- [Qwen/Qwen3.5-9B — HuggingFace Model Card](https://huggingface.co/Qwen/Qwen3.5-9B)
- [Jackrong Opus-Distilled GGUF](https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF)
- [Kaitchup: Qwen3.5 GPU Requirements & KV Cache Breakdown](https://kaitchup.substack.com/p/qwen35-9b-4b-2b-and-08b-gpu-requirements)
- [Qwen3.5 llama.cpp Guide](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html)
- [Unsloth: Qwen3.5 Local Guide](https://unsloth.ai/docs/models/qwen3.5)
- [Ollama: qwen3.5:9b](https://ollama.com/library/qwen3.5:9b)