# Qwen3.5-9B vs Llama 3.1 8B — Evaluation Report

**Date:** 2026-03-03
**Purpose:** Data-driven comparison for Annie Voice tool-calling LLM
**Decision:** See Recommendation below

## Model Specs

| Spec | Llama 3.1 8B | Qwen3.5-9B |
|------|-------------|------------|
| Release | July 2024 | March 2, 2026 |
| Parameters | 8B dense | 9B (DeltaNet+MoE) |
| Architecture | LLaMA Transformer | DeltaNet + MoE |
| Context | 128K | 262K |
| License | Llama 3.1 Community | Apache 2.0 |
| GGUF size (Q4_K_M) | ~4.5 GiB | ~5.3 GiB |
| GPQA Diamond | 51.0 | 81.7 |
| MMLU-Pro | 69.4 | 82.5 |
| IFEval (instruction following) | — | 91.5 |
| Inference platform | Ollama | llama.cpp (DeltaNet unsupported by Ollama) |
| Port on Titan | 11434 (Ollama) | 8003 (llama-server) |

## VRAM Budget

Titan has 128 GiB GPU memory. Current usage:
- Whisper large-v3-turbo: ~3 GiB
- Kokoro TTS: ~1 GiB
- Ollama (Llama 3.1 8B / Gemma3 4B): ~5-8 GiB
- Qwen3.5-9B via llama.cpp: ~5.4 GiB (4.9 GiB model + KV cache)

Both models can coexist — total ~18 GiB, well within 128 GiB budget.

## Benchmark Setup

**Eval harness:** `scripts/eval-llm/run_eval.py`
**Temperature:** 0.2 for both (standardized — was 0.7 for llama-cpp, fixed)
**Tasks:** entities, promises, emotions, summary, resolution, sensitivity, briefing, nudge, email_triage, email_draft, chat_qa, contradiction
**Inputs:** 8 transcripts + synthetic task inputs = ~84 evals per model
**Scoring:** `score_tasks.py` — deterministic for classification, Claude-as-judge for generative

## Entity Extraction — Latency per Transcript

8 transcripts, entity extraction task. Llama via Ollama (port 11434), Qwen via llama.cpp (port 8003).

| Transcript | Llama 3.1 8B | Qwen3.5-9B | Winner |
|-----------|-------------|------------|--------|
| cross_reference | 23,672 ms (422 tok) | 22,280 ms (726 tok) | **Qwen** |
| emotional_frustration | 9,826 ms (385 tok) | 23,822 ms (765 tok) | **Llama** (2.4×) |
| health_routine | 13,186 ms (519 tok) | 23,819 ms (762 tok) | **Llama** (1.8×) |
| kannada_friend_gossip | 22,587 ms (354 tok) | 25,794 ms (828 tok) | **Llama** |
| kannada_heavy_family | 26,926 ms (415 tok) | 25,337 ms (815 tok) | **Qwen** |
| kannada_work_call | 15,683 ms (364 tok) | 28,965 ms (948 tok) | **Llama** (1.8×) |
| weekend_family | 12,892 ms (417 tok) | 23,845 ms (774 tok) | **Llama** (1.8×) |
| work_standup | 11,582 ms (456 tok) | 23,609 ms (765 tok) | **Llama** (2.0×) |
| **Average** | **17,044 ms (416 tok)** | **24,684 ms (798 tok)** | **Llama** (1.4×) |

**Key observation:** Qwen generates **1.9× more output tokens** on average (798 vs 416). This suggests richer entity extractions — quality scoring needed to confirm whether the extra tokens add value or are verbose.

## Task Benchmark — Latency (83 evals per model)

| Task | n | Llama 3.1 8B | Qwen3.5-9B | Faster | Delta |
|------|---|-------------|------------|--------|-------|
| entities | 8 | **17,044 ms** (416 tok) | 24,684 ms (798 tok) | Llama | 1.4× |
| promises | 8 | **6,244 ms** (242 tok) | 9,864 ms (296 tok) | Llama | 1.6× |
| emotions | 8 | **13,394 ms** (527 tok) | 20,195 ms (639 tok) | Llama | 1.5× |
| summary | 8 | **6,522 ms** (254 tok) | 12,328 ms (374 tok) | Llama | 1.9× |
| resolution | 1 | **4,371 ms** (168 tok) | 19,651 ms (627 tok) | Llama | 4.5× |
| sensitivity | 8 | **7,304 ms** (288 tok) | 9,463 ms (290 tok) | Llama | 1.3× |
| briefing | 4 | **4,440 ms** (162 tok) | 8,847 ms (215 tok) | Llama | 2.0× |
| nudge | 6 | 7,191 ms (279 tok) | **2,832 ms** (52 tok) | **Qwen** | 2.5× |
| email_triage | 10 | 3,242 ms (124 tok) | **1,834 ms** (39 tok) | **Qwen** | 1.8× |
| email_draft | 4 | **3,091 ms** (118 tok) | 3,573 ms (95 tok) | Llama | 1.2× |
| chat_qa | 8 | **2,320 ms** (83 tok) | 7,455 ms (204 tok) | Llama | 3.2× |
| contradiction | 10 | 5,439 ms (213 tok) | **3,994 ms** (114 tok) | **Qwen** | 1.4× |
| **Overall** | **83** | **7,073 ms avg** | **9,838 ms avg** | **Llama** | **1.4×** |

**Llama wins 9/12 tasks** on latency. Qwen wins on nudge, email_triage, and contradiction — tasks where Qwen produces more concise, targeted responses.

## Task Benchmark — Quality Grades

> Pending: Run `score_tasks.py` with ANTHROPIC_API_KEY for Claude-as-judge quality scoring.
> Quality grades will determine whether Qwen's richer outputs translate to better accuracy.

## Latency & Throughput Comparison

| Metric | Llama 3.1 8B (Ollama) | Qwen3.5-9B (llama.cpp) |
|--------|----------------------|----------------------|
| Avg latency (entity extraction) | 17,044 ms | 24,684 ms |
| Avg latency (summary) | 6,522 ms | 12,328 ms |
| Avg latency (overall) | 7,073 ms | 9,838 ms |
| Avg Kannada entity latency | 21,732 ms | 26,698 ms |
| Generation throughput | **35.5 tok/s** | 30.0 tok/s |
| Avg output tokens (entities) | 416 tok | **798 tok** (1.9×) |
| Avg output tokens (overall) | — | — (1.2× more) |
| Total evals | 83 | 83 |
| Errors | 0 | 0 |

**Llama is 18% faster** in tok/s generation (35.5 vs 30.0). But Qwen generates more tokens per response, making it 1.4× slower wall-clock despite the similar model sizes.

## Tool-Calling Assessment

### Known Llama 3.1 8B Issues
- **Bogus searches:** Makes unprompted searches ("knock knock jokes", "current date December 2023")
- **Text leaks:** Tool calls leak as streamed text content (Ollama streaming bug)
- **Trigger-happy:** Calls tools when a simple response would suffice
- **Mitigation:** ToolCallTextFilter in bot.py, but adds latency and has false positives

### Qwen3.5-9B Observations

- Tool call structure: **Verified** (structured JSON, `finish_reason: "tool_calls"`)
- Thinking-off: **Working** (`chat_template_kwargs: {"enable_thinking": false}`)
- Bogus search rate: **TBD** (pending live A/B testing)
- Text leak rate: **N/A** (non-streaming for tools, no text leaks possible)
- No ToolCallTextFilter needed — structured tool_calls only, no Ollama streaming bug

## Architecture Notes

### Why llama.cpp, Not Ollama
Ollama does NOT support Qwen3.5's DeltaNet architecture (confirmed via ollama/ollama issues and web research). The `llama.cpp` project added DeltaNet support. We use `llama-server` with the OpenAI-compatible API.

### Implementation
- `services/annie-voice/llamacpp_llm.py` — `LlamaCppToolsService` (Pipecat LLM service)
- `services/annie-voice/_streaming_adapters.py` — Shared non-streaming → streaming bridge
- Explicit routing in `bot.py` (elif before Ollama fallback)
- No ToolCallTextFilter needed (structured tool_calls only)

### Live A/B Toggle
Frontend: 5 buttons (Auto, Claude, Gemma3 4B, Llama 3.1 8B, Qwen3.5 9B)
Protocol: `llm_backend` passed via WebRTC session `requestData`
URL: `http://192.168.68.52:7860/`

## Recommendation

### Summary

| Dimension | Llama 3.1 8B | Qwen3.5-9B | Verdict |
|-----------|-------------|------------|---------|
| Latency (wall-clock) | **7,073 ms avg** | 9,838 ms avg | Llama 1.4× faster |
| tok/s throughput | **35.5 tok/s** | 30.0 tok/s | Llama 18% faster |
| Output richness | 416 tok avg | **798 tok avg** | Qwen 1.9× more |
| Published benchmarks | GPQA 51.0 | **GPQA 81.7** | Qwen dramatically better |
| Instruction following | — | **IFEval 91.5** | Qwen (key for tool calling) |
| Tool call reliability | Bogus searches, text leaks | Structured only | **Qwen** |
| Text leak risk | High (needs filter) | **None** (non-streaming) | Qwen |
| Context window | 128K | **262K** | Qwen 2× |
| License | Llama Community | **Apache 2.0** | Qwen |
| VRAM | ~4.5 GiB | ~5.4 GiB | Comparable |

### Decision: **Keep both, A/B test live**

1. **Llama 3.1 8B** is faster on raw latency for most tasks, but suffers from unreliable tool calling (bogus searches, text leaks).
2. **Qwen3.5-9B** is 1.4× slower wall-clock but produces richer outputs, has dramatically better published benchmarks (GPQA +30 pts, IFEval 91.5), and eliminates the ToolCallTextFilter hack entirely.
3. **Quality scoring pending** — Claude-as-judge grades will determine whether Qwen's richer outputs are *better* or just *longer*.
4. **Live A/B testing** via the 5-button UI will reveal real-world tool-calling behavior (bogus search rate, user satisfaction).
5. **No VRAM conflict** — both fit easily in Titan's 128 GiB budget (~18 GiB total).

**Hypothesis:** Qwen3.5-9B will win on quality + reliability despite slower latency, making it the better default for Annie Voice tool calling. The 1.4× latency penalty is acceptable given the elimination of text leaks and bogus searches.

---

## Commands to Run Benchmarks

```bash
# On Titan (or SSH from dev machine):
cd scripts/eval-llm

# Llama 3.1 8B — all tasks
.venv/bin/python run_eval.py --models llama31-8b \
  --tasks entities promises emotions summary resolution \
         sensitivity briefing nudge email_triage email_draft chat_qa contradiction \
  --run-id llama31-8b-full

# Qwen3.5-9B — all tasks
LLAMACPP_BASE_URL=http://localhost:8003 .venv/bin/python run_eval.py --models qwen35-9b \
  --tasks entities promises emotions summary resolution \
         sensitivity briefing nudge email_triage email_draft chat_qa contradiction \
  --run-id qwen35-9b-full

# Score
.venv/bin/python score_tasks.py results/eval-llama31-8b-full.json
.venv/bin/python score_tasks.py results/eval-qwen35-9b-full.json
```
