# Local Research Agent: Qwen3.5 as NotebookLM Alternative

**Status:** Evaluation Phase
**Date:** 2026-03-04
**Context:** YouTube video `usTeU4Uh0iM` demonstrates a Claude Code + NotebookLM workflow for research. We want to evaluate whether we can replicate this entirely locally on Titan using Qwen3.5 models.

## What the Video Proposes

A pipeline where Claude Code orchestrates research:
1. **Search YouTube** → yt-dlp scrapes video metadata
2. **Ingest into NotebookLM** → unofficial Python API sends URLs, NotebookLM pulls captions and builds RAG index
3. **Analyze** → NotebookLM answers questions grounded in the ingested content
4. **Generate deliverables** → infographics, podcasts, mind maps, flashcards, slide decks, quizzes

Key value: NotebookLM is free (Google pays), pre-built RAG, grounded answers with source references, multiple output formats.

## Our Local Alternative Vision

Replace NotebookLM with Qwen3.5 running on Titan. Keep Claude Code as orchestrator (or replace with local tooling). All data stays on-device — full privacy.

## Available Qwen3.5 Models on Titan

| Model | Params | Active | Quant | File |
|-------|--------|--------|-------|------|
| Qwen3.5-9B | 9B | 9B (dense) | Q4_K_M | `Qwen3.5-9B-Q4_K_M.gguf` |
| Qwen3.5-35B-A3B | 35B | 3B (MoE) | Q4_K_M | `Qwen3.5-35B-A3B-Q4_K_M.gguf` |
| Qwen3.5-35B-A3B-UD | 35B | 3B (MoE) | Q4_K_XL | `Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf` |

---

## Evaluation Dimensions

### 1. Context Window & Long-Document Handling

**Question:** Can Qwen3.5 handle the volume of text NotebookLM ingests?

- NotebookLM supports up to 50 sources per notebook. A single YouTube transcript can be 5K-50K tokens.
- 50 transcripts × ~10K tokens avg = **~500K tokens** of source material
- What are the effective context windows for each Qwen3.5 variant?
- Do they support RoPE scaling or YaRN for extended context?
- At what context length does quality degrade (the "lost in the middle" problem)?
- **Alternative:** Do we need a chunking + retrieval pipeline instead of stuffing everything into context?

**What to test:**
- [ ] Max context length for each model on llama-server (current: 32K for 9B)
- [ ] Quality of retrieval at 8K / 16K / 32K / 64K / 128K context
- [ ] "Needle in a haystack" test with transcript content

### 2. RAG Pipeline Requirements

**Question:** What infrastructure do we need to build a local RAG system?

NotebookLM gives us RAG for free. Locally, we'd need:
- **Chunking strategy** — How to split transcripts (by timestamp? by topic? fixed-size with overlap?)
- **Embedding model** — Which model for vector embeddings? (e.g., `nomic-embed-text`, `bge-m3`, `mxbai-embed-large`)
- **Vector store** — PostgreSQL pgvector (we already have PG), ChromaDB, Qdrant, or FAISS?
- **Retrieval** — Top-K similarity + BM25 hybrid? (Context Engine already does BM25)
- **Reranking** — Do we need a cross-encoder reranker for quality?

**What to test:**
- [ ] Embedding model options that run on Titan (GPU vs CPU, latency)
- [ ] pgvector vs dedicated vector DB (we already have PostgreSQL)
- [ ] Retrieval quality: vector-only vs hybrid (vector + BM25)
- [ ] End-to-end latency: ingest 20 transcripts → answer a question

### 3. Summarization & Analysis Quality

**Question:** Can Qwen3.5 synthesize across multiple sources as well as NotebookLM (Gemini)?

The core task: given 20 video transcripts, answer "What are the top 5 Claude Code skills?" with grounded analysis and source citations.

- Qwen3.5-9B vs 35B-A3B quality comparison for multi-document summarization
- Can the model attribute claims to specific sources (citation/grounding)?
- How does it handle contradictory information across sources?
- Comparison baseline: what does Claude (via API) produce for the same task?

**What to test:**
- [ ] Side-by-side: same 5 transcripts, same question → compare Qwen3.5-9B, 35B-A3B, Claude
- [ ] Source attribution accuracy (does it correctly cite which video said what?)
- [ ] Coherence when synthesizing across 10+ sources
- [ ] Hallucination rate (claims not in any source)

### 4. Content Ingestion Pipeline

**Question:** How do we get content into the system?

We already have most pieces:
- **YouTube search:** yt-dlp (her-player already uses this)
- **YouTube transcripts:** her-player downloads `subtitles.vtt` automatically
- **Audio transcription:** WhisperX in audio-pipeline (for videos without captions)
- **Web pages:** Could extend with trafilatura or similar for non-YouTube sources
- **PDFs/docs:** PyMuPDF, pdfplumber, etc.

**What to test:**
- [ ] Can we reuse her-player's download pipeline as a library?
- [ ] VTT → clean text conversion quality (strip timestamps, merge fragments)
- [ ] Ingestion speed for 20 videos (download + extract + embed)

### 5. Deliverable Generation

**Question:** Which NotebookLM deliverables can we replicate locally?

| Deliverable | NotebookLM | Local Feasibility | Notes |
|-------------|-----------|-------------------|-------|
| Text summary | ✅ | ✅ Easy | Qwen3.5 can do this |
| Mind map | ✅ | ⚠️ Medium | Generate Mermaid/Markmap markdown → render |
| Flashcards | ✅ | ✅ Easy | Structured JSON output from LLM |
| Quiz | ✅ | ✅ Easy | Structured JSON output from LLM |
| Slide deck | ✅ | ⚠️ Medium | Generate markdown → Marp/reveal.js |
| Audio podcast | ✅ | ⚠️ Medium | We have Kokoro TTS on Titan |
| Infographic | ✅ | ❌ Hard | Requires image generation model |

**What to test:**
- [ ] Qwen3.5 structured output reliability (JSON mode for flashcards, quizzes)
- [ ] Mermaid diagram generation quality
- [ ] Kokoro TTS for podcast-style audio (dialogue generation + synthesis)
- [ ] What image generation options exist for Titan? (SD, FLUX, etc.)

### 6. Performance & Resource Budget

**Question:** Can Titan handle this alongside existing services?

Current Titan GPU load: WhisperX + pyannote + Kokoro TTS + Ollama/llama-server + Annie Voice.

- How much VRAM does each Qwen3.5 model need for inference?
- Can RAG embedding run on CPU while LLM uses GPU?
- What's the total latency budget? (NotebookLM takes ~2 min for 20 sources)
- Can we batch-process during idle time?

**What to test:**
- [ ] VRAM usage: 9B vs 35B-A3B on llama-server
- [ ] Concurrent load: RAG query while Annie Voice is active
- [ ] Embedding throughput on CPU vs GPU
- [ ] Total pipeline time: 20 YouTube videos → analysis complete

### 7. Privacy & Data Sovereignty

**Question:** What's the privacy advantage of local vs NotebookLM?

- NotebookLM sends all content to Google servers
- Local pipeline: all data stays on Titan, never leaves the network
- Critical for her-os (personal conversations, life logging)
- NotebookLM's unofficial API could break at any time (no SLA)

**Evaluation:** This is a clear win for local. No testing needed — it's a design principle.

### 8. Cost Comparison

**Question:** Is the "free" claim actually free, and what does local cost?

| Factor | NotebookLM | Local (Titan) |
|--------|-----------|---------------|
| API cost | Free (Google pays) | Free (own hardware) |
| Claude Code tokens | ~minimal orchestration | Same or zero (local orchestrator) |
| Reliability | Unofficial API, can break | Self-maintained |
| Rate limits | Unknown, likely throttled | None |
| Privacy | Data goes to Google | Fully local |
| Setup effort | 5 min (pip + login) | Significant (RAG pipeline) |
| Maintenance | Zero (but API may break) | Ongoing |

### 9. Orchestration Layer

**Question:** What orchestrates the pipeline if not Claude Code?

Options:
- **Claude Code** — keep as orchestrator (but costs API tokens per research session)
- **Local CLI tool** — Python script that chains the steps (zero token cost)
- **Annie Voice** — voice-triggered research ("Annie, research the latest on X")
- **Pipecat agent** — extend Annie with research tool calls

**What to test:**
- [ ] Can Qwen3.5 handle multi-step tool calling reliably? (we know 9B is decent from benchmarks)
- [ ] Latency of local orchestration vs Claude Code orchestration
- [ ] Integration with Annie Voice's existing tool framework

### 10. Quality Baseline

**Question:** How do we measure "good enough"?

Before building anything, we need a ground truth:
1. Pick 5 YouTube videos on a topic we know well
2. Run through NotebookLM (if we set it up) → get analysis
3. Run through Claude API (stuff transcripts in context) → get analysis
4. Run through Qwen3.5-9B (stuff transcripts in context) → get analysis
5. Run through Qwen3.5-35B-A3B → get analysis
6. Blind-compare the four outputs

**What to test:**
- [ ] Design the eval rubric (accuracy, coverage, citation quality, coherence, usefulness)
- [ ] Run the 5-video baseline comparison
- [ ] Determine minimum acceptable quality threshold

---

## Decision Matrix

Once evaluations complete, score each dimension:

| Dimension | Weight | Score (1-5) | Notes |
|-----------|--------|-------------|-------|
| Context window | High | ? | |
| RAG quality | High | ? | |
| Summarization | High | ? | |
| Ingestion pipeline | Medium | ? | |
| Deliverables | Medium | ? | |
| Performance | High | ? | |
| Privacy | High | 5 (auto-win) | |
| Cost | Low | ? | |
| Orchestration | Medium | ? | |
| Quality baseline | High | ? | |

## Recommended Evaluation Order

1. **Context window test** (quick, tells us if naive approach works)
2. **Summarization quality** (quick, tells us if models are good enough)
3. **RAG pipeline prototype** (if context window insufficient)
4. **Deliverable generation** (parallel with RAG work)
5. **End-to-end benchmark** (final validation)

## Open Questions

- Should we evaluate other local models too? (Llama 4, Gemma 3, Mistral?)
- Is there a local NotebookLM-like tool already? (AnythingLLM, LibreChat, PrivateGPT?)
- Can we leverage Context Engine's existing BM25 + PostgreSQL for the RAG layer?
- Should the research agent be a standalone tool or integrated into Annie Voice?