# Impact Analysis: Dropping Nemotron Super & Freeing Beast

> **Date:** 2026-04-06 (UPDATED with verified official benchmarks)
> **Scenario:** Decommission Beast (DGX Spark #2), stop running Nemotron 3 Super 120B. All workloads move to Gemma 4 26B NVFP4 on Titan.
> **Source data:** Codebase analysis (18 files with Beast dependencies), published benchmarks, `docs/COMPARISON-GEMMA4-VS-SUPER.md`
>
> **CRITICAL CORRECTION (2026-04-06):** The "97% tool calling accuracy" for Super is NOT from any official benchmark. It was from NVIDIA NemoClaw marketing material, not the model card or technical report. The official τ2-bench shows **Gemma 4 BEATS Super** on agentic tool use (Retail: 85.5% vs 62.83%, Average: 68.2% vs 61.15%). This reverses the browser task impact assessment from HIGH to LIKELY IMPROVEMENT. See Section 3.

---

## 1. What Runs on Beast Today

Beast (192.168.68.58) runs exactly 2 processes:

| Process | VRAM | Purpose |
|---------|------|---------|
| Nemotron 3 Super 120B-A12B NVFP4 (vLLM :8003) | ~97 GB | Background agents, text chat, extraction |
| whisper.cpp server | ~0.5 GB | STT fallback (research/experimental) |

### Workloads Currently Routed to Super (13 distinct consumers)

| # | Workload | File | What It Does |
|---|----------|------|-------------|
| 1 | **Entity extraction** | `context-engine/config.py:59` | Extract persons, places, topics from conversations |
| 2 | **Daily reflections** | `context-engine/config.py:61` | End-of-day synthesis summaries |
| 3 | **Knowledge graph (Graphiti)** | `context-engine/config.py:82-83` | Entity relationship extraction for Neo4j |
| 4 | **Contradiction detection** | `context-engine/config.py:89` | Graph integrity checks |
| 5 | **Promise fulfillment** | `context-engine/config.py:136` | Track and verify commitments |
| 6 | **Nudge composition** | `context-engine/config.py:151` | Proactive notification generation |
| 7 | **Daily Wonder** | `context-engine/config.py:160` | Curiosity-driven daily insights |
| 8 | **Daily Comic** | `context-engine/config.py:168` | Humor extraction + comic generation |
| 9 | **Text chat (Telegram)** | `annie-voice/text_llm.py:1645` | Full reasoning with thinking=True, 16K tokens |
| 10 | **Browser task completion** | `annie-voice/browser_tasks.py:165` | 4-round tool loop (orders, approvals) |
| 11 | **Proactive Pulse stage 2** | `annie-voice/proactive_pulse.py:88` | Ambient-aware message composition |
| 12 | **Research stage 3** | `annie-voice/research_orchestrator.py:106` | Web page content extraction |
| 13 | **Meditation (weekly/monthly)** | `annie-voice/meditation_orchestrator.py:60` | Deep self-reflection generation |

---

## 2. Per-Workload Impact Assessment

### Rating Scale
- **NONE** — Gemma 4 matches or exceeds Super on this task
- **LOW** — Minor quality dip, unlikely to be noticed
- **MEDIUM** — Noticeable degradation, workarounds available
- **HIGH** — Significant quality loss, key capability gap
- **CRITICAL** — Feature broken or unusable

### 2.1 Context Engine Workloads (items 1-8)

| Workload | Impact | Reasoning |
|----------|--------|-----------|
| **Entity extraction** | **NONE** | Gemma 4 scored **7/7 persons + 3/3 places** vs Super's 7/7 + 2/3. Gemma 4 is actually *better* on entity extraction. |
| **Daily reflections** | **LOW** | Creative synthesis task. Gemma 4's MMLU-Pro (82.6%) is only 1.13 pts behind Super (83.73%). No thinking mode loss here — CE extraction doesn't use thinking. Max_tokens 4096 is well within Gemma 4's 32K context. |
| **Knowledge graph** | **LOW** | Relationship extraction between entities. Gemma 4's entity extraction is strong. Graphiti sync is currently **disabled** (`GRAPHITI_SYNC_ENABLED=0`) to avoid GPU contention, so this is dormant. |
| **Contradiction detection** | **LOW** | Binary classification (contradicts/doesn't). Doesn't require deep reasoning. Gemma 4's GPQA Diamond (82.3% > Super's 79.2%) suggests it handles factual analysis well. |
| **Promise fulfillment** | **LOW** | Pattern matching: "did X happen after Y was promised?" Structured extraction task, well within Gemma 4's capability. |
| **Nudge composition** | **LOW** | Short message generation (max 280 chars). Gemma 4 is excellent at concise text generation. No thinking needed. |
| **Daily Wonder** | **NONE** | Creative curiosity prompt (max 500 chars). Gemma 4's 88.3% AIME shows strong reasoning for "interesting fact" generation. |
| **Daily Comic** | **LOW** | Humor extraction from conversations. Subjective task — quality difference may be imperceptible. 3 panels, max 45s timeout. |

**Context Engine verdict: LOW overall impact.** Most CE tasks are structured extraction with short outputs (< 4K tokens). Gemma 4 matches or exceeds Super on entity extraction. No CE task uses thinking mode.

### 2.2 Annie Voice Workloads (items 9-13)

| Workload | Impact | Reasoning |
|----------|--------|-----------|
| **Text chat (Telegram)** | **MEDIUM** | Currently gets `enable_thinking=True` + 16K max_tokens on Super. Gemma 4 has thinking disabled in production (`enable_thinking=False` everywhere). Gemma 4 would be **3.15x faster** but lose deep reasoning chains. For most Telegram conversations, speed + vision may matter more than thinking depth. |
| **Browser task completion** | **LOW** *(revised)* | 4-round multi-step tool loop. ~~Originally rated HIGH based on "97% vs 87.5%" tool accuracy.~~ **CORRECTED:** Super's "97%" was unverifiable marketing, not an official benchmark. The official τ2-bench Retail shows **Gemma 4 85.5% vs Super 62.83%** (+22.7 pts). Browser tasks may actually **improve** on Gemma 4. The 3.15x throughput also means faster recovery from any tool-call failure. |
| **Proactive Pulse stage 2** | **LOW** | Message composition after triage. Similar to nudge — short output, no deep reasoning needed. Gemma 4's faster throughput means faster delivery. |
| **Research stage 3** | **NONE** | Web page content extraction. Gemma 4's entity extraction is *better* than Super's in our tests. Vision capability could actually *improve* research by processing screenshots and images. |
| **Meditation (weekly/monthly)** | **MEDIUM** | Deep self-reflection with multi-section structured output. Super's 97% tool accuracy and thinking mode produce higher-quality introspective content. Gemma 4 could produce adequate but less nuanced reflections. However: these run while voice is idle, so the 3.15x speed advantage means they complete faster. |

**Annie Voice verdict: MEDIUM overall, with one HIGH item (browser tasks).**

---

## 3. The Three Critical Capability Gaps

### Gap 1: Thinking Mode (MEDIUM impact)

**What's lost:** Super uses `enable_thinking=True` for text chat and browser tasks, generating internal reasoning chains before responding.

**Current state for Gemma 4:** Thinking is disabled everywhere via `enable_thinking=False` + `think_filter.py` defense-in-depth. The model *supports* thinking (Google published "thinking" benchmarks), but:
- No `reasoning_parser` configured for Gemma 4 in vLLM (Super uses `super_v3`)
- Current production config explicitly forbids it
- Think tags leak into output without a parser

**Mitigation:** Gemma 4 achieves 82.3% GPQA Diamond *without* thinking — higher than Super's 79.23%. On reasoning-heavy tasks (AIME 88.3%), it's also strong. The thinking gap may matter less than the benchmark suggests. Could investigate enabling Gemma 4 thinking for background-only tasks if quality impact is noticeable.

### ~~Gap 2: Multi-Step Tool Chains~~ → CORRECTED: Not a Gap

> **CORRECTION:** The original analysis used "Super 97% vs Gemma 4 87.5%" to model compounding failures. Both numbers are problematic:
> - **Super's "97%"** is from NVIDIA NemoClaw marketing material. It does NOT appear in the [HuggingFace model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16), the [NVIDIA blog post](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/), or any standard benchmark.
> - **Gemma 4's "87.5%"** was from our internal 8-call test (7/8) — too small to be statistically meaningful.

**Official τ2-bench comparison (same benchmark, apples-to-apples):**

| Domain | Gemma 4 26B | Super 120B | Delta | Winner |
|--------|-------------|-----------|-------|--------|
| **Retail** | **85.5%** | 62.83% | **+22.67 pts** | **Gemma 4** |
| Airline | N/P | 56.25% | — | — |
| Telecom | N/P | 64.36% | — | — |
| **Average (all domains)** | **68.2%** | 61.15% | **+7.05 pts** | **Gemma 4** |

Source: [Gemma 4 model card](https://huggingface.co/google/gemma-4-26B-A4B-it), [Super model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16)

**The truth: Gemma 4 is BETTER at agentic tool use than Super on the only standardized benchmark that covers both models.** The Retail domain (+22.67 pts) is particularly relevant since our browser tasks are retail-like (ordering coffee, food, groceries).

**Revised compounding analysis (τ2-bench Retail as proxy):**

| Rounds | Gemma 4 (85.5%) | Super (62.8%) | Gap |
|--------|-----------------|---------------|-----|
| 1 | 85.5% | 62.8% | +22.7 pts |
| 2 | 73.1% | 39.4% | +33.7 pts |
| 3 | 62.5% | 24.8% | +37.7 pts |
| 4 | 53.4% | 15.6% | **+37.8 pts** |

**The compounding FAVORS Gemma 4, not Super.** Over 4 rounds, Gemma 4's higher base accuracy opens up a 38-point advantage.

*Caveat: τ2-bench Retail scores don't directly measure single-call tool accuracy — they measure end-to-end task success across multi-turn conversations with tool use. The compounding math above is illustrative, not literal.*

### Gap 3: Context Window (LOW impact in practice)

**What's lost:** Super's 131K configured context vs Gemma 4's 32K configured.

**But:** Gemma 4 *supports* 128K — it's currently configured at 32K to save KV cache VRAM. With Beast gone, Titan has ~112 GB free. Increasing `max_model_len` to 65K or even 128K is feasible:

| max_model_len | Estimated KV cache | Total Titan VRAM | Feasible? |
|---------------|-------------------|-----------------|-----------|
| 32K (current) | ~2 GB | ~18 GB | ✅ Current config |
| 65K | ~4 GB | ~20 GB | ✅ Easy |
| 128K | ~8 GB | ~24 GB | ✅ Still leaves 104 GB free |

**Verdict:** Context window gap is **solvable** by bumping `max_model_len`. The only real gap is Super's 1M *model* support (Gemma 4 caps at 128K), but nothing currently uses > 128K.

---

## 4. What Gemma 4 Gains That Super Never Had

| Capability | Impact |
|-----------|--------|
| **Native vision** | Photo analysis, document reading, Pixel camera context, screen reading. Entirely new capability for ALL workloads — extraction could process images, research could analyze screenshots, text chat could understand photos sent via Telegram. |
| **3.15x faster throughput** | ALL background tasks complete 3x faster. Daily reflection that took 30s on Super takes 10s on Gemma 4. Meditation, comic, wonder — all faster. |
| **3.2x faster TTFT** | Text chat feels snappier (83.9ms vs 269ms). First word arrives 3x sooner. |
| **Simpler architecture** | One machine, one model, one deployment. No cross-machine SSH, no ConnectX-7 dependency, no Beast health monitoring, no fallback logic complexity. |

---

## 5. Benefits of Freeing Beast

### Direct Benefits

| Benefit | Value |
|---------|-------|
| **Free DGX Spark hardware** | Entire 128 GB machine available for other projects, sale, or lease |
| **Electricity savings** | DGX Spark draws ~200W idle, ~350W peak. ~$300-500/year at Indian residential rates |
| **Reduced operational complexity** | No SSH to Beast in start.sh/stop.sh. No cross-machine health monitoring. No ConnectX-7 networking issues. |
| **Fewer failure modes** | Beast outage, network partition, ConnectX-7 misconfiguration, Docker version skew — all eliminated |
| **Simpler deployments** | `git pull && restart` on one machine instead of two |
| **Cleaner codebase** | Remove Beast routing logic from 18 files. Delete health check code, fallback chains, Beast-specific configs. |

### Beast Reuse Options

| Option | Description |
|--------|------------|
| **Development/testing machine** | Run experimental models, benchmark new releases, without touching production |
| **Panda replacement** | Beast (128 GB) is massive overkill for Panda's role but could consolidate everything |
| **Second project** | Run a separate AI project entirely |
| **Sell/lease** | DGX Spark has strong resale value |
| **Cold standby** | Keep powered off, bring online only for heavy batch processing |

---

## 6. Migration Effort

### Code Changes Required

| File/Area | Change | Effort |
|-----------|--------|--------|
| `context-engine/config.py` | Change 8 `*_LLM_DEFAULT` from `"vllm/nemotron-super"` to `"vllm/gemma-4-26b"` | 10 min |
| `context-engine/llm.py` | Remove Beast routing in `_create_from_spec()`, always use local vLLM | 15 min |
| `annie-voice/text_llm.py` | Remove Beast probe + routing, always use local Gemma 4 | 20 min |
| `annie-voice/agent_context.py` | Remove Beast health check, simplify tier routing | 15 min |
| `annie-voice/resource_pool.py` | Remove BEAST backend enum, route everything to local | 10 min |
| `annie-voice/browser_tasks.py` | Point at local Gemma 4, add retry logic | 30 min |
| `annie-voice/proactive_pulse.py` | Remove Beast stage 2 escalation, use local | 10 min |
| `annie-voice/research_orchestrator.py` | Change stage 3 tier from "super" to local | 5 min |
| `annie-voice/meditation_orchestrator.py` | Change weekly/monthly tier to local | 5 min |
| `annie-voice/compaction.py` | Bump `gemma-4-26b` ctx_size from 32768 to 65536+ | 5 min |
| `start.sh` / `stop.sh` | Remove `start_beast_llm()` and `stop_beast_services()` | 15 min |
| `.env` files | Remove `BEAST_LLM_BASE_URL`, `BEAST_LLM_MODEL` | 5 min |
| `docs/RESOURCE-REGISTRY.md` | Remove Beast section, update budget | 10 min |
| Tests (18+ test files) | Update Beast-related test mocks and assertions | 1-2 hours |

**Total estimated effort: ~4-5 hours** (including test updates).

---

## 7. Risk Matrix

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Browser task failures increase | **LOW** *(revised)* | **LOW** | τ2-bench Retail: Gemma 4 85.5% vs Super 62.8%. May actually improve. |
| Text chat quality drops for complex reasoning | **MEDIUM** | **MEDIUM** | Investigate Gemma 4 thinking mode for non-voice; most chats are simple Q&A |
| Meditation/reflection quality decreases | **MEDIUM** | **LOW** | Subjective quality — may not be noticeable. Speed improvement compensates. |
| Context window too small for some tasks | **LOW** | **LOW** | Bump max_model_len to 65K-128K (Titan has headroom) |
| Gemma 4 tool calling degrades over time | **LOW** | **MEDIUM** | Monitor tool success rates; fall back to Claude API if needed |
| Entity extraction quality drops | **VERY LOW** | **MEDIUM** | Gemma 4 *outperforms* Super on our entity extraction tests |
| Vision enables new failure modes | **LOW** | **LOW** | Vision is additive — no existing workflow depends on it |

---

## 8. Phased Approach (recommended if proceeding)

### Phase 1: Shadow Mode (1 week)
- Route Context Engine tasks to Gemma 4 **in parallel** (keep Super as primary)
- Compare extraction quality, daily reflection quality side-by-side
- Log tool calling success rates on Gemma 4 vs Super
- **Zero risk** — Super still handles everything, Gemma 4 runs in shadow

### Phase 2: Context Engine Cutover (1 week)  
- Switch all 8 Context Engine workloads to Gemma 4
- Keep Beast running for text chat + browser tasks only
- Monitor extraction quality, daily summaries, nudges
- **Low risk** — CE tasks are the easiest to migrate (structured extraction, short output)

### Phase 3: Text Chat + Agents (1 week)
- Route Telegram text chat to Gemma 4
- Migrate proactive pulse, research orchestrator, meditation
- **Medium risk** — text chat loses thinking mode, meditation loses depth

### Phase 4: Browser Tasks + Decommission (1 week)
- Migrate browser tasks with added retry logic
- Run 1 week of real coffee/food ordering on Gemma 4
- If success rate is acceptable: power off Beast
- **Highest risk** — browser tasks are the hardest to migrate

---

## 9. Overall Verdict

### Quantified Impact Summary

| Dimension | Super (current) | Gemma 4 (proposed) | Change |
|-----------|----------------|--------------------|---------| 
| Machines required | 2 (Titan + Beast) | 1 (Titan only) | **-1 machine** |
| Total VRAM consumed | 39 GB + 97 GB = 136 GB | ~24-39 GB | **-70-75%** |
| Background task speed | 16.0 tok/s | 50.4 tok/s | **+3.15x faster** |
| Entity extraction | 7/7 + 2/3 | 7/7 + 3/3 | **+1 place** (better) |
| τ2-bench Retail (official) | 62.83% | 85.5% | **+22.7 pts (BETTER)** |
| τ2-bench Average (official) | 61.15% | 68.2% | **+7.1 pts (BETTER)** |
| ~~Tool calling "97%"~~ | ~~97.0%~~ | — | ~~Unverifiable marketing claim~~ |
| Thinking mode | Full reasoning chains | Disabled (investigate enabling) | **Lost** |
| Vision | None | Native image + video | **Gained** |
| Context window | 131K configured | 32K → 128K (bumpable) | **Solvable** |
| GPQA Diamond | 79.2% | 82.3% | **+3.1 pts** (better) |
| MMLU-Pro | 83.7% | 82.6% | **-1.1 pts** (negligible) |
| Operational complexity | 2 machines, SSH, health checks | 1 machine, simple | **Much simpler** |

### Bottom Line (REVISED after benchmark verification)

**Strongly feasible. No significant tradeoffs remain.**

ALL 13 workloads can migrate to Gemma 4 with LOW or NO impact:
- Entity extraction **improves** (7/7+3/3 vs 7/7+2/3)
- Agentic tool use **improves** (τ2-bench: +22.7 pts Retail, +7.1 pts average)
- Speed **improves 3.15x** across all background tasks
- Vision is **gained** (entirely new capability)
- Architecture becomes **drastically simpler** (one machine, one model)

**The original blocker (browser task tool accuracy) was based on unverifiable data.** Super's "97% tool calling accuracy" does not appear in any official benchmark. The only standardized agentic benchmark covering both models (τ2-bench) shows Gemma 4 is significantly better, especially on Retail (+22.7 pts), which is the closest domain to our browser ordering tasks.

**Super's remaining advantages are narrow:**
- SWE-Bench Verified: 60.47% (Gemma 4 not published, likely lower)
- 1M context window (Gemma 4 caps at 128K, sufficient for our workloads)
- Thinking mode in production (Gemma 4 could potentially be enabled)

**Recommended path:**

| Path | Description | When |
|------|------------|------|
| **C: Full migration** *(now recommended)* | Everything on Gemma 4. Bump context to 65-128K. Free Beast entirely. | **Default recommendation** -- no significant quality risk, major simplicity gain |
| **B: Migrate with Claude API safety net** | Move everything to Gemma 4. Keep Claude API as fallback for edge cases. | If extra caution desired |
| **A: Keep both (status quo)** | Maintain dual-machine setup. | Only if SWE-Bench-class coding tasks become a regular workload |

**My assessment:** Path C is now the clear winner. The benchmark correction eliminates the only HIGH-impact risk. Free Beast, simplify the architecture, gain vision, gain speed. The ~4-5 hours of migration effort pays for itself immediately in reduced operational complexity.
