# Research: Gemma 4 Tool Calling in vLLM — Root Cause & Fix

> **Date:** 2026-04-05
> **Status:** RESEARCH COMPLETE
> **Conclusion:** Our 0/8 tool calling was caused by using wrong parsers (`hermes`/`pythonic`). vLLM has a dedicated `gemma4` parser (merged Apr 2, PR #38826). Both 26B MoE and 31B Dense should work with `--tool-call-parser gemma4`. Two streaming bugs exist but don't affect non-streaming.

## Question 1: vLLM Issue #38912 — What Is It?

**Source:** https://github.com/vllm-project/vllm/issues/38912

Issue #38912 is about **NVFP4 weight loading for Gemma 4 MoE models**, NOT about tool calling.

The problem: vLLM's `expert_params_mapping` in `gemma4.py` correctly maps base weight keys but fails for NVFP4 scale keys (`.weight_scale`, `.weight_scale_2`, `.input_scale`). The dot-vs-underscore mapping produces wrong FusedMoE parameter names, causing a `TypeError` crash during weight loading.

**Scope:** Only affects **MoE models (26B-A4B)** with NVFP4/FP8 quantization via modelopt. Does NOT affect the 31B Dense model at all — Dense models don't have expert tensors. Does NOT relate to tool calling.

**Fix:** A patched `gemma4.py` is provided at https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4/blob/main/gemma4_patched.py

**Status:** OPEN (not yet merged into vLLM main).

## Question 2: vLLM PR #38909 — Gemma4 Tool Parser Fix?

**Source:** https://github.com/vllm-project/vllm/pull/38909

PR #38909 is **NOT the main tool parser implementation**. It's a bugfix for streaming HTML duplication in the Gemma4 tool parser. It fixes a bug where `<html>` could become `<<htmlhtml` during streamed tool call arguments containing HTML content.

**The actual Gemma4 tool parser was introduced in PR #38826** (merged Apr 2, 2026):
- **PR #38826** — `feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use)`
- Added `vllm/tool_parsers/gemma4_tool_parser.py` and `vllm/tool_parsers/gemma4_utils.py`
- Registered `gemma4` as a tool-call-parser option
- Added `vllm/reasoning/gemma4_reasoning_parser.py`
- Added unit tests in `tests/tool_parsers/test_gemma4_tool_parser.py`

**Related PRs (all post-merge fixes):**

| PR | Status | What it fixes |
|----|--------|---------------|
| [#38847](https://github.com/vllm-project/vllm/pull/38847) | **MERGED** (Apr 2) | `Gemma4ToolParser.__init__()` missing `tools` param — 400 error. Tested with **31B-IT-NVFP4 on DGX Spark**. |
| [#38909](https://github.com/vllm-project/vllm/pull/38909) | OPEN | Streaming HTML duplication after tool calls |
| [#38945](https://github.com/vllm-project/vllm/pull/38945) | OPEN | Invalid JSON diffs during tool usage (string delimiter `<\|"\|>` leaking into streamed output). Tested with **31B-IT-NVFP4 on H100**. |
| [#38992](https://github.com/vllm-project/vllm/pull/38992) | OPEN | Partial delimiter chars not stripped in streamed tool calls |

**Key finding:** PR #38847 was explicitly tested with `nvidia/Gemma-4-31B-IT-NVFP4` on DGX Spark and confirmed working. Issue #38946 also confirms tool calling works with the 31B-IT-NVFP4 on H100 (the bug was about streaming JSON corruption, not tool calling failing entirely).

## Question 3: Does `--tool-call-parser gemma4` Exist in vLLM?

**YES.** Confirmed in three places:

1. **Source code** — `vllm/tool_parsers/__init__.py` registers `"gemma4"` mapping to `Gemma4ToolParser` in `gemma4_tool_parser.py`.

2. **Official recipe** — https://github.com/vllm-project/recipes/blob/main/Google/Gemma4.md documents:
   ```bash
   vllm serve google/gemma-4-31B-it \
     --max-model-len 8192 \
     --enable-auto-tool-choice \
     --tool-call-parser gemma4 \
     --reasoning-parser gemma4
   ```

3. **vLLM docs** — https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/gemma4_tool_parser/ exists as API documentation.

### How the parser works

Gemma 4 does NOT use JSON for tool calls. It uses a custom format:
```
<|tool_call>call:func_name{key:<|"|>value<|"|>,num:42}<tool_call|>
```

The `gemma4` parser handles this format. The `hermes` parser expects `<tool_call>{"name": ...}` JSON. The `pythonic` parser expects `func_name(arg=val)`. **Neither can parse Gemma 4's output**, which is why we got 0/8.

### Root cause of our 0/8 benchmark results

| Benchmark | Parser used | Result | Why it failed |
|-----------|------------|--------|---------------|
| 31B NVFP4 (Beast) | `hermes` | 0/8 | Hermes expects JSON tool calls, Gemma 4 emits `<\|tool_call>call:name{...}` |
| 26B FP8 (Titan) | `pythonic` | 0/8 | Pythonic expects `func(args)`, Gemma 4 emits `<\|tool_call>call:name{...}` |
| 26B Ollama | Built-in | **8/8** | Ollama has its own Gemma 4 template that correctly parses the native format |

**Fix:** Use `--tool-call-parser gemma4 --enable-auto-tool-choice`. That's it.

## Question 4: 31B Dense NVFP4 — Faster Options?

### Official checkpoint: `nvidia/Gemma-4-31B-IT-NVFP4`

Our benchmark result: **6.9 tok/s, 390ms TTFT, 45.95 GB VRAM** on Beast (DGX Spark).

This is too slow for interactive use (threshold: 40 tok/s). The 31B Dense model activates all 31B parameters every token, unlike the MoE variants that only activate 3.8B.

### Community alternatives

| Checkpoint | Format | VRAM est. | vLLM compatible? | Notes |
|-----------|--------|-----------|-------------------|-------|
| `nvidia/Gemma-4-31B-IT-NVFP4` | NVFP4 (modelopt) | 46 GB | Yes | Official, 6.9 tok/s on Spark |
| `unsloth/gemma-4-31B-it-GGUF` | GGUF (Q4_K_M etc.) | ~18-22 GB | **No** (llama.cpp only) | Unsloth Dynamic 2.0 quant |
| `bartowski/google_gemma-4-31B-it-GGUF` | GGUF (Q8_0 etc.) | ~34 GB | **No** (llama.cpp only) | Standard quant |

**No community NVFP4 checkpoint exists for 31B Dense** (unlike the 26B MoE where bg-digitalservices created one). This makes sense: the 31B weight loading doesn't need the expert tensor unfusion patch, so the official NVIDIA checkpoint is the correct one.

### Why 31B is inherently slow on DGX Spark

The bottleneck is **memory bandwidth**, not compute. DGX Spark has 273 GB/s LPDDR5x bandwidth shared between CPU and GPU.

- 31B Dense at NVFP4: ~18 GB weights. At 273 GB/s, theoretical max is ~15 tok/s for pure decode.
- 26B MoE at NVFP4: ~16 GB weights but only ~3 GB active per token. Much faster decode.
- Nemotron Nano at NVFP4: 18 GB weights but only ~3 GB active per token. Also fast.

**31B Dense will never be fast on DGX Spark.** The 6.9 tok/s is close to theoretical limits for a bandwidth-bound dense model of this size.

### Verdict

The 31B Dense is best suited for:
- **Offline vision tasks** (where latency doesn't matter)
- **Batch processing** (entity extraction, document analysis)
- NOT for interactive/voice use on DGX Spark

## Question 5: 31B + Super Coexistence on Beast

### Memory math

| Component | VRAM |
|-----------|------|
| Super 120B-A12B NVFP4 | ~80 GB (weights) + ~10 GB (KV cache) = ~90 GB |
| 31B Dense NVFP4 | ~18 GB (weights) + ~28 GB (KV cache default) = ~46 GB |
| OS + runtime | ~5-8 GB |
| **Total** | **~141-144 GB** |

DGX Spark has **128 GB unified memory**. Total exceeds capacity by **13-16 GB**.

### Can you oversubscribe?

**No. DGX Spark does NOT support memory oversubscription gracefully.**

Key findings from NVIDIA forums:
- When memory pressure exceeds 128 GB, the system enters a **"zombie" state** — SSH hangs, the machine becomes unresponsive, requiring hard reboot. It does NOT throw a clean CUDA OOM error ([NVIDIA Forum](https://forums.developer.nvidia.com/t/dgx-spark-becomes-unresponsive-zombie-instead-of-throwing-cuda-oom/353752)).
- **Swap should be disabled** (`sudo swapoff -a`) on DGX Spark. Swap causes thrashing that makes the system unusable rather than providing graceful degradation.
- Actual usable memory is **~119-120 GB**, not 128 GB (OS kernel, drivers, and filesystem cache take ~8-9 GB).
- vLLM's default KV cache pre-allocation can consume 89 GB on its own if not tuned ([NVIDIA Forum](https://forums.developer.nvidia.com/t/memory-creep-on-dgx-spark-where-your-128-gb-actually-goes-and-how-to-stop-it/364886)).

### Verdict

**Running 31B + Super simultaneously on Beast is NOT feasible.** Even with aggressive KV cache tuning (`--gpu-memory-utilization 0.2`), the combined weight footprint alone (~98 GB) plus OS overhead leaves insufficient headroom for KV cache and runtime.

Options:
1. **Hot-swap** — Stop Super, start 31B for vision tasks, swap back. Adds ~2 min swap latency.
2. **Run 31B on Titan** — But at 46 GB it leaves only ~74 GB for other Titan services.
3. **Don't run 31B** — Use Super for text + 26B MoE for vision on Titan. The 26B MoE at NVFP4 (15.7 GB) is far more practical.

## Summary of Findings

| Question | Answer |
|----------|--------|
| Issue #38912 | MoE NVFP4 weight loading fix. Only affects 26B MoE. NOT about tool calling. |
| PR #38909 | Streaming HTML duplication bugfix. NOT the main parser. Main parser is PR #38826 (merged). |
| `--tool-call-parser gemma4` | **YES, EXISTS.** Registered in vLLM. Our 0/8 results were from using wrong parsers (`hermes`/`pythonic`). |
| 31B faster options | **No.** 31B Dense is bandwidth-bound at ~6.9 tok/s on Spark. No community optimizations change this. |
| 31B + Super on Beast | **NOT FEASIBLE.** 136+ GB needed, 128 GB available. System zombies on overcommit. |

## Action Items for Phase 3

1. **Re-benchmark 26B-A4B NVFP4 with `--tool-call-parser gemma4`** — This is the most likely fix for tool calling. The community checkpoint (bg-digitalservices) already claims tool calling works with this flag.

2. **Re-benchmark 31B NVFP4 with `--tool-call-parser gemma4`** — PR #38847 was tested with 31B-IT-NVFP4 on DGX Spark and confirmed tool calls are processed correctly. Our 0/8 was purely a parser mismatch.

3. **Use `vllm/vllm-openai:gemma4-cu130` Docker image** — This contains PR #38826 (parser) + PR #38847 (init fix) already merged. The gemma4.py weight-loading patch (#38912) is still needed for 26B MoE NVFP4 only.

4. **Drop 31B from interactive consideration** — 6.9 tok/s is a hardware limitation, not a software bug. Keep it for offline vision only if needed.

## Sources

- [vLLM Issue #38912 — Gemma 4 MoE NVFP4 expert weight mapping](https://github.com/vllm-project/vllm/issues/38912)
- [vLLM PR #38826 — Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use)](https://github.com/vllm-project/vllm/pull/38826)
- [vLLM PR #38847 — Fix Gemma4ToolParser init (tested with 31B on DGX Spark)](https://github.com/vllm-project/vllm/pull/38847)
- [vLLM PR #38909 — Fix streaming HTML duplication](https://github.com/vllm-project/vllm/pull/38909)
- [vLLM PR #38945 — Fix invalid JSON diffs (tested with 31B on H100)](https://github.com/vllm-project/vllm/pull/38945)
- [vLLM PR #38992 — Fix partial delimiter stripping](https://github.com/vllm-project/vllm/pull/38992)
- [vLLM Issue #38837 — Gemma4ToolParser missing tools param](https://github.com/vllm-project/vllm/issues/38837)
- [vLLM Issue #38946 — Gemma 4 tool usage invalid JSON streaming](https://github.com/vllm-project/vllm/issues/38946)
- [vLLM Gemma 4 Recipe — Tool calling flags](https://github.com/vllm-project/recipes/blob/main/Google/Gemma4.md)
- [vLLM Gemma4 Tool Parser API Docs](https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/gemma4_tool_parser/)
- [bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 — Community NVFP4 checkpoint](https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4)
- [NVIDIA Forum — DGX Spark zombie on OOM](https://forums.developer.nvidia.com/t/dgx-spark-becomes-unresponsive-zombie-instead-of-throwing-cuda-oom/353752)
- [NVIDIA Forum — DGX Spark memory creep](https://forums.developer.nvidia.com/t/memory-creep-on-dgx-spark-where-your-128-gb-actually-goes-and-how-to-stop-it/364886)
- [HuggingFace — Gemma 4 31B vLLM discussion](https://huggingface.co/google/gemma-4-31B-it/discussions/4)
- [DeepWiki — DGX Spark Unified Memory Architecture](https://deepwiki.com/NVIDIA/dgx-spark-playbooks/9.3-unified-memory-architecture)
