# Research: Llama 3.1 Tool Calling via Ollama + Pipecat

**Date:** 2026-03-01 (Session 112)
**Status:** Active issue — workaround applied
**Affects:** annie-voice with Llama 3.1 8B backend + Pipecat + Ollama

## Problem

When using Llama 3.1 8B via Ollama's OpenAI-compatible API (`/v1/chat/completions`) with streaming enabled, tool calls (e.g., `web_search`) are sometimes:
1. Output as **text content** instead of structured `tool_calls` — gets spoken aloud by TTS
2. Called for **every interaction** (greetings, emotions, casual conversation) — trigger-happy
3. **Narrated** before execution ("I'll use the web_search function...") — also spoken aloud

## Root Causes

### Bug A: Ollama streaming drops tool_calls (KNOWN BUG)

Ollama's `/v1/chat/completions` streaming endpoint silently drops the `tool_calls` field. The model decides to call a tool, but the streaming response returns the tool call JSON as `delta.content` text instead of `delta.tool_calls`.

- [ollama#9632](https://github.com/ollama/ollama/issues/9632) — "Not streaming tool calling responses" (closed May 2025 with PR #9938)
- [ollama#5796](https://github.com/ollama/ollama/issues/5796) — "Streaming for tool calls unsupported" (closed July 2024)
- [ollama#12557](https://github.com/ollama/ollama/issues/12557) — "Tool Calling + Streaming Issue" (closed October 2025)
- [ollama#13519](https://github.com/ollama/ollama/issues/13519) — "Tool calls as JSON in content field" (January 2026)

PR #9938 (May 2025) was supposed to fix this, but subsequent issues show it persists or regresses.

### Bug B: Llama 3.1 tool-call bias (KNOWN BEHAVIOR)

Llama 3.1 has a strong bias toward using tools when they're available.

- [ollama#6127](https://github.com/ollama/ollama/issues/6127) — "No matter what I prompt, llama3.1 always replies with a tool call"
- The model's chat template instructs it to output `{"name": function_name, "parameters": {...}}` for tool calls
- Even with restrictive descriptions, an 8B model can't reliably follow "only use tools when explicitly asked"

### Bug C: Pipecat's elif chain

Pipecat's `BaseOpenAILLMService._process_context()` uses mutually exclusive branches:

```python
if chunk.choices[0].delta.tool_calls:
    # Accumulate tool call fragments — text NOT emitted
elif chunk.choices[0].delta.content:
    # Push as LLMTextFrame → goes to TTS
```

When Ollama puts tool calls in `content` instead of `tool_calls`, Pipecat treats it as normal text and speaks it.

### Bug D: System prompt priming (SELF-INFLICTED)

Adding "CRITICAL tool-calling rules" to the system prompt with repeated mentions of `web_search` and JSON actually **primed** the 8B model to think about tools more, increasing trigger-happiness. Smaller models respond worse to negative instructions ("do NOT use tools for X").

## What Works

- Ollama's **native** `/api/chat` endpoint supports streaming + tool calling correctly since May 2025
- **Non-streaming** requests to `/v1/chat/completions` return proper `tool_calls` in the structured field
- Structured tool calls DO work intermittently in streaming mode — the bug is inconsistent

## Possible Fixes (ranked by reliability)

### Fix 1: Disable streaming for tool-enabled requests (RECOMMENDED)

Subclass `OpenAILLMService`, override to use `stream=False` when `context.tools` is non-empty. Non-streaming responses correctly return structured `tool_calls`.

**Trade-off:** Adds latency — must wait for full response before TTS starts. For 1-3 sentence responses from an 8B model, this is ~1-2s extra wait.

**Implementation:**
```python
class OllamaNoStreamToolsService(OpenAILLMService):
    """Ollama LLM that disables streaming when tools are present."""

    async def _stream_chat_completions_specific_context(self, context, messages):
        if context.tools:
            # Non-streaming: tool_calls field is reliable
            response = await self._client.chat.completions.create(
                model=self.model_name,
                messages=messages,
                tools=context.tools,
                stream=False,
            )
            # Convert to single-chunk format for Pipecat
            yield response  # Needs adapter to match streaming protocol
        else:
            # Streaming: no tools, standard path
            async for chunk in super()._stream_chat_completions_specific_context(context, messages):
                yield chunk
```

**Status:** Not yet implemented (requires understanding Pipecat's internal chunk protocol)

### Fix 2: Intercept tool calls from text content

A `FrameProcessor` that:
1. Detects tool call JSON in `LLMTextFrame` text
2. Parses the JSON and executes the registered tool handler
3. Suppresses the text from TTS
4. Injects the tool result back into context

**Trade-off:** Complex, fragile (JSON parsing from streaming tokens), and requires re-triggering the LLM after tool execution.

**Status:** Not implemented

### Fix 3: Use Ollama's native API (MOST CORRECT)

Write a new Pipecat LLM service that uses `/api/chat` instead of `/v1/chat/completions`. The native API has a proper incremental parser for tool calls.

**Trade-off:** Most work — requires a new LLM service class.

**Status:** Not implemented. Pipecat doesn't ship a native Ollama service.

### Fix 4: Keep it simple — minimal system prompt (CURRENT APPROACH)

- Keep the system prompt simple (don't mention tools)
- Keep `ToolCallTextFilter` as safety net for text leaks
- Keep `_is_bogus_search()` gate for trigger-happiness
- Accept that some tool calls will leak as text intermittently

**Trade-off:** Imperfect but pragmatic. Most tool calls work via structured path; the occasional text leak is filtered.

## Anti-Patterns Discovered

1. **Don't mention tool names in system prompts for small models** — "NEVER use web_search for greetings" makes the model think about web_search more, not less
2. **Don't add long negative instructions** — "Do NOT output text before a tool call" confuses 8B models; they follow positive instructions better
3. **Don't tighten tool descriptions excessively** — longer descriptions with NEVER/ONLY/EXPLICITLY confuse small models more than simple descriptions
4. **Don't play whack-a-mole with text filters** — Llama invents new narration formats faster than you can filter them. Address the root cause (streaming bug) instead.

## Current State (Session 112)

- `TOOL_CAPABLE_BACKENDS = {"llama3.1:8b"}` — tools enabled for Llama
- System prompt reverted to simple version (no tool-calling rules)
- Tool description reverted to simple version
- `ToolCallTextFilter` active as safety net
- `_is_bogus_search()` gate active in tools.py
- Timing instrumentation in WhisperSTT and KokoroTTS

## Pipeline Latency Budget (measured)

| Stage | Latency | Notes |
|-------|---------|-------|
| WhisperSTT | 200-580ms | GPU, fast |
| Llama 3.1 8B (Ollama) | ~5,300ms first turn, ~800ms subsequent | THE BOTTLENECK |
| KokoroTTS | 44-177ms (866ms first call warmup) | GPU, fast |
| SearXNG web_search | ~2,000ms | HTTP to local Docker |
| **Total (no tools)** | **~6,000ms first, ~1,200ms subsequent** | |
| **Total (with tool call)** | **~8,000ms first, ~3,200ms subsequent** | |

## Next Steps

1. **Try Fix 1** (disable streaming for tool requests) — most reliable, moderate effort
2. **Try Ollama version upgrade** — check if latest Ollama fixes the streaming tool_calls bug
3. **Consider Ollama native API** — if Fix 1 doesn't work, write a native API client
4. **Watch for Pipecat OllamaLLMService updates** — Pipecat may add native Ollama support

## References

- [Pipecat OpenAI base_llm.py](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/openai/base_llm.py) — streaming chunk processing
- [Pipecat Function Calling Guide](https://docs.pipecat.ai/guides/learn/function-calling)
- [Pipecat Function Calling Example](https://github.com/pipecat-ai/pipecat/blob/main/examples/foundational/14-function-calling.py)
- [Ollama Streaming Tool Blog](https://ollama.com/blog/streaming-tool)
- [Ollama Tool Support Blog](https://ollama.com/blog/tool-support)
- [Ollama OpenAI Compatibility Docs](https://docs.ollama.com/api/openai-compatibility)
