# Research: TTS Text Processing — Markdown-to-Speech

**Date:** 2026-03-14
**Context:** Annie reads markdown formatting literally ("asterisk asterisk bold asterisk asterisk") and Pipecat's MarkdownTextFilter eats legitimate content.

## Problem

LLMs produce markdown-formatted text (bold, headings, links, code blocks) even when instructed not to. When sent to TTS, the raw formatting characters are read aloud. Post-processing filters intended to strip markdown can corrupt or delete content.

## Pipecat's MarkdownTextFilter (REMOVED — too dangerous)

**Source:** `pipecat/utils/text/markdown_text_filter.py` (v0.0.105)

**Approach:** Convert text → HTML via Python `markdown` library → strip HTML tags with regex.

**Fatal bugs discovered (tested 2026-03-14):**

| Input | Output | Bug |
|-------|--------|-----|
| `>` (alone) | empty | Blockquote eats content |
| `> 50% of traders are bullish.` | `\n50% of traders are bullish.\n` | `>` stripped as blockquote |
| `<Reuters>` | empty | Treated as HTML tag |
| `---` | empty | Horizontal rule |
| `Gold: $5,120 \| Silver: $32` | `Gold: $5,120  Silver: $32` | All pipes removed |
| `# Summary` | `Summary` | Heading stripped (minor) |

**Root cause:** The Markdown parser is designed for *rendering* documents, not *stripping syntax*. It interprets `>` as blockquote, `<word>` as HTML, `---` as `<hr>`. The HTML→strip pipeline then deletes the rendered content.

## ThinkTagFilter (REMOVED — fragile custom code)

**Source:** `services/annie-voice/text_processor.py` (custom)

**Approach:** Character-level state machine that buffers text after `<`, looking for `<think>` tag.

**Issues:**
- Orphaned `<` at end of stream is silently lost
- Adds latency to every frame for a rare edge case
- Primary defense (`enable_thinking: False` in `llamacpp_llm.py`) already handles this

**Decision:** Removed. The `enable_thinking: False` kwarg sent to llama-server is sufficient.

## Approach 1: System Prompt (DEPLOYED)

**Strategy:** Tell the LLM explicitly not to use markdown.

```
NEVER use markdown formatting — no **bold**, no *italic*, no # headings,
no `backticks`, no [links](url). Your output is spoken aloud by a TTS
engine that reads raw text. Use plain numbered lists (1. 2. 3.) and
natural speech patterns instead.
```

**Pros:**
- Zero processing overhead
- Works at the source — model generates clean text
- No risk of content loss

**Cons:**
- Small models (9B) sometimes ignore system prompt instructions, especially for longer responses or tool result summaries
- Pretraining markdown habit is strong — explicit examples of forbidden patterns help

**Status:** Deployed in commit `0985880`. Monitoring whether Qwen3.5-9B respects it consistently.

## Approach 2: LiveKit's `filter_markdown` (RECOMMENDED if prompt isn't enough)

**Source:** [`livekit/agents/voice/transcription/filters.py`](https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/voice/transcription/filters.py)

**Approach:** Pure regex substitution — no HTML conversion. Streaming-aware with incomplete-pattern buffering.

**Architecture:**

1. **Line patterns** (applied at start of lines only):
   ```python
   LINE_PATTERNS = [
       (re.compile(r"^#{1,6}\s+", re.MULTILINE), ""),      # # headings
       (re.compile(r"^\s*[-+*]\s+", re.MULTILINE), ""),      # - list markers
       (re.compile(r"^\s*>\s+", re.MULTILINE), ""),           # > block quotes
   ]
   ```

2. **Inline patterns** (anywhere in text):
   ```python
   INLINE_PATTERNS = [
       (re.compile(r"!\[([^\]]*)\]\([^)]*\)"), r"\1"),        # ![alt](url) → alt
       (re.compile(r"\[([^\]]*)\]\([^)]*\)"), r"\1"),          # [text](url) → text
       (re.compile(r"(?<!\S)\*\*([^*]+?)\*\*(?!\S)"), r"\1"), # **bold** → bold
       (re.compile(r"(?<!\S)\*([^*]+?)\*(?!\S)"), r"\1"),      # *italic* → italic
       (re.compile(r"(?<!\w)__([^_]+?)__(?!\w)"), r"\1"),      # __bold__ → bold
       (re.compile(r"(?<!\w)_([^_]+?)_(?!\w)"), r"\1"),        # _italic_ → italic
       (re.compile(r"`{3,4}[\S]*"), ""),                        # ``` code blocks
       (re.compile(r"`([^`]+?)`"), r"\1"),                      # `code` → code
       (re.compile(r"~~(?!\s)([^~]*?)(?<!\s)~~"), ""),          # ~~strikethrough~~
   ]
   ```

3. **Streaming buffer:** Accumulates text until a natural split point (space, punctuation). Checks `has_incomplete_pattern()` — if the buffer might contain half of a `**bold**` marker, waits for more text before substituting. This prevents false positives on standalone `*` or `_` in normal text.

4. **Line handling:** Newlines trigger LINE_PATTERNS on the new line. This means `>` is only stripped at the start of a line, not mid-sentence (unlike Pipecat's approach which converts everything through the Markdown parser).

**Why this is safe:**
- `>` mid-sentence ("price > $5000") is NOT stripped — LINE_PATTERNS require `^\s*>\s+`
- `<word>` is NOT treated as HTML — no HTML conversion step exists
- `---` is NOT stripped — not in the pattern list (could add if needed)
- `|` is NOT stripped — tables aren't part of TTS text processing
- Content is never *added* or *converted*, only *removed*

**Pros:**
- Production-tested in LiveKit (used by thousands of voice agents)
- Streaming-first design with proper buffering
- No content loss risk (regex only removes known syntax markers)
- Handles all common markdown patterns

**Cons:**
- Requires adaptation from LiveKit's `AsyncIterable[str]` to Pipecat's `BaseTextFilter` interface
- Still a processing step in the pipeline (minor latency)

**Adoption plan:** If the prompt-only approach (Approach 1) doesn't fully work, port LiveKit's regex patterns into a Pipecat `BaseTextFilter` subclass.

## Approach 3: Prompt + Lightweight Regex (HYBRID)

Combine system prompt instruction with a minimal regex filter as safety net:

```python
import re

STRIP_PATTERNS = [
    (re.compile(r"\*\*(.+?)\*\*"), r"\1"),           # **bold** → bold
    (re.compile(r"(?<!\w)\*(.+?)\*(?!\w)"), r"\1"),   # *italic* → italic
    (re.compile(r"`([^`]+?)`"), r"\1"),                 # `code` → code
    (re.compile(r"\[([^\]]*)\]\([^)]*\)"), r"\1"),      # [text](url) → text
    (re.compile(r"^#{1,6}\s+", re.MULTILINE), ""),     # # heading → heading
]

def strip_markdown_for_speech(text: str) -> str:
    for pattern, replacement in STRIP_PATTERNS:
        text = pattern.sub(replacement, text)
    return text
```

This is simpler than LiveKit's full implementation but handles the 80% case. No streaming buffering needed if applied per-sentence.

## Other Approaches Considered

### Open WebUI (JS, simpler non-streaming)
From [PR #7919](https://github.com/open-webui/open-webui/pull/7919) — straightforward regex chain:
- Removes code blocks and tables entirely
- Strips bold/italic/strikethrough to plain text
- Removes headers, list markers, blockquotes, footnotes
- Final pass: removes remaining `- * _ ~` characters
- No streaming support (applied to complete text)

### Vapi's 14-Step Pipeline (most comprehensive)
Goes beyond markdown into **spoken number formatting**:
- Currency: `$42.50` → "forty two dollars and fifty cents"
- Emails: `@` → "at", `.` → "dot"
- Dates to spoken format, phone numbers, percentages
- User-defined custom replacements (exact string or regex)
- Worth stealing the currency/number patterns if Kokoro reads `$5,120` as "dollar sign five comma one two zero"

### SSML Conversion
Convert `**bold**` → `<emphasis>bold</emphasis>`, etc. Requires TTS engine SSML support. **Kokoro does NOT support SSML** — it takes plain text only. This approach is viable if we switch to a TTS engine with SSML support (e.g., Azure TTS, ElevenLabs). SSML is orthogonal to markdown stripping — it's for pronunciation control (pauses, emphasis), not syntax removal.

### Prompt Engineering Best Practices (from LMNT, LiveKit)
- Explicitly state "your response will be spoken aloud" — biggest single impact
- Use contractions and casual language in the prompt itself (LLM mirrors style)
- Be redundant: repeat "no markdown" in multiple sections
- Provide before/after examples of desired vs undesired output
- Filler words guidance ("um", "well") for naturalness (optional)

## Key Sources

- [LiveKit filter_markdown](https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/voice/transcription/filters.py) — production regex filter
- [Pipecat MarkdownTextFilter](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/utils/text/markdown_text_filter.py) — HTML conversion approach (dangerous)
- [Open WebUI PR #7919](https://github.com/open-webui/open-webui/pull/7919) — JS regex chain
- [LiveKit: Modifying LLM output before TTS](https://docs.livekit.io/recipes/chain-of-thought/)
- [LiveKit blog: Prompting voice agents](https://livekit.com/blog/prompting-voice-agents-to-sound-more-realistic/)
- [LMNT LLM prompting guide](https://docs.lmnt.com/guides/llm-prompt)
- [Vapi voice formatting plan](https://docs.vapi.ai/assistants/voice-formatting-plan)
- [OpenAI community: preventing markdown](https://community.openai.com/t/how-to-prevent-gpt-from-outputting-responses-in-markdown-format/961314)

## Decision

**Phase 1 (NOW):** System prompt instruction only — tell Qwen3.5-9B not to use markdown. Zero overhead, no content loss risk. Monitor compliance.

**Phase 2 (IF NEEDED):** Add lightweight regex safety net (Approach 3) as a Pipecat `BaseTextFilter`. Port the 5 most common patterns from LiveKit.

**Phase 3 (FUTURE):** If moving to a TTS with SSML support, convert markdown to SSML instead of stripping it — `**bold**` becomes `<emphasis>` for natural prosody.
