# Research: TTS Text Processing — Markdown-to-Speech **Date:** 2026-03-14 **Context:** Annie reads markdown formatting literally ("asterisk asterisk bold asterisk asterisk") and Pipecat's MarkdownTextFilter eats legitimate content. ## Problem LLMs produce markdown-formatted text (bold, headings, links, code blocks) even when instructed not to. When sent to TTS, the raw formatting characters are read aloud. Post-processing filters intended to strip markdown can corrupt or delete content. ## Pipecat's MarkdownTextFilter (REMOVED — too dangerous) **Source:** `pipecat/utils/text/markdown_text_filter.py` (v0.0.105) **Approach:** Convert text → HTML via Python `markdown` library → strip HTML tags with regex. **Fatal bugs discovered (tested 2026-03-14):** | Input | Output | Bug | |-------|--------|-----| | `>` (alone) | empty | Blockquote eats content | | `> 50% of traders are bullish.` | `\n50% of traders are bullish.\n` | `>` stripped as blockquote | | `` | empty | Treated as HTML tag | | `---` | empty | Horizontal rule | | `Gold: $5,120 \| Silver: $32` | `Gold: $5,120 Silver: $32` | All pipes removed | | `# Summary` | `Summary` | Heading stripped (minor) | **Root cause:** The Markdown parser is designed for *rendering* documents, not *stripping syntax*. It interprets `>` as blockquote, `` as HTML, `---` as `

`. The HTML→strip pipeline then deletes the rendered content. ## ThinkTagFilter (REMOVED — fragile custom code) **Source:** `services/annie-voice/text_processor.py` (custom) **Approach:** Character-level state machine that buffers text after `<`, looking for `` tag. **Issues:** - Orphaned `<` at end of stream is silently lost - Adds latency to every frame for a rare edge case - Primary defense (`enable_thinking: False` in `llamacpp_llm.py`) already handles this **Decision:** Removed. The `enable_thinking: False` kwarg sent to llama-server is sufficient. ## Approach 1: System Prompt (DEPLOYED) **Strategy:** Tell the LLM explicitly not to use markdown. ``` NEVER use markdown formatting — no **bold**, no *italic*, no # headings, no `backticks`, no [links](url). Your output is spoken aloud by a TTS engine that reads raw text. Use plain numbered lists (1. 2. 3.) and natural speech patterns instead. ``` **Pros:** - Zero processing overhead - Works at the source — model generates clean text - No risk of content loss **Cons:** - Small models (9B) sometimes ignore system prompt instructions, especially for longer responses or tool result summaries - Pretraining markdown habit is strong — explicit examples of forbidden patterns help **Status:** Deployed in commit `0985880`. Monitoring whether Qwen3.5-9B respects it consistently. ## Approach 2: LiveKit's `filter_markdown` (RECOMMENDED if prompt isn't enough) **Source:** [`livekit/agents/voice/transcription/filters.py`](https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/voice/transcription/filters.py) **Approach:** Pure regex substitution — no HTML conversion. Streaming-aware with incomplete-pattern buffering. **Architecture:** 1. **Line patterns** (applied at start of lines only): ```python LINE_PATTERNS = [ (re.compile(r"^#{1,6}\s+", re.MULTILINE), ""), # # headings (re.compile(r"^\s*[-+*]\s+", re.MULTILINE), ""), # - list markers (re.compile(r"^\s*>\s+", re.MULTILINE), ""), # > block quotes ] ``` 2. **Inline patterns** (anywhere in text): ```python INLINE_PATTERNS = [ (re.compile(r"!\[([^\]]*)\]$[^)]*$"), r"\1"), # ![alt](url) → alt (re.compile(r"\[([^\]]*)\]$[^)]*$"), r"\1"), # [text](url) → text (re.compile(r"(?` is only stripped at the start of a line, not mid-sentence (unlike Pipecat's approach which converts everything through the Markdown parser). **Why this is safe:** - `>` mid-sentence ("price > $5000") is NOT stripped — LINE_PATTERNS require `^\s*>\s+` - `` is NOT treated as HTML — no HTML conversion step exists - `---` is NOT stripped — not in the pattern list (could add if needed) - `|` is NOT stripped — tables aren't part of TTS text processing - Content is never *added* or *converted*, only *removed* **Pros:** - Production-tested in LiveKit (used by thousands of voice agents) - Streaming-first design with proper buffering - No content loss risk (regex only removes known syntax markers) - Handles all common markdown patterns **Cons:** - Requires adaptation from LiveKit's `AsyncIterable[str]` to Pipecat's `BaseTextFilter` interface - Still a processing step in the pipeline (minor latency) **Adoption plan:** If the prompt-only approach (Approach 1) doesn't fully work, port LiveKit's regex patterns into a Pipecat `BaseTextFilter` subclass. ## Approach 3: Prompt + Lightweight Regex (HYBRID) Combine system prompt instruction with a minimal regex filter as safety net: ```python import re STRIP_PATTERNS = [ (re.compile(r"\*\*(.+?)\*\*"), r"\1"), # **bold** → bold (re.compile(r"(? str: for pattern, replacement in STRIP_PATTERNS: text = pattern.sub(replacement, text) return text ``` This is simpler than LiveKit's full implementation but handles the 80% case. No streaming buffering needed if applied per-sentence. ## Other Approaches Considered ### Open WebUI (JS, simpler non-streaming) From [PR #7919](https://github.com/open-webui/open-webui/pull/7919) — straightforward regex chain: - Removes code blocks and tables entirely - Strips bold/italic/strikethrough to plain text - Removes headers, list markers, blockquotes, footnotes - Final pass: removes remaining `- * _ ~` characters - No streaming support (applied to complete text) ### Vapi's 14-Step Pipeline (most comprehensive) Goes beyond markdown into **spoken number formatting**: - Currency: `$42.50` → "forty two dollars and fifty cents" - Emails: `@` → "at", `.` → "dot" - Dates to spoken format, phone numbers, percentages - User-defined custom replacements (exact string or regex) - Worth stealing the currency/number patterns if Kokoro reads `$5,120` as "dollar sign five comma one two zero" ### SSML Conversion Convert `**bold**` → `bold`, etc. Requires TTS engine SSML support. **Kokoro does NOT support SSML** — it takes plain text only. This approach is viable if we switch to a TTS engine with SSML support (e.g., Azure TTS, ElevenLabs). SSML is orthogonal to markdown stripping — it's for pronunciation control (pauses, emphasis), not syntax removal. ### Prompt Engineering Best Practices (from LMNT, LiveKit) - Explicitly state "your response will be spoken aloud" — biggest single impact - Use contractions and casual language in the prompt itself (LLM mirrors style) - Be redundant: repeat "no markdown" in multiple sections - Provide before/after examples of desired vs undesired output - Filler words guidance ("um", "well") for naturalness (optional) ## Key Sources - [LiveKit filter_markdown](https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/voice/transcription/filters.py) — production regex filter - [Pipecat MarkdownTextFilter](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/utils/text/markdown_text_filter.py) — HTML conversion approach (dangerous) - [Open WebUI PR #7919](https://github.com/open-webui/open-webui/pull/7919) — JS regex chain - [LiveKit: Modifying LLM output before TTS](https://docs.livekit.io/recipes/chain-of-thought/) - [LiveKit blog: Prompting voice agents](https://livekit.com/blog/prompting-voice-agents-to-sound-more-realistic/) - [LMNT LLM prompting guide](https://docs.lmnt.com/guides/llm-prompt) - [Vapi voice formatting plan](https://docs.vapi.ai/assistants/voice-formatting-plan) - [OpenAI community: preventing markdown](https://community.openai.com/t/how-to-prevent-gpt-from-outputting-responses-in-markdown-format/961314) ## Decision **Phase 1 (NOW):** System prompt instruction only — tell Qwen3.5-9B not to use markdown. Zero overhead, no content loss risk. Monitor compliance. **Phase 2 (IF NEEDED):** Add lightweight regex safety net (Approach 3) as a Pipecat `BaseTextFilter`. Port the 5 most common patterns from LiveKit. **Phase 3 (FUTURE):** If moving to a TTS with SSML support, convert markdown to SSML instead of stripping it — `**bold**` becomes `` for natural prosody.