Add user-configurable TTS pronunciation overrides

A JSON map (TTS_PRONUNCIATIONS_PATH, default tts_pronunciations.json) rewrites mispronounced words — place names, initialisms, dotted abbreviations — to phonetic spellings before synthesis, applied after markdown cleanup in both /tts/speech paths. Whole-word smartcase matching (lowercase keys match any casing, uppercase keys exact), longest key wins, hot-reloaded on mtime change with last-good fallback on parse errors. See tts_pronunciations.example.json. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 23:06:18 -04:00
parent 3fa4fa8501
commit 2e0f78aa1b
7 changed files with 319 additions and 3 deletions
@@ -153,7 +153,8 @@ behind the same llama-swap proxy. Only requires `LLAMA_SWAP_URL` (the TTS client
 is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
 - `POST /tts/speech` — body `{ text, voice?, format?, exaggeration?, cfg_weight?,
  temperature? }`; returns `{ audio_base64, format }`. Input is cleaned
-  server-side (markdown + emoji stripped) and the generation knobs are clamped
+  server-side (markdown + emoji stripped, then pronunciation overrides applied —
+  see below) and the generation knobs are clamped
  to Chatterbox's ranges. Synthesis is serialized (one at a time — the upstream
  has no GPU lock of its own); a concurrent request gets a fast `429`.
 - `POST /tts/speech/jobs` — durable variant for long syntheses: same body as
@@ -177,7 +178,14 @@ is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
 Created voice names are tagged with the ref-clip cap in effect (e.g.
 `grandma-30s`) so the library shows which reference length produced each clone.

+Words the model mispronounces (place names, initialisms) can be rewritten
+before synthesis via a JSON map — copy `tts_pronunciations.example.json` to
+`tts_pronunciations.json` and edit; changes apply without a restart. Full
+matching rules are documented in `src/ai/pronunciation.rs`.
+
 Env:
+- `TTS_PRONUNCIATIONS_PATH` - pronunciation-override JSON file
+  [default: `tts_pronunciations.json` in the working directory]
 - `LLAMA_SWAP_TTS_MODEL` - TTS model id in llama-swap's `config.yaml` [default: `chatterbox`]
 - `LLAMA_SWAP_TTS_VOICE` - default voice used when a `/tts/speech` request omits `voice` (optional)
 - `LLAMA_SWAP_TTS_REF_SECONDS` - max voice-clone reference clip length in seconds