Add user-configurable TTS pronunciation overrides

A JSON map (TTS_PRONUNCIATIONS_PATH, default tts_pronunciations.json)
rewrites mispronounced words — place names, initialisms, dotted
abbreviations — to phonetic spellings before synthesis, applied after
markdown cleanup in both /tts/speech paths. Whole-word smartcase
matching (lowercase keys match any casing, uppercase keys exact),
longest key wins, hot-reloaded on mtime change with last-good fallback
on parse errors. See tts_pronunciations.example.json.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Cameron Cordes
2026-06-11 23:06:18 -04:00
parent 3fa4fa8501
commit 2e0f78aa1b
7 changed files with 319 additions and 3 deletions
+9 -1
View File
@@ -153,7 +153,8 @@ behind the same llama-swap proxy. Only requires `LLAMA_SWAP_URL` (the TTS client
is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
- `POST /tts/speech` — body `{ text, voice?, format?, exaggeration?, cfg_weight?,
temperature? }`; returns `{ audio_base64, format }`. Input is cleaned
server-side (markdown + emoji stripped) and the generation knobs are clamped
server-side (markdown + emoji stripped, then pronunciation overrides applied —
see below) and the generation knobs are clamped
to Chatterbox's ranges. Synthesis is serialized (one at a time — the upstream
has no GPU lock of its own); a concurrent request gets a fast `429`.
- `POST /tts/speech/jobs` — durable variant for long syntheses: same body as
@@ -177,7 +178,14 @@ is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
Created voice names are tagged with the ref-clip cap in effect (e.g.
`grandma-30s`) so the library shows which reference length produced each clone.
Words the model mispronounces (place names, initialisms) can be rewritten
before synthesis via a JSON map — copy `tts_pronunciations.example.json` to
`tts_pronunciations.json` and edit; changes apply without a restart. Full
matching rules are documented in `src/ai/pronunciation.rs`.
Env:
- `TTS_PRONUNCIATIONS_PATH` - pronunciation-override JSON file
[default: `tts_pronunciations.json` in the working directory]
- `LLAMA_SWAP_TTS_MODEL` - TTS model id in llama-swap's `config.yaml` [default: `chatterbox`]
- `LLAMA_SWAP_TTS_VOICE` - default voice used when a `/tts/speech` request omits `voice` (optional)
- `LLAMA_SWAP_TTS_REF_SECONDS` - max voice-clone reference clip length in seconds