Add user-configurable TTS pronunciation overrides
A JSON map (TTS_PRONUNCIATIONS_PATH, default tts_pronunciations.json) rewrites mispronounced words — place names, initialisms, dotted abbreviations — to phonetic spellings before synthesis, applied after markdown cleanup in both /tts/speech paths. Whole-word smartcase matching (lowercase keys match any casing, uppercase keys exact), longest key wins, hot-reloaded on mtime change with last-good fallback on parse errors. See tts_pronunciations.example.json. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
@@ -153,7 +153,8 @@ behind the same llama-swap proxy. Only requires `LLAMA_SWAP_URL` (the TTS client
|
||||
is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
|
||||
- `POST /tts/speech` — body `{ text, voice?, format?, exaggeration?, cfg_weight?,
|
||||
temperature? }`; returns `{ audio_base64, format }`. Input is cleaned
|
||||
server-side (markdown + emoji stripped) and the generation knobs are clamped
|
||||
server-side (markdown + emoji stripped, then pronunciation overrides applied —
|
||||
see below) and the generation knobs are clamped
|
||||
to Chatterbox's ranges. Synthesis is serialized (one at a time — the upstream
|
||||
has no GPU lock of its own); a concurrent request gets a fast `429`.
|
||||
- `POST /tts/speech/jobs` — durable variant for long syntheses: same body as
|
||||
@@ -177,7 +178,14 @@ is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
|
||||
Created voice names are tagged with the ref-clip cap in effect (e.g.
|
||||
`grandma-30s`) so the library shows which reference length produced each clone.
|
||||
|
||||
Words the model mispronounces (place names, initialisms) can be rewritten
|
||||
before synthesis via a JSON map — copy `tts_pronunciations.example.json` to
|
||||
`tts_pronunciations.json` and edit; changes apply without a restart. Full
|
||||
matching rules are documented in `src/ai/pronunciation.rs`.
|
||||
|
||||
Env:
|
||||
- `TTS_PRONUNCIATIONS_PATH` - pronunciation-override JSON file
|
||||
[default: `tts_pronunciations.json` in the working directory]
|
||||
- `LLAMA_SWAP_TTS_MODEL` - TTS model id in llama-swap's `config.yaml` [default: `chatterbox`]
|
||||
- `LLAMA_SWAP_TTS_VOICE` - default voice used when a `/tts/speech` request omits `voice` (optional)
|
||||
- `LLAMA_SWAP_TTS_REF_SECONDS` - max voice-clone reference clip length in seconds
|
||||
|
||||
Reference in New Issue
Block a user