Serialize /tts/speech with a single permit; 429 when busy

The Chatterbox wrapper has no internal lock or cancellation, so concurrent
synth requests contend on the single GPU and abandoned (timed-out) jobs
cascade into stacked slowness. Gate synthesis behind a one-permit semaphore
and fast-fail concurrent requests with 429 instead of queueing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Cameron Cordes
2026-06-03 14:02:56 -04:00
parent d8dd260c6b
commit cab867da60
2 changed files with 19 additions and 1 deletions
+2 -1
View File
@@ -154,7 +154,8 @@ is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
- `POST /tts/speech` — body `{ text, voice?, format?, exaggeration?, cfg_weight?,
temperature? }`; returns `{ audio_base64, format }`. Input is cleaned
server-side (markdown + emoji stripped) and the generation knobs are clamped
to Chatterbox's ranges.
to Chatterbox's ranges. Synthesis is serialized (one at a time — the upstream
has no GPU lock of its own); a concurrent request gets a fast `429`.
- `GET /tts/voices` — list the voice library.
- `POST /tts/voices/upload` — multipart `voice_name` + `voice_file`; clone a
voice from an uploaded clip (≤25 MB).