Add TTS voice deletion, async speech jobs, voice-list cache, ref-seconds name tags

- DELETE /tts/voices/{name}: remove a cloned voice via the llama-swap passthrough (upstream chatterbox-tts-api exposes DELETE /voices/{name}). - POST/GET/DELETE /tts/speech/jobs: durable job flow for long syntheses — dispatch returns 202 + job id, the synth queues on the GPU permit instead of fast-failing 429, and clients poll for the result (kept ~10 min). - GET /tts/voices now serves an in-memory cache so listing voices doesn't make llama-swap spin up the TTS model (evicting the resident LLM); invalidated on create/delete, ?refresh=1 forces an upstream re-query. - Created voice names are tagged with LLAMA_SWAP_TTS_REF_SECONDS (e.g. grandma-30s) so the library shows which ref length produced each clone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:36:15 -04:00
parent c78e751743
commit 03699f7413
5 changed files with 588 additions and 16 deletions
@@ -156,12 +156,26 @@ is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
  server-side (markdown + emoji stripped) and the generation knobs are clamped
  to Chatterbox's ranges. Synthesis is serialized (one at a time — the upstream
  has no GPU lock of its own); a concurrent request gets a fast `429`.
- `GET /tts/voices` — list the voice library.
+- `POST /tts/speech/jobs` — durable variant for long syntheses: same body as
+  `/tts/speech`, returns `202 { job_id, status }` immediately. Jobs queue on the
+  GPU permit instead of fast-failing `429`.
+- `GET /tts/speech/jobs/{id}` — poll a job: `{ job_id, status, format,
+  audio_base64?, error? }` with status `queued|running|done|error|cancelled`.
+  Results are kept in memory ~10 min after completion, then the job 404s.
+- `DELETE /tts/speech/jobs/{id}` — cancel a queued/running job.
+- `GET /tts/voices` — list the voice library. Served from an in-memory cache
+  (so the listing doesn't make llama-swap spin up the TTS model and evict the
+  resident LLM); pass `?refresh=1` to force an upstream re-query. The cache is
+  invalidated by voice create/delete.
 - `POST /tts/voices/upload` — multipart `voice_name` + `voice_file`; clone a
  voice from an uploaded clip (≤25 MB).
 - `POST /tts/voices/from-library` — body `{ voice_name, path, library? }`; clone
  from a library file (audio forwarded as-is; video has its audio extracted via
  ffmpeg).
+- `DELETE /tts/voices/{name}` — remove a cloned voice from the library.
+
+Created voice names are tagged with the ref-clip cap in effect (e.g.
+`grandma-30s`) so the library shows which reference length produced each clone.

 Env:
 - `LLAMA_SWAP_TTS_MODEL` - TTS model id in llama-swap's `config.yaml` [default: `chatterbox`]