Add TTS voice deletion, async speech jobs, voice-list cache, ref-seconds name tags
- DELETE /tts/voices/{name}: remove a cloned voice via the llama-swap
passthrough (upstream chatterbox-tts-api exposes DELETE /voices/{name}).
- POST/GET/DELETE /tts/speech/jobs: durable job flow for long syntheses —
dispatch returns 202 + job id, the synth queues on the GPU permit instead
of fast-failing 429, and clients poll for the result (kept ~10 min).
- GET /tts/voices now serves an in-memory cache so listing voices doesn't
make llama-swap spin up the TTS model (evicting the resident LLM);
invalidated on create/delete, ?refresh=1 forces an upstream re-query.
- Created voice names are tagged with LLAMA_SWAP_TTS_REF_SECONDS (e.g.
grandma-30s) so the library shows which ref length produced each clone.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
@@ -156,12 +156,26 @@ is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
|
||||
server-side (markdown + emoji stripped) and the generation knobs are clamped
|
||||
to Chatterbox's ranges. Synthesis is serialized (one at a time — the upstream
|
||||
has no GPU lock of its own); a concurrent request gets a fast `429`.
|
||||
- `GET /tts/voices` — list the voice library.
|
||||
- `POST /tts/speech/jobs` — durable variant for long syntheses: same body as
|
||||
`/tts/speech`, returns `202 { job_id, status }` immediately. Jobs queue on the
|
||||
GPU permit instead of fast-failing `429`.
|
||||
- `GET /tts/speech/jobs/{id}` — poll a job: `{ job_id, status, format,
|
||||
audio_base64?, error? }` with status `queued|running|done|error|cancelled`.
|
||||
Results are kept in memory ~10 min after completion, then the job 404s.
|
||||
- `DELETE /tts/speech/jobs/{id}` — cancel a queued/running job.
|
||||
- `GET /tts/voices` — list the voice library. Served from an in-memory cache
|
||||
(so the listing doesn't make llama-swap spin up the TTS model and evict the
|
||||
resident LLM); pass `?refresh=1` to force an upstream re-query. The cache is
|
||||
invalidated by voice create/delete.
|
||||
- `POST /tts/voices/upload` — multipart `voice_name` + `voice_file`; clone a
|
||||
voice from an uploaded clip (≤25 MB).
|
||||
- `POST /tts/voices/from-library` — body `{ voice_name, path, library? }`; clone
|
||||
from a library file (audio forwarded as-is; video has its audio extracted via
|
||||
ffmpeg).
|
||||
- `DELETE /tts/voices/{name}` — remove a cloned voice from the library.
|
||||
|
||||
Created voice names are tagged with the ref-clip cap in effect (e.g.
|
||||
`grandma-30s`) so the library shows which reference length produced each clone.
|
||||
|
||||
Env:
|
||||
- `LLAMA_SWAP_TTS_MODEL` - TTS model id in llama-swap's `config.yaml` [default: `chatterbox`]
|
||||
|
||||
Reference in New Issue
Block a user