ImageApi

Author	SHA1	Message	Date
Cameron Cordes	1017fe73af	Include start offset in voice-name window tag Clones that don't start at 0:00 are tagged with where the reference window begins (grandma-at1m32s-30s), so voices cloned from different sections of the same source are distinguishable in the voice list. Zero-start names keep the existing -30s form. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 16:21:41 -04:00
Cameron Cordes	1dec34540d	Add start/duration window selection for voice-clone reference clips Both voice creation endpoints (upload + from-library) now accept optional start_seconds/duration_seconds, threaded to ffmpeg as -ss/-t, so the reference window can target clean speech anywhere in a long recording instead of always the first N seconds. Duration is clamped to the LLAMA_SWAP_TTS_REF_SECONDS cap and the voice-name tag reflects the actual window length. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 16:09:03 -04:00
Cameron Cordes	2e0f78aa1b	Add user-configurable TTS pronunciation overrides A JSON map (TTS_PRONUNCIATIONS_PATH, default tts_pronunciations.json) rewrites mispronounced words — place names, initialisms, dotted abbreviations — to phonetic spellings before synthesis, applied after markdown cleanup in both /tts/speech paths. Whole-word smartcase matching (lowercase keys match any casing, uppercase keys exact), longest key wins, hot-reloaded on mtime change with last-good fallback on parse errors. See tts_pronunciations.example.json. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 23:06:18 -04:00
Cameron Cordes	0accc4ef2f	Add GPU lease coordinating LLM and TTS requests through llama-swap llama-swap runs chat/vision/Chatterbox as a mutually-exclusive set on one GPU and HOLDS a request for a non-resident model until the resident model drains, then swaps. That hold burned the holder's reqwest timeout (measured: a queued TTS lost 77s behind one LLM turn; an LLM request behind a synthesis waited the entire remaining synth), so concurrent insight + read-aloud timed out instead of queueing. ai::gpu adds a fair RwLock lease acquired before each request is sent, so cross-model waits happen before the HTTP timeout starts: chat/vision share the read lease, TTS synthesis and voice-library ops (which spin Chatterbox up) take the write lease, and embeddings take none (the embed slot is in llama-swap's always-resident group). Speech jobs now flip queued->running only after acquiring the GPU, letting the client anchor its poll deadline to that transition. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:20:06 -04:00
Cameron Cordes	03699f7413	Add TTS voice deletion, async speech jobs, voice-list cache, ref-seconds name tags - DELETE /tts/voices/{name}: remove a cloned voice via the llama-swap passthrough (upstream chatterbox-tts-api exposes DELETE /voices/{name}). - POST/GET/DELETE /tts/speech/jobs: durable job flow for long syntheses — dispatch returns 202 + job id, the synth queues on the GPU permit instead of fast-failing 429, and clients poll for the result (kept ~10 min). - GET /tts/voices now serves an in-memory cache so listing voices doesn't make llama-swap spin up the TTS model (evicting the resident LLM); invalidated on create/delete, ?refresh=1 forces an upstream re-query. - Created voice names are tagged with LLAMA_SWAP_TTS_REF_SECONDS (e.g. grandma-30s) so the library shows which ref length produced each clone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:36:15 -04:00
Cameron Cordes	412da2ce8e	Collapse blank lines to a single break in TTS text cleaning Chatterbox inserts a long pause — sometimes ~20s of silence — for each blank line it sees, and insight text is markdown full of paragraph breaks. clean_for_tts previously preserved paragraph structure (\n{3,} -> \n\n), so every paragraph boundary still reached the model as a double newline. Now any run of 2+ newlines, including whitespace-only blank lines, collapses to a single newline so the worst pause a break can cause is a normal line-break pause. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 09:12:43 -04:00
Cameron Cordes	cab867da60	Serialize /tts/speech with a single permit; 429 when busy The Chatterbox wrapper has no internal lock or cancellation, so concurrent synth requests contend on the single GPU and abandoned (timed-out) jobs cascade into stacked slowness. Gate synthesis behind a one-permit semaphore and fast-fail concurrent requests with 429 instead of queueing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 14:02:56 -04:00
Cameron Cordes	ccacfe1113	Instrument TTS handlers with OTel spans (codebase standard) Each /tts handler now opens an http.tts.* span via extract_context_from_request + global_tracer().start_with_context, sets Status::Ok / Status::error on every outcome, and records useful attributes (model, format, voice_name, byte counts) — matching the insight handlers. Prometheus request metrics were already covered by the app-wide actix-web-prom middleware. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 23:10:43 -04:00
Cameron Cordes	62d517dcda	Normalize voice-clone reference audio to WAV via ffmpeg Chatterbox validates the reference clip by file extension and rejects formats like .aac/.opus. Always transcode the reference (upload bytes and library files alike) to mono 24 kHz WAV with ffmpeg before forwarding, so any source format is accepted and the from-library audio/video paths are unified. The reference length cap is now configurable via LLAMA_SWAP_TTS_REF_SECONDS (default 30) — Chatterbox is zero-shot, so a clean ~10-20s clip is the sweet spot. Drops the now-unused mime guesser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 22:50:08 -04:00
Cameron Cordes	51be5df214	Clean insight text for TTS and pass through Chatterbox tuning knobs /tts/speech now normalizes input before synthesis: unwraps markdown links/images to visible text, drops heading/list/blockquote/emphasis markers and URLs, strips emoji (which non-turbo Chatterbox mispronounces or skips), and collapses whitespace. Centralized in clean_for_tts so the app, WebUI, and curl all get clean audio. Bracketed tags are deliberately preserved for a future Turbo (paralinguistic) switch. Adds optional exaggeration / cfg_weight / temperature to the request, clamped to Chatterbox's documented ranges and forwarded on the speech body. Unit tests cover markdown/emoji/URL stripping and tag preservation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 22:15:05 -04:00
Cameron Cordes	69268d03fe	Add TTS endpoints backed by Chatterbox via llama-swap LlamaCppClient gains text_to_speech (OpenAI /audio/speech), list_voices and create_voice (voice library at the swap-root /upstream/<model>/voices passthrough), plus a tts_model slot configured via LLAMA_SWAP_TTS_MODEL (default "chatterbox"). New Claims-gated routes: - POST /tts/speech -> { audio_base64, format } for data: URI playback - GET /tts/voices -> voice library passthrough - POST /tts/voices/upload -> clone a voice from an uploaded clip (multipart) - POST /tts/voices/from-library -> clone from a library file (ffmpeg-extracts audio from video; audio forwarded as-is) Security: voice_name sanitized to [A-Za-z0-9_-] (it becomes an upstream filename), 25 MB upload cap, library refs restricted to real audio/video, path confined via is_valid_full_path. Adds is_audio_file + unit tests for the sanitizer, mime guesser, and swap-root derivation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 22:04:42 -04:00

11 Commits