Feature/tts voice management #105
Reference in New Issue
Block a user
Delete Branch "feature/tts-voice-management"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
- DELETE /tts/voices/{name}: remove a cloned voice via the llama-swap passthrough (upstream chatterbox-tts-api exposes DELETE /voices/{name}). - POST/GET/DELETE /tts/speech/jobs: durable job flow for long syntheses — dispatch returns 202 + job id, the synth queues on the GPU permit instead of fast-failing 429, and clients poll for the result (kept ~10 min). - GET /tts/voices now serves an in-memory cache so listing voices doesn't make llama-swap spin up the TTS model (evicting the resident LLM); invalidated on create/delete, ?refresh=1 forces an upstream re-query. - Created voice names are tagged with LLAMA_SWAP_TTS_REF_SECONDS (e.g. grandma-30s) so the library shows which ref length produced each clone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>Queries embedded via llama-swap were searching corpora embedded via Ollama (measured: spaces diverged). Introduce LocalLlm — the local Ollama + llama-swap pair with LLM_BACKEND dispatch baked in — and route all embedding writers through it; anything embedding via a concrete client reintroduces the bug. - search_rag: embed the model's query verbatim (no metadata boilerplate), make date optional — no time-decay when omitted, so "when did X happen?" queries rank purely by similarity across all time - reembed_embeddings bin: re-embed summaries / calendar / search / knowledge entities via the active backend, with old-new cosine report per table and truncate-and-retry for inputs over the embed server's physical batch size - import_calendar, import_search_history: embed through LocalLlm - search_messages / get_sms_messages: render sender → recipient so sent messages are attributable to a conversation - insight job failures: store the one-line anyhow context chain ({:#}) instead of the Debug dump the client was shown verbatim - serialize env_dispatch tests behind a lock (parallel-runner flake) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>The GPU lease keeps per-request reqwest budgets from burning behind a cross-model swap, but the job-level INSIGHT_GENERATION_TIMEOUT_SECS wall-clock started at spawn — an insight queued behind a running TTS synthesis parked its first chat call on the lease and timed out ("timeout after 180s") before chatterbox even finished loading. Acquire-and-drop an LLM read lease before starting the job clock in both insight handlers: the wait for the GPU happens before the timeout begins, mirroring the per-request lease semantics. Dropped immediately — holding it across the generation would deadlock the chat calls' own lease acquisitions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>