Feature/tts voice management #105

Merged
cameron merged 9 commits from feature/tts-voice-management into master 2026-06-13 02:01:37 +00:00
Owner
No description provided.
cameron added 9 commits 2026-06-13 02:01:17 +00:00
- DELETE /tts/voices/{name}: remove a cloned voice via the llama-swap
  passthrough (upstream chatterbox-tts-api exposes DELETE /voices/{name}).
- POST/GET/DELETE /tts/speech/jobs: durable job flow for long syntheses —
  dispatch returns 202 + job id, the synth queues on the GPU permit instead
  of fast-failing 429, and clients poll for the result (kept ~10 min).
- GET /tts/voices now serves an in-memory cache so listing voices doesn't
  make llama-swap spin up the TTS model (evicting the resident LLM);
  invalidated on create/delete, ?refresh=1 forces an upstream re-query.
- Created voice names are tagged with LLAMA_SWAP_TTS_REF_SECONDS (e.g.
  grandma-30s) so the library shows which ref length produced each clone.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
llama-swap runs chat/vision/Chatterbox as a mutually-exclusive set on
one GPU and HOLDS a request for a non-resident model until the resident
model drains, then swaps. That hold burned the holder's reqwest timeout
(measured: a queued TTS lost 77s behind one LLM turn; an LLM request
behind a synthesis waited the entire remaining synth), so concurrent
insight + read-aloud timed out instead of queueing.

ai::gpu adds a fair RwLock lease acquired before each request is sent,
so cross-model waits happen before the HTTP timeout starts: chat/vision
share the read lease, TTS synthesis and voice-library ops (which spin
Chatterbox up) take the write lease, and embeddings take none (the
embed slot is in llama-swap's always-resident group). Speech jobs now
flip queued->running only after acquiring the GPU, letting the client
anchor its poll deadline to that transition.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Queries embedded via llama-swap were searching corpora embedded via
Ollama (measured: spaces diverged). Introduce LocalLlm — the local
Ollama + llama-swap pair with LLM_BACKEND dispatch baked in — and route
all embedding writers through it; anything embedding via a concrete
client reintroduces the bug.

- search_rag: embed the model's query verbatim (no metadata boilerplate),
  make date optional — no time-decay when omitted, so "when did X
  happen?" queries rank purely by similarity across all time
- reembed_embeddings bin: re-embed summaries / calendar / search /
  knowledge entities via the active backend, with old-new cosine report
  per table and truncate-and-retry for inputs over the embed server's
  physical batch size
- import_calendar, import_search_history: embed through LocalLlm
- search_messages / get_sms_messages: render sender → recipient so sent
  messages are attributable to a conversation
- insight job failures: store the one-line anyhow context chain ({:#})
  instead of the Debug dump the client was shown verbatim
- serialize env_dispatch tests behind a lock (parallel-runner flake)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The GPU lease keeps per-request reqwest budgets from burning behind a
cross-model swap, but the job-level INSIGHT_GENERATION_TIMEOUT_SECS
wall-clock started at spawn — an insight queued behind a running TTS
synthesis parked its first chat call on the lease and timed out
("timeout after 180s") before chatterbox even finished loading.

Acquire-and-drop an LLM read lease before starting the job clock in
both insight handlers: the wait for the GPU happens before the
timeout begins, mirroring the per-request lease semantics. Dropped
immediately — holding it across the generation would deadlock the
chat calls' own lease acquisitions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Trialing Qwen3-Embedding-0.6B (1024-dim, instruct-prefixed queries)
against nomic required code changes at every hardcoded seam; now it's a
config flip plus a reembed_embeddings run.

- EMBEDDING_DIM env (default 768) replaces every hardcoded dim check:
  daily summary / calendar / search / location DAOs, Ollama batch
  validation, reembed_embeddings
- entities gains the dim guard it never had — a wrong-dim vector
  silently kills dedup/recall (cosine over mismatched lengths is 0),
  so store None and warn instead
- embed_query / embed_document split with EMBED_QUERY_PREFIX /
  EMBED_DOCUMENT_PREFIX (literal \n expanded): retrieval models treat
  the two sides differently — nomic wants search_query:/search_document:,
  Qwen3 wants Instruct:...\nQuery: on queries only. All query-side
  call sites and all corpus writers now declare their side.
- document the contract in CLAUDE.md: change the model or any of these
  vars → re-run reembed_embeddings or search is garbage

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Models wrap the title line despite the prompt — "**Title: A Day in the
Woods**", "## Title: ...", bold around just the label — which made
parse_title_body's bare "Title:" prefix match fall through to the
fallbacks and leak asterisks into the stored title.

strip_title_markdown trims bold/italic markers, heading hashes,
backticks, and quotes from both ends; applied to the label line, the
extracted title, both fallback paths, and generate_photo_title (which
previously stripped only quotes).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A JSON map (TTS_PRONUNCIATIONS_PATH, default tts_pronunciations.json)
rewrites mispronounced words — place names, initialisms, dotted
abbreviations — to phonetic spellings before synthesis, applied after
markdown cleanup in both /tts/speech paths. Whole-word smartcase
matching (lowercase keys match any casing, uppercase keys exact),
longest key wins, hot-reloaded on mtime change with last-good fallback
on parse errors. See tts_pronunciations.example.json.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Both voice creation endpoints (upload + from-library) now accept optional
start_seconds/duration_seconds, threaded to ffmpeg as -ss/-t, so the
reference window can target clean speech anywhere in a long recording
instead of always the first N seconds. Duration is clamped to the
LLAMA_SWAP_TTS_REF_SECONDS cap and the voice-name tag reflects the
actual window length.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Clones that don't start at 0:00 are tagged with where the reference
window begins (grandma-at1m32s-30s), so voices cloned from different
sections of the same source are distinguishable in the voice list.
Zero-start names keep the existing -30s form.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cameron merged commit 98274c3301 into master 2026-06-13 02:01:37 +00:00
cameron deleted branch feature/tts-voice-management 2026-06-13 02:01:38 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Apps/ImageApi#105