Add GPU lease coordinating LLM and TTS requests through llama-swap

llama-swap runs chat/vision/Chatterbox as a mutually-exclusive set on
one GPU and HOLDS a request for a non-resident model until the resident
model drains, then swaps. That hold burned the holder's reqwest timeout
(measured: a queued TTS lost 77s behind one LLM turn; an LLM request
behind a synthesis waited the entire remaining synth), so concurrent
insight + read-aloud timed out instead of queueing.

ai::gpu adds a fair RwLock lease acquired before each request is sent,
so cross-model waits happen before the HTTP timeout starts: chat/vision
share the read lease, TTS synthesis and voice-library ops (which spin
Chatterbox up) take the write lease, and embeddings take none (the
embed slot is in llama-swap's always-resident group). Speech jobs now
flip queued->running only after acquiring the GPU, letting the client
anchor its poll deadline to that transition.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Cameron Cordes
2026-06-11 18:20:06 -04:00
parent 03699f7413
commit 0accc4ef2f
4 changed files with 125 additions and 1 deletions
+11 -1
View File
@@ -378,6 +378,10 @@ pub async fn tts_speech_handler(
}));
};
// Wait for the LLM side to release the GPU before sending — the synthesis
// timeout starts at send, not here (see ai::gpu).
let _gpu = crate::ai::gpu::tts_lease().await;
match client
.text_to_speech(&text, voice, format, exaggeration, cfg_weight, temperature)
.await
@@ -495,7 +499,13 @@ pub async fn create_speech_job_handler(
return;
}
};
// Cancelled while queued — release the permit without synthesizing.
// Wait for the LLM side to release the GPU too (see ai::gpu) — only
// then does the job count as running. The synthesis timeout starts at
// the HTTP send below, so neither wait burns it, and the client can
// anchor its own deadline to the queued→running transition.
let _gpu = crate::ai::gpu::tts_lease().await;
// Cancelled while queued — release the permits without synthesizing.
let cancelled = with_job(job_id, |job| {
if job.status == TtsJobStatus::Queued {
job.status = TtsJobStatus::Running;