Add GPU lease coordinating LLM and TTS requests through llama-swap
llama-swap runs chat/vision/Chatterbox as a mutually-exclusive set on one GPU and HOLDS a request for a non-resident model until the resident model drains, then swaps. That hold burned the holder's reqwest timeout (measured: a queued TTS lost 77s behind one LLM turn; an LLM request behind a synthesis waited the entire remaining synth), so concurrent insight + read-aloud timed out instead of queueing. ai::gpu adds a fair RwLock lease acquired before each request is sent, so cross-model waits happen before the HTTP timeout starts: chat/vision share the read lease, TTS synthesis and voice-library ops (which spin Chatterbox up) take the write lease, and embeddings take none (the embed slot is in llama-swap's always-resident group). Speech jobs now flip queued->running only after acquiring the GPU, letting the client anchor its poll deadline to that transition. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
+11
-1
@@ -378,6 +378,10 @@ pub async fn tts_speech_handler(
|
||||
}));
|
||||
};
|
||||
|
||||
// Wait for the LLM side to release the GPU before sending — the synthesis
|
||||
// timeout starts at send, not here (see ai::gpu).
|
||||
let _gpu = crate::ai::gpu::tts_lease().await;
|
||||
|
||||
match client
|
||||
.text_to_speech(&text, voice, format, exaggeration, cfg_weight, temperature)
|
||||
.await
|
||||
@@ -495,7 +499,13 @@ pub async fn create_speech_job_handler(
|
||||
return;
|
||||
}
|
||||
};
|
||||
// Cancelled while queued — release the permit without synthesizing.
|
||||
// Wait for the LLM side to release the GPU too (see ai::gpu) — only
|
||||
// then does the job count as running. The synthesis timeout starts at
|
||||
// the HTTP send below, so neither wait burns it, and the client can
|
||||
// anchor its own deadline to the queued→running transition.
|
||||
let _gpu = crate::ai::gpu::tts_lease().await;
|
||||
|
||||
// Cancelled while queued — release the permits without synthesizing.
|
||||
let cancelled = with_job(job_id, |job| {
|
||||
if job.status == TtsJobStatus::Queued {
|
||||
job.status = TtsJobStatus::Running;
|
||||
|
||||
Reference in New Issue
Block a user