Add GPU lease coordinating LLM and TTS requests through llama-swap

llama-swap runs chat/vision/Chatterbox as a mutually-exclusive set on one GPU and HOLDS a request for a non-resident model until the resident model drains, then swaps. That hold burned the holder's reqwest timeout (measured: a queued TTS lost 77s behind one LLM turn; an LLM request behind a synthesis waited the entire remaining synth), so concurrent insight + read-aloud timed out instead of queueing. ai::gpu adds a fair RwLock lease acquired before each request is sent, so cross-model waits happen before the HTTP timeout starts: chat/vision share the read lease, TTS synthesis and voice-library ops (which spin Chatterbox up) take the write lease, and embeddings take none (the embed slot is in llama-swap's always-resident group). Speech jobs now flip queued->running only after acquiring the GPU, letting the client anchor its poll deadline to that transition. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:20:06 -04:00
parent 03699f7413
commit 0accc4ef2f
4 changed files with 125 additions and 1 deletions
@@ -378,6 +378,10 @@ pub async fn tts_speech_handler(
        }));
    };

+    // Wait for the LLM side to release the GPU before sending — the synthesis
+    // timeout starts at send, not here (see ai::gpu).
+    let _gpu = crate::ai::gpu::tts_lease().await;
+
    match client
        .text_to_speech(&text, voice, format, exaggeration, cfg_weight, temperature)
        .await
@@ -495,7 +499,13 @@ pub async fn create_speech_job_handler(
                return;
            }
        };
-        // Cancelled while queued — release the permit without synthesizing.
+        // Wait for the LLM side to release the GPU too (see ai::gpu) — only
+        // then does the job count as running. The synthesis timeout starts at
+        // the HTTP send below, so neither wait burns it, and the client can
+        // anchor its own deadline to the queued→running transition.
+        let _gpu = crate::ai::gpu::tts_lease().await;
+
+        // Cancelled while queued — release the permits without synthesizing.
        let cancelled = with_job(job_id, |job| {
            if job.status == TtsJobStatus::Queued {
                job.status = TtsJobStatus::Running;