Normalize voice-clone reference audio to WAV via ffmpeg

Chatterbox validates the reference clip by file extension and rejects formats like .aac/.opus. Always transcode the reference (upload bytes and library files alike) to mono 24 kHz WAV with ffmpeg before forwarding, so any source format is accepted and the from-library audio/video paths are unified. The reference length cap is now configurable via LLAMA_SWAP_TTS_REF_SECONDS (default 30) — Chatterbox is zero-shot, so a clean ~10-20s clip is the sweet spot. Drops the now-unused mime guesser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 22:50:08 -04:00
parent 35c5ecb427
commit 62d517dcda
3 changed files with 79 additions and 78 deletions
@@ -165,6 +165,10 @@ is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
 Env:
 - `LLAMA_SWAP_TTS_MODEL` - TTS model id in llama-swap's `config.yaml` [default: `chatterbox`]
 - `LLAMA_SWAP_TTS_VOICE` - default voice used when a `/tts/speech` request omits `voice` (optional)
+- `LLAMA_SWAP_TTS_REF_SECONDS` - max voice-clone reference clip length in seconds
+  [default: `30`]. Reference audio is ffmpeg-normalized to mono 24 kHz WAV (so any
+  source format works); Chatterbox is zero-shot, so a clean ~10–20s sample is the
+  sweet spot — more rarely helps.

 #### Fallback Behavior
 - Primary server is tried first with 5-second connection timeout