Normalize voice-clone reference audio to WAV via ffmpeg
Chatterbox validates the reference clip by file extension and rejects formats like .aac/.opus. Always transcode the reference (upload bytes and library files alike) to mono 24 kHz WAV with ffmpeg before forwarding, so any source format is accepted and the from-library audio/video paths are unified. The reference length cap is now configurable via LLAMA_SWAP_TTS_REF_SECONDS (default 30) — Chatterbox is zero-shot, so a clean ~10-20s clip is the sweet spot. Drops the now-unused mime guesser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -165,6 +165,10 @@ is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
|
||||
Env:
|
||||
- `LLAMA_SWAP_TTS_MODEL` - TTS model id in llama-swap's `config.yaml` [default: `chatterbox`]
|
||||
- `LLAMA_SWAP_TTS_VOICE` - default voice used when a `/tts/speech` request omits `voice` (optional)
|
||||
- `LLAMA_SWAP_TTS_REF_SECONDS` - max voice-clone reference clip length in seconds
|
||||
[default: `30`]. Reference audio is ffmpeg-normalized to mono 24 kHz WAV (so any
|
||||
source format works); Chatterbox is zero-shot, so a clean ~10–20s sample is the
|
||||
sweet spot — more rarely helps.
|
||||
|
||||
#### Fallback Behavior
|
||||
- Primary server is tried first with 5-second connection timeout
|
||||
|
||||
Reference in New Issue
Block a user