Normalize voice-clone reference audio to WAV via ffmpeg
Chatterbox validates the reference clip by file extension and rejects formats like .aac/.opus. Always transcode the reference (upload bytes and library files alike) to mono 24 kHz WAV with ffmpeg before forwarding, so any source format is accepted and the from-library audio/video paths are unified. The reference length cap is now configurable via LLAMA_SWAP_TTS_REF_SECONDS (default 30) — Chatterbox is zero-shot, so a clean ~10-20s clip is the sweet spot. Drops the now-unused mime guesser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -87,6 +87,7 @@ AGENTIC_CHAT_MAX_ITERATIONS=6
|
||||
# + voice cloning in the mobile app).
|
||||
# LLAMA_SWAP_TTS_MODEL=chatterbox # TTS model id in config.yaml
|
||||
# LLAMA_SWAP_TTS_VOICE=m # default voice when a request omits one
|
||||
# LLAMA_SWAP_TTS_REF_SECONDS=30 # max voice-clone reference clip length (s)
|
||||
|
||||
# ── AI Insights — sibling services (optional) ───────────────────────────
|
||||
# Apollo (places, face inference, CLIP encoders). Single-Apollo deploys
|
||||
|
||||
Reference in New Issue
Block a user