Normalize voice-clone reference audio to WAV via ffmpeg

Chatterbox validates the reference clip by file extension and rejects formats like .aac/.opus. Always transcode the reference (upload bytes and library files alike) to mono 24 kHz WAV with ffmpeg before forwarding, so any source format is accepted and the from-library audio/video paths are unified. The reference length cap is now configurable via LLAMA_SWAP_TTS_REF_SECONDS (default 30) — Chatterbox is zero-shot, so a clean ~10-20s clip is the sweet spot. Drops the now-unused mime guesser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 22:50:08 -04:00
parent 35c5ecb427
commit 62d517dcda
3 changed files with 79 additions and 78 deletions
@@ -87,6 +87,7 @@ AGENTIC_CHAT_MAX_ITERATIONS=6
 # + voice cloning in the mobile app).
 # LLAMA_SWAP_TTS_MODEL=chatterbox        # TTS model id in config.yaml
 # LLAMA_SWAP_TTS_VOICE=m                 # default voice when a request omits one
+# LLAMA_SWAP_TTS_REF_SECONDS=30          # max voice-clone reference clip length (s)

 # ── AI Insights — sibling services (optional) ───────────────────────────
 # Apollo (places, face inference, CLIP encoders). Single-Apollo deploys