Normalize voice-clone reference audio to WAV via ffmpeg

Chatterbox validates the reference clip by file extension and rejects formats
like .aac/.opus. Always transcode the reference (upload bytes and library
files alike) to mono 24 kHz WAV with ffmpeg before forwarding, so any source
format is accepted and the from-library audio/video paths are unified.

The reference length cap is now configurable via LLAMA_SWAP_TTS_REF_SECONDS
(default 30) — Chatterbox is zero-shot, so a clean ~10-20s clip is the sweet
spot. Drops the now-unused mime guesser.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Cameron Cordes
2026-06-02 22:50:08 -04:00
parent 35c5ecb427
commit 62d517dcda
3 changed files with 79 additions and 78 deletions
+1
View File
@@ -87,6 +87,7 @@ AGENTIC_CHAT_MAX_ITERATIONS=6
# + voice cloning in the mobile app).
# LLAMA_SWAP_TTS_MODEL=chatterbox # TTS model id in config.yaml
# LLAMA_SWAP_TTS_VOICE=m # default voice when a request omits one
# LLAMA_SWAP_TTS_REF_SECONDS=30 # max voice-clone reference clip length (s)
# ── AI Insights — sibling services (optional) ───────────────────────────
# Apollo (places, face inference, CLIP encoders). Single-Apollo deploys