From 35c5ecb427f7118440a0095e8730b2aa82d0e8e8 Mon Sep 17 00:00:00 2001 From: Cameron Cordes Date: Tue, 2 Jun 2026 22:34:34 -0400 Subject: [PATCH] Document TTS endpoints and env in README + .env.example Adds the /tts/speech and /tts/voices* endpoints plus LLAMA_SWAP_TTS_MODEL / LLAMA_SWAP_TTS_VOICE (TTS only needs LLAMA_SWAP_URL, not LLM_BACKEND=llamacpp). Co-Authored-By: Claude Opus 4.8 (1M context) --- .env.example | 8 ++++++++ README.md | 19 +++++++++++++++++++ 2 files changed, 27 insertions(+) diff --git a/.env.example b/.env.example index f7a1004..835bef5 100644 --- a/.env.example +++ b/.env.example @@ -80,6 +80,14 @@ AGENTIC_CHAT_MAX_ITERATIONS=6 # LLAMA_SWAP_ALLOWED_MODELS=chat,vision,embed # LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 +# ── Text-to-speech (optional, requires LLAMA_SWAP_URL) ─────────────────── +# TTS routes through the same llama-swap proxy (a Chatterbox model id), so it +# only needs LLAMA_SWAP_URL — it does NOT require LLM_BACKEND=llamacpp. +# Powers POST /tts/speech and the /tts/voices* endpoints (read-aloud insights +# + voice cloning in the mobile app). +# LLAMA_SWAP_TTS_MODEL=chatterbox # TTS model id in config.yaml +# LLAMA_SWAP_TTS_VOICE=m # default voice when a request omits one + # ── AI Insights — sibling services (optional) ─────────────────────────── # Apollo (places, face inference, CLIP encoders). Single-Apollo deploys # typically set only APOLLO_API_BASE_URL and let the face + CLIP diff --git a/README.md b/README.md index b6d764b..12c220f 100644 --- a/README.md +++ b/README.md @@ -147,6 +147,25 @@ so you can rewrite the saved summary from within chat. - `AGENTIC_CHAT_MAX_ITERATIONS` - Cap on tool-calling iterations per chat turn [default: `6`] - Per-request `max_iterations` (when sent by the client) is clamped to this cap +#### Text-to-Speech (Optional) +Reads insights aloud and manages cloned voices via a Chatterbox model served +behind the same llama-swap proxy. Only requires `LLAMA_SWAP_URL` (the TTS client +is built whenever that's set — independent of `LLM_BACKEND`). Endpoints: +- `POST /tts/speech` — body `{ text, voice?, format?, exaggeration?, cfg_weight?, + temperature? }`; returns `{ audio_base64, format }`. Input is cleaned + server-side (markdown + emoji stripped) and the generation knobs are clamped + to Chatterbox's ranges. +- `GET /tts/voices` — list the voice library. +- `POST /tts/voices/upload` — multipart `voice_name` + `voice_file`; clone a + voice from an uploaded clip (≤25 MB). +- `POST /tts/voices/from-library` — body `{ voice_name, path, library? }`; clone + from a library file (audio forwarded as-is; video has its audio extracted via + ffmpeg). + +Env: +- `LLAMA_SWAP_TTS_MODEL` - TTS model id in llama-swap's `config.yaml` [default: `chatterbox`] +- `LLAMA_SWAP_TTS_VOICE` - default voice used when a `/tts/speech` request omits `voice` (optional) + #### Fallback Behavior - Primary server is tried first with 5-second connection timeout - On failure, automatically falls back to secondary server (if configured)