62d517dcda
Chatterbox validates the reference clip by file extension and rejects formats like .aac/.opus. Always transcode the reference (upload bytes and library files alike) to mono 24 kHz WAV with ffmpeg before forwarding, so any source format is accepted and the from-library audio/video paths are unified. The reference length cap is now configurable via LLAMA_SWAP_TTS_REF_SECONDS (default 30) — Chatterbox is zero-shot, so a clean ~10-20s clip is the sweet spot. Drops the now-unused mime guesser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
216 lines
12 KiB
Markdown
216 lines
12 KiB
Markdown
# Image API
|
||
This is an Actix-web server for serving images and videos from a filesystem.
|
||
Upon first run it will generate thumbnails for all images and videos at `BASE_PATH`.
|
||
|
||
## Features
|
||
- Automatic thumbnail generation for images and videos
|
||
- EXIF data extraction and storage for photos
|
||
- File watching with NFS support (polling-based)
|
||
- Video streaming with HLS
|
||
- Tag-based organization
|
||
- Memories API for browsing photos by date
|
||
- **Video Wall** - Auto-generated short preview clips for videos, served via a grid view
|
||
- **AI-Powered Photo Insights** - Generate contextual insights from photos using LLMs
|
||
- **RAG-based Context Retrieval** - Semantic search over daily conversation summaries
|
||
- **Automatic Daily Summaries** - LLM-generated summaries of daily conversations with embeddings
|
||
|
||
## External Dependencies
|
||
|
||
### ffmpeg (required)
|
||
`ffmpeg` must be on `PATH`. It is used for:
|
||
- **HLS video streaming** — transcoding/segmenting source videos into `.m3u8` + `.ts` playlists
|
||
- **Video thumbnails** — extracting a frame at the 3-second mark
|
||
- **Video preview clips** — short looping previews for the Video Wall
|
||
- **HEIC / HEIF thumbnails** — decoding Apple's HEIC format (your ffmpeg build must include
|
||
`libheif`; most modern builds do)
|
||
|
||
Builds used in development: the `gyan.dev` full build on Windows, and distro `ffmpeg`
|
||
packages on Linux work fine. If HEIC thumbnails silently fail, check
|
||
`ffmpeg -formats | grep heif` to confirm HEIF support.
|
||
|
||
### RAW photo thumbnails
|
||
RAW formats (ARW, NEF, CR2, CR3, DNG, RAF, ORF, RW2, PEF, SRW, TIFF) are thumbnailed
|
||
by reading an embedded JPEG preview out of the TIFF container — no external RAW
|
||
decoder (libraw / dcraw) is involved. The pipeline tries two layers in order and
|
||
keeps the largest valid JPEG:
|
||
|
||
1. **Fast path (no extra dependency)** — `kamadak-exif` reads
|
||
`JPEGInterchangeFormat` from IFD0 / IFD1 directly. Covers older bodies and
|
||
most DNGs.
|
||
2. **`exiftool` fallback (recommended for RAW-heavy libraries)** — shells out
|
||
to extract `PreviewImage` / `JpgFromRaw` / `OtherImage`, which reaches
|
||
MakerNote and SubIFD-hosted previews kamadak-exif can't see (e.g. Nikon's
|
||
`PreviewIFD`, where modern Nikon bodies stash the full-res review JPEG).
|
||
If `exiftool` isn't on `PATH` this layer is skipped silently and only the
|
||
fast-path result is used.
|
||
|
||
Install `exiftool` via your package manager:
|
||
- macOS: `brew install exiftool`
|
||
- Linux (Debian/Ubuntu): `apt install libimage-exiftool-perl`
|
||
- Windows: `winget install OliverBetz.ExifTool` or `choco install exiftool`
|
||
|
||
Files where neither layer produces a valid preview fall back to ffmpeg. Anything
|
||
that still can't be decoded is marked with a `<thumb>.unsupported` sentinel in
|
||
the thumbnail directory so we don't retry it every scan. Delete those sentinels
|
||
(and any cached black thumbnails) to force retries after a tooling upgrade.
|
||
|
||
## Environment
|
||
There are a handful of required environment variables to have the API run.
|
||
They should be defined where the binary is located or above it in an `.env` file.
|
||
|
||
- `DATABASE_URL` is a path or url to a database (currently only SQLite is tested)
|
||
- `BASE_PATH` is the root from which you want to serve images and videos
|
||
- `THUMBNAILS` is a path where generated thumbnails should be stored. Thumbnails
|
||
mirror the source tree under `BASE_PATH` and keep the source's original
|
||
extension (e.g. `foo.arw` or `bar.mp4`), though the file contents are always
|
||
JPEG bytes — browsers content-sniff. Files that can't be thumbnailed by the
|
||
`image` crate, ffmpeg, or an embedded RAW preview get a zero-byte
|
||
`<thumb_path>.unsupported` sentinel in this directory so subsequent scans
|
||
skip them. Delete the `*.unsupported` files to force retries (for example
|
||
after upgrading ffmpeg or adding libheif)
|
||
- `VIDEO_PATH` is a path where HLS playlists and video parts should be stored
|
||
- `GIFS_DIRECTORY` is a path where generated video GIF thumbnails should be stored
|
||
- `BIND_URL` is the url and port to bind to (typically your own IP address)
|
||
- `SECRET_KEY` is the *hopefully* random string to sign Tokens with
|
||
- `RUST_LOG` is one of `off, error, warn, info, debug, trace`, from least to most noisy [error is default]
|
||
- `EXCLUDED_DIRS` is a comma separated list of directories to exclude from the Memories API
|
||
- `PREVIEW_CLIPS_DIRECTORY` (optional) is a path where generated video preview clips should be stored [default: `preview_clips`]
|
||
- `WATCH_QUICK_INTERVAL_SECONDS` (optional) is the interval in seconds for quick file scans [default: 60]
|
||
- `WATCH_FULL_INTERVAL_SECONDS` (optional) is the interval in seconds for full file scans [default: 3600]
|
||
|
||
### AI Insights Configuration (Optional)
|
||
|
||
The following environment variables configure AI-powered photo insights and daily conversation summaries:
|
||
|
||
#### Ollama Configuration
|
||
- `OLLAMA_PRIMARY_URL` - Primary Ollama server URL [default: `http://localhost:11434`]
|
||
- Example: `http://desktop:11434` (your main/powerful server)
|
||
- `OLLAMA_FALLBACK_URL` - Fallback Ollama server URL (optional)
|
||
- Example: `http://server:11434` (always-on backup server)
|
||
- `OLLAMA_PRIMARY_MODEL` - Model to use on primary server [default: `nemotron-3-nano:30b`]
|
||
- Example: `nemotron-3-nano:30b`, `llama3.2:3b`, etc.
|
||
- `OLLAMA_FALLBACK_MODEL` - Model to use on fallback server (optional)
|
||
- If not set, uses `OLLAMA_PRIMARY_MODEL` on fallback server
|
||
|
||
**Legacy Variables** (still supported):
|
||
- `OLLAMA_URL` - Used if `OLLAMA_PRIMARY_URL` not set
|
||
- `OLLAMA_MODEL` - Used if `OLLAMA_PRIMARY_MODEL` not set
|
||
|
||
#### OpenRouter Configuration (Hybrid Backend)
|
||
The hybrid agentic backend keeps embeddings + vision local (Ollama) while routing
|
||
chat + tool-calling to OpenRouter. Enabled per-request when the client sends
|
||
`backend=hybrid`.
|
||
|
||
- `OPENROUTER_API_KEY` - OpenRouter API key. Required to enable the hybrid backend.
|
||
- `OPENROUTER_DEFAULT_MODEL` - Model id used when the client doesn't specify one
|
||
[default: `anthropic/claude-sonnet-4`]
|
||
- Example: `openai/gpt-4o-mini`, `google/gemini-2.5-flash`
|
||
- `OPENROUTER_ALLOWED_MODELS` - Comma-separated curated allowlist exposed to
|
||
clients via `GET /insights/openrouter/models`. The mobile picker shows only
|
||
these. Empty/unset = no picker, server default is used.
|
||
- Example: `openai/gpt-4o-mini,anthropic/claude-haiku-4-5,google/gemini-2.5-flash`
|
||
- `OPENROUTER_BASE_URL` - Override base URL [default: `https://openrouter.ai/api/v1`]
|
||
- `OPENROUTER_EMBEDDING_MODEL` - Embedding model for OpenRouter
|
||
[default: `openai/text-embedding-3-small`]. Only used if/when embeddings are
|
||
routed through OpenRouter (currently embeddings stay local).
|
||
- `OPENROUTER_HTTP_REFERER` - Optional `HTTP-Referer` for OpenRouter attribution
|
||
- `OPENROUTER_APP_TITLE` - Optional `X-Title` for OpenRouter attribution
|
||
|
||
Capability checks are skipped for the curated allowlist — bad model ids surface
|
||
as a 4xx from the chat call. Pick tool-capable models.
|
||
|
||
#### SMS API Configuration
|
||
- `SMS_API_URL` - URL to SMS message API [default: `http://localhost:8000`]
|
||
- Used to fetch conversation data for context in insights
|
||
- `SMS_API_TOKEN` - Authentication token for SMS API (optional)
|
||
|
||
#### Agentic Insight Generation
|
||
- `AGENTIC_MAX_ITERATIONS` - Maximum tool-call iterations per agentic insight request [default: `10`]
|
||
- Controls how many times the model can invoke tools before being forced to produce a final answer
|
||
- Increase for more thorough context gathering; decrease to limit response time
|
||
|
||
#### Insight Chat Continuation
|
||
After an agentic insight is generated, the conversation can be continued. Endpoints:
|
||
- `POST /insights/chat` — single-turn reply (non-streaming)
|
||
- `POST /insights/chat/stream` — SSE variant with live `text` deltas and
|
||
`tool_call` / `tool_result` events. Mobile client uses this.
|
||
- `GET /insights/chat/history?path=...&library=...` — rendered transcript;
|
||
each assistant message carries a `tools: [{name, arguments, result}]` array
|
||
- `POST /insights/chat/rewind` — truncate transcript at a rendered index
|
||
(drops that message + any preceding tool scaffolding + later turns). Used
|
||
for "try again from here" flows. The initial user message is protected.
|
||
|
||
Amend mode (`amend: true` in the chat request body) regenerates the insight's
|
||
title and inserts a new row instead of appending to the existing transcript,
|
||
so you can rewrite the saved summary from within chat.
|
||
|
||
- `AGENTIC_CHAT_MAX_ITERATIONS` - Cap on tool-calling iterations per chat turn [default: `6`]
|
||
- Per-request `max_iterations` (when sent by the client) is clamped to this cap
|
||
|
||
#### Text-to-Speech (Optional)
|
||
Reads insights aloud and manages cloned voices via a Chatterbox model served
|
||
behind the same llama-swap proxy. Only requires `LLAMA_SWAP_URL` (the TTS client
|
||
is built whenever that's set — independent of `LLM_BACKEND`). Endpoints:
|
||
- `POST /tts/speech` — body `{ text, voice?, format?, exaggeration?, cfg_weight?,
|
||
temperature? }`; returns `{ audio_base64, format }`. Input is cleaned
|
||
server-side (markdown + emoji stripped) and the generation knobs are clamped
|
||
to Chatterbox's ranges.
|
||
- `GET /tts/voices` — list the voice library.
|
||
- `POST /tts/voices/upload` — multipart `voice_name` + `voice_file`; clone a
|
||
voice from an uploaded clip (≤25 MB).
|
||
- `POST /tts/voices/from-library` — body `{ voice_name, path, library? }`; clone
|
||
from a library file (audio forwarded as-is; video has its audio extracted via
|
||
ffmpeg).
|
||
|
||
Env:
|
||
- `LLAMA_SWAP_TTS_MODEL` - TTS model id in llama-swap's `config.yaml` [default: `chatterbox`]
|
||
- `LLAMA_SWAP_TTS_VOICE` - default voice used when a `/tts/speech` request omits `voice` (optional)
|
||
- `LLAMA_SWAP_TTS_REF_SECONDS` - max voice-clone reference clip length in seconds
|
||
[default: `30`]. Reference audio is ffmpeg-normalized to mono 24 kHz WAV (so any
|
||
source format works); Chatterbox is zero-shot, so a clean ~10–20s sample is the
|
||
sweet spot — more rarely helps.
|
||
|
||
#### Fallback Behavior
|
||
- Primary server is tried first with 5-second connection timeout
|
||
- On failure, automatically falls back to secondary server (if configured)
|
||
- Total request timeout is 120 seconds to accommodate LLM inference
|
||
- Logs indicate which server/model was used and any failover attempts
|
||
|
||
#### Daily Summary Generation
|
||
Daily conversation summaries are generated automatically on server startup. Configure in `src/main.rs`:
|
||
- Date range for summary generation
|
||
- Contacts to process
|
||
- Model version used for embeddings: `nomic-embed-text:v1.5`
|
||
|
||
### Apollo + Face Recognition (Optional)
|
||
|
||
Apollo (sibling project) hosts both the Places API and the local insightface
|
||
inference service. Both integrations are optional and degrade gracefully when
|
||
unset.
|
||
|
||
- `APOLLO_API_BASE_URL` - Base URL of the sibling Apollo backend.
|
||
- When set, photo-insight enrichment folds the user's personal place name
|
||
(Home, Work, Cabin, ...) into the location string, and the agentic loop
|
||
gains a `get_personal_place_at` tool. Unset = legacy Nominatim-only path.
|
||
- `APOLLO_FACE_API_BASE_URL` - Base URL for the face-detection service.
|
||
- Falls back to `APOLLO_API_BASE_URL` when unset (typical single-Apollo
|
||
deploy). Both unset = face feature disabled (file-watch hook and
|
||
manual-face endpoints short-circuit silently).
|
||
- `FACE_AUTOBIND_MIN_COS` (Phase 3) - Cosine-sim floor for auto-binding a
|
||
detected face to an existing same-named person via people-tag bootstrap
|
||
[default: `0.4`].
|
||
- `FACE_DETECT_CONCURRENCY` (Phase 3) - Per-scan-tick concurrent detect
|
||
calls fired by the file watcher [default: `8`]. Apollo serializes them
|
||
via its single-worker GPU pool.
|
||
- `FACE_DETECT_TIMEOUT_SEC` - reqwest client timeout per detect call
|
||
[default: `60`]. CPU inference on a backlog can take many seconds.
|
||
- `FACE_BACKLOG_MAX_PER_TICK` - Cap on the per-tick backlog drain (photos
|
||
with a content_hash but no face_detections row) [default: `64`]. Runs
|
||
every watcher tick regardless of quick-vs-full scan, so the unscanned
|
||
set drains independently of the file walk.
|
||
- `FACE_HASH_BACKFILL_MAX_PER_TICK` - Cap on the per-tick content_hash
|
||
backfill (photos that were registered before the hash field was
|
||
populated retroactively) [default: `2000`]. Errors don't burn the cap;
|
||
only successful hashes count.
|
||
|