diff --git a/.env.example b/.env.example index 81fab4e..f7a1004 100644 --- a/.env.example +++ b/.env.example @@ -66,15 +66,17 @@ AGENTIC_CHAT_MAX_ITERATIONS=6 # ── AI Insights — llama.cpp / llama-swap (optional) ───────────────────── # Set LLAMA_SWAP_URL plus LLM_BACKEND=llamacpp to swap the local stack # off Ollama. Talks OpenAI-compatible /v1 to a llama-swap proxy fronting -# per-slot llama-server instances (chat / vision / embed). The chat slot -# is treated as text-only — images are pre-described via the vision slot -# and inlined into the prompt. +# per-slot llama-server instances. Chat models receive images directly +# via content-parts (vision-capable models assumed); a separate vision +# slot is used only by the describe_photo tool and describe-image utility. # LLAMA_SWAP_URL=http://localhost:9292/v1 # LLAMA_SWAP_PRIMARY_MODEL=chat +# Optional dedicated vision slot for describe_image. Defaults to +# PRIMARY_MODEL so describe_photo works without extra config. # LLAMA_SWAP_VISION_MODEL=vision # LLAMA_SWAP_EMBEDDING_MODEL=embed # Comma-separated allowlist surfaced by /insights/models when -# LLM_BACKEND=llamacpp. +# LLM_BACKEND=llamacpp. All report has_vision=true. # LLAMA_SWAP_ALLOWED_MODELS=chat,vision,embed # LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 diff --git a/CLAUDE.md b/CLAUDE.md index d06b29f..7f1da76 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -640,17 +640,16 @@ OPENROUTER_APP_TITLE=ImageApi # Optional attribution header LLM_BACKEND=ollama # llama.cpp / llama-swap (used when LLM_BACKEND=llamacpp). OpenAI-compatible -# proxy hosting one or more llama-server processes (chat / vision / embed slots). +# proxy hosting one or more llama-server processes. Chat models receive +# images directly via content-parts (all models assumed vision-capable). LLAMA_SWAP_URL=http://localhost:9292/v1 # Required when LLM_BACKEND=llamacpp LLAMA_SWAP_PRIMARY_MODEL=chat # Chat slot id (matches config.yaml) -LLAMA_SWAP_VISION_MODEL=vision # Vision slot id; describe_image routes here. - # The only slot reported as has_vision=true in - # /insights/models — chat slots are treated as - # text-only (images pre-described and inlined). +LLAMA_SWAP_VISION_MODEL= # Dedicated vision slot for describe_image / describe_photo + # tool. Defaults to PRIMARY_MODEL when unset. LLAMA_SWAP_EMBEDDING_MODEL=embed # Embedding slot id LLAMA_SWAP_ALLOWED_MODELS=chat,coder # Curated allowlist surfaced by GET /insights/models - # when LLM_BACKEND=llamacpp. Empty = picker shows - # only the configured primary model. + # when LLM_BACKEND=llamacpp. All report has_vision=true. + # Empty = picker shows only the configured primary model. LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 # Per-request timeout; bump for slow CPU offload # Insight Chat Continuation @@ -675,21 +674,35 @@ This allows runtime verification of model availability before generating insight **Local backend switch (`LLM_BACKEND`):** One env var decides which "local" stack the server runs against — `ollama` -(default) or `llamacpp`. It's global on purpose: chat, vision describe, and +(default) or `llamacpp`. It's global on purpose: chat, vision, and embeddings all route through the same backend, so the embedding-vector column in SQLite stays in one vector space. Don't flip mid-deploy without re-embedding the affected rows — similarity search will collapse. -- `LLM_BACKEND=ollama`: chat and embeddings use Ollama; vision describe - uses Ollama's multimodal model. -- `LLM_BACKEND=llamacpp`: chat hits llama-swap's `chat` slot (which is - treated as text-only — images are pre-described via the `vision` slot - and inlined), embeddings hit the `embed` slot, vision describe hits the - `vision` slot. Requires `LLAMA_SWAP_URL`. +- `LLM_BACKEND=ollama`: chat, vision, and embeddings use Ollama. Vision + capability is probed per-model via `/api/show`. +- `LLM_BACKEND=llamacpp`: chat models receive images directly via OpenAI + content-parts (all models assumed vision-capable). Embeddings hit the + `embed` slot. A dedicated `LLAMA_SWAP_VISION_MODEL` slot (defaults to + the chat model) handles `describe_image` for the `describe_photo` tool. + Requires `LLAMA_SWAP_URL`. The per-request `backend=hybrid` override is orthogonal: it always sends -chat to OpenRouter, but the describe pass still routes through whichever -`LLM_BACKEND` is configured. +chat to OpenRouter (text-only, images are pre-described and inlined), but +the describe + embed passes still route through whichever `LLM_BACKEND` +is configured. + +**Backend dispatch (`ResolvedBackend`):** + +`InsightGenerator::resolve_backend(kind, overrides)` is the single entry +point that builds clients for a request. Returns a `ResolvedBackend` with +two roles: `.chat()` (the agentic/chat client) and `.local()` (local-only +utility calls: rerank, describe_image, embeddings). `BackendKind` is an +enum (`Local` | `Hybrid`) replacing the stringly-typed `"local"` / +`"hybrid"` labels. `SamplingOverrides` groups model/ctx/temp/top_p/top_k/ +min_p per-request overrides. All downstream code (`execute_tool`, +`run_streaming_agentic_loop`, etc.) takes `&ResolvedBackend` rather than +individual client references. `GET /insights/models` returns the local-backend models with capabilities in the same envelope shape regardless of `LLM_BACKEND`: Ollama servers