docs: update env + CLAUDE.md for direct-vision llamacpp + ResolvedBackend

llamacpp models now receive images directly instead of describe-then-inline. LLAMA_SWAP_VISION_MODEL defaults to the primary model. Document the ResolvedBackend dispatch pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 15:03:12 -04:00
parent a8a661f70a
commit fb388c29d7
2 changed files with 35 additions and 20 deletions
--- a/.env.example
+++ b/.env.example
@@ -66,15 +66,17 @@ AGENTIC_CHAT_MAX_ITERATIONS=6
 # ── AI Insights — llama.cpp / llama-swap (optional) ─────────────────────
 # Set LLAMA_SWAP_URL plus LLM_BACKEND=llamacpp to swap the local stack
 # off Ollama. Talks OpenAI-compatible /v1 to a llama-swap proxy fronting
-# per-slot llama-server instances (chat / vision / embed). The chat slot
-# is treated as text-only — images are pre-described via the vision slot
-# and inlined into the prompt.
+# per-slot llama-server instances. Chat models receive images directly
+# via content-parts (vision-capable models assumed); a separate vision
+# slot is used only by the describe_photo tool and describe-image utility.
 # LLAMA_SWAP_URL=http://localhost:9292/v1
 # LLAMA_SWAP_PRIMARY_MODEL=chat
+# Optional dedicated vision slot for describe_image. Defaults to
+# PRIMARY_MODEL so describe_photo works without extra config.
 # LLAMA_SWAP_VISION_MODEL=vision
 # LLAMA_SWAP_EMBEDDING_MODEL=embed
 # Comma-separated allowlist surfaced by /insights/models when
-# LLM_BACKEND=llamacpp.
+# LLM_BACKEND=llamacpp. All report has_vision=true.
 # LLAMA_SWAP_ALLOWED_MODELS=chat,vision,embed
 # LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180

--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -640,17 +640,16 @@ OPENROUTER_APP_TITLE=ImageApi                  # Optional attribution header
 LLM_BACKEND=ollama

 # llama.cpp / llama-swap (used when LLM_BACKEND=llamacpp). OpenAI-compatible
-# proxy hosting one or more llama-server processes (chat / vision / embed slots).
+# proxy hosting one or more llama-server processes. Chat models receive
+# images directly via content-parts (all models assumed vision-capable).
 LLAMA_SWAP_URL=http://localhost:9292/v1         # Required when LLM_BACKEND=llamacpp
 LLAMA_SWAP_PRIMARY_MODEL=chat                   # Chat slot id (matches config.yaml)
-LLAMA_SWAP_VISION_MODEL=vision                  # Vision slot id; describe_image routes here.
-                                                # The only slot reported as has_vision=true in
-                                                # /insights/models — chat slots are treated as
-                                                # text-only (images pre-described and inlined).
+LLAMA_SWAP_VISION_MODEL=                        # Dedicated vision slot for describe_image / describe_photo
+                                                # tool. Defaults to PRIMARY_MODEL when unset.
 LLAMA_SWAP_EMBEDDING_MODEL=embed                # Embedding slot id
 LLAMA_SWAP_ALLOWED_MODELS=chat,coder            # Curated allowlist surfaced by GET /insights/models
-                                                # when LLM_BACKEND=llamacpp. Empty = picker shows
-                                                # only the configured primary model.
+                                                # when LLM_BACKEND=llamacpp. All report has_vision=true.
+                                                # Empty = picker shows only the configured primary model.
 LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180          # Per-request timeout; bump for slow CPU offload

 # Insight Chat Continuation
@@ -675,21 +674,35 @@ This allows runtime verification of model availability before generating insight
 **Local backend switch (`LLM_BACKEND`):**

 One env var decides which "local" stack the server runs against — `ollama`
-(default) or `llamacpp`. It's global on purpose: chat, vision describe, and
+(default) or `llamacpp`. It's global on purpose: chat, vision, and
 embeddings all route through the same backend, so the embedding-vector
 column in SQLite stays in one vector space. Don't flip mid-deploy without
 re-embedding the affected rows — similarity search will collapse.

- `LLM_BACKEND=ollama`: chat and embeddings use Ollama; vision describe
-  uses Ollama's multimodal model.
- `LLM_BACKEND=llamacpp`: chat hits llama-swap's `chat` slot (which is
-  treated as text-only — images are pre-described via the `vision` slot
-  and inlined), embeddings hit the `embed` slot, vision describe hits the
-  `vision` slot. Requires `LLAMA_SWAP_URL`.
+- `LLM_BACKEND=ollama`: chat, vision, and embeddings use Ollama. Vision
+  capability is probed per-model via `/api/show`.
+- `LLM_BACKEND=llamacpp`: chat models receive images directly via OpenAI
+  content-parts (all models assumed vision-capable). Embeddings hit the
+  `embed` slot. A dedicated `LLAMA_SWAP_VISION_MODEL` slot (defaults to
+  the chat model) handles `describe_image` for the `describe_photo` tool.
+  Requires `LLAMA_SWAP_URL`.

 The per-request `backend=hybrid` override is orthogonal: it always sends
-chat to OpenRouter, but the describe pass still routes through whichever
-`LLM_BACKEND` is configured.
+chat to OpenRouter (text-only, images are pre-described and inlined), but
+the describe + embed passes still route through whichever `LLM_BACKEND`
+is configured.
+
+**Backend dispatch (`ResolvedBackend`):**
+
+`InsightGenerator::resolve_backend(kind, overrides)` is the single entry
+point that builds clients for a request. Returns a `ResolvedBackend` with
+two roles: `.chat()` (the agentic/chat client) and `.local()` (local-only
+utility calls: rerank, describe_image, embeddings). `BackendKind` is an
+enum (`Local` | `Hybrid`) replacing the stringly-typed `"local"` /
+`"hybrid"` labels. `SamplingOverrides` groups model/ctx/temp/top_p/top_k/
+min_p per-request overrides. All downstream code (`execute_tool`,
+`run_streaming_agentic_loop`, etc.) takes `&ResolvedBackend` rather than
+individual client references.

 `GET /insights/models` returns the local-backend models with capabilities
 in the same envelope shape regardless of `LLM_BACKEND`: Ollama servers