ai: collapse llamacpp into LLM_BACKEND env switch

Reverts the per-request backend="llamacpp" value. Chat/vision/embedding backend is now a deploy-time decision (LLM_BACKEND=ollama|llamacpp), applied globally across chat, vision describe, and embeddings — so embedding vectors stay in one space across the index. - Per-request backend whitelist back to "local"|"hybrid". A request arriving with backend="llamacpp" is rejected. - LLM_BACKEND=llamacpp swaps the entire local stack to llama-swap: chat hits the chat slot, describe hits the vision slot, embeddings hit the embed slot. Hybrid mode still routes chat to OpenRouter but uses LLM_BACKEND for the describe pass. - Drops env vars HYBRID_VISION_BACKEND, LLAMA_SWAP_VISION_MODELS, EMBEDDING_BACKEND (the last never shipped). Drops the LlamaCppClient.vision_models allowlist — capability inference now reports has_vision only for the configured vision_model slot. - Drops the /insights/llamacpp/models handler. /insights/models is the single endpoint; returns Ollama servers under LLM_BACKEND=ollama and llama-swap slots (from LLAMA_SWAP_ALLOWED_MODELS) under LLM_BACKEND=llamacpp. Same envelope shape either way. - New ai::embed_one helper routes embeddings through llama-swap when LLM_BACKEND=llamacpp (else Ollama). Wires it into the four insight_generator embedding sites. - Cross-replay matrix simplifies to pre-llamacpp shape (local↔local, hybrid↔hybrid, hybrid→local allowed; local→hybrid rejected).
2026-05-21 11:36:58 -04:00
parent d14df63f19
commit be51421b38
9 changed files with 338 additions and 301 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -473,9 +473,8 @@ GET /memories?path=...&recursive=true
 POST /insights/generate              (non-agentic single-shot)
 POST /insights/generate/agentic      (tool-calling loop; body: { file_path, backend?, model?, ... })
 GET  /insights?path=...&library=...
-GET  /insights/models                (local Ollama models + capabilities)
+GET  /insights/models                (local-backend models + capabilities; Ollama OR llama-swap based on LLM_BACKEND)
 GET  /insights/openrouter/models     (curated OpenRouter allowlist)
-GET  /insights/llamacpp/models       (curated llama-swap slot allowlist)
 POST /insights/rate                  (thumbs up/down for training data)

 // Insight Chat Continuation
@@ -632,22 +631,27 @@ OPENROUTER_EMBEDDING_MODEL=openai/text-embedding-3-small  # Optional, embeddings
 OPENROUTER_HTTP_REFERER=https://your-site.example    # Optional attribution header
 OPENROUTER_APP_TITLE=ImageApi                  # Optional attribution header

-# llama.cpp / llama-swap (Llamacpp Backend) - sibling to Ollama; OpenAI-compatible
+# Local LLM backend switch. `ollama` (default) keeps the OLLAMA_* settings
+# above; `llamacpp` swaps the entire local stack (chat + vision describe +
+# embeddings) over to llama-swap. The switch is global and applies to
+# `backend=local` requests and to `backend=hybrid`'s describe pass (hybrid
+# chat still goes to OpenRouter). Don't flip mid-deploy without
+# re-embedding — mixed vector spaces break similarity search.
+LLM_BACKEND=ollama
+
+# llama.cpp / llama-swap (used when LLM_BACKEND=llamacpp). OpenAI-compatible
 # proxy hosting one or more llama-server processes (chat / vision / embed slots).
-LLAMA_SWAP_URL=http://localhost:9292/v1         # Required to enable llamacpp backend
+LLAMA_SWAP_URL=http://localhost:9292/v1         # Required when LLM_BACKEND=llamacpp
 LLAMA_SWAP_PRIMARY_MODEL=chat                   # Chat slot id (matches config.yaml)
-LLAMA_SWAP_VISION_MODEL=vision                  # Vision slot id; describe_image routes here
-LLAMA_SWAP_EMBEDDING_MODEL=embed                # Embedding slot id (when local embeddings via llamacpp)
-LLAMA_SWAP_VISION_MODELS=qwen-vl,llava          # Comma-separated slot ids known to have vision.
-                                                # Drives `has_vision` in /insights/llamacpp/models.
-                                                # `LLAMA_SWAP_VISION_MODEL` is auto-included.
-LLAMA_SWAP_ALLOWED_MODELS=chat,coder            # Curated allowlist exposed to clients via
-                                                # GET /insights/llamacpp/models. Empty = no picker.
+LLAMA_SWAP_VISION_MODEL=vision                  # Vision slot id; describe_image routes here.
+                                                # The only slot reported as has_vision=true in
+                                                # /insights/models — chat slots are treated as
+                                                # text-only (images pre-described and inlined).
+LLAMA_SWAP_EMBEDDING_MODEL=embed                # Embedding slot id
+LLAMA_SWAP_ALLOWED_MODELS=chat,coder            # Curated allowlist surfaced by GET /insights/models
+                                                # when LLM_BACKEND=llamacpp. Empty = picker shows
+                                                # only the configured primary model.
 LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180          # Per-request timeout; bump for slow CPU offload
-HYBRID_VISION_BACKEND=llamacpp                  # Optional override for hybrid mode's describe_image:
-                                                # `ollama` (default) or `llamacpp`. When `llamacpp`,
-                                                # hybrid still routes chat to OpenRouter but uses
-                                                # llama-swap's vision slot to describe images.

 # Insight Chat Continuation
 AGENTIC_CHAT_MAX_ITERATIONS=6                  # Cap on tool-calling iterations per chat turn (default 6)
@@ -668,13 +672,36 @@ The `OllamaClient` provides methods to query available models:

 This allows runtime verification of model availability before generating insights.

+**Local backend switch (`LLM_BACKEND`):**
+
+One env var decides which "local" stack the server runs against — `ollama`
+(default) or `llamacpp`. It's global on purpose: chat, vision describe, and
+embeddings all route through the same backend, so the embedding-vector
+column in SQLite stays in one vector space. Don't flip mid-deploy without
+re-embedding the affected rows — similarity search will collapse.
+
+- `LLM_BACKEND=ollama`: chat and embeddings use Ollama; vision describe
+  uses Ollama's multimodal model.
+- `LLM_BACKEND=llamacpp`: chat hits llama-swap's `chat` slot (which is
+  treated as text-only — images are pre-described via the `vision` slot
+  and inlined), embeddings hit the `embed` slot, vision describe hits the
+  `vision` slot. Requires `LLAMA_SWAP_URL`.
+
+The per-request `backend=hybrid` override is orthogonal: it always sends
+chat to OpenRouter, but the describe pass still routes through whichever
+`LLM_BACKEND` is configured.
+
+`GET /insights/models` returns the local-backend models with capabilities
+in the same envelope shape regardless of `LLM_BACKEND`: Ollama servers
+when `ollama`, llama-swap slots (from `LLAMA_SWAP_ALLOWED_MODELS`) when
+`llamacpp`. No `/insights/llamacpp/models` — the picker reads a single
+endpoint.
+
 **Hybrid Backend (OpenRouter):**
 - Per-request opt-in via `backend=hybrid` on `POST /insights/generate/agentic`.
 - Vision describe happens before the agentic loop; the description is inlined
-  into the chat prompt and the agentic loop runs on OpenRouter. By default
-  vision uses local Ollama, but `HYBRID_VISION_BACKEND=llamacpp` flips it to
-  llama-swap's vision slot (useful when you want chat on a frontier model and
-  vision on a local-but-not-Ollama path).
+  into the chat prompt and the agentic loop runs on OpenRouter. Vision
+  routes through whichever `LLM_BACKEND` is configured.
 - `request.model` (if provided) overrides `OPENROUTER_DEFAULT_MODEL` for that
  call. The mobile picker reads from `OPENROUTER_ALLOWED_MODELS`.
 - No live capability precheck — the operator-curated allowlist is trusted.
@@ -682,29 +709,14 @@ This allows runtime verification of model availability before generating insight
 - `GET /insights/openrouter/models` returns `{ models, default_model, configured }`
  for client picker UIs.

-**Llamacpp Backend (llama-swap):**
- Per-request opt-in via `backend=llamacpp` on `POST /insights/generate/agentic`.
- Sibling to Ollama: a local OpenAI-compatible proxy (mostlygeek/llama-swap)
-  fronting one or more `llama-server` processes. The chat slot is text-only
-  by default; vision and embeddings have their own slots (`LLAMA_SWAP_VISION_MODEL`,
-  `LLAMA_SWAP_EMBEDDING_MODEL`) that llama-swap routes to by model id. The
-  bundled `docker-compose.yml` + `llama-swap/config.yaml` in the opencode root
-  is the reference deploy.
- Operates in the same describe-then-inline shape as hybrid: the chat model
-  never sees raw images. Vision describe routes through llama-swap's vision
-  slot (`describe_image` on `LlamaCppClient`).
- `request.model` (if provided) overrides `LLAMA_SWAP_PRIMARY_MODEL` for that
-  call (must match a slot id in llama-swap's `config.yaml`). The mobile picker
-  reads from `LLAMA_SWAP_ALLOWED_MODELS`.
- No live capability precheck — slot ids are trusted. Tool calling is assumed
-  for every slot (llama-swap entries typically launch with `--jinja`).
- `GET /insights/llamacpp/models` returns `{ models, default_model, configured }`.
- Cross-replay matrix (chat continuation): `local ↔ llamacpp` allowed (the
-  LlamaCppClient passes images through to the chat slot — you're responsible
-  for a vision-capable slot if the stored transcript carries images);
-  `hybrid ↔ llamacpp` allowed (both transcripts are text-only); `local →
-  hybrid` and `llamacpp → hybrid` rejected (mid-conversation description
-  source change isn't supported).
+**Cross-replay matrix (chat continuation):**
+- `local → local` allowed (whether served by Ollama or llama-swap; that's
+  a deploy-time decision, not a request-time one).
+- `hybrid → hybrid` allowed.
+- `hybrid → local` allowed (the inlined description replays as text).
+- `local → hybrid` rejected — the stored transcript has raw images in the
+  first user message and OpenRouter providers don't accept that shape
+  consistently. Regenerate the insight in hybrid mode instead.

 **Insight Chat Continuation:**