ai: add llamacpp backend (llama-swap) as third LLM client

Wires a new LlamaCppClient (OpenAI-compatible /v1 wire format) alongside OllamaClient and OpenRouterClient. Per-slot routing for chat/vision/embed via env (LLAMA_SWAP_URL + *_MODEL vars); capability inference uses an env allowlist since /v1/models doesn't report modality. InsightGenerator + InsightChatService gain three-way dispatch on chat_backend = "local" | "hybrid" | "llamacpp". Hybrid and llamacpp share the describe-then-inline path (text-only chat after a separate vision describe). HYBRID_VISION_BACKEND=llamacpp lets hybrid route its describe pass through llama-swap's vision slot while chat still goes to OpenRouter. Cross-replay matrix added (validate_cross_replay): local<->llamacpp and hybrid<->llamacpp allowed; local->hybrid and llamacpp->hybrid rejected. New /insights/llamacpp/models handler mirrors the OpenRouter shape.
2026-05-20 17:52:33 -04:00
parent d04b86e32c
commit f0927f5355
9 changed files with 1468 additions and 102 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -475,6 +475,7 @@ POST /insights/generate/agentic      (tool-calling loop; body: { file_path, back
 GET  /insights?path=...&library=...
 GET  /insights/models                (local Ollama models + capabilities)
 GET  /insights/openrouter/models     (curated OpenRouter allowlist)
+GET  /insights/llamacpp/models       (curated llama-swap slot allowlist)
 POST /insights/rate                  (thumbs up/down for training data)

 // Insight Chat Continuation
@@ -631,6 +632,23 @@ OPENROUTER_EMBEDDING_MODEL=openai/text-embedding-3-small  # Optional, embeddings
 OPENROUTER_HTTP_REFERER=https://your-site.example    # Optional attribution header
 OPENROUTER_APP_TITLE=ImageApi                  # Optional attribution header

+# llama.cpp / llama-swap (Llamacpp Backend) - sibling to Ollama; OpenAI-compatible
+# proxy hosting one or more llama-server processes (chat / vision / embed slots).
+LLAMA_SWAP_URL=http://localhost:9292/v1         # Required to enable llamacpp backend
+LLAMA_SWAP_PRIMARY_MODEL=chat                   # Chat slot id (matches config.yaml)
+LLAMA_SWAP_VISION_MODEL=vision                  # Vision slot id; describe_image routes here
+LLAMA_SWAP_EMBEDDING_MODEL=embed                # Embedding slot id (when local embeddings via llamacpp)
+LLAMA_SWAP_VISION_MODELS=qwen-vl,llava          # Comma-separated slot ids known to have vision.
+                                                # Drives `has_vision` in /insights/llamacpp/models.
+                                                # `LLAMA_SWAP_VISION_MODEL` is auto-included.
+LLAMA_SWAP_ALLOWED_MODELS=chat,coder            # Curated allowlist exposed to clients via
+                                                # GET /insights/llamacpp/models. Empty = no picker.
+LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180          # Per-request timeout; bump for slow CPU offload
+HYBRID_VISION_BACKEND=llamacpp                  # Optional override for hybrid mode's describe_image:
+                                                # `ollama` (default) or `llamacpp`. When `llamacpp`,
+                                                # hybrid still routes chat to OpenRouter but uses
+                                                # llama-swap's vision slot to describe images.
+
 # Insight Chat Continuation
 AGENTIC_CHAT_MAX_ITERATIONS=6                  # Cap on tool-calling iterations per chat turn (default 6)
 ```
@@ -652,8 +670,11 @@ This allows runtime verification of model availability before generating insight

 **Hybrid Backend (OpenRouter):**
 - Per-request opt-in via `backend=hybrid` on `POST /insights/generate/agentic`.
- Local Ollama still describes the image (vision); the description is inlined
-  into the chat prompt and the agentic loop runs on OpenRouter.
+- Vision describe happens before the agentic loop; the description is inlined
+  into the chat prompt and the agentic loop runs on OpenRouter. By default
+  vision uses local Ollama, but `HYBRID_VISION_BACKEND=llamacpp` flips it to
+  llama-swap's vision slot (useful when you want chat on a frontier model and
+  vision on a local-but-not-Ollama path).
 - `request.model` (if provided) overrides `OPENROUTER_DEFAULT_MODEL` for that
  call. The mobile picker reads from `OPENROUTER_ALLOWED_MODELS`.
 - No live capability precheck — the operator-curated allowlist is trusted.
@@ -661,6 +682,30 @@ This allows runtime verification of model availability before generating insight
 - `GET /insights/openrouter/models` returns `{ models, default_model, configured }`
  for client picker UIs.

+**Llamacpp Backend (llama-swap):**
+- Per-request opt-in via `backend=llamacpp` on `POST /insights/generate/agentic`.
+- Sibling to Ollama: a local OpenAI-compatible proxy (mostlygeek/llama-swap)
+  fronting one or more `llama-server` processes. The chat slot is text-only
+  by default; vision and embeddings have their own slots (`LLAMA_SWAP_VISION_MODEL`,
+  `LLAMA_SWAP_EMBEDDING_MODEL`) that llama-swap routes to by model id. The
+  bundled `docker-compose.yml` + `llama-swap/config.yaml` in the opencode root
+  is the reference deploy.
+- Operates in the same describe-then-inline shape as hybrid: the chat model
+  never sees raw images. Vision describe routes through llama-swap's vision
+  slot (`describe_image` on `LlamaCppClient`).
+- `request.model` (if provided) overrides `LLAMA_SWAP_PRIMARY_MODEL` for that
+  call (must match a slot id in llama-swap's `config.yaml`). The mobile picker
+  reads from `LLAMA_SWAP_ALLOWED_MODELS`.
+- No live capability precheck — slot ids are trusted. Tool calling is assumed
+  for every slot (llama-swap entries typically launch with `--jinja`).
+- `GET /insights/llamacpp/models` returns `{ models, default_model, configured }`.
+- Cross-replay matrix (chat continuation): `local ↔ llamacpp` allowed (the
+  LlamaCppClient passes images through to the chat slot — you're responsible
+  for a vision-capable slot if the stored transcript carries images);
+  `hybrid ↔ llamacpp` allowed (both transcripts are text-only); `local →
+  hybrid` and `llamacpp → hybrid` rejected (mid-conversation description
+  source change isn't supported).
+
 **Insight Chat Continuation:**

 After an agentic insight is generated, the full `Vec<ChatMessage>` transcript is