ai: collapse llamacpp into LLM_BACKEND env switch

Reverts the per-request backend="llamacpp" value. Chat/vision/embedding
backend is now a deploy-time decision (LLM_BACKEND=ollama|llamacpp),
applied globally across chat, vision describe, and embeddings — so
embedding vectors stay in one space across the index.

- Per-request backend whitelist back to "local"|"hybrid". A request
  arriving with backend="llamacpp" is rejected.
- LLM_BACKEND=llamacpp swaps the entire local stack to llama-swap:
  chat hits the chat slot, describe hits the vision slot, embeddings
  hit the embed slot. Hybrid mode still routes chat to OpenRouter
  but uses LLM_BACKEND for the describe pass.
- Drops env vars HYBRID_VISION_BACKEND, LLAMA_SWAP_VISION_MODELS,
  EMBEDDING_BACKEND (the last never shipped). Drops the
  LlamaCppClient.vision_models allowlist — capability inference now
  reports has_vision only for the configured vision_model slot.
- Drops the /insights/llamacpp/models handler. /insights/models is
  the single endpoint; returns Ollama servers under LLM_BACKEND=ollama
  and llama-swap slots (from LLAMA_SWAP_ALLOWED_MODELS) under
  LLM_BACKEND=llamacpp. Same envelope shape either way.
- New ai::embed_one helper routes embeddings through llama-swap when
  LLM_BACKEND=llamacpp (else Ollama). Wires it into the four
  insight_generator embedding sites.
- Cross-replay matrix simplifies to pre-llamacpp shape (local↔local,
  hybrid↔hybrid, hybrid→local allowed; local→hybrid rejected).
This commit is contained in:
Cameron Cordes
2026-05-21 11:36:58 -04:00
parent d14df63f19
commit be51421b38
9 changed files with 338 additions and 301 deletions

View File

@@ -473,9 +473,8 @@ GET /memories?path=...&recursive=true
POST /insights/generate (non-agentic single-shot)
POST /insights/generate/agentic (tool-calling loop; body: { file_path, backend?, model?, ... })
GET /insights?path=...&library=...
GET /insights/models (local Ollama models + capabilities)
GET /insights/models (local-backend models + capabilities; Ollama OR llama-swap based on LLM_BACKEND)
GET /insights/openrouter/models (curated OpenRouter allowlist)
GET /insights/llamacpp/models (curated llama-swap slot allowlist)
POST /insights/rate (thumbs up/down for training data)
// Insight Chat Continuation
@@ -632,22 +631,27 @@ OPENROUTER_EMBEDDING_MODEL=openai/text-embedding-3-small # Optional, embeddings
OPENROUTER_HTTP_REFERER=https://your-site.example # Optional attribution header
OPENROUTER_APP_TITLE=ImageApi # Optional attribution header
# llama.cpp / llama-swap (Llamacpp Backend) - sibling to Ollama; OpenAI-compatible
# Local LLM backend switch. `ollama` (default) keeps the OLLAMA_* settings
# above; `llamacpp` swaps the entire local stack (chat + vision describe +
# embeddings) over to llama-swap. The switch is global and applies to
# `backend=local` requests and to `backend=hybrid`'s describe pass (hybrid
# chat still goes to OpenRouter). Don't flip mid-deploy without
# re-embedding — mixed vector spaces break similarity search.
LLM_BACKEND=ollama
# llama.cpp / llama-swap (used when LLM_BACKEND=llamacpp). OpenAI-compatible
# proxy hosting one or more llama-server processes (chat / vision / embed slots).
LLAMA_SWAP_URL=http://localhost:9292/v1 # Required to enable llamacpp backend
LLAMA_SWAP_URL=http://localhost:9292/v1 # Required when LLM_BACKEND=llamacpp
LLAMA_SWAP_PRIMARY_MODEL=chat # Chat slot id (matches config.yaml)
LLAMA_SWAP_VISION_MODEL=vision # Vision slot id; describe_image routes here
LLAMA_SWAP_EMBEDDING_MODEL=embed # Embedding slot id (when local embeddings via llamacpp)
LLAMA_SWAP_VISION_MODELS=qwen-vl,llava # Comma-separated slot ids known to have vision.
# Drives `has_vision` in /insights/llamacpp/models.
# `LLAMA_SWAP_VISION_MODEL` is auto-included.
LLAMA_SWAP_ALLOWED_MODELS=chat,coder # Curated allowlist exposed to clients via
# GET /insights/llamacpp/models. Empty = no picker.
LLAMA_SWAP_VISION_MODEL=vision # Vision slot id; describe_image routes here.
# The only slot reported as has_vision=true in
# /insights/models — chat slots are treated as
# text-only (images pre-described and inlined).
LLAMA_SWAP_EMBEDDING_MODEL=embed # Embedding slot id
LLAMA_SWAP_ALLOWED_MODELS=chat,coder # Curated allowlist surfaced by GET /insights/models
# when LLM_BACKEND=llamacpp. Empty = picker shows
# only the configured primary model.
LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 # Per-request timeout; bump for slow CPU offload
HYBRID_VISION_BACKEND=llamacpp # Optional override for hybrid mode's describe_image:
# `ollama` (default) or `llamacpp`. When `llamacpp`,
# hybrid still routes chat to OpenRouter but uses
# llama-swap's vision slot to describe images.
# Insight Chat Continuation
AGENTIC_CHAT_MAX_ITERATIONS=6 # Cap on tool-calling iterations per chat turn (default 6)
@@ -668,13 +672,36 @@ The `OllamaClient` provides methods to query available models:
This allows runtime verification of model availability before generating insights.
**Local backend switch (`LLM_BACKEND`):**
One env var decides which "local" stack the server runs against — `ollama`
(default) or `llamacpp`. It's global on purpose: chat, vision describe, and
embeddings all route through the same backend, so the embedding-vector
column in SQLite stays in one vector space. Don't flip mid-deploy without
re-embedding the affected rows — similarity search will collapse.
- `LLM_BACKEND=ollama`: chat and embeddings use Ollama; vision describe
uses Ollama's multimodal model.
- `LLM_BACKEND=llamacpp`: chat hits llama-swap's `chat` slot (which is
treated as text-only — images are pre-described via the `vision` slot
and inlined), embeddings hit the `embed` slot, vision describe hits the
`vision` slot. Requires `LLAMA_SWAP_URL`.
The per-request `backend=hybrid` override is orthogonal: it always sends
chat to OpenRouter, but the describe pass still routes through whichever
`LLM_BACKEND` is configured.
`GET /insights/models` returns the local-backend models with capabilities
in the same envelope shape regardless of `LLM_BACKEND`: Ollama servers
when `ollama`, llama-swap slots (from `LLAMA_SWAP_ALLOWED_MODELS`) when
`llamacpp`. No `/insights/llamacpp/models` — the picker reads a single
endpoint.
**Hybrid Backend (OpenRouter):**
- Per-request opt-in via `backend=hybrid` on `POST /insights/generate/agentic`.
- Vision describe happens before the agentic loop; the description is inlined
into the chat prompt and the agentic loop runs on OpenRouter. By default
vision uses local Ollama, but `HYBRID_VISION_BACKEND=llamacpp` flips it to
llama-swap's vision slot (useful when you want chat on a frontier model and
vision on a local-but-not-Ollama path).
into the chat prompt and the agentic loop runs on OpenRouter. Vision
routes through whichever `LLM_BACKEND` is configured.
- `request.model` (if provided) overrides `OPENROUTER_DEFAULT_MODEL` for that
call. The mobile picker reads from `OPENROUTER_ALLOWED_MODELS`.
- No live capability precheck — the operator-curated allowlist is trusted.
@@ -682,29 +709,14 @@ This allows runtime verification of model availability before generating insight
- `GET /insights/openrouter/models` returns `{ models, default_model, configured }`
for client picker UIs.
**Llamacpp Backend (llama-swap):**
- Per-request opt-in via `backend=llamacpp` on `POST /insights/generate/agentic`.
- Sibling to Ollama: a local OpenAI-compatible proxy (mostlygeek/llama-swap)
fronting one or more `llama-server` processes. The chat slot is text-only
by default; vision and embeddings have their own slots (`LLAMA_SWAP_VISION_MODEL`,
`LLAMA_SWAP_EMBEDDING_MODEL`) that llama-swap routes to by model id. The
bundled `docker-compose.yml` + `llama-swap/config.yaml` in the opencode root
is the reference deploy.
- Operates in the same describe-then-inline shape as hybrid: the chat model
never sees raw images. Vision describe routes through llama-swap's vision
slot (`describe_image` on `LlamaCppClient`).
- `request.model` (if provided) overrides `LLAMA_SWAP_PRIMARY_MODEL` for that
call (must match a slot id in llama-swap's `config.yaml`). The mobile picker
reads from `LLAMA_SWAP_ALLOWED_MODELS`.
- No live capability precheck — slot ids are trusted. Tool calling is assumed
for every slot (llama-swap entries typically launch with `--jinja`).
- `GET /insights/llamacpp/models` returns `{ models, default_model, configured }`.
- Cross-replay matrix (chat continuation): `local ↔ llamacpp` allowed (the
LlamaCppClient passes images through to the chat slot — you're responsible
for a vision-capable slot if the stored transcript carries images);
`hybrid ↔ llamacpp` allowed (both transcripts are text-only); `local →
hybrid` and `llamacpp → hybrid` rejected (mid-conversation description
source change isn't supported).
**Cross-replay matrix (chat continuation):**
- `local → local` allowed (whether served by Ollama or llama-swap; that's
a deploy-time decision, not a request-time one).
- `hybrid → hybrid` allowed.
- `hybrid → local` allowed (the inlined description replays as text).
- `local → hybrid` rejected — the stored transcript has raw images in the
first user message and OpenRouter providers don't accept that shape
consistently. Regenerate the insight in hybrid mode instead.
**Insight Chat Continuation:**