ai: collapse llamacpp into LLM_BACKEND env switch
Reverts the per-request backend="llamacpp" value. Chat/vision/embedding backend is now a deploy-time decision (LLM_BACKEND=ollama|llamacpp), applied globally across chat, vision describe, and embeddings — so embedding vectors stay in one space across the index. - Per-request backend whitelist back to "local"|"hybrid". A request arriving with backend="llamacpp" is rejected. - LLM_BACKEND=llamacpp swaps the entire local stack to llama-swap: chat hits the chat slot, describe hits the vision slot, embeddings hit the embed slot. Hybrid mode still routes chat to OpenRouter but uses LLM_BACKEND for the describe pass. - Drops env vars HYBRID_VISION_BACKEND, LLAMA_SWAP_VISION_MODELS, EMBEDDING_BACKEND (the last never shipped). Drops the LlamaCppClient.vision_models allowlist — capability inference now reports has_vision only for the configured vision_model slot. - Drops the /insights/llamacpp/models handler. /insights/models is the single endpoint; returns Ollama servers under LLM_BACKEND=ollama and llama-swap slots (from LLAMA_SWAP_ALLOWED_MODELS) under LLM_BACKEND=llamacpp. Same envelope shape either way. - New ai::embed_one helper routes embeddings through llama-swap when LLM_BACKEND=llamacpp (else Ollama). Wires it into the four insight_generator embedding sites. - Cross-replay matrix simplifies to pre-llamacpp shape (local↔local, hybrid↔hybrid, hybrid→local allowed; local→hybrid rejected).
This commit is contained in:
96
CLAUDE.md
96
CLAUDE.md
@@ -473,9 +473,8 @@ GET /memories?path=...&recursive=true
|
||||
POST /insights/generate (non-agentic single-shot)
|
||||
POST /insights/generate/agentic (tool-calling loop; body: { file_path, backend?, model?, ... })
|
||||
GET /insights?path=...&library=...
|
||||
GET /insights/models (local Ollama models + capabilities)
|
||||
GET /insights/models (local-backend models + capabilities; Ollama OR llama-swap based on LLM_BACKEND)
|
||||
GET /insights/openrouter/models (curated OpenRouter allowlist)
|
||||
GET /insights/llamacpp/models (curated llama-swap slot allowlist)
|
||||
POST /insights/rate (thumbs up/down for training data)
|
||||
|
||||
// Insight Chat Continuation
|
||||
@@ -632,22 +631,27 @@ OPENROUTER_EMBEDDING_MODEL=openai/text-embedding-3-small # Optional, embeddings
|
||||
OPENROUTER_HTTP_REFERER=https://your-site.example # Optional attribution header
|
||||
OPENROUTER_APP_TITLE=ImageApi # Optional attribution header
|
||||
|
||||
# llama.cpp / llama-swap (Llamacpp Backend) - sibling to Ollama; OpenAI-compatible
|
||||
# Local LLM backend switch. `ollama` (default) keeps the OLLAMA_* settings
|
||||
# above; `llamacpp` swaps the entire local stack (chat + vision describe +
|
||||
# embeddings) over to llama-swap. The switch is global and applies to
|
||||
# `backend=local` requests and to `backend=hybrid`'s describe pass (hybrid
|
||||
# chat still goes to OpenRouter). Don't flip mid-deploy without
|
||||
# re-embedding — mixed vector spaces break similarity search.
|
||||
LLM_BACKEND=ollama
|
||||
|
||||
# llama.cpp / llama-swap (used when LLM_BACKEND=llamacpp). OpenAI-compatible
|
||||
# proxy hosting one or more llama-server processes (chat / vision / embed slots).
|
||||
LLAMA_SWAP_URL=http://localhost:9292/v1 # Required to enable llamacpp backend
|
||||
LLAMA_SWAP_URL=http://localhost:9292/v1 # Required when LLM_BACKEND=llamacpp
|
||||
LLAMA_SWAP_PRIMARY_MODEL=chat # Chat slot id (matches config.yaml)
|
||||
LLAMA_SWAP_VISION_MODEL=vision # Vision slot id; describe_image routes here
|
||||
LLAMA_SWAP_EMBEDDING_MODEL=embed # Embedding slot id (when local embeddings via llamacpp)
|
||||
LLAMA_SWAP_VISION_MODELS=qwen-vl,llava # Comma-separated slot ids known to have vision.
|
||||
# Drives `has_vision` in /insights/llamacpp/models.
|
||||
# `LLAMA_SWAP_VISION_MODEL` is auto-included.
|
||||
LLAMA_SWAP_ALLOWED_MODELS=chat,coder # Curated allowlist exposed to clients via
|
||||
# GET /insights/llamacpp/models. Empty = no picker.
|
||||
LLAMA_SWAP_VISION_MODEL=vision # Vision slot id; describe_image routes here.
|
||||
# The only slot reported as has_vision=true in
|
||||
# /insights/models — chat slots are treated as
|
||||
# text-only (images pre-described and inlined).
|
||||
LLAMA_SWAP_EMBEDDING_MODEL=embed # Embedding slot id
|
||||
LLAMA_SWAP_ALLOWED_MODELS=chat,coder # Curated allowlist surfaced by GET /insights/models
|
||||
# when LLM_BACKEND=llamacpp. Empty = picker shows
|
||||
# only the configured primary model.
|
||||
LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 # Per-request timeout; bump for slow CPU offload
|
||||
HYBRID_VISION_BACKEND=llamacpp # Optional override for hybrid mode's describe_image:
|
||||
# `ollama` (default) or `llamacpp`. When `llamacpp`,
|
||||
# hybrid still routes chat to OpenRouter but uses
|
||||
# llama-swap's vision slot to describe images.
|
||||
|
||||
# Insight Chat Continuation
|
||||
AGENTIC_CHAT_MAX_ITERATIONS=6 # Cap on tool-calling iterations per chat turn (default 6)
|
||||
@@ -668,13 +672,36 @@ The `OllamaClient` provides methods to query available models:
|
||||
|
||||
This allows runtime verification of model availability before generating insights.
|
||||
|
||||
**Local backend switch (`LLM_BACKEND`):**
|
||||
|
||||
One env var decides which "local" stack the server runs against — `ollama`
|
||||
(default) or `llamacpp`. It's global on purpose: chat, vision describe, and
|
||||
embeddings all route through the same backend, so the embedding-vector
|
||||
column in SQLite stays in one vector space. Don't flip mid-deploy without
|
||||
re-embedding the affected rows — similarity search will collapse.
|
||||
|
||||
- `LLM_BACKEND=ollama`: chat and embeddings use Ollama; vision describe
|
||||
uses Ollama's multimodal model.
|
||||
- `LLM_BACKEND=llamacpp`: chat hits llama-swap's `chat` slot (which is
|
||||
treated as text-only — images are pre-described via the `vision` slot
|
||||
and inlined), embeddings hit the `embed` slot, vision describe hits the
|
||||
`vision` slot. Requires `LLAMA_SWAP_URL`.
|
||||
|
||||
The per-request `backend=hybrid` override is orthogonal: it always sends
|
||||
chat to OpenRouter, but the describe pass still routes through whichever
|
||||
`LLM_BACKEND` is configured.
|
||||
|
||||
`GET /insights/models` returns the local-backend models with capabilities
|
||||
in the same envelope shape regardless of `LLM_BACKEND`: Ollama servers
|
||||
when `ollama`, llama-swap slots (from `LLAMA_SWAP_ALLOWED_MODELS`) when
|
||||
`llamacpp`. No `/insights/llamacpp/models` — the picker reads a single
|
||||
endpoint.
|
||||
|
||||
**Hybrid Backend (OpenRouter):**
|
||||
- Per-request opt-in via `backend=hybrid` on `POST /insights/generate/agentic`.
|
||||
- Vision describe happens before the agentic loop; the description is inlined
|
||||
into the chat prompt and the agentic loop runs on OpenRouter. By default
|
||||
vision uses local Ollama, but `HYBRID_VISION_BACKEND=llamacpp` flips it to
|
||||
llama-swap's vision slot (useful when you want chat on a frontier model and
|
||||
vision on a local-but-not-Ollama path).
|
||||
into the chat prompt and the agentic loop runs on OpenRouter. Vision
|
||||
routes through whichever `LLM_BACKEND` is configured.
|
||||
- `request.model` (if provided) overrides `OPENROUTER_DEFAULT_MODEL` for that
|
||||
call. The mobile picker reads from `OPENROUTER_ALLOWED_MODELS`.
|
||||
- No live capability precheck — the operator-curated allowlist is trusted.
|
||||
@@ -682,29 +709,14 @@ This allows runtime verification of model availability before generating insight
|
||||
- `GET /insights/openrouter/models` returns `{ models, default_model, configured }`
|
||||
for client picker UIs.
|
||||
|
||||
**Llamacpp Backend (llama-swap):**
|
||||
- Per-request opt-in via `backend=llamacpp` on `POST /insights/generate/agentic`.
|
||||
- Sibling to Ollama: a local OpenAI-compatible proxy (mostlygeek/llama-swap)
|
||||
fronting one or more `llama-server` processes. The chat slot is text-only
|
||||
by default; vision and embeddings have their own slots (`LLAMA_SWAP_VISION_MODEL`,
|
||||
`LLAMA_SWAP_EMBEDDING_MODEL`) that llama-swap routes to by model id. The
|
||||
bundled `docker-compose.yml` + `llama-swap/config.yaml` in the opencode root
|
||||
is the reference deploy.
|
||||
- Operates in the same describe-then-inline shape as hybrid: the chat model
|
||||
never sees raw images. Vision describe routes through llama-swap's vision
|
||||
slot (`describe_image` on `LlamaCppClient`).
|
||||
- `request.model` (if provided) overrides `LLAMA_SWAP_PRIMARY_MODEL` for that
|
||||
call (must match a slot id in llama-swap's `config.yaml`). The mobile picker
|
||||
reads from `LLAMA_SWAP_ALLOWED_MODELS`.
|
||||
- No live capability precheck — slot ids are trusted. Tool calling is assumed
|
||||
for every slot (llama-swap entries typically launch with `--jinja`).
|
||||
- `GET /insights/llamacpp/models` returns `{ models, default_model, configured }`.
|
||||
- Cross-replay matrix (chat continuation): `local ↔ llamacpp` allowed (the
|
||||
LlamaCppClient passes images through to the chat slot — you're responsible
|
||||
for a vision-capable slot if the stored transcript carries images);
|
||||
`hybrid ↔ llamacpp` allowed (both transcripts are text-only); `local →
|
||||
hybrid` and `llamacpp → hybrid` rejected (mid-conversation description
|
||||
source change isn't supported).
|
||||
**Cross-replay matrix (chat continuation):**
|
||||
- `local → local` allowed (whether served by Ollama or llama-swap; that's
|
||||
a deploy-time decision, not a request-time one).
|
||||
- `hybrid → hybrid` allowed.
|
||||
- `hybrid → local` allowed (the inlined description replays as text).
|
||||
- `local → hybrid` rejected — the stored transcript has raw images in the
|
||||
first user message and OpenRouter providers don't accept that shape
|
||||
consistently. Regenerate the insight in hybrid mode instead.
|
||||
|
||||
**Insight Chat Continuation:**
|
||||
|
||||
|
||||
Reference in New Issue
Block a user