ai: add llamacpp backend (llama-swap) as third LLM client

Wires a new LlamaCppClient (OpenAI-compatible /v1 wire format) alongside
OllamaClient and OpenRouterClient. Per-slot routing for chat/vision/embed
via env (LLAMA_SWAP_URL + *_MODEL vars); capability inference uses an
env allowlist since /v1/models doesn't report modality.

InsightGenerator + InsightChatService gain three-way dispatch on
chat_backend = "local" | "hybrid" | "llamacpp". Hybrid and llamacpp
share the describe-then-inline path (text-only chat after a separate
vision describe). HYBRID_VISION_BACKEND=llamacpp lets hybrid route its
describe pass through llama-swap's vision slot while chat still goes
to OpenRouter.

Cross-replay matrix added (validate_cross_replay): local<->llamacpp
and hybrid<->llamacpp allowed; local->hybrid and llamacpp->hybrid
rejected. New /insights/llamacpp/models handler mirrors the OpenRouter
shape.
This commit is contained in:
Cameron Cordes
2026-05-20 17:52:33 -04:00
parent d04b86e32c
commit f0927f5355
9 changed files with 1468 additions and 102 deletions

View File

@@ -475,6 +475,7 @@ POST /insights/generate/agentic (tool-calling loop; body: { file_path, back
GET /insights?path=...&library=...
GET /insights/models (local Ollama models + capabilities)
GET /insights/openrouter/models (curated OpenRouter allowlist)
GET /insights/llamacpp/models (curated llama-swap slot allowlist)
POST /insights/rate (thumbs up/down for training data)
// Insight Chat Continuation
@@ -631,6 +632,23 @@ OPENROUTER_EMBEDDING_MODEL=openai/text-embedding-3-small # Optional, embeddings
OPENROUTER_HTTP_REFERER=https://your-site.example # Optional attribution header
OPENROUTER_APP_TITLE=ImageApi # Optional attribution header
# llama.cpp / llama-swap (Llamacpp Backend) - sibling to Ollama; OpenAI-compatible
# proxy hosting one or more llama-server processes (chat / vision / embed slots).
LLAMA_SWAP_URL=http://localhost:9292/v1 # Required to enable llamacpp backend
LLAMA_SWAP_PRIMARY_MODEL=chat # Chat slot id (matches config.yaml)
LLAMA_SWAP_VISION_MODEL=vision # Vision slot id; describe_image routes here
LLAMA_SWAP_EMBEDDING_MODEL=embed # Embedding slot id (when local embeddings via llamacpp)
LLAMA_SWAP_VISION_MODELS=qwen-vl,llava # Comma-separated slot ids known to have vision.
# Drives `has_vision` in /insights/llamacpp/models.
# `LLAMA_SWAP_VISION_MODEL` is auto-included.
LLAMA_SWAP_ALLOWED_MODELS=chat,coder # Curated allowlist exposed to clients via
# GET /insights/llamacpp/models. Empty = no picker.
LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 # Per-request timeout; bump for slow CPU offload
HYBRID_VISION_BACKEND=llamacpp # Optional override for hybrid mode's describe_image:
# `ollama` (default) or `llamacpp`. When `llamacpp`,
# hybrid still routes chat to OpenRouter but uses
# llama-swap's vision slot to describe images.
# Insight Chat Continuation
AGENTIC_CHAT_MAX_ITERATIONS=6 # Cap on tool-calling iterations per chat turn (default 6)
```
@@ -652,8 +670,11 @@ This allows runtime verification of model availability before generating insight
**Hybrid Backend (OpenRouter):**
- Per-request opt-in via `backend=hybrid` on `POST /insights/generate/agentic`.
- Local Ollama still describes the image (vision); the description is inlined
into the chat prompt and the agentic loop runs on OpenRouter.
- Vision describe happens before the agentic loop; the description is inlined
into the chat prompt and the agentic loop runs on OpenRouter. By default
vision uses local Ollama, but `HYBRID_VISION_BACKEND=llamacpp` flips it to
llama-swap's vision slot (useful when you want chat on a frontier model and
vision on a local-but-not-Ollama path).
- `request.model` (if provided) overrides `OPENROUTER_DEFAULT_MODEL` for that
call. The mobile picker reads from `OPENROUTER_ALLOWED_MODELS`.
- No live capability precheck — the operator-curated allowlist is trusted.
@@ -661,6 +682,30 @@ This allows runtime verification of model availability before generating insight
- `GET /insights/openrouter/models` returns `{ models, default_model, configured }`
for client picker UIs.
**Llamacpp Backend (llama-swap):**
- Per-request opt-in via `backend=llamacpp` on `POST /insights/generate/agentic`.
- Sibling to Ollama: a local OpenAI-compatible proxy (mostlygeek/llama-swap)
fronting one or more `llama-server` processes. The chat slot is text-only
by default; vision and embeddings have their own slots (`LLAMA_SWAP_VISION_MODEL`,
`LLAMA_SWAP_EMBEDDING_MODEL`) that llama-swap routes to by model id. The
bundled `docker-compose.yml` + `llama-swap/config.yaml` in the opencode root
is the reference deploy.
- Operates in the same describe-then-inline shape as hybrid: the chat model
never sees raw images. Vision describe routes through llama-swap's vision
slot (`describe_image` on `LlamaCppClient`).
- `request.model` (if provided) overrides `LLAMA_SWAP_PRIMARY_MODEL` for that
call (must match a slot id in llama-swap's `config.yaml`). The mobile picker
reads from `LLAMA_SWAP_ALLOWED_MODELS`.
- No live capability precheck — slot ids are trusted. Tool calling is assumed
for every slot (llama-swap entries typically launch with `--jinja`).
- `GET /insights/llamacpp/models` returns `{ models, default_model, configured }`.
- Cross-replay matrix (chat continuation): `local ↔ llamacpp` allowed (the
LlamaCppClient passes images through to the chat slot — you're responsible
for a vision-capable slot if the stored transcript carries images);
`hybrid ↔ llamacpp` allowed (both transcripts are text-only); `local →
hybrid` and `llamacpp → hybrid` rejected (mid-conversation description
source change isn't supported).
**Insight Chat Continuation:**
After an agentic insight is generated, the full `Vec<ChatMessage>` transcript is