ai: add llamacpp backend (llama-swap) as third LLM client
Wires a new LlamaCppClient (OpenAI-compatible /v1 wire format) alongside OllamaClient and OpenRouterClient. Per-slot routing for chat/vision/embed via env (LLAMA_SWAP_URL + *_MODEL vars); capability inference uses an env allowlist since /v1/models doesn't report modality. InsightGenerator + InsightChatService gain three-way dispatch on chat_backend = "local" | "hybrid" | "llamacpp". Hybrid and llamacpp share the describe-then-inline path (text-only chat after a separate vision describe). HYBRID_VISION_BACKEND=llamacpp lets hybrid route its describe pass through llama-swap's vision slot while chat still goes to OpenRouter. Cross-replay matrix added (validate_cross_replay): local<->llamacpp and hybrid<->llamacpp allowed; local->hybrid and llamacpp->hybrid rejected. New /insights/llamacpp/models handler mirrors the OpenRouter shape.
This commit is contained in:
49
CLAUDE.md
49
CLAUDE.md
@@ -475,6 +475,7 @@ POST /insights/generate/agentic (tool-calling loop; body: { file_path, back
|
||||
GET /insights?path=...&library=...
|
||||
GET /insights/models (local Ollama models + capabilities)
|
||||
GET /insights/openrouter/models (curated OpenRouter allowlist)
|
||||
GET /insights/llamacpp/models (curated llama-swap slot allowlist)
|
||||
POST /insights/rate (thumbs up/down for training data)
|
||||
|
||||
// Insight Chat Continuation
|
||||
@@ -631,6 +632,23 @@ OPENROUTER_EMBEDDING_MODEL=openai/text-embedding-3-small # Optional, embeddings
|
||||
OPENROUTER_HTTP_REFERER=https://your-site.example # Optional attribution header
|
||||
OPENROUTER_APP_TITLE=ImageApi # Optional attribution header
|
||||
|
||||
# llama.cpp / llama-swap (Llamacpp Backend) - sibling to Ollama; OpenAI-compatible
|
||||
# proxy hosting one or more llama-server processes (chat / vision / embed slots).
|
||||
LLAMA_SWAP_URL=http://localhost:9292/v1 # Required to enable llamacpp backend
|
||||
LLAMA_SWAP_PRIMARY_MODEL=chat # Chat slot id (matches config.yaml)
|
||||
LLAMA_SWAP_VISION_MODEL=vision # Vision slot id; describe_image routes here
|
||||
LLAMA_SWAP_EMBEDDING_MODEL=embed # Embedding slot id (when local embeddings via llamacpp)
|
||||
LLAMA_SWAP_VISION_MODELS=qwen-vl,llava # Comma-separated slot ids known to have vision.
|
||||
# Drives `has_vision` in /insights/llamacpp/models.
|
||||
# `LLAMA_SWAP_VISION_MODEL` is auto-included.
|
||||
LLAMA_SWAP_ALLOWED_MODELS=chat,coder # Curated allowlist exposed to clients via
|
||||
# GET /insights/llamacpp/models. Empty = no picker.
|
||||
LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 # Per-request timeout; bump for slow CPU offload
|
||||
HYBRID_VISION_BACKEND=llamacpp # Optional override for hybrid mode's describe_image:
|
||||
# `ollama` (default) or `llamacpp`. When `llamacpp`,
|
||||
# hybrid still routes chat to OpenRouter but uses
|
||||
# llama-swap's vision slot to describe images.
|
||||
|
||||
# Insight Chat Continuation
|
||||
AGENTIC_CHAT_MAX_ITERATIONS=6 # Cap on tool-calling iterations per chat turn (default 6)
|
||||
```
|
||||
@@ -652,8 +670,11 @@ This allows runtime verification of model availability before generating insight
|
||||
|
||||
**Hybrid Backend (OpenRouter):**
|
||||
- Per-request opt-in via `backend=hybrid` on `POST /insights/generate/agentic`.
|
||||
- Local Ollama still describes the image (vision); the description is inlined
|
||||
into the chat prompt and the agentic loop runs on OpenRouter.
|
||||
- Vision describe happens before the agentic loop; the description is inlined
|
||||
into the chat prompt and the agentic loop runs on OpenRouter. By default
|
||||
vision uses local Ollama, but `HYBRID_VISION_BACKEND=llamacpp` flips it to
|
||||
llama-swap's vision slot (useful when you want chat on a frontier model and
|
||||
vision on a local-but-not-Ollama path).
|
||||
- `request.model` (if provided) overrides `OPENROUTER_DEFAULT_MODEL` for that
|
||||
call. The mobile picker reads from `OPENROUTER_ALLOWED_MODELS`.
|
||||
- No live capability precheck — the operator-curated allowlist is trusted.
|
||||
@@ -661,6 +682,30 @@ This allows runtime verification of model availability before generating insight
|
||||
- `GET /insights/openrouter/models` returns `{ models, default_model, configured }`
|
||||
for client picker UIs.
|
||||
|
||||
**Llamacpp Backend (llama-swap):**
|
||||
- Per-request opt-in via `backend=llamacpp` on `POST /insights/generate/agentic`.
|
||||
- Sibling to Ollama: a local OpenAI-compatible proxy (mostlygeek/llama-swap)
|
||||
fronting one or more `llama-server` processes. The chat slot is text-only
|
||||
by default; vision and embeddings have their own slots (`LLAMA_SWAP_VISION_MODEL`,
|
||||
`LLAMA_SWAP_EMBEDDING_MODEL`) that llama-swap routes to by model id. The
|
||||
bundled `docker-compose.yml` + `llama-swap/config.yaml` in the opencode root
|
||||
is the reference deploy.
|
||||
- Operates in the same describe-then-inline shape as hybrid: the chat model
|
||||
never sees raw images. Vision describe routes through llama-swap's vision
|
||||
slot (`describe_image` on `LlamaCppClient`).
|
||||
- `request.model` (if provided) overrides `LLAMA_SWAP_PRIMARY_MODEL` for that
|
||||
call (must match a slot id in llama-swap's `config.yaml`). The mobile picker
|
||||
reads from `LLAMA_SWAP_ALLOWED_MODELS`.
|
||||
- No live capability precheck — slot ids are trusted. Tool calling is assumed
|
||||
for every slot (llama-swap entries typically launch with `--jinja`).
|
||||
- `GET /insights/llamacpp/models` returns `{ models, default_model, configured }`.
|
||||
- Cross-replay matrix (chat continuation): `local ↔ llamacpp` allowed (the
|
||||
LlamaCppClient passes images through to the chat slot — you're responsible
|
||||
for a vision-capable slot if the stored transcript carries images);
|
||||
`hybrid ↔ llamacpp` allowed (both transcripts are text-only); `local →
|
||||
hybrid` and `llamacpp → hybrid` rejected (mid-conversation description
|
||||
source change isn't supported).
|
||||
|
||||
**Insight Chat Continuation:**
|
||||
|
||||
After an agentic insight is generated, the full `Vec<ChatMessage>` transcript is
|
||||
|
||||
Reference in New Issue
Block a user