feature/llamacpp-backend #101

Merged
cameron merged 11 commits from feature/llamacpp-backend into master 2026-05-26 18:58:48 +00:00
2 changed files with 35 additions and 20 deletions
Showing only changes of commit fb388c29d7 - Show all commits

View File

@@ -66,15 +66,17 @@ AGENTIC_CHAT_MAX_ITERATIONS=6
# ── AI Insights — llama.cpp / llama-swap (optional) ───────────────────── # ── AI Insights — llama.cpp / llama-swap (optional) ─────────────────────
# Set LLAMA_SWAP_URL plus LLM_BACKEND=llamacpp to swap the local stack # Set LLAMA_SWAP_URL plus LLM_BACKEND=llamacpp to swap the local stack
# off Ollama. Talks OpenAI-compatible /v1 to a llama-swap proxy fronting # off Ollama. Talks OpenAI-compatible /v1 to a llama-swap proxy fronting
# per-slot llama-server instances (chat / vision / embed). The chat slot # per-slot llama-server instances. Chat models receive images directly
# is treated as text-only — images are pre-described via the vision slot # via content-parts (vision-capable models assumed); a separate vision
# and inlined into the prompt. # slot is used only by the describe_photo tool and describe-image utility.
# LLAMA_SWAP_URL=http://localhost:9292/v1 # LLAMA_SWAP_URL=http://localhost:9292/v1
# LLAMA_SWAP_PRIMARY_MODEL=chat # LLAMA_SWAP_PRIMARY_MODEL=chat
# Optional dedicated vision slot for describe_image. Defaults to
# PRIMARY_MODEL so describe_photo works without extra config.
# LLAMA_SWAP_VISION_MODEL=vision # LLAMA_SWAP_VISION_MODEL=vision
# LLAMA_SWAP_EMBEDDING_MODEL=embed # LLAMA_SWAP_EMBEDDING_MODEL=embed
# Comma-separated allowlist surfaced by /insights/models when # Comma-separated allowlist surfaced by /insights/models when
# LLM_BACKEND=llamacpp. # LLM_BACKEND=llamacpp. All report has_vision=true.
# LLAMA_SWAP_ALLOWED_MODELS=chat,vision,embed # LLAMA_SWAP_ALLOWED_MODELS=chat,vision,embed
# LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 # LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180

View File

@@ -640,17 +640,16 @@ OPENROUTER_APP_TITLE=ImageApi # Optional attribution header
LLM_BACKEND=ollama LLM_BACKEND=ollama
# llama.cpp / llama-swap (used when LLM_BACKEND=llamacpp). OpenAI-compatible # llama.cpp / llama-swap (used when LLM_BACKEND=llamacpp). OpenAI-compatible
# proxy hosting one or more llama-server processes (chat / vision / embed slots). # proxy hosting one or more llama-server processes. Chat models receive
# images directly via content-parts (all models assumed vision-capable).
LLAMA_SWAP_URL=http://localhost:9292/v1 # Required when LLM_BACKEND=llamacpp LLAMA_SWAP_URL=http://localhost:9292/v1 # Required when LLM_BACKEND=llamacpp
LLAMA_SWAP_PRIMARY_MODEL=chat # Chat slot id (matches config.yaml) LLAMA_SWAP_PRIMARY_MODEL=chat # Chat slot id (matches config.yaml)
LLAMA_SWAP_VISION_MODEL=vision # Vision slot id; describe_image routes here. LLAMA_SWAP_VISION_MODEL= # Dedicated vision slot for describe_image / describe_photo
# The only slot reported as has_vision=true in # tool. Defaults to PRIMARY_MODEL when unset.
# /insights/models — chat slots are treated as
# text-only (images pre-described and inlined).
LLAMA_SWAP_EMBEDDING_MODEL=embed # Embedding slot id LLAMA_SWAP_EMBEDDING_MODEL=embed # Embedding slot id
LLAMA_SWAP_ALLOWED_MODELS=chat,coder # Curated allowlist surfaced by GET /insights/models LLAMA_SWAP_ALLOWED_MODELS=chat,coder # Curated allowlist surfaced by GET /insights/models
# when LLM_BACKEND=llamacpp. Empty = picker shows # when LLM_BACKEND=llamacpp. All report has_vision=true.
# only the configured primary model. # Empty = picker shows only the configured primary model.
LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 # Per-request timeout; bump for slow CPU offload LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 # Per-request timeout; bump for slow CPU offload
# Insight Chat Continuation # Insight Chat Continuation
@@ -675,21 +674,35 @@ This allows runtime verification of model availability before generating insight
**Local backend switch (`LLM_BACKEND`):** **Local backend switch (`LLM_BACKEND`):**
One env var decides which "local" stack the server runs against — `ollama` One env var decides which "local" stack the server runs against — `ollama`
(default) or `llamacpp`. It's global on purpose: chat, vision describe, and (default) or `llamacpp`. It's global on purpose: chat, vision, and
embeddings all route through the same backend, so the embedding-vector embeddings all route through the same backend, so the embedding-vector
column in SQLite stays in one vector space. Don't flip mid-deploy without column in SQLite stays in one vector space. Don't flip mid-deploy without
re-embedding the affected rows — similarity search will collapse. re-embedding the affected rows — similarity search will collapse.
- `LLM_BACKEND=ollama`: chat and embeddings use Ollama; vision describe - `LLM_BACKEND=ollama`: chat, vision, and embeddings use Ollama. Vision
uses Ollama's multimodal model. capability is probed per-model via `/api/show`.
- `LLM_BACKEND=llamacpp`: chat hits llama-swap's `chat` slot (which is - `LLM_BACKEND=llamacpp`: chat models receive images directly via OpenAI
treated as text-only — images are pre-described via the `vision` slot content-parts (all models assumed vision-capable). Embeddings hit the
and inlined), embeddings hit the `embed` slot, vision describe hits the `embed` slot. A dedicated `LLAMA_SWAP_VISION_MODEL` slot (defaults to
`vision` slot. Requires `LLAMA_SWAP_URL`. the chat model) handles `describe_image` for the `describe_photo` tool.
Requires `LLAMA_SWAP_URL`.
The per-request `backend=hybrid` override is orthogonal: it always sends The per-request `backend=hybrid` override is orthogonal: it always sends
chat to OpenRouter, but the describe pass still routes through whichever chat to OpenRouter (text-only, images are pre-described and inlined), but
`LLM_BACKEND` is configured. the describe + embed passes still route through whichever `LLM_BACKEND`
is configured.
**Backend dispatch (`ResolvedBackend`):**
`InsightGenerator::resolve_backend(kind, overrides)` is the single entry
point that builds clients for a request. Returns a `ResolvedBackend` with
two roles: `.chat()` (the agentic/chat client) and `.local()` (local-only
utility calls: rerank, describe_image, embeddings). `BackendKind` is an
enum (`Local` | `Hybrid`) replacing the stringly-typed `"local"` /
`"hybrid"` labels. `SamplingOverrides` groups model/ctx/temp/top_p/top_k/
min_p per-request overrides. All downstream code (`execute_tool`,
`run_streaming_agentic_loop`, etc.) takes `&ResolvedBackend` rather than
individual client references.
`GET /insights/models` returns the local-backend models with capabilities `GET /insights/models` returns the local-backend models with capabilities
in the same envelope shape regardless of `LLM_BACKEND`: Ollama servers in the same envelope shape regardless of `LLM_BACKEND`: Ollama servers