ai: collapse llamacpp into LLM_BACKEND env switch
Reverts the per-request backend="llamacpp" value. Chat/vision/embedding backend is now a deploy-time decision (LLM_BACKEND=ollama|llamacpp), applied globally across chat, vision describe, and embeddings — so embedding vectors stay in one space across the index. - Per-request backend whitelist back to "local"|"hybrid". A request arriving with backend="llamacpp" is rejected. - LLM_BACKEND=llamacpp swaps the entire local stack to llama-swap: chat hits the chat slot, describe hits the vision slot, embeddings hit the embed slot. Hybrid mode still routes chat to OpenRouter but uses LLM_BACKEND for the describe pass. - Drops env vars HYBRID_VISION_BACKEND, LLAMA_SWAP_VISION_MODELS, EMBEDDING_BACKEND (the last never shipped). Drops the LlamaCppClient.vision_models allowlist — capability inference now reports has_vision only for the configured vision_model slot. - Drops the /insights/llamacpp/models handler. /insights/models is the single endpoint; returns Ollama servers under LLM_BACKEND=ollama and llama-swap slots (from LLAMA_SWAP_ALLOWED_MODELS) under LLM_BACKEND=llamacpp. Same envelope shape either way. - New ai::embed_one helper routes embeddings through llama-swap when LLM_BACKEND=llamacpp (else Ollama). Wires it into the four insight_generator embedding sites. - Cross-replay matrix simplifies to pre-llamacpp shape (local↔local, hybrid↔hybrid, hybrid→local allowed; local→hybrid rejected).
This commit is contained in:
32
.env.example
32
.env.example
@@ -53,26 +53,30 @@ AGENTIC_CHAT_MAX_ITERATIONS=6
|
||||
# OPENROUTER_HTTP_REFERER=https://your-site.example
|
||||
# OPENROUTER_APP_TITLE=ImageApi
|
||||
|
||||
# ── AI Insights — local backend switch ──────────────────────────────────
|
||||
# Picks which local LLM stack the server uses for chat, vision describe,
|
||||
# and embeddings. `ollama` (default) uses the OLLAMA_* settings above;
|
||||
# `llamacpp` uses the LLAMA_SWAP_* settings below. The switch is global
|
||||
# and applies to both `backend=local` and `backend=hybrid` (hybrid keeps
|
||||
# chat on OpenRouter but still uses this stack for the describe pass).
|
||||
# Don't flip mid-deploy without re-embedding existing index rows —
|
||||
# mixed vector spaces break similarity search.
|
||||
# LLM_BACKEND=ollama
|
||||
|
||||
# ── AI Insights — llama.cpp / llama-swap (optional) ─────────────────────
|
||||
# Set LLAMA_SWAP_URL to enable the `llamacpp` chat_backend. Talks
|
||||
# OpenAI-compatible /v1 to a llama-swap proxy that fronts per-slot
|
||||
# llama-server instances (chat / vision / embed). Like hybrid, the
|
||||
# agentic loop describes images via the vision slot then inlines the
|
||||
# text into the chat slot — so the chat slot itself can be text-only.
|
||||
# Set LLAMA_SWAP_URL plus LLM_BACKEND=llamacpp to swap the local stack
|
||||
# off Ollama. Talks OpenAI-compatible /v1 to a llama-swap proxy fronting
|
||||
# per-slot llama-server instances (chat / vision / embed). The chat slot
|
||||
# is treated as text-only — images are pre-described via the vision slot
|
||||
# and inlined into the prompt.
|
||||
# LLAMA_SWAP_URL=http://localhost:9292/v1
|
||||
# LLAMA_SWAP_PRIMARY_MODEL=chat
|
||||
# LLAMA_SWAP_VISION_MODEL=vision
|
||||
# LLAMA_SWAP_EMBEDDING_MODEL=embed
|
||||
# Comma-separated allowlist of model ids the /v1/models endpoint should
|
||||
# advertise as vision-capable (llama-swap doesn't report modality).
|
||||
# LLAMA_SWAP_VISION_MODELS=vision
|
||||
# Comma-separated allowlist surfaced by /insights/llamacpp/models.
|
||||
# Comma-separated allowlist surfaced by /insights/models when
|
||||
# LLM_BACKEND=llamacpp.
|
||||
# LLAMA_SWAP_ALLOWED_MODELS=chat,vision,embed
|
||||
# LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=120
|
||||
# Routes hybrid mode's vision-describe pass through llama-swap's vision
|
||||
# slot instead of Ollama (chat still goes to OpenRouter). Values:
|
||||
# `ollama` (default) | `llamacpp`.
|
||||
# HYBRID_VISION_BACKEND=ollama
|
||||
# LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180
|
||||
|
||||
# ── AI Insights — sibling services (optional) ───────────────────────────
|
||||
# Apollo (places, face inference, CLIP encoders). Single-Apollo deploys
|
||||
|
||||
Reference in New Issue
Block a user