ai: collapse llamacpp into LLM_BACKEND env switch

Reverts the per-request backend="llamacpp" value. Chat/vision/embedding
backend is now a deploy-time decision (LLM_BACKEND=ollama|llamacpp),
applied globally across chat, vision describe, and embeddings — so
embedding vectors stay in one space across the index.

- Per-request backend whitelist back to "local"|"hybrid". A request
  arriving with backend="llamacpp" is rejected.
- LLM_BACKEND=llamacpp swaps the entire local stack to llama-swap:
  chat hits the chat slot, describe hits the vision slot, embeddings
  hit the embed slot. Hybrid mode still routes chat to OpenRouter
  but uses LLM_BACKEND for the describe pass.
- Drops env vars HYBRID_VISION_BACKEND, LLAMA_SWAP_VISION_MODELS,
  EMBEDDING_BACKEND (the last never shipped). Drops the
  LlamaCppClient.vision_models allowlist — capability inference now
  reports has_vision only for the configured vision_model slot.
- Drops the /insights/llamacpp/models handler. /insights/models is
  the single endpoint; returns Ollama servers under LLM_BACKEND=ollama
  and llama-swap slots (from LLAMA_SWAP_ALLOWED_MODELS) under
  LLM_BACKEND=llamacpp. Same envelope shape either way.
- New ai::embed_one helper routes embeddings through llama-swap when
  LLM_BACKEND=llamacpp (else Ollama). Wires it into the four
  insight_generator embedding sites.
- Cross-replay matrix simplifies to pre-llamacpp shape (local↔local,
  hybrid↔hybrid, hybrid→local allowed; local→hybrid rejected).
This commit is contained in:
Cameron Cordes
2026-05-21 11:36:58 -04:00
parent d14df63f19
commit be51421b38
9 changed files with 338 additions and 301 deletions

View File

@@ -53,26 +53,30 @@ AGENTIC_CHAT_MAX_ITERATIONS=6
# OPENROUTER_HTTP_REFERER=https://your-site.example
# OPENROUTER_APP_TITLE=ImageApi
# ── AI Insights — local backend switch ──────────────────────────────────
# Picks which local LLM stack the server uses for chat, vision describe,
# and embeddings. `ollama` (default) uses the OLLAMA_* settings above;
# `llamacpp` uses the LLAMA_SWAP_* settings below. The switch is global
# and applies to both `backend=local` and `backend=hybrid` (hybrid keeps
# chat on OpenRouter but still uses this stack for the describe pass).
# Don't flip mid-deploy without re-embedding existing index rows —
# mixed vector spaces break similarity search.
# LLM_BACKEND=ollama
# ── AI Insights — llama.cpp / llama-swap (optional) ─────────────────────
# Set LLAMA_SWAP_URL to enable the `llamacpp` chat_backend. Talks
# OpenAI-compatible /v1 to a llama-swap proxy that fronts per-slot
# llama-server instances (chat / vision / embed). Like hybrid, the
# agentic loop describes images via the vision slot then inlines the
# text into the chat slot — so the chat slot itself can be text-only.
# Set LLAMA_SWAP_URL plus LLM_BACKEND=llamacpp to swap the local stack
# off Ollama. Talks OpenAI-compatible /v1 to a llama-swap proxy fronting
# per-slot llama-server instances (chat / vision / embed). The chat slot
# is treated as text-only — images are pre-described via the vision slot
# and inlined into the prompt.
# LLAMA_SWAP_URL=http://localhost:9292/v1
# LLAMA_SWAP_PRIMARY_MODEL=chat
# LLAMA_SWAP_VISION_MODEL=vision
# LLAMA_SWAP_EMBEDDING_MODEL=embed
# Comma-separated allowlist of model ids the /v1/models endpoint should
# advertise as vision-capable (llama-swap doesn't report modality).
# LLAMA_SWAP_VISION_MODELS=vision
# Comma-separated allowlist surfaced by /insights/llamacpp/models.
# Comma-separated allowlist surfaced by /insights/models when
# LLM_BACKEND=llamacpp.
# LLAMA_SWAP_ALLOWED_MODELS=chat,vision,embed
# LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=120
# Routes hybrid mode's vision-describe pass through llama-swap's vision
# slot instead of Ollama (chat still goes to OpenRouter). Values:
# `ollama` (default) | `llamacpp`.
# HYBRID_VISION_BACKEND=ollama
# LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180
# ── AI Insights — sibling services (optional) ───────────────────────────
# Apollo (places, face inference, CLIP encoders). Single-Apollo deploys