feature/llamacpp-backend #101

cameron · 2026-05-26T18:58:34Z

cameron commented

2026-05-26 18:58:34 +00:00

No description provided.

cameron added 11 commits 2026-05-26 18:58:35 +00:00

ai: add llamacpp backend (llama-swap) as third LLM client f0927f5355

Wires a new LlamaCppClient (OpenAI-compatible /v1 wire format) alongside
OllamaClient and OpenRouterClient. Per-slot routing for chat/vision/embed
via env (LLAMA_SWAP_URL + *_MODEL vars); capability inference uses an
env allowlist since /v1/models doesn't report modality.

InsightGenerator + InsightChatService gain three-way dispatch on
chat_backend = "local" | "hybrid" | "llamacpp". Hybrid and llamacpp
share the describe-then-inline path (text-only chat after a separate
vision describe). HYBRID_VISION_BACKEND=llamacpp lets hybrid route its
describe pass through llama-swap's vision slot while chat still goes
to OpenRouter.

Cross-replay matrix added (validate_cross_replay): local<->llamacpp
and hybrid<->llamacpp allowed; local->hybrid and llamacpp->hybrid
rejected. New /insights/llamacpp/models handler mirrors the OpenRouter
shape.

env.example: document LLAMA_SWAP_* + HYBRID_VISION_BACKEND vars d14df63f19

Mirrors the section added to CLAUDE.md so deploys can opt into the
llamacpp backend from the template alone.

ai: collapse llamacpp into LLM_BACKEND env switch be51421b38

Reverts the per-request backend="llamacpp" value. Chat/vision/embedding
backend is now a deploy-time decision (LLM_BACKEND=ollama|llamacpp),
applied globally across chat, vision describe, and embeddings — so
embedding vectors stay in one space across the index.

- Per-request backend whitelist back to "local"|"hybrid". A request
  arriving with backend="llamacpp" is rejected.
- LLM_BACKEND=llamacpp swaps the entire local stack to llama-swap:
  chat hits the chat slot, describe hits the vision slot, embeddings
  hit the embed slot. Hybrid mode still routes chat to OpenRouter
  but uses LLM_BACKEND for the describe pass.
- Drops env vars HYBRID_VISION_BACKEND, LLAMA_SWAP_VISION_MODELS,
  EMBEDDING_BACKEND (the last never shipped). Drops the
  LlamaCppClient.vision_models allowlist — capability inference now
  reports has_vision only for the configured vision_model slot.
- Drops the /insights/llamacpp/models handler. /insights/models is
  the single endpoint; returns Ollama servers under LLM_BACKEND=ollama
  and llama-swap slots (from LLAMA_SWAP_ALLOWED_MODELS) under
  LLM_BACKEND=llamacpp. Same envelope shape either way.
- New ai::embed_one helper routes embeddings through llama-swap when
  LLM_BACKEND=llamacpp (else Ollama). Wires it into the four
  insight_generator embedding sites.
- Cross-replay matrix simplifies to pre-llamacpp shape (local↔local,
  hybrid↔hybrid, hybrid→local allowed; local→hybrid rejected).

ai: send images directly to llamacpp chat models + add ResolvedBackend 0631820fbf

llamacpp models now receive images via OpenAI content-parts instead of
the describe-then-inline strategy (hybrid mode unchanged). Fixes
assistant messages with tool_calls emitting content: null instead of ""
to satisfy strict Jinja template role-alternation checks. Adds debug
logging of message role sequences on llamacpp requests.

Introduces BackendKind enum, SamplingOverrides, and ResolvedBackend in
a new backend.rs module. InsightGenerator::resolve_backend centralises
client construction + vision capability detection — next step wires the
existing inline dispatch through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ai: extract ResolvedBackend, remove ~480 lines of duplicated dispatch a8a661f70a

Replace 5 copies of the ~80-line backend resolution pattern with a
single InsightGenerator::resolve_backend() builder that returns a
ResolvedBackend (chat + local clients, BackendKind enum, images_inline
flag). Tool dispatch now takes &ResolvedBackend instead of
&OllamaClient + model + backend strings.

Remove duplicated ollama/openrouter/llamacpp fields from
InsightChatService — InsightGenerator owns them and resolve_backend
uses them. Delete build_chat_clients (replaced by resolve_backend).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs: update env + CLAUDE.md for direct-vision llamacpp + ResolvedBackend fb388c29d7

llamacpp models now receive images directly instead of
describe-then-inline. LLAMA_SWAP_VISION_MODEL defaults to the
primary model. Document the ResolvedBackend dispatch pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ai: mirror chat model on local client to prevent mid-turn model swap 208344ad98

When the user selects a model from the picker, the local client's
primary_model and vision_model now match the chat model. Prevents
llama-swap exclusive mode from swapping models when describe_photo
or rerank fires during an agentic turn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test: add llamacpp model-slot consistency and content-null tests 9dba659d1e

Cover the properties that prevent mid-turn model swaps in llama-swap
exclusive mode: vision_model defaults to primary, cloned local client
mirrors the user-selected model, embeddings stay on their own slot.
Also test the content:null serialization for tool-calling messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

image: add xlarge (4096px) on-demand preview tier b9175e2718

New `PhotoSize::XLarge` variant sits between `Large` (2048px) and
`Full` (original). On-demand generated and disk-cached at
`_xlarge/<hash>.jpg`, same waterfall as `Large` (embedded RAW preview
→ ffmpeg → image crate). Sources below 4096px serve at native size.

Reduces decoded bitmap memory from ~192MB (48MP full) to ~64MB for
the mobile viewer's zoom tier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add contact name filter to SMS search tool + misc improvements 0a627f4880

- sms search tool: accept contact name, trim/validate, skip when
  contact_id is set, pass to API client
- sms_client: new contact field in SmsSearchParams, URL-encode on wire
- Tool description clarifies contact_id takes precedence when both given
- Add parse_title_body helper for LLM response parsing
- llamacpp backend improvements

fix: prevent hybrid mode from leaking OpenRouter model to local llamacpp client b03ee60342

When backend=hybrid with LLM_BACKEND=llamacpp, the user-selected model
(an OpenRouter id like "google/gemini-3-flash-preview") was being applied
to the local LlamaCppClient's primary_model and vision_model. This caused
describe_image to send the OpenRouter model name to llama-swap, which
returned 400 because it has no such slot.

Guard the local-client model override with !is_hybrid so it only applies
in local-only mode (where the user is selecting a different local model).
Bump to v1.2.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cameron merged commit 5a75d1a28c into master

2026-05-26 18:58:48 +00:00

cameron deleted branch feature/llamacpp-backend

2026-05-26 18:58:48 +00:00

cameron referenced this issue from a commit

2026-05-26 18:58:49 +00:00

Merge pull request 'feature/llamacpp-backend' (#101) from feature/llamacpp-backend into master

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Apps/ImageApi#101