feature/llamacpp-backend #101

Merged
cameron merged 11 commits from feature/llamacpp-backend into master 2026-05-26 18:58:48 +00:00
9 changed files with 1468 additions and 102 deletions
Showing only changes of commit f0927f5355 - Show all commits

View File

@@ -475,6 +475,7 @@ POST /insights/generate/agentic (tool-calling loop; body: { file_path, back
GET /insights?path=...&library=... GET /insights?path=...&library=...
GET /insights/models (local Ollama models + capabilities) GET /insights/models (local Ollama models + capabilities)
GET /insights/openrouter/models (curated OpenRouter allowlist) GET /insights/openrouter/models (curated OpenRouter allowlist)
GET /insights/llamacpp/models (curated llama-swap slot allowlist)
POST /insights/rate (thumbs up/down for training data) POST /insights/rate (thumbs up/down for training data)
// Insight Chat Continuation // Insight Chat Continuation
@@ -631,6 +632,23 @@ OPENROUTER_EMBEDDING_MODEL=openai/text-embedding-3-small # Optional, embeddings
OPENROUTER_HTTP_REFERER=https://your-site.example # Optional attribution header OPENROUTER_HTTP_REFERER=https://your-site.example # Optional attribution header
OPENROUTER_APP_TITLE=ImageApi # Optional attribution header OPENROUTER_APP_TITLE=ImageApi # Optional attribution header
# llama.cpp / llama-swap (Llamacpp Backend) - sibling to Ollama; OpenAI-compatible
# proxy hosting one or more llama-server processes (chat / vision / embed slots).
LLAMA_SWAP_URL=http://localhost:9292/v1 # Required to enable llamacpp backend
LLAMA_SWAP_PRIMARY_MODEL=chat # Chat slot id (matches config.yaml)
LLAMA_SWAP_VISION_MODEL=vision # Vision slot id; describe_image routes here
LLAMA_SWAP_EMBEDDING_MODEL=embed # Embedding slot id (when local embeddings via llamacpp)
LLAMA_SWAP_VISION_MODELS=qwen-vl,llava # Comma-separated slot ids known to have vision.
# Drives `has_vision` in /insights/llamacpp/models.
# `LLAMA_SWAP_VISION_MODEL` is auto-included.
LLAMA_SWAP_ALLOWED_MODELS=chat,coder # Curated allowlist exposed to clients via
# GET /insights/llamacpp/models. Empty = no picker.
LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS=180 # Per-request timeout; bump for slow CPU offload
HYBRID_VISION_BACKEND=llamacpp # Optional override for hybrid mode's describe_image:
# `ollama` (default) or `llamacpp`. When `llamacpp`,
# hybrid still routes chat to OpenRouter but uses
# llama-swap's vision slot to describe images.
# Insight Chat Continuation # Insight Chat Continuation
AGENTIC_CHAT_MAX_ITERATIONS=6 # Cap on tool-calling iterations per chat turn (default 6) AGENTIC_CHAT_MAX_ITERATIONS=6 # Cap on tool-calling iterations per chat turn (default 6)
``` ```
@@ -652,8 +670,11 @@ This allows runtime verification of model availability before generating insight
**Hybrid Backend (OpenRouter):** **Hybrid Backend (OpenRouter):**
- Per-request opt-in via `backend=hybrid` on `POST /insights/generate/agentic`. - Per-request opt-in via `backend=hybrid` on `POST /insights/generate/agentic`.
- Local Ollama still describes the image (vision); the description is inlined - Vision describe happens before the agentic loop; the description is inlined
into the chat prompt and the agentic loop runs on OpenRouter. into the chat prompt and the agentic loop runs on OpenRouter. By default
vision uses local Ollama, but `HYBRID_VISION_BACKEND=llamacpp` flips it to
llama-swap's vision slot (useful when you want chat on a frontier model and
vision on a local-but-not-Ollama path).
- `request.model` (if provided) overrides `OPENROUTER_DEFAULT_MODEL` for that - `request.model` (if provided) overrides `OPENROUTER_DEFAULT_MODEL` for that
call. The mobile picker reads from `OPENROUTER_ALLOWED_MODELS`. call. The mobile picker reads from `OPENROUTER_ALLOWED_MODELS`.
- No live capability precheck — the operator-curated allowlist is trusted. - No live capability precheck — the operator-curated allowlist is trusted.
@@ -661,6 +682,30 @@ This allows runtime verification of model availability before generating insight
- `GET /insights/openrouter/models` returns `{ models, default_model, configured }` - `GET /insights/openrouter/models` returns `{ models, default_model, configured }`
for client picker UIs. for client picker UIs.
**Llamacpp Backend (llama-swap):**
- Per-request opt-in via `backend=llamacpp` on `POST /insights/generate/agentic`.
- Sibling to Ollama: a local OpenAI-compatible proxy (mostlygeek/llama-swap)
fronting one or more `llama-server` processes. The chat slot is text-only
by default; vision and embeddings have their own slots (`LLAMA_SWAP_VISION_MODEL`,
`LLAMA_SWAP_EMBEDDING_MODEL`) that llama-swap routes to by model id. The
bundled `docker-compose.yml` + `llama-swap/config.yaml` in the opencode root
is the reference deploy.
- Operates in the same describe-then-inline shape as hybrid: the chat model
never sees raw images. Vision describe routes through llama-swap's vision
slot (`describe_image` on `LlamaCppClient`).
- `request.model` (if provided) overrides `LLAMA_SWAP_PRIMARY_MODEL` for that
call (must match a slot id in llama-swap's `config.yaml`). The mobile picker
reads from `LLAMA_SWAP_ALLOWED_MODELS`.
- No live capability precheck — slot ids are trusted. Tool calling is assumed
for every slot (llama-swap entries typically launch with `--jinja`).
- `GET /insights/llamacpp/models` returns `{ models, default_model, configured }`.
- Cross-replay matrix (chat continuation): `local ↔ llamacpp` allowed (the
LlamaCppClient passes images through to the chat slot — you're responsible
for a vision-capable slot if the stored transcript carries images);
`hybrid ↔ llamacpp` allowed (both transcripts are text-only); `local →
hybrid` and `llamacpp → hybrid` rejected (mid-conversation description
source change isn't supported).
**Insight Chat Continuation:** **Insight Chat Continuation:**
After an agentic insight is generated, the full `Vec<ChatMessage>` transcript is After an agentic insight is generated, the full `Vec<ChatMessage>` transcript is

View File

@@ -549,6 +549,36 @@ pub async fn get_openrouter_models_handler(
HttpResponse::Ok().json(response) HttpResponse::Ok().json(response)
} }
#[derive(serde::Serialize)]
pub struct LlamaCppModelsResponse {
pub models: Vec<String>,
pub default_model: Option<String>,
pub configured: bool,
}
/// GET /insights/llamacpp/models - Curated llama-swap model ids exposed
/// to clients for the llamacpp backend. Returned verbatim from
/// `LLAMA_SWAP_ALLOWED_MODELS`; no live call to llama-swap. Use
/// `LLAMA_SWAP_URL` plus `LLAMA_SWAP_PRIMARY_MODEL` on the server side to
/// pick the actual chat slot.
#[get("/insights/llamacpp/models")]
pub async fn get_llamacpp_models_handler(
_claims: Claims,
app_state: web::Data<crate::state::AppState>,
) -> impl Responder {
let configured = app_state.llamacpp.is_some();
let default_model = app_state
.llamacpp
.as_ref()
.map(|c| c.primary_model.clone());
let response = LlamaCppModelsResponse {
models: app_state.llamacpp_allowed_models.clone(),
default_model,
configured,
};
HttpResponse::Ok().json(response)
}
/// POST /insights/rate - Rate an insight (thumbs up/down for training data) /// POST /insights/rate - Rate an insight (thumbs up/down for training data)
#[post("/insights/rate")] #[post("/insights/rate")]
pub async fn rate_insight_handler( pub async fn rate_insight_handler(

View File

@@ -9,6 +9,7 @@ use tokio::sync::Mutex as TokioMutex;
use crate::ai::insight_generator::InsightGenerator; use crate::ai::insight_generator::InsightGenerator;
use crate::ai::llm_client::{ChatMessage, LlmClient, LlmStreamEvent, Tool}; use crate::ai::llm_client::{ChatMessage, LlmClient, LlmStreamEvent, Tool};
use crate::ai::ollama::OllamaClient; use crate::ai::ollama::OllamaClient;
use crate::ai::llamacpp::LlamaCppClient;
use crate::ai::openrouter::OpenRouterClient; use crate::ai::openrouter::OpenRouterClient;
use crate::database::InsightDao; use crate::database::InsightDao;
use crate::database::models::InsertPhotoInsight; use crate::database::models::InsertPhotoInsight;
@@ -93,6 +94,7 @@ pub struct InsightChatService {
generator: Arc<InsightGenerator>, generator: Arc<InsightGenerator>,
ollama: OllamaClient, ollama: OllamaClient,
openrouter: Option<Arc<OpenRouterClient>>, openrouter: Option<Arc<OpenRouterClient>>,
llamacpp: Option<Arc<LlamaCppClient>>,
insight_dao: Arc<Mutex<Box<dyn InsightDao>>>, insight_dao: Arc<Mutex<Box<dyn InsightDao>>>,
chat_locks: ChatLockMap, chat_locks: ChatLockMap,
} }
@@ -102,6 +104,7 @@ impl InsightChatService {
generator: Arc<InsightGenerator>, generator: Arc<InsightGenerator>,
ollama: OllamaClient, ollama: OllamaClient,
openrouter: Option<Arc<OpenRouterClient>>, openrouter: Option<Arc<OpenRouterClient>>,
llamacpp: Option<Arc<LlamaCppClient>>,
insight_dao: Arc<Mutex<Box<dyn InsightDao>>>, insight_dao: Arc<Mutex<Box<dyn InsightDao>>>,
chat_locks: ChatLockMap, chat_locks: ChatLockMap,
) -> Self { ) -> Self {
@@ -109,6 +112,7 @@ impl InsightChatService {
generator, generator,
ollama, ollama,
openrouter, openrouter,
llamacpp,
insight_dao, insight_dao,
chat_locks, chat_locks,
} }
@@ -303,23 +307,15 @@ impl InsightChatService {
.map(|s| s.trim().to_lowercase()) .map(|s| s.trim().to_lowercase())
.filter(|s| !s.is_empty()) .filter(|s| !s.is_empty())
.unwrap_or_else(|| stored_backend.clone()); .unwrap_or_else(|| stored_backend.clone());
if !matches!(effective_backend.as_str(), "local" | "hybrid") { validate_cross_replay(&stored_backend, &effective_backend)?;
bail!(
"unknown backend '{}'; expected 'local' or 'hybrid'",
effective_backend
);
}
if stored_backend == "local" && effective_backend == "hybrid" {
bail!(
"switching from local to hybrid mid-chat isn't supported yet; \
regenerate the insight in hybrid mode if you want OpenRouter chat"
);
}
let is_hybrid = effective_backend == "hybrid"; let is_hybrid = effective_backend == "hybrid";
let is_llamacpp = effective_backend == "llamacpp";
let describes_then_inlines = is_hybrid || is_llamacpp;
span.set_attribute(KeyValue::new("backend", effective_backend.clone())); span.set_attribute(KeyValue::new("backend", effective_backend.clone()));
// 4. Build the chat backend client. Ollama in local mode, a freshly // 4. Build the chat backend client. Ollama in local mode, a freshly
// cloned OpenRouter client in hybrid mode (clone so per-request // cloned OpenRouter client in hybrid mode, a freshly cloned
// LlamaCppClient in llamacpp mode (clone so per-request
// sampling/model overrides don't leak into shared state). // sampling/model overrides don't leak into shared state).
let max_iterations = req let max_iterations = req
.max_iterations .max_iterations
@@ -336,6 +332,7 @@ impl InsightChatService {
let mut ollama_client = self.ollama.clone(); let mut ollama_client = self.ollama.clone();
let mut openrouter_client: Option<OpenRouterClient> = None; let mut openrouter_client: Option<OpenRouterClient> = None;
let mut llamacpp_client: Option<LlamaCppClient> = None;
if is_hybrid { if is_hybrid {
let arc = self.openrouter.as_ref().ok_or_else(|| { let arc = self.openrouter.as_ref().ok_or_else(|| {
@@ -356,6 +353,25 @@ impl InsightChatService {
c.set_num_ctx(Some(ctx)); c.set_num_ctx(Some(ctx));
} }
openrouter_client = Some(c); openrouter_client = Some(c);
} else if is_llamacpp {
let arc = self.llamacpp.as_ref().ok_or_else(|| {
anyhow!("llamacpp backend unavailable: LLAMA_SWAP_URL not configured")
})?;
let mut c: LlamaCppClient = (**arc).clone();
if let Some(ref m) = custom_model {
c.primary_model = m.clone();
}
if req.temperature.is_some()
|| req.top_p.is_some()
|| req.top_k.is_some()
|| req.min_p.is_some()
{
c.set_sampling_params(req.temperature, req.top_p, req.top_k, req.min_p);
}
if let Some(ctx) = req.num_ctx {
c.set_num_ctx(Some(ctx));
}
llamacpp_client = Some(c);
} else { } else {
// Local-mode model swap. Build a new client when the chat model // Local-mode model swap. Build a new client when the chat model
// differs from the configured one (mirrors the agentic pattern). // differs from the configured one (mirrors the agentic pattern).
@@ -381,7 +397,9 @@ impl InsightChatService {
} }
} }
let chat_backend: &dyn LlmClient = if let Some(ref c) = openrouter_client { let chat_backend: &dyn LlmClient = if let Some(ref c) = llamacpp_client {
c
} else if let Some(ref c) = openrouter_client {
c c
} else { } else {
&ollama_client &ollama_client
@@ -389,18 +407,19 @@ impl InsightChatService {
let model_used = chat_backend.primary_model().to_string(); let model_used = chat_backend.primary_model().to_string();
span.set_attribute(KeyValue::new("model", model_used.clone())); span.set_attribute(KeyValue::new("model", model_used.clone()));
// 5. Decide vision + tool set. In hybrid we always omit // 5. Decide vision + tool set. In describe-then-inline modes
// `describe_photo` (matches the original generation flow). In // (hybrid, llamacpp) we always omit `describe_photo` (matches the
// local we trust the stored history's first-user shape: if it // original generation flow). In local we trust the stored
// carries `images`, the original model was vision-capable, and // history's first-user shape: if it carries `images`, the
// we keep `describe_photo` available. // original model was vision-capable, and we keep `describe_photo`
// available.
let local_first_user_has_image = messages let local_first_user_has_image = messages
.iter() .iter()
.find(|m| m.role == "user") .find(|m| m.role == "user")
.and_then(|m| m.images.as_ref()) .and_then(|m| m.images.as_ref())
.map(|imgs| !imgs.is_empty()) .map(|imgs| !imgs.is_empty())
.unwrap_or(false); .unwrap_or(false);
let offer_describe_tool = !is_hybrid && local_first_user_has_image; let offer_describe_tool = !describes_then_inlines && local_first_user_has_image;
// current_gate_opts(has_vision) sets gate_opts.has_vision = has_vision // current_gate_opts(has_vision) sets gate_opts.has_vision = has_vision
// and probes the per-table presence flags. Pass `offer_describe_tool` // and probes the per-table presence flags. Pass `offer_describe_tool`
// directly — the `!is_hybrid && local_first_user_has_image` decision // directly — the `!is_hybrid && local_first_user_has_image` decision
@@ -799,19 +818,10 @@ impl InsightChatService {
.map(|s| s.trim().to_lowercase()) .map(|s| s.trim().to_lowercase())
.filter(|s| !s.is_empty()) .filter(|s| !s.is_empty())
.unwrap_or_else(|| stored_backend.clone()); .unwrap_or_else(|| stored_backend.clone());
if !matches!(effective_backend.as_str(), "local" | "hybrid") { validate_cross_replay(&stored_backend, &effective_backend)?;
bail!(
"unknown backend '{}'; expected 'local' or 'hybrid'",
effective_backend
);
}
if stored_backend == "local" && effective_backend == "hybrid" {
bail!(
"switching from local to hybrid mid-chat isn't supported yet; \
regenerate the insight in hybrid mode if you want OpenRouter chat"
);
}
let is_hybrid = effective_backend == "hybrid"; let is_hybrid = effective_backend == "hybrid";
let is_llamacpp = effective_backend == "llamacpp";
let describes_then_inlines = is_hybrid || is_llamacpp;
let max_iterations = req let max_iterations = req
.max_iterations .max_iterations
@@ -826,20 +836,21 @@ impl InsightChatService {
.filter(|m| !m.is_empty()); .filter(|m| !m.is_empty());
let (chat_backend_holder, ollama_client) = let (chat_backend_holder, ollama_client) =
self.build_chat_clients(is_hybrid, custom_model.as_deref(), &req)?; self.build_chat_clients(&effective_backend, custom_model.as_deref(), &req)?;
let chat_backend: &dyn LlmClient = chat_backend_holder.as_ref(); let chat_backend: &dyn LlmClient = chat_backend_holder.as_ref();
let model_used = chat_backend.primary_model().to_string(); let model_used = chat_backend.primary_model().to_string();
// Tool set — local mode + first user turn carries an image → // Tool set — local mode + first user turn carries an image →
// offer describe_photo. Hybrid: visual description was inlined // offer describe_photo. Describe-then-inline modes (hybrid /
// when the insight was bootstrapped, no describe tool needed. // llamacpp): visual description was inlined when the insight was
// bootstrapped, no describe tool needed.
let local_first_user_has_image = messages let local_first_user_has_image = messages
.iter() .iter()
.find(|m| m.role == "user") .find(|m| m.role == "user")
.and_then(|m| m.images.as_ref()) .and_then(|m| m.images.as_ref())
.map(|imgs| !imgs.is_empty()) .map(|imgs| !imgs.is_empty())
.unwrap_or(false); .unwrap_or(false);
let offer_describe_tool = !is_hybrid && local_first_user_has_image; let offer_describe_tool = !describes_then_inlines && local_first_user_has_image;
let gate_opts = self.generator.current_gate_opts_for_persona( let gate_opts = self.generator.current_gate_opts_for_persona(
offer_describe_tool, offer_describe_tool,
Some((req.user_id, &active_persona)), Some((req.user_id, &active_persona)),
@@ -976,6 +987,8 @@ impl InsightChatService {
.unwrap_or_else(|| "default".to_string()); .unwrap_or_else(|| "default".to_string());
let effective_backend = resolve_bootstrap_backend(req.backend.as_deref())?; let effective_backend = resolve_bootstrap_backend(req.backend.as_deref())?;
let is_hybrid = effective_backend == "hybrid"; let is_hybrid = effective_backend == "hybrid";
let is_llamacpp = effective_backend == "llamacpp";
let describes_then_inlines = is_hybrid || is_llamacpp;
let max_iterations = req let max_iterations = req
.max_iterations .max_iterations
@@ -984,7 +997,7 @@ impl InsightChatService {
let custom_model = req.model.clone().filter(|m| !m.is_empty()); let custom_model = req.model.clone().filter(|m| !m.is_empty());
let (chat_backend_holder, ollama_client) = let (chat_backend_holder, ollama_client) =
self.build_chat_clients(is_hybrid, custom_model.as_deref(), &req)?; self.build_chat_clients(&effective_backend, custom_model.as_deref(), &req)?;
let chat_backend: &dyn LlmClient = chat_backend_holder.as_ref(); let chat_backend: &dyn LlmClient = chat_backend_holder.as_ref();
let model_used = chat_backend.primary_model().to_string(); let model_used = chat_backend.primary_model().to_string();
@@ -1007,21 +1020,48 @@ impl InsightChatService {
_ => None, _ => None,
}); });
// Hybrid backend: pre-describe the image via local Ollama vision // Describe-then-inline backends (hybrid, llamacpp): pre-describe the
// so OpenRouter chat models (which can't see images directly) get // image so a text-only chat model gets the visual description inline.
// the visual description as text. Mirrors the same pre-describe // Vision source: llamacpp's vision slot in llamacpp mode; in hybrid
// pass that `generate_agentic_insight_for_photo` does for hybrid. // mode Ollama by default, llamacpp via `HYBRID_VISION_BACKEND=llamacpp`.
let visual_block = if is_hybrid { let visual_block = if describes_then_inlines {
match image_base64.as_deref() { match image_base64.as_deref() {
Some(b64) => match self.ollama.describe_image(b64).await { Some(b64) => {
Ok(desc) => { let use_llamacpp_vision = if is_llamacpp {
format!("Visual description (from local vision model):\n{}\n", desc) true
} else {
matches!(
std::env::var("HYBRID_VISION_BACKEND")
.ok()
.as_deref()
.map(|s| s.trim().to_lowercase())
.as_deref(),
Some("llamacpp")
)
};
let described = if use_llamacpp_vision {
match self.llamacpp.as_ref() {
Some(c) => c.describe_image(b64).await,
None => {
log::warn!(
"bootstrap: requested llamacpp vision but LLAMA_SWAP_URL unset; falling back to Ollama"
);
self.ollama.describe_image(b64).await
}
}
} else {
self.ollama.describe_image(b64).await
};
match described {
Ok(desc) => {
format!("Visual description (from local vision model):\n{}\n", desc)
}
Err(e) => {
log::warn!("{} bootstrap: describe_image failed: {}", effective_backend, e);
String::new()
}
} }
Err(e) => { }
log::warn!("hybrid bootstrap: local describe_image failed: {}", e);
String::new()
}
},
None => String::new(), None => String::new(),
} }
} else { } else {
@@ -1031,7 +1071,7 @@ impl InsightChatService {
// Tool gates. Local + image present → expose describe_photo so // Tool gates. Local + image present → expose describe_photo so
// the chat model can re-look at the photo on demand. Hybrid: // the chat model can re-look at the photo on demand. Hybrid:
// already inlined, no tool needed. // already inlined, no tool needed.
let offer_describe_tool = !is_hybrid && image_base64.is_some(); let offer_describe_tool = !describes_then_inlines && image_base64.is_some();
let gate_opts = self.generator.current_gate_opts_for_persona( let gate_opts = self.generator.current_gate_opts_for_persona(
offer_describe_tool, offer_describe_tool,
Some((req.user_id, &active_persona)), Some((req.user_id, &active_persona)),
@@ -1057,7 +1097,7 @@ impl InsightChatService {
); );
let system_msg = ChatMessage::system(system_content); let system_msg = ChatMessage::system(system_content);
let mut user_msg = ChatMessage::user(req.user_message.clone()); let mut user_msg = ChatMessage::user(req.user_message.clone());
if !is_hybrid && let Some(ref img) = image_base64 { if !describes_then_inlines && let Some(ref img) = image_base64 {
user_msg.images = Some(vec![img.clone()]); user_msg.images = Some(vec![img.clone()]);
} }
let mut messages = vec![system_msg, user_msg]; let mut messages = vec![system_msg, user_msg];
@@ -1130,19 +1170,22 @@ impl InsightChatService {
Ok(()) Ok(())
} }
/// Set up chat clients (Ollama + optional OpenRouter) shared by /// Set up chat clients (Ollama + optional OpenRouter / LlamaCpp) shared
/// bootstrap and continuation. Returns the chat-side backend client /// by bootstrap and continuation. Returns the chat-side backend client
/// (boxed because hybrid and local return different concrete types) /// (boxed because each backend has a different concrete type) and the
/// and the Ollama client used for describe-image / local tool calls. /// Ollama client used for describe-image / local tool calls.
///
/// `effective_backend` must be one of `"local"`, `"hybrid"`, `"llamacpp"`
/// (validated upstream).
fn build_chat_clients( fn build_chat_clients(
&self, &self,
is_hybrid: bool, effective_backend: &str,
custom_model: Option<&str>, custom_model: Option<&str>,
req: &ChatTurnRequest, req: &ChatTurnRequest,
) -> Result<(Box<dyn LlmClient>, OllamaClient)> { ) -> Result<(Box<dyn LlmClient>, OllamaClient)> {
let mut ollama_client = self.ollama.clone(); let mut ollama_client = self.ollama.clone();
if is_hybrid { if effective_backend == "hybrid" {
let arc = self.openrouter.as_ref().ok_or_else(|| { let arc = self.openrouter.as_ref().ok_or_else(|| {
anyhow!("hybrid backend unavailable: OPENROUTER_API_KEY not configured") anyhow!("hybrid backend unavailable: OPENROUTER_API_KEY not configured")
})?; })?;
@@ -1163,6 +1206,27 @@ impl InsightChatService {
return Ok((Box::new(c), ollama_client)); return Ok((Box::new(c), ollama_client));
} }
if effective_backend == "llamacpp" {
let arc = self.llamacpp.as_ref().ok_or_else(|| {
anyhow!("llamacpp backend unavailable: LLAMA_SWAP_URL not configured")
})?;
let mut c: LlamaCppClient = (**arc).clone();
if let Some(m) = custom_model {
c.primary_model = m.to_string();
}
if req.temperature.is_some()
|| req.top_p.is_some()
|| req.top_k.is_some()
|| req.min_p.is_some()
{
c.set_sampling_params(req.temperature, req.top_p, req.top_k, req.min_p);
}
if let Some(ctx) = req.num_ctx {
c.set_num_ctx(Some(ctx));
}
return Ok((Box::new(c), ollama_client));
}
if let Some(m) = custom_model if let Some(m) = custom_model
&& m != self.ollama.primary_model && m != self.ollama.primary_model
{ {
@@ -1459,6 +1523,49 @@ fn resolve_date_taken_for_context(
.map(|dt| dt.format("%Y-%m-%d").to_string()) .map(|dt| dt.format("%Y-%m-%d").to_string())
} }
/// Validate a stored→effective backend transition for a chat continuation.
/// Continuation runs against a transcript that was generated with a specific
/// backend; some transitions break the conversation shape:
///
/// - `local → hybrid` — the stored transcript has images embedded in the
/// first user message; the openrouter chat client surfaces them through
/// the wire, but vision-only models routed via the hybrid path may not
/// accept that shape consistently across providers. Reject to keep the
/// `regenerate-in-hybrid-mode` workflow as the supported answer.
/// - `llamacpp → hybrid` — the stored transcript already has an inlined
/// visual description produced by llama-swap's vision slot. Switching
/// to hybrid mid-conversation would mix description sources across
/// subsequent turns (any new image in the chat continuation would be
/// described by ollama-vision while the original was described by
/// llama-vision). Reject for consistency.
///
/// All other transitions are allowed. `local ↔ llamacpp` works because
/// LlamaCppClient passes image content-parts through to the chat slot —
/// the user is responsible for picking a vision-capable chat model in
/// that case. `hybrid ↔ llamacpp` works because both transcripts are
/// text-only (visual description inlined at bootstrap).
fn validate_cross_replay(stored: &str, effective: &str) -> Result<()> {
if !matches!(effective, "local" | "hybrid" | "llamacpp") {
bail!(
"unknown backend '{}'; expected 'local', 'hybrid', or 'llamacpp'",
effective
);
}
if stored == "local" && effective == "hybrid" {
bail!(
"switching from local to hybrid mid-chat isn't supported yet; \
regenerate the insight in hybrid mode if you want OpenRouter chat"
);
}
if stored == "llamacpp" && effective == "hybrid" {
bail!(
"switching from llamacpp to hybrid mid-chat isn't supported yet; \
regenerate the insight in hybrid mode if you want OpenRouter chat"
);
}
Ok(())
}
/// Pick the backend label for bootstrap. Bootstrap has no stored insight /// Pick the backend label for bootstrap. Bootstrap has no stored insight
/// to defer to (that's continuation's behaviour), so the default is /// to defer to (that's continuation's behaviour), so the default is
/// `"local"`. Returns an error if the supplied label is non-empty but /// `"local"`. Returns an error if the supplied label is non-empty but
@@ -1469,8 +1576,11 @@ fn resolve_bootstrap_backend(supplied: Option<&str>) -> Result<String> {
.map(|s| s.trim().to_lowercase()) .map(|s| s.trim().to_lowercase())
.filter(|s| !s.is_empty()) .filter(|s| !s.is_empty())
.unwrap_or_else(|| "local".to_string()); .unwrap_or_else(|| "local".to_string());
if !matches!(lower.as_str(), "local" | "hybrid") { if !matches!(lower.as_str(), "local" | "hybrid" | "llamacpp") {
bail!("unknown backend '{}'; expected 'local' or 'hybrid'", lower); bail!(
"unknown backend '{}'; expected 'local', 'hybrid', or 'llamacpp'",
lower
);
} }
Ok(lower) Ok(lower)
} }
@@ -2074,6 +2184,10 @@ mod tests {
fn bootstrap_backend_accepts_local_and_hybrid_case_insensitively() { fn bootstrap_backend_accepts_local_and_hybrid_case_insensitively() {
assert_eq!(resolve_bootstrap_backend(Some("LOCAL")).unwrap(), "local"); assert_eq!(resolve_bootstrap_backend(Some("LOCAL")).unwrap(), "local");
assert_eq!(resolve_bootstrap_backend(Some("Hybrid")).unwrap(), "hybrid"); assert_eq!(resolve_bootstrap_backend(Some("Hybrid")).unwrap(), "hybrid");
assert_eq!(
resolve_bootstrap_backend(Some("Llamacpp")).unwrap(),
"llamacpp"
);
assert_eq!( assert_eq!(
resolve_bootstrap_backend(Some(" local ")).unwrap(), resolve_bootstrap_backend(Some(" local ")).unwrap(),
"local" "local"
@@ -2088,6 +2202,38 @@ mod tests {
assert!(msg.contains("openrouter")); assert!(msg.contains("openrouter"));
} }
#[test]
fn cross_replay_rejects_local_to_hybrid() {
let err = validate_cross_replay("local", "hybrid").unwrap_err();
assert!(format!("{}", err).contains("local to hybrid"));
}
#[test]
fn cross_replay_rejects_llamacpp_to_hybrid() {
let err = validate_cross_replay("llamacpp", "hybrid").unwrap_err();
assert!(format!("{}", err).contains("llamacpp to hybrid"));
}
#[test]
fn cross_replay_allows_local_llamacpp_and_hybrid_llamacpp_transitions() {
// Local ↔ llamacpp: user is responsible for picking a vision-capable
// chat slot when the transcript has images.
assert!(validate_cross_replay("local", "llamacpp").is_ok());
assert!(validate_cross_replay("llamacpp", "local").is_ok());
// Hybrid ↔ llamacpp: both transcripts are text-only.
assert!(validate_cross_replay("hybrid", "llamacpp").is_ok());
// Same-backend replays are always fine.
assert!(validate_cross_replay("local", "local").is_ok());
assert!(validate_cross_replay("hybrid", "hybrid").is_ok());
assert!(validate_cross_replay("llamacpp", "llamacpp").is_ok());
}
#[test]
fn cross_replay_rejects_unknown_effective() {
let err = validate_cross_replay("local", "openrouter").unwrap_err();
assert!(format!("{}", err).contains("unknown backend"));
}
#[test] #[test]
fn bootstrap_system_message_includes_path_and_persona() { fn bootstrap_system_message_includes_path_and_persona() {
let out = build_bootstrap_system_message("you are helpful", "pics/IMG.jpg", None, None, ""); let out = build_bootstrap_system_message("you are helpful", "pics/IMG.jpg", None, None, "");

View File

@@ -12,6 +12,7 @@ use std::sync::{Arc, Mutex};
use crate::ai::apollo_client::{ApolloClient, ApolloPlace}; use crate::ai::apollo_client::{ApolloClient, ApolloPlace};
use crate::ai::llm_client::LlmClient; use crate::ai::llm_client::LlmClient;
use crate::ai::ollama::{ChatMessage, OllamaClient, Tool}; use crate::ai::ollama::{ChatMessage, OllamaClient, Tool};
use crate::ai::llamacpp::LlamaCppClient;
use crate::ai::openrouter::OpenRouterClient; use crate::ai::openrouter::OpenRouterClient;
use crate::ai::sms_client::{SmsApiClient, SmsSearchHit, SmsSearchParams}; use crate::ai::sms_client::{SmsApiClient, SmsSearchHit, SmsSearchParams};
use crate::ai::user_display_name; use crate::ai::user_display_name;
@@ -68,6 +69,9 @@ pub struct InsightGenerator {
/// Optional OpenRouter client, used when `backend=hybrid` is requested. /// Optional OpenRouter client, used when `backend=hybrid` is requested.
/// `None` when `OPENROUTER_API_KEY` is not configured. /// `None` when `OPENROUTER_API_KEY` is not configured.
openrouter: Option<Arc<OpenRouterClient>>, openrouter: Option<Arc<OpenRouterClient>>,
/// Optional llama-swap client, used when `backend=llamacpp` is requested.
/// `None` when `LLAMA_SWAP_URL` is not configured.
llamacpp: Option<Arc<LlamaCppClient>>,
sms_client: SmsApiClient, sms_client: SmsApiClient,
/// Optional integration with Apollo's user-defined Places. When the /// Optional integration with Apollo's user-defined Places. When the
/// integration is disabled (`APOLLO_API_BASE_URL` unset), every /// integration is disabled (`APOLLO_API_BASE_URL` unset), every
@@ -120,6 +124,7 @@ impl InsightGenerator {
pub fn new( pub fn new(
ollama: OllamaClient, ollama: OllamaClient,
openrouter: Option<Arc<OpenRouterClient>>, openrouter: Option<Arc<OpenRouterClient>>,
llamacpp: Option<Arc<LlamaCppClient>>,
sms_client: SmsApiClient, sms_client: SmsApiClient,
apollo_client: ApolloClient, apollo_client: ApolloClient,
insight_dao: Arc<Mutex<Box<dyn InsightDao>>>, insight_dao: Arc<Mutex<Box<dyn InsightDao>>>,
@@ -137,6 +142,7 @@ impl InsightGenerator {
Self { Self {
ollama, ollama,
openrouter, openrouter,
llamacpp,
sms_client, sms_client,
apollo_client, apollo_client,
insight_dao, insight_dao,
@@ -3574,23 +3580,31 @@ Return ONLY the summary, nothing else."#,
.map(|s| s.trim().to_lowercase()) .map(|s| s.trim().to_lowercase())
.filter(|s| !s.is_empty()) .filter(|s| !s.is_empty())
.unwrap_or_else(|| "local".to_string()); .unwrap_or_else(|| "local".to_string());
if !matches!(backend_label.as_str(), "local" | "hybrid") { if !matches!(backend_label.as_str(), "local" | "hybrid" | "llamacpp") {
return Err(anyhow::anyhow!( return Err(anyhow::anyhow!(
"unknown backend '{}'; expected 'local' or 'hybrid'", "unknown backend '{}'; expected 'local', 'hybrid', or 'llamacpp'",
backend_label backend_label
)); ));
} }
span.set_attribute(KeyValue::new("backend", backend_label.clone())); span.set_attribute(KeyValue::new("backend", backend_label.clone()));
let is_hybrid = backend_label == "hybrid"; let is_hybrid = backend_label == "hybrid";
let is_llamacpp = backend_label == "llamacpp";
// In hybrid + llamacpp modes the chat model never sees the image
// directly; we describe-then-inline locally before the agentic loop
// starts. Tracked as a single flag so vision/tool-gate logic doesn't
// have to branch twice.
let describes_then_inlines = is_hybrid || is_llamacpp;
// 1b. Always build an Ollama client. In local mode it owns the chat // 1b. Always build an Ollama client. In local mode it owns the chat
// loop; in hybrid mode it still handles describe_image + any // loop; in hybrid/llamacpp mode it still handles tool-local calls
// tool-local calls (e.g. if a future tool needs embeddings). // (e.g. future embedding-backed tools). The chat backend is
// Sampling overrides only apply in local mode — in hybrid the // selected separately below.
// user's params belong to the OpenRouter chat client. // Sampling overrides only apply in local mode — in
let apply_sampling_to_ollama = !is_hybrid; // hybrid/llamacpp the user's params belong to the alternate chat
// client.
let apply_sampling_to_ollama = !describes_then_inlines;
let mut ollama_client = if let Some(ref model) = custom_model let mut ollama_client = if let Some(ref model) = custom_model
&& !is_hybrid && !describes_then_inlines
{ {
log::info!("Using custom model for agentic: {}", model); log::info!("Using custom model for agentic: {}", model);
span.set_attribute(KeyValue::new("custom_model", model.clone())); span.set_attribute(KeyValue::new("custom_model", model.clone()));
@@ -3601,7 +3615,7 @@ Return ONLY the summary, nothing else."#,
Some(model.clone()), Some(model.clone()),
) )
} else { } else {
if !is_hybrid { if !describes_then_inlines {
span.set_attribute(KeyValue::new("model", self.ollama.primary_model.clone())); span.set_attribute(KeyValue::new("model", self.ollama.primary_model.clone()));
} }
self.ollama.clone() self.ollama.clone()
@@ -3674,6 +3688,44 @@ Return ONLY the summary, nothing else."#,
None None
}; };
// 1d. In llamacpp mode, clone the configured LlamaCpp client and
// apply per-request overrides. Same shape as the openrouter
// branch above; describe_image will route through the vision
// slot configured on the client.
let llamacpp_client: Option<LlamaCppClient> = if is_llamacpp {
let arc = self.llamacpp.as_ref().ok_or_else(|| {
anyhow::anyhow!("llamacpp backend unavailable: LLAMA_SWAP_URL not configured")
})?;
let mut c: LlamaCppClient = (**arc).clone();
if let Some(ref m) = custom_model {
c.primary_model = m.clone();
span.set_attribute(KeyValue::new("custom_model", m.clone()));
}
span.set_attribute(KeyValue::new("llamacpp_model", c.primary_model.clone()));
if temperature.is_some() || top_p.is_some() || top_k.is_some() || min_p.is_some() {
if let Some(t) = temperature {
span.set_attribute(KeyValue::new("temperature", t as f64));
}
if let Some(p) = top_p {
span.set_attribute(KeyValue::new("top_p", p as f64));
}
if let Some(k) = top_k {
span.set_attribute(KeyValue::new("top_k", k as i64));
}
if let Some(m) = min_p {
span.set_attribute(KeyValue::new("min_p", m as f64));
}
c.set_sampling_params(temperature, top_p, top_k, min_p);
}
if let Some(ctx) = num_ctx {
span.set_attribute(KeyValue::new("num_ctx", ctx as i64));
c.set_num_ctx(Some(ctx));
}
Some(c)
} else {
None
};
let insight_cx = current_cx.with_span(span); let insight_cx = current_cx.with_span(span);
// 2. Verify chat model supports tool calling. // 2. Verify chat model supports tool calling.
@@ -3681,10 +3733,11 @@ Return ONLY the summary, nothing else."#,
// - hybrid: trust the operator's curated allowlist // - hybrid: trust the operator's curated allowlist
// (OPENROUTER_ALLOWED_MODELS) — no live precheck. A bad model id // (OPENROUTER_ALLOWED_MODELS) — no live precheck. A bad model id
// surfaces as a chat-call error on the next step. // surfaces as a chat-call error on the next step.
let has_vision = if is_hybrid { let has_vision = if describes_then_inlines {
// In hybrid mode the chat model never sees images directly — we // In hybrid + llamacpp modes the chat model never sees images
// describe-then-inject, so `has_vision` drives only whether we // directly — we describe-then-inject, so `has_vision` drives only
// bother loading the image to describe it, which we always do. // whether we bother loading the image to describe it, which we
// always do.
true true
} else { } else {
if let Some(ref model_name) = custom_model { if let Some(ref model_name) = custom_model {
@@ -3864,24 +3917,61 @@ Return ONLY the summary, nothing else."#,
None None
}; };
let hybrid_visual_description: Option<String> = if is_hybrid { // describe-then-inline path. In hybrid mode the vision backend
// defaults to Ollama but can be flipped to llamacpp via
// `HYBRID_VISION_BACKEND=llamacpp` (so chat goes to OpenRouter while
// vision/audio routes through llama-swap). In llamacpp mode we always
// use the llamacpp client's configured vision slot.
let inlined_visual_description: Option<String> = if describes_then_inlines {
match image_base64.as_deref() { match image_base64.as_deref() {
Some(b64) => match self.ollama.describe_image(b64).await { Some(b64) => {
Ok(desc) => { let use_llamacpp_vision = if is_llamacpp {
log::info!( true
"Hybrid: local vision describe succeeded ({} chars)", } else {
desc.len() // is_hybrid branch — consult env switch
); matches!(
Some(desc) std::env::var("HYBRID_VISION_BACKEND")
.ok()
.as_deref()
.map(|s| s.trim().to_lowercase())
.as_deref(),
Some("llamacpp")
)
};
let described = if use_llamacpp_vision {
match self.llamacpp.as_ref() {
Some(c) => c.describe_image(b64).await,
None => {
log::warn!(
"describe-then-inline: requested llamacpp vision but LLAMA_SWAP_URL is unset, falling back to Ollama"
);
self.ollama.describe_image(b64).await
}
}
} else {
self.ollama.describe_image(b64).await
};
match described {
Ok(desc) => {
log::info!(
"{}: vision describe succeeded ({} chars)",
backend_label,
desc.len()
);
Some(desc)
}
Err(e) => {
log::warn!(
"{}: vision describe failed, continuing without: {}",
backend_label,
e
);
None
}
} }
Err(e) => { }
log::warn!(
"Hybrid: local vision describe failed, continuing without: {}",
e
);
None
}
},
None => None, None => None,
} }
} else { } else {
@@ -3934,7 +4024,7 @@ Return ONLY the summary, nothing else."#,
.map(|c| format!("Contact/Person: {}", c)) .map(|c| format!("Contact/Person: {}", c))
.unwrap_or_else(|| "Contact/Person: unknown".to_string()); .unwrap_or_else(|| "Contact/Person: unknown".to_string());
let visual_block = hybrid_visual_description let visual_block = inlined_visual_description
.as_deref() .as_deref()
.map(|d| format!("Visual description (from local vision model):\n{}\n\n", d)) .map(|d| format!("Visual description (from local vision model):\n{}\n\n", d))
.unwrap_or_default(); .unwrap_or_default();
@@ -3954,25 +4044,28 @@ Return ONLY the summary, nothing else."#,
); );
// 10. Define tools. Gate flags computed from current data presence; // 10. Define tools. Gate flags computed from current data presence;
// hybrid mode omits describe_photo since the chat model receives // describe-then-inline modes (hybrid, llamacpp) omit describe_photo
// the visual description inline (so we pass `false` for has_vision // since the chat model receives the visual description inline (so
// in hybrid mode regardless of the model's actual capability). // we pass `false` for has_vision in those modes regardless of the
let gate_opts = self.current_gate_opts(has_vision && !is_hybrid); // model's actual capability).
let gate_opts = self.current_gate_opts(has_vision && !describes_then_inlines);
let tools = Self::build_tool_definitions(gate_opts); let tools = Self::build_tool_definitions(gate_opts);
// 11. Build initial messages. In hybrid mode images are never // 11. Build initial messages. In describe-then-inline modes images
// attached to the wire message — the description is part of // are never attached to the wire message — the description is part
// `user_content`. // of `user_content`.
let system_msg = ChatMessage::system(system_content); let system_msg = ChatMessage::system(system_content);
let mut user_msg = ChatMessage::user(user_content); let mut user_msg = ChatMessage::user(user_content);
if !is_hybrid && let Some(ref img) = image_base64 { if !describes_then_inlines && let Some(ref img) = image_base64 {
user_msg.images = Some(vec![img.clone()]); user_msg.images = Some(vec![img.clone()]);
} }
let mut messages = vec![system_msg, user_msg]; let mut messages = vec![system_msg, user_msg];
// 12. Agentic loop — dispatch through the selected backend. // 12. Agentic loop — dispatch through the selected backend.
let chat_backend: &dyn LlmClient = if let Some(ref or_c) = openrouter_client { let chat_backend: &dyn LlmClient = if let Some(ref lc_c) = llamacpp_client {
lc_c
} else if let Some(ref or_c) = openrouter_client {
or_c or_c
} else { } else {
&ollama_client &ollama_client

978
src/ai/llamacpp.rs Normal file
View File

@@ -0,0 +1,978 @@
// LlamaCppClient — talks to a llama-swap proxy that fronts one or more
// llama-server processes. llama-swap exposes an OpenAI-compatible HTTP
// surface (`/v1/chat/completions`, `/v1/embeddings`, `/v1/models`), so the
// wire translation mirrors `OpenRouterClient` almost exactly.
//
// Differences from OpenRouter:
// - No bearer auth or attribution headers; llama-swap is LAN-only.
// - Three model slots (`primary_model` = chat, `vision_model`, `embedding_model`)
// each map to a model id in the llama-swap config. `describe_image` and
// `generate_embeddings` issue requests with the appropriate slot id in the
// `model` field, which is how llama-swap selects which backend process to
// run.
// - `/v1/models` returns only the configured slot ids — capabilities aren't
// reported by the API, so `vision_models` is a config-time allowlist (env
// `LLAMA_SWAP_VISION_MODELS`) used to set `has_vision` on responses.
// `has_tool_calling` is assumed true for every slot, since llama-swap entries
// default to launching llama-server with `--jinja`.
//
// First consumer lands alongside the three-way backend dispatch in
// insight_generator / insight_chat.
#![allow(dead_code)]
use anyhow::{Context, Result, anyhow, bail};
use async_trait::async_trait;
use reqwest::Client;
use serde::Deserialize;
use serde_json::{Value, json};
use std::time::Duration;
use crate::ai::llm_client::{
ChatMessage, LlmClient, LlmStreamEvent, ModelCapabilities, Tool, ToolCall, ToolCallFunction,
};
use futures::stream::{BoxStream, StreamExt};
const DEFAULT_BASE_URL: &str = "http://localhost:9292/v1";
const DEFAULT_PRIMARY_MODEL: &str = "chat";
const DEFAULT_VISION_MODEL: &str = "vision";
const DEFAULT_EMBEDDING_MODEL: &str = "embed";
const DEFAULT_REQUEST_TIMEOUT_SECS: u64 = 180;
/// OpenAI-compatible client targeting a llama-swap proxy in front of one or
/// more llama-server processes. See the module doc-comment for the slot model.
#[derive(Clone)]
pub struct LlamaCppClient {
client: Client,
pub base_url: String,
/// Chat model slot id (e.g. `"chat"`). Used for `generate` /
/// `chat_with_tools` / `chat_with_tools_stream`.
pub primary_model: String,
/// Embedding model slot id (e.g. `"embed"`). Used for
/// `generate_embeddings`.
pub embedding_model: String,
/// Vision model slot id (e.g. `"vision"`). Used for `describe_image` and
/// included in `vision_models` automatically so capability lookups for
/// the default vision slot report `has_vision = true` even when the env
/// allowlist is empty.
pub vision_model: String,
/// Operator-curated set of slot ids known to be multimodal. Drives the
/// `has_vision` field in `list_models` / `model_capabilities`, since
/// llama-swap's `/v1/models` doesn't report modality. Empty allowlist
/// still marks `vision_model` as vision-capable.
pub vision_models: Vec<String>,
num_ctx: Option<i32>,
temperature: Option<f32>,
top_p: Option<f32>,
top_k: Option<i32>,
min_p: Option<f32>,
}
impl LlamaCppClient {
pub fn new(base_url: Option<String>, primary_model: Option<String>) -> Self {
let timeout_secs = std::env::var("LLAMA_SWAP_REQUEST_TIMEOUT_SECONDS")
.ok()
.and_then(|v| v.parse::<u64>().ok())
.unwrap_or(DEFAULT_REQUEST_TIMEOUT_SECS);
Self {
client: Client::builder()
.connect_timeout(Duration::from_secs(10))
.timeout(Duration::from_secs(timeout_secs))
.build()
.unwrap_or_else(|_| Client::new()),
base_url: base_url.unwrap_or_else(|| DEFAULT_BASE_URL.to_string()),
primary_model: primary_model.unwrap_or_else(|| DEFAULT_PRIMARY_MODEL.to_string()),
embedding_model: DEFAULT_EMBEDDING_MODEL.to_string(),
vision_model: DEFAULT_VISION_MODEL.to_string(),
vision_models: Vec::new(),
num_ctx: None,
temperature: None,
top_p: None,
top_k: None,
min_p: None,
}
}
pub fn set_embedding_model(&mut self, model: String) {
self.embedding_model = model;
}
pub fn set_vision_model(&mut self, model: String) {
self.vision_model = model;
}
pub fn set_vision_models(&mut self, models: Vec<String>) {
self.vision_models = models;
}
pub fn set_num_ctx(&mut self, num_ctx: Option<i32>) {
self.num_ctx = num_ctx;
}
pub fn set_sampling_params(
&mut self,
temperature: Option<f32>,
top_p: Option<f32>,
top_k: Option<i32>,
min_p: Option<f32>,
) {
self.temperature = temperature;
self.top_p = top_p;
self.top_k = top_k;
self.min_p = min_p;
}
/// Translate canonical messages to the OpenAI-compatible wire shape.
/// Behaviorally identical to `OpenRouterClient::messages_to_openai` —
/// stringify tool-call arguments, rewrite images into content-parts, attach
/// `tool_call_id` to `role=tool` messages based on the preceding assistant
/// turn's tool calls.
fn messages_to_openai(messages: &[ChatMessage]) -> Vec<Value> {
let mut out = Vec::with_capacity(messages.len());
let mut last_tool_call_ids: Vec<String> = Vec::new();
let mut next_tool_result_idx: usize = 0;
for msg in messages {
let mut obj = serde_json::Map::new();
obj.insert("role".into(), Value::String(msg.role.clone()));
match &msg.images {
Some(images) if !images.is_empty() => {
let mut parts: Vec<Value> = Vec::new();
if !msg.content.is_empty() {
parts.push(json!({"type": "text", "text": msg.content}));
}
for img in images {
let url = image_to_data_url(img);
parts.push(json!({
"type": "image_url",
"image_url": { "url": url }
}));
}
obj.insert("content".into(), Value::Array(parts));
}
_ => {
obj.insert("content".into(), Value::String(msg.content.clone()));
}
}
if let Some(tcs) = &msg.tool_calls
&& msg.role == "assistant"
{
let converted: Vec<Value> = tcs
.iter()
.enumerate()
.map(|(i, call)| {
let id = call.id.clone().unwrap_or_else(|| format!("call_{}", i));
let args_str = serde_json::to_string(&call.function.arguments)
.unwrap_or_else(|_| "{}".to_string());
json!({
"id": id,
"type": "function",
"function": {
"name": call.function.name,
"arguments": args_str,
}
})
})
.collect();
last_tool_call_ids = converted
.iter()
.filter_map(|v| v.get("id").and_then(|x| x.as_str()).map(String::from))
.collect();
next_tool_result_idx = 0;
obj.insert("tool_calls".into(), Value::Array(converted));
}
if msg.role == "tool" {
let id = last_tool_call_ids
.get(next_tool_result_idx)
.cloned()
.unwrap_or_else(|| "call_0".to_string());
obj.insert("tool_call_id".into(), Value::String(id));
next_tool_result_idx += 1;
}
out.push(Value::Object(obj));
}
out
}
/// Parse an OpenAI-compatible assistant message back into canonical shape.
/// llama.cpp emits `reasoning_content` on thinking models; we drop it for
/// parity with OpenRouter (which also strips upstream reasoning fields).
fn openai_message_to_chat(msg: &Value) -> Result<ChatMessage> {
let obj = msg
.as_object()
.ok_or_else(|| anyhow!("response message is not an object"))?;
let role = obj
.get("role")
.and_then(|v| v.as_str())
.unwrap_or("assistant")
.to_string();
let content = obj
.get("content")
.and_then(|v| v.as_str())
.unwrap_or("")
.to_string();
let tool_calls = if let Some(tcs) = obj.get("tool_calls").and_then(|v| v.as_array()) {
let mut parsed = Vec::with_capacity(tcs.len());
for tc in tcs {
let id = tc.get("id").and_then(|v| v.as_str()).map(String::from);
let function = tc
.get("function")
.ok_or_else(|| anyhow!("tool_call missing function field"))?;
let name = function
.get("name")
.and_then(|v| v.as_str())
.unwrap_or_default()
.to_string();
let args_value = match function.get("arguments") {
Some(Value::String(s)) => {
serde_json::from_str::<Value>(s).unwrap_or_else(|_| json!({}))
}
Some(v @ Value::Object(_)) => v.clone(),
_ => json!({}),
};
parsed.push(ToolCall {
id,
function: ToolCallFunction {
name,
arguments: args_value,
},
});
}
Some(parsed)
} else {
None
};
Ok(ChatMessage {
role,
content,
tool_calls,
images: None,
})
}
fn build_options(&self) -> Vec<(&'static str, Value)> {
let mut v = Vec::new();
if let Some(t) = self.temperature {
v.push(("temperature", json!(t)));
}
if let Some(p) = self.top_p {
v.push(("top_p", json!(p)));
}
if let Some(k) = self.top_k {
v.push(("top_k", json!(k)));
}
if let Some(m) = self.min_p {
v.push(("min_p", json!(m)));
}
// num_ctx isn't an OpenAI param; llama-server bakes ctx in at launch
// via -c, so we silently drop the override here. The config.yaml
// entry is the source of truth for context size.
let _ = self.num_ctx;
v
}
/// Issue a chat request with an explicit model id override. Used by
/// `describe_image` to route through the vision slot without mutating
/// `self.primary_model`.
async fn chat_completion_with_model(
&self,
model: &str,
messages: Vec<ChatMessage>,
tools: Vec<Tool>,
) -> Result<(ChatMessage, Option<i32>, Option<i32>)> {
let url = format!("{}/chat/completions", self.base_url);
let mut body = serde_json::Map::new();
body.insert("model".into(), Value::String(model.to_string()));
body.insert(
"messages".into(),
Value::Array(Self::messages_to_openai(&messages)),
);
body.insert("stream".into(), Value::Bool(false));
if !tools.is_empty() {
body.insert(
"tools".into(),
serde_json::to_value(&tools).context("serializing tools")?,
);
}
for (k, v) in self.build_options() {
body.insert(k.into(), v);
}
let resp = self
.client
.post(&url)
.json(&Value::Object(body))
.send()
.await
.with_context(|| format!("POST {} failed", url))?;
if !resp.status().is_success() {
let status = resp.status();
let body = resp.text().await.unwrap_or_default();
bail!("llama-swap chat request failed: {} — {}", status, body);
}
let parsed: Value = resp.json().await.context("parsing chat response")?;
let choice = parsed
.get("choices")
.and_then(|v| v.as_array())
.and_then(|a| a.first())
.ok_or_else(|| {
anyhow!(
"response missing choices[0]: {}",
extract_error_detail(&parsed)
)
})?;
let msg = choice.get("message").ok_or_else(|| {
anyhow!(
"choices[0] missing message: {}",
extract_error_detail(&parsed)
)
})?;
let chat_msg = Self::openai_message_to_chat(msg)?;
let usage = parsed.get("usage");
let prompt_tokens = usage
.and_then(|u| u.get("prompt_tokens"))
.and_then(|v| v.as_i64())
.map(|n| n as i32);
let completion_tokens = usage
.and_then(|u| u.get("completion_tokens"))
.and_then(|v| v.as_i64())
.map(|n| n as i32);
Ok((chat_msg, prompt_tokens, completion_tokens))
}
}
#[async_trait]
impl LlmClient for LlamaCppClient {
async fn generate(
&self,
prompt: &str,
system: Option<&str>,
images: Option<Vec<String>>,
) -> Result<String> {
let mut messages: Vec<ChatMessage> = Vec::new();
if let Some(sys) = system {
messages.push(ChatMessage::system(sys));
}
let mut user = ChatMessage::user(prompt);
user.images = images;
messages.push(user);
let (reply, _, _) = self.chat_with_tools(messages, Vec::new()).await?;
Ok(reply.content)
}
async fn chat_with_tools(
&self,
messages: Vec<ChatMessage>,
tools: Vec<Tool>,
) -> Result<(ChatMessage, Option<i32>, Option<i32>)> {
log::info!(
"llama-swap chat_with_tools: model={} messages={} tools={}",
self.primary_model,
messages.len(),
tools.len()
);
self.chat_completion_with_model(&self.primary_model.clone(), messages, tools)
.await
}
async fn chat_with_tools_stream(
&self,
messages: Vec<ChatMessage>,
tools: Vec<Tool>,
) -> Result<BoxStream<'static, Result<LlmStreamEvent>>> {
let url = format!("{}/chat/completions", self.base_url);
let mut body = serde_json::Map::new();
body.insert(
"model".into(),
Value::String(self.primary_model.clone()),
);
body.insert(
"messages".into(),
Value::Array(Self::messages_to_openai(&messages)),
);
body.insert("stream".into(), Value::Bool(true));
body.insert(
"stream_options".into(),
serde_json::json!({ "include_usage": true }),
);
if !tools.is_empty() {
body.insert(
"tools".into(),
serde_json::to_value(&tools).context("serializing tools")?,
);
}
for (k, v) in self.build_options() {
body.insert(k.into(), v);
}
let resp = self
.client
.post(&url)
.json(&Value::Object(body))
.send()
.await
.with_context(|| format!("POST {} failed", url))?;
if !resp.status().is_success() {
let status = resp.status();
let body = resp.text().await.unwrap_or_default();
bail!("llama-swap stream request failed: {} — {}", status, body);
}
let byte_stream = resp.bytes_stream();
let stream = async_stream::stream! {
let mut byte_stream = byte_stream;
let mut buf: Vec<u8> = Vec::new();
let mut accumulated_content = String::new();
let mut tool_state: std::collections::BTreeMap<
usize,
(Option<String>, Option<String>, String),
> = std::collections::BTreeMap::new();
let mut role = "assistant".to_string();
let mut prompt_tokens: Option<i32> = None;
let mut completion_tokens: Option<i32> = None;
let mut done_seen = false;
while let Some(chunk) = byte_stream.next().await {
let chunk = match chunk {
Ok(b) => b,
Err(e) => {
yield Err(anyhow!("stream read failed: {}", e));
return;
}
};
buf.extend_from_slice(&chunk);
while let Some(sep) = find_double_newline(&buf) {
let frame = buf.drain(..sep + 2).collect::<Vec<_>>();
let frame_str = match std::str::from_utf8(&frame) {
Ok(s) => s,
Err(_) => continue,
};
for line in frame_str.lines() {
let line = line.trim_end_matches('\r');
let payload = match line.strip_prefix("data: ") {
Some(p) => p,
None => continue,
};
if payload == "[DONE]" {
done_seen = true;
break;
}
let v: Value = match serde_json::from_str(payload) {
Ok(v) => v,
Err(e) => {
log::warn!(
"malformed llama-swap SSE frame: {} ({})",
payload,
e
);
continue;
}
};
if let Some(usage) = v.get("usage") {
prompt_tokens = usage
.get("prompt_tokens")
.and_then(|n| n.as_i64())
.map(|n| n as i32);
completion_tokens = usage
.get("completion_tokens")
.and_then(|n| n.as_i64())
.map(|n| n as i32);
}
let Some(choices) = v.get("choices").and_then(|c| c.as_array())
else {
continue;
};
let Some(choice) = choices.first() else { continue };
let delta = match choice.get("delta") {
Some(d) => d,
None => continue,
};
if let Some(r) = delta.get("role").and_then(|v| v.as_str()) {
role = r.to_string();
}
if let Some(content) =
delta.get("content").and_then(|v| v.as_str())
&& !content.is_empty()
{
accumulated_content.push_str(content);
yield Ok(LlmStreamEvent::TextDelta(content.to_string()));
}
if let Some(tcs) = delta.get("tool_calls").and_then(|v| v.as_array()) {
for tc_delta in tcs {
let idx = tc_delta
.get("index")
.and_then(|n| n.as_u64())
.unwrap_or(0) as usize;
let entry = tool_state
.entry(idx)
.or_insert((None, None, String::new()));
if let Some(id) =
tc_delta.get("id").and_then(|v| v.as_str())
{
entry.0 = Some(id.to_string());
}
if let Some(func) = tc_delta.get("function") {
if let Some(name) =
func.get("name").and_then(|v| v.as_str())
{
entry.1 = Some(name.to_string());
}
if let Some(args) =
func.get("arguments").and_then(|v| v.as_str())
{
entry.2.push_str(args);
}
}
}
}
}
if done_seen {
break;
}
}
if done_seen {
break;
}
}
let tool_calls: Option<Vec<ToolCall>> = if tool_state.is_empty() {
None
} else {
let mut v = Vec::with_capacity(tool_state.len());
for (_idx, (id, name, args)) in tool_state {
let arguments: Value = if args.trim().is_empty() {
Value::Object(Default::default())
} else {
serde_json::from_str(&args).unwrap_or_else(|_| {
Value::Object(Default::default())
})
};
v.push(ToolCall {
id,
function: ToolCallFunction {
name: name.unwrap_or_default(),
arguments,
},
});
}
Some(v)
};
let message = ChatMessage {
role,
content: accumulated_content,
tool_calls,
images: None,
};
yield Ok(LlmStreamEvent::Done {
message,
prompt_eval_count: prompt_tokens,
eval_count: completion_tokens,
});
};
Ok(Box::pin(stream))
}
async fn generate_embeddings(&self, texts: &[&str]) -> Result<Vec<Vec<f32>>> {
let url = format!("{}/embeddings", self.base_url);
let body = json!({
"model": self.embedding_model,
"input": texts,
});
let resp = self
.client
.post(&url)
.json(&body)
.send()
.await
.with_context(|| format!("POST {} failed", url))?;
if !resp.status().is_success() {
let status = resp.status();
let body = resp.text().await.unwrap_or_default();
bail!("llama-swap embedding request failed: {} — {}", status, body);
}
#[derive(Deserialize)]
struct EmbedResponse {
data: Vec<EmbedItem>,
}
#[derive(Deserialize)]
struct EmbedItem {
embedding: Vec<f32>,
}
let parsed: EmbedResponse = resp.json().await.context("parsing embed response")?;
Ok(parsed.data.into_iter().map(|i| i.embedding).collect())
}
async fn describe_image(&self, image_base64: &str) -> Result<String> {
let prompt = "Briefly describe what you see in this image in 1-2 sentences. \
Focus on the people, location, and activity.";
let system = "You are a scene description assistant. Be concise and factual.";
let messages = vec![
ChatMessage::system(system),
ChatMessage {
role: "user".to_string(),
content: prompt.to_string(),
tool_calls: None,
images: Some(vec![image_base64.to_string()]),
},
];
let (reply, _, _) = self
.chat_completion_with_model(&self.vision_model.clone(), messages, Vec::new())
.await?;
Ok(reply.content)
}
async fn list_models(&self) -> Result<Vec<ModelCapabilities>> {
let url = format!("{}/models", self.base_url);
let resp = self
.client
.get(&url)
.send()
.await
.with_context(|| format!("GET {} failed", url))?;
if !resp.status().is_success() {
let status = resp.status();
let body = resp.text().await.unwrap_or_default();
bail!("llama-swap list_models failed: {} — {}", status, body);
}
let parsed: Value = resp.json().await.context("parsing models response")?;
let data = parsed
.get("data")
.and_then(|v| v.as_array())
.ok_or_else(|| anyhow!("models response missing data[]"))?;
let caps: Vec<ModelCapabilities> = data
.iter()
.map(|m| self.parse_model_capabilities(m))
.collect();
Ok(caps)
}
async fn model_capabilities(&self, model: &str) -> Result<ModelCapabilities> {
let all = self.list_models().await?;
all.into_iter()
.find(|m| m.name == model)
.ok_or_else(|| anyhow!("model '{}' not found on llama-swap", model))
}
fn primary_model(&self) -> &str {
&self.primary_model
}
}
impl LlamaCppClient {
fn parse_model_capabilities(&self, m: &Value) -> ModelCapabilities {
let name = m
.get("id")
.and_then(|v| v.as_str())
.unwrap_or_default()
.to_string();
let has_vision = name == self.vision_model || self.vision_models.iter().any(|v| v == &name);
// Tool calling is the default for llama-swap entries we configure
// (--jinja flag); no negative-list mechanism yet, so report true.
ModelCapabilities {
name,
has_vision,
has_tool_calling: true,
}
}
}
/// Extract a diagnostic fragment from a llama-swap / llama-server response
/// that doesn't match the expected `{choices: [...]}` shape. llama-server
/// returns errors as `{"error": {"message": "...", "code": N, "type": "..."}}`;
/// llama-swap itself sometimes wraps subprocess failures with its own
/// `{"error": "..."}` flat shape. Surface either when present, otherwise fall
/// back to a truncated raw-JSON view.
fn extract_error_detail(parsed: &Value) -> String {
if let Some(err) = parsed.get("error") {
match err {
Value::Object(_) => {
let message = err
.get("message")
.and_then(|v| v.as_str())
.unwrap_or("(no message)");
let code = err
.get("code")
.map(|v| match v {
Value::String(s) => s.clone(),
other => other.to_string(),
})
.unwrap_or_else(|| "?".to_string());
let short_message: String = message.chars().take(240).collect();
return format!("error code={} message=\"{}\"", code, short_message);
}
Value::String(s) => {
let short: String = s.chars().take(240).collect();
return format!("error=\"{}\"", short);
}
_ => {}
}
}
let raw = parsed.to_string();
raw.chars().take(300).collect()
}
fn find_double_newline(buf: &[u8]) -> Option<usize> {
for i in 0..buf.len().saturating_sub(1) {
if buf[i] == b'\n' && buf[i + 1] == b'\n' {
return Some(i);
}
if i + 3 < buf.len()
&& buf[i] == b'\r'
&& buf[i + 1] == b'\n'
&& buf[i + 2] == b'\r'
&& buf[i + 3] == b'\n'
{
return Some(i + 1);
}
}
None
}
fn image_to_data_url(img: &str) -> String {
if img.starts_with("data:") {
img.to_string()
} else {
format!("data:image/jpeg;base64,{}", img)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn tool_call_arguments_stringified_on_send() {
let msg = ChatMessage {
role: "assistant".into(),
content: String::new(),
tool_calls: Some(vec![ToolCall {
id: Some("call_abc".into()),
function: ToolCallFunction {
name: "search_sms".into(),
arguments: json!({"query": "hello", "limit": 5}),
},
}]),
images: None,
};
let wire = LlamaCppClient::messages_to_openai(&[msg]);
let tcs = wire[0]
.get("tool_calls")
.and_then(|v| v.as_array())
.expect("tool_calls present");
let args = tcs[0]
.get("function")
.and_then(|f| f.get("arguments"))
.and_then(|a| a.as_str())
.expect("arguments stringified");
let parsed: Value = serde_json::from_str(args).unwrap();
assert_eq!(parsed["query"], "hello");
assert_eq!(parsed["limit"], 5);
}
#[test]
fn tool_call_arguments_parsed_on_receive() {
let response_msg = json!({
"role": "assistant",
"content": "",
"tool_calls": [{
"id": "call_xyz",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"city\":\"Boston\",\"units\":\"celsius\"}"
}
}]
});
let parsed = LlamaCppClient::openai_message_to_chat(&response_msg).unwrap();
let tcs = parsed.tool_calls.unwrap();
assert_eq!(tcs.len(), 1);
assert_eq!(tcs[0].function.name, "get_weather");
assert_eq!(tcs[0].function.arguments["city"], "Boston");
assert_eq!(tcs[0].function.arguments["units"], "celsius");
assert_eq!(tcs[0].id.as_deref(), Some("call_xyz"));
}
#[test]
fn tool_call_arguments_accept_native_json_on_receive() {
// Some llama.cpp builds emit arguments as a JSON object directly when
// jinja's tool-output strict-string rule isn't applied — accept both.
let response_msg = json!({
"role": "assistant",
"content": "",
"tool_calls": [{
"id": "call_1",
"type": "function",
"function": {
"name": "foo",
"arguments": {"nested": {"k": 1}}
}
}]
});
let parsed = LlamaCppClient::openai_message_to_chat(&response_msg).unwrap();
let tc = &parsed.tool_calls.unwrap()[0];
assert_eq!(tc.function.arguments["nested"]["k"], 1);
}
#[test]
fn images_become_content_parts() {
let mut msg = ChatMessage::user("What is in this photo?");
msg.images = Some(vec!["BASE64DATA".into()]);
let wire = LlamaCppClient::messages_to_openai(&[msg]);
let content = wire[0].get("content").and_then(|v| v.as_array()).unwrap();
assert_eq!(content.len(), 2);
assert_eq!(content[0]["type"], "text");
assert_eq!(content[0]["text"], "What is in this photo?");
assert_eq!(content[1]["type"], "image_url");
assert_eq!(
content[1]["image_url"]["url"],
"data:image/jpeg;base64,BASE64DATA"
);
}
#[test]
fn data_url_images_pass_through_unchanged() {
let mut msg = ChatMessage::user("");
msg.images = Some(vec!["data:image/png;base64,ABCDEF".into()]);
let wire = LlamaCppClient::messages_to_openai(&[msg]);
let content = wire[0].get("content").and_then(|v| v.as_array()).unwrap();
assert_eq!(content.len(), 1);
assert_eq!(
content[0]["image_url"]["url"],
"data:image/png;base64,ABCDEF"
);
}
#[test]
fn text_only_message_stays_string() {
let msg = ChatMessage::user("hello");
let wire = LlamaCppClient::messages_to_openai(&[msg]);
assert_eq!(wire[0]["content"], "hello");
assert!(wire[0]["content"].as_str().is_some());
}
#[test]
fn tool_result_inherits_tool_call_id_from_prior_assistant() {
let assistant = ChatMessage {
role: "assistant".into(),
content: String::new(),
tool_calls: Some(vec![ToolCall {
id: Some("call_42".into()),
function: ToolCallFunction {
name: "lookup".into(),
arguments: json!({}),
},
}]),
images: None,
};
let tool_result = ChatMessage::tool_result("found it");
let wire = LlamaCppClient::messages_to_openai(&[assistant, tool_result]);
assert_eq!(wire[1]["role"], "tool");
assert_eq!(wire[1]["tool_call_id"], "call_42");
}
#[test]
fn multiple_tool_results_map_to_sequential_call_ids() {
let assistant = ChatMessage {
role: "assistant".into(),
content: String::new(),
tool_calls: Some(vec![
ToolCall {
id: Some("call_A".into()),
function: ToolCallFunction {
name: "a".into(),
arguments: json!({}),
},
},
ToolCall {
id: Some("call_B".into()),
function: ToolCallFunction {
name: "b".into(),
arguments: json!({}),
},
},
]),
images: None,
};
let r1 = ChatMessage::tool_result("a result");
let r2 = ChatMessage::tool_result("b result");
let wire = LlamaCppClient::messages_to_openai(&[assistant, r1, r2]);
assert_eq!(wire[1]["tool_call_id"], "call_A");
assert_eq!(wire[2]["tool_call_id"], "call_B");
}
#[test]
fn missing_tool_call_id_gets_synthetic_fallback() {
let assistant = ChatMessage {
role: "assistant".into(),
content: String::new(),
tool_calls: Some(vec![ToolCall {
id: None,
function: ToolCallFunction {
name: "noid".into(),
arguments: json!({}),
},
}]),
images: None,
};
let wire = LlamaCppClient::messages_to_openai(&[assistant]);
let tcs = wire[0]
.get("tool_calls")
.and_then(|v| v.as_array())
.unwrap();
assert_eq!(tcs[0]["id"], "call_0");
}
#[test]
fn capability_inference_uses_vision_model_and_allowlist() {
let mut c = LlamaCppClient::new(None, Some("chat".into()));
c.set_vision_model("vision".into());
c.set_vision_models(vec!["qwen-vl".into()]);
let m_chat = json!({ "id": "chat" });
let m_vision = json!({ "id": "vision" });
let m_qwen = json!({ "id": "qwen-vl" });
let m_other = json!({ "id": "embed" });
let chat = c.parse_model_capabilities(&m_chat);
let vision = c.parse_model_capabilities(&m_vision);
let qwen = c.parse_model_capabilities(&m_qwen);
let other = c.parse_model_capabilities(&m_other);
assert!(!chat.has_vision);
assert!(chat.has_tool_calling);
assert!(vision.has_vision);
assert!(qwen.has_vision);
assert!(!other.has_vision);
}
}

View File

@@ -5,6 +5,7 @@ pub mod face_client;
pub mod handlers; pub mod handlers;
pub mod insight_chat; pub mod insight_chat;
pub mod insight_generator; pub mod insight_generator;
pub mod llamacpp;
pub mod llm_client; pub mod llm_client;
pub mod ollama; pub mod ollama;
pub mod openrouter; pub mod openrouter;
@@ -20,7 +21,8 @@ pub use handlers::{
chat_history_handler, chat_rewind_handler, chat_stream_handler, chat_turn_handler, chat_history_handler, chat_rewind_handler, chat_stream_handler, chat_turn_handler,
delete_insight_handler, export_training_data_handler, generate_agentic_insight_handler, delete_insight_handler, export_training_data_handler, generate_agentic_insight_handler,
generate_insight_handler, get_all_insights_handler, get_available_models_handler, generate_insight_handler, get_all_insights_handler, get_available_models_handler,
get_insight_handler, get_openrouter_models_handler, rate_insight_handler, get_insight_handler, get_llamacpp_models_handler, get_openrouter_models_handler,
rate_insight_handler,
}; };
pub use insight_generator::InsightGenerator; pub use insight_generator::InsightGenerator;
#[allow(unused_imports)] #[allow(unused_imports)]

View File

@@ -195,6 +195,7 @@ async fn main() -> anyhow::Result<()> {
let generator = InsightGenerator::new( let generator = InsightGenerator::new(
ollama, ollama,
None, None,
None,
sms_client, sms_client,
apollo_client, apollo_client,
insight_dao.clone(), insight_dao.clone(),

View File

@@ -313,6 +313,7 @@ fn main() -> std::io::Result<()> {
.service(ai::get_all_insights_handler) .service(ai::get_all_insights_handler)
.service(ai::get_available_models_handler) .service(ai::get_available_models_handler)
.service(ai::get_openrouter_models_handler) .service(ai::get_openrouter_models_handler)
.service(ai::get_llamacpp_models_handler)
.service(ai::chat_turn_handler) .service(ai::chat_turn_handler)
.service(ai::chat_stream_handler) .service(ai::chat_stream_handler)
.service(ai::chat_history_handler) .service(ai::chat_history_handler)

View File

@@ -2,6 +2,7 @@ use crate::ai::apollo_client::ApolloClient;
use crate::ai::clip_client::ClipClient; use crate::ai::clip_client::ClipClient;
use crate::ai::face_client::FaceClient; use crate::ai::face_client::FaceClient;
use crate::ai::insight_chat::{ChatLockMap, InsightChatService}; use crate::ai::insight_chat::{ChatLockMap, InsightChatService};
use crate::ai::llamacpp::LlamaCppClient;
use crate::ai::openrouter::OpenRouterClient; use crate::ai::openrouter::OpenRouterClient;
use crate::ai::{InsightGenerator, OllamaClient, SmsApiClient}; use crate::ai::{InsightGenerator, OllamaClient, SmsApiClient};
use crate::database::{ use crate::database::{
@@ -62,6 +63,16 @@ pub struct AppState {
/// Curated list of OpenRouter model ids exposed to clients. Sourced from /// Curated list of OpenRouter model ids exposed to clients. Sourced from
/// `OPENROUTER_ALLOWED_MODELS` (comma-separated). Empty when unset. /// `OPENROUTER_ALLOWED_MODELS` (comma-separated). Empty when unset.
pub openrouter_allowed_models: Vec<String>, pub openrouter_allowed_models: Vec<String>,
/// `None` when `LLAMA_SWAP_URL` is not configured. Consulted only when a
/// request explicitly opts into `backend=llamacpp`. Same shape as the
/// `openrouter` slot — present here so handlers can route to it without
/// threading through the generator.
#[allow(dead_code)]
pub llamacpp: Option<Arc<LlamaCppClient>>,
/// Curated list of llama-swap model ids exposed to clients. Sourced from
/// `LLAMA_SWAP_ALLOWED_MODELS` (comma-separated). Empty when unset; the
/// server then falls back to `LLAMA_SWAP_PRIMARY_MODEL`.
pub llamacpp_allowed_models: Vec<String>,
pub sms_client: SmsApiClient, pub sms_client: SmsApiClient,
pub insight_generator: InsightGenerator, pub insight_generator: InsightGenerator,
/// Chat continuation service. Hold an Arc so handlers can clone cheaply. /// Chat continuation service. Hold an Arc so handlers can clone cheaply.
@@ -105,6 +116,8 @@ impl AppState {
ollama: OllamaClient, ollama: OllamaClient,
openrouter: Option<Arc<OpenRouterClient>>, openrouter: Option<Arc<OpenRouterClient>>,
openrouter_allowed_models: Vec<String>, openrouter_allowed_models: Vec<String>,
llamacpp: Option<Arc<LlamaCppClient>>,
llamacpp_allowed_models: Vec<String>,
sms_client: SmsApiClient, sms_client: SmsApiClient,
insight_generator: InsightGenerator, insight_generator: InsightGenerator,
insight_chat: Arc<InsightChatService>, insight_chat: Arc<InsightChatService>,
@@ -145,6 +158,8 @@ impl AppState {
ollama, ollama,
openrouter, openrouter,
openrouter_allowed_models, openrouter_allowed_models,
llamacpp,
llamacpp_allowed_models,
sms_client, sms_client,
insight_generator, insight_generator,
insight_chat, insight_chat,
@@ -186,6 +201,9 @@ impl Default for AppState {
let openrouter = build_openrouter_from_env(); let openrouter = build_openrouter_from_env();
let openrouter_allowed_models = parse_openrouter_allowed_models(); let openrouter_allowed_models = parse_openrouter_allowed_models();
let llamacpp = build_llamacpp_from_env();
let llamacpp_allowed_models = parse_llamacpp_allowed_models();
let sms_api_url = let sms_api_url =
env::var("SMS_API_URL").unwrap_or_else(|_| "http://localhost:8000".to_string()); env::var("SMS_API_URL").unwrap_or_else(|_| "http://localhost:8000".to_string());
let sms_api_token = env::var("SMS_API_TOKEN").ok(); let sms_api_token = env::var("SMS_API_TOKEN").ok();
@@ -250,6 +268,7 @@ impl Default for AppState {
let insight_generator = InsightGenerator::new( let insight_generator = InsightGenerator::new(
ollama.clone(), ollama.clone(),
openrouter.clone(), openrouter.clone(),
llamacpp.clone(),
sms_client.clone(), sms_client.clone(),
apollo_client.clone(), apollo_client.clone(),
insight_dao.clone(), insight_dao.clone(),
@@ -273,6 +292,7 @@ impl Default for AppState {
Arc::new(insight_generator.clone()), Arc::new(insight_generator.clone()),
ollama.clone(), ollama.clone(),
openrouter.clone(), openrouter.clone(),
llamacpp.clone(),
insight_dao.clone(), insight_dao.clone(),
chat_locks, chat_locks,
)); ));
@@ -294,6 +314,8 @@ impl Default for AppState {
ollama, ollama,
openrouter, openrouter,
openrouter_allowed_models, openrouter_allowed_models,
llamacpp,
llamacpp_allowed_models,
sms_client, sms_client,
insight_generator, insight_generator,
insight_chat, insight_chat,
@@ -335,6 +357,50 @@ fn parse_openrouter_allowed_models() -> Vec<String> {
.collect() .collect()
} }
/// Build a `LlamaCppClient` from environment variables. Returns `None` when
/// `LLAMA_SWAP_URL` is unset (the llamacpp backend is then unavailable and
/// requests for it return a clear error). The slot ids default to the
/// names the bundled `llama-swap/config.yaml` uses — `chat` / `vision` /
/// `embed` — so a minimal deploy only needs to set `LLAMA_SWAP_URL`.
fn build_llamacpp_from_env() -> Option<Arc<LlamaCppClient>> {
let base_url = env::var("LLAMA_SWAP_URL").ok()?;
let primary_model = env::var("LLAMA_SWAP_PRIMARY_MODEL").ok();
let mut client = LlamaCppClient::new(Some(base_url), primary_model);
if let Ok(model) = env::var("LLAMA_SWAP_EMBEDDING_MODEL") {
client.set_embedding_model(model);
}
if let Ok(model) = env::var("LLAMA_SWAP_VISION_MODEL") {
client.set_vision_model(model);
}
client.set_vision_models(parse_llamacpp_vision_models());
Some(Arc::new(client))
}
/// Parse `LLAMA_SWAP_ALLOWED_MODELS` (comma-separated) into a vec. Used to
/// drive `/insights/llamacpp/models`; empty when unset.
fn parse_llamacpp_allowed_models() -> Vec<String> {
env::var("LLAMA_SWAP_ALLOWED_MODELS")
.unwrap_or_default()
.split(',')
.map(|s| s.trim().to_string())
.filter(|s| !s.is_empty())
.collect()
}
/// Parse `LLAMA_SWAP_VISION_MODELS` (comma-separated) — slot ids that report
/// `has_vision = true` in capability lookups. The configured `vision_model`
/// (default `vision`) is always considered vision-capable regardless of this
/// list, so a deploy that only uses the default vision slot can leave it
/// unset.
fn parse_llamacpp_vision_models() -> Vec<String> {
env::var("LLAMA_SWAP_VISION_MODELS")
.unwrap_or_default()
.split(',')
.map(|s| s.trim().to_string())
.filter(|s| !s.is_empty())
.collect()
}
#[cfg(test)] #[cfg(test)]
impl AppState { impl AppState {
/// Creates an AppState instance for testing with temporary directories /// Creates an AppState instance for testing with temporary directories
@@ -397,6 +463,7 @@ impl AppState {
let insight_generator = InsightGenerator::new( let insight_generator = InsightGenerator::new(
ollama.clone(), ollama.clone(),
None, None,
None,
sms_client.clone(), sms_client.clone(),
apollo_client.clone(), apollo_client.clone(),
insight_dao.clone(), insight_dao.clone(),
@@ -418,6 +485,7 @@ impl AppState {
Arc::new(insight_generator.clone()), Arc::new(insight_generator.clone()),
ollama.clone(), ollama.clone(),
None, None,
None,
insight_dao.clone(), insight_dao.clone(),
chat_locks, chat_locks,
)); ));
@@ -445,6 +513,8 @@ impl AppState {
ollama, ollama,
None, None,
Vec::new(), Vec::new(),
None,
Vec::new(),
sms_client, sms_client,
insight_generator, insight_generator,
insight_chat, insight_chat,