The history-truncation budget assumed an 8192-token context whenever a
chat request omitted num_ctx, while the llama-swap chat slots serve
20k-131k. Replayed transcripts past ~6k tokens were silently gutted
every turn — losing conversation history and destroying llama.cpp
KV-cache prefix reuse (full SWA re-prefill per turn).
Default is now 32768 (real conversations top out around 16k), with
AGENTIC_CHAT_DEFAULT_NUM_CTX to override per deploy, floored at
headroom + 1024. Explicit per-request num_ctx still wins.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Keep `cargo clippy --tests` clean alongside the agentic-loop changes:
alias backfill's five-element setup() tuple as SetupFixture
(type_complexity) and build the single-library health map via
std::slice::from_ref instead of cloning (unnecessary clone-to-slice).
No behavior change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A request carrying persona_id but no system_prompt used to fall back to
the neutral default voice. Both agentic generation
(generate_agentic_insight_handler) and chat bootstrap now resolve the
persona's stored prompt from the persona store, with precedence:
explicit non-blank client system_prompt > persona store lookup >
existing default ("default" persona id behaves the same — used if the
store has a row, neutral default otherwise). Resolution happens at the
handler / bootstrap entry where the DAO is reachable; internals are
unchanged. resolve_bootstrap_system_prompt takes the resolved persona
prompt as a second argument, with precedence tests.
Also in insight_chat:
- Sync chat_turn no longer persists the synthetic "Please write your
final answer now without calling any more tools." user message pushed
on iteration exhaustion — extracted both streaming variants'
synthetic_idx pattern into push/remove_synthetic_final_prompt (the
remove is a defensive no-op on index drift) and applied it to all
three loops; round-trip test included.
- Strip leaked <think> blocks from the final content persisted as the
reply in chat_turn and both streaming AgenticLoopOutcomes (mid-stream
TextDeltas are untouched; the raw transcript keeps the block).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Agentic-loop fixes in the generator:
- New recall_facts_for_entity tool (always-on, like recall_entities):
fetches facts for one entity by id so the model can follow up on
entities surfaced by recall_entities that aren't photo-linked
(recall_facts_for_photo only covers linked entities). Mirrors that
tool's persona scoping (PersonaFilter::Single) and the persona's
reviewed_only_facts filter exactly, and renders in the same
"Entity: ... / - predicate object" style. Wired through execute_tool
and the trajectory summarizer.
- Generation now resolves gates persona-aware:
current_gate_opts_for_persona(images_inline, Some((user_id,
persona_id))) instead of the None-defaulting wrapper, so a persona's
allow_agent_corrections opens propose_correction during generation the
same way chat turns already did. The now-unused current_gate_opts
wrapper is removed.
- Strip leaked <think> blocks from the final assistant content before
parse_title_body / store_insight (raw training transcript keeps them).
- Honest truncation labels: get_sms_messages and get_location_history
said "Found N ..." while listing only the first K; found_header now
emits "Found N ... (showing first K):" when truncated, and the
summarizer still parses the count.
- Clamp days_radius in get_calendar_events and get_location_history to
1..=30, matching get_sms_messages.
- persona_system_prompt helper (persona store lookup, blank-prompt ->
None) for server-side persona resolution; callers land in the next
commit.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Ollama >=0.8 can stream tool_calls incrementally across NDJSON chunks;
chat_with_tools_stream did `tool_calls = Some(tcs)` per chunk, so only
the last chunk's calls survived assembly and earlier calls were silently
dropped. Append into the accumulator instead.
- ollama: append_streamed_tool_calls helper + tests covering two calls
arriving in separate chunks and the single-chunk batch case.
- llamacpp: the SSE delta assembly was already correct (per-index
BTreeMap, same-index argument fragments concatenate, distinct indexes
accumulate); extracted it into apply_tool_call_deltas /
finalize_tool_calls and added tests pinning that behavior.
- llm_client: new shared strip_think_blocks (moved from ollama's private
extract_final_answer, which now delegates) so the tool-calling final
content paths can reuse it; unit tests for tagged/plain/unclosed/empty
cases.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Expose GET /insights/history?path=... returning every generated version
of a photo's insight (current plus superseded), newest-first, backing the
mobile per-file insight history view.
- New get_insight_history_handler; reuses the existing get_insight_history
DAO method (removed its dead_code allow).
- impl From<PhotoInsight> for PhotoInsightResponse, collapsing the mapping
that was duplicated across the single-get and all-insights handlers.
- rate_insight_by_id DAO method + optional insight_id on RateInsightRequest
so previously generated versions can be approved/rejected (the path-based
rate only touches the current row).
- DAO tests for history ordering/scoping and id-targeted rating.
- cargo fmt normalized a multi-line assert in insight_chat.rs tests.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>