Raise chat truncation default num_ctx to 32k, env-overridable

The history-truncation budget assumed an 8192-token context whenever a
chat request omitted num_ctx, while the llama-swap chat slots serve
20k-131k. Replayed transcripts past ~6k tokens were silently gutted
every turn — losing conversation history and destroying llama.cpp
KV-cache prefix reuse (full SWA re-prefill per turn).

Default is now 32768 (real conversations top out around 16k), with
AGENTIC_CHAT_DEFAULT_NUM_CTX to override per deploy, floored at
headroom + 1024. Explicit per-request num_ctx still wins.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Cameron Cordes
2026-06-09 19:14:02 -04:00
parent 13f3635db2
commit 31904fef80
2 changed files with 33 additions and 8 deletions
+10 -2
View File
@@ -671,6 +671,11 @@ LLAMA_SWAP_TTS_REQUEST_TIMEOUT_SECONDS=600 # Per-request synth timeout (long
# Insight Chat Continuation
AGENTIC_CHAT_MAX_ITERATIONS=6 # Cap on tool-calling iterations per chat turn (default 6)
AGENTIC_CHAT_DEFAULT_NUM_CTX=32768 # Assumed context window for the history-truncation budget
# when a chat request omits num_ctx (default 32768). Size to
# the smallest context among the chat models actually served;
# too small silently guts replayed history every turn (and
# destroys llama.cpp KV-cache prefix reuse).
```
**AI Insights Fallback Behavior:**
@@ -794,14 +799,17 @@ Per-`(library_id, file_path)` async mutex (`AppState.insight_chat.chat_locks`)
serialises concurrent turns on the same insight so the JSON blob doesn't race.
Context management is a soft bound: if the serialized history exceeds
`num_ctx - 2048` tokens (cheap 4-byte/token heuristic), the oldest
assistant-tool_call + tool_result pairs are dropped until under budget. The
`num_ctx - 2048` tokens (cheap 4-byte/token heuristic; `num_ctx` defaults
to `AGENTIC_CHAT_DEFAULT_NUM_CTX`, 32768, when the request omits it), the
oldest assistant-tool_call + tool_result pairs are dropped until under budget. The
initial user message (with any images) and system prompt are always preserved.
The `truncated` event / flag is surfaced to the client when a drop occurred.
Configurable env:
- `AGENTIC_CHAT_MAX_ITERATIONS` — cap on tool-calling iterations per turn
(default 6). Per-request `max_iterations` is clamped to this cap.
- `AGENTIC_CHAT_DEFAULT_NUM_CTX` — assumed context window for the truncation
budget when the request omits `num_ctx` (default 32768).
**Apollo Places integration (optional):**