Apps/ImageApi

feature/knowledge-curation #91

Merged

cameron merged 19 commits from feature/knowledge-curation into master

2026-05-12 15:40:57 +00:00

Author	SHA1	Message	Date
Cameron Cordes	e67e00ef8a	knowledge: predicate-quality nudge + bulk-reject endpoint Two coupled changes to fight the speech-act-predicate problem (facts like (Cameron, expressed, "I'm tempted to...")): 1. System prompt grows an explicit predicate-quality rule. The agent is told to use relationship-shaped verbs (lives_in, works_at, attended, is_friend_of, interested_in), and is given an explicit DON'T list (expressed, said, mentioned, stated, quoted, noted, discussed, thought, wondered). Plus a concrete Bad / Good example contrasting the noise pattern with the structured paraphrase the agent should be writing. Stops the bleed for new insights. 2. Cleanup tools for the legacy noise that's already in the table: - get_predicate_stats(persona, limit) returns [(predicate, count)] sorted desc — feeds the curation UI's PREDICATES tab. - bulk_reject_facts_by_predicate(persona, predicate, audit) flips every ACTIVE fact under that predicate to 'rejected' in one transaction, stamping last_modified_* so the action is attributable + reversible per-fact through the entity detail panel. REVIEWED facts under the same predicate are left alone — the curator may have hand-approved an exception ("interested_in" might be largely noise but a reviewed entry is intentional). New HTTP endpoints: GET /knowledge/predicate-stats?limit= POST /knowledge/predicates/{predicate}/bulk-reject Persona-scoped via the existing X-Persona-Id header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 21:50:26 -04:00
Cameron Cordes	fb078b4906	knowledge: normalize legacy entity_type values One-shot migration that re-applies the synonym map from `normalize_entity_type` over every existing row, so legacy entries written before that helper landed in upsert_entity stop needing client-side workarounds. person ← person \| people \| human \| individual \| contact place ← place \| location \| venue \| site \| area \| landmark event ← event \| occasion \| activity \| celebration thing ← thing \| object \| item \| product Unknown types ("friend", "family", etc.) get a lowercase+trim sweep so at minimum case variants collapse — the curator can merge or rename them via the curation UI from there. `UPDATE OR IGNORE` skips rows that would violate UNIQUE(name, entity_type) after the rewrite (e.g. an existing ("Sarah", "person") + ("Sarah", "Person") pair). The duplicate survives unchanged so it can be merged through the normal curation flow rather than silently disappearing. Idempotent: every UPDATE is conditional on `entity_type != canonical`, so re-running the migration is a no-op. The down migration is intentionally inert — we don't have per-row history of the original strings and the rewritten values stay semantically correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 21:42:51 -04:00
Cameron Cordes	d123cde333	knowledge: entity-graph endpoint for force-directed view New GET /knowledge/graph?type=&limit= returns the data the curation UI's graph tab needs: - nodes = entities with at least one in-scope fact (rejected / superseded excluded). Carries fact_count for visual sizing. Top-N by count desc; default cap 200 (clamped 1..1000). - edges = relational facts (object_entity_id set) grouped by (subject, object, predicate) so 3 "is_friend_of" facts between the same pair collapse into one edge with count=3. Two raw SQL queries: an INNER JOIN onto a persona-scoped fact- count subquery for nodes (skips 0-fact entities entirely so the sim doesn't waste time on disconnected islands), then a follow- up GROUP BY over the persona-scoped fact set restricted to the node id set via IN clauses (ids are i32 so inlining is safe). Pairs with the Apollo-side GraphPanel that runs d3-force over the returned payload and renders SVG with click-to-open. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 21:26:02 -04:00
Cameron Cordes	6dca0c027d	fmt: cargo fmt sweep No logic changes - line reflow, brace placement, and method-chain splits across handlers / personas / state / faces / knowledge / insights_dao / knowledge_dao / populate_knowledge. Picked up incidentally while running fmt for the sms-search work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 19:21:00 -04:00
Cameron Cordes	7329cc5ce7	insights: push sms search filters server-side, render snippets, expand fts5 docs - Refactor search_messages_with_contact -> search_messages(query, &SmsSearchParams) exposing date_from / date_to / offset / is_mms / has_media; drop the over-fetch + client-side date post-filter that could silently drop in-window hits past position 100. - Surface SMS-API's <mark>-wrapped snippet for MMS messages that only matched via message_parts_fts (attachment text / filename) - pre-snippet, those rendered as a blank body preview to the LLM. - Expose is_mms / has_media on the search_messages tool schema; expand the FTS5 syntax docs with worked examples for phrase / prefix / boolean / NEAR / grouping so the model picks the right operator. - Unit tests for format_search_hits (body fallback, snippet preferred, MMS attachment-only regression, empty-snippet fallback) and strip_mark_tags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 19:20:19 -04:00
Cameron Cordes	6620fa48d7	knowledge: consolidation proposals endpoint Finds near-duplicate entities the upsert-time cosine guard didn't catch — typically legacy data from before that guard landed, or pairs whose embeddings sit between 0.85 (default proposal floor) and 0.92 (auto-collapse threshold). Pure read-side feature; the actual merging still goes through the existing /knowledge/entities/merge action. New DAO method `find_consolidation_proposals(threshold, max_groups)`: - Loads every non-rejected entity with an embedding. - Partitions by entity_type so a person can't cluster with a place. - Pairwise cosine, edges above threshold feed a union-find for transitive grouping (Sara → Sarah → Sarah J. all land in one cluster). - Tracks min/max cosine per component so the UI can show "how tight" each cluster is before clicking in. - Returns groups of >= 2 sorted by size desc then max cosine desc; trimmed to `max_groups`. New endpoint `GET /knowledge/consolidation-proposals?threshold= &limit=` accepts the threshold (clamped 0.5–0.99 to prevent the "every entity in one mega-cluster" case) and returns groups with per-entity persona fact-count breakdowns baked in — saves the UI a separate query per cluster member. ConsolidationGroup is exported through database/mod.rs so the handler can use it without depending on knowledge_dao internals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:43:11 -04:00
Cameron Cordes	89d0a6527c	knowledge: per-entity persona breakdown for list + detail Entities are global; facts are persona-scoped. Under the active persona an entity can read as "0 facts" while having plenty under other personas the user owns — the curation UI had no way to surface that gap. Adds a batched DAO method `get_persona_breakdowns_for_entities` that returns {entity_id → [(persona_id, count)]} in one query (group by subject + persona, user-scoped, status != rejected), and wires it into both /knowledge/entities list rows and GET /knowledge/entities/{id}. EntitySummary grows an optional `persona_breakdown` field (skipped on serialization when None — keeps PATCH responses unchanged). EntityDetailResponse carries the breakdown as a non-optional Vec since the detail endpoint always populates it. One extra query per list page (50 entities → 50 subject ids batched in one IN clause); single-entity GET adds one round trip. Indexed by (subject_entity_id, persona_id) implicitly via the existing user-persona indexes on entity_facts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:29:20 -04:00
Cameron Cordes	f200466508	knowledge: forbid markdown in synthesized merge descriptions System prompt now explicitly enumerates the markdown forms the model shouldn't emit (bold, italics, headings, bullets, lists, code fences) on top of the existing "no preamble, no quotes" constraints. Some local models default to markdown-shaped output for descriptions and the curation UI is plain-text, which would render the asterisks and hashes literally. The output cleaning step picks up a parallel sweep: strip code fences, leading bullets / headings, wrapping quotes, and naive inline emphasis markers (** and __). Rare enough that the plain-replace is fine; not trying to parse markdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 16:49:02 -04:00
Cameron Cordes	afac02cade	knowledge: synthesize-merge endpoint for LLM-curated descriptions New POST /knowledge/entities/synthesize-merge { source_id, target_id } that calls the local Ollama with both entities' names + descriptions and returns a synthesized merged-description draft. Read-only on the database — the curation UI uses the response as the editable seed in the merge picker; the actual merge still requires a follow-up PATCH-target-description + POST /merge. The handler drops the KnowledgeDao lock before the LLM call so other knowledge reads aren't blocked while generation runs (typically seconds). Failure mode is 503 with an explicit hint that the UI should fall back to skip-synthesis — keeps the merge action working when the model is offline. Output is lightly cleaned (leading "Merged description:" / surrounding quotes stripped) since small models reach for those patterns even with explicit "no preamble" guidance. Heavier parsing isn't worth it — the curator edits anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 16:37:26 -04:00
Cameron Cordes	fd4dd89bbb	knowledge: agent self-correction with audit + per-persona gate + revert Bundles three coupled changes so agent-side mutations stay auditable and reversible: 1. Audit columns on entity_facts — `last_modified_by_model` / `last_modified_by_backend` / `last_modified_at`. Stamped on every mutation path (update_fact, supersede_fact, manual PATCH, manual supersede, the new revert). NULL on rows never touched since creation. Partial index on `last_modified_at WHERE NOT NULL` keeps the "show me recent edits" feed fast without bloating from legacy rows. 2. Per-persona gate `personas.allow_agent_corrections` (BOOLEAN, default 0). Defense in depth at two layers: - build_tool_definitions: when off, `update_fact` and `supersede_fact` aren't in the catalog at all, so even a hallucinated tool call by the model fails fast. - tool_update_fact / tool_supersede_fact: re-checks the persona flag at call time and returns an explicit "corrections disabled" error if it's somehow off (e.g. flag flipped mid- loop). ToolGateOpts grows the flag; current_gate_opts splits into `current_gate_opts` (no persona context, defaults closed) + `current_gate_opts_for_persona` for chat callers that have a persona id. Both call sites in insight_chat are updated. 3. Revert action — new DAO method `revert_supersession` + `POST /knowledge/facts/{id}/restore`. Flips status back to 'active', clears `superseded_by`, clears `valid_until` (we don't track whether it was hand-set vs auto-stamped, so the safe reset is to drop it — user can re-bound after). Stamps `last_modified_` so the revert itself is attributable. Manual paths (PATCH / supersede via HTTP, plus restore) stamp the audit columns with `("manual", "manual")`. Agent paths stamp the loop-time chat model and backend (mirroring the existing created_by_ convention). FactDetail in the HTTP response now carries the audit triple alongside the existing provenance. Apollo wires the new field set in the matching commit. PersonaView / UpdatePersonaRequest grow `allowAgentCorrections`; the PersonaPatch + InsertPersona + bulk_import paths thread it. 317 lib tests pass, including unchanged update_fact / supersede DAO tests (now passing audit=None — None means "no provenance context to attribute", legacy semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 20:56:56 -04:00
Cameron Cordes	86c331571d	knowledge: per-persona reviewed-only mode + agent reads include reviewed Two coupled changes to the agent's recall surface: 1. Default scope expanded. recall_facts_for_photo and recall_entities used to filter to status='active' only — which silently dropped 'reviewed' (human-verified) facts. Now they surface active + reviewed by default. Reviewed is strictly more trusted than active and shouldn't have been hidden. Rejected and superseded stay filtered. 2. New persona toggle `reviewed_only_facts` (BOOLEAN, default false, migration 2026-05-10-000400). When set, the agent's recall on that persona returns ONLY facts with status='reviewed' — strict mode for tasks where hallucinated agent claims are particularly costly. Wired: - schema.rs / Persona / InsertPersona / PersonaPatch grow the field. - PersonaView returns it as `reviewedOnlyFacts` (camelCase wire). - PUT /personas/{id} accepts it (mobile editor surfaces it). - InsightGenerator now carries a PersonaDao reference so recall_facts_for_photo can read the active persona's flag at start; one extra read per recall, cheap. Composes with include_all_memories: that operates on the persona scope axis (single vs hive), reviewed_only_facts on the status axis. They're orthogonal. Legacy persona rows pick up the default false on migration; no behavior change unless explicitly toggled. The 4 existing persona construction sites (one production, two tests, one InsertPersona in knowledge_dao tests) all default the field. populate_knowledge bin + state.rs constructors also wire the new persona_dao arg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 20:21:39 -04:00
Cameron Cordes	f53338923d	knowledge: stamp model + backend on facts for audit Adds two nullable TEXT columns to entity_facts — `created_by_model` (LLM identifier) and `created_by_backend` ("local" / "hybrid" / "manual" / NULL) — so the curator can audit which configurations produce good fact-keeping and which produce noise. photo_insights already carries model_version + backend, and entity_facts.source_insight_id links to it, but: - source_insight_id is set post-loop, so chat-continuation and regenerated-insight facts lose the link. - JOINing per read is more friction than embedding provenance on the row itself. - Manual facts (POST /knowledge/facts) have no insight at all and need their own "manual" provenance marker. Threading: execute_tool grows `model` + `backend` params, passed from the three call sites (agentic insight loop, chat single-turn, chat stream) using the loop-time `chat_backend.primary_model()` + `effective_backend` already in scope. tool_store_fact stamps the new fact accordingly; manual create_fact stamps backend="manual". Legacy rows leave both NULL — pre-tracking data can't be back- filled reliably from training_messages without burning compute. Indexes are partial (WHERE NOT NULL) so legacy rows don't bloat them, and "show me all facts from model X" stays fast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 20:05:14 -04:00
Cameron Cordes	85f3716379	knowledge: fact supersession + photo-date valid_from Two Phase-2 followups in one commit since they're coupled at the write path: * Agent populates valid_from from the source photo's date_taken when calling store_fact. Loose semantics — date_taken is evidence at that date, not strictly when the fact started being true — but gives the curator a calendar anchor and pairs with supersession to close intervals cleanly. valid_until stays NULL (a single photo can't tell us when something stopped). Honours the existing upsert_fact dedup (corroborated facts keep their first-recorded valid_from). * Supersession: new column entity_facts.superseded_by INTEGER (migration 2026-05-10-000200), new status value 'superseded', new DAO method supersede_fact, new HTTP endpoint POST /knowledge/facts/{id}/supersede. Marking an old fact as replaced by a new one atomically: flips status to 'superseded', sets superseded_by, and stamps valid_until from the new fact's valid_from (when not already set). delete_fact clears dangling supersession pointers in the same transaction so the column never points at a missing row — no FK because SQLite can't ALTER ADD with REFERENCES, but the DAO maintains the invariant. Pairs with conflict detection from the previous slice: once the old fact's valid_until is closed, its interval no longer overlaps the new fact's, so they stop flagging — the supersede action resolves the conflict. Two tests pin the contract: supersede stamps valid_until from new.valid_from while respecting an existing valid_until, and deleting the supersedeR clears the dangling pointer while leaving the old fact's 'superseded' status in place for history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 19:47:06 -04:00
Cameron Cordes	01f5ad7527	knowledge: valid-time on facts + interval-aware conflict detection Adds bitemporal support to entity_facts. Existing `created_at` is transaction time (when we recorded the fact); the new `valid_from` / `valid_until` BIGINT columns are valid time (when the fact is/was true in the real world). NULL on either side = unbounded on that side, both NULL = "always-true / unknown" — matches the default state of every legacy row, no backfill needed. The split matters for time-bounded predicates like is_in_relationship_with / lives_in / works_at: recording the fact once doesn't mean the relationship is still ongoing. Same predicate across different windows ("lives_in NYC 2018-2020", "lives_in SF 2020-present") is no longer a conflict — the interval-aware check in get_entity only flags pairs whose windows overlap. Facts with no valid-time data still flag against everything (worst case for legacy rows — user adds dates to suppress). API surface: - POST /knowledge/facts accepts optional valid_from / valid_until. - PATCH /knowledge/facts/{id} accepts both with tri-state semantics: field omitted = leave alone, JSON null = clear to NULL, number = set. Implemented via a small serde helper around Option<Option>. - GET /knowledge/entities/{id} surfaces both fields per fact and uses them in conflict detection. Agent path (insight_generator) writes NULL/NULL for now — deriving valid_from from the source photo's date_taken is slated for a follow-up agent tool alongside Phase 2's supersession. Test pins set + clear semantics via update_fact: setting both bounds, leaving them alone on a subsequent patch, then clearing valid_until back to NULL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 19:25:55 -04:00
Cameron Cordes	bcd5312953	knowledge: detect same-predicate object conflicts at read time GET /knowledge/entities/{id} now flags facts as `in_conflict` when another active fact shares the same predicate but disagrees on the object (entity id or text value). Pure read-time computation in the handler — group facts by predicate, distinct-object count > 1 flags all members. No schema change; same shape as `is_current` on photo insights. The flag is intentionally a signal, not a hard constraint. Some predicates are legitimately multi-valued (friend_of, tagged_in, appears_in) — the curator UI surfaces the amber accent and lets the user reject the stale fact, accept both, or supersede one later once the supersession column lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 19:14:58 -04:00
Cameron Cordes	0b8478a5e4	knowledge: list sort + persona-scoped fact_count per entity Two related additions to /knowledge/entities: - New EntitySort enum (UpdatedDesc default, NameAsc, FactCountDesc) surfaced via `?sort=updated\|name\|count`. NameAsc clusters near- duplicate names so dupes stand out at a glance; FactCountDesc surfaces heavily-used entities and demotes 0-fact noise to the bottom. - New `list_entities_with_fact_counts` DAO method that returns each entity alongside a persona-scoped count of its non-rejected facts (subject side). Persona scope follows X-Persona-Id via the existing resolve_persona_filter chain — Single filters on (user_id, persona_id), All unions across the user's personas. Implemented as one raw SQL query with a LEFT JOIN to a fact-count subquery and ORDER BY tied to the chosen sort, so count-sort needs no second round trip. The agent's existing list_entities call site is unchanged — it doesn't need persona-scoped counts and the trait method stays cheap. EntitySummary grows an Option<i64> fact_count (skip_serializing_if none) so PATCH responses stay shaped as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 16:04:13 -04:00
Cameron Cordes	0e2b18224f	knowledge: pre-delete relational facts so entity delete succeeds DELETE /knowledge/entities/{id} was 500ing on any entity that was the object of a relational fact. entity_facts.object_entity_id has ON DELETE SET NULL, but the table also has CHECK (object_entity_id IS NOT NULL OR object_value IS NOT NULL) — purely relational facts (subject + predicate + object_entity_id, no object_value, like "Alice is_friend_of Bob") would have both NULL after SET NULL fired, the CHECK would abort, and the whole DELETE would fail with a CHECK violation. The user just saw QueryError because the DAO swallowed the diesel error string. Wrap delete_entity in a transaction that first deletes facts where the entity is the object AND object_value is null, then deletes the entity. Surviving siblings (typed facts about the entity as subject) are CASCADE'd by the FK as before. Also start surfacing the actual diesel error in a warn log before collapsing to DbErrorKind so future similar issues don't masquerade as the opaque QueryError. A schema-level fix (changing object FK to ON DELETE CASCADE via a table-rebuild migration) is the cleaner long-term resolution and is slated for Phase 2; the DAO-side pre-delete is sufficient and less invasive in the meantime. Test pins the contract: a relational fact pointing at the deleted entity is removed, an unrelated typed fact about an unrelated entity survives, and the entity itself is deleted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 15:44:38 -04:00
Cameron Cordes	f7ce3d2b22	knowledge: include library_id in photo_links response The PhotoLinkDetail in /knowledge/entities/{id} was dropping the library_id field, leaving consumers no way to construct a content-routed thumbnail URL. Apollo's curation screen was falling through to library=0 (the FastAPI default) and getting 400s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 15:19:37 -04:00
Cameron Cordes	d7aee4f228	knowledge: cosine dedup, fact create endpoint, recall nudge Phase 1 of the knowledge curation work. Three small server-side changes to support an Apollo-side curation surface and reduce the agent's near- duplicate output rate going forward: - upsert_entity grows an embedding-cosine fallback after the exact name match misses. New entities whose embedding sits above ENTITY_DEDUP_COSINE_THRESHOLD (default 0.92) against any same-type active entity collapse onto the existing row. Eliminates the Sarah / Sara / Sarah J. trio the FTS5 prefix check was missing. - POST /knowledge/facts symmetric with the existing PATCH/DELETE so the curation UI can create facts directly. Persona-scoped via X-Persona-Id; validates subject (and optional object) entity existence; reuses KnowledgeDao::upsert_fact so corroboration semantics match the agent path. - One sentence in build_system_content telling the agent to call recall_entities before store_entity when a name resembles something already known. Cheap; complements the DAO-layer guard. Includes upsert_entity_collapses_near_duplicate_by_embedding test covering both the collapse-on-near-match path and the don't-collapse-on- unrelated-embedding path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 15:16:05 -04:00