feature/knowledge-curation #91
Reference in New Issue
Block a user
Delete Branch "feature/knowledge-curation"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The PhotoLinkDetail in /knowledge/entities/{id} was dropping the library_id field, leaving consumers no way to construct a content-routed thumbnail URL. Apollo's curation screen was falling through to library=0 (the FastAPI default) and getting 400s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>DELETE /knowledge/entities/{id} was 500ing on any entity that was the object of a relational fact. entity_facts.object_entity_id has ON DELETE SET NULL, but the table also has CHECK (object_entity_id IS NOT NULL OR object_value IS NOT NULL) — purely relational facts (subject + predicate + object_entity_id, no object_value, like "Alice is_friend_of Bob") would have both NULL after SET NULL fired, the CHECK would abort, and the whole DELETE would fail with a CHECK violation. The user just saw QueryError because the DAO swallowed the diesel error string. Wrap delete_entity in a transaction that first deletes facts where the entity is the object AND object_value is null, then deletes the entity. Surviving siblings (typed facts about the entity as subject) are CASCADE'd by the FK as before. Also start surfacing the actual diesel error in a warn log before collapsing to DbErrorKind so future similar issues don't masquerade as the opaque QueryError. A schema-level fix (changing object FK to ON DELETE CASCADE via a table-rebuild migration) is the cleaner long-term resolution and is slated for Phase 2; the DAO-side pre-delete is sufficient and less invasive in the meantime. Test pins the contract: a relational fact pointing at the deleted entity is removed, an unrelated typed fact about an unrelated entity survives, and the entity itself is deleted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>GET /knowledge/entities/{id} now flags facts as `in_conflict` when another active fact shares the same predicate but disagrees on the object (entity id or text value). Pure read-time computation in the handler — group facts by predicate, distinct-object count > 1 flags all members. No schema change; same shape as `is_current` on photo insights. The flag is intentionally a *signal*, not a hard constraint. Some predicates are legitimately multi-valued (friend_of, tagged_in, appears_in) — the curator UI surfaces the amber accent and lets the user reject the stale fact, accept both, or supersede one later once the supersession column lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Adds bitemporal support to entity_facts. Existing `created_at` is transaction time (when we recorded the fact); the new `valid_from` / `valid_until` BIGINT columns are valid time (when the fact is/was true in the real world). NULL on either side = unbounded on that side, both NULL = "always-true / unknown" — matches the default state of every legacy row, no backfill needed. The split matters for time-bounded predicates like is_in_relationship_with / lives_in / works_at: recording the fact once doesn't mean the relationship is still ongoing. Same predicate across different windows ("lives_in NYC 2018-2020", "lives_in SF 2020-present") is no longer a conflict — the interval-aware check in get_entity only flags pairs whose windows overlap. Facts with no valid-time data still flag against everything (worst case for legacy rows — user adds dates to suppress). API surface: - POST /knowledge/facts accepts optional valid_from / valid_until. - PATCH /knowledge/facts/{id} accepts both with tri-state semantics: field omitted = leave alone, JSON null = clear to NULL, number = set. Implemented via a small serde helper around Option<Option>. - GET /knowledge/entities/{id} surfaces both fields per fact and uses them in conflict detection. Agent path (insight_generator) writes NULL/NULL for now — deriving valid_from from the source photo's date_taken is slated for a follow-up agent tool alongside Phase 2's supersession. Test pins set + clear semantics via update_fact: setting both bounds, leaving them alone on a subsequent patch, then clearing valid_until back to NULL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Two Phase-2 followups in one commit since they're coupled at the write path: * Agent populates valid_from from the source photo's date_taken when calling store_fact. Loose semantics — date_taken is *evidence at that date*, not strictly when the fact started being true — but gives the curator a calendar anchor and pairs with supersession to close intervals cleanly. valid_until stays NULL (a single photo can't tell us when something stopped). Honours the existing upsert_fact dedup (corroborated facts keep their first-recorded valid_from). * Supersession: new column entity_facts.superseded_by INTEGER (migration 2026-05-10-000200), new status value 'superseded', new DAO method supersede_fact, new HTTP endpoint POST /knowledge/facts/{id}/supersede. Marking an old fact as replaced by a new one atomically: flips status to 'superseded', sets superseded_by, and stamps valid_until from the new fact's valid_from (when not already set). delete_fact clears dangling supersession pointers in the same transaction so the column never points at a missing row — no FK because SQLite can't ALTER ADD with REFERENCES, but the DAO maintains the invariant. Pairs with conflict detection from the previous slice: once the old fact's valid_until is closed, its interval no longer overlaps the new fact's, so they stop flagging — the supersede action resolves the conflict. Two tests pin the contract: supersede stamps valid_until from new.valid_from while respecting an existing valid_until, and deleting the supersedeR clears the dangling pointer while leaving the old fact's 'superseded' status in place for history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Adds two nullable TEXT columns to entity_facts — `created_by_model` (LLM identifier) and `created_by_backend` ("local" / "hybrid" / "manual" / NULL) — so the curator can audit which configurations produce good fact-keeping and which produce noise. photo_insights already carries model_version + backend, and entity_facts.source_insight_id links to it, but: - source_insight_id is set post-loop, so chat-continuation and regenerated-insight facts lose the link. - JOINing per read is more friction than embedding provenance on the row itself. - Manual facts (POST /knowledge/facts) have no insight at all and need their own "manual" provenance marker. Threading: execute_tool grows `model` + `backend` params, passed from the three call sites (agentic insight loop, chat single-turn, chat stream) using the loop-time `chat_backend.primary_model()` + `effective_backend` already in scope. tool_store_fact stamps the new fact accordingly; manual create_fact stamps backend="manual". Legacy rows leave both NULL — pre-tracking data can't be back- filled reliably from training_messages without burning compute. Indexes are partial (WHERE NOT NULL) so legacy rows don't bloat them, and "show me all facts from model X" stays fast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Two coupled changes to the agent's recall surface: 1. Default scope expanded. recall_facts_for_photo and recall_entities used to filter to status='active' only — which silently dropped 'reviewed' (human-verified) facts. Now they surface active + reviewed by default. Reviewed is strictly more trusted than active and shouldn't have been hidden. Rejected and superseded stay filtered. 2. New persona toggle `reviewed_only_facts` (BOOLEAN, default false, migration 2026-05-10-000400). When set, the agent's recall on that persona returns ONLY facts with status='reviewed' — strict mode for tasks where hallucinated agent claims are particularly costly. Wired: - schema.rs / Persona / InsertPersona / PersonaPatch grow the field. - PersonaView returns it as `reviewedOnlyFacts` (camelCase wire). - PUT /personas/{id} accepts it (mobile editor surfaces it). - InsightGenerator now carries a PersonaDao reference so recall_facts_for_photo can read the active persona's flag at start; one extra read per recall, cheap. Composes with include_all_memories: that operates on the persona *scope* axis (single vs hive), reviewed_only_facts on the *status* axis. They're orthogonal. Legacy persona rows pick up the default false on migration; no behavior change unless explicitly toggled. The 4 existing persona construction sites (one production, two tests, one InsertPersona in knowledge_dao tests) all default the field. populate_knowledge bin + state.rs constructors also wire the new persona_dao arg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Bundles three coupled changes so agent-side mutations stay auditable and reversible: 1. Audit columns on entity_facts — `last_modified_by_model` / `last_modified_by_backend` / `last_modified_at`. Stamped on every mutation path (update_fact, supersede_fact, manual PATCH, manual supersede, the new revert). NULL on rows never touched since creation. Partial index on `last_modified_at WHERE NOT NULL` keeps the "show me recent edits" feed fast without bloating from legacy rows. 2. Per-persona gate `personas.allow_agent_corrections` (BOOLEAN, default 0). Defense in depth at two layers: - build_tool_definitions: when off, `update_fact` and `supersede_fact` aren't in the catalog at all, so even a hallucinated tool call by the model fails fast. - tool_update_fact / tool_supersede_fact: re-checks the persona flag at call time and returns an explicit "corrections disabled" error if it's somehow off (e.g. flag flipped mid- loop). ToolGateOpts grows the flag; current_gate_opts splits into `current_gate_opts` (no persona context, defaults closed) + `current_gate_opts_for_persona` for chat callers that have a persona id. Both call sites in insight_chat are updated. 3. Revert action — new DAO method `revert_supersession` + `POST /knowledge/facts/{id}/restore`. Flips status back to 'active', clears `superseded_by`, clears `valid_until` (we don't track whether it was hand-set vs auto-stamped, so the safe reset is to drop it — user can re-bound after). Stamps `last_modified_*` so the revert itself is attributable. Manual paths (PATCH / supersede via HTTP, plus restore) stamp the audit columns with `("manual", "manual")`. Agent paths stamp the loop-time chat model and backend (mirroring the existing created_by_* convention). FactDetail in the HTTP response now carries the audit triple alongside the existing provenance. Apollo wires the new field set in the matching commit. PersonaView / UpdatePersonaRequest grow `allowAgentCorrections`; the PersonaPatch + InsertPersona + bulk_import paths thread it. 317 lib tests pass, including unchanged update_fact / supersede DAO tests (now passing audit=None — None means "no provenance context to attribute", legacy semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>New POST /knowledge/entities/synthesize-merge { source_id, target_id } that calls the local Ollama with both entities' names + descriptions and returns a synthesized merged-description draft. Read-only on the database — the curation UI uses the response as the editable seed in the merge picker; the actual merge still requires a follow-up PATCH-target-description + POST /merge. The handler drops the KnowledgeDao lock before the LLM call so other knowledge reads aren't blocked while generation runs (typically seconds). Failure mode is 503 with an explicit hint that the UI should fall back to skip-synthesis — keeps the merge action working when the model is offline. Output is lightly cleaned (leading "Merged description:" / surrounding quotes stripped) since small models reach for those patterns even with explicit "no preamble" guidance. Heavier parsing isn't worth it — the curator edits anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Entities are global; facts are persona-scoped. Under the active persona an entity can read as "0 facts" while having plenty under other personas the user owns — the curation UI had no way to surface that gap. Adds a batched DAO method `get_persona_breakdowns_for_entities` that returns {entity_id → [(persona_id, count)]} in one query (group by subject + persona, user-scoped, status != rejected), and wires it into both /knowledge/entities list rows and GET /knowledge/entities/{id}. EntitySummary grows an optional `persona_breakdown` field (skipped on serialization when None — keeps PATCH responses unchanged). EntityDetailResponse carries the breakdown as a non-optional Vec since the detail endpoint always populates it. One extra query per list page (50 entities → 50 subject ids batched in one IN clause); single-entity GET adds one round trip. Indexed by (subject_entity_id, persona_id) implicitly via the existing user-persona indexes on entity_facts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Finds near-duplicate entities the upsert-time cosine guard didn't catch — typically legacy data from before that guard landed, or pairs whose embeddings sit between 0.85 (default proposal floor) and 0.92 (auto-collapse threshold). Pure read-side feature; the actual merging still goes through the existing /knowledge/entities/merge action. New DAO method `find_consolidation_proposals(threshold, max_groups)`: - Loads every non-rejected entity with an embedding. - Partitions by entity_type so a person can't cluster with a place. - Pairwise cosine, edges above threshold feed a union-find for transitive grouping (Sara → Sarah → Sarah J. all land in one cluster). - Tracks min/max cosine per component so the UI can show "how tight" each cluster is before clicking in. - Returns groups of >= 2 sorted by size desc then max cosine desc; trimmed to `max_groups`. New endpoint `GET /knowledge/consolidation-proposals?threshold= &limit=` accepts the threshold (clamped 0.5–0.99 to prevent the "every entity in one mega-cluster" case) and returns groups with per-entity persona fact-count breakdowns baked in — saves the UI a separate query per cluster member. ConsolidationGroup is exported through database/mod.rs so the handler can use it without depending on knowledge_dao internals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>New GET /knowledge/graph?type=&limit= returns the data the curation UI's graph tab needs: - nodes = entities with at least one in-scope fact (rejected / superseded excluded). Carries fact_count for visual sizing. Top-N by count desc; default cap 200 (clamped 1..1000). - edges = relational facts (object_entity_id set) grouped by (subject, object, predicate) so 3 "is_friend_of" facts between the same pair collapse into one edge with count=3. Two raw SQL queries: an INNER JOIN onto a persona-scoped fact- count subquery for nodes (skips 0-fact entities entirely so the sim doesn't waste time on disconnected islands), then a follow- up GROUP BY over the persona-scoped fact set restricted to the node id set via IN clauses (ids are i32 so inlining is safe). Pairs with the Apollo-side GraphPanel that runs d3-force over the returned payload and renders SVG with click-to-open. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>One-shot migration that re-applies the synonym map from `normalize_entity_type` over every existing row, so legacy entries written before that helper landed in upsert_entity stop needing client-side workarounds. person ← person | people | human | individual | contact place ← place | location | venue | site | area | landmark event ← event | occasion | activity | celebration thing ← thing | object | item | product Unknown types ("friend", "family", etc.) get a lowercase+trim sweep so at minimum case variants collapse — the curator can merge or rename them via the curation UI from there. `UPDATE OR IGNORE` skips rows that would violate UNIQUE(name, entity_type) after the rewrite (e.g. an existing ("Sarah", "person") + ("Sarah", "Person") pair). The duplicate survives unchanged so it can be merged through the normal curation flow rather than silently disappearing. Idempotent: every UPDATE is conditional on `entity_type != canonical`, so re-running the migration is a no-op. The down migration is intentionally inert — we don't have per-row history of the original strings and the rewritten values stay semantically correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Two coupled changes to fight the speech-act-predicate problem (facts like (Cameron, expressed, "I'm tempted to...")): 1. System prompt grows an explicit predicate-quality rule. The agent is told to use relationship-shaped verbs (lives_in, works_at, attended, is_friend_of, interested_in), and is given an explicit DON'T list (expressed, said, mentioned, stated, quoted, noted, discussed, thought, wondered). Plus a concrete Bad / Good example contrasting the noise pattern with the structured paraphrase the agent should be writing. Stops the bleed for new insights. 2. Cleanup tools for the legacy noise that's already in the table: - get_predicate_stats(persona, limit) returns [(predicate, count)] sorted desc — feeds the curation UI's PREDICATES tab. - bulk_reject_facts_by_predicate(persona, predicate, audit) flips every ACTIVE fact under that predicate to 'rejected' in one transaction, stamping last_modified_* so the action is attributable + reversible per-fact through the entity detail panel. REVIEWED facts under the same predicate are left alone — the curator may have hand-approved an exception ("interested_in" might be largely noise but a reviewed entry is intentional). New HTTP endpoints: GET /knowledge/predicate-stats?limit= POST /knowledge/predicates/{predicate}/bulk-reject Persona-scoped via the existing X-Persona-Id header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>