feature/knowledge-curation #91

cameron · 2026-05-12T14:39:31Z

cameron commented

2026-05-12 14:39:31 +00:00

No description provided.

cameron added 19 commits 2026-05-12 14:39:33 +00:00

knowledge: cosine dedup, fact create endpoint, recall nudge d7aee4f228

Phase 1 of the knowledge curation work. Three small server-side changes
to support an Apollo-side curation surface and reduce the agent's near-
duplicate output rate going forward:

- upsert_entity grows an embedding-cosine fallback after the exact name
  match misses. New entities whose embedding sits above
  ENTITY_DEDUP_COSINE_THRESHOLD (default 0.92) against any same-type
  active entity collapse onto the existing row. Eliminates the Sarah /
  Sara / Sarah J. trio the FTS5 prefix check was missing.
- POST /knowledge/facts symmetric with the existing PATCH/DELETE so the
  curation UI can create facts directly. Persona-scoped via X-Persona-Id;
  validates subject (and optional object) entity existence; reuses
  KnowledgeDao::upsert_fact so corroboration semantics match the agent
  path.
- One sentence in build_system_content telling the agent to call
  recall_entities before store_entity when a name resembles something
  already known. Cheap; complements the DAO-layer guard.

Includes upsert_entity_collapses_near_duplicate_by_embedding test
covering both the collapse-on-near-match path and the don't-collapse-on-
unrelated-embedding path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: include library_id in photo_links response f7ce3d2b22

The PhotoLinkDetail in /knowledge/entities/{id} was dropping the
library_id field, leaving consumers no way to construct a
content-routed thumbnail URL. Apollo's curation screen was falling
through to library=0 (the FastAPI default) and getting 400s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: pre-delete relational facts so entity delete succeeds 0e2b18224f

DELETE /knowledge/entities/{id} was 500ing on any entity that was the
object of a relational fact. entity_facts.object_entity_id has
ON DELETE SET NULL, but the table also has
CHECK (object_entity_id IS NOT NULL OR object_value IS NOT NULL) —
purely relational facts (subject + predicate + object_entity_id, no
object_value, like "Alice is_friend_of Bob") would have both NULL
after SET NULL fired, the CHECK would abort, and the whole DELETE
would fail with a CHECK violation. The user just saw QueryError
because the DAO swallowed the diesel error string.

Wrap delete_entity in a transaction that first deletes facts where
the entity is the object AND object_value is null, then deletes the
entity. Surviving siblings (typed facts about the entity as subject)
are CASCADE'd by the FK as before. Also start surfacing the actual
diesel error in a warn log before collapsing to DbErrorKind so future
similar issues don't masquerade as the opaque QueryError.

A schema-level fix (changing object FK to ON DELETE CASCADE via a
table-rebuild migration) is the cleaner long-term resolution and is
slated for Phase 2; the DAO-side pre-delete is sufficient and less
invasive in the meantime.

Test pins the contract: a relational fact pointing at the deleted
entity is removed, an unrelated typed fact about an unrelated entity
survives, and the entity itself is deleted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: list sort + persona-scoped fact_count per entity 0b8478a5e4

Two related additions to /knowledge/entities:

- New EntitySort enum (UpdatedDesc default, NameAsc, FactCountDesc)
  surfaced via `?sort=updated|name|count`. NameAsc clusters near-
  duplicate names so dupes stand out at a glance; FactCountDesc
  surfaces heavily-used entities and demotes 0-fact noise to the
  bottom.

- New `list_entities_with_fact_counts` DAO method that returns each
  entity alongside a persona-scoped count of its non-rejected facts
  (subject side). Persona scope follows X-Persona-Id via the
  existing resolve_persona_filter chain — Single filters on
  (user_id, persona_id), All unions across the user's personas.
  Implemented as one raw SQL query with a LEFT JOIN to a fact-count
  subquery and ORDER BY tied to the chosen sort, so count-sort needs
  no second round trip.

The agent's existing list_entities call site is unchanged — it
doesn't need persona-scoped counts and the trait method stays cheap.
EntitySummary grows an Option<i64> fact_count (skip_serializing_if
none) so PATCH responses stay shaped as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: detect same-predicate object conflicts at read time bcd5312953

GET /knowledge/entities/{id} now flags facts as `in_conflict` when
another active fact shares the same predicate but disagrees on the
object (entity id or text value). Pure read-time computation in the
handler — group facts by predicate, distinct-object count > 1 flags
all members. No schema change; same shape as `is_current` on photo
insights.

The flag is intentionally a *signal*, not a hard constraint. Some
predicates are legitimately multi-valued (friend_of, tagged_in,
appears_in) — the curator UI surfaces the amber accent and lets the
user reject the stale fact, accept both, or supersede one later
once the supersession column lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: valid-time on facts + interval-aware conflict detection 01f5ad7527

Adds bitemporal support to entity_facts. Existing `created_at` is
transaction time (when we recorded the fact); the new
`valid_from` / `valid_until` BIGINT columns are valid time (when the
fact is/was true in the real world). NULL on either side = unbounded
on that side, both NULL = "always-true / unknown" — matches the
default state of every legacy row, no backfill needed.

The split matters for time-bounded predicates like
is_in_relationship_with / lives_in / works_at: recording the fact
once doesn't mean the relationship is still ongoing. Same predicate
across different windows ("lives_in NYC 2018-2020", "lives_in SF
2020-present") is no longer a conflict — the interval-aware check
in get_entity only flags pairs whose windows overlap. Facts with no
valid-time data still flag against everything (worst case for legacy
rows — user adds dates to suppress).

API surface:
- POST /knowledge/facts accepts optional valid_from / valid_until.
- PATCH /knowledge/facts/{id} accepts both with tri-state semantics:
  field omitted = leave alone, JSON null = clear to NULL, number =
  set. Implemented via a small serde helper around Option<Option>.
- GET /knowledge/entities/{id} surfaces both fields per fact and
  uses them in conflict detection.

Agent path (insight_generator) writes NULL/NULL for now — deriving
valid_from from the source photo's date_taken is slated for a
follow-up agent tool alongside Phase 2's supersession.

Test pins set + clear semantics via update_fact: setting both
bounds, leaving them alone on a subsequent patch, then clearing
valid_until back to NULL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: fact supersession + photo-date valid_from 85f3716379

Two Phase-2 followups in one commit since they're coupled at the
write path:

* Agent populates valid_from from the source photo's date_taken
  when calling store_fact. Loose semantics — date_taken is *evidence
  at that date*, not strictly when the fact started being true — but
  gives the curator a calendar anchor and pairs with supersession to
  close intervals cleanly. valid_until stays NULL (a single photo
  can't tell us when something stopped). Honours the existing
  upsert_fact dedup (corroborated facts keep their first-recorded
  valid_from).

* Supersession: new column entity_facts.superseded_by INTEGER
  (migration 2026-05-10-000200), new status value 'superseded',
  new DAO method supersede_fact, new HTTP endpoint
  POST /knowledge/facts/{id}/supersede.

  Marking an old fact as replaced by a new one atomically: flips
  status to 'superseded', sets superseded_by, and stamps
  valid_until from the new fact's valid_from (when not already
  set). delete_fact clears dangling supersession pointers in the
  same transaction so the column never points at a missing row —
  no FK because SQLite can't ALTER ADD with REFERENCES, but the
  DAO maintains the invariant.

Pairs with conflict detection from the previous slice: once the
old fact's valid_until is closed, its interval no longer overlaps
the new fact's, so they stop flagging — the supersede action
resolves the conflict.

Two tests pin the contract: supersede stamps valid_until from
new.valid_from while respecting an existing valid_until, and
deleting the supersedeR clears the dangling pointer while leaving
the old fact's 'superseded' status in place for history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: stamp model + backend on facts for audit f53338923d

Adds two nullable TEXT columns to entity_facts —
`created_by_model` (LLM identifier) and `created_by_backend`
("local" / "hybrid" / "manual" / NULL) — so the curator can audit
which configurations produce good fact-keeping and which produce
noise.

photo_insights already carries model_version + backend, and
entity_facts.source_insight_id links to it, but:
  - source_insight_id is set post-loop, so chat-continuation and
    regenerated-insight facts lose the link.
  - JOINing per read is more friction than embedding provenance on
    the row itself.
  - Manual facts (POST /knowledge/facts) have no insight at all and
    need their own "manual" provenance marker.

Threading: execute_tool grows `model` + `backend` params, passed
from the three call sites (agentic insight loop, chat single-turn,
chat stream) using the loop-time `chat_backend.primary_model()` +
`effective_backend` already in scope. tool_store_fact stamps the
new fact accordingly; manual create_fact stamps backend="manual".
Legacy rows leave both NULL — pre-tracking data can't be back-
filled reliably from training_messages without burning compute.

Indexes are partial (WHERE NOT NULL) so legacy rows don't bloat
them, and "show me all facts from model X" stays fast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: per-persona reviewed-only mode + agent reads include reviewed 86c331571d

Two coupled changes to the agent's recall surface:

1. Default scope expanded. recall_facts_for_photo and recall_entities
   used to filter to status='active' only — which silently dropped
   'reviewed' (human-verified) facts. Now they surface active +
   reviewed by default. Reviewed is strictly more trusted than
   active and shouldn't have been hidden. Rejected and superseded
   stay filtered.

2. New persona toggle `reviewed_only_facts` (BOOLEAN, default false,
   migration 2026-05-10-000400). When set, the agent's recall on
   that persona returns ONLY facts with status='reviewed' — strict
   mode for tasks where hallucinated agent claims are particularly
   costly. Wired:
   - schema.rs / Persona / InsertPersona / PersonaPatch grow the
     field.
   - PersonaView returns it as `reviewedOnlyFacts` (camelCase wire).
   - PUT /personas/{id} accepts it (mobile editor surfaces it).
   - InsightGenerator now carries a PersonaDao reference so
     recall_facts_for_photo can read the active persona's flag at
     start; one extra read per recall, cheap.

Composes with include_all_memories: that operates on the persona
*scope* axis (single vs hive), reviewed_only_facts on the *status*
axis. They're orthogonal.

Legacy persona rows pick up the default false on migration; no
behavior change unless explicitly toggled. The 4 existing persona
construction sites (one production, two tests, one InsertPersona in
knowledge_dao tests) all default the field. populate_knowledge bin
+ state.rs constructors also wire the new persona_dao arg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: agent self-correction with audit + per-persona gate + revert fd4dd89bbb

Bundles three coupled changes so agent-side mutations stay
auditable and reversible:

1. Audit columns on entity_facts —
   `last_modified_by_model` / `last_modified_by_backend` /
   `last_modified_at`. Stamped on every mutation path
   (update_fact, supersede_fact, manual PATCH, manual supersede,
   the new revert). NULL on rows never touched since creation.
   Partial index on `last_modified_at WHERE NOT NULL` keeps the
   "show me recent edits" feed fast without bloating from legacy
   rows.

2. Per-persona gate `personas.allow_agent_corrections` (BOOLEAN,
   default 0). Defense in depth at two layers:
   - build_tool_definitions: when off, `update_fact` and
     `supersede_fact` aren't in the catalog at all, so even a
     hallucinated tool call by the model fails fast.
   - tool_update_fact / tool_supersede_fact: re-checks the persona
     flag at call time and returns an explicit "corrections
     disabled" error if it's somehow off (e.g. flag flipped mid-
     loop).
   ToolGateOpts grows the flag; current_gate_opts splits into
   `current_gate_opts` (no persona context, defaults closed) +
   `current_gate_opts_for_persona` for chat callers that have a
   persona id. Both call sites in insight_chat are updated.

3. Revert action — new DAO method `revert_supersession` +
   `POST /knowledge/facts/{id}/restore`. Flips status back to
   'active', clears `superseded_by`, clears `valid_until` (we
   don't track whether it was hand-set vs auto-stamped, so the
   safe reset is to drop it — user can re-bound after). Stamps
   `last_modified_*` so the revert itself is attributable.

Manual paths (PATCH / supersede via HTTP, plus restore) stamp the
audit columns with `("manual", "manual")`. Agent paths stamp the
loop-time chat model and backend (mirroring the existing
created_by_* convention).

FactDetail in the HTTP response now carries the audit triple
alongside the existing provenance. Apollo wires the new field set
in the matching commit.

PersonaView / UpdatePersonaRequest grow `allowAgentCorrections`;
the PersonaPatch + InsertPersona + bulk_import paths thread it.

317 lib tests pass, including unchanged update_fact / supersede
DAO tests (now passing audit=None — None means "no provenance
context to attribute", legacy semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: synthesize-merge endpoint for LLM-curated descriptions afac02cade

New POST /knowledge/entities/synthesize-merge { source_id,
target_id } that calls the local Ollama with both entities' names
+ descriptions and returns a synthesized merged-description draft.
Read-only on the database — the curation UI uses the response as
the editable seed in the merge picker; the actual merge still
requires a follow-up PATCH-target-description + POST /merge.

The handler drops the KnowledgeDao lock before the LLM call so
other knowledge reads aren't blocked while generation runs
(typically seconds). Failure mode is 503 with an explicit hint
that the UI should fall back to skip-synthesis — keeps the merge
action working when the model is offline.

Output is lightly cleaned (leading "Merged description:" /
surrounding quotes stripped) since small models reach for those
patterns even with explicit "no preamble" guidance. Heavier
parsing isn't worth it — the curator edits anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: forbid markdown in synthesized merge descriptions f200466508

System prompt now explicitly enumerates the markdown forms the
model shouldn't emit (bold, italics, headings, bullets, lists,
code fences) on top of the existing "no preamble, no quotes"
constraints. Some local models default to markdown-shaped
output for descriptions and the curation UI is plain-text,
which would render the asterisks and hashes literally.

The output cleaning step picks up a parallel sweep: strip code
fences, leading bullets / headings, wrapping quotes, and naive
inline emphasis markers (** and __). Rare enough that the
plain-replace is fine; not trying to parse markdown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: per-entity persona breakdown for list + detail 89d0a6527c

Entities are global; facts are persona-scoped. Under the active
persona an entity can read as "0 facts" while having plenty under
other personas the user owns — the curation UI had no way to
surface that gap. Adds a batched DAO method
`get_persona_breakdowns_for_entities` that returns
{entity_id → [(persona_id, count)]} in one query (group by
subject + persona, user-scoped, status != rejected), and wires it
into both /knowledge/entities list rows and
GET /knowledge/entities/{id}.

EntitySummary grows an optional `persona_breakdown` field
(skipped on serialization when None — keeps PATCH responses
unchanged). EntityDetailResponse carries the breakdown as a
non-optional Vec since the detail endpoint always populates it.

One extra query per list page (50 entities → 50 subject ids
batched in one IN clause); single-entity GET adds one round trip.
Indexed by (subject_entity_id, persona_id) implicitly via the
existing user-persona indexes on entity_facts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: consolidation proposals endpoint 6620fa48d7

Finds near-duplicate entities the upsert-time cosine guard didn't
catch — typically legacy data from before that guard landed, or
pairs whose embeddings sit between 0.85 (default proposal floor)
and 0.92 (auto-collapse threshold). Pure read-side feature; the
actual merging still goes through the existing
/knowledge/entities/merge action.

New DAO method `find_consolidation_proposals(threshold,
max_groups)`:
  - Loads every non-rejected entity with an embedding.
  - Partitions by entity_type so a person can't cluster with a
    place.
  - Pairwise cosine, edges above threshold feed a union-find for
    transitive grouping (Sara → Sarah → Sarah J. all land in one
    cluster).
  - Tracks min/max cosine per component so the UI can show "how
    tight" each cluster is before clicking in.
  - Returns groups of >= 2 sorted by size desc then max cosine
    desc; trimmed to `max_groups`.

New endpoint `GET /knowledge/consolidation-proposals?threshold=
&limit=` accepts the threshold (clamped 0.5–0.99 to prevent the
"every entity in one mega-cluster" case) and returns groups with
per-entity persona fact-count breakdowns baked in — saves the UI
a separate query per cluster member.

ConsolidationGroup is exported through database/mod.rs so the
handler can use it without depending on knowledge_dao internals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

insights: push sms search filters server-side, render snippets, expand fts5 docs 7329cc5ce7

- Refactor search_messages_with_contact -> search_messages(query, &SmsSearchParams)
  exposing date_from / date_to / offset / is_mms / has_media; drop the over-fetch
  + client-side date post-filter that could silently drop in-window hits past
  position 100.
- Surface SMS-API's <mark>-wrapped snippet for MMS messages that only matched
  via message_parts_fts (attachment text / filename) - pre-snippet, those
  rendered as a blank body preview to the LLM.
- Expose is_mms / has_media on the search_messages tool schema; expand the
  FTS5 syntax docs with worked examples for phrase / prefix / boolean / NEAR
  / grouping so the model picks the right operator.
- Unit tests for format_search_hits (body fallback, snippet preferred, MMS
  attachment-only regression, empty-snippet fallback) and strip_mark_tags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fmt: cargo fmt sweep 6dca0c027d

No logic changes - line reflow, brace placement, and method-chain splits
across handlers / personas / state / faces / knowledge / insights_dao /
knowledge_dao / populate_knowledge. Picked up incidentally while running
fmt for the sms-search work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: entity-graph endpoint for force-directed view d123cde333

New GET /knowledge/graph?type=&limit= returns the data the
curation UI's graph tab needs:
  - nodes = entities with at least one in-scope fact (rejected /
    superseded excluded). Carries fact_count for visual sizing.
    Top-N by count desc; default cap 200 (clamped 1..1000).
  - edges = relational facts (object_entity_id set) grouped by
    (subject, object, predicate) so 3 "is_friend_of" facts
    between the same pair collapse into one edge with count=3.

Two raw SQL queries: an INNER JOIN onto a persona-scoped fact-
count subquery for nodes (skips 0-fact entities entirely so the
sim doesn't waste time on disconnected islands), then a follow-
up GROUP BY over the persona-scoped fact set restricted to the
node id set via IN clauses (ids are i32 so inlining is safe).

Pairs with the Apollo-side GraphPanel that runs d3-force over
the returned payload and renders SVG with click-to-open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: normalize legacy entity_type values fb078b4906

One-shot migration that re-applies the synonym map from
`normalize_entity_type` over every existing row, so legacy
entries written before that helper landed in upsert_entity stop
needing client-side workarounds.

  person ← person | people | human | individual | contact
  place  ← place | location | venue | site | area | landmark
  event  ← event | occasion | activity | celebration
  thing  ← thing | object | item | product

Unknown types ("friend", "family", etc.) get a lowercase+trim
sweep so at minimum case variants collapse — the curator can
merge or rename them via the curation UI from there.

`UPDATE OR IGNORE` skips rows that would violate UNIQUE(name,
entity_type) after the rewrite (e.g. an existing ("Sarah",
"person") + ("Sarah", "Person") pair). The duplicate survives
unchanged so it can be merged through the normal curation flow
rather than silently disappearing.

Idempotent: every UPDATE is conditional on `entity_type !=
canonical`, so re-running the migration is a no-op. The down
migration is intentionally inert — we don't have per-row
history of the original strings and the rewritten values stay
semantically correct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

knowledge: predicate-quality nudge + bulk-reject endpoint e67e00ef8a

Two coupled changes to fight the speech-act-predicate problem
(facts like (Cameron, expressed, "I'm tempted to...")):

1. System prompt grows an explicit predicate-quality rule. The
   agent is told to use relationship-shaped verbs (lives_in,
   works_at, attended, is_friend_of, interested_in), and is
   given an explicit DON'T list (expressed, said, mentioned,
   stated, quoted, noted, discussed, thought, wondered). Plus a
   concrete Bad / Good example contrasting the noise pattern
   with the structured paraphrase the agent should be writing.
   Stops the bleed for new insights.

2. Cleanup tools for the legacy noise that's already in the
   table:
   - get_predicate_stats(persona, limit) returns
     [(predicate, count)] sorted desc — feeds the curation UI's
     PREDICATES tab.
   - bulk_reject_facts_by_predicate(persona, predicate, audit)
     flips every ACTIVE fact under that predicate to 'rejected'
     in one transaction, stamping last_modified_* so the action
     is attributable + reversible per-fact through the entity
     detail panel. REVIEWED facts under the same predicate are
     left alone — the curator may have hand-approved an
     exception ("interested_in" might be largely noise but a
     reviewed entry is intentional).

New HTTP endpoints:
   GET  /knowledge/predicate-stats?limit=
   POST /knowledge/predicates/{predicate}/bulk-reject

Persona-scoped via the existing X-Persona-Id header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cameron merged commit 05ec5d0c70 into master

2026-05-12 15:40:57 +00:00

cameron deleted branch feature/knowledge-curation

2026-05-12 15:40:57 +00:00

cameron referenced this issue from a commit

2026-05-12 15:40:57 +00:00

Merge pull request 'feature/knowledge-curation' (#91) from feature/knowledge-curation into master

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Apps/ImageApi#91