multi-library: hash-keyed tagged_photo + photo_insights with reconciliation

Branch B of the multi-library data-model rollout. tagged_photo and photo_insights now follow the bytes (content_hash), not the path, matching the policy pinned in CLAUDE.md "Multi-library data model". Branch A's availability probe and EXIF scoping land first; this branch builds on top. Migration (2026-05-01-000000_hash_keyed_derived_data) Adds nullable content_hash columns to tagged_photo and photo_insights, with partial indexes on the non-null subset to keep the index small during the transitional window. The migration backfills from image_exif: * tagged_photo joins on rel_path alone (no library_id available); * photo_insights joins on (library_id, rel_path), unambiguous. Rows whose image_exif hash isn't known yet stay null and the runtime reconciliation pass populates them as the hash backlog drains. Insert-time population TagDao::tag_file looks up image_exif.content_hash by rel_path before inserting; the hash is written into the new column. InsightDao::store_insight does the same scoped to (library_id, rel_path). Caller-supplied hash on InsertPhotoInsight wins; otherwise the DAO does the lookup. Both paths fall back to None if the hash isn't known yet — reconciliation backfills. Reconciliation (database/reconcile.rs) Three idempotent passes the watcher runs once per tick after the per-library backfill loop: 1. tagged_photo NULL hashes → populate from image_exif by rel_path. 2. photo_insights NULL hashes → populate by (library_id, rel_path). 3. photo_insights scalar merge — when multiple is_current rows share a content_hash, keep the earliest generated_at as current; demote the rest. Demoted rows keep their data so /insights/history is unaffected; only the "current" pointer narrows to one per hash. No filesystem dependency, so reconcile doesn't need the availability gate; runs every tick. Logs once when something changed, debug otherwise. Tags are set-valued under the policy (union on read, already DISTINCT in queries), so there is no analogous tag-collapse pass — duplicate (tag_id, content_hash) rows across libraries are harmless. Read paths are unchanged in this branch — lookup_tags_batch's existing rel_path-via-hash-sibling expansion still produces the correct merge. A follow-up can simplify reads to use the new column directly for performance. Tests: 217 pass (212 pre-existing + 5 new in reconcile covering NULL-fill, hash-not-yet-known no-op, library scoping on insights, earliest-wins collapse, idempotency). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:52:16 +00:00
parent cce8f0c1b7
commit 48cac8c285
11 changed files with 529 additions and 2 deletions
@@ -0,0 +1,64 @@
+-- Phase B of the multi-library data-model rollout: add a nullable
+-- `content_hash` column to derived/user-intent tables that should follow
+-- the bytes rather than the path. Reads will prefer hash-key joins and
+-- fall back to rel_path while the column is null. A separate
+-- reconciliation pass collapses duplicates as the column populates.
+--
+-- See CLAUDE.md → "Multi-library data model" for the policy. The
+-- reference implementation is `face_detections`, which has been
+-- hash-keyed since it was introduced.
+--
+-- Tables in this migration:
+--   * tagged_photo   — user-intent (tags follow the bytes)
+--   * photo_insights — intrinsic to bytes (LLM-generated description)
+--
+-- favorites is the natural third candidate but its DAO is barely used in
+-- v1 and the row count is tiny; deferring lets this migration stay
+-- focused on the high-volume tables that drive cross-library overhead.
+
+-- ---------------------------------------------------------------------------
+-- tagged_photo
+-- ---------------------------------------------------------------------------
+ALTER TABLE tagged_photo ADD COLUMN content_hash TEXT;
+
+-- Backfill: for each tagged_photo row, find the content_hash for its
+-- rel_path. tagged_photo doesn't carry a library_id, so a rel_path that
+-- exists under multiple libraries with different content is genuinely
+-- ambiguous — we take the first matching image_exif row. The
+-- reconciliation pass at runtime cleans up any rows that resolve
+-- differently once a hash is known per library.
+UPDATE tagged_photo
+SET content_hash = (
+    SELECT content_hash FROM image_exif
+    WHERE image_exif.rel_path = tagged_photo.rel_path
+      AND image_exif.content_hash IS NOT NULL
+    LIMIT 1
+)
+WHERE content_hash IS NULL;
+
+-- Hash-key index. Partial (only non-null rows) to keep the index small
+-- during the transitional window where most rows are still null.
+CREATE INDEX idx_tagged_photo_content_hash
+    ON tagged_photo (content_hash)
+    WHERE content_hash IS NOT NULL;
+
+-- ---------------------------------------------------------------------------
+-- photo_insights
+-- ---------------------------------------------------------------------------
+ALTER TABLE photo_insights ADD COLUMN content_hash TEXT;
+
+-- Backfill keyed on (library_id, rel_path) — photo_insights already
+-- carries library_id, so the resolution is unambiguous.
+UPDATE photo_insights
+SET content_hash = (
+    SELECT content_hash FROM image_exif
+    WHERE image_exif.library_id = photo_insights.library_id
+      AND image_exif.rel_path = photo_insights.rel_path
+      AND image_exif.content_hash IS NOT NULL
+    LIMIT 1
+)
+WHERE content_hash IS NULL;
+
+CREATE INDEX idx_photo_insights_content_hash
+    ON photo_insights (content_hash)
+    WHERE content_hash IS NOT NULL;