Compare commits

34 Commits

Author SHA1 Message Date
82dd21b205 Merge pull request 'feature/duplicate-detection' (#73) from feature/duplicate-detection into master
Reviewed-on: #73
2026-05-03 22:34:49 +00:00
Cameron Cordes
57b7bad086 duplicates: library-aware visibility — only hide a demoted row when its survivor is reachable
Soft-marked rows used to disappear from /photos globally, including
from a library-scoped view that didn't contain the survivor at all.
A user browsing lib A who'd promoted a file from lib B as the
survivor would silently lose visibility on their own copy in lib A,
even though lib B's file isn't reachable from lib A's view.

Library-scoped queries now keep a demoted row visible when its
survivor lives in a library outside the current scope. Implemented
as a NOT EXISTS subquery against the same image_exif table aliased
as `survivor`. The unscoped (all-libraries) view is unchanged — every
survivor is reachable, so demoted rows stay hidden as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:24:07 -04:00
Cameron Cordes
98057c98a1 duplicates: tighten perceptual cluster — entropy band, asymmetric dHash, medoid prune
Three changes against "still too loose at lowest sensitivity":

- Popcount entropy band tightened from [8, 56] to [16, 48]. The wider
  band let too much low-frequency content through (skies, scans,
  faded film) where pHash collapses to near-uniform values that
  Hamming-trivially across hundreds of unrelated images.
- dHash check now uses an asymmetric stricter threshold
  (dhash_threshold = max(2, threshold/2)). pHash is the candidate-
  discovery signal; dHash is validation. Splitting the budget means
  a real near-dup survives both while incidental pHash collisions
  on uniform content get vetoed. Missing dHash on either side now
  rejects the edge (was: trust pHash alone).
- Single-link union-find can chain weakly-similar images via
  transitive edges. Added a medoid-validation pass: per cluster,
  pick the member with smallest summed distance to others, then
  drop any whose distance to it exceeds threshold. Two new tests
  pin both invariants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:19:48 -04:00
Cameron Cordes
7ca888e95d duplicates: filter low-entropy hashes + dHash double-check, fix backfill loop
The perceptual cluster was producing one giant first group that
contained hundreds of unrelated images. Two causes:
- Solid-colour images (skies, black frames, monochrome scans) all
  hash to near-zero pHashes that Hamming-distance-zero to each other.
- Single-link clustering on pHash alone is too permissive — a chain
  of weakly-similar images all collapses into one cluster.

Fixed by skipping hashes outside the popcount [8, 56] band (uniform
content) and requiring dHash agreement within threshold before
unioning a candidate edge from the BK-tree. Two new tests pin both
invariants.

Backfill bin separately fix: decode-failed rows kept phash_64=NULL
and got re-pulled by every batch, infinite-looping on a queue of
unbreakable formats. Persist a 0/0 sentinel on decode failure so
the row leaves the candidate set; the all-zero hash is excluded
from clustering by the same entropy filter so it doesn't pollute
results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:08:05 -04:00
Cameron Cordes
7584cd8792 duplicates: perceptual hash + soft-mark resolution + upload 409
Adds pHash + dHash columns alongside the existing blake3 content_hash so
near-duplicates (re-encoded, resized, format-converted copies) become
queryable. /duplicates/{exact,perceptual} return groups; /duplicates/
{resolve,unresolve} flip a duplicate_of_hash soft-mark on losing rows
and union perceptual-only tag sets onto the survivor. The default
/photos listing filters duplicate_of_hash IS NULL so demoted siblings
stop cluttering the grid; include_duplicates=true opts back in for
Apollo's review modal. Upload now hashes bytes pre-write and returns
409 with the canonical sibling when a file's bytes already exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:36:01 -04:00
4340b164eb Merge pull request 'perf/faces-embeddings-no-clone' (#72) from perf/faces-embeddings-no-clone into master
Reviewed-on: #72
2026-05-01 23:09:22 +00:00
Cameron Cordes
fb4df4b195 style: cargo fmt sweep
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:01:00 -04:00
Cameron Cordes
1d9b9a0bc4 faces: avoid 40 MB row clone in /faces/embeddings
list_embeddings cloned the full FaceDetectionRow inside the filter_map
just to pair it with the base64-encoded embedding. The 2 KB BLOB was
already on the row — at 20k unassigned faces that's 40 MB of pointless
heap traffic per Apollo cluster-suggest run. Move the bytes out via
Option::take() so the row drops the BLOB instead of duplicating it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:00:55 -04:00
7998a0c9b0 Merge pull request 'feature/per-library-excluded-dirs' (#71) from feature/per-library-excluded-dirs into master
Reviewed-on: #71
2026-05-01 20:11:10 +00:00
Cameron Cordes
58f010f302 docs(claude): pin excluded_dirs entry-form syntax
The two entry shapes for libraries.excluded_dirs / EXCLUDED_DIRS
are not symmetric:
  - /sub/path → multi-segment, library-root-anchored, recursive
  - name     → single component anywhere in the tree

Without this pinned, a reasonable read of the column doc would be
"any path-like string works" — but a multi-segment string without a
leading slash silently never matches (the no-slash form scans path
components for exact string equality, and components are
slash-free).

No code change; just documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:05:58 +00:00
Cameron Cordes
814066551e multi-library: per-library excluded_dirs
Adds a nullable comma-separated TEXT column to the libraries table.
Effective excludes for a walk = (env-var globals) ∪
(library.excluded_dirs). Empty / NULL = no library-specific
extras; the global env var still applies.

Migration (2026-05-01-110000_libraries_excluded_dirs)

  ALTER TABLE libraries ADD COLUMN excluded_dirs TEXT. NULL on every
  existing row — no behavior change on upgrade.

Library struct + helpers (libraries.rs)

  - Library gains excluded_dirs: Vec<String>, parsed from the column
    by parse_excluded_dirs_column (drops empties / whitespace,
    matches the env-var parser).
  - Library::effective_excluded_dirs(globals) returns the union.
  - From<LibraryRow> hydrates the field on AppState construction so
    /libraries surfaces it.

Watcher / walkers / memories

  Every per-library walker now consults the effective set:
    - process_new_files (file-watch ingest, RAW/EXIF/face)
    - process_face_backlog (filter_excluded inherits)
    - create_thumbnails (startup + new-file branch)
    - update_media_counts (Prometheus gauge)
    - cleanup_orphaned_playlists (per-library source-existence check)
    - memories endpoint (PathExcluder)

  Effective set is computed once per per-library iteration in the
  watcher tick and threaded through; called functions retain their
  flat &[String] signature (no per-library awareness needed inside
  the walker primitives).

Use case: mount a parent directory while a sibling library covers
a child subtree, and exclude the child subtree from the parent so
the libraries don't double-walk / double-write image_exif. With
hash-keyed derived data (Branches B/C), the duplication-avoidance
is the only cost prevented — face / tag / insight sharing was
already correct via content_hash.

Tests: 228 pass (226 from previous + 2 new in libraries::tests:
parse_excluded_dirs_column edge cases,
effective_excluded_dirs_unions_global_and_per_library).

CLAUDE.md gains a "Per-library excludes" subsection of the
multi-library data model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:54:17 +00:00
4f17af688e Merge pull request 'multi-library: operator kill switch via libraries.enabled' (#70) from feature/library-enabled-flag into master
Reviewed-on: #70
2026-05-01 19:15:20 +00:00
Cameron Cordes
3598bb2cfe multi-library: operator kill switch via libraries.enabled
A small follow-up to Branches A/B/C. Adds a nullable-default-1
boolean column to the `libraries` table that controls whether the
watcher considers the library at all. Useful for staging a new
mount before committing to ingest, and as a maintenance kill
switch when a library needs to be quiet without being unmounted.

Migration (2026-05-01-100000_libraries_enabled_flag)

  ALTER TABLE libraries ADD COLUMN enabled BOOLEAN NOT NULL DEFAULT 1.
  Existing rows stay enabled — no behavior change on upgrade.

Watcher gate (main.rs)

  At the top of the per-library loop, if !lib.enabled { continue; }
  — runs BEFORE the availability probe. Disabled libraries don't
  enter the health map, don't get probed, don't get ingest, don't
  get any maintenance pass. The initial sweep before the loop's
  first sleep also skips disabled libraries.

Orphan-GC consensus (library_maintenance.rs)

  all_libraries_online filters disabled libraries out of the
  consensus check — they're treated as out-of-scope, not as
  blockers. Otherwise flipping enabled=false would permanently
  halt orphan GC for the rest of the system, which is the opposite
  of the intended kill-switch semantics.

Cross-library duplicates: safe by construction. Hash-keyed derived
data (face_detections, tagged_photo with hash, photo_insights with
hash) is anchored by ANY image_exif row carrying the hash. Disabling
a library does NOT delete its image_exif rows, so a hash referenced
by a disabled library's row stays anchored — derived data survives.
collect_orphan_hashes deliberately doesn't filter image_exif by
library.enabled for exactly this reason.

No HTTP endpoint. Library mutation is rare-enough infra work that a
SQL toggle is fine, and a public mutation endpoint without a role /
permission story would be poorly-prioritized exposure for a
single-user tool. Documented in CLAUDE.md.

Tests: 226 pass (225 from Branch C + 1 new
all_libraries_online_treats_disabled_as_out_of_scope, which proves
that even an explicit Stale entry on a disabled library doesn't
block the consensus).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:10:24 +00:00
23448cf5e6 Merge pull request 'feature/library-handoff-and-gc' (#69) from feature/library-handoff-and-gc into master
Reviewed-on: #69
2026-05-01 18:27:40 +00:00
Cameron Cordes
d809ddee44 library_maintenance: clarify orphan-gc log wording
"marked 2 new" parses as "2 new files" on first read — but the
unit is content_hashes, and the action is observing them as
orphaned (becoming-deleted, not appearing). Reword:

  "{} new orphan hash(es) marked, {} revived"

instead of "marked {} new, revived {}". Also pluralize the deleted
counts ("row(s)") and append the pending-set size to the success
log so a tick that both deletes and re-marks doesn't lose the
trailing-state context.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:01:01 +00:00
Cameron Cordes
fa98d147be library_maintenance: log orphan-gc decisions in stale-library path too
run_orphan_gc returned early on the !all_online branch before the
final debug/info log line, so the GC was effectively invisible
whenever any library was Stale — exactly the dry-run scenario where
operators most want to confirm the safety gate is firing. Add the
same conditional log inside the early-return branch (plus a
"deferred — at least one library Stale" hint in the info-level
variant when there's something newly marked).

No behavior change beyond observability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:14:09 +00:00
Cameron Cordes
5f247be1f1 docs(claude): note in-place edit gap as future Branch D
The maintenance pipeline added in Branch C assumes (library_id,
rel_path) bytes are stable for as long as the file lives at that
path. In-place edits (crop, re-export to same name) bypass
process_new_files's already-indexed check, so the row's
content_hash stays pinned to the original bytes — tags / faces /
insights remain attached to that hash silently.

Document the gap and the proposed shape of the fix:
  - Stale-content detection pass: compare last_modified / size_bytes
    to fs::metadata, re-hash on mismatch, update image_exif.
  - "Content branched" semantics on hash change: faces re-run, tags
    migrate forward (user intent survives a crop), insights migrate
    + flag for re-generation, favorites follow path.
  - Apollo derived.db cache invalidation belongs in the same design
    cycle, not after.

Captured here so the design intent is clear before someone hits the
case in real life. No code change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:53:08 +00:00
Cameron Cordes
263e27e108 multi-library: handoff + orphan GC with two-tick consensus
Branch C of the multi-library data-model rollout. Implements the
operational maintenance pipeline pinned in CLAUDE.md → "Multi-library
data model" / "Library availability and safety". Branches A and B
land first; this branch builds on top.

New module: src/library_maintenance.rs

Three idempotent passes the watcher runs every tick after the
per-library ingest loop:

1. Missing-file scan (per online library)

   For each Online library, load a paginated page of image_exif rows
   (IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE, default 500), stat() each one,
   and delete rows whose source file is NotFound. Permission/IO
   errors are skipped, never deleted. Capped at
   IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK (default 200) per library
   per tick — so a pathological mount that returns NotFound for
   everything can't wipe the table in one cycle. Cursor advances
   across ticks, wraps on partial-page returns, and naturally cycles
   through the entire library over many minutes. Skipped wholesale
   for Stale libraries via the existing probe gate.

2. Back-ref refresh (DB-only)

   For face_detections / tagged_photo / photo_insights: any
   hash-keyed row whose (library_id, rel_path) no longer matches an
   image_exif row, but whose content_hash does, is repointed at a
   surviving image_exif location. Pure SQL with EXISTS guards so
   rows whose hash is fully orphaned are left alone (the orphan GC
   handles those). Idempotent; no availability gate needed.

   This is what makes a recent → archive move invisible to readers:
   when pass 1 retires the lib-A row, pass 2 pivots tags / faces /
   insights to lib-B's surviving path before any client notices.

3. Orphan GC (destructive)

   Hash-keyed derived rows whose content_hash has no image_exif
   referent are GC-eligible. Two-tick consensus: a hash must be
   observed orphaned on two consecutive ticks AND every library must
   be Online for both. A single Stale tick within the window cancels
   all pending deletes (they remain marked but won't be promoted) —
   they're re-evaluated next tick. The pending set lives in
   OrphanGcState (in-memory); a watcher restart resets it, which can
   only delay a delete, never cause one. Hashes that re-appear in
   image_exif between ticks are "revived" from the pending set
   (handles transient share unmount / remount).

Two new ExifDao methods:
  - list_rel_paths_for_library_page(library_id, limit, offset) for
    the paginated missing-file scan.
  - (count_for_library landed in Branch A.)

Watcher wiring (main.rs)

Per-library: missing-file scan inside the existing per-library
loop, after process_new_files, gated by the same probe check that
already protects ingest. After the loop: reconcile (Branch B),
back-ref refresh, then run_orphan_gc. The maintenance connection is
opened once per tick (image_api::database::connect), used by all
three DB-only passes, and dropped at end of tick.

CLAUDE.md gains a "Maintenance pipeline" subsection that describes
the three passes and their interaction with the existing
availability-and-safety policy.

Tests: 225 pass (217 from Branch B + 8 new in library_maintenance
covering back-ref refresh including the fully-orphaned no-op case,
two-tick GC consensus, Stale-tick consensus reset, image_exif
re-appearance revival, multi-table delete, and the
all_libraries_online helper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:27:53 +00:00
a0283a6362 Merge pull request 'multi-library: hash-keyed tagged_photo + photo_insights with reconciliation' (#68) from feature/hash-keyed-derived-data into master
Reviewed-on: #68
2026-05-01 16:16:38 +00:00
Cameron Cordes
48cac8c285 multi-library: hash-keyed tagged_photo + photo_insights with reconciliation
Branch B of the multi-library data-model rollout. tagged_photo and
photo_insights now follow the bytes (content_hash), not the path,
matching the policy pinned in CLAUDE.md "Multi-library data model".
Branch A's availability probe and EXIF scoping land first; this
branch builds on top.

Migration (2026-05-01-000000_hash_keyed_derived_data)

  Adds nullable content_hash columns to tagged_photo and photo_insights,
  with partial indexes on the non-null subset to keep the index small
  during the transitional window. The migration backfills from
  image_exif:
    * tagged_photo joins on rel_path alone (no library_id available);
    * photo_insights joins on (library_id, rel_path), unambiguous.
  Rows whose image_exif hash isn't known yet stay null and the runtime
  reconciliation pass populates them as the hash backlog drains.

Insert-time population

  TagDao::tag_file looks up image_exif.content_hash by rel_path before
  inserting; the hash is written into the new column.
  InsightDao::store_insight does the same scoped to (library_id,
  rel_path). Caller-supplied hash on InsertPhotoInsight wins; otherwise
  the DAO does the lookup. Both paths fall back to None if the hash
  isn't known yet — reconciliation backfills.

Reconciliation (database/reconcile.rs)

  Three idempotent passes the watcher runs once per tick after the
  per-library backfill loop:
    1. tagged_photo NULL hashes → populate from image_exif by rel_path.
    2. photo_insights NULL hashes → populate by (library_id, rel_path).
    3. photo_insights scalar merge — when multiple is_current rows
       share a content_hash, keep the earliest generated_at as
       current; demote the rest. Demoted rows keep their data so
       /insights/history is unaffected; only the "current" pointer
       narrows to one per hash.

  No filesystem dependency, so reconcile doesn't need the availability
  gate; runs every tick. Logs once when something changed, debug
  otherwise.

  Tags are set-valued under the policy (union on read, already
  DISTINCT in queries), so there is no analogous tag-collapse pass —
  duplicate (tag_id, content_hash) rows across libraries are
  harmless.

Read paths are unchanged in this branch — lookup_tags_batch's
existing rel_path-via-hash-sibling expansion still produces the
correct merge. A follow-up can simplify reads to use the new column
directly for performance.

Tests: 217 pass (212 pre-existing + 5 new in reconcile covering
NULL-fill, hash-not-yet-known no-op, library scoping on insights,
earliest-wins collapse, idempotency).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:52:16 +00:00
cce8f0c1b7 Merge pull request 'feature/multi-library-data-model' (#67) from feature/multi-library-data-model into master
Reviewed-on: #67
2026-05-01 14:40:16 +00:00
Cameron Cordes
48ed7be5d9 libraries: initial availability sweep before watcher's first sleep
new_health_map seeds every library as Online, and the watcher's tick
loop sleeps WATCH_QUICK_INTERVAL_SECONDS (default 60s) before its
first probe — meaning /libraries reported the optimistic default for
up to a minute after boot, even when a share was clearly unmounted.

Run the same refresh_health pass once at the top of the watcher
thread before entering the sleep loop. /libraries is then truthful
within milliseconds of the watcher thread starting (effectively from
the first HTTP request, since the watcher spawns well before the
server binds).

The per-tick gate inside the loop is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:33:45 +00:00
Cameron Cordes
eea1bf3181 multi-library: availability probe + scoped EXIF queries + collision fixes
Branch A of the multi-library data-model rollout. Three threads of
correctness/safety work that ship together because the new mount
needs all three before it can land:

1. Library availability probe (libraries.rs, state.rs, main.rs)

   New LibraryHealth (Online | Stale { reason, since }) and a shared
   LibraryHealthMap on AppState. Probe checks root_path exists +
   is_dir + readable + non-empty (relative to a "had_data" signal so
   fresh mounts aren't downgraded). The watcher tick begins with a
   refresh_health() per library; stale libraries skip ingest, the
   hash backfill, and face-detection backlog drains for that tick.
   The orphaned-playlist cleanup also gates on every library being
   online — a missing source on a stale library is indistinguishable
   from a transient unmount, and the cleanup is destructive.

   /libraries now returns each library with its current health
   state. Logs only on Online↔Stale transitions so a long outage
   doesn't spam.

   New ExifDao::count_for_library is the "had_data" signal.

2. EXIF queries scoped by library_id (database/mod.rs, files.rs,
   main.rs, tags.rs)

   query_by_exif gains an Option<i32> library filter; /photos and
   /photos/exif now pass it. Without this, an EXIF-filtered request
   scoped to ?library=N returned cross-library results because the
   handler resolved the library but didn't push it through to SQL.

   get_exif_batch gains the same option. The watcher's per-library
   ingest, face-candidate build, and content-hash backfill all
   scope to their library; the union-mode /photos date-sort path
   and the library-agnostic tag fan-out (lookup_tags_batch, by
   design) keep using None.

3. Derivative-path collision fixes (content_hash.rs, main.rs)

   New content_hash::library_scoped_legacy_path helper:
   <derivative_dir>/<library_id>/<rel_path>. Thumbnail generation
   (startup walk + watcher needs-thumb check) and serving now use
   it; serving falls back to the bare-legacy mirrored path so
   pre-multi-library deployments keep working without
   regeneration. Without this, lib2 with the same rel_path as lib1
   would have its thumbnail request short-circuit to lib1's image.

   Orphaned-playlist cleanup walks every library when checking for
   the source video (was: BASE_PATH only). Without this, mounting
   a 2nd library and waiting 24h would delete every playlist whose
   source lived only in the 2nd library.

   The HLS playlist write path collision (filename-only basename,
   not rel_path) is left as a known issue with a TODO at the call
   site — the actor-pipeline rewrite belongs in Branch B/C.

Tests: 212 pass (cargo test --lib). New tests cover the probe
states (online / missing root / non-dir / empty-with-prior-data),
refresh_health transitions, query_by_exif scoping, get_exif_batch
keying on (library_id, rel_path), library_scoped_legacy_path, and
count_for_library.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:12:49 +00:00
Cameron Cordes
2f91891459 docs(claude): pin multi-library data model + availability/safety policy
Adds a "Multi-library data model" section that classifies each table as
intrinsic-to-bytes (hash-keyed), user-intent-about-a-photo (hash-keyed),
or library-administrative ((library_id, rel_path)). Spells out merge
semantics on read (union for set-valued, earliest-wins for scalar),
write attribution (binds to bytes, not to current library), the
transitional-state rules for hash-less rows, library handoff behavior
on archive moves, and orphan GC.

Adds a "Library availability and safety" subsection: every watcher
tick begins with a presence probe; destructive paths (move-handoff
re-keying, orphan GC) require both/all libraries online and
confirmed-clean for two consecutive ticks. A NAS reboot, USB pull, or
VPN drop must never trigger destruction — the worst case is that
derived-data work pauses until the share returns.

The face_detections table is referenced as the existing reference
implementation of the policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:11:42 +00:00
3d162105f7 Merge pull request 'feature/edit-tag' (#66) from feature/edit-tag into master
Reviewed-on: #66
2026-05-01 01:03:40 +00:00
Cameron
98601973f7 faces: log at the three 503 paths in update_face_handler
PATCH /image/faces/{id} can return 503 from three places (face client
disabled, transient embed error, mid-flight disable) and none of them
were logging — operator sees the status code but nothing in the Rust
log explaining why. Add warn! lines at each so future bbox-edit
failures aren't silent. Response body is unchanged so existing clients
keep working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:57:51 -04:00
Cameron
862917b0d1 gitignore: SQLite WAL runtime + local docs/specs dirs
*.db-shm / *.db-wal show up in the working tree whenever the server
runs (the WAL/journal pragmas in connect()), and /docs and /specs
hold per-feature design notes that stay local per the project's
"spec docs not in git" convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:31:19 -04:00
Cameron
44d677528e tags: add edit + delete endpoints, enable FK enforcement
PUT /image/tags/{id} renames a tag globally; DELETE /image/tags/{id}
removes a tag and every photo's reference. Rename returns 200/404/409
(case-insensitive name conflict) / 400 (empty name); delete returns
204/404. New migration adds a UNIQUE COLLATE NOCASE index on
tags.name with a pre-flight pass that collapses existing case-
insensitive duplicates onto the lowest id.

The connection setup now sets PRAGMA foreign_keys = ON. The schema
already declares ON DELETE CASCADE / SET NULL on several tables —
those clauses were documentation-only because SQLite has FK
enforcement off per-connection by default. Audited every
diesel::delete site; each touches either no inbound FKs or has a
matching policy. delete_tag relies on the tagged_photo cascade
instead of doing manual cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 20:26:35 -04:00
89b743ba54 Merge pull request 'faces: count distinct content_hash in stats total_photos' (#65) from face-stats-dedup-hash into master
Reviewed-on: #65
2026-04-30 22:43:58 +00:00
Cameron Cordes
323097c650 faces: count distinct content_hash in stats total_photos
face_detections is keyed on content_hash (one row per unique bytes,
shared across libraries / duplicate paths) but total_photos was
COUNT(*) over image_exif rows. A file present at multiple rel_paths or
across libraries inflated the denominator without inflating the
numerator, leaving a permanent gap (e.g. 1101/1103 with nothing
actually pending detection).

Switch total_photos to COUNT(DISTINCT content_hash) so numerator and
denominator live in the same domain. Exclude rows with NULL
content_hash from the count — they're held in the hash-backfill
backlog, not the detection backlog, and counting them pins the bar
below 100% for the duration of that pass.

CLAUDE.md: document the stats domain rule next to the rest of the
face-detection notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:41:20 +00:00
d0833177c7 Merge pull request 'feature/face-stats-exclude-videos' (#64) from feature/face-stats-exclude-videos into master
Reviewed-on: #64
2026-04-30 21:17:19 +00:00
Cameron Cordes
67abd8d8ff style: cargo fmt
Pre-existing whitespace drift in test bodies, normalized by rustfmt.
No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:16:34 +00:00
Cameron Cordes
0840d55c70 faces: exclude videos from backlog drain and SCANNED denominator
list_unscanned_candidates pulled every hashed image_exif row, including
videos. filter_excluded then dropped them client-side without writing a
marker, so the same set re-appeared every watcher tick — emitting the
"backlog drain — running detection on N candidate(s)" log forever and
producing no progress.

face_stats.total_photos counted the same video rows in the denominator,
so the SCANNED percentage was structurally capped below 100%.

Add an image-extension SQL predicate (case-insensitive, sourced from
file_types::IMAGE_EXTENSIONS) and apply it to both queries. Videos
never enter the candidate set, total_photos counts only what can
actually be scanned, and 100% becomes reachable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:16:30 +00:00
dbb046dfa8 Merge pull request 'indexer: prune EXCLUDED_DIRS at WalkDir time, extract enumerate_indexable_files' (#63) from feature/exclude-dirs-at-index-time into master
Reviewed-on: #63
2026-04-30 20:24:18 +00:00
36 changed files with 5341 additions and 112 deletions

4
.gitignore vendored
View File

@@ -2,8 +2,12 @@
database/target
*.db
*.db.bak
*.db-shm
*.db-wal
.env
/tmp
/docs
/specs
# Default ignored files
.idea/shelf/

240
CLAUDE.md
View File

@@ -104,6 +104,242 @@ All database access goes through trait-based DAOs (e.g., `ExifDao`, `SqliteExifD
- `query_by_exif()`: Complex filtering by camera, GPS bounds, date ranges
- Batch operations minimize DB hits during file watching
### Multi-library data model
ImageApi supports more than one library (a library = a `(name, root_path)`
row in the `libraries` table that maps to a mounted directory tree). The
same bytes may exist under more than one library — typical case is an
"active" library plus an "archive" library that ingests files as they age
out — and the data model is designed so that derived data follows the
**bytes**, not the path, while user-managed data does the same.
**The principle.** A photo's identity is its `content_hash` (blake3, see
`src/content_hash.rs`). Anything we compute from or attach to a photo is
keyed on that hash so it survives:
- the same file appearing in a second library (backup / archive / mirror),
- the file moving between libraries (recent → archive handoff),
- the file moving within a library (re-organized rel_path),
- intra-library duplicates (same bytes at two paths).
**Table classification.** Three categories drive the keying decision:
| Category | Key | Rationale | Tables |
|---|---|---|---|
| Intrinsic to bytes | `content_hash` | Rerunning is wasted work (or LLM cost) | `face_detections` ✓, `image_exif` (target), `photo_insights` (target), `video_preview_clips` (target) |
| User intent about a photo | `content_hash` | "Tag this photo" means the bytes, not a path | `tagged_photo` (target), `favorites` (target) |
| Library administrative | `(library_id, rel_path)` | Tied to a specific filesystem location | `libraries`, `entity_photo_links`, the `rel_path` back-ref columns on hash-keyed tables |
✓ = already implemented this way. *(target)* = today still keyed on
`(library_id, rel_path)` and slated for migration. The migration adds a
nullable `content_hash` column, populates it from `image_exif` where
known, and read paths fall back to rel_path while the hash is null.
**Carrying a `rel_path` even when hash-keyed.** Hash-keyed tables retain
`(library_id, rel_path)` columns as a denormalized **back-reference**, not
as the key. This lets a single query answer "what is at this path right
now" without joining through `image_exif`, and supports the path-only
endpoints that predate the hash. `face_detections` is the reference
implementation: hash is the truth, path is a hint.
**Merge semantics on read.** When the same hash has rows under more than
one library:
- Set-valued data (tags, favorites, faces, entity links) → **union**.
- Scalar data (current insight, EXIF row, video preview clip) → earliest
`generated_at` / `created_time` wins. The historical lib1 row beats a
re-generated lib2 row, so the user's curated insight isn't shadowed by
a re-run on archive ingest.
**Write attribution.** A new tag/favorite/insight created while viewing
under lib2 binds to the bytes, not to lib2 — so it shows up under lib1
too. This is by design, but it's the most surprising rule on first
encounter; clients should not assume tags are library-scoped.
**Hash-less rows (transitional state).** During and immediately after a
new mount, `image_exif.content_hash` is being populated by
`backfill_unhashed_backlog` (capped per tick). Rules during this window:
- Writes: if the hash is known, write hash-keyed. If not, write
`(library_id, rel_path)`-keyed and let the reconciliation job collapse
duplicates once the hash lands.
- Reads: prefer hash key, fall back to `(library_id, rel_path)`.
- Reconciliation: a one-shot pass after every backfill tick collapses
rows that now share a hash, applying the merge semantics above.
Idempotent — safe to re-run.
**Library handoff (recent → archive).** When a file moves between
libraries (e.g. operator moves `~/photos/2024/IMG.nef` to the archive
mount), the file watcher sees the disappearance under lib1 and the
appearance under lib2. Hash-keyed rows don't need migration; the
`(library_id, rel_path)` back-ref columns are updated to point to the new
location. Library administrative rows (`entity_photo_links`,
`(library_id, rel_path)` rows in `image_exif` for hash-less items) are
re-keyed by the move detector, which matches a disappearance to an
appearance by `content_hash` within a configurable window.
**Orphans (source deleted while a copy survives).** When the only
`image_exif` row for a hash is deleted (file removed from disk), the
hash-keyed derived rows survive **as long as another `image_exif` row
references the same hash**. If the last reference is gone, derived rows
are eligible for GC (deferred — the GC job runs on a slow schedule so
that a brief unmount or rename doesn't wipe history).
**Stats and counts.** When reporting "how many photos do you have," count
`DISTINCT content_hash` over `image_exif`, not row count. Faces stats
already does this (`FaceDao::stats` in `src/faces.rs`); other counters
should follow suit. Numerator and denominator must live in the same
domain — see the face-stats commentary below for the cautionary tale.
**Per-library scoping when the user asks for it.** A request scoped to
`?library=N` filters the `image_exif` view to that library, and the
hash-keyed derived data is joined through that view. The user sees only
photos that have a copy under lib N, but the derived data attached to
those photos is the merged hash-keyed view. This is the answer to "show
me archive photos with their original tags."
**Operator kill switch (`libraries.enabled`).** Setting `enabled=0` on a
library is a hard pause: the watcher skips it entirely — before the
probe, before ingest, before any maintenance pass — and the orphan-GC
all-online consensus check filters disabled libraries out (they don't
keep the GC window closed). Reads / serving are unaffected; nothing
prevents `/image?path=...` from resolving against a disabled library's
root if the file is on disk. The existing `image_exif` rows for a
disabled library are **not deleted** — they continue to anchor
hash-keyed derived data, so cross-library duplicates survive the
disable. Toggle via SQL; there is intentionally no HTTP endpoint for
library mutation (single-user tool, no role / permission story).
Typical workflows: stage a new mount with `enabled=0` then flip to `1`;
quiet a flaky NAS during maintenance without disturbing the rest of
the system.
**Per-library excludes (`libraries.excluded_dirs`).** A
comma-separated column, same shape as the global `EXCLUDED_DIRS` env
var, that's applied **in union** with the env-var globals when a
walker scans this library. Use case: mount a parent directory as a
new library while a sibling library covers a child subtree, and
exclude that child subtree from the parent so the two libraries
don't double-walk and double-write `image_exif`. Two entry forms
(parsed by `memories::PathExcluder`):
- `/sub/path` — leading slash flags it as a path under the library
root. Joins to root + matches by `path.starts_with(...)`. Works
at any depth (`/photos`, `/media/2024/raw`).
- `name` — no leading slash flags it as a component name to skip
anywhere in the tree (`@eaDir`, `.thumbnails`). Single segment
only — `media/photos/a` without a leading slash never matches
anything. Hash-keyed derived
data (faces, tags, insights) is unaffected either way — those
follow the bytes — but `image_exif` row count, walker CPU, and
thumbnail disk usage all drop to 1× instead of 2× for the overlap.
Affects: file-watch ingest (`process_new_files`), thumbnail
generation, media-count gauges, the orphaned-playlist cleanup walk,
and the `/memories` endpoint. The face-detection backlog drain
inherits via `face_watch::filter_excluded`. NULL = no extras (only
the global env var applies).
**Library availability and safety.** Libraries can be on network shares
or removable media; the file watcher must not interpret a temporary
unavailability as a mass-deletion event. Every tick begins with a
**presence probe** per library: the library is considered online iff
its `root_path` exists, is readable, and a top-level scan returns at
least one expected entry (or matches a recent file-count high-water
mark within a tolerance). The probe result gates which actions are safe
to run on that library this tick:
| Action | Requires online? |
|---|---|
| Quick / full scan ingest of new files | yes |
| EXIF / face / insight backlog drains | yes — but the work runs against any online library |
| Move-handoff detection (lib1 disappearance ↔ lib2 appearance match) | **both** libraries online |
| `(library_id, rel_path)` re-keying on detected move | **both** libraries online |
| Orphan GC of hash-keyed derived data | all libraries that have *ever* held the hash must be online and confirmed-clean for two consecutive ticks |
| Reads / serving | always allowed; falls back to whichever library is online |
A library that fails the probe enters a "stale" state: writes scoped to
it are paused, its rows are flagged stale (not deleted) in
`/libraries` status, and the watcher logs at `warn` once per
state-transition (not per tick). A library that recovers re-enters the
online set automatically; no operator action required for transient
outages. The intent is that pulling a USB drive, rebooting a NAS, or
losing a VPN never triggers a destructive code path — the worst case is
that derived-data work pauses until the share returns.
The same rule constrains the move-handoff matcher: a disappearance
under lib1 only counts as a "move" if there is a matching appearance
under another **online** library within the window. A bare
disappearance with no matching appearance is treated as
"unavailable-or-deleted, defer judgment" — it does not re-key any rows
and does not enqueue GC.
**Maintenance pipeline (`src/library_maintenance.rs`).** The watcher
runs three maintenance passes per tick that together implement the
move/handoff and orphan rules:
1. **Missing-file scan** — per online library, paginated. A page of
`image_exif` rows is loaded (`IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE`,
default 500), each row's `(root_path/rel_path)` is `stat()`-ed,
and confirmed-not-found rows are deleted from `image_exif`
(capped at `IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK`, default 200).
Permission/IO errors are skipped, never deleted — only `NotFound`
triggers a deletion. The cursor wraps every time a partial page
comes back, so the whole library is swept across consecutive ticks.
Skipped wholesale for Stale libraries via the per-library probe
gate at the top of the loop iteration.
2. **Back-ref refresh** — DB-only. For `face_detections`,
`tagged_photo`, and `photo_insights`: any hash-keyed row whose
`(library_id, rel_path)` no longer matches an `image_exif` row
*but whose `content_hash` does* is repointed at the surviving
`image_exif` location. Idempotent SQL; no health gate needed.
This is what makes the recent → archive handoff invisible to
read paths: when the missing-file scan retires the lib-A row,
tags/faces/insights pivot to lib-B's path before any user
notices.
3. **Orphan GC** — destructive. Hash-keyed derived rows whose
`content_hash` no longer has any `image_exif` row are eligible.
Two-tick consensus: a hash must be observed orphaned on two
consecutive ticks AND every library must be online for both. A
single Stale tick within the window cancels all pending deletes.
The pending set is held in memory (`OrphanGcState`) — restart
resets it, which only delays a delete, never causes one. Tags,
faces, and insights for orphaned hashes are deleted in one batch
per tick.
A backup library that briefly disappears, then returns within two
ticks, never loses any derived data. A move from lib-A to lib-B
without disappearance flips through pass 1 (lib-A row retired) and
pass 2 (back-refs follow), with pass 3 noting nothing because the
hash is still present in `image_exif` (lib-B's row).
**Known gap: in-place content changes (future Branch D).** The
maintenance pipeline assumes a `(library_id, rel_path)`'s bytes are
stable for as long as the file exists at that path. If a user edits
a file in place (crop, re-export) without renaming, the watcher's
quick scan walks the file (mtime is recent) but `process_new_files`
short-circuits because `(library_id, rel_path)` already has an
`image_exif` row — no re-hash, no re-EXIF, no face redetection. The
row's `content_hash` keeps pointing at the original bytes. Tags /
faces / insights stay attached to the original hash and continue to
display because the rel_path back-ref still resolves; new faces
introduced by the edit are never detected.
The right place to fix this is a **stale-content detection pass**
that compares `image_exif.last_modified` / `size_bytes` to
`fs::metadata` for rows the quick scan would otherwise skip. On
mismatch, recompute the hash, update `image_exif`, and apply the
"content branched" semantics:
- **Faces** re-run (faces are fully derived from bytes).
- **Tags** migrate to the new hash (user intent — "this photo is
vacation" survives a crop). Insights migrate forward as a
starting point and are flagged for re-generation.
- **Favorites** (when migrated to hash-keyed) follow the path /
user intent.
The interesting case is the operator who keeps an unedited copy in
the archive library and edits the local copy: post-detection, the
archive copy stays on the original hash, the local copy branches to
the new hash, and the two histories cleanly split. Apollo's
`derived.db` cache will need an invalidation hook for the changed
hash — design it alongside Branch D.
### File Processing Pipeline
**Thumbnail Generation:**
@@ -219,7 +455,7 @@ ImageApi owns the face data; Apollo (sibling repo) hosts the insightface inferen
- `persons(id, name UNIQUE COLLATE NOCASE, cover_face_id, entity_id, created_from_tag, notes, ...)` — operator-managed, name is the user-visible identity.
- `face_detections(id, library_id, content_hash, rel_path, bbox_*, embedding BLOB, confidence, source, person_id, status, model_version, ...)` — keyed on `content_hash` so a photo duplicated across libraries is detected once. Marker rows for `status IN ('no_faces','failed')` carry NULL bbox/embedding (CHECK constraint enforces this).
**Why content_hash and not (library_id, rel_path):** ties face data to the bytes, not the path. A backup mount that copies files from the primary library naturally inherits the existing detections without re-running inference.
**Why content_hash and not (library_id, rel_path):** ties face data to the bytes, not the path. A backup mount that copies files from the primary library naturally inherits the existing detections without re-running inference. This is the reference implementation of the multi-library data model — see "Multi-library data model" above.
**File-watch hook** (`src/main.rs::process_new_files`): for each photo with a populated `content_hash`, check `FaceDao::already_scanned(hash)`; if not, send bytes (or embedded JPEG preview for RAW via `exif::extract_embedded_jpeg_preview`) to Apollo's `/api/internal/faces/detect`. K=`FACE_DETECT_CONCURRENCY` (default 8) parallel calls per scan tick; Apollo serializes them via its single-worker GPU pool. `face_watch.rs` is the Tokio orchestration layer.
@@ -233,6 +469,8 @@ ImageApi owns the face data; Apollo (sibling repo) hosts the insightface inferen
**Rerun preserves manual rows** (`POST /image/faces/{id}/rerun`): only `source='auto'` rows are deleted before re-running detection. `already_scanned` returns true on ANY row, so a photo whose only faces are manually drawn never auto-redetects.
**Stats domain — content_hash, not file rows** (`FaceDao::stats` in `src/faces.rs`): `total_photos` counts `DISTINCT content_hash` over `image_exif` (filtered to image extensions, `content_hash IS NOT NULL`), and so do `scanned` / `with_faces` / `no_faces` / `failed` over `face_detections`. Numerator and denominator must live in the same domain — `face_detections` is keyed on content_hash, so the same JPEG present at two rel_paths or in two libraries scans once. Counting `image_exif` rows in the denominator inflated total by one per duplicate file and produced a permanent gap (e.g. 1101/1103 with nothing actually pending). Hash-less rows are excluded from total_photos while they sit in the `backfill_unhashed_backlog` queue; otherwise the bar pins below 100% for the duration of that backfill even though those rows aren't pending detection yet — they're pending hashing.
Module map:
- `src/faces.rs``FaceDao` trait + `SqliteFaceDao` impl, route handlers for `/faces/*`, `/image/faces/*`, `/persons/*`. Mirror of `tags.rs` layout.
- `src/face_watch.rs` — Tokio orchestration for the file-watch detect pass; `filter_excluded` (PathExcluder + image-extension filter), `read_image_bytes_for_detect` (RAW preview fallback).

88
Cargo.lock generated
View File

@@ -600,6 +600,16 @@ version = "2.6.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6099cdc01846bc367c4e7dd630dc5966dccf36b652fae7a74e17b640411a91b2"
[[package]]
name = "bk-tree"
version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a8283fb8e64b873918f8bc527efa6aff34956296e48ea750a9c909cd47c01546"
dependencies = [
"fnv",
"triple_accel",
]
[[package]]
name = "blake3"
version = "1.8.4"
@@ -1928,6 +1938,7 @@ dependencies = [
"async-trait",
"base64",
"bcrypt",
"bk-tree",
"blake3",
"bytes",
"chrono",
@@ -1939,6 +1950,7 @@ dependencies = [
"futures",
"ical",
"image",
"image_hasher",
"indicatif",
"infer",
"jsonwebtoken",
@@ -1978,6 +1990,19 @@ dependencies = [
"quick-error",
]
[[package]]
name = "image_hasher"
version = "3.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "dd266c66b0a0e2d4c6db8e710663fc163a2d33595ce997b6fbda407c8759d344"
dependencies = [
"base64",
"image",
"rustdct",
"serde",
"transpose",
]
[[package]]
name = "imgref"
version = "1.11.0"
@@ -2438,6 +2463,15 @@ dependencies = [
"num-traits",
]
[[package]]
name = "num-complex"
version = "0.4.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "73f88a1307638156682bada9d7604135552957b7818057dcef22705b4d509495"
dependencies = [
"num-traits",
]
[[package]]
name = "num-conv"
version = "0.1.0"
@@ -2907,6 +2941,15 @@ version = "0.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "925383efa346730478fb4838dbe9137d2a47675ad789c546d150a6e1dd4ab31c"
[[package]]
name = "primal-check"
version = "0.3.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "dc0d895b311e3af9902528fbb8f928688abbd95872819320517cc24ca6b2bd08"
dependencies = [
"num-integer",
]
[[package]]
name = "proc-macro2"
version = "1.0.101"
@@ -3286,6 +3329,29 @@ dependencies = [
"semver",
]
[[package]]
name = "rustdct"
version = "0.7.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8b61555105d6a9bf98797c063c362a1d24ed8ab0431655e38f1cf51e52089551"
dependencies = [
"rustfft",
]
[[package]]
name = "rustfft"
version = "6.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "21db5f9893e91f41798c88680037dba611ca6674703c1a18601b01a72c8adb89"
dependencies = [
"num-complex",
"num-integer",
"num-traits",
"primal-check",
"strength_reduce",
"transpose",
]
[[package]]
name = "rustix"
version = "1.0.8"
@@ -3624,6 +3690,12 @@ version = "1.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a8f112729512f8e442d81f95a8a7ddf2b7c6b8a1a6f509a95864142b30cab2d3"
[[package]]
name = "strength_reduce"
version = "0.2.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fe895eb47f22e2ddd4dabc02bce419d2e643c8e3b585c78158b349195bc24d82"
[[package]]
name = "strfmt"
version = "0.2.5"
@@ -4122,6 +4194,22 @@ dependencies = [
"once_cell",
]
[[package]]
name = "transpose"
version = "0.2.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1ad61aed86bc3faea4300c7aee358b4c6d0c8d6ccc36524c96e4c92ccf26e77e"
dependencies = [
"num-integer",
"strength_reduce",
]
[[package]]
name = "triple_accel"
version = "0.3.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "622b09ce2fe2df4618636fb92176d205662f59803f39e70d1c333393082de96c"
[[package]]
name = "try-lock"
version = "0.2.5"

View File

@@ -59,5 +59,7 @@ ical = "0.11"
scraper = "0.20"
base64 = "0.22"
blake3 = "1.5"
image_hasher = "3.0"
bk-tree = "0.5"
async-trait = "0.1"
indicatif = "0.17"

View File

@@ -0,0 +1 @@
DROP INDEX IF EXISTS idx_tags_name_nocase;

View File

@@ -0,0 +1,28 @@
-- Tags only enforced uniqueness in application code (the add_tag handler
-- looks up by name before inserting). The schema itself accepted dupes,
-- so a divergent code path could land two tags with the same name. Now
-- that we expose a rename endpoint we want a hard guarantee: case-
-- insensitive UNIQUE on tags.name.
-- Pre-flight: collapse exact-name duplicates (case-insensitive) onto the
-- lowest-id row before adding the constraint, otherwise the index
-- creation fails on any DB that ever produced dupes. On a clean DB this
-- is a no-op.
UPDATE tagged_photo
SET tag_id = (
SELECT MIN(t2.id) FROM tags t2
WHERE LOWER(t2.name) = LOWER((SELECT name FROM tags WHERE id = tagged_photo.tag_id))
)
WHERE tag_id IN (
SELECT t.id FROM tags t
WHERE t.id <> (
SELECT MIN(t2.id) FROM tags t2 WHERE LOWER(t2.name) = LOWER(t.name)
)
);
DELETE FROM tags
WHERE id <> (
SELECT MIN(t2.id) FROM tags t2 WHERE LOWER(t2.name) = LOWER(tags.name)
);
CREATE UNIQUE INDEX idx_tags_name_nocase ON tags (name COLLATE NOCASE);

View File

@@ -0,0 +1,5 @@
DROP INDEX IF EXISTS idx_photo_insights_content_hash;
ALTER TABLE photo_insights DROP COLUMN content_hash;
DROP INDEX IF EXISTS idx_tagged_photo_content_hash;
ALTER TABLE tagged_photo DROP COLUMN content_hash;

View File

@@ -0,0 +1,64 @@
-- Phase B of the multi-library data-model rollout: add a nullable
-- `content_hash` column to derived/user-intent tables that should follow
-- the bytes rather than the path. Reads will prefer hash-key joins and
-- fall back to rel_path while the column is null. A separate
-- reconciliation pass collapses duplicates as the column populates.
--
-- See CLAUDE.md → "Multi-library data model" for the policy. The
-- reference implementation is `face_detections`, which has been
-- hash-keyed since it was introduced.
--
-- Tables in this migration:
-- * tagged_photo — user-intent (tags follow the bytes)
-- * photo_insights — intrinsic to bytes (LLM-generated description)
--
-- favorites is the natural third candidate but its DAO is barely used in
-- v1 and the row count is tiny; deferring lets this migration stay
-- focused on the high-volume tables that drive cross-library overhead.
-- ---------------------------------------------------------------------------
-- tagged_photo
-- ---------------------------------------------------------------------------
ALTER TABLE tagged_photo ADD COLUMN content_hash TEXT;
-- Backfill: for each tagged_photo row, find the content_hash for its
-- rel_path. tagged_photo doesn't carry a library_id, so a rel_path that
-- exists under multiple libraries with different content is genuinely
-- ambiguous — we take the first matching image_exif row. The
-- reconciliation pass at runtime cleans up any rows that resolve
-- differently once a hash is known per library.
UPDATE tagged_photo
SET content_hash = (
SELECT content_hash FROM image_exif
WHERE image_exif.rel_path = tagged_photo.rel_path
AND image_exif.content_hash IS NOT NULL
LIMIT 1
)
WHERE content_hash IS NULL;
-- Hash-key index. Partial (only non-null rows) to keep the index small
-- during the transitional window where most rows are still null.
CREATE INDEX idx_tagged_photo_content_hash
ON tagged_photo (content_hash)
WHERE content_hash IS NOT NULL;
-- ---------------------------------------------------------------------------
-- photo_insights
-- ---------------------------------------------------------------------------
ALTER TABLE photo_insights ADD COLUMN content_hash TEXT;
-- Backfill keyed on (library_id, rel_path) — photo_insights already
-- carries library_id, so the resolution is unambiguous.
UPDATE photo_insights
SET content_hash = (
SELECT content_hash FROM image_exif
WHERE image_exif.library_id = photo_insights.library_id
AND image_exif.rel_path = photo_insights.rel_path
AND image_exif.content_hash IS NOT NULL
LIMIT 1
)
WHERE content_hash IS NULL;
CREATE INDEX idx_photo_insights_content_hash
ON photo_insights (content_hash)
WHERE content_hash IS NOT NULL;

View File

@@ -0,0 +1,2 @@
-- Requires SQLite 3.35+ for ALTER TABLE DROP COLUMN.
ALTER TABLE libraries DROP COLUMN enabled;

View File

@@ -0,0 +1,14 @@
-- Operator-controlled kill switch for a library. When `enabled = 0` the
-- watcher tick skips that library entirely — before the availability
-- probe, before ingest, before any maintenance pass — and the orphan-GC
-- all-online check treats it as out-of-scope rather than as a blocker.
--
-- The intended workflow is staging a new mount: insert with enabled=0,
-- verify the row appears in /libraries with enabled=false, then UPDATE
-- to 1 to start ingest. Same toggle works as a maintenance kill switch
-- after the fact ("don't keep probing this NAS while I'm rebooting it").
--
-- Default 1 so every existing library stays running on upgrade — no
-- behavior change without an explicit flip.
ALTER TABLE libraries ADD COLUMN enabled BOOLEAN NOT NULL DEFAULT 1;

View File

@@ -0,0 +1,2 @@
-- Requires SQLite 3.35+ for ALTER TABLE DROP COLUMN.
ALTER TABLE libraries DROP COLUMN excluded_dirs;

View File

@@ -0,0 +1,14 @@
-- Per-library excluded directories.
--
-- The global EXCLUDED_DIRS env var is the right knob for excludes that
-- every library shares (Synology @eaDir, .thumbnails, etc.). It's a
-- poor fit for "exclude this subtree from THIS library only", which
-- the natural use case for is mounting a parent directory while
-- another library already covers a child subtree underneath.
--
-- This column is parsed comma-separated, same shape as the env var,
-- and the watcher / memories / thumbnail walks each apply
-- (env_globals library.excluded_dirs) when scanning the library.
-- NULL = no extra excludes; the global env var still applies.
ALTER TABLE libraries ADD COLUMN excluded_dirs TEXT;

View File

@@ -0,0 +1,8 @@
DROP INDEX IF EXISTS idx_image_exif_duplicate_of_hash;
DROP INDEX IF EXISTS idx_image_exif_dhash;
DROP INDEX IF EXISTS idx_image_exif_phash;
ALTER TABLE image_exif DROP COLUMN duplicate_decided_at;
ALTER TABLE image_exif DROP COLUMN duplicate_of_hash;
ALTER TABLE image_exif DROP COLUMN dhash_64;
ALTER TABLE image_exif DROP COLUMN phash_64;

View File

@@ -0,0 +1,41 @@
-- Adds perceptual-hash signals + soft-mark resolution state to image_exif so
-- the duplicates surface in Apollo can group near-duplicates (re-encoded,
-- resized, format-converted copies) and let the user demote losers without
-- touching the file on disk. Image-only for v1: phash_64/dhash_64 are NULL
-- on videos and on images that fail to decode. See Apollo CLAUDE.md →
-- Duplicate detection / Caching layer for the policy.
--
-- Soft-mark columns are media-type-agnostic — when video perceptual hashing
-- arrives, it lives in a separate hash-keyed companion table and reuses the
-- same duplicate_of_hash / duplicate_decided_at machinery.
-- pHash (DCT, 64-bit) packed as i64 for fast XOR + popcount Hamming.
ALTER TABLE image_exif ADD COLUMN phash_64 BIGINT;
-- dHash (gradient, 64-bit). Cheap, robust to compression/resize. Stored
-- alongside pHash so the query layer can fall back if either is null.
ALTER TABLE image_exif ADD COLUMN dhash_64 BIGINT;
-- When non-null, this row is a soft-marked duplicate of the row whose
-- content_hash matches. The duplicate file stays on disk; the default
-- /photos listing filters it out. /photos?include_duplicates=true opts
-- back in (the Apollo duplicates modal uses this).
ALTER TABLE image_exif ADD COLUMN duplicate_of_hash TEXT;
-- Unix seconds of the resolve. Distinguishes "never reviewed" from
-- "reviewed and resolved" for the Apollo include_resolved toggle.
ALTER TABLE image_exif ADD COLUMN duplicate_decided_at BIGINT;
-- Partial indexes — the columns are NULL for the vast majority of rows
-- during the transitional window and forever for videos / decode failures.
CREATE INDEX idx_image_exif_phash
ON image_exif (phash_64)
WHERE phash_64 IS NOT NULL;
CREATE INDEX idx_image_exif_dhash
ON image_exif (dhash_64)
WHERE dhash_64 IS NOT NULL;
CREATE INDEX idx_image_exif_duplicate_of_hash
ON image_exif (duplicate_of_hash)
WHERE duplicate_of_hash IS NOT NULL;

View File

@@ -383,7 +383,10 @@ mod tests {
// body cap and rejected normal-size photos before they reached
// the backend.
assert!(is_transient(&classify_error_response(408, "")));
assert!(is_transient(&classify_error_response(413, "<html>nginx</html>")));
assert!(is_transient(&classify_error_response(
413,
"<html>nginx</html>"
)));
assert!(is_transient(&classify_error_response(429, "{}")));
}

View File

@@ -521,6 +521,7 @@ impl InsightChatService {
training_messages: Some(json),
backend: effective_backend.clone(),
fewshot_source_ids: None,
content_hash: None,
};
let cx = opentelemetry::Context::new();
let mut dao = self.insight_dao.lock().expect("Unable to lock InsightDao");
@@ -983,6 +984,7 @@ impl InsightChatService {
training_messages: Some(json),
backend: effective_backend.clone(),
fewshot_source_ids: None,
content_hash: None,
};
let cx = opentelemetry::Context::new();
let mut dao = self.insight_dao.lock().expect("Unable to lock InsightDao");

View File

@@ -1255,7 +1255,9 @@ impl InsightGenerator {
.span()
.set_attribute(KeyValue::new("summary_length", summary.len() as i64));
// 11. Store in database
// 11. Store in database. content_hash is None here — store_insight
// looks it up from image_exif before persisting; reconciliation
// backfills if the hash isn't known yet.
let insight = InsertPhotoInsight {
library_id: crate::libraries::PRIMARY_LIBRARY_ID,
file_path: file_path.to_string(),
@@ -1267,6 +1269,7 @@ impl InsightGenerator {
training_messages: None,
backend: "local".to_string(),
fewshot_source_ids: None,
content_hash: None,
};
let mut dao = self.insight_dao.lock().expect("Unable to lock InsightDao");
@@ -3530,6 +3533,7 @@ Return ONLY the summary, nothing else."#,
training_messages,
backend: backend_label.clone(),
fewshot_source_ids: fewshot_source_ids_json,
content_hash: None,
};
let stored = {

View File

@@ -0,0 +1,243 @@
//! Backfill `image_exif.phash_64` + `dhash_64` for image rows that
//! were ingested before perceptual hashing was wired into the watcher.
//!
//! The watcher computes perceptual hashes for new images as they're
//! ingested, so this binary is a one-shot for the historical backlog.
//! Idempotent — only rows with a non-null content_hash and a null
//! phash are processed, so re-runs are safe and pick up where they
//! left off (e.g. after a crash or interrupt).
//!
//! Image-only by design: `get_rows_missing_perceptual_hash` filters by
//! file extension at the DB layer so videos and other non-decodable
//! media are skipped without round-tripping `image_hasher`. Files that
//! can't be opened (missing on disk, permission errors) are quietly
//! left as null and counted as "missing"; on next run, if the file is
//! restored, the row will surface again.
use std::path::Path;
use std::sync::{Arc, Mutex};
use std::time::Instant;
use clap::Parser;
use log::{error, warn};
use rayon::prelude::*;
use image_api::bin_progress;
use image_api::database::{ExifDao, SqliteExifDao, connect};
use image_api::libraries::{self, Library};
use image_api::perceptual_hash;
#[derive(Parser, Debug)]
#[command(name = "backfill_perceptual_hash")]
#[command(about = "Compute pHash + dHash for image_exif rows missing one")]
struct Args {
/// Max rows to hash per batch. The process loops until no rows remain.
#[arg(long, default_value_t = 256)]
batch_size: i64,
/// Rayon parallelism override. 0 uses the default thread pool size.
#[arg(long, default_value_t = 0)]
parallelism: usize,
/// Dry-run: log what would be hashed without writing to the DB.
#[arg(long)]
dry_run: bool,
}
fn main() -> anyhow::Result<()> {
env_logger::init();
dotenv::dotenv().ok();
let args = Args::parse();
if args.parallelism > 0 {
rayon::ThreadPoolBuilder::new()
.num_threads(args.parallelism)
.build_global()
.expect("Unable to configure rayon thread pool");
}
let base_path = dotenv::var("BASE_PATH").ok();
let mut seed_conn = connect();
if let Some(base) = base_path.as_deref() {
libraries::seed_or_patch_from_env(&mut seed_conn, base);
}
let libs = libraries::load_all(&mut seed_conn);
drop(seed_conn);
if libs.is_empty() {
anyhow::bail!("No libraries configured; cannot backfill perceptual hashes");
}
let libs_by_id: std::collections::HashMap<i32, Library> =
libs.into_iter().map(|lib| (lib.id, lib)).collect();
println!(
"Configured libraries: {}",
libs_by_id
.values()
.map(|l| format!("{} -> {}", l.name, l.root_path))
.collect::<Vec<_>>()
.join(", ")
);
let dao: Arc<Mutex<Box<dyn ExifDao>>> = Arc::new(Mutex::new(Box::new(SqliteExifDao::new())));
let ctx = opentelemetry::Context::new();
let mut total_hashed = 0u64;
let mut total_missing = 0u64;
let mut total_decode_failures = 0u64;
let mut total_errors = 0u64;
let start = Instant::now();
let pb = bin_progress::spinner("perceptual-hashing");
loop {
let rows = {
let mut guard = dao.lock().expect("Unable to lock ExifDao");
guard
.get_rows_missing_perceptual_hash(&ctx, args.batch_size)
.map_err(|e| anyhow::anyhow!("DB error: {:?}", e))?
};
if rows.is_empty() {
break;
}
let batch_size = rows.len();
pb.set_message(format!(
"batch of {} (hashed={} decode_fail={} missing={} errors={})",
batch_size, total_hashed, total_decode_failures, total_missing, total_errors
));
// Compute perceptual hashes in parallel — CPU-bound, decoder
// releases the GIL-equivalent. rayon's default thread pool
// matches the host's logical-core count which is the right
// ceiling for image_hasher's DCT pass.
let results: Vec<(i32, String, FilePerceptualResult)> = rows
.into_par_iter()
.map(|(library_id, rel_path)| {
let abs = libs_by_id
.get(&library_id)
.map(|lib| Path::new(&lib.root_path).join(&rel_path));
match abs {
Some(abs_path) if abs_path.exists() => {
match perceptual_hash::compute(&abs_path) {
Some(id) => (library_id, rel_path, FilePerceptualResult::Ok(id)),
None => (library_id, rel_path, FilePerceptualResult::DecodeFailed),
}
}
Some(_) => (library_id, rel_path, FilePerceptualResult::MissingOnDisk),
None => {
warn!("Row refers to unknown library_id {}", library_id);
(library_id, rel_path, FilePerceptualResult::MissingOnDisk)
}
}
})
.collect();
// Persist sequentially — SQLite writes serialize anyway.
if !args.dry_run {
let mut guard = dao.lock().expect("Unable to lock ExifDao");
for (library_id, rel_path, result) in &results {
match result {
FilePerceptualResult::Ok(id) => {
match guard.backfill_perceptual_hash(
&ctx,
*library_id,
rel_path,
Some(id.phash_64),
Some(id.dhash_64),
) {
Ok(_) => {
total_hashed += 1;
pb.inc(1);
}
Err(e) => {
pb.println(format!("persist error for {}: {:?}", rel_path, e));
total_errors += 1;
}
}
}
FilePerceptualResult::DecodeFailed => {
// Persist phash_64=0/dhash_64=0 as a "tried,
// unhashable" sentinel so this row leaves the
// `phash_64 IS NULL` candidate set and the
// backfill doesn't infinite-loop on a queue of
// unbreakable formats (HEIC, RAW, CMYK JPEGs,
// truncated bytes). The all-zero hash is
// explicitly excluded from clustering by
// is_informative_hash in duplicates.rs, so it
// won't pollute group output — it just becomes
// invisible to the duplicate finder.
log::debug!(
"perceptual decode failed for {} (lib {}); marking unhashable",
rel_path,
library_id
);
match guard.backfill_perceptual_hash(
&ctx,
*library_id,
rel_path,
Some(0),
Some(0),
) {
Ok(_) => {
total_decode_failures += 1;
}
Err(e) => {
pb.println(format!(
"persist error (decode-fail sentinel) for {}: {:?}",
rel_path, e
));
total_errors += 1;
}
}
}
FilePerceptualResult::MissingOnDisk => {
total_missing += 1;
}
}
}
} else {
for (_, rel_path, result) in &results {
match result {
FilePerceptualResult::Ok(id) => {
pb.println(format!(
"[dry-run] {} -> phash={:016x} dhash={:016x}",
rel_path, id.phash_64, id.dhash_64
));
total_hashed += 1;
pb.inc(1);
}
FilePerceptualResult::DecodeFailed => {
total_decode_failures += 1;
}
FilePerceptualResult::MissingOnDisk => {
total_missing += 1;
}
}
}
pb.println(format!(
"[dry-run] processed one batch of {}. Stopping — a real run would continue \
until no NULL phash_64 image rows remain.",
results.len()
));
break;
}
}
pb.finish_and_clear();
println!(
"Done. hashed={}, decode_failed={}, skipped (missing on disk)={}, errors={}, elapsed={:.1}s",
total_hashed,
total_decode_failures,
total_missing,
total_errors,
start.elapsed().as_secs_f64()
);
if total_errors > 0 {
error!("Backfill completed with {} persist errors", total_errors);
}
Ok(())
}
enum FilePerceptualResult {
Ok(perceptual_hash::PerceptualIdentity),
DecodeFailed,
MissingOnDisk,
}

View File

@@ -53,12 +53,34 @@ pub fn thumbnail_path(thumbs_dir: &Path, hash: &str) -> PathBuf {
/// Hash-keyed HLS output directory: `<video_dir>/<hash[..2]>/<hash>/`.
/// The playlist lives at `playlist.m3u8` inside this directory and its
/// segments are co-located so HLS relative references Just Work.
///
/// Allow-dead until Branch B/C rewires the HLS pipeline to use it; the
/// helper lives here today so Branch A's path layout decisions stay
/// adjacent to thumbnail/legacy ones.
#[allow(dead_code)]
pub fn hls_dir(video_dir: &Path, hash: &str) -> PathBuf {
let shard = shard_prefix(hash);
video_dir.join(shard).join(hash)
}
/// Library-scoped legacy mirrored path:
/// `<derivative_dir>/<library_id>/<rel_path>`. Used as the fallback when
/// `content_hash` isn't available — the library prefix prevents the
/// "lib1 wrote `vacation/IMG.jpg` first, lib2 sees thumb_path.exists()
/// and serves the wrong image" failure mode.
///
/// Existing single-library deployments may already have thumbnails at the
/// bare-legacy `<derivative_dir>/<rel_path>` shape; serving code is
/// expected to check both this scoped path and the bare-legacy path so
/// nothing 404s during the transition.
pub fn library_scoped_legacy_path(
derivative_dir: &Path,
library_id: i32,
rel_path: impl AsRef<Path>,
) -> PathBuf {
derivative_dir.join(library_id.to_string()).join(rel_path)
}
fn shard_prefix(hash: &str) -> &str {
let end = hash
.char_indices()
@@ -105,4 +127,17 @@ mod tests {
let d = hls_dir(video, "1234deadbeef");
assert_eq!(d, PathBuf::from("/tmp/video/12/1234deadbeef"));
}
#[test]
fn library_scoped_legacy_path_prefixes_with_library_id() {
let thumbs = Path::new("/tmp/thumbs");
let p = library_scoped_legacy_path(thumbs, 7, "vacation/IMG.jpg");
assert_eq!(p, PathBuf::from("/tmp/thumbs/7/vacation/IMG.jpg"));
// Same rel_path, different library — different output. This is
// the whole point: lib 1 and lib 2 don't clobber each other.
let p1 = library_scoped_legacy_path(thumbs, 1, "vacation/IMG.jpg");
let p2 = library_scoped_legacy_path(thumbs, 2, "vacation/IMG.jpg");
assert_ne!(p1, p2);
}
}

View File

@@ -165,6 +165,15 @@ pub struct FilesRequest {
/// Optional library filter. Accepts a library id (e.g. "1") or name
/// (e.g. "main"). When omitted, results span all libraries.
pub library: Option<String>,
/// When true, include rows soft-marked as duplicates of another file
/// (i.e. `image_exif.duplicate_of_hash IS NOT NULL`). Default false —
/// the standard /photos listing hides demoted siblings so the grid
/// silently shrinks after a resolve. The Apollo duplicates modal
/// passes `true` so it can show both survivors and demoted members
/// inside a group.
#[serde(default)]
pub include_duplicates: Option<bool>,
}
#[derive(Copy, Clone, Deserialize, PartialEq, Debug)]

View File

@@ -111,13 +111,30 @@ impl InsightDao for SqliteInsightDao {
fn store_insight(
&mut self,
context: &opentelemetry::Context,
insight: InsertPhotoInsight,
mut insight: InsertPhotoInsight,
) -> Result<PhotoInsight, DbError> {
trace_db_call(context, "insert", "store_insight", |_span| {
use schema::photo_insights::dsl::*;
let mut connection = self.connection.lock().expect("Unable to get InsightDao");
// Eagerly populate content_hash so this insight follows the
// bytes (CLAUDE.md "Multi-library data model"). Caller-
// supplied hash wins; otherwise look it up from image_exif
// for the (library_id, rel_path) tuple. None is acceptable —
// reconciliation backfills it once the hash lands.
if insight.content_hash.is_none() {
use schema::image_exif as ie;
insight.content_hash = ie::table
.filter(ie::library_id.eq(insight.library_id))
.filter(ie::rel_path.eq(&insight.file_path))
.filter(ie::content_hash.is_not_null())
.select(ie::content_hash)
.first::<Option<String>>(connection.deref_mut())
.ok()
.flatten();
}
// Mark all existing insights for this file as no longer current
diesel::update(
photo_insights

View File

@@ -9,6 +9,25 @@ use crate::database::models::{
};
use crate::otel::trace_db_call;
/// Wire shape for a single member of a duplicate group, returned by
/// `list_duplicates_*` and `lookup_duplicate_row`. Carries everything
/// the Apollo modal needs to render a member tile and its meta line —
/// thumbnails are derived from `(library_id, rel_path)` upstream.
#[derive(Debug, Clone, serde::Serialize)]
pub struct DuplicateRow {
pub library_id: i32,
pub rel_path: String,
pub content_hash: String,
pub size_bytes: Option<i64>,
pub date_taken: Option<i64>,
pub width: Option<i32>,
pub height: Option<i32>,
pub phash_64: Option<i64>,
pub dhash_64: Option<i64>,
pub duplicate_of_hash: Option<String>,
pub duplicate_decided_at: Option<i64>,
}
pub mod calendar_dao;
pub mod daily_summary_dao;
pub mod insights_dao;
@@ -16,6 +35,7 @@ pub mod knowledge_dao;
pub mod location_dao;
pub mod models;
pub mod preview_dao;
pub mod reconcile;
pub mod schema;
pub mod search_dao;
@@ -136,10 +156,19 @@ pub fn connect() -> SqliteConnection {
// rollback-journal durability; we accept the narrow last-fsync
// window for the 210× write throughput).
use diesel::connection::SimpleConnection;
// foreign_keys = ON is per-connection in SQLite (off by default), so
// it has to be set here alongside the other pragmas. Without it
// every `REFERENCES … ON DELETE CASCADE / SET NULL` clause in the
// schema is documentation-only — orphan rows would survive the
// referenced row's deletion. With it, the cascade fires
// automatically and code that previously did manual two-step
// cleanup (delete child rows, then parent) becomes redundant but
// still correct.
conn.batch_execute(
"PRAGMA journal_mode = WAL; \
PRAGMA busy_timeout = 5000; \
PRAGMA synchronous = NORMAL;",
PRAGMA synchronous = NORMAL; \
PRAGMA foreign_keys = ON;",
)
.expect("set sqlite pragmas");
conn
@@ -286,17 +315,29 @@ pub trait ExifDao: Sync + Send {
library_id: Option<i32>,
) -> Result<Vec<(String, i64)>, DbError>;
/// Batch load EXIF data for multiple file paths (single query)
/// Batch load EXIF data for multiple file paths (single query). When
/// `library_id = Some(id)` the lookup is keyed on `(library_id,
/// rel_path)`; cross-library duplicates with the same rel_path are
/// excluded. `None` keeps the legacy rel-path-only behavior — used by
/// the union-mode `/photos` listing, which already disambiguates by
/// `(file_path, library_id)` in the caller.
fn get_exif_batch(
&mut self,
context: &opentelemetry::Context,
library_id: Option<i32>,
file_paths: &[String],
) -> Result<Vec<ImageExif>, DbError>;
/// Query files by EXIF criteria with optional filters
/// Query files by EXIF criteria with optional filters. `library_id =
/// Some(id)` restricts to that library; `None` spans every library
/// (used by the unscoped `/photos` form). The composite
/// `(library_id, date_taken)` index added in the multi_library
/// migration depends on `library_id` being part of the WHERE clause —
/// callers that have a library context must pass it.
fn query_by_exif(
&mut self,
context: &opentelemetry::Context,
library_id: Option<i32>,
camera_make: Option<&str>,
camera_model: Option<&str>,
lens_model: Option<&str>,
@@ -355,6 +396,104 @@ pub trait ExifDao: Sync + Send {
size_bytes: i64,
) -> Result<(), DbError>;
/// Return image rows that have a `content_hash` but no `phash_64`,
/// oldest first. Used by the `backfill_perceptual_hash` binary.
/// Filters by image extension at the DB layer to avoid ever asking
/// `image_hasher` to decode a video. Returns `(library_id, rel_path)`.
fn get_rows_missing_perceptual_hash(
&mut self,
context: &opentelemetry::Context,
limit: i64,
) -> Result<Vec<(i32, String)>, DbError>;
/// Persist computed perceptual hashes (pHash + dHash) for an
/// existing image_exif row. Either column may be left NULL by
/// passing `None`, but in practice the binary computes both or
/// neither — `image_hasher` either decodes the image and produces
/// both signals, or fails entirely.
fn backfill_perceptual_hash(
&mut self,
context: &opentelemetry::Context,
library_id: i32,
rel_path: &str,
phash_64: Option<i64>,
dhash_64: Option<i64>,
) -> Result<(), DbError>;
/// Group exact-hash duplicates: rows whose `content_hash` appears
/// more than once across the (optionally library-scoped) corpus.
/// Returns one [`DuplicateRow`] per member; callers group by
/// `content_hash`. When `include_resolved=false`, rows already
/// soft-marked (`duplicate_of_hash IS NOT NULL`) are excluded so
/// the modal doesn't re-surface decisions the user already made.
fn list_duplicates_exact(
&mut self,
context: &opentelemetry::Context,
library_id: Option<i32>,
include_resolved: bool,
) -> Result<Vec<DuplicateRow>, DbError>;
/// Return all rows with a non-null `phash_64` (optionally library-
/// scoped), used by the perceptual-cluster routine in
/// [`crate::main`] to single-link cluster via Hamming distance.
/// Each returned row is a *distinct content_hash* — exact duplicates
/// are collapsed at the DB layer so the in-memory clusterer doesn't
/// rediscover them.
fn list_perceptual_candidates(
&mut self,
context: &opentelemetry::Context,
library_id: Option<i32>,
include_resolved: bool,
) -> Result<Vec<DuplicateRow>, DbError>;
/// Look up a single row's metadata by `(library_id, rel_path)`. Used
/// by the resolve endpoint to map the request payload to the
/// underlying `content_hash` before writing the soft-mark. Returns
/// `Ok(None)` if the file doesn't exist in `image_exif`.
fn lookup_duplicate_row(
&mut self,
context: &opentelemetry::Context,
library_id: i32,
rel_path: &str,
) -> Result<Option<DuplicateRow>, DbError>;
/// Soft-mark a file as a duplicate of `survivor_hash`. Sets
/// `duplicate_of_hash` and `duplicate_decided_at` on the row(s)
/// matching `(library_id, rel_path)`. The file stays on disk; the
/// default `/photos` listing hides it because of the
/// `duplicate_of_hash IS NULL` filter.
fn set_duplicate_of(
&mut self,
context: &opentelemetry::Context,
library_id: i32,
rel_path: &str,
survivor_hash: &str,
decided_at: i64,
) -> Result<(), DbError>;
/// Reverse a soft-mark: clears `duplicate_of_hash` and
/// `duplicate_decided_at`. Used by the modal's UNRESOLVE chip.
fn clear_duplicate_of(
&mut self,
context: &opentelemetry::Context,
library_id: i32,
rel_path: &str,
) -> Result<(), DbError>;
/// Union the tags from `demoted_hash` onto `survivor_hash`. Used at
/// resolve time for *perceptual* duplicates (different content_hashes,
/// independent tag sets) so the user doesn't lose their tagging work
/// when promoting a survivor. Idempotent: a tag already on the survivor
/// is left alone. Exact duplicates (same content_hash) don't need this
/// because their tag rows are already shared.
fn union_perceptual_tags(
&mut self,
context: &opentelemetry::Context,
survivor_hash: &str,
demoted_hash: &str,
survivor_rel_path: &str,
) -> Result<(), DbError>;
/// Return the first EXIF row with the given content hash (any library).
/// Used by thumbnail/HLS generation to detect pre-existing derivatives
/// from another library before regenerating.
@@ -418,11 +557,17 @@ pub trait ExifDao: Sync + Send {
/// `library_ids` is empty, rows from every library are returned. Used by
/// `/photos` recursive listing to skip the filesystem walk — the watcher
/// keeps image_exif in parity with disk via the reconciliation pass.
///
/// `include_duplicates=false` filters out rows soft-marked with
/// `duplicate_of_hash IS NOT NULL` so the default photo listing hides
/// demoted siblings; the Apollo duplicates modal passes `true` to
/// see both survivors and demoted members inside a group.
fn list_rel_paths_for_libraries(
&mut self,
context: &opentelemetry::Context,
library_ids: &[i32],
path_prefix: Option<&str>,
include_duplicates: bool,
) -> Result<Vec<(i32, String)>, DbError>;
/// Delete a single image_exif row scoped to `(library_id, rel_path)`.
@@ -434,6 +579,28 @@ pub trait ExifDao: Sync + Send {
library_id: i32,
rel_path: &str,
) -> Result<(), DbError>;
/// Number of image_exif rows for a library. Used by the availability
/// probe to decide whether an empty mount is "fresh" (zero rows: fine)
/// or "the share went offline" (non-zero rows: stale). Zero on query
/// error so a transient DB hiccup doesn't itself cause a Stale flip.
fn count_for_library(
&mut self,
context: &opentelemetry::Context,
library_id: i32,
) -> Result<i64, DbError>;
/// Paginated rel_path listing for a single library, ordered by id
/// ascending. Used by the missing-file detector to scan a library
/// in capped chunks across consecutive watcher ticks rather than
/// stat()ing every row every minute. Returns `(id, rel_path)`.
fn list_rel_paths_for_library_page(
&mut self,
context: &opentelemetry::Context,
library_id: i32,
limit: i64,
offset: i64,
) -> Result<Vec<(i32, String)>, DbError>;
}
pub struct SqliteExifDao {
@@ -613,6 +780,7 @@ impl ExifDao for SqliteExifDao {
fn get_exif_batch(
&mut self,
context: &opentelemetry::Context,
library_id_filter: Option<i32>,
file_paths: &[String],
) -> Result<Vec<ImageExif>, DbError> {
trace_db_call(context, "query", "get_exif_batch", |_span| {
@@ -623,8 +791,11 @@ impl ExifDao for SqliteExifDao {
}
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
image_exif
let mut query = image_exif.into_boxed();
if let Some(lib_id) = library_id_filter {
query = query.filter(library_id.eq(lib_id));
}
query
.filter(rel_path.eq_any(file_paths))
.load::<ImageExif>(connection.deref_mut())
.map_err(|_| anyhow::anyhow!("Query error"))
@@ -635,6 +806,7 @@ impl ExifDao for SqliteExifDao {
fn query_by_exif(
&mut self,
context: &opentelemetry::Context,
library_id_filter: Option<i32>,
camera_make_filter: Option<&str>,
camera_model_filter: Option<&str>,
lens_model_filter: Option<&str>,
@@ -648,6 +820,12 @@ impl ExifDao for SqliteExifDao {
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
let mut query = image_exif.into_boxed();
// Library scope (most-selective filter — apply first so the
// `(library_id, ...)` indexes are eligible).
if let Some(lib_id) = library_id_filter {
query = query.filter(library_id.eq(lib_id));
}
// Camera filters (case-insensitive partial match)
if let Some(make) = camera_make_filter {
query = query.filter(camera_make.like(format!("%{}%", make)));
@@ -1022,6 +1200,7 @@ impl ExifDao for SqliteExifDao {
context: &opentelemetry::Context,
library_ids: &[i32],
path_prefix: Option<&str>,
include_duplicates: bool,
) -> Result<Vec<(i32, String)>, DbError> {
trace_db_call(context, "query", "list_rel_paths_for_libraries", |_span| {
use schema::image_exif::dsl::*;
@@ -1042,6 +1221,41 @@ impl ExifDao for SqliteExifDao {
query = query.filter(rel_path.like(pattern).escape('\\'));
}
if !include_duplicates {
if library_ids.is_empty() {
// Unscoped (all-libraries) view — every survivor is
// reachable somewhere, so a soft-marked row is
// genuinely a duplicate from the user's perspective.
// Hide it.
query = query.filter(duplicate_of_hash.is_null());
} else {
// Scoped to specific libraries: only hide a
// soft-marked row when the survivor is reachable
// *in this view*. If the survivor lives in a
// library the user can't see right now, the
// demoted file is the only copy of those bytes
// they have access to — keep it visible.
//
// Implemented as a correlated NOT EXISTS subquery
// over an aliased image_exif. Library ids are i32
// so format!-inlining the integer list is safe.
use diesel::sql_types::Bool;
let lib_list = library_ids
.iter()
.map(i32::to_string)
.collect::<Vec<_>>()
.join(",");
let raw = format!(
"(image_exif.duplicate_of_hash IS NULL OR NOT EXISTS \
(SELECT 1 FROM image_exif AS survivor \
WHERE survivor.content_hash = image_exif.duplicate_of_hash \
AND survivor.library_id IN ({})))",
lib_list
);
query = query.filter(diesel::dsl::sql::<Bool>(&raw));
}
}
query
.load::<(i32, String)>(connection.deref_mut())
.map_err(|_| anyhow::anyhow!("Query error"))
@@ -1069,6 +1283,465 @@ impl ExifDao for SqliteExifDao {
})
.map_err(|_| DbError::new(DbErrorKind::QueryError))
}
fn count_for_library(
&mut self,
context: &opentelemetry::Context,
library_id_val: i32,
) -> Result<i64, DbError> {
trace_db_call(context, "query", "count_for_library", |_span| {
use schema::image_exif::dsl::*;
image_exif
.filter(library_id.eq(library_id_val))
.count()
.get_result::<i64>(self.connection.lock().unwrap().deref_mut())
.map_err(|_| anyhow::anyhow!("Count error"))
})
.map_err(|_| DbError::new(DbErrorKind::QueryError))
}
fn list_rel_paths_for_library_page(
&mut self,
context: &opentelemetry::Context,
library_id_val: i32,
limit: i64,
offset: i64,
) -> Result<Vec<(i32, String)>, DbError> {
trace_db_call(
context,
"query",
"list_rel_paths_for_library_page",
|_span| {
use schema::image_exif::dsl::*;
image_exif
.filter(library_id.eq(library_id_val))
.order(id.asc())
.select((id, rel_path))
.limit(limit)
.offset(offset)
.load::<(i32, String)>(self.connection.lock().unwrap().deref_mut())
.map_err(|_| anyhow::anyhow!("Query error"))
},
)
.map_err(|_| DbError::new(DbErrorKind::QueryError))
}
fn get_rows_missing_perceptual_hash(
&mut self,
context: &opentelemetry::Context,
limit: i64,
) -> Result<Vec<(i32, String)>, DbError> {
trace_db_call(
context,
"query",
"get_rows_missing_perceptual_hash",
|_span| {
use schema::image_exif::dsl::*;
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
// Image-only filter via extension. Videos and decode-failures
// would always come back NULL otherwise and the binary would
// grind through them on every run. The list mirrors the file
// formats `image` 0.25 / `image_hasher` 3.x can decode.
image_exif
.filter(content_hash.is_not_null())
.filter(phash_64.is_null())
.filter(
rel_path
.like("%.jpg")
.or(rel_path.like("%.jpeg"))
.or(rel_path.like("%.JPG"))
.or(rel_path.like("%.JPEG"))
.or(rel_path.like("%.png"))
.or(rel_path.like("%.PNG"))
.or(rel_path.like("%.webp"))
.or(rel_path.like("%.WEBP"))
.or(rel_path.like("%.tif"))
.or(rel_path.like("%.tiff"))
.or(rel_path.like("%.TIF"))
.or(rel_path.like("%.TIFF"))
.or(rel_path.like("%.avif"))
.or(rel_path.like("%.AVIF")),
)
.select((library_id, rel_path))
.order(id.asc())
.limit(limit)
.load::<(i32, String)>(connection.deref_mut())
.map_err(|_| anyhow::anyhow!("Query error"))
},
)
.map_err(|_| DbError::new(DbErrorKind::QueryError))
}
fn backfill_perceptual_hash(
&mut self,
context: &opentelemetry::Context,
library_id_val: i32,
rel_path_val: &str,
phash_val: Option<i64>,
dhash_val: Option<i64>,
) -> Result<(), DbError> {
trace_db_call(context, "update", "backfill_perceptual_hash", |_span| {
use schema::image_exif::dsl::*;
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
diesel::update(
image_exif
.filter(library_id.eq(library_id_val))
.filter(rel_path.eq(rel_path_val)),
)
.set((phash_64.eq(phash_val), dhash_64.eq(dhash_val)))
.execute(connection.deref_mut())
.map(|_| ())
.map_err(|_| anyhow::anyhow!("Update error"))
})
.map_err(|_| DbError::new(DbErrorKind::UpdateError))
}
fn list_duplicates_exact(
&mut self,
context: &opentelemetry::Context,
library_id_filter: Option<i32>,
include_resolved: bool,
) -> Result<Vec<DuplicateRow>, DbError> {
trace_db_call(context, "query", "list_duplicates_exact", |_span| {
// Sub-select the content_hashes that appear more than once
// (optionally library-scoped), then load the full member rows
// for those hashes ordered by hash + library + path so the
// caller can stream-group without buffering the full dataset.
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
// Step 1: hashes with count > 1.
let dup_hashes: Vec<String> = {
use schema::image_exif::dsl::*;
let mut q = image_exif
.filter(content_hash.is_not_null())
.group_by(content_hash)
.select(content_hash.assume_not_null())
.having(diesel::dsl::count_star().gt(1))
.into_boxed();
if let Some(lib) = library_id_filter {
q = q.filter(library_id.eq(lib));
}
q.load::<String>(connection.deref_mut())
.map_err(|_| anyhow::anyhow!("Query error"))?
};
if dup_hashes.is_empty() {
return Ok(Vec::new());
}
// Step 2: every member row for those hashes.
use schema::image_exif::dsl::*;
let mut q = image_exif
.filter(content_hash.eq_any(&dup_hashes))
.select((
library_id,
rel_path,
content_hash.assume_not_null(),
size_bytes,
date_taken,
width,
height,
phash_64,
dhash_64,
duplicate_of_hash,
duplicate_decided_at,
))
.order((content_hash.asc(), library_id.asc(), rel_path.asc()))
.into_boxed();
if let Some(lib) = library_id_filter {
q = q.filter(library_id.eq(lib));
}
if !include_resolved {
q = q.filter(duplicate_of_hash.is_null());
}
let rows: Vec<(
i32,
String,
String,
Option<i64>,
Option<i64>,
Option<i32>,
Option<i32>,
Option<i64>,
Option<i64>,
Option<String>,
Option<i64>,
)> = q
.load(connection.deref_mut())
.map_err(|_| anyhow::anyhow!("Query error"))?;
Ok(rows
.into_iter()
.map(|r| DuplicateRow {
library_id: r.0,
rel_path: r.1,
content_hash: r.2,
size_bytes: r.3,
date_taken: r.4,
width: r.5,
height: r.6,
phash_64: r.7,
dhash_64: r.8,
duplicate_of_hash: r.9,
duplicate_decided_at: r.10,
})
.collect())
})
.map_err(|_| DbError::new(DbErrorKind::QueryError))
}
fn list_perceptual_candidates(
&mut self,
context: &opentelemetry::Context,
library_id_filter: Option<i32>,
include_resolved: bool,
) -> Result<Vec<DuplicateRow>, DbError> {
trace_db_call(context, "query", "list_perceptual_candidates", |_span| {
use schema::image_exif::dsl::*;
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
// For perceptual candidates we want one canonical row per
// distinct content_hash — exact dups are clustered by the
// exact-dup query and would only pollute the perceptual
// graph with zero-distance edges. Diesel doesn't have a
// clean `DISTINCT ON`, so we load every row and dedup
// client-side keyed on content_hash. The result set is small
// (only rows with a phash) and the cost is negligible vs
// the BK-tree clustering that follows.
let mut q = image_exif
.filter(content_hash.is_not_null())
.filter(phash_64.is_not_null())
.select((
library_id,
rel_path,
content_hash.assume_not_null(),
size_bytes,
date_taken,
width,
height,
phash_64,
dhash_64,
duplicate_of_hash,
duplicate_decided_at,
))
.order((content_hash.asc(), library_id.asc(), rel_path.asc()))
.into_boxed();
if let Some(lib) = library_id_filter {
q = q.filter(library_id.eq(lib));
}
if !include_resolved {
q = q.filter(duplicate_of_hash.is_null());
}
let rows: Vec<(
i32,
String,
String,
Option<i64>,
Option<i64>,
Option<i32>,
Option<i32>,
Option<i64>,
Option<i64>,
Option<String>,
Option<i64>,
)> = q
.load(connection.deref_mut())
.map_err(|_| anyhow::anyhow!("Query error"))?;
// Dedup keyed on content_hash, keeping the first occurrence
// (deterministic by the SQL ORDER BY: lowest library_id,
// then lexicographically smallest rel_path).
let mut seen = std::collections::HashSet::new();
let mut out = Vec::with_capacity(rows.len());
for r in rows {
if seen.insert(r.2.clone()) {
out.push(DuplicateRow {
library_id: r.0,
rel_path: r.1,
content_hash: r.2,
size_bytes: r.3,
date_taken: r.4,
width: r.5,
height: r.6,
phash_64: r.7,
dhash_64: r.8,
duplicate_of_hash: r.9,
duplicate_decided_at: r.10,
});
}
}
Ok(out)
})
.map_err(|_| DbError::new(DbErrorKind::QueryError))
}
fn lookup_duplicate_row(
&mut self,
context: &opentelemetry::Context,
library_id_val: i32,
rel_path_val: &str,
) -> Result<Option<DuplicateRow>, DbError> {
trace_db_call(context, "query", "lookup_duplicate_row", |_span| {
use schema::image_exif::dsl::*;
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
image_exif
.filter(library_id.eq(library_id_val))
.filter(rel_path.eq(rel_path_val))
.filter(content_hash.is_not_null())
.select((
library_id,
rel_path,
content_hash.assume_not_null(),
size_bytes,
date_taken,
width,
height,
phash_64,
dhash_64,
duplicate_of_hash,
duplicate_decided_at,
))
.first::<(
i32,
String,
String,
Option<i64>,
Option<i64>,
Option<i32>,
Option<i32>,
Option<i64>,
Option<i64>,
Option<String>,
Option<i64>,
)>(connection.deref_mut())
.optional()
.map(|opt| {
opt.map(|r| DuplicateRow {
library_id: r.0,
rel_path: r.1,
content_hash: r.2,
size_bytes: r.3,
date_taken: r.4,
width: r.5,
height: r.6,
phash_64: r.7,
dhash_64: r.8,
duplicate_of_hash: r.9,
duplicate_decided_at: r.10,
})
})
.map_err(|_| anyhow::anyhow!("Query error"))
})
.map_err(|_| DbError::new(DbErrorKind::QueryError))
}
fn set_duplicate_of(
&mut self,
context: &opentelemetry::Context,
library_id_val: i32,
rel_path_val: &str,
survivor_hash: &str,
decided_at: i64,
) -> Result<(), DbError> {
trace_db_call(context, "update", "set_duplicate_of", |_span| {
use schema::image_exif::dsl::*;
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
diesel::update(
image_exif
.filter(library_id.eq(library_id_val))
.filter(rel_path.eq(rel_path_val)),
)
.set((
duplicate_of_hash.eq(survivor_hash),
duplicate_decided_at.eq(decided_at),
))
.execute(connection.deref_mut())
.map(|_| ())
.map_err(|_| anyhow::anyhow!("Update error"))
})
.map_err(|_| DbError::new(DbErrorKind::UpdateError))
}
fn clear_duplicate_of(
&mut self,
context: &opentelemetry::Context,
library_id_val: i32,
rel_path_val: &str,
) -> Result<(), DbError> {
trace_db_call(context, "update", "clear_duplicate_of", |_span| {
use schema::image_exif::dsl::*;
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
diesel::update(
image_exif
.filter(library_id.eq(library_id_val))
.filter(rel_path.eq(rel_path_val)),
)
.set((
duplicate_of_hash.eq::<Option<String>>(None),
duplicate_decided_at.eq::<Option<i64>>(None),
))
.execute(connection.deref_mut())
.map(|_| ())
.map_err(|_| anyhow::anyhow!("Update error"))
})
.map_err(|_| DbError::new(DbErrorKind::UpdateError))
}
fn union_perceptual_tags(
&mut self,
context: &opentelemetry::Context,
survivor_hash: &str,
demoted_hash: &str,
survivor_rel_path: &str,
) -> Result<(), DbError> {
trace_db_call(context, "update", "union_perceptual_tags", |_span| {
// INSERT OR IGNORE handles two relevant uniqueness paths:
// - tagged_photo (rel_path, tag_id) is the historical key,
// so existing tag rows under the survivor's path collide
// and stay put.
// - The (rel_path, tag_id) collision is the one that
// matters for idempotence; (content_hash, tag_id) at the
// bytes level isn't enforced by SQLite but the read path
// dedups on it, so an extra row would be cosmetic.
// Tags whose rel_path differs are inserted, picking up the
// survivor's content_hash so they live under the right bytes.
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
diesel::sql_query(
"INSERT OR IGNORE INTO tagged_photo (rel_path, tag_id, created_time, content_hash) \
SELECT ?, tag_id, strftime('%s','now'), ? \
FROM tagged_photo \
WHERE content_hash = ? \
AND tag_id NOT IN ( \
SELECT tag_id FROM tagged_photo WHERE content_hash = ? \
)",
)
.bind::<diesel::sql_types::Text, _>(survivor_rel_path)
.bind::<diesel::sql_types::Text, _>(survivor_hash)
.bind::<diesel::sql_types::Text, _>(demoted_hash)
.bind::<diesel::sql_types::Text, _>(survivor_hash)
.execute(connection.deref_mut())
.map(|_| ())
.map_err(|_| anyhow::anyhow!("Tag union error"))
})
.map_err(|_| DbError::new(DbErrorKind::UpdateError))
}
}
#[cfg(test)]
@@ -1105,6 +1778,8 @@ mod exif_dao_tests {
last_modified: 0,
content_hash: None,
size_bytes: None,
phash_64: None,
dhash_64: None,
},
)
.expect("insert exif row");
@@ -1118,6 +1793,8 @@ mod exif_dao_tests {
name: "archive",
root_path: "/tmp/archive",
created_at: 0,
enabled: true,
excluded_dirs: None,
})
.execute(&mut conn)
.expect("seed second library");
@@ -1158,4 +1835,61 @@ mod exif_dao_tests {
let lib1 = dao.get_all_with_date_taken(&ctx(), Some(1)).unwrap();
assert_eq!(lib1, vec![("main/a.jpg".to_string(), 100)]);
}
#[test]
fn query_by_exif_scopes_by_library_id() {
let mut dao = setup_two_libraries();
insert_row(&mut dao, 1, "main/a.jpg", Some(100));
insert_row(&mut dao, 2, "archive/a.jpg", Some(200));
// Union: both rows.
let all = dao
.query_by_exif(&ctx(), None, None, None, None, None, None, None)
.unwrap();
assert_eq!(all.len(), 2);
// Scoped to lib 2: only archive row.
let lib2 = dao
.query_by_exif(&ctx(), Some(2), None, None, None, None, None, None)
.unwrap();
assert_eq!(lib2.len(), 1);
assert_eq!(lib2[0].file_path, "archive/a.jpg");
assert_eq!(lib2[0].library_id, 2);
}
#[test]
fn get_exif_batch_scopes_by_library_id() {
let mut dao = setup_two_libraries();
// Same rel_path, different libraries — the cross-library duplicate
// case the audit flagged.
insert_row(&mut dao, 1, "shared/photo.jpg", Some(100));
insert_row(&mut dao, 2, "shared/photo.jpg", Some(200));
// None spans both libraries (legacy union behavior).
let union = dao
.get_exif_batch(&ctx(), None, &["shared/photo.jpg".to_string()])
.unwrap();
assert_eq!(union.len(), 2);
// Some(2) returns only the archive row.
let scoped = dao
.get_exif_batch(&ctx(), Some(2), &["shared/photo.jpg".to_string()])
.unwrap();
assert_eq!(scoped.len(), 1);
assert_eq!(scoped[0].library_id, 2);
assert_eq!(scoped[0].date_taken, Some(200));
}
#[test]
fn count_for_library_returns_per_library_count() {
let mut dao = setup_two_libraries();
insert_row(&mut dao, 1, "main/a.jpg", None);
insert_row(&mut dao, 1, "main/b.jpg", None);
insert_row(&mut dao, 2, "archive/a.jpg", None);
assert_eq!(dao.count_for_library(&ctx(), 1).unwrap(), 2);
assert_eq!(dao.count_for_library(&ctx(), 2).unwrap(), 1);
// Unknown library: zero, no error.
assert_eq!(dao.count_for_library(&ctx(), 999).unwrap(), 0);
}
}

View File

@@ -59,6 +59,10 @@ pub struct InsertImageExif {
pub last_modified: i64,
pub content_hash: Option<String>,
pub size_bytes: Option<i64>,
/// 64-bit pHash (DCT) packed as i64. NULL for videos and decode failures.
pub phash_64: Option<i64>,
/// 64-bit dHash (gradient). NULL for videos and decode failures.
pub dhash_64: Option<i64>,
}
// Field order matches the post-migration column order in `image_exif`.
@@ -86,6 +90,14 @@ pub struct ImageExif {
pub last_modified: i64,
pub content_hash: Option<String>,
pub size_bytes: Option<i64>,
pub phash_64: Option<i64>,
pub dhash_64: Option<i64>,
/// When non-null, this row is a soft-marked duplicate of the file
/// whose `content_hash` matches this value. The default `/photos`
/// listing filters such rows out.
pub duplicate_of_hash: Option<String>,
/// Unix seconds at which the resolve was committed.
pub duplicate_decided_at: Option<i64>,
}
#[derive(Insertable)]
@@ -108,6 +120,13 @@ pub struct InsertPhotoInsight {
/// generation). Used downstream to filter out contaminated rows when
/// assembling an unbiased training / evaluation set.
pub fewshot_source_ids: Option<String>,
/// Bytes-keyed identity. When present, this insight is considered
/// to belong to the content rather than the path — see CLAUDE.md
/// "Multi-library data model". The DAO populates this from
/// `image_exif.content_hash` at insert time when known; rows
/// inserted before the hash is available stay null and the
/// reconciliation pass backfills them.
pub content_hash: Option<String>,
}
#[derive(Serialize, Queryable, Clone, Debug)]
@@ -126,6 +145,7 @@ pub struct PhotoInsight {
/// `"local"` (Ollama with images) | `"hybrid"` (local vision + OpenRouter chat).
pub backend: String,
pub fewshot_source_ids: Option<String>,
pub content_hash: Option<String>,
}
// --- Libraries ---
@@ -136,6 +156,20 @@ pub struct LibraryRow {
pub name: String,
pub root_path: String,
pub created_at: i64,
/// Operator kill switch. `false` = the watcher skips this library
/// entirely (no probe, no ingest, no maintenance) and orphan-GC
/// treats it as out-of-scope for the all-online consensus rule.
/// Toggle via SQL today — there is intentionally no HTTP endpoint
/// for library mutation (see CLAUDE.md "Multi-library data model").
pub enabled: bool,
/// Per-library excluded paths/patterns, stored comma-separated
/// (same shape as the global `EXCLUDED_DIRS` env var). NULL = no
/// extra excludes for this library; the global env var still
/// applies. The runtime `Library` struct parses this into a
/// `Vec<String>` and the walker applies the union of (global,
/// library) excludes when scanning. Use case: mount a parent
/// directory while another library covers a child subtree.
pub excluded_dirs: Option<String>,
}
#[derive(Insertable)]
@@ -144,6 +178,8 @@ pub struct InsertLibrary<'a> {
pub name: &'a str,
pub root_path: &'a str,
pub created_at: i64,
pub enabled: bool,
pub excluded_dirs: Option<&'a str>,
}
// --- Knowledge memory models ---

382
src/database/reconcile.rs Normal file
View File

@@ -0,0 +1,382 @@
//! Reconciliation pass for hash-keyed derived data.
//!
//! As `backfill_unhashed_backlog` populates `image_exif.content_hash`
//! for legacy rows, we want the matching `tagged_photo` and
//! `photo_insights` rows — which were inserted before the hash was
//! known — to inherit the hash too. Otherwise reads keep falling back
//! to the rel_path path even when a hash is now available.
//!
//! Two passes:
//! 1. **Hash backfill** — for every `tagged_photo` / `photo_insights`
//! row with NULL `content_hash`, look up the matching
//! `image_exif.content_hash` and write it. SQL-only; idempotent;
//! a no-op once everything is hashed.
//! 2. **Insight scalar merge** — when multiple `photo_insights` rows
//! share a `content_hash` with `is_current = true`, only the
//! earliest `generated_at` keeps `is_current = true` (per the
//! "earliest wins" rule in CLAUDE.md → "Multi-library data
//! model"). Others are demoted, not deleted, so they remain
//! visible in history endpoints.
//!
//! Tags are set-valued under the policy (union on read), so there's no
//! analogous "collapse" pass — duplicate `(tag_id, content_hash)` rows
//! across libraries are harmless and correctly de-duped at read time
//! by the existing `DISTINCT` queries.
//!
//! The pass operates on the database alone — no filesystem access —
//! so it doesn't need the library availability gate.
// The lib doesn't call into this module directly — the watcher (in the
// bin) does. Dead-code analysis at the lib level can't see that, so
// suppress at the module level. Tests still exercise every function.
#![allow(dead_code)]
use diesel::prelude::*;
use diesel::sql_query;
use diesel::sqlite::SqliteConnection;
use log::{debug, info, warn};
/// Outcome of a reconciliation tick. Tracked so the watcher can log
/// progress when something changed and stay quiet when nothing did.
#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
pub struct ReconcileStats {
pub tagged_photo_hashes_filled: usize,
pub photo_insights_hashes_filled: usize,
pub photo_insights_demoted: usize,
}
impl ReconcileStats {
pub fn changed(&self) -> bool {
self.tagged_photo_hashes_filled > 0
|| self.photo_insights_hashes_filled > 0
|| self.photo_insights_demoted > 0
}
}
/// Run the reconciliation pass. Idempotent — safe to call on every
/// watcher tick. Errors are logged but never propagated; reconciliation
/// is best-effort and a transient DB hiccup must not stall the watcher.
pub fn run(conn: &mut SqliteConnection) -> ReconcileStats {
let mut stats = ReconcileStats::default();
stats.tagged_photo_hashes_filled = match backfill_tagged_photo_hashes(conn) {
Ok(n) => n,
Err(e) => {
warn!("reconcile: tagged_photo hash backfill failed: {:?}", e);
0
}
};
stats.photo_insights_hashes_filled = match backfill_photo_insights_hashes(conn) {
Ok(n) => n,
Err(e) => {
warn!("reconcile: photo_insights hash backfill failed: {:?}", e);
0
}
};
stats.photo_insights_demoted = match collapse_insight_currents(conn) {
Ok(n) => n,
Err(e) => {
warn!("reconcile: photo_insights scalar merge failed: {:?}", e);
0
}
};
if stats.changed() {
info!(
"reconcile: filled {} tagged_photo hash(es), {} photo_insights hash(es); demoted {} non-current insight row(s)",
stats.tagged_photo_hashes_filled,
stats.photo_insights_hashes_filled,
stats.photo_insights_demoted,
);
} else {
debug!("reconcile: no changes this tick");
}
stats
}
/// Populate `tagged_photo.content_hash` for any row that still has
/// NULL by joining on `rel_path` against `image_exif`. tagged_photo
/// doesn't carry `library_id`, so a path that exists under multiple
/// libraries with different content is genuinely ambiguous; we pick
/// any non-null hash for that path. Same trade-off as the migration
/// backfill — see `migrations/2026-05-01-000000_hash_keyed_derived_data`.
fn backfill_tagged_photo_hashes(conn: &mut SqliteConnection) -> QueryResult<usize> {
sql_query(
"UPDATE tagged_photo \
SET content_hash = ( \
SELECT content_hash FROM image_exif \
WHERE image_exif.rel_path = tagged_photo.rel_path \
AND image_exif.content_hash IS NOT NULL \
LIMIT 1 \
) \
WHERE content_hash IS NULL \
AND EXISTS ( \
SELECT 1 FROM image_exif \
WHERE image_exif.rel_path = tagged_photo.rel_path \
AND image_exif.content_hash IS NOT NULL \
)",
)
.execute(conn)
}
/// Populate `photo_insights.content_hash` from `image_exif`, keyed on
/// `(library_id, rel_path)`. Unambiguous because photo_insights carries
/// library_id.
fn backfill_photo_insights_hashes(conn: &mut SqliteConnection) -> QueryResult<usize> {
sql_query(
"UPDATE photo_insights \
SET content_hash = ( \
SELECT content_hash FROM image_exif \
WHERE image_exif.library_id = photo_insights.library_id \
AND image_exif.rel_path = photo_insights.rel_path \
AND image_exif.content_hash IS NOT NULL \
LIMIT 1 \
) \
WHERE content_hash IS NULL \
AND EXISTS ( \
SELECT 1 FROM image_exif \
WHERE image_exif.library_id = photo_insights.library_id \
AND image_exif.rel_path = photo_insights.rel_path \
AND image_exif.content_hash IS NOT NULL \
)",
)
.execute(conn)
}
/// Scalar-merge step: when multiple rows share a `content_hash` and
/// claim `is_current = true`, demote all but the earliest by
/// `generated_at` (ties broken by lowest id, deterministic).
///
/// Demoted rows keep their data — only `is_current` flips. Clients that
/// hit `/insights/history` still see the full sequence; only the
/// "current" pointer is unique per hash.
fn collapse_insight_currents(conn: &mut SqliteConnection) -> QueryResult<usize> {
sql_query(
"UPDATE photo_insights \
SET is_current = 0 \
WHERE is_current = 1 \
AND content_hash IS NOT NULL \
AND id NOT IN ( \
SELECT MIN(p2.id) FROM photo_insights p2 \
WHERE p2.is_current = 1 \
AND p2.content_hash = photo_insights.content_hash \
AND p2.generated_at = ( \
SELECT MIN(p3.generated_at) FROM photo_insights p3 \
WHERE p3.is_current = 1 \
AND p3.content_hash = p2.content_hash \
) \
)",
)
.execute(conn)
}
#[cfg(test)]
mod tests {
use super::*;
use crate::database::test::in_memory_db_connection;
fn ensure_library(conn: &mut SqliteConnection, library_id: i32) {
// Migration seeds library id=1; tests that reference id>1 must
// create those rows themselves, otherwise FK enforcement (added
// in the tags-edit migration) rejects image_exif inserts.
diesel::sql_query(
"INSERT OR IGNORE INTO libraries (id, name, root_path, created_at) \
VALUES (?, 'test-' || ?, '/tmp/test-' || ?, 0)",
)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::Integer, _>(library_id)
.execute(conn)
.unwrap();
}
fn insert_image_exif(
conn: &mut SqliteConnection,
library_id: i32,
rel_path: &str,
content_hash: Option<&str>,
) {
use crate::database::schema::image_exif;
ensure_library(conn, library_id);
diesel::sql_query(
"INSERT INTO image_exif (library_id, rel_path, created_time, last_modified, content_hash) \
VALUES (?, ?, 0, 0, ?)",
)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::Text, _>(rel_path)
.bind::<diesel::sql_types::Nullable<diesel::sql_types::Text>, _>(content_hash)
.execute(conn)
.unwrap();
// Keep clippy happy that the import is used.
let _ = image_exif::table;
}
fn insert_tagged_photo(conn: &mut SqliteConnection, rel_path: &str, tag_id: i32) {
diesel::sql_query(
"INSERT INTO tagged_photo (rel_path, tag_id, created_time) VALUES (?, ?, 0)",
)
.bind::<diesel::sql_types::Text, _>(rel_path)
.bind::<diesel::sql_types::Integer, _>(tag_id)
.execute(conn)
.unwrap();
}
fn insert_tag(conn: &mut SqliteConnection, id: i32, name: &str) {
diesel::sql_query("INSERT INTO tags (id, name, created_time) VALUES (?, ?, 0)")
.bind::<diesel::sql_types::Integer, _>(id)
.bind::<diesel::sql_types::Text, _>(name)
.execute(conn)
.unwrap();
}
fn insert_insight(
conn: &mut SqliteConnection,
library_id: i32,
rel_path: &str,
generated_at: i64,
is_current: bool,
) -> i32 {
ensure_library(conn, library_id);
diesel::sql_query(
"INSERT INTO photo_insights (library_id, rel_path, title, summary, generated_at, model_version, is_current, backend) \
VALUES (?, ?, 't', 's', ?, 'v', ?, 'local')",
)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::Text, _>(rel_path)
.bind::<diesel::sql_types::BigInt, _>(generated_at)
.bind::<diesel::sql_types::Bool, _>(is_current)
.execute(conn)
.unwrap();
diesel::sql_query("SELECT last_insert_rowid() AS id")
.get_result::<TestId>(conn)
.map(|r| r.id)
.unwrap()
}
#[derive(QueryableByName)]
struct TestId {
#[diesel(sql_type = diesel::sql_types::Integer)]
id: i32,
}
#[derive(QueryableByName, Debug)]
struct HashOnly {
#[diesel(sql_type = diesel::sql_types::Nullable<diesel::sql_types::Text>)]
content_hash: Option<String>,
}
#[derive(QueryableByName, Debug)]
struct CurrentRow {
#[diesel(sql_type = diesel::sql_types::Integer)]
id: i32,
#[diesel(sql_type = diesel::sql_types::Bool)]
is_current: bool,
}
#[test]
fn backfill_fills_tagged_photo_hash_when_image_exif_has_one() {
let mut conn = in_memory_db_connection();
insert_tag(&mut conn, 1, "vacation");
insert_tagged_photo(&mut conn, "trip/IMG.jpg", 1);
// No image_exif row yet — backfill no-op.
let stats = run(&mut conn);
assert_eq!(stats.tagged_photo_hashes_filled, 0);
// image_exif row appears with a hash; next reconcile fills it.
insert_image_exif(&mut conn, 1, "trip/IMG.jpg", Some("hashabc"));
let stats = run(&mut conn);
assert_eq!(stats.tagged_photo_hashes_filled, 1);
let row = diesel::sql_query(
"SELECT content_hash FROM tagged_photo WHERE rel_path = 'trip/IMG.jpg'",
)
.get_result::<HashOnly>(&mut conn)
.unwrap();
assert_eq!(row.content_hash.as_deref(), Some("hashabc"));
// Idempotent: a second run is a no-op.
let stats = run(&mut conn);
assert_eq!(stats.tagged_photo_hashes_filled, 0);
}
#[test]
fn backfill_skips_tagged_photo_when_image_exif_has_no_hash() {
let mut conn = in_memory_db_connection();
insert_tag(&mut conn, 1, "vacation");
insert_tagged_photo(&mut conn, "trip/IMG.jpg", 1);
// image_exif exists but its hash is null.
insert_image_exif(&mut conn, 1, "trip/IMG.jpg", None);
let stats = run(&mut conn);
assert_eq!(stats.tagged_photo_hashes_filled, 0);
}
#[test]
fn backfill_fills_photo_insights_hash_scoped_by_library() {
let mut conn = in_memory_db_connection();
// Row in library 1 only — must not be filled by a hash from
// library 2's same-rel_path entry.
insert_image_exif(&mut conn, 1, "shared.jpg", Some("hash-lib1"));
let id1 = insert_insight(&mut conn, 1, "shared.jpg", 100, true);
let stats = run(&mut conn);
assert_eq!(stats.photo_insights_hashes_filled, 1);
let row = diesel::sql_query("SELECT content_hash FROM photo_insights WHERE id = ?")
.bind::<diesel::sql_types::Integer, _>(id1)
.get_result::<HashOnly>(&mut conn)
.unwrap();
assert_eq!(row.content_hash.as_deref(), Some("hash-lib1"));
}
#[test]
fn collapse_keeps_earliest_is_current_per_hash() {
let mut conn = in_memory_db_connection();
// Two libraries, same content_hash via image_exif. Insights
// were generated independently in each library, both currently
// is_current = true. The earlier one wins.
insert_image_exif(&mut conn, 1, "a.jpg", Some("h1"));
insert_image_exif(&mut conn, 2, "a.jpg", Some("h1"));
let earlier = insert_insight(&mut conn, 1, "a.jpg", 100, true);
let later = insert_insight(&mut conn, 2, "a.jpg", 200, true);
// First pass fills the content_hash; second collapses.
let stats = run(&mut conn);
assert_eq!(stats.photo_insights_hashes_filled, 2);
assert_eq!(stats.photo_insights_demoted, 1);
let rows = diesel::sql_query("SELECT id, is_current FROM photo_insights ORDER BY id")
.get_results::<CurrentRow>(&mut conn)
.unwrap();
let earlier_row = rows.iter().find(|r| r.id == earlier).unwrap();
let later_row = rows.iter().find(|r| r.id == later).unwrap();
assert!(
earlier_row.is_current,
"earlier insight should remain current"
);
assert!(!later_row.is_current, "later insight should be demoted");
// Idempotent.
let stats = run(&mut conn);
assert_eq!(stats.photo_insights_demoted, 0);
}
#[test]
fn collapse_does_not_demote_a_solo_current_row() {
let mut conn = in_memory_db_connection();
insert_image_exif(&mut conn, 1, "a.jpg", Some("h1"));
let solo = insert_insight(&mut conn, 1, "a.jpg", 100, true);
let stats = run(&mut conn);
assert_eq!(stats.photo_insights_demoted, 0);
let row = diesel::sql_query("SELECT id, is_current FROM photo_insights WHERE id = ?")
.bind::<diesel::sql_types::Integer, _>(solo)
.get_result::<CurrentRow>(&mut conn)
.unwrap();
assert!(row.is_current);
}
}

View File

@@ -121,6 +121,10 @@ diesel::table! {
last_modified -> BigInt,
content_hash -> Nullable<Text>,
size_bytes -> Nullable<BigInt>,
phash_64 -> Nullable<BigInt>,
dhash_64 -> Nullable<BigInt>,
duplicate_of_hash -> Nullable<Text>,
duplicate_decided_at -> Nullable<BigInt>,
}
}
@@ -130,6 +134,8 @@ diesel::table! {
name -> Text,
root_path -> Text,
created_at -> BigInt,
enabled -> Bool,
excluded_dirs -> Nullable<Text>,
}
}
@@ -178,6 +184,7 @@ diesel::table! {
approved -> Nullable<Bool>,
backend -> Text,
fewshot_source_ids -> Nullable<Text>,
content_hash -> Nullable<Text>,
}
}
@@ -199,6 +206,7 @@ diesel::table! {
rel_path -> Text,
tag_id -> Integer,
created_time -> BigInt,
content_hash -> Nullable<Text>,
}
}

893
src/duplicates.rs Normal file
View File

@@ -0,0 +1,893 @@
//! Duplicate detection surface — exact (blake3) and perceptual
//! (pHash + Hamming) groups, plus the soft-mark resolve flow that
//! Apollo's DUPLICATES modal drives.
//!
//! All routes require auth (Claims). Endpoints:
//!
//! - `GET /duplicates/exact?library=&include_resolved=` — count>1 byte-identical groups.
//! - `GET /duplicates/perceptual?library=&threshold=&include_resolved=` — Hamming-clustered groups.
//! - `POST /duplicates/resolve` — soft-mark demoted siblings.
//! - `POST /duplicates/unresolve` — clear a prior soft-mark.
//!
//! Perceptual clustering caches the BK-tree result for 5 minutes so
//! repeated opens of the modal don't re-cluster the whole library.
//! Cache invalidation is best-effort: resolve/unresolve clear the
//! cache, but new files arriving via the watcher don't (the next
//! 5-minute window picks them up). For a single-user personal tool
//! that's the right trade-off.
use std::collections::HashMap;
use std::sync::Mutex;
use std::time::{Duration, Instant};
use actix_web::{App, HttpRequest, HttpResponse, Responder, dev::ServiceFactory, web};
use bk_tree::{BKTree, Metric};
use lazy_static::lazy_static;
use opentelemetry::trace::{TraceContextExt, Tracer};
use serde::{Deserialize, Serialize};
use crate::data::Claims;
use crate::database::{DuplicateRow, ExifDao};
use crate::libraries;
use crate::otel::{extract_context_from_request, global_tracer};
use crate::state::AppState;
// ── Cache ────────────────────────────────────────────────────────────────
const PERCEPTUAL_CACHE_TTL: Duration = Duration::from_secs(300);
#[derive(Clone)]
struct PerceptualCacheEntry {
/// Cache key: (library_id, threshold, include_resolved). `library_id`
/// is `None` for "all libraries". Cluster output is the same shape we
/// return on the wire so we can serve cached requests with zero work.
library_id: Option<i32>,
threshold: u32,
include_resolved: bool,
computed_at: Instant,
groups: Vec<DuplicateGroup>,
}
lazy_static! {
static ref PERCEPTUAL_CACHE: Mutex<Option<PerceptualCacheEntry>> = Mutex::new(None);
}
/// Drop the perceptual-cluster cache. Called from `resolve`/`unresolve`
/// so the next modal open reflects the soft-mark change immediately.
fn invalidate_perceptual_cache() {
if let Ok(mut guard) = PERCEPTUAL_CACHE.lock() {
*guard = None;
}
}
// ── Wire shapes ──────────────────────────────────────────────────────────
#[derive(Serialize, Debug, Clone)]
pub struct DuplicateMember {
pub library_id: i32,
pub rel_path: String,
pub content_hash: String,
pub size_bytes: Option<i64>,
pub date_taken: Option<i64>,
pub width: Option<i32>,
pub height: Option<i32>,
pub duplicate_of_hash: Option<String>,
pub duplicate_decided_at: Option<i64>,
}
impl From<DuplicateRow> for DuplicateMember {
fn from(r: DuplicateRow) -> Self {
Self {
library_id: r.library_id,
rel_path: r.rel_path,
content_hash: r.content_hash,
size_bytes: r.size_bytes,
date_taken: r.date_taken,
width: r.width,
height: r.height,
duplicate_of_hash: r.duplicate_of_hash,
duplicate_decided_at: r.duplicate_decided_at,
}
}
}
#[derive(Serialize, Debug, Clone)]
#[serde(rename_all = "lowercase")]
pub enum DuplicateKind {
Exact,
Perceptual,
}
#[derive(Serialize, Debug, Clone)]
pub struct DuplicateGroup {
pub kind: DuplicateKind,
/// Representative content_hash. For exact groups, the shared hash
/// (every member has the same one). For perceptual groups, an
/// arbitrary cluster member's hash, used only as a stable id for
/// the UI to key off.
pub representative_hash: String,
pub members: Vec<DuplicateMember>,
}
#[derive(Deserialize, Debug)]
pub struct ListDuplicatesQuery {
pub library: Option<String>,
#[serde(default)]
pub include_resolved: Option<bool>,
/// Perceptual only — Hamming-distance threshold. Ignored on the
/// exact endpoint. Defaults to 8 (~12% similarity tolerance, the
/// sweet spot for resized/recompressed copies).
#[serde(default)]
pub threshold: Option<u32>,
}
#[derive(Deserialize, Debug)]
pub struct DuplicateMemberRef {
pub library_id: i32,
pub rel_path: String,
}
#[derive(Deserialize, Debug)]
pub struct ResolveDuplicatesReq {
pub survivor: DuplicateMemberRef,
pub demoted: Vec<DuplicateMemberRef>,
}
#[derive(Serialize, Debug)]
pub struct ResolveResponse {
pub resolved_count: usize,
}
#[derive(Deserialize, Debug)]
pub struct UnresolveDuplicateReq {
pub library_id: i32,
pub rel_path: String,
}
// ── Handlers ─────────────────────────────────────────────────────────────
async fn list_exact_handler(
_: Claims,
request: HttpRequest,
app_state: web::Data<AppState>,
query: web::Query<ListDuplicatesQuery>,
exif_dao: web::Data<Mutex<Box<dyn ExifDao>>>,
) -> impl Responder {
let context = extract_context_from_request(&request);
let span = global_tracer().start_with_context("duplicates.list_exact", &context);
let span_context = opentelemetry::Context::current_with_span(span);
let library_id = libraries::resolve_library_param(&app_state, query.library.as_deref())
.ok()
.flatten()
.map(|l| l.id);
let include_resolved = query.include_resolved.unwrap_or(false);
let rows = {
let mut dao = exif_dao.lock().expect("exif dao lock");
match dao.list_duplicates_exact(&span_context, library_id, include_resolved) {
Ok(rows) => rows,
Err(e) => {
return HttpResponse::InternalServerError().body(format!("{:?}", e));
}
}
};
let groups = group_exact(rows);
HttpResponse::Ok().json(GroupsResponse { groups })
}
async fn list_perceptual_handler(
_: Claims,
request: HttpRequest,
app_state: web::Data<AppState>,
query: web::Query<ListDuplicatesQuery>,
exif_dao: web::Data<Mutex<Box<dyn ExifDao>>>,
) -> impl Responder {
let context = extract_context_from_request(&request);
let span = global_tracer().start_with_context("duplicates.list_perceptual", &context);
let span_context = opentelemetry::Context::current_with_span(span);
let library_id = libraries::resolve_library_param(&app_state, query.library.as_deref())
.ok()
.flatten()
.map(|l| l.id);
let threshold = query.threshold.unwrap_or(8).clamp(0, 32);
let include_resolved = query.include_resolved.unwrap_or(false);
// Cache hit?
if let Ok(guard) = PERCEPTUAL_CACHE.lock()
&& let Some(entry) = guard.as_ref()
&& entry.library_id == library_id
&& entry.threshold == threshold
&& entry.include_resolved == include_resolved
&& entry.computed_at.elapsed() < PERCEPTUAL_CACHE_TTL
{
return HttpResponse::Ok().json(GroupsResponse {
groups: entry.groups.clone(),
});
}
let rows = {
let mut dao = exif_dao.lock().expect("exif dao lock");
match dao.list_perceptual_candidates(&span_context, library_id, include_resolved) {
Ok(rows) => rows,
Err(e) => {
return HttpResponse::InternalServerError().body(format!("{:?}", e));
}
}
};
let groups = cluster_perceptual(rows, threshold);
if let Ok(mut guard) = PERCEPTUAL_CACHE.lock() {
*guard = Some(PerceptualCacheEntry {
library_id,
threshold,
include_resolved,
computed_at: Instant::now(),
groups: groups.clone(),
});
}
HttpResponse::Ok().json(GroupsResponse { groups })
}
async fn resolve_handler(
_: Claims,
request: HttpRequest,
body: web::Json<ResolveDuplicatesReq>,
exif_dao: web::Data<Mutex<Box<dyn ExifDao>>>,
) -> impl Responder {
let context = extract_context_from_request(&request);
let span = global_tracer().start_with_context("duplicates.resolve", &context);
let span_context = opentelemetry::Context::current_with_span(span);
if body.demoted.is_empty() {
return HttpResponse::BadRequest().body("demoted list is empty");
}
let mut dao = exif_dao.lock().expect("exif dao lock");
// Resolve survivor → its content_hash, plus the canonical rel_path
// we'll use as the destination for any tag-union INSERTs.
let survivor = match dao.lookup_duplicate_row(
&span_context,
body.survivor.library_id,
&body.survivor.rel_path,
) {
Ok(Some(row)) => row,
Ok(None) => return HttpResponse::NotFound().body("survivor not found"),
Err(e) => return HttpResponse::InternalServerError().body(format!("{:?}", e)),
};
// Survivor must not itself be soft-marked — otherwise the modal is
// pointing at a row we've already demoted, which would create a chain.
if survivor.duplicate_of_hash.is_some() {
return HttpResponse::Conflict().body("survivor is itself soft-marked as a duplicate");
}
let now = chrono::Utc::now().timestamp();
let mut resolved_count = 0usize;
for member_ref in &body.demoted {
let demoted = match dao.lookup_duplicate_row(
&span_context,
member_ref.library_id,
&member_ref.rel_path,
) {
Ok(Some(row)) => row,
Ok(None) => {
log::warn!(
"duplicates.resolve: skipping unknown demoted ({}, {})",
member_ref.library_id,
member_ref.rel_path
);
continue;
}
Err(e) => {
return HttpResponse::InternalServerError().body(format!("{:?}", e));
}
};
// Survivor and demoted must not be the same row (would set
// duplicate_of_hash to its own hash — recursive nonsense).
if demoted.library_id == survivor.library_id && demoted.rel_path == survivor.rel_path {
continue;
}
// For perceptual dups (different content_hash), union the
// demoted's tag set onto the survivor before flipping the
// soft-mark. For exact dups (same content_hash), tags are
// already shared at the bytes layer — the union is a no-op.
if demoted.content_hash != survivor.content_hash
&& let Err(e) = dao.union_perceptual_tags(
&span_context,
&survivor.content_hash,
&demoted.content_hash,
&survivor.rel_path,
)
{
log::warn!(
"duplicates.resolve: tag union failed for {}: {:?}",
demoted.rel_path,
e
);
// Continue with the soft-mark anyway — losing tag
// continuity is recoverable (unresolve restores the
// demoted row's grid presence, and the original tags
// never moved off the demoted hash).
}
if let Err(e) = dao.set_duplicate_of(
&span_context,
demoted.library_id,
&demoted.rel_path,
&survivor.content_hash,
now,
) {
return HttpResponse::InternalServerError().body(format!("{:?}", e));
}
resolved_count += 1;
}
drop(dao);
invalidate_perceptual_cache();
HttpResponse::Ok().json(ResolveResponse { resolved_count })
}
async fn unresolve_handler(
_: Claims,
request: HttpRequest,
body: web::Json<UnresolveDuplicateReq>,
exif_dao: web::Data<Mutex<Box<dyn ExifDao>>>,
) -> impl Responder {
let context = extract_context_from_request(&request);
let span = global_tracer().start_with_context("duplicates.unresolve", &context);
let span_context = opentelemetry::Context::current_with_span(span);
let mut dao = exif_dao.lock().expect("exif dao lock");
if let Err(e) = dao.clear_duplicate_of(&span_context, body.library_id, &body.rel_path) {
return HttpResponse::InternalServerError().body(format!("{:?}", e));
}
drop(dao);
invalidate_perceptual_cache();
HttpResponse::Ok().finish()
}
// ── Grouping / clustering ────────────────────────────────────────────────
#[derive(Serialize, Debug)]
struct GroupsResponse {
groups: Vec<DuplicateGroup>,
}
fn group_exact(rows: Vec<DuplicateRow>) -> Vec<DuplicateGroup> {
let mut by_hash: HashMap<String, Vec<DuplicateRow>> = HashMap::new();
for row in rows {
by_hash
.entry(row.content_hash.clone())
.or_default()
.push(row);
}
let mut groups: Vec<DuplicateGroup> = by_hash
.into_iter()
.filter(|(_, members)| members.len() > 1)
.map(|(hash, members)| DuplicateGroup {
kind: DuplicateKind::Exact,
representative_hash: hash,
members: members.into_iter().map(DuplicateMember::from).collect(),
})
.collect();
// Largest groups first (most reward per click), then deterministic.
groups.sort_by(|a, b| {
b.members
.len()
.cmp(&a.members.len())
.then_with(|| a.representative_hash.cmp(&b.representative_hash))
});
groups
}
/// Bits set in a "useful" perceptual hash. Real photographic content
/// produces ~50/50 bit distributions; anything outside the [16, 48]
/// band is low-entropy structure (uniform skies, black frames,
/// monochrome scans, faded film) where pHash collapses to near-
/// uniform values that Hamming-trivially across hundreds of unrelated
/// images. The 8/56 band that shipped first was too permissive —
/// even at threshold=4 the false-positive cluster persisted.
const MIN_INFORMATIVE_POPCOUNT: u32 = 16;
const MAX_INFORMATIVE_POPCOUNT: u32 = 64 - MIN_INFORMATIVE_POPCOUNT;
#[inline]
fn is_informative_hash(h: i64) -> bool {
let pop = (h as u64).count_ones();
(MIN_INFORMATIVE_POPCOUNT..=MAX_INFORMATIVE_POPCOUNT).contains(&pop)
}
/// dHash gets a stricter threshold than pHash. pHash is the
/// candidate-discovery signal (BK-tree neighbourhood lookup); dHash
/// is the validation signal that has to actively agree before we
/// union. Splitting the budget asymmetrically means a real near-dup
/// (which scores well on both) survives while an incidental pHash
/// collision (uniform-content false positive) gets vetoed.
///
/// Floor of 2 so threshold=4 still allows a 1-bit jitter in dHash —
/// genuine resampling can flip a low-frequency gradient bit even
/// when the visual content is identical.
#[inline]
fn dhash_threshold(phash_threshold: u32) -> u32 {
(phash_threshold / 2).max(2)
}
/// Single-link cluster the input rows by Hamming distance over their
/// pHash, with `threshold` as the maximum distance for an edge. Rows
/// without a pHash, or with a degenerate (low-entropy) pHash, are
/// excluded — they'd chain together unrelated images.
///
/// Two-signal validation: the BK-tree gives candidate pairs cheaply,
/// then we additionally require dHash agreement before unioning. pHash
/// alone is too permissive; pairing it with dHash collapses the false-
/// positive cluster significantly (different DCT vs gradient
/// signatures on real near-dups still both stay close, but spurious
/// pHash collisions on uniform images don't survive the dHash check).
///
/// Implementation: BK-tree neighbourhood lookup per row, union-find
/// over the validated edges. O(N log N) instead of the O(N²) naive
/// pairwise scan; on a 1.26M-row library that's the difference between
/// "responds in 1.5 s" and "responds in 25 minutes".
fn cluster_perceptual(rows: Vec<DuplicateRow>, threshold: u32) -> Vec<DuplicateGroup> {
let candidates: Vec<DuplicateRow> = rows
.into_iter()
.filter(|r| r.phash_64.is_some_and(is_informative_hash))
.collect();
if candidates.len() < 2 {
return Vec::new();
}
// Build BK-tree keyed on (phash_u64, index-in-candidates).
let mut tree: BKTree<HashKey, HammingMetric> = BKTree::new(HammingMetric);
for (idx, row) in candidates.iter().enumerate() {
if let Some(p) = row.phash_64 {
tree.add(HashKey {
phash: p as u64,
idx,
});
}
}
// Union-find over edges within `threshold`. For a candidate pair
// surfaced by the pHash BK-tree, require dHash within a *stricter*
// threshold (`dhash_threshold(threshold)`) before unioning. pHash
// agreement on low-entropy structure can be incidental; pHash
// agreement AND dHash within roughly half that distance is a
// strong near-dup signal. dHash on either side missing → reject
// (was: trust pHash alone). Missing dHash means we can't validate
// the candidate, and the false-positive cost outweighs the rare
// case of a partial backfill.
let dhash_max = dhash_threshold(threshold);
let mut uf = UnionFind::new(candidates.len());
for (idx, row) in candidates.iter().enumerate() {
let Some(p) = row.phash_64 else { continue };
let key = HashKey {
phash: p as u64,
idx,
};
for (_, neighbour) in tree.find(&key, threshold) {
if neighbour.idx == idx {
continue;
}
let other = &candidates[neighbour.idx];
let dhash_ok = match (row.dhash_64, other.dhash_64) {
(Some(a), Some(b)) => {
(a as u64 ^ b as u64).count_ones() <= dhash_max
&& is_informative_hash(a)
&& is_informative_hash(b)
}
_ => false,
};
if dhash_ok {
uf.union(idx, neighbour.idx);
}
}
}
// Bucket by root.
let mut by_root: HashMap<usize, Vec<DuplicateRow>> = HashMap::new();
for (idx, row) in candidates.into_iter().enumerate() {
let root = uf.find(idx);
by_root.entry(root).or_default().push(row);
}
// Medoid-validate each cluster to break single-link chains.
// Single-link unions any pair within threshold; that means a chain
// A↔B↔C can collapse into one cluster even when A and C aren't
// similar. The medoid pass picks the cluster's most-central member
// and drops any other whose distance to it exceeds threshold —
// chains lose their tail, dense real-near-dup clusters keep all
// members. Discard clusters that drop below 2 after refinement.
let groups: Vec<DuplicateGroup> = by_root
.into_values()
.filter_map(|cluster| refine_cluster(cluster, threshold, dhash_max))
.map(|cluster| {
let representative_hash = cluster[0].content_hash.clone();
DuplicateGroup {
kind: DuplicateKind::Perceptual,
representative_hash,
members: cluster.into_iter().map(DuplicateMember::from).collect(),
}
})
.collect();
let mut groups = groups;
groups.sort_by(|a, b| {
b.members
.len()
.cmp(&a.members.len())
.then_with(|| a.representative_hash.cmp(&b.representative_hash))
});
groups
}
/// Tighten a single-link cluster to its medoid neighbourhood. Returns
/// `None` when fewer than 2 members survive — caller drops the cluster.
fn refine_cluster(
cluster: Vec<DuplicateRow>,
phash_max: u32,
dhash_max: u32,
) -> Option<Vec<DuplicateRow>> {
if cluster.len() < 2 {
return None;
}
if cluster.len() == 2 {
// No chain can exist with only two members; the union-find
// already guaranteed both signals validated when joining.
return Some(cluster);
}
// Pick the medoid: member whose summed pHash+dHash distance to the
// rest of the cluster is smallest. Stable-deterministic via the
// first-best-wins tie break (lower content_hash wins via natural
// iteration order from the BK-tree input ordering).
let phashes: Vec<u64> = cluster
.iter()
.map(|r| r.phash_64.unwrap_or(0) as u64)
.collect();
let dhashes: Vec<u64> = cluster
.iter()
.map(|r| r.dhash_64.unwrap_or(0) as u64)
.collect();
let mut best_idx = 0usize;
let mut best_score = u32::MAX;
for i in 0..cluster.len() {
let mut score: u32 = 0;
for j in 0..cluster.len() {
if i == j {
continue;
}
score = score.saturating_add((phashes[i] ^ phashes[j]).count_ones());
score = score.saturating_add((dhashes[i] ^ dhashes[j]).count_ones());
}
if score < best_score {
best_score = score;
best_idx = i;
}
}
let medoid_phash = phashes[best_idx];
let medoid_dhash = dhashes[best_idx];
let kept: Vec<DuplicateRow> = cluster
.into_iter()
.enumerate()
.filter(|(i, _)| {
*i == best_idx
|| ((phashes[*i] ^ medoid_phash).count_ones() <= phash_max
&& (dhashes[*i] ^ medoid_dhash).count_ones() <= dhash_max)
})
.map(|(_, r)| r)
.collect();
if kept.len() < 2 { None } else { Some(kept) }
}
#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
struct HashKey {
phash: u64,
idx: usize,
}
struct HammingMetric;
impl Metric<HashKey> for HammingMetric {
fn distance(&self, a: &HashKey, b: &HashKey) -> u32 {
(a.phash ^ b.phash).count_ones()
}
fn threshold_distance(&self, a: &HashKey, b: &HashKey, _: u32) -> Option<u32> {
Some(self.distance(a, b))
}
}
struct UnionFind {
parent: Vec<usize>,
rank: Vec<u8>,
}
impl UnionFind {
fn new(n: usize) -> Self {
Self {
parent: (0..n).collect(),
rank: vec![0; n],
}
}
fn find(&mut self, x: usize) -> usize {
if self.parent[x] != x {
let root = self.find(self.parent[x]);
self.parent[x] = root;
}
self.parent[x]
}
fn union(&mut self, a: usize, b: usize) {
let ra = self.find(a);
let rb = self.find(b);
if ra == rb {
return;
}
if self.rank[ra] < self.rank[rb] {
self.parent[ra] = rb;
} else if self.rank[ra] > self.rank[rb] {
self.parent[rb] = ra;
} else {
self.parent[rb] = ra;
self.rank[ra] += 1;
}
}
}
// ── Routing ──────────────────────────────────────────────────────────────
pub fn add_duplicate_services<T>(app: App<T>) -> App<T>
where
T: ServiceFactory<
actix_web::dev::ServiceRequest,
Config = (),
Error = actix_web::Error,
InitError = (),
>,
{
app.service(web::resource("/duplicates/exact").route(web::get().to(list_exact_handler)))
.service(
web::resource("/duplicates/perceptual").route(web::get().to(list_perceptual_handler)),
)
.service(web::resource("/duplicates/resolve").route(web::post().to(resolve_handler)))
.service(web::resource("/duplicates/unresolve").route(web::post().to(unresolve_handler)))
}
// ── Tests ────────────────────────────────────────────────────────────────
#[cfg(test)]
mod tests {
use super::*;
fn row(library_id: i32, rel: &str, hash: &str, phash: Option<i64>) -> DuplicateRow {
DuplicateRow {
library_id,
rel_path: rel.into(),
content_hash: hash.into(),
size_bytes: Some(1000),
date_taken: None,
width: None,
height: None,
phash_64: phash,
dhash_64: None,
duplicate_of_hash: None,
duplicate_decided_at: None,
}
}
#[test]
fn group_exact_collapses_by_hash() {
let rows = vec![
row(1, "a.jpg", "h1", None),
row(1, "b.jpg", "h1", None),
row(2, "c.jpg", "h1", None),
row(1, "lonely.jpg", "h2", None),
];
let groups = group_exact(rows);
assert_eq!(groups.len(), 1);
assert_eq!(groups[0].representative_hash, "h1");
assert_eq!(groups[0].members.len(), 3);
}
/// All hashes used below have popcount in the "informative"
/// 8..=56 band so they survive the entropy filter that keeps
/// solid-colour images out of the cluster graph.
const INFORMATIVE_BASE: i64 = 0x55AA_55AA_55AA_55AA; // popcount = 32
const INFORMATIVE_NEAR: i64 = 0x55AA_55AA_55AA_55AB; // 1-bit away from BASE
const INFORMATIVE_FAR: i64 = 0x6996_6996_6996_6996; // 32-bits away from BASE
fn row_with_dhash(
library_id: i32,
rel: &str,
hash: &str,
phash: Option<i64>,
dhash: Option<i64>,
) -> DuplicateRow {
DuplicateRow {
library_id,
rel_path: rel.into(),
content_hash: hash.into(),
size_bytes: Some(1000),
date_taken: None,
width: None,
height: None,
phash_64: phash,
dhash_64: dhash,
duplicate_of_hash: None,
duplicate_decided_at: None,
}
}
#[test]
fn cluster_perceptual_unites_close_hashes() {
// Two rows near each other on both pHash and dHash; one far
// on pHash. Threshold 4 should merge the close pair.
let rows = vec![
row_with_dhash(
1,
"a.jpg",
"h1",
Some(INFORMATIVE_BASE),
Some(INFORMATIVE_BASE),
),
row_with_dhash(
1,
"b.jpg",
"h2",
Some(INFORMATIVE_NEAR),
Some(INFORMATIVE_NEAR),
),
row_with_dhash(
1,
"c.jpg",
"h3",
Some(INFORMATIVE_FAR),
Some(INFORMATIVE_FAR),
),
];
let groups = cluster_perceptual(rows, 4);
assert_eq!(groups.len(), 1);
assert_eq!(groups[0].members.len(), 2);
let paths: Vec<&str> = groups[0]
.members
.iter()
.map(|m| m.rel_path.as_str())
.collect();
assert!(paths.contains(&"a.jpg"));
assert!(paths.contains(&"b.jpg"));
}
#[test]
fn cluster_perceptual_threshold_zero_drops_distinct() {
let rows = vec![
row_with_dhash(
1,
"a.jpg",
"h1",
Some(INFORMATIVE_BASE),
Some(INFORMATIVE_BASE),
),
row_with_dhash(
1,
"b.jpg",
"h2",
Some(INFORMATIVE_NEAR),
Some(INFORMATIVE_NEAR),
),
];
let groups = cluster_perceptual(rows, 0);
assert!(groups.is_empty());
}
#[test]
fn cluster_perceptual_skips_singletons() {
let rows = vec![row(1, "alone.jpg", "h1", Some(INFORMATIVE_BASE))];
assert!(cluster_perceptual(rows, 8).is_empty());
}
#[test]
fn cluster_perceptual_filters_low_entropy_hashes() {
// Both 0 (popcount 0) and i64::MAX (popcount 63) fall outside
// the informative band. A pair of these would trivially match
// (Hamming distance to each other small or zero) without the
// entropy filter — that's exactly the regression that was
// producing a giant first cluster of solid-colour images.
let rows = vec![
row(1, "blank-a.jpg", "h1", Some(0)),
row(1, "blank-b.jpg", "h2", Some(0)),
row(1, "white-a.jpg", "h3", Some(i64::MAX)),
row(1, "white-b.jpg", "h4", Some(i64::MAX)),
];
assert!(cluster_perceptual(rows, 8).is_empty());
}
#[test]
fn cluster_perceptual_requires_dhash_agreement() {
// pHash within threshold but dHash far apart — the candidate
// edge from the BK-tree must be rejected. Without the dHash
// double-check this would form a 2-member cluster.
let rows = vec![
row_with_dhash(
1,
"a.jpg",
"h1",
Some(INFORMATIVE_BASE),
Some(INFORMATIVE_BASE),
),
row_with_dhash(
1,
"b.jpg",
"h2",
Some(INFORMATIVE_NEAR),
Some(INFORMATIVE_FAR),
),
];
assert!(cluster_perceptual(rows, 4).is_empty());
}
#[test]
fn cluster_perceptual_breaks_long_chain_at_medoid() {
// 4-link chain at threshold=2 with pairwise distances chosen
// so single-link unions all four but the endpoints sit past
// the medoid's neighbourhood. Bit positions hop by exactly 2
// bits per step, in non-overlapping nibbles, so consecutive
// hops compose into wider distant-pair distances:
// A↔B = 2, B↔C = 2, C↔D = 2,
// A↔C = 4, B↔D = 4, A↔D = 6.
// Medoid (B or C) keeps Δ ≤ 2 of itself; the far endpoint
// gets chopped, leaving exactly 3 members.
const A: i64 = 0x55AA_55AA_55AA_55AA;
const B: i64 = 0x55AA_55AA_55AA_55A9; // ^0x03 last byte
const C: i64 = 0x55AA_55AA_55AA_55A5; // ^0x0C from B
const D: i64 = 0x55AA_55AA_55AA_5595; // ^0x30 from C
let rows = vec![
row_with_dhash(1, "a.jpg", "h1", Some(A), Some(A)),
row_with_dhash(1, "b.jpg", "h2", Some(B), Some(B)),
row_with_dhash(1, "c.jpg", "h3", Some(C), Some(C)),
row_with_dhash(1, "d.jpg", "h4", Some(D), Some(D)),
];
let groups = cluster_perceptual(rows, 2);
assert_eq!(groups.len(), 1);
assert_eq!(
groups[0].members.len(),
3,
"medoid pass should chop one chain endpoint past Δ=2"
);
}
/// Sanity-check the BK-tree's metric, which is what the duplicates
/// path actually clusters on.
#[test]
fn hamming_metric_is_symmetric() {
let m = HammingMetric;
let a = HashKey {
phash: 0b1010,
idx: 0,
};
let b = HashKey {
phash: 0b0101,
idx: 1,
};
let d1 = m.distance(&a, &b);
let d2 = m.distance(&b, &a);
assert_eq!(d1, d2);
assert_eq!(d1, 4);
}
}

View File

@@ -20,9 +20,10 @@
use crate::Claims;
use crate::ai::face_client::{DetectMeta, FaceClient, FaceDetectError};
use crate::exif;
use crate::database::schema::{face_detections, image_exif, persons};
use crate::error::IntoHttpError;
use crate::exif;
use crate::file_types;
use crate::libraries::{self, Library};
use crate::otel::{extract_context_from_request, global_tracer, trace_db_call};
use crate::state::AppState;
@@ -99,9 +100,30 @@ pub struct FaceDetectionRow {
pub created_at: i64,
}
/// SQL fragment restricting an `image_exif.rel_path` (or `face_detections.rel_path`)
/// column to image extensions. Videos register in `image_exif` with a
/// populated `content_hash` but can never produce a `face_detections` row
/// — applying this filter at query time keeps videos out of the per-tick
/// backlog drain (which would otherwise loop forever — `filter_excluded`
/// drops them client-side without writing a marker) and out of the SCANNED
/// stat denominator (so 100% is reachable).
fn image_path_predicate(col: &str) -> String {
let clauses: Vec<String> = file_types::IMAGE_EXTENSIONS
.iter()
.map(|ext| format!("lower({col}) LIKE '%.{ext}'"))
.collect();
format!("({})", clauses.join(" OR "))
}
/// Row shape for `list_unscanned_candidates`'s raw SQL. Diesel's
/// `sql_query` requires a `QueryableByName` row type with explicit
/// column SQL types; using a tuple isn't supported.
#[derive(diesel::QueryableByName, Debug)]
struct CountRow {
#[diesel(sql_type = diesel::sql_types::BigInt)]
count: i64,
}
#[derive(diesel::QueryableByName, Debug)]
struct UnscannedRow {
#[diesel(sql_type = diesel::sql_types::Text)]
@@ -601,26 +623,32 @@ impl FaceDao for SqliteFaceDao {
// fire multiple detect calls for the same hash if it lives
// under several rel_paths in the same library. The
// anti-join (NOT EXISTS) drains hashes that have no row in
// face_detections at all.
let rows: Vec<(String, String)> = diesel::sql_query(
// face_detections at all. The image-extension predicate
// keeps videos out of the candidate set; without it they'd
// be filtered client-side and re-pulled every tick forever
// because no marker row is written for excluded paths.
let ext_predicate = image_path_predicate("rel_path");
let sql = format!(
"SELECT rel_path, content_hash \
FROM image_exif e \
WHERE library_id = ? \
AND content_hash IS NOT NULL \
AND {ext_predicate} \
AND NOT EXISTS ( \
SELECT 1 FROM face_detections f \
WHERE f.content_hash = e.content_hash \
) \
GROUP BY content_hash \
LIMIT ?",
)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::BigInt, _>(limit)
.load::<UnscannedRow>(conn.deref_mut())
.with_context(|| "list_unscanned_candidates")?
.into_iter()
.map(|r| (r.rel_path, r.content_hash))
.collect();
LIMIT ?"
);
let rows: Vec<(String, String)> = diesel::sql_query(sql)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::BigInt, _>(limit)
.load::<UnscannedRow>(conn.deref_mut())
.with_context(|| "list_unscanned_candidates")?
.into_iter()
.map(|r| (r.rel_path, r.content_hash))
.collect();
Ok(rows)
})
}
@@ -856,14 +884,18 @@ impl FaceDao for SqliteFaceDao {
// Pair with the base64-encoded embedding string so the handler
// doesn't need to know the wire format. Skip rows with NULL
// embedding (shouldn't happen on detected rows, but defensive).
// `embedding.take()` moves the bytes out of the row so we can
// hand the (now-empty-embedding) row plus the encoded string
// back to the caller without cloning the whole row — at 20k
// rows × 2 KB that clone was 40 MB of pointless heap traffic
// per cluster-suggest run.
use base64::Engine;
Ok(rows
.into_iter()
.filter_map(|r| {
r.embedding.as_ref().map(|bytes| {
let b64 = base64::engine::general_purpose::STANDARD.encode(bytes);
(r.clone(), b64)
})
.filter_map(|mut r| {
let bytes = r.embedding.take()?;
let b64 = base64::engine::general_purpose::STANDARD.encode(&bytes);
Some((r, b64))
})
.collect())
})
@@ -1013,14 +1045,42 @@ impl FaceDao for SqliteFaceDao {
.first(conn.deref_mut())
.with_context(|| "stats: failed")?
};
// Image-extension filter mirrors `list_unscanned_candidates` so
// SCANNED can actually reach 100%: videos sit in `image_exif` but
// never get a `face_detections` row, so counting them here
// permanently caps the percentage below 100%.
//
// Count DISTINCT content_hash (not rows) so the numerator
// (`scanned`, also distinct-content_hash) and denominator live
// in the same domain. Without this, a file present at multiple
// rel_paths or across libraries inflates total_photos by one
// per duplicate row while face_detections — keyed on
// content_hash — counts the bytes once, leaving a permanent
// gap (e.g. 1101/1103 with nothing actually pending). Rows
// with NULL content_hash are excluded; they're held in the
// hash-backfill backlog and counting them would pin the bar
// below 100% for the duration of that backfill.
let total_photos: i64 = {
let mut q = image_exif::table.into_boxed();
if let Some(lib) = library_id {
q = q.filter(image_exif::library_id.eq(lib));
}
q.select(diesel::dsl::count_star())
.first(conn.deref_mut())
.with_context(|| "stats: total_photos")?
let ext_predicate = image_path_predicate("rel_path");
let row: CountRow = if let Some(lib) = library_id {
let sql = format!(
"SELECT COUNT(DISTINCT content_hash) AS count FROM image_exif \
WHERE library_id = ? AND content_hash IS NOT NULL AND {ext_predicate}"
);
diesel::sql_query(sql)
.bind::<diesel::sql_types::Integer, _>(lib)
.get_result(conn.deref_mut())
.with_context(|| "stats: total_photos")?
} else {
let sql = format!(
"SELECT COUNT(DISTINCT content_hash) AS count FROM image_exif \
WHERE content_hash IS NOT NULL AND {ext_predicate}"
);
diesel::sql_query(sql)
.get_result(conn.deref_mut())
.with_context(|| "stats: total_photos")?
};
row.count
};
let persons_count: i64 = persons::table
.select(diesel::dsl::count_star())
@@ -2255,6 +2315,12 @@ async fn update_face_handler<D: FaceDao>(
let mut new_embedding: Option<Vec<u8>> = None;
if let Some((bx, by, bw, bh)) = bbox_patch {
if !face_client.is_enabled() {
warn!(
"PATCH /image/faces/{}: 503 — face client not enabled \
(APOLLO_FACE_API_BASE_URL / APOLLO_API_BASE_URL both unset). \
Bbox edit requires Apollo to re-embed.",
id
);
return HttpResponse::ServiceUnavailable()
.body("face client disabled — bbox edit requires Apollo");
}
@@ -2284,8 +2350,7 @@ async fn update_face_handler<D: FaceDao>(
"PATCH /image/faces/{}: crop failed for {:?}: {:?}",
id, abs_path, e
);
return HttpResponse::BadRequest()
.body(format!("cannot crop new bbox: {}", e));
return HttpResponse::BadRequest().body(format!("cannot crop new bbox: {}", e));
}
};
let meta = DetectMeta {
@@ -2332,11 +2397,20 @@ async fn update_face_handler<D: FaceDao>(
);
}
Err(FaceDetectError::Transient(e)) => {
warn!(
"PATCH /image/faces/{}: 503 — Apollo face client transient \
error during re-embed: {}",
id, e
);
return HttpResponse::ServiceUnavailable().body(format!("{}", e));
}
Err(FaceDetectError::Disabled) => {
return HttpResponse::ServiceUnavailable()
.body("face client disabled mid-flight");
warn!(
"PATCH /image/faces/{}: 503 — face client became disabled \
mid-flight",
id
);
return HttpResponse::ServiceUnavailable().body("face client disabled mid-flight");
}
}
}
@@ -3145,6 +3219,39 @@ mod tests {
assert_eq!(stats.with_faces, 0);
}
#[test]
fn stats_total_photos_excludes_videos() {
// SCANNED counts content_hashes in face_detections; total_photos
// must apply the same image-extension filter as the watcher
// backlog query so the percentage can reach 100%. Without this,
// videos sit in image_exif but never produce a face_detections
// row (Apollo decodes images only) and the bar caps below 100%.
let mut dao = fresh_dao();
diesel::sql_query(
"INSERT OR IGNORE INTO libraries (id, name, root_path, created_at) \
VALUES (1, 'main', '/tmp', 0)",
)
.execute(dao.connection.lock().unwrap().deref_mut())
.expect("seed libraries");
diesel::sql_query(
"INSERT INTO image_exif \
(library_id, rel_path, content_hash, created_time, last_modified) VALUES \
(1, 'a.jpg', 'h-a', 0, 0), \
(1, 'b.JPEG', 'h-b', 0, 0), \
(1, 'movie.mp4', 'h-mp4', 0, 0), \
(1, 'clip.MOV', 'h-mov', 0, 0)",
)
.execute(dao.connection.lock().unwrap().deref_mut())
.expect("seed image_exif");
let stats = dao.stats(&ctx(), Some(1)).expect("stats");
assert_eq!(
stats.total_photos, 2,
"videos should not count toward total"
);
}
#[test]
fn merge_persons_repoints_faces() {
let mut dao = fresh_dao();
@@ -3325,8 +3432,7 @@ mod tests {
)
.unwrap();
let row = seed_library_and_face(&mut dao, Some(p.id));
let joined =
hydrate_face_with_person(&mut dao, &ctx(), row).expect("hydrate assigned");
let joined = hydrate_face_with_person(&mut dao, &ctx(), row).expect("hydrate assigned");
assert_eq!(joined.person_id, Some(p.id));
assert_eq!(joined.person_name.as_deref(), Some("Alice"));
// Bbox + confidence + source must round-trip — these are what
@@ -3345,8 +3451,7 @@ mod tests {
// previously-assigned row's serialization.
let mut dao = fresh_dao();
let row = seed_library_and_face(&mut dao, None);
let joined =
hydrate_face_with_person(&mut dao, &ctx(), row).expect("hydrate unassigned");
let joined = hydrate_face_with_person(&mut dao, &ctx(), row).expect("hydrate unassigned");
assert!(joined.person_id.is_none());
assert!(joined.person_name.is_none());
}
@@ -3367,7 +3472,12 @@ mod tests {
.execute(dao.connection.lock().unwrap().deref_mut())
.expect("seed libraries");
// Seed image_exif: mix of hashed/unhashed/scanned/cross-library.
// Seed image_exif: mix of hashed/unhashed/scanned/cross-library,
// plus a video and a mixed-case image extension. Videos register
// in image_exif but can never produce a face_detections row, so
// the SQL must filter them out — otherwise the per-tick backlog
// drain re-pulls them every tick (no marker is ever written, so
// they loop forever) and the SCANNED stat is permanently capped.
diesel::sql_query(
"INSERT INTO image_exif \
(library_id, rel_path, content_hash, created_time, last_modified) VALUES \
@@ -3375,6 +3485,9 @@ mod tests {
(1, 'b.jpg', 'h-b', 0, 0), \
(1, 'c.jpg', NULL, 0, 0), \
(1, 'd.jpg', 'h-d', 0, 0), \
(1, 'movie.mp4', 'h-mp4', 0, 0), \
(1, 'clip.MOV', 'h-mov', 0, 0), \
(1, 'photo.JPG', 'h-jpg-upper', 0, 0), \
(2, 'e.jpg', 'h-e', 0, 0)",
)
.execute(dao.connection.lock().unwrap().deref_mut())
@@ -3388,16 +3501,26 @@ mod tests {
.list_unscanned_candidates(&ctx(), 1, 10)
.expect("list unscanned");
let hashes: std::collections::HashSet<_> =
cands.iter().map(|(_, h)| h.clone()).collect();
let hashes: std::collections::HashSet<_> = cands.iter().map(|(_, h)| h.clone()).collect();
// Should contain a and d (hashed, unscanned, library 1).
// Should contain a, d, and the upper-case .JPG (image-extension
// match is case-insensitive).
assert!(hashes.contains("h-a"), "missing h-a: {:?}", hashes);
assert!(hashes.contains("h-d"), "missing h-d: {:?}", hashes);
// Should NOT contain b (scanned), c (no hash), e (other library).
assert!(
hashes.contains("h-jpg-upper"),
"missing h-jpg-upper: {:?}",
hashes
);
// Should NOT contain b (scanned), c (no hash), e (other library),
// or videos (mp4/mov are not image extensions).
assert!(!hashes.contains("h-b"), "expected h-b filtered (scanned)");
assert!(!hashes.contains("h-e"), "expected h-e filtered (other library)");
assert_eq!(cands.len(), 2, "unexpected candidates: {:?}", cands);
assert!(
!hashes.contains("h-e"),
"expected h-e filtered (other library)"
);
assert!(!hashes.contains("h-mp4"), "expected h-mp4 filtered (video)");
assert!(!hashes.contains("h-mov"), "expected h-mov filtered (video)");
assert_eq!(cands.len(), 3, "unexpected candidates: {:?}", cands);
}
}

View File

@@ -110,11 +110,18 @@ fn in_memory_date_sort(
let total_count = files.len() as i64;
let file_paths: Vec<String> = files.iter().map(|f| f.file_name.clone()).collect();
// Batch fetch EXIF data (keyed by rel_path; in union mode a rel_path may
// correspond to rows in multiple libraries — pick the date from the one
// matching the requesting row's library_id when possible).
// Batch fetch EXIF data. When every file in this batch belongs to the
// same library, scope the SQL filter to that library so cross-library
// duplicates with the same rel_path don't get fetched and discarded.
// In genuine union mode (mixed libraries) keep the rel-path-only
// lookup; the caller's `(file_path, library_id)` map below picks the
// right row.
let scope_library = match file_libraries.first() {
Some(&first) if file_libraries.iter().all(|&id| id == first) => Some(first),
_ => None,
};
let exif_rows = exif_dao
.get_exif_batch(span_context, &file_paths)
.get_exif_batch(span_context, scope_library, &file_paths)
.unwrap_or_default();
let exif_map: std::collections::HashMap<(String, i32), i64> = exif_rows
.into_iter()
@@ -309,11 +316,15 @@ pub async fn list_photos<TagD: TagDao, FS: FileSystemAccess>(
None
};
// Query EXIF database
// Query EXIF database. When the request named a library, the EXIF
// filter must be scoped to it — otherwise camera/date/GPS hits
// from other libraries would pollute the result set even though
// downstream filesystem walks would never visit those files.
let mut exif_dao_guard = exif_dao.lock().expect("Unable to get ExifDao");
let exif_results = exif_dao_guard
.query_by_exif(
&span_context,
library.map(|l| l.id),
req.camera_make.as_deref(),
req.camera_model.as_deref(),
req.lens_model.as_deref(),
@@ -572,9 +583,10 @@ pub async fn list_photos<TagD: TagDao, FS: FileSystemAccess>(
} else {
Some(trimmed)
};
let include_duplicates = req.include_duplicates.unwrap_or(false);
let rows = {
let mut dao = exif_dao.lock().expect("Unable to get ExifDao");
dao.list_rel_paths_for_libraries(&span_context, &lib_ids, prefix)
dao.list_rel_paths_for_libraries(&span_context, &lib_ids, prefix, include_duplicates)
.unwrap_or_else(|e| {
warn!("list_rel_paths_for_libraries failed: {:?}", e);
Vec::new()
@@ -1242,15 +1254,19 @@ pub async fn list_exif_summary(
.collect();
let mut exif_dao_guard = exif_dao.lock().expect("Unable to get ExifDao");
match exif_dao_guard.query_by_exif(&cx, None, None, None, None, req.date_from, req.date_to) {
match exif_dao_guard.query_by_exif(
&cx,
library_filter,
None,
None,
None,
None,
req.date_from,
req.date_to,
) {
Ok(rows) => {
let photos: Vec<ExifSummary> = rows
.into_iter()
// Library filter post-query: keeps the DAO trait (and its
// mocks) unchanged. For typical 23 library setups the in-
// memory pass over a date-bounded result set is negligible;
// can be pushed into SQL later if it ever isn't.
.filter(|r| library_filter.is_none_or(|id| r.library_id == id))
.map(|r| ExifSummary {
library_name: library_names.get(&r.library_id).cloned(),
file_path: r.file_path,
@@ -1488,6 +1504,10 @@ mod tests {
last_modified: data.last_modified,
content_hash: data.content_hash.clone(),
size_bytes: data.size_bytes,
phash_64: data.phash_64,
dhash_64: data.dhash_64,
duplicate_of_hash: None,
duplicate_decided_at: None,
})
}
@@ -1527,6 +1547,10 @@ mod tests {
last_modified: data.last_modified,
content_hash: data.content_hash.clone(),
size_bytes: data.size_bytes,
phash_64: data.phash_64,
dhash_64: data.dhash_64,
duplicate_of_hash: None,
duplicate_decided_at: None,
})
}
@@ -1549,6 +1573,7 @@ mod tests {
fn get_exif_batch(
&mut self,
_context: &opentelemetry::Context,
_library_id: Option<i32>,
_: &[String],
) -> Result<Vec<crate::database::models::ImageExif>, DbError> {
Ok(Vec::new())
@@ -1557,6 +1582,7 @@ mod tests {
fn query_by_exif(
&mut self,
_context: &opentelemetry::Context,
_library_id: Option<i32>,
_: Option<&str>,
_: Option<&str>,
_: Option<&str>,
@@ -1672,6 +1698,7 @@ mod tests {
_context: &opentelemetry::Context,
_library_ids: &[i32],
_path_prefix: Option<&str>,
_include_duplicates: bool,
) -> Result<Vec<(i32, String)>, DbError> {
Ok(vec![])
}
@@ -1684,6 +1711,100 @@ mod tests {
) -> Result<(), DbError> {
Ok(())
}
fn count_for_library(
&mut self,
_context: &opentelemetry::Context,
_library_id: i32,
) -> Result<i64, DbError> {
Ok(0)
}
fn list_rel_paths_for_library_page(
&mut self,
_context: &opentelemetry::Context,
_library_id: i32,
_limit: i64,
_offset: i64,
) -> Result<Vec<(i32, String)>, DbError> {
Ok(Vec::new())
}
fn get_rows_missing_perceptual_hash(
&mut self,
_context: &opentelemetry::Context,
_limit: i64,
) -> Result<Vec<(i32, String)>, DbError> {
Ok(Vec::new())
}
fn backfill_perceptual_hash(
&mut self,
_context: &opentelemetry::Context,
_library_id: i32,
_rel_path: &str,
_phash_64: Option<i64>,
_dhash_64: Option<i64>,
) -> Result<(), DbError> {
Ok(())
}
fn list_duplicates_exact(
&mut self,
_context: &opentelemetry::Context,
_library_id: Option<i32>,
_include_resolved: bool,
) -> Result<Vec<crate::database::DuplicateRow>, DbError> {
Ok(Vec::new())
}
fn list_perceptual_candidates(
&mut self,
_context: &opentelemetry::Context,
_library_id: Option<i32>,
_include_resolved: bool,
) -> Result<Vec<crate::database::DuplicateRow>, DbError> {
Ok(Vec::new())
}
fn lookup_duplicate_row(
&mut self,
_context: &opentelemetry::Context,
_library_id: i32,
_rel_path: &str,
) -> Result<Option<crate::database::DuplicateRow>, DbError> {
Ok(None)
}
fn set_duplicate_of(
&mut self,
_context: &opentelemetry::Context,
_library_id: i32,
_rel_path: &str,
_survivor_hash: &str,
_decided_at: i64,
) -> Result<(), DbError> {
Ok(())
}
fn clear_duplicate_of(
&mut self,
_context: &opentelemetry::Context,
_library_id: i32,
_rel_path: &str,
) -> Result<(), DbError> {
Ok(())
}
fn union_perceptual_tags(
&mut self,
_context: &opentelemetry::Context,
_survivor_hash: &str,
_demoted_hash: &str,
_survivor_rel_path: &str,
) -> Result<(), DbError> {
Ok(())
}
}
mod api {

View File

@@ -10,6 +10,7 @@ pub mod cleanup;
pub mod content_hash;
pub mod data;
pub mod database;
pub mod duplicates;
pub mod error;
pub mod exif;
pub mod face_watch;
@@ -19,9 +20,11 @@ pub mod file_types;
pub mod files;
pub mod geo;
pub mod libraries;
pub mod library_maintenance;
pub mod memories;
pub mod otel;
pub mod parsers;
pub mod perceptual_hash;
pub mod service;
pub mod state;
pub mod tags;

View File

@@ -3,7 +3,9 @@ use chrono::Utc;
use diesel::prelude::*;
use diesel::sqlite::SqliteConnection;
use log::{info, warn};
use std::collections::HashMap;
use std::path::{Path, PathBuf};
use std::sync::{Arc, RwLock};
use crate::data::Claims;
use crate::database::models::{InsertLibrary, LibraryRow};
@@ -26,6 +28,19 @@ pub struct Library {
pub id: i32,
pub name: String,
pub root_path: String,
/// Operator kill switch (mirrors `libraries.enabled`). When `false`
/// the watcher skips this library entirely — before the probe,
/// before ingest, before maintenance. Reads / serving still work
/// (a request whose path resolves to a disabled library's root
/// will succeed if the file is on disk; nothing prevents that
/// today and there's no obvious reason to). Toggle via SQL.
pub enabled: bool,
/// Per-library excluded paths/patterns, parsed from the
/// comma-separated DB column. The walker applies these
/// **in union** with the global `EXCLUDED_DIRS` env var; either
/// list matching a path is enough to exclude. Empty = no
/// library-specific excludes (only the global env var applies).
pub excluded_dirs: Vec<String>,
}
impl Library {
@@ -47,6 +62,36 @@ impl Library {
.ok()
.map(|p| p.to_string_lossy().replace('\\', "/"))
}
/// Effective excluded directories for a walk of this library:
/// the union of the global env-var excludes (passed in by the
/// caller as `globals`) and this library's per-row excludes.
/// Order doesn't matter; `PathExcluder` accepts repeats.
pub fn effective_excluded_dirs(&self, globals: &[String]) -> Vec<String> {
if self.excluded_dirs.is_empty() {
return globals.to_vec();
}
let mut combined: Vec<String> =
Vec::with_capacity(globals.len() + self.excluded_dirs.len());
combined.extend_from_slice(globals);
combined.extend(self.excluded_dirs.iter().cloned());
combined
}
}
/// Parse a comma-separated excluded_dirs column into a Vec, dropping
/// empty entries (mirrors `AppState::parse_excluded_dirs` for the env
/// var). NULL → empty Vec.
pub fn parse_excluded_dirs_column(raw: Option<&str>) -> Vec<String> {
match raw {
None => Vec::new(),
Some(s) => s
.split(',')
.map(str::trim)
.filter(|s| !s.is_empty())
.map(String::from)
.collect(),
}
}
impl From<LibraryRow> for Library {
@@ -55,6 +100,8 @@ impl From<LibraryRow> for Library {
id: row.id,
name: row.name,
root_path: row.root_path,
enabled: row.enabled,
excluded_dirs: parse_excluded_dirs_column(row.excluded_dirs.as_deref()),
}
}
}
@@ -109,6 +156,8 @@ pub fn seed_or_patch_from_env(conn: &mut SqliteConnection, base_path: &str) {
name: "main",
root_path: base_path,
created_at: now,
enabled: true,
excluded_dirs: None,
})
.execute(conn);
match result {
@@ -146,16 +195,165 @@ pub fn resolve_library_param<'a>(
.ok_or_else(|| format!("unknown library name: {}", raw))
}
/// Health of a library at a point in time. Probed at the top of each
/// file-watcher tick. The `Stale` state is the "be conservative" signal:
/// destructive paths (ingest writes, future move-handoff and orphan GC in
/// branches B/C) skip a stale library, but reads/serving stay unaffected.
///
/// See `CLAUDE.md` → "Library availability and safety" for the policy.
#[derive(Clone, Debug, serde::Serialize, PartialEq, Eq)]
#[serde(tag = "state", rename_all = "snake_case")]
pub enum LibraryHealth {
Online,
Stale {
reason: String,
/// Unix timestamp (seconds) of the most recent transition into
/// Stale. Held for telemetry / `/libraries` surfacing only —
/// gating logic doesn't read it.
since: i64,
},
}
impl LibraryHealth {
pub fn is_online(&self) -> bool {
matches!(self, LibraryHealth::Online)
}
}
/// Shared snapshot of every configured library's health, keyed by
/// `library_id`. The watcher writes; HTTP handlers read. RwLock because
/// reads vastly outnumber writes (one tick vs. every status request).
pub type LibraryHealthMap = Arc<RwLock<HashMap<i32, LibraryHealth>>>;
/// Construct an initial health map. Libraries start `Online`; the first
/// probe will downgrade any that fail. Starting `Stale` would block ingest
/// for the watcher's first tick on a healthy mount, which is the wrong
/// default for a server that's just been restarted.
pub fn new_health_map(libs: &[Library]) -> LibraryHealthMap {
let mut m = HashMap::with_capacity(libs.len());
for lib in libs {
m.insert(lib.id, LibraryHealth::Online);
}
Arc::new(RwLock::new(m))
}
/// Probe a library's mount point. Cheap: stat + open dir + peek one entry.
///
/// `had_data` is the caller's prior knowledge that this library has been
/// non-empty before — typically `image_exif` row count > 0. When true, an
/// empty directory is suspicious (it's how an unmounted NFS share looks);
/// when false, it's accepted as a fresh mount that simply hasn't been
/// indexed yet.
///
/// Note: stat / read_dir on a hard-mounted, unreachable NFS share can
/// block. The watcher accepts that risk for now — the worst case is that
/// the tick stalls until the mount returns, which is no more destructive
/// than the pre-probe behavior. A future enhancement can wrap this in a
/// thread + timeout if it becomes an operational issue.
pub fn probe_online(lib: &Library, had_data: bool) -> LibraryHealth {
let now = Utc::now().timestamp();
let path = Path::new(&lib.root_path);
let metadata = match std::fs::metadata(path) {
Ok(m) => m,
Err(e) => {
return LibraryHealth::Stale {
reason: format!("root_path stat failed: {}", e),
since: now,
};
}
};
if !metadata.is_dir() {
return LibraryHealth::Stale {
reason: format!("root_path is not a directory: {}", lib.root_path),
since: now,
};
}
let mut entries = match std::fs::read_dir(path) {
Ok(it) => it,
Err(e) => {
return LibraryHealth::Stale {
reason: format!("read_dir failed: {}", e),
since: now,
};
}
};
// Empty directory only counts as Stale when we have prior evidence
// this library used to have content. A genuinely fresh mount is
// legitimately empty, and degrading it would block first-time ingest.
if had_data && entries.next().is_none() {
return LibraryHealth::Stale {
reason: "library is empty but image_exif has rows for it".to_string(),
since: now,
};
}
LibraryHealth::Online
}
/// Probe `lib`, update `map`, and return the new state. Logs only on a
/// state transition (Online↔Stale) so a long outage doesn't spam at every
/// tick — operators get one warn on the way down and one info on the way
/// up.
pub fn refresh_health(map: &LibraryHealthMap, lib: &Library, had_data: bool) -> LibraryHealth {
let new_state = probe_online(lib, had_data);
let mut guard = map.write().unwrap_or_else(|e| e.into_inner());
let prev = guard.get(&lib.id).cloned();
let transitioned = matches!(
(&prev, &new_state),
(None, LibraryHealth::Stale { .. })
| (Some(LibraryHealth::Online), LibraryHealth::Stale { .. })
| (Some(LibraryHealth::Stale { .. }), LibraryHealth::Online)
);
if transitioned {
match &new_state {
LibraryHealth::Online => info!(
"Library '{}' (id={}) recovered: {} is online",
lib.name, lib.id, lib.root_path
),
LibraryHealth::Stale { reason, .. } => warn!(
"Library '{}' (id={}) is STALE — pausing writes. Reason: {}. Path: {}",
lib.name, lib.id, reason, lib.root_path
),
}
}
guard.insert(lib.id, new_state.clone());
new_state
}
/// Snapshot of one library + its current health, for `/libraries`.
#[derive(serde::Serialize)]
pub struct LibraryStatus {
#[serde(flatten)]
pub library: Library,
pub health: LibraryHealth,
}
#[derive(serde::Serialize)]
pub struct LibrariesResponse {
pub libraries: Vec<Library>,
pub libraries: Vec<LibraryStatus>,
}
#[get("/libraries")]
pub async fn list_libraries(_claims: Claims, app_state: Data<AppState>) -> impl Responder {
HttpResponse::Ok().json(LibrariesResponse {
libraries: app_state.libraries.clone(),
})
let health_guard = app_state
.library_health
.read()
.unwrap_or_else(|e| e.into_inner());
let libraries = app_state
.libraries
.iter()
.map(|lib| LibraryStatus {
library: lib.clone(),
health: health_guard
.get(&lib.id)
.cloned()
.unwrap_or(LibraryHealth::Online),
})
.collect();
HttpResponse::Ok().json(LibrariesResponse { libraries })
}
#[cfg(test)]
@@ -192,6 +390,8 @@ mod tests {
id: 1,
name: "main".into(),
root_path: "/tmp/media".into(),
enabled: true,
excluded_dirs: Vec::new(),
};
let rel = lib.strip_root(Path::new("/tmp/media/2024/photo.jpg"));
assert_eq!(rel.as_deref(), Some("2024/photo.jpg"));
@@ -205,6 +405,8 @@ mod tests {
id: 1,
name: "main".into(),
root_path: "/tmp/media".into(),
enabled: true,
excluded_dirs: Vec::new(),
};
let abs = lib.resolve("2024/photo.jpg");
assert_eq!(abs, PathBuf::from("/tmp/media/2024/photo.jpg"));
@@ -222,11 +424,15 @@ mod tests {
id: 1,
name: "main".into(),
root_path: "/tmp/main".into(),
enabled: true,
excluded_dirs: Vec::new(),
},
Library {
id: 7,
name: "archive".into(),
root_path: "/tmp/archive".into(),
enabled: true,
excluded_dirs: Vec::new(),
},
]
}
@@ -279,4 +485,138 @@ mod tests {
let err = resolve_library_param(&state, Some("missing")).unwrap_err();
assert!(err.contains("unknown library name"));
}
#[test]
fn parse_excluded_dirs_column_handles_null_and_whitespace() {
assert_eq!(parse_excluded_dirs_column(None), Vec::<String>::new());
assert_eq!(parse_excluded_dirs_column(Some("")), Vec::<String>::new());
assert_eq!(
parse_excluded_dirs_column(Some(" /a , /b/sub , @eaDir ,, ")),
vec!["/a".to_string(), "/b/sub".to_string(), "@eaDir".to_string()]
);
}
#[test]
fn effective_excluded_dirs_unions_global_and_per_library() {
let lib_no_extras = Library {
id: 1,
name: "main".into(),
root_path: "/x".into(),
enabled: true,
excluded_dirs: Vec::new(),
};
let globals = vec!["@eaDir".to_string(), ".thumbnails".to_string()];
// Empty per-library excludes → exactly the globals.
assert_eq!(lib_no_extras.effective_excluded_dirs(&globals), globals);
let lib_with_extras = Library {
id: 2,
name: "archive".into(),
root_path: "/y".into(),
enabled: true,
excluded_dirs: vec!["/photos".to_string()],
};
let combined = lib_with_extras.effective_excluded_dirs(&globals);
assert!(combined.contains(&"@eaDir".to_string()));
assert!(combined.contains(&".thumbnails".to_string()));
assert!(combined.contains(&"/photos".to_string()));
assert_eq!(combined.len(), 3);
}
fn probe_lib(id: i32, root: String) -> Library {
Library {
id,
name: "main".into(),
root_path: root,
enabled: true,
excluded_dirs: Vec::new(),
}
}
#[test]
fn probe_online_for_existing_non_empty_dir() {
let tmp = tempfile::tempdir().unwrap();
std::fs::write(tmp.path().join("photo.jpg"), b"hello").unwrap();
let lib = probe_lib(1, tmp.path().to_string_lossy().into());
// had_data doesn't matter when the dir has entries.
assert!(probe_online(&lib, true).is_online());
assert!(probe_online(&lib, false).is_online());
}
#[test]
fn probe_stale_when_root_missing() {
let lib = probe_lib(1, "/nonexistent/definitely/not/here".into());
assert!(matches!(
probe_online(&lib, false),
LibraryHealth::Stale { .. }
));
}
#[test]
fn probe_stale_when_root_is_a_file() {
let tmp = tempfile::tempdir().unwrap();
let file = tmp.path().join("not-a-dir");
std::fs::write(&file, b"x").unwrap();
let lib = probe_lib(1, file.to_string_lossy().into());
assert!(matches!(
probe_online(&lib, false),
LibraryHealth::Stale { .. }
));
}
#[test]
fn probe_empty_dir_is_online_when_no_prior_data() {
// Fresh mount: empty directory, no rows in image_exif. Accept it.
let tmp = tempfile::tempdir().unwrap();
let lib = probe_lib(1, tmp.path().to_string_lossy().into());
assert!(probe_online(&lib, false).is_online());
}
#[test]
fn probe_empty_dir_is_stale_when_prior_data_existed() {
// The "share went offline" signal: directory exists but is empty,
// and we know the library used to have content. Treat as Stale.
let tmp = tempfile::tempdir().unwrap();
let lib = probe_lib(1, tmp.path().to_string_lossy().into());
match probe_online(&lib, true) {
LibraryHealth::Stale { reason, .. } => {
assert!(reason.contains("empty"), "unexpected reason: {}", reason)
}
other => panic!("expected Stale, got {:?}", other),
}
}
#[test]
fn refresh_health_logs_only_on_transition() {
// Smoke test: refresh_health updates the map and reports correctly.
// (We can't easily assert on logs without a custom logger; the
// important thing is that the state churns properly.)
let tmp = tempfile::tempdir().unwrap();
let lib = Library {
id: 42,
name: "test".into(),
root_path: tmp.path().to_string_lossy().into(),
enabled: true,
excluded_dirs: Vec::new(),
};
let map = new_health_map(&[lib.clone()]);
// First probe: empty dir, no prior data — Online.
let s1 = refresh_health(&map, &lib, false);
assert!(s1.is_online());
// Probe again with had_data=true on the same empty dir — Stale.
let s2 = refresh_health(&map, &lib, true);
assert!(matches!(s2, LibraryHealth::Stale { .. }));
assert_eq!(
map.read().unwrap().get(&lib.id).cloned(),
Some(s2.clone()),
"map should reflect the latest probe"
);
// Recovery: drop a file and probe again.
std::fs::write(tmp.path().join("photo.jpg"), b"x").unwrap();
let s3 = refresh_health(&map, &lib, true);
assert!(s3.is_online());
}
}

828
src/library_maintenance.rs Normal file
View File

@@ -0,0 +1,828 @@
//! Filesystem-backed maintenance of `image_exif`, the back-ref columns
//! on hash-keyed tables, and orphan derived data.
//!
//! These passes are the operational implementation of the library
//! handoff and orphan rules from CLAUDE.md → "Multi-library data
//! model" / "Library availability and safety":
//!
//! 1. **Missing-file detection** — when a file disappears from disk
//! but its `image_exif` row remains, the row is removed. Naturally
//! implements the move case: when a user moves a file from lib-A
//! to lib-B, the watcher's normal ingest creates the lib-B row;
//! this pass eventually retires the lib-A row.
//!
//! 2. **Back-ref refresh** — hash-keyed rows (`face_detections` and,
//! after Branch B, `tagged_photo` / `photo_insights`) carry a
//! denormalized `(library_id, rel_path)` back-ref. After a move,
//! that back-ref may point at a deleted row. The refresh pass
//! finds rows whose `(library_id, rel_path)` no longer matches
//! any `image_exif` row but whose `content_hash` does, and updates
//! the back-ref to one of the surviving paths. Idempotent.
//!
//! 3. **Orphan GC** — when a `content_hash` no longer has any
//! `image_exif` row referencing it, hash-keyed derived rows for
//! that hash become eligible for deletion. To survive transient
//! unmounts, the pass uses a **two-tick consensus rule**: a hash
//! must be observed orphaned for two consecutive ticks AND every
//! library must be online for both observations. The "marked but
//! not yet deleted" state is held in memory; restarting the
//! watcher resets it (which is fine — the second tick simply
//! happens after the next tick, not the very next one).
//!
//! Pass 1 is filesystem-dependent and gated on the per-library
//! availability probe. Passes 2 and 3 are database-only but pass 3
//! additionally requires every library to be online for the
//! consensus window.
use std::collections::HashSet;
use std::path::Path;
use std::sync::{Arc, Mutex};
use diesel::prelude::*;
use diesel::sql_query;
use diesel::sqlite::SqliteConnection;
use log::{debug, info, warn};
use crate::database::ExifDao;
use crate::libraries::{Library, LibraryHealthMap};
/// Cap on missing-file deletions per library per tick. Prevents a
/// pathological mount that returns "not found" for everything (e.g.
/// case-sensitivity flip on a network share that the probe didn't
/// catch) from wiping the entire image_exif table in one tick. Tunable
/// via `IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK`.
pub const DEFAULT_MISSING_DELETE_CAP: usize = 200;
/// Page size for the missing-file scan. We stat() every row in this
/// batch but only delete those that are confirmed-not-found (subject
/// to the delete cap above). Tunable via
/// `IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE`.
pub const DEFAULT_SCAN_PAGE_SIZE: i64 = 500;
/// Scan a page of `image_exif` rows for `library`, stat() each one,
/// and delete rows whose source file is gone. Returns
/// `(deleted, next_offset)`. `next_offset` wraps to 0 when the page
/// returned fewer rows than the page size, so the watcher cycles
/// through the whole library across ticks.
///
/// Caller must already have confirmed the library is online — running
/// against a Stale library would interpret every row as missing.
pub fn detect_missing_files_for_library(
context: &opentelemetry::Context,
library: &Library,
exif_dao: &Arc<Mutex<Box<dyn ExifDao>>>,
offset: i64,
page_size: i64,
delete_cap: usize,
) -> (usize, i64) {
let rows = {
let mut dao = exif_dao.lock().expect("exif_dao poisoned");
match dao.list_rel_paths_for_library_page(context, library.id, page_size, offset) {
Ok(r) => r,
Err(e) => {
warn!(
"missing-file scan: list page failed for library '{}' (offset={}): {:?}",
library.name, offset, e
);
return (0, offset);
}
}
};
let n_returned = rows.len();
// Wrap offset when we hit the end of the table — next tick starts
// a fresh sweep. Doing it here rather than on the next call keeps
// the offset accounting visible in one place.
let next_offset = if (n_returned as i64) < page_size {
0
} else {
offset + page_size
};
if rows.is_empty() {
return (0, next_offset);
}
let root = Path::new(&library.root_path);
let mut to_delete: Vec<String> = Vec::new();
for (_id, rel_path) in &rows {
if to_delete.len() >= delete_cap {
break;
}
let abs = root.join(rel_path);
match std::fs::metadata(&abs) {
Ok(_) => {
// File still exists — nothing to do.
}
Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
to_delete.push(rel_path.clone());
}
Err(e) => {
// Permission denied / IO error / etc. — skip this row,
// leave it for the next sweep. We never want a transient
// FS hiccup to mass-delete metadata.
debug!(
"missing-file scan: stat() error for {:?}, skipping: {:?}",
abs, e
);
}
}
}
if to_delete.is_empty() {
return (0, next_offset);
}
let mut deleted = 0;
{
let mut dao = exif_dao.lock().expect("exif_dao poisoned");
for rel_path in &to_delete {
match dao.delete_exif_by_library(context, library.id, rel_path) {
Ok(()) => deleted += 1,
Err(e) => warn!(
"missing-file scan: delete failed for ({}, {}): {:?}",
library.id, rel_path, e
),
}
}
}
if deleted > 0 {
info!(
"missing-file scan: removed {} stale image_exif row(s) from library '{}'",
deleted, library.name
);
}
(deleted, next_offset)
}
/// Refresh the `(library_id, rel_path)` back-refs on hash-keyed
/// tables. A back-ref is stale when:
/// - its `content_hash` is non-null,
/// - that hash is referenced by at least one `image_exif` row, but
/// - the row's own `(library_id, rel_path)` does not appear in
/// `image_exif`.
///
/// In that case, point the back-ref at any surviving image_exif row
/// for the same hash. `face_detections` is the canonical case (it
/// carries `library_id` + `rel_path` columns); `tagged_photo` and
/// `photo_insights` only carry rel_path historically — we still keep
/// it in sync here for consistency, picking any surviving rel_path.
///
/// All-SQL, idempotent. Returns the number of rows updated.
pub fn refresh_back_refs(conn: &mut SqliteConnection) -> usize {
let mut total = 0usize;
// face_detections — back-ref is (library_id, rel_path). Repoint to
// any surviving image_exif row carrying the same content_hash.
let updated = sql_query(
"UPDATE face_detections \
SET library_id = ( \
SELECT ie.library_id FROM image_exif ie \
WHERE ie.content_hash = face_detections.content_hash \
ORDER BY ie.id LIMIT 1 \
), \
rel_path = ( \
SELECT ie.rel_path FROM image_exif ie \
WHERE ie.content_hash = face_detections.content_hash \
ORDER BY ie.id LIMIT 1 \
) \
WHERE EXISTS ( \
SELECT 1 FROM image_exif ie \
WHERE ie.content_hash = face_detections.content_hash \
) \
AND NOT EXISTS ( \
SELECT 1 FROM image_exif ie \
WHERE ie.library_id = face_detections.library_id \
AND ie.rel_path = face_detections.rel_path \
)",
)
.execute(conn)
.unwrap_or_else(|e| {
warn!("back-ref refresh: face_detections update failed: {:?}", e);
0
});
total += updated;
// tagged_photo — only rel_path. Update to any surviving rel_path
// for the same content_hash so the path-only DAO read still finds
// tags after a move.
let updated = sql_query(
"UPDATE tagged_photo \
SET rel_path = ( \
SELECT ie.rel_path FROM image_exif ie \
WHERE ie.content_hash = tagged_photo.content_hash \
ORDER BY ie.id LIMIT 1 \
) \
WHERE content_hash IS NOT NULL \
AND EXISTS ( \
SELECT 1 FROM image_exif ie \
WHERE ie.content_hash = tagged_photo.content_hash \
) \
AND NOT EXISTS ( \
SELECT 1 FROM image_exif ie \
WHERE ie.rel_path = tagged_photo.rel_path \
)",
)
.execute(conn)
.unwrap_or_else(|e| {
warn!("back-ref refresh: tagged_photo update failed: {:?}", e);
0
});
total += updated;
// photo_insights — has both library_id and rel_path. Update both
// when the (library_id, rel_path) tuple no longer matches any
// image_exif row but the hash does.
let updated = sql_query(
"UPDATE photo_insights \
SET library_id = ( \
SELECT ie.library_id FROM image_exif ie \
WHERE ie.content_hash = photo_insights.content_hash \
ORDER BY ie.id LIMIT 1 \
), \
rel_path = ( \
SELECT ie.rel_path FROM image_exif ie \
WHERE ie.content_hash = photo_insights.content_hash \
ORDER BY ie.id LIMIT 1 \
) \
WHERE content_hash IS NOT NULL \
AND EXISTS ( \
SELECT 1 FROM image_exif ie \
WHERE ie.content_hash = photo_insights.content_hash \
) \
AND NOT EXISTS ( \
SELECT 1 FROM image_exif ie \
WHERE ie.library_id = photo_insights.library_id \
AND ie.rel_path = photo_insights.rel_path \
)",
)
.execute(conn)
.unwrap_or_else(|e| {
warn!("back-ref refresh: photo_insights update failed: {:?}", e);
0
});
total += updated;
if total > 0 {
info!("back-ref refresh: updated {} hash-keyed row(s)", total);
}
total
}
/// One tick's outcome of the orphan-GC pass.
#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
pub struct GcStats {
/// Hashes newly observed orphaned this tick (added to the
/// pending set).
pub newly_marked: usize,
/// Hashes that were marked last tick AND are still orphaned this
/// tick AND every library is online — these are deleted.
pub deleted_face_detections: usize,
pub deleted_tagged_photo: usize,
pub deleted_photo_insights: usize,
/// Hashes dropped from the pending set because they re-appeared
/// in image_exif (e.g. user remounted a backup that was briefly
/// missing).
pub revived: usize,
}
impl GcStats {
pub fn changed(&self) -> bool {
self.newly_marked > 0
|| self.deleted_face_detections > 0
|| self.deleted_tagged_photo > 0
|| self.deleted_photo_insights > 0
|| self.revived > 0
}
pub fn total_deleted(&self) -> usize {
self.deleted_face_detections + self.deleted_tagged_photo + self.deleted_photo_insights
}
}
/// Two-tick orphan-GC state. The watcher constructs one of these once
/// at startup and passes it back into `run_orphan_gc` every tick.
#[derive(Debug, Default)]
pub struct OrphanGcState {
/// Hashes observed orphaned on the previous tick. A hash gets
/// promoted to "delete" when it survives a second consecutive
/// observation with all libraries online.
pending: HashSet<String>,
/// Whether every library was online on the previous tick. Combined
/// with the all-online check on the current tick, this gives the
/// "two consecutive ticks of full availability" guard described in
/// CLAUDE.md → "Library availability and safety".
prev_tick_all_online: bool,
}
/// Run one tick of the orphan GC. The function is responsible for the
/// full lifecycle: probing for orphans, updating `state.pending`,
/// performing deletes when consensus is reached, and returning stats
/// for the watcher to log.
///
/// Safety guard: `all_online` MUST reflect every configured library
/// being Online right now. Even if true, deletes only happen when the
/// previous tick was also all-online. A single Stale tick within the
/// window cancels any pending deletes (they stay marked but won't be
/// promoted) — they're then re-evaluated next tick.
pub fn run_orphan_gc(
conn: &mut SqliteConnection,
state: &mut OrphanGcState,
all_online: bool,
) -> GcStats {
let mut stats = GcStats::default();
// Find every distinct content_hash referenced by hash-keyed
// derived data that is NOT currently referenced by image_exif.
// These are this tick's orphan candidates. Cheap query — three
// index lookups + a HashSet at row count of derived tables, which
// is small.
let orphans: HashSet<String> = match collect_orphan_hashes(conn) {
Ok(set) => set,
Err(e) => {
warn!("orphan-gc: candidate query failed: {:?}", e);
return stats;
}
};
// Drop entries from pending that are no longer orphaned
// ("revived"). Common case: a network share that briefly went
// stale comes back, image_exif gets re-populated by ingest, and
// the hash is no longer orphaned.
let revived = state
.pending
.difference(&orphans)
.cloned()
.collect::<Vec<_>>();
if !revived.is_empty() {
for h in &revived {
state.pending.remove(h);
}
stats.revived = revived.len();
}
if !all_online {
// Every Stale library cancels both the consensus window AND
// any pending deletes. We *do* still note newly observed
// orphans below — that's harmless bookkeeping. But we never
// delete this tick.
for h in &orphans {
if state.pending.insert(h.clone()) {
stats.newly_marked += 1;
}
}
state.prev_tick_all_online = false;
if stats.changed() {
info!(
"orphan-gc: {} new orphan hash(es) marked, {} revived (deferred — at least one library Stale; pending: {})",
stats.newly_marked,
stats.revived,
state.pending.len()
);
} else {
debug!(
"orphan-gc: stale library, no changes (pending: {})",
state.pending.len()
);
}
return stats;
}
// All-online + previous-tick-also-all-online: hashes that are
// both pending AND still orphaned this tick are confirmed and
// get deleted. Hashes orphaned this tick but not pending get
// freshly marked.
let consensus_window_open = state.prev_tick_all_online;
let to_delete: Vec<String> = if consensus_window_open {
orphans
.iter()
.filter(|h| state.pending.contains(*h))
.cloned()
.collect()
} else {
Vec::new()
};
for h in &orphans {
if !state.pending.contains(h) {
state.pending.insert(h.clone());
stats.newly_marked += 1;
}
}
if !to_delete.is_empty() {
match delete_hash_keyed_rows(conn, &to_delete) {
Ok((faces, tags, insights)) => {
stats.deleted_face_detections = faces;
stats.deleted_tagged_photo = tags;
stats.deleted_photo_insights = insights;
// Drop deleted hashes from pending so we don't try to
// re-delete them next tick (they'll have already been
// removed from the orphan set).
for h in &to_delete {
state.pending.remove(h);
}
}
Err(e) => warn!("orphan-gc: delete batch failed: {:?}", e),
}
}
state.prev_tick_all_online = true;
if stats.changed() {
info!(
"orphan-gc: {} new orphan hash(es) marked, {} revived; deleted {} face_detections / {} tagged_photo / {} photo_insights row(s) (pending: {})",
stats.newly_marked,
stats.revived,
stats.deleted_face_detections,
stats.deleted_tagged_photo,
stats.deleted_photo_insights,
state.pending.len(),
);
} else {
debug!(
"orphan-gc: no changes this tick (pending: {})",
state.pending.len()
);
}
stats
}
/// Helper for the watcher: are *all enabled* libraries currently Online?
///
/// Disabled libraries are out-of-scope for the orphan-GC consensus
/// rule — they don't get probed, don't have a health entry, and a
/// system with one disabled library should still be able to GC
/// orphans for the remaining online libraries. Treating disabled as
/// "blocking" would mean flipping a library to `enabled=false` would
/// permanently halt GC, which is the opposite of the intended kill-
/// switch semantics ("turn this library off and let the rest of the
/// system run normally").
pub fn all_libraries_online(libs: &[Library], health: &LibraryHealthMap) -> bool {
let guard = health.read().unwrap_or_else(|e| e.into_inner());
libs.iter()
.filter(|lib| lib.enabled)
.all(|lib| guard.get(&lib.id).map(|h| h.is_online()).unwrap_or(false))
}
#[derive(QueryableByName, Debug)]
struct HashRow {
#[diesel(sql_type = diesel::sql_types::Text)]
content_hash: String,
}
fn collect_orphan_hashes(conn: &mut SqliteConnection) -> QueryResult<HashSet<String>> {
// Union of every distinct content_hash carried by hash-keyed
// derived tables, minus those still referenced by image_exif.
let rows = sql_query(
"SELECT DISTINCT content_hash FROM ( \
SELECT content_hash FROM face_detections WHERE content_hash IS NOT NULL \
UNION ALL \
SELECT content_hash FROM tagged_photo WHERE content_hash IS NOT NULL \
UNION ALL \
SELECT content_hash FROM photo_insights WHERE content_hash IS NOT NULL \
) AS derived \
WHERE content_hash NOT IN ( \
SELECT content_hash FROM image_exif WHERE content_hash IS NOT NULL \
)",
)
.get_results::<HashRow>(conn)?;
Ok(rows.into_iter().map(|r| r.content_hash).collect())
}
/// Delete every hash-keyed row whose `content_hash` is in `hashes`.
/// Returns `(faces, tagged_photo, photo_insights)`.
fn delete_hash_keyed_rows(
conn: &mut SqliteConnection,
hashes: &[String],
) -> QueryResult<(usize, usize, usize)> {
if hashes.is_empty() {
return Ok((0, 0, 0));
}
use crate::database::schema::{face_detections, photo_insights, tagged_photo};
let faces =
diesel::delete(face_detections::table.filter(face_detections::content_hash.eq_any(hashes)))
.execute(conn)?;
let tags =
diesel::delete(tagged_photo::table.filter(tagged_photo::content_hash.eq_any(hashes)))
.execute(conn)?;
let insights =
diesel::delete(photo_insights::table.filter(photo_insights::content_hash.eq_any(hashes)))
.execute(conn)?;
Ok((faces, tags, insights))
}
#[cfg(test)]
mod tests {
use super::*;
use crate::database::test::in_memory_db_connection;
fn ensure_library(conn: &mut SqliteConnection, library_id: i32) {
diesel::sql_query(
"INSERT OR IGNORE INTO libraries (id, name, root_path, created_at) \
VALUES (?, 'test-' || ?, '/tmp/test-' || ?, 0)",
)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::Integer, _>(library_id)
.execute(conn)
.unwrap();
}
fn insert_image_exif(
conn: &mut SqliteConnection,
library_id: i32,
rel_path: &str,
content_hash: Option<&str>,
) {
ensure_library(conn, library_id);
diesel::sql_query(
"INSERT INTO image_exif (library_id, rel_path, created_time, last_modified, content_hash) \
VALUES (?, ?, 0, 0, ?)",
)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::Text, _>(rel_path)
.bind::<diesel::sql_types::Nullable<diesel::sql_types::Text>, _>(content_hash)
.execute(conn)
.unwrap();
}
fn insert_face(conn: &mut SqliteConnection, library_id: i32, rel_path: &str, hash: &str) {
ensure_library(conn, library_id);
diesel::sql_query(
"INSERT INTO face_detections (library_id, content_hash, rel_path, source, status, model_version, created_at) \
VALUES (?, ?, ?, 'auto', 'no_faces', 'v', 0)",
)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::Text, _>(hash)
.bind::<diesel::sql_types::Text, _>(rel_path)
.execute(conn)
.unwrap();
}
fn insert_tag_with_hash(conn: &mut SqliteConnection, rel_path: &str, hash: &str) {
diesel::sql_query("INSERT OR IGNORE INTO tags (id, name, created_time) VALUES (1, 't', 0)")
.execute(conn)
.unwrap();
diesel::sql_query(
"INSERT INTO tagged_photo (rel_path, tag_id, created_time, content_hash) VALUES (?, 1, 0, ?)",
)
.bind::<diesel::sql_types::Text, _>(rel_path)
.bind::<diesel::sql_types::Text, _>(hash)
.execute(conn)
.unwrap();
}
fn insert_insight_with_hash(
conn: &mut SqliteConnection,
library_id: i32,
rel_path: &str,
hash: &str,
) {
ensure_library(conn, library_id);
diesel::sql_query(
"INSERT INTO photo_insights (library_id, rel_path, title, summary, generated_at, model_version, is_current, backend, content_hash) \
VALUES (?, ?, 't', 's', 0, 'v', 1, 'local', ?)",
)
.bind::<diesel::sql_types::Integer, _>(library_id)
.bind::<diesel::sql_types::Text, _>(rel_path)
.bind::<diesel::sql_types::Text, _>(hash)
.execute(conn)
.unwrap();
}
#[derive(QueryableByName, Debug)]
struct CountRow {
#[diesel(sql_type = diesel::sql_types::BigInt)]
n: i64,
}
fn count(conn: &mut SqliteConnection, sql: &str) -> i64 {
diesel::sql_query(sql)
.get_result::<CountRow>(conn)
.unwrap()
.n
}
#[test]
fn refresh_back_refs_repoints_face_detection_after_move() {
let mut conn = in_memory_db_connection();
// Original location lib 1, rel "old.jpg". image_exif row gone
// (file moved); only the new lib 2 row remains.
insert_image_exif(&mut conn, 2, "new.jpg", Some("h1"));
insert_face(&mut conn, 1, "old.jpg", "h1");
let updated = refresh_back_refs(&mut conn);
assert_eq!(updated, 1);
let row = diesel::sql_query("SELECT library_id AS n FROM face_detections")
.get_result::<CountRow>(&mut conn)
.unwrap();
assert_eq!(row.n, 2, "library_id should now point at lib 2");
}
#[test]
fn refresh_back_refs_no_change_when_back_ref_still_valid() {
let mut conn = in_memory_db_connection();
insert_image_exif(&mut conn, 1, "a.jpg", Some("h1"));
insert_face(&mut conn, 1, "a.jpg", "h1");
let updated = refresh_back_refs(&mut conn);
assert_eq!(updated, 0);
}
#[test]
fn refresh_back_refs_no_change_when_hash_fully_orphaned() {
// Hash exists on face_detections but no surviving image_exif
// row for it → the refresh is a no-op (orphan GC handles
// these). Important: the SET subquery would return NULL and
// we'd null out the back-ref otherwise; the EXISTS guard
// protects against that.
let mut conn = in_memory_db_connection();
insert_face(&mut conn, 1, "gone.jpg", "h1");
let updated = refresh_back_refs(&mut conn);
assert_eq!(updated, 0);
}
#[test]
fn orphan_gc_requires_two_consecutive_all_online_ticks() {
let mut conn = in_memory_db_connection();
// Hash present in face_detections but NOT image_exif → orphan.
insert_face(&mut conn, 1, "x.jpg", "h-orphan");
let mut state = OrphanGcState::default();
// Tick 1: prev_tick_all_online is false (default), so even
// with current tick all-online we mark only.
let stats = run_orphan_gc(&mut conn, &mut state, true);
assert_eq!(stats.newly_marked, 1);
assert_eq!(stats.total_deleted(), 0);
assert_eq!(state.pending.len(), 1);
// Tick 2: prev_tick_all_online is now true, current tick still
// all-online → consensus reached, hash gets deleted.
let stats = run_orphan_gc(&mut conn, &mut state, true);
assert_eq!(stats.deleted_face_detections, 1);
assert!(state.pending.is_empty());
// Tick 3: nothing left.
let stats = run_orphan_gc(&mut conn, &mut state, true);
assert_eq!(stats.total_deleted(), 0);
assert_eq!(stats.newly_marked, 0);
}
#[test]
fn orphan_gc_resets_consensus_on_stale_library() {
let mut conn = in_memory_db_connection();
insert_face(&mut conn, 1, "x.jpg", "h-orphan");
let mut state = OrphanGcState::default();
// Tick 1: all-online, mark.
run_orphan_gc(&mut conn, &mut state, true);
// Tick 2: stale library — consensus window resets, no delete.
let stats = run_orphan_gc(&mut conn, &mut state, false);
assert_eq!(stats.total_deleted(), 0);
assert!(!state.prev_tick_all_online);
// Tick 3: all-online again — but we need ANOTHER tick to set
// prev_tick_all_online before deletes can fire. So tick 3
// marks (no-op on existing pending), tick 4 deletes.
let stats = run_orphan_gc(&mut conn, &mut state, true);
assert_eq!(stats.total_deleted(), 0);
let stats = run_orphan_gc(&mut conn, &mut state, true);
assert_eq!(stats.deleted_face_detections, 1);
}
#[test]
fn orphan_gc_revives_when_image_exif_reappears() {
let mut conn = in_memory_db_connection();
insert_face(&mut conn, 1, "x.jpg", "h-orphan");
let mut state = OrphanGcState::default();
// Tick 1: mark.
run_orphan_gc(&mut conn, &mut state, true);
assert!(state.pending.contains("h-orphan"));
// Between ticks, the image_exif row reappears (e.g. backup
// share was briefly stale). Hash is no longer orphaned.
insert_image_exif(&mut conn, 2, "x.jpg", Some("h-orphan"));
let stats = run_orphan_gc(&mut conn, &mut state, true);
assert_eq!(stats.revived, 1);
assert_eq!(stats.total_deleted(), 0);
assert!(state.pending.is_empty());
}
#[test]
fn orphan_gc_deletes_across_all_three_tables() {
let mut conn = in_memory_db_connection();
// Same orphan hash appears in all three derived tables.
insert_face(&mut conn, 1, "a.jpg", "h-orphan");
insert_tag_with_hash(&mut conn, "a.jpg", "h-orphan");
insert_insight_with_hash(&mut conn, 1, "a.jpg", "h-orphan");
let mut state = OrphanGcState::default();
run_orphan_gc(&mut conn, &mut state, true);
let stats = run_orphan_gc(&mut conn, &mut state, true);
assert_eq!(stats.deleted_face_detections, 1);
assert_eq!(stats.deleted_tagged_photo, 1);
assert_eq!(stats.deleted_photo_insights, 1);
assert_eq!(
count(&mut conn, "SELECT COUNT(*) AS n FROM face_detections"),
0
);
assert_eq!(
count(&mut conn, "SELECT COUNT(*) AS n FROM tagged_photo"),
0
);
assert_eq!(
count(&mut conn, "SELECT COUNT(*) AS n FROM photo_insights"),
0
);
}
#[test]
fn all_libraries_online_helper() {
use crate::libraries::{LibraryHealth, new_health_map};
let libs = vec![
Library {
id: 1,
name: "a".into(),
root_path: "/x".into(),
enabled: true,
excluded_dirs: Vec::new(),
},
Library {
id: 2,
name: "b".into(),
root_path: "/y".into(),
enabled: true,
excluded_dirs: Vec::new(),
},
];
let health = new_health_map(&libs);
assert!(all_libraries_online(&libs, &health));
// Flip lib 2 to stale.
{
let mut g = health.write().unwrap();
g.insert(
2,
LibraryHealth::Stale {
reason: "test".into(),
since: 0,
},
);
}
assert!(!all_libraries_online(&libs, &health));
}
#[test]
fn all_libraries_online_treats_disabled_as_out_of_scope() {
use crate::libraries::{LibraryHealth, new_health_map};
// lib 1 enabled+online, lib 2 disabled (would be treated as
// Online in the health map's optimistic seed but the map
// entry is irrelevant — disabled libs are filtered out
// before the health lookup).
let libs = vec![
Library {
id: 1,
name: "a".into(),
root_path: "/x".into(),
enabled: true,
excluded_dirs: Vec::new(),
},
Library {
id: 2,
name: "b".into(),
root_path: "/y".into(),
enabled: false,
excluded_dirs: Vec::new(),
},
];
let health = new_health_map(&libs);
// Sanity: forcibly mark lib 2 stale to prove disabled wins
// over even an explicit Stale entry — the filter skips it
// before the health check happens.
{
let mut g = health.write().unwrap();
g.insert(
2,
LibraryHealth::Stale {
reason: "intentionally stale".into(),
since: 0,
},
);
}
assert!(
all_libraries_online(&libs, &health),
"disabled library should not block consensus"
);
}
}

View File

@@ -64,6 +64,7 @@ mod auth;
mod content_hash;
mod data;
mod database;
mod duplicates;
mod error;
mod exif;
mod face_watch;
@@ -72,6 +73,8 @@ mod file_types;
mod files;
mod geo;
mod libraries;
mod library_maintenance;
mod perceptual_hash;
mod state;
mod tags;
mod utils;
@@ -150,7 +153,12 @@ async fn get_image(
let relative_path_str = relative_path.to_string_lossy().replace('\\', "/");
let thumbs = &app_state.thumbnail_path;
let legacy_thumb_path = Path::new(&thumbs).join(relative_path);
let bare_legacy_thumb_path = Path::new(&thumbs).join(relative_path);
let scoped_legacy_thumb_path = content_hash::library_scoped_legacy_path(
Path::new(&thumbs),
library.id,
relative_path,
);
// Gif thumbnails are a separate lookup (video GIF previews).
// Dual-lookup for gif is out of scope; preserve existing flow.
@@ -168,8 +176,16 @@ async fn get_image(
}
}
// Resolve the hash-keyed thumbnail (if the row already has a
// content_hash) and fall back to the legacy mirrored path.
// Lookup chain (most-specific first, falling back as we miss):
// 1. hash-keyed (`<thumbs>/<hash[..2]>/<hash>.jpg`) — content
// identity, shared across libraries;
// 2. library-scoped legacy (`<thumbs>/<lib_id>/<rel_path>`) —
// written by current generation when hash isn't known;
// 3. bare legacy (`<thumbs>/<rel_path>`) — pre-multi-library
// thumbs from the days before library prefixing existed.
// Stage (3) goes away once a one-time migration lifts every
// bare-legacy file under a library prefix; until then it
// prevents needless 404s for already-warmed deployments.
let hash_thumb_path: Option<PathBuf> = {
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
match dao.get_exif(&context, &relative_path_str) {
@@ -184,7 +200,14 @@ async fn get_image(
.as_ref()
.filter(|p| p.exists())
.cloned()
.unwrap_or_else(|| legacy_thumb_path.clone());
.or_else(|| {
if scoped_legacy_thumb_path.exists() {
Some(scoped_legacy_thumb_path.clone())
} else {
None
}
})
.unwrap_or_else(|| bare_legacy_thumb_path.clone());
// Handle circular thumbnail request
if req.shape == Some(ThumbnailShape::Circle) {
@@ -509,6 +532,11 @@ async fn set_image_gps(
.ok()
.map(|c| c.content_hash),
size_bytes: content_hash::compute(&full_path).ok().map(|c| c.size_bytes),
// GPS-update path doesn't touch perceptual hashes either; columns
// ignored by update_exif. Compute best-effort so a new file lands
// with a usable signal; failure just leaves prior values in place.
phash_64: perceptual_hash::compute(&full_path).map(|h| h.phash_64),
dhash_64: perceptual_hash::compute(&full_path).map(|h| h.dhash_64),
};
let updated = {
@@ -631,6 +659,37 @@ async fn upload_image(
&full_path.to_str().unwrap().to_string(),
true,
) {
// Pre-write content-hash check: if these exact bytes already
// exist anywhere in any library (and aren't themselves
// soft-marked as duplicates), don't write the file. Return
// 409 with the canonical sibling so the mobile app can show
// a friendly "already in your library" toast.
let upload_hash = blake3::Hasher::new()
.update(&file_content)
.finalize()
.to_hex()
.to_string();
{
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
if let Ok(Some(existing)) = dao.find_by_content_hash(&span_context, &upload_hash)
&& existing.duplicate_of_hash.is_none()
{
let library_name = libraries::load_all(&mut crate::database::connect())
.into_iter()
.find(|l| l.id == existing.library_id)
.map(|l| l.name);
span.set_status(Status::Ok);
return HttpResponse::Conflict().json(serde_json::json!({
"duplicate_of": {
"library_id": existing.library_id,
"rel_path": existing.file_path,
},
"content_hash": upload_hash,
"library_name": library_name,
}));
}
}
let context =
opentelemetry::Context::new().with_remote_span_context(span.span_context().clone());
tracer
@@ -689,6 +748,7 @@ async fn upload_image(
(None, None)
}
};
let perceptual = perceptual_hash::compute(&uploaded_path);
let insert_exif = InsertImageExif {
library_id: target_library.id,
file_path: relative_path.clone(),
@@ -710,6 +770,8 @@ async fn upload_image(
last_modified: timestamp,
content_hash,
size_bytes,
phash_64: perceptual.map(|h| h.phash_64),
dhash_64: perceptual.map(|h| h.dhash_64),
};
if let Ok(mut dao) = exif_dao.lock() {
@@ -761,6 +823,15 @@ async fn generate_video(
if let Some(name) = filename.file_name() {
let filename = name.to_str().expect("Filename should convert to string");
// KNOWN ISSUE (multi-library): playlist filename is the basename
// alone, so two source files with the same basename — whether in
// different libraries or different subdirs of one library —
// overwrite each other's playlists while ffmpeg runs. The
// hash-keyed `content_hash::hls_dir` is the long-term answer
// (see CLAUDE.md "Multi-library data model"); rewiring the
// actor pipeline to use it is out of scope for this branch.
// The orphan-cleanup job above already walks every library so
// it doesn't false-delete archive playlists.
let playlist = format!("{}/{}.m3u8", app_state.video_path, filename);
let library = libraries::resolve_library_param(&app_state, body.library.as_deref())
@@ -1305,19 +1376,41 @@ fn create_thumbnails(libs: &[libraries::Library], excluded_dirs: &[String]) {
lib.name, lib.root_path
);
let images = PathBuf::from(&lib.root_path);
// Effective excludes = global env-var excludes library row's
// excluded_dirs. Lets a parent-library mount skip the subtree
// already covered by a child library.
let effective_excludes = lib.effective_excluded_dirs(excluded_dirs);
// Prune EXCLUDED_DIRS so we don't generate thumbnails-of-thumbnails
// for Synology @eaDir trees. file_scan handles filter_entry pruning.
image_api::file_scan::walk_library_files(&images, excluded_dirs)
image_api::file_scan::walk_library_files(&images, &effective_excludes)
.into_par_iter()
.for_each(|entry| {
let src = entry.path();
let Ok(relative_path) = src.strip_prefix(&images) else {
return;
};
let thumb_path = Path::new(thumbnail_directory).join(relative_path);
// Library-scoped legacy path: prevents two libraries with
// the same rel_path from clobbering each other's thumbs.
// Hash-keyed promotion happens lazily on first hash-aware
// request — keeping this loop ExifDao-free preserves the
// current "cargo build && go" startup story.
let thumb_path = content_hash::library_scoped_legacy_path(
thumbnail_directory,
lib.id,
relative_path,
);
let bare_legacy = thumbnail_directory.join(relative_path);
if thumb_path.exists() || unsupported_thumbnail_sentinel(&thumb_path).exists() {
// Backwards-compat check: if a single-library install has a
// bare-legacy thumb here already, accept it as present.
// Same for the sentinel. Means we don't redo work after
// upgrade and we don't leave stale duplicates around.
if thumb_path.exists()
|| bare_legacy.exists()
|| unsupported_thumbnail_sentinel(&thumb_path).exists()
|| unsupported_thumbnail_sentinel(&bare_legacy).exists()
{
return;
}
@@ -1365,7 +1458,8 @@ fn create_thumbnails(libs: &[libraries::Library], excluded_dirs: &[String]) {
debug!("Finished making thumbnails");
for lib in libs {
update_media_counts(Path::new(&lib.root_path), excluded_dirs);
let effective_excludes = lib.effective_excluded_dirs(excluded_dirs);
update_media_counts(Path::new(&lib.root_path), &effective_excludes);
}
}
@@ -1462,10 +1556,18 @@ fn main() -> std::io::Result<()> {
preview_gen_for_watcher,
app_state.face_client.clone(),
app_state.excluded_dirs.clone(),
app_state.library_health.clone(),
);
// Start orphaned playlist cleanup job
cleanup_orphaned_playlists(app_state.excluded_dirs.clone());
// Start orphaned playlist cleanup job. Multi-library aware: walks
// every configured library when looking for the source video, and
// skips the whole cycle while any library is stale (a missing
// source is indistinguishable from a transiently-unmounted share).
cleanup_orphaned_playlists(
app_state.libraries.clone(),
app_state.excluded_dirs.clone(),
app_state.library_health.clone(),
);
// Spawn background job to generate daily conversation summaries
{
@@ -1600,6 +1702,7 @@ fn main() -> std::io::Result<()> {
.add_feature(add_tag_services::<_, SqliteTagDao>)
.add_feature(knowledge::add_knowledge_services::<_, SqliteKnowledgeDao>)
.add_feature(faces::add_face_services::<_, faces::SqliteFaceDao>)
.add_feature(duplicates::add_duplicate_services)
.app_data(app_data.clone())
.app_data::<Data<RealFileSystem>>(Data::new(RealFileSystem::new(
app_data.base_path.clone(),
@@ -1657,10 +1760,13 @@ fn run_migrations(
}
/// Clean up orphaned HLS playlists and segments whose source videos no longer exist
fn cleanup_orphaned_playlists(excluded_dirs: Vec<String>) {
fn cleanup_orphaned_playlists(
libs: Vec<libraries::Library>,
excluded_dirs: Vec<String>,
library_health: libraries::LibraryHealthMap,
) {
std::thread::spawn(move || {
let video_path = dotenv::var("VIDEO_PATH").expect("VIDEO_PATH must be set");
let base_path = dotenv::var("BASE_PATH").expect("BASE_PATH must be set");
// Get cleanup interval from environment (default: 24 hours)
let cleanup_interval_secs = dotenv::var("PLAYLIST_CLEANUP_INTERVAL_SECONDS")
@@ -1671,10 +1777,39 @@ fn cleanup_orphaned_playlists(excluded_dirs: Vec<String>) {
info!("Starting orphaned playlist cleanup job");
info!(" Cleanup interval: {} seconds", cleanup_interval_secs);
info!(" Playlist directory: {}", video_path);
for lib in &libs {
info!(
" Checking sources under '{}' at {}",
lib.name, lib.root_path
);
}
loop {
std::thread::sleep(Duration::from_secs(cleanup_interval_secs));
// Safety gate: skip the cleanup cycle if any library is
// stale. A missing source video on a stale library is
// indistinguishable from a transient unmount, and the
// cleanup is destructive — we'd rather leak a few playlist
// files for a tick than delete one whose source is briefly
// unreachable. The cycle re-runs on the next interval.
{
let guard = library_health.read().unwrap_or_else(|e| e.into_inner());
let stale: Vec<String> = libs
.iter()
.filter(|lib| guard.get(&lib.id).map(|h| !h.is_online()).unwrap_or(false))
.map(|lib| lib.name.clone())
.collect();
if !stale.is_empty() {
warn!(
"Skipping orphaned-playlist cleanup: {} library(ies) stale: [{}]",
stale.len(),
stale.join(", ")
);
continue;
}
}
info!("Running orphaned playlist cleanup");
let start = std::time::Instant::now();
let mut deleted_count = 0;
@@ -1703,20 +1838,26 @@ fn cleanup_orphaned_playlists(excluded_dirs: Vec<String>) {
if let Some(filename) = playlist_path.file_stem() {
let video_filename = filename.to_string_lossy();
// Search for this video file in BASE_PATH, respecting
// EXCLUDED_DIRS so we don't false-resurrect playlists for
// videos that only exist inside an excluded subtree.
// Search for this video file across every configured
// library, respecting EXCLUDED_DIRS so we don't
// false-resurrect playlists for videos that only
// exist inside an excluded subtree. As soon as one
// library has a matching source, we're done — the
// playlist isn't orphaned.
let mut video_exists = false;
for entry in image_api::file_scan::walk_library_files(
Path::new(&base_path),
&excluded_dirs,
) {
if let Some(entry_stem) = entry.path().file_stem()
&& entry_stem == filename
&& is_video_file(entry.path())
{
video_exists = true;
break;
'libs: for lib in &libs {
let effective = lib.effective_excluded_dirs(&excluded_dirs);
for entry in image_api::file_scan::walk_library_files(
Path::new(&lib.root_path),
&effective,
) {
if let Some(entry_stem) = entry.path().file_stem()
&& entry_stem == filename
&& is_video_file(entry.path())
{
video_exists = true;
break 'libs;
}
}
}
@@ -1792,6 +1933,7 @@ fn watch_files(
preview_generator: Addr<video::actors::PreviewClipGenerator>,
face_client: crate::ai::face_client::FaceClient,
excluded_dirs: Vec<String>,
library_health: libraries::LibraryHealthMap,
) {
std::thread::spawn(move || {
// Get polling intervals from environment variables
@@ -1850,6 +1992,52 @@ fn watch_files(
let mut last_full_scan = SystemTime::now();
let mut scan_count = 0u64;
// Per-library cursor for the missing-file scan. Each tick reads
// a page from `offset`, stat()s the rows, deletes confirmed-
// missing ones, and advances or wraps the cursor. State held
// in-memory so a watcher restart resumes from 0 — fine, the
// sweep is idempotent.
let mut missing_file_offsets: std::collections::HashMap<i32, i64> =
std::collections::HashMap::new();
let missing_scan_page_size: i64 = dotenv::var("IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE")
.ok()
.and_then(|s| s.parse().ok())
.filter(|n: &i64| *n > 0)
.unwrap_or(library_maintenance::DEFAULT_SCAN_PAGE_SIZE);
let missing_delete_cap: usize = dotenv::var("IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK")
.ok()
.and_then(|s| s.parse().ok())
.filter(|n: &usize| *n > 0)
.unwrap_or(library_maintenance::DEFAULT_MISSING_DELETE_CAP);
// Two-tick orphan-GC consensus state. Carried across ticks via
// `OrphanGcState`; see library_maintenance::run_orphan_gc.
let mut orphan_gc_state = library_maintenance::OrphanGcState::default();
// Initial availability sweep before the loop's first sleep so
// /libraries reports the truth from the very first request,
// rather than the optimistic Online default that
// new_health_map seeds. Without this, an unmounted share would
// appear online for up to WATCH_QUICK_INTERVAL_SECONDS (default
// 60s) after boot. Same probe logic as the per-tick gate
// below; no ingest runs here, just the health update + log.
// Disabled libraries skip the probe entirely — they should
// never enter the health map (treated as out-of-scope).
for lib in &libs {
if !lib.enabled {
continue;
}
let context = opentelemetry::Context::new();
let had_data = exif_dao
.lock()
.expect("exif_dao poisoned")
.count_for_library(&context, lib.id)
.map(|n| n > 0)
.unwrap_or(false);
libraries::refresh_health(&library_health, lib, had_data);
}
loop {
std::thread::sleep(Duration::from_secs(quick_interval_secs));
@@ -1861,6 +2049,44 @@ fn watch_files(
let is_full_scan = since_last_full.as_secs() >= full_interval_secs;
for lib in &libs {
// Operator kill switch: a disabled library is invisible
// to the watcher entirely. No probe, no ingest, no
// maintenance, no health entry. Distinct from Stale —
// Stale is "we wanted to but couldn't"; Disabled is
// "we don't want to". Toggle via SQL.
if !lib.enabled {
debug!(
"watcher: skipping library '{}' (id={}) — enabled=false",
lib.name, lib.id
);
continue;
}
// Availability probe: every tick checks that the
// library's mount is reachable, is a directory, is
// readable, and (if image_exif has rows for it) is
// non-empty. A Stale library skips ingest, backlog
// drains, and metric refresh — reads/serving in HTTP
// handlers continue to work. Branches B/C extend the
// probe gate to cover handoff and orphan GC. See
// CLAUDE.md "Library availability and safety".
let had_data = {
let context = opentelemetry::Context::new();
let mut guard = exif_dao.lock().expect("exif_dao poisoned");
guard
.count_for_library(&context, lib.id)
.map(|n| n > 0)
.unwrap_or(false)
};
let health = libraries::refresh_health(&library_health, lib, had_data);
if !health.is_online() {
// Skip every write path for this library this tick.
// Don't refresh the media-count gauge either — a
// probe-failed library would otherwise flap to 0
// image / 0 video and pollute Prometheus.
continue;
}
// Drain the unhashed-hash backlog AND the face-detection
// backlog every tick, regardless of quick/full. Quick
// scans only walk recently-modified files, so the
@@ -1868,6 +2094,11 @@ fn watch_files(
// — without these standalone passes, backfill +
// detection only progressed during full scans
// (default once an hour).
// Effective excludes for this library: global env-var
// row's excluded_dirs. Compute once per tick — used
// by every walker below for this library.
let effective_excludes = lib.effective_excluded_dirs(&excluded_dirs);
if face_client.is_enabled() {
let context = opentelemetry::Context::new();
backfill_unhashed_backlog(&context, lib, &exif_dao);
@@ -1877,7 +2108,7 @@ fn watch_files(
&face_client,
&face_dao,
&watcher_tag_dao,
&excluded_dirs,
&effective_excludes,
);
}
@@ -1893,7 +2124,7 @@ fn watch_files(
Arc::clone(&face_dao),
Arc::clone(&watcher_tag_dao),
face_client.clone(),
&excluded_dirs,
&effective_excludes,
None,
playlist_manager.clone(),
preview_generator.clone(),
@@ -1914,7 +2145,7 @@ fn watch_files(
Arc::clone(&face_dao),
Arc::clone(&watcher_tag_dao),
face_client.clone(),
&excluded_dirs,
&effective_excludes,
Some(check_since),
playlist_manager.clone(),
preview_generator.clone(),
@@ -1922,7 +2153,66 @@ fn watch_files(
}
// Update media counts per library (metric aggregates across all)
update_media_counts(Path::new(&lib.root_path), &excluded_dirs);
update_media_counts(Path::new(&lib.root_path), &effective_excludes);
// Missing-file detection: prune image_exif rows whose
// source file is no longer on disk. Per-library, so we
// pass library-online-this-tick implicitly (we only
// reach here if the probe gate at the top of the
// iteration passed). Capped + paginated so a huge
// library doesn't stall the watcher; rows we don't
// visit this tick get visited next tick. See
// library_maintenance::detect_missing_files_for_library.
{
let context = opentelemetry::Context::new();
let offset = missing_file_offsets.get(&lib.id).copied().unwrap_or(0);
let (deleted, next_offset) =
library_maintenance::detect_missing_files_for_library(
&context,
lib,
&exif_dao,
offset,
missing_scan_page_size,
missing_delete_cap,
);
missing_file_offsets.insert(lib.id, next_offset);
if deleted > 0 {
debug!(
"missing-file scan: library '{}' next_offset={}",
lib.name, next_offset
);
}
}
}
// Reconciliation: cross-library, so it runs once per tick
// outside the per-library loop. Idempotent — fast no-op when
// there's nothing to do. Operates on the database alone, no
// filesystem dependency, so it doesn't need a health gate.
// See database::reconcile and CLAUDE.md "Multi-library data
// model" for the rules.
{
let mut conn = image_api::database::connect();
let _ = image_api::database::reconcile::run(&mut conn);
// Back-ref refresh: hash-keyed rows whose
// (library_id, rel_path) tuple no longer matches any
// image_exif row but whose hash still does. After a
// recent→archive move, the missing-file scan removes
// the old image_exif row; this pass repoints face /
// tag / insight back-refs at the surviving location.
// DB-only, no health gate needed — uses what's in
// image_exif as truth.
let _ = library_maintenance::refresh_back_refs(&mut conn);
// Orphan GC: the destructive end of the maintenance
// pipeline. Two-tick consensus + every-library-online
// requirement is enforced inside run_orphan_gc; we
// pass the current all-online flag and the function
// tracks the previous tick's flag in OrphanGcState.
let all_online = library_maintenance::all_libraries_online(&libs, &library_health);
let _ =
library_maintenance::run_orphan_gc(&mut conn, &mut orphan_gc_state, all_online);
}
if is_full_scan {
@@ -1992,7 +2282,9 @@ fn process_new_files(
let existing_exif_paths: HashMap<String, bool> = {
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
match dao.get_exif_batch(&context, &file_paths) {
// Walk is per-library, so scope the lookup so a same-named file
// in another library doesn't make this one look already-indexed.
match dao.get_exif_batch(&context, Some(library.id), &file_paths) {
Ok(exif_records) => exif_records
.into_iter()
.map(|record| (record.file_path, true))
@@ -2012,9 +2304,19 @@ fn process_new_files(
// derivative dedup and DB-indexed sort/filter work for every file,
// not just photos with parseable EXIF.
for (file_path, relative_path) in &files {
let thumb_path = thumbnail_directory.join(relative_path);
let needs_thumbnail =
!thumb_path.exists() && !unsupported_thumbnail_sentinel(&thumb_path).exists();
// Check both the library-scoped legacy path (current shape) and
// the bare-legacy path (pre-multi-library shape). Either one
// existing means a thumbnail is already on disk for this file.
let scoped_thumb_path = content_hash::library_scoped_legacy_path(
thumbnail_directory,
library.id,
relative_path,
);
let bare_legacy_thumb_path = thumbnail_directory.join(relative_path);
let needs_thumbnail = !scoped_thumb_path.exists()
&& !bare_legacy_thumb_path.exists()
&& !unsupported_thumbnail_sentinel(&scoped_thumb_path).exists()
&& !unsupported_thumbnail_sentinel(&bare_legacy_thumb_path).exists();
let needs_row = !existing_exif_paths.contains_key(relative_path);
if needs_thumbnail || needs_row {
@@ -2049,6 +2351,12 @@ fn process_new_files(
}
};
// Perceptual hashes (pHash + dHash). Best-effort — None for
// videos and decode failures. Drives near-duplicate detection
// in the Apollo duplicates surface; failure here is non-fatal
// and never blocks indexing.
let perceptual = perceptual_hash::compute(&file_path);
// EXIF is best-effort enrichment. When extraction fails (or the
// file type doesn't support EXIF) we still store a row with all
// EXIF fields NULL; the file remains visible to sort-by-date
@@ -2100,6 +2408,8 @@ fn process_new_files(
last_modified: timestamp,
content_hash,
size_bytes,
phash_64: perceptual.map(|h| h.phash_64),
dhash_64: perceptual.map(|h| h.dhash_64),
};
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
@@ -2131,7 +2441,7 @@ fn process_new_files(
// ensures small/medium deploys self-heal without operator
// action.
backfill_missing_content_hashes(&context, &files, library, &exif_dao);
let candidates = build_face_candidates(&context, &files, &exif_dao, &face_dao);
let candidates = build_face_candidates(&context, library, &files, &exif_dao, &face_dao);
debug!(
"face_watch: scan tick — {} image file(s) walked, {} candidate(s) (library '{}', modified_since={})",
files.iter().filter(|(p, _)| !is_video_file(p)).count(),
@@ -2449,7 +2759,7 @@ fn backfill_missing_content_hashes(
let exif_records = {
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
dao.get_exif_batch(context, &image_paths)
dao.get_exif_batch(context, Some(library.id), &image_paths)
.unwrap_or_default()
};
// Cheap lookup back from rel_path → absolute file_path so
@@ -2541,6 +2851,7 @@ fn backfill_missing_content_hashes(
/// covers both new uploads and the initial backlog scan.
fn build_face_candidates(
context: &opentelemetry::Context,
library: &libraries::Library,
files: &[(PathBuf, String)],
exif_dao: &Arc<Mutex<Box<dyn ExifDao>>>,
face_dao: &Arc<Mutex<Box<dyn faces::FaceDao>>>,
@@ -2558,7 +2869,7 @@ fn build_face_candidates(
let exif_records = {
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
dao.get_exif_batch(context, &image_paths)
dao.get_exif_batch(context, Some(library.id), &image_paths)
.unwrap_or_default()
};
// rel_path → content_hash (only rows with a hash; without one we have

View File

@@ -569,7 +569,8 @@ pub async fn list_memories(
for lib in &libraries_to_scan {
let base = Path::new(&lib.root_path);
let path_excluder = PathExcluder::new(base, &app_state.excluded_dirs);
let effective = lib.effective_excluded_dirs(&app_state.excluded_dirs);
let path_excluder = PathExcluder::new(base, &effective);
let exif_memories = collect_exif_memories(
&exif_dao,

159
src/perceptual_hash.rs Normal file
View File

@@ -0,0 +1,159 @@
//! Perceptual image hashing for near-duplicate detection.
//!
//! Two 64-bit signals per image, packed into i64 for storage and fast
//! Hamming distance via XOR + popcount:
//!
//! - **pHash (DCT)** — robust to lossy recompression, format conversion,
//! moderate brightness/contrast shifts. The primary signal.
//! - **dHash (gradient)** — much cheaper to compute, robust to scaling
//! and small crops. Acts as a fallback / corroboration when pHash is
//! ambiguous (very flat images can collide).
//!
//! Image-only by design. Videos, decode failures, and any image we
//! can't open all return `None` — perceptual hash failure is non-fatal
//! and must not block the indexer; the file is still hashed by blake3
//! and exact-match dedup keeps working.
use std::path::Path;
use image_hasher::{HashAlg, HasherConfig};
/// 64-bit perceptual fingerprint pair.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub struct PerceptualIdentity {
pub phash_64: i64,
pub dhash_64: i64,
}
/// Compute pHash + dHash for an image at `path`. Returns `None` on
/// decode failure (unsupported format, corrupt bytes, video, etc.) —
/// callers should treat that as "no perceptual signal available" and
/// proceed with exact-match dedup only.
pub fn compute(path: &Path) -> Option<PerceptualIdentity> {
let img = image::open(path).ok()?;
// 8x8 = 64 bits, the standard size for pHash/dHash. Larger sizes
// give more discriminative power but no longer fit in i64 and the
// marginal robustness isn't worth the storage / index cost for a
// personal-scale library.
let phash = HasherConfig::new()
.hash_alg(HashAlg::Mean)
.hash_size(8, 8)
.preproc_dct()
.to_hasher()
.hash_image(&img);
let dhash = HasherConfig::new()
.hash_alg(HashAlg::Gradient)
.hash_size(8, 8)
.to_hasher()
.hash_image(&img);
Some(PerceptualIdentity {
phash_64: bytes_to_i64(phash.as_bytes())?,
dhash_64: bytes_to_i64(dhash.as_bytes())?,
})
}
/// Hamming distance between two 64-bit perceptual hashes. The primary
/// query primitive: two images are "near-duplicates" when this is below
/// a threshold (default 8 for pHash, ~12% similarity tolerance). The
/// duplicates module clusters via a BK-tree which uses its own copy of
/// this calculation; this helper is kept for ad-hoc tools and tests.
#[allow(dead_code)]
#[inline]
pub fn hamming_distance(a: i64, b: i64) -> u32 {
(a ^ b).count_ones()
}
fn bytes_to_i64(bytes: &[u8]) -> Option<i64> {
if bytes.len() < 8 {
return None;
}
let mut buf = [0u8; 8];
buf.copy_from_slice(&bytes[..8]);
Some(i64::from_be_bytes(buf))
}
#[cfg(test)]
mod tests {
use super::*;
use image::{ImageBuffer, Rgb};
fn write_test_image(path: &Path, seed: u32) {
// Deterministic-but-distinct image content: simple gradient with
// a per-seed offset. Gives pHash/dHash a real signal to work
// with (a uniform image collapses to all-zero hashes).
let img: ImageBuffer<Rgb<u8>, Vec<u8>> = ImageBuffer::from_fn(64, 64, |x, y| {
let r = ((x + seed) & 0xFF) as u8;
let g = ((y + seed * 2) & 0xFF) as u8;
let b = ((x ^ y ^ seed) & 0xFF) as u8;
Rgb([r, g, b])
});
img.save(path).unwrap();
}
#[test]
fn identical_bytes_yield_identical_hashes() {
let dir = tempfile::tempdir().unwrap();
let a = dir.path().join("a.png");
let b = dir.path().join("b.png");
write_test_image(&a, 42);
write_test_image(&b, 42);
let ha = compute(&a).expect("hash a");
let hb = compute(&b).expect("hash b");
assert_eq!(ha, hb);
assert_eq!(hamming_distance(ha.phash_64, hb.phash_64), 0);
}
#[test]
fn distinct_images_have_distinct_hashes() {
let dir = tempfile::tempdir().unwrap();
let a = dir.path().join("a.png");
let b = dir.path().join("b.png");
write_test_image(&a, 42);
write_test_image(&b, 123);
let ha = compute(&a).expect("hash a");
let hb = compute(&b).expect("hash b");
assert_ne!(ha.phash_64, hb.phash_64);
}
#[test]
fn resized_copy_is_near_duplicate_under_threshold() {
// The whole point of perceptual hashing: a resized copy of the
// same source image should land within a small Hamming distance
// of the original. We check the dHash specifically because it's
// the more resize-robust of the two; pHash is also tight but
// gradient-based dHash gives the most reliable signal here.
let dir = tempfile::tempdir().unwrap();
let a = dir.path().join("a.png");
write_test_image(&a, 7);
let img = image::open(&a).unwrap();
let small = img.resize_exact(32, 32, image::imageops::FilterType::Lanczos3);
let b = dir.path().join("b.png");
small.save(&b).unwrap();
let ha = compute(&a).expect("hash a");
let hb = compute(&b).expect("hash b");
let d_dhash = hamming_distance(ha.dhash_64, hb.dhash_64);
assert!(
d_dhash <= 8,
"expected dhash Hamming distance <= 8 for resized copy, got {}",
d_dhash
);
}
#[test]
fn unsupported_path_returns_none() {
let dir = tempfile::tempdir().unwrap();
let p = dir.path().join("notanimage.txt");
std::fs::write(&p, b"hello").unwrap();
assert!(compute(&p).is_none());
}
#[test]
fn missing_file_returns_none() {
let p = Path::new("/nonexistent/path/that/does/not/exist.png");
assert!(compute(p).is_none());
}
}

View File

@@ -10,7 +10,7 @@ use crate::database::{
connect,
};
use crate::database::{PreviewDao, SqlitePreviewDao};
use crate::libraries::{self, Library};
use crate::libraries::{self, Library, LibraryHealthMap};
use crate::tags::{SqliteTagDao, TagDao};
use crate::video::actors::{
PlaylistGenerator, PreviewClipGenerator, StreamActor, VideoPlaylistManager,
@@ -26,6 +26,11 @@ pub struct AppState {
/// All configured media libraries. Ordered by `id` ascending; the first
/// entry is the primary library.
pub libraries: Vec<Library>,
/// Per-library availability snapshot. Updated by the file watcher at
/// the top of each tick via `libraries::refresh_health`. HTTP handlers
/// read it (e.g. `/libraries` surfacing). See "Library availability
/// and safety" in CLAUDE.md.
pub library_health: LibraryHealthMap,
/// Legacy shim equal to `libraries[0].root_path`. Phase 2 transitional —
/// new code should go through `primary_library()`.
pub base_path: String,
@@ -105,11 +110,13 @@ impl AppState {
preview_dao,
);
let library_health = libraries::new_health_map(&libraries_vec);
Self {
stream_manager,
playlist_manager: Arc::new(video_playlist_manager.start()),
preview_clip_generator: Arc::new(preview_clip_generator.start()),
libraries: libraries_vec,
library_health,
base_path,
thumbnail_path,
video_path,
@@ -348,6 +355,8 @@ impl AppState {
id: crate::libraries::PRIMARY_LIBRARY_ID,
name: "main".to_string(),
root_path: base_path_str.clone(),
enabled: true,
excluded_dirs: Vec::new(),
};
let insight_generator = InsightGenerator::new(
ollama.clone(),
@@ -384,6 +393,8 @@ impl AppState {
id: crate::libraries::PRIMARY_LIBRARY_ID,
name: "main".to_string(),
root_path: base_path_str.clone(),
enabled: true,
excluded_dirs: Vec::new(),
}];
AppState::new(
Arc::new(StreamActor {}.start()),

View File

@@ -33,6 +33,11 @@ where
.service(web::resource("image/tags/all").route(web::get().to(get_all_tags::<TagD>)))
.service(web::resource("image/tags/batch").route(web::post().to(update_tags::<TagD>)))
.service(web::resource("image/tags/lookup").route(web::post().to(lookup_tags_batch::<TagD>)))
.service(
web::resource("image/tags/{id}")
.route(web::put().to(update_tag::<TagD>))
.route(web::delete().to(delete_tag::<TagD>)),
)
}
async fn add_tag<D: TagDao>(
@@ -53,7 +58,14 @@ async fn add_tag<D: TagDao>(
tag_dao
.get_all_tags(&span_context, None)
.and_then(|tags| {
if let Some((_, tag)) = tags.iter().find(|t| t.1.name == tag_name) {
// Case-insensitive match. With the unique-NOCASE index on
// tags.name now in place, a case-sensitive find here would
// miss a casing-only collision and let the subsequent
// create_tag INSERT crash on the constraint.
if let Some((_, tag)) = tags
.iter()
.find(|t| t.1.name.eq_ignore_ascii_case(&tag_name))
{
Ok(tag.clone())
} else {
info!(
@@ -71,6 +83,74 @@ async fn add_tag<D: TagDao>(
.into_http_internal_err()
}
async fn update_tag<D: TagDao>(
_: Claims,
http_request: HttpRequest,
path: web::Path<i32>,
body: web::Json<UpdateTagRequest>,
tag_dao: web::Data<Mutex<D>>,
) -> impl Responder {
let tracer = global_tracer();
let context = extract_context_from_request(&http_request);
let span = tracer.start_with_context("update_tag", &context);
let span_context = opentelemetry::Context::current_with_span(span);
let id = path.into_inner();
let trimmed = body.name.trim();
if trimmed.is_empty() {
return HttpResponse::BadRequest()
.json(serde_json::json!({ "error": "Tag name must not be empty" }));
}
let mut tag_dao = tag_dao.lock().expect("Unable to get TagDao");
match tag_dao.update_tag_name(&span_context, id, trimmed) {
Ok(UpdateTagOutcome::Renamed(tag)) => {
span_context.span().set_status(Status::Ok);
info!("Renamed tag {} -> '{}'", id, trimmed);
HttpResponse::Ok().json(tag)
}
Ok(UpdateTagOutcome::NotFound) => {
HttpResponse::NotFound().json(serde_json::json!({ "error": "Tag not found" }))
}
Ok(UpdateTagOutcome::Conflict { existing }) => HttpResponse::Conflict().json(
serde_json::json!({ "error": "Tag name already exists", "existing_tag": existing }),
),
Err(e) => {
log::error!("update_tag failed: {:?}", e);
HttpResponse::InternalServerError()
.json(serde_json::json!({ "error": "Update failed" }))
}
}
}
async fn delete_tag<D: TagDao>(
_: Claims,
http_request: HttpRequest,
path: web::Path<i32>,
tag_dao: web::Data<Mutex<D>>,
) -> impl Responder {
let tracer = global_tracer();
let context = extract_context_from_request(&http_request);
let span = tracer.start_with_context("delete_tag", &context);
let span_context = opentelemetry::Context::current_with_span(span);
let id = path.into_inner();
let mut tag_dao = tag_dao.lock().expect("Unable to get TagDao");
match tag_dao.delete_tag(&span_context, id) {
Ok(true) => {
span_context.span().set_status(Status::Ok);
info!("Deleted tag {}", id);
HttpResponse::NoContent().finish()
}
Ok(false) => HttpResponse::NotFound().json(serde_json::json!({ "error": "Tag not found" })),
Err(e) => {
log::error!("delete_tag failed: {:?}", e);
HttpResponse::InternalServerError()
.json(serde_json::json!({ "error": "Delete failed" }))
}
}
}
async fn get_tags<D: TagDao>(
_: Claims,
http_request: HttpRequest,
@@ -284,9 +364,15 @@ async fn lookup_tags_batch<D: TagDao>(
// Stage 1: query → content_hash mapping. Files without a hash yet
// (just-indexed, hash compute failed, etc.) skip the sibling
// expansion and only get tags from their own rel_path.
// Library-agnostic by design: this endpoint takes raw rel_paths from
// the client (typically Apollo) with no library context. Span all
// libraries and let the hash-keyed sibling expansion below do the
// disambiguation. Same-rel_path/different-content collisions across
// libraries surface as multiple hashes for one path — fine, we union
// every sibling tag set.
let exif_records = {
let mut dao = exif_dao.lock().expect("Unable to get ExifDao");
match dao.get_exif_batch(&span_context, &query_paths) {
match dao.get_exif_batch(&span_context, None, &query_paths) {
Ok(rows) => rows,
Err(e) => {
return HttpResponse::InternalServerError()
@@ -421,6 +507,11 @@ pub struct InsertTaggedPhoto {
#[diesel(column_name = rel_path)]
pub photo_name: String,
pub created_time: i64,
/// Hash-keyed identity. The DAO populates this from
/// `image_exif.content_hash` at insert time when known; the
/// reconciliation pass backfills rows inserted before the hash
/// landed. See CLAUDE.md "Multi-library data model".
pub content_hash: Option<String>,
}
#[derive(Queryable, Clone, Debug)]
@@ -434,6 +525,8 @@ pub struct TaggedPhoto {
pub tag_id: i32,
#[allow(dead_code)] // Part of API contract
pub created_time: i64,
#[allow(dead_code)]
pub content_hash: Option<String>,
}
#[derive(Debug, Deserialize)]
@@ -442,6 +535,22 @@ pub struct AddTagsRequest {
pub tag_ids: Vec<i32>,
}
#[derive(Debug, Deserialize)]
pub struct UpdateTagRequest {
pub name: String,
}
/// Result of an attempted tag rename. Returning a typed outcome (rather
/// than `anyhow::Result<Tag>`) lets the handler map each case to a
/// distinct HTTP status without sniffing error strings, and keeps the
/// 409 path a normal control-flow result instead of a DB constraint
/// violation surfacing as a generic 500.
pub enum UpdateTagOutcome {
Renamed(Tag),
NotFound,
Conflict { existing: Tag },
}
pub trait TagDao: Send + Sync {
fn get_all_tags(
&mut self,
@@ -511,6 +620,26 @@ pub trait TagDao: Send + Sync {
context: &opentelemetry::Context,
file_paths: &[String],
) -> anyhow::Result<std::collections::HashMap<String, i64>>;
/// Rename a tag in place. The tag id stays stable so existing
/// `tagged_photo` rows automatically reflect the new name without
/// a join-table rewrite. Conflict is resolved against the rest of
/// the table case-insensitively (mirroring the
/// `idx_tags_name_nocase` UNIQUE index) — a rename that changes
/// only the case of the tag's own current name is allowed.
fn update_tag_name(
&mut self,
context: &opentelemetry::Context,
id: i32,
new_name: &str,
) -> anyhow::Result<UpdateTagOutcome>;
/// Globally remove a tag and every `tagged_photo` row that
/// references it. Returns `true` if a tag was deleted, `false` if
/// no row matched the id. The schema's FK is `ON DELETE CASCADE`
/// but SQLite only honors that with `PRAGMA foreign_keys = ON`,
/// which this project doesn't set — the impl deletes both tables
/// explicitly in a single transaction so partial state is
/// impossible.
fn delete_tag(&mut self, context: &opentelemetry::Context, id: i32) -> anyhow::Result<bool>;
}
pub struct SqliteTagDao {
@@ -704,6 +833,83 @@ impl TagDao for SqliteTagDao {
})
}
fn update_tag_name(
&mut self,
context: &opentelemetry::Context,
id: i32,
new_name: &str,
) -> anyhow::Result<UpdateTagOutcome> {
let mut conn = self
.connection
.lock()
.expect("Unable to lock SqliteTagDao connection");
trace_db_call(context, "update", "update_tag_name", |span| {
span.set_attributes(vec![
KeyValue::new("tag_id", id as i64),
KeyValue::new("new_name", new_name.to_string()),
]);
let target = tags::table
.filter(tags::id.eq(id))
.select((tags::id, tags::name, tags::created_time))
.get_result::<Tag>(conn.deref_mut())
.optional()
.with_context(|| format!("Unable to look up tag id {}", id))?;
let target = match target {
Some(t) => t,
None => return Ok(UpdateTagOutcome::NotFound),
};
// Case-insensitive collision check on every other row.
// Belt-and-suspenders: idx_tags_name_nocase enforces this at
// the index level, but checking up front gives the handler
// a clean 409 with the existing tag's id instead of a
// generic constraint-violation 500. Tags table is small;
// loading peers and comparing in Rust avoids a fragile
// dsl::sql composition for case-insensitive equality.
let conflict = tags::table
.filter(tags::id.ne(id))
.select((tags::id, tags::name, tags::created_time))
.get_results::<Tag>(conn.deref_mut())
.with_context(|| "Unable to query for tag-name conflict")?
.into_iter()
.find(|t| t.name.eq_ignore_ascii_case(new_name));
if let Some(existing) = conflict {
return Ok(UpdateTagOutcome::Conflict { existing });
}
diesel::update(tags::table.filter(tags::id.eq(id)))
.set(tags::name.eq(new_name))
.execute(conn.deref_mut())
.with_context(|| format!("Unable to rename tag {}", id))?;
Ok(UpdateTagOutcome::Renamed(Tag {
id: target.id,
name: new_name.to_string(),
created_time: target.created_time,
}))
})
}
fn delete_tag(&mut self, context: &opentelemetry::Context, id: i32) -> anyhow::Result<bool> {
let mut conn = self
.connection
.lock()
.expect("Unable to lock SqliteTagDao connection");
trace_db_call(context, "delete", "delete_tag", |span| {
span.set_attribute(KeyValue::new("tag_id", id as i64));
// tagged_photo.tag_id is `ON DELETE CASCADE` and the
// connection now sets `PRAGMA foreign_keys = ON`, so a
// single DELETE on tags removes its tagged_photo rows
// atomically.
let removed = diesel::delete(tags::table.filter(tags::id.eq(id)))
.execute(conn.deref_mut())
.with_context(|| format!("Unable to delete tag {}", id))?;
Ok(removed > 0)
})
}
fn remove_tag(
&mut self,
context: &opentelemetry::Context,
@@ -759,11 +965,31 @@ impl TagDao for SqliteTagDao {
KeyValue::new("tag_id", tag_id.to_string()),
]);
// Eagerly populate content_hash so this tag follows the bytes,
// not the path (see CLAUDE.md "Multi-library data model").
// None is fine — the reconciliation pass will backfill once
// image_exif has a hash for this file. We deliberately don't
// require library_id here: the tag handler is library-
// agnostic by design, and any matching image_exif row's hash
// is acceptable. If the path resolves to different bytes in
// different libraries, reconciliation per-library refines.
let content_hash: Option<String> = {
use crate::database::schema::image_exif as ie;
ie::table
.filter(ie::rel_path.eq(path))
.filter(ie::content_hash.is_not_null())
.select(ie::content_hash)
.first::<Option<String>>(conn.deref_mut())
.ok()
.flatten()
};
diesel::insert_into(tagged_photo::table)
.values(InsertTaggedPhoto {
tag_id,
photo_name: path.to_string(),
created_time: Utc::now().timestamp(),
content_hash,
})
.execute(conn.deref_mut())
.with_context(|| format!("Unable to tag file {:?} in sqlite", path))
@@ -1168,6 +1394,7 @@ mod tests {
tag_id: tag.id,
created_time: Utc::now().timestamp(),
photo_name: path.to_string(),
content_hash: None,
};
if self.tagged_photos.borrow().contains_key(path) {
@@ -1238,6 +1465,54 @@ mod tests {
}
Ok(counts)
}
fn update_tag_name(
&mut self,
_context: &opentelemetry::Context,
id: i32,
new_name: &str,
) -> anyhow::Result<UpdateTagOutcome> {
// Conflict pass first so the target tag's own old name
// doesn't collide with itself.
let conflict = self
.tags
.borrow()
.iter()
.find(|t| t.id != id && t.name.eq_ignore_ascii_case(new_name))
.cloned();
if let Some(existing) = conflict {
return Ok(UpdateTagOutcome::Conflict { existing });
}
let mut tags = self.tags.borrow_mut();
match tags.iter_mut().find(|t| t.id == id) {
Some(t) => {
t.name = new_name.to_string();
Ok(UpdateTagOutcome::Renamed(t.clone()))
}
None => Ok(UpdateTagOutcome::NotFound),
}
}
fn delete_tag(
&mut self,
_context: &opentelemetry::Context,
id: i32,
) -> anyhow::Result<bool> {
let target_name = {
let tags = self.tags.borrow();
tags.iter().find(|t| t.id == id).map(|t| t.name.clone())
};
let Some(name) = target_name else {
return Ok(false);
};
// Mirror the cascade: drop any tagged_photo references, then
// remove the tag itself.
for (_path, tags) in self.tagged_photos.borrow_mut().iter_mut() {
tags.retain(|t| t.id != id && t.name != name);
}
self.tags.borrow_mut().retain(|t| t.id != id);
Ok(true)
}
}
#[actix_rt::test]
@@ -1253,20 +1528,29 @@ mod tests {
// Seed: two paths tagged, one path untagged.
dao.tagged_photos.borrow_mut().insert(
"a.jpg".into(),
vec![Tag { id: 1, name: "alpha".into(), created_time: 0 }],
vec![Tag {
id: 1,
name: "alpha".into(),
created_time: 0,
}],
);
dao.tagged_photos.borrow_mut().insert(
"b.jpg".into(),
vec![
Tag { id: 2, name: "beta".into(), created_time: 0 },
Tag { id: 3, name: "gamma".into(), created_time: 0 },
Tag {
id: 2,
name: "beta".into(),
created_time: 0,
},
Tag {
id: 3,
name: "gamma".into(),
created_time: 0,
},
],
);
let grouped = dao
.get_tags_grouped_by_paths(
&ctx,
&["a.jpg".into(), "b.jpg".into(), "c.jpg".into()],
)
.get_tags_grouped_by_paths(&ctx, &["a.jpg".into(), "b.jpg".into(), "c.jpg".into()])
.unwrap();
assert_eq!(grouped.get("a.jpg").map(|v| v.len()), Some(1));
assert_eq!(grouped.get("b.jpg").map(|v| v.len()), Some(2));
@@ -1381,6 +1665,177 @@ mod tests {
None
);
}
async fn rename_tag(
dao: &Data<Mutex<TestTagDao>>,
id: i32,
new_name: &str,
) -> actix_web::http::StatusCode {
use actix_web::Responder;
let req = TestRequest::default().to_http_request();
let body = web::Json(UpdateTagRequest {
name: new_name.to_string(),
});
let claims = Claims::valid_user(String::from("1"));
let resp = update_tag(claims, req.clone(), web::Path::from(id), body, dao.clone()).await;
resp.respond_to(&req).status()
}
#[actix_rt::test]
async fn update_tag_renames_successfully() {
let mut dao = TestTagDao::new();
let tag = dao
.create_tag(&opentelemetry::Context::current(), "old")
.unwrap();
let dao = Data::new(Mutex::new(dao));
assert_eq!(
rename_tag(&dao, tag.id, "new").await,
actix_web::http::StatusCode::OK
);
let mut locked = dao.lock().unwrap();
let all = locked
.get_all_tags(&opentelemetry::Context::current(), None)
.unwrap();
assert_eq!(all.len(), 1);
assert_eq!(all[0].1.name, "new");
}
#[actix_rt::test]
async fn update_tag_not_found_returns_404() {
let dao = Data::new(Mutex::new(TestTagDao::new()));
assert_eq!(
rename_tag(&dao, 99999, "nope").await,
actix_web::http::StatusCode::NOT_FOUND
);
}
#[actix_rt::test]
async fn update_tag_empty_name_returns_400() {
let mut dao = TestTagDao::new();
let tag = dao
.create_tag(&opentelemetry::Context::current(), "keep")
.unwrap();
let dao = Data::new(Mutex::new(dao));
assert_eq!(
rename_tag(&dao, tag.id, " ").await,
actix_web::http::StatusCode::BAD_REQUEST
);
let mut locked = dao.lock().unwrap();
let all = locked
.get_all_tags(&opentelemetry::Context::current(), None)
.unwrap();
assert_eq!(all[0].1.name, "keep", "name must not change on 400");
}
#[actix_rt::test]
async fn update_tag_conflict_returns_409() {
let mut dao = TestTagDao::new();
let _a = dao
.create_tag(&opentelemetry::Context::current(), "a")
.unwrap();
let b = dao
.create_tag(&opentelemetry::Context::current(), "b")
.unwrap();
let dao = Data::new(Mutex::new(dao));
// Case-insensitive collision: renaming b -> "A" must conflict with a.
assert_eq!(
rename_tag(&dao, b.id, "A").await,
actix_web::http::StatusCode::CONFLICT
);
let mut locked = dao.lock().unwrap();
let all = locked
.get_all_tags(&opentelemetry::Context::current(), None)
.unwrap();
let b_after = all.iter().find(|(_, t)| t.id == b.id).unwrap();
assert_eq!(b_after.1.name, "b", "no DB change on 409");
}
async fn delete_via_handler(
dao: &Data<Mutex<TestTagDao>>,
id: i32,
) -> actix_web::http::StatusCode {
use actix_web::Responder;
let req = TestRequest::default().to_http_request();
let claims = Claims::valid_user(String::from("1"));
let resp = delete_tag(claims, req.clone(), web::Path::from(id), dao.clone()).await;
resp.respond_to(&req).status()
}
#[actix_rt::test]
async fn delete_tag_removes_tag_and_cascades_tagged_photos() {
let mut dao = TestTagDao::new();
let tag = dao
.create_tag(&opentelemetry::Context::current(), "doomed")
.unwrap();
dao.tag_file(&opentelemetry::Context::current(), "a.jpg", tag.id)
.unwrap();
dao.tag_file(&opentelemetry::Context::current(), "b.jpg", tag.id)
.unwrap();
let dao = Data::new(Mutex::new(dao));
assert_eq!(
delete_via_handler(&dao, tag.id).await,
actix_web::http::StatusCode::NO_CONTENT
);
let mut locked = dao.lock().unwrap();
assert!(
locked
.get_all_tags(&opentelemetry::Context::current(), None)
.unwrap()
.is_empty()
);
assert!(
locked
.get_tags_for_path(&opentelemetry::Context::current(), "a.jpg")
.unwrap()
.is_empty(),
"tagged_photo references must be cleaned up by the cascade"
);
assert!(
locked
.get_tags_for_path(&opentelemetry::Context::current(), "b.jpg")
.unwrap()
.is_empty()
);
}
#[actix_rt::test]
async fn delete_tag_unknown_id_returns_404() {
let dao = Data::new(Mutex::new(TestTagDao::new()));
assert_eq!(
delete_via_handler(&dao, 99999).await,
actix_web::http::StatusCode::NOT_FOUND
);
}
#[actix_rt::test]
async fn update_tag_case_only_change_succeeds() {
let mut dao = TestTagDao::new();
let tag = dao
.create_tag(&opentelemetry::Context::current(), "vacation")
.unwrap();
let dao = Data::new(Mutex::new(dao));
// The conflict check excludes the target's own row, so changing
// only the case of the tag's current name must succeed.
assert_eq!(
rename_tag(&dao, tag.id, "Vacation").await,
actix_web::http::StatusCode::OK
);
let mut locked = dao.lock().unwrap();
let all = locked
.get_all_tags(&opentelemetry::Context::current(), None)
.unwrap();
assert_eq!(all[0].1.name, "Vacation");
}
}
#[derive(QueryableByName, Debug, Clone)]
pub struct FileWithTagCount {