Compare commits
34 Commits
f50655fb21
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
| 82dd21b205 | |||
|
|
57b7bad086 | ||
|
|
98057c98a1 | ||
|
|
7ca888e95d | ||
|
|
7584cd8792 | ||
| 4340b164eb | |||
|
|
fb4df4b195 | ||
|
|
1d9b9a0bc4 | ||
| 7998a0c9b0 | |||
|
|
58f010f302 | ||
|
|
814066551e | ||
| 4f17af688e | |||
|
|
3598bb2cfe | ||
| 23448cf5e6 | |||
|
|
d809ddee44 | ||
|
|
fa98d147be | ||
|
|
5f247be1f1 | ||
|
|
263e27e108 | ||
| a0283a6362 | |||
|
|
48cac8c285 | ||
| cce8f0c1b7 | |||
|
|
48ed7be5d9 | ||
|
|
eea1bf3181 | ||
|
|
2f91891459 | ||
| 3d162105f7 | |||
|
|
98601973f7 | ||
|
|
862917b0d1 | ||
|
|
44d677528e | ||
| 89b743ba54 | |||
|
|
323097c650 | ||
| d0833177c7 | |||
|
|
67abd8d8ff | ||
|
|
0840d55c70 | ||
| dbb046dfa8 |
4
.gitignore
vendored
4
.gitignore
vendored
@@ -2,8 +2,12 @@
|
||||
database/target
|
||||
*.db
|
||||
*.db.bak
|
||||
*.db-shm
|
||||
*.db-wal
|
||||
.env
|
||||
/tmp
|
||||
/docs
|
||||
/specs
|
||||
|
||||
# Default ignored files
|
||||
.idea/shelf/
|
||||
|
||||
240
CLAUDE.md
240
CLAUDE.md
@@ -104,6 +104,242 @@ All database access goes through trait-based DAOs (e.g., `ExifDao`, `SqliteExifD
|
||||
- `query_by_exif()`: Complex filtering by camera, GPS bounds, date ranges
|
||||
- Batch operations minimize DB hits during file watching
|
||||
|
||||
### Multi-library data model
|
||||
|
||||
ImageApi supports more than one library (a library = a `(name, root_path)`
|
||||
row in the `libraries` table that maps to a mounted directory tree). The
|
||||
same bytes may exist under more than one library — typical case is an
|
||||
"active" library plus an "archive" library that ingests files as they age
|
||||
out — and the data model is designed so that derived data follows the
|
||||
**bytes**, not the path, while user-managed data does the same.
|
||||
|
||||
**The principle.** A photo's identity is its `content_hash` (blake3, see
|
||||
`src/content_hash.rs`). Anything we compute from or attach to a photo is
|
||||
keyed on that hash so it survives:
|
||||
- the same file appearing in a second library (backup / archive / mirror),
|
||||
- the file moving between libraries (recent → archive handoff),
|
||||
- the file moving within a library (re-organized rel_path),
|
||||
- intra-library duplicates (same bytes at two paths).
|
||||
|
||||
**Table classification.** Three categories drive the keying decision:
|
||||
|
||||
| Category | Key | Rationale | Tables |
|
||||
|---|---|---|---|
|
||||
| Intrinsic to bytes | `content_hash` | Rerunning is wasted work (or LLM cost) | `face_detections` ✓, `image_exif` (target), `photo_insights` (target), `video_preview_clips` (target) |
|
||||
| User intent about a photo | `content_hash` | "Tag this photo" means the bytes, not a path | `tagged_photo` (target), `favorites` (target) |
|
||||
| Library administrative | `(library_id, rel_path)` | Tied to a specific filesystem location | `libraries`, `entity_photo_links`, the `rel_path` back-ref columns on hash-keyed tables |
|
||||
|
||||
✓ = already implemented this way. *(target)* = today still keyed on
|
||||
`(library_id, rel_path)` and slated for migration. The migration adds a
|
||||
nullable `content_hash` column, populates it from `image_exif` where
|
||||
known, and read paths fall back to rel_path while the hash is null.
|
||||
|
||||
**Carrying a `rel_path` even when hash-keyed.** Hash-keyed tables retain
|
||||
`(library_id, rel_path)` columns as a denormalized **back-reference**, not
|
||||
as the key. This lets a single query answer "what is at this path right
|
||||
now" without joining through `image_exif`, and supports the path-only
|
||||
endpoints that predate the hash. `face_detections` is the reference
|
||||
implementation: hash is the truth, path is a hint.
|
||||
|
||||
**Merge semantics on read.** When the same hash has rows under more than
|
||||
one library:
|
||||
- Set-valued data (tags, favorites, faces, entity links) → **union**.
|
||||
- Scalar data (current insight, EXIF row, video preview clip) → earliest
|
||||
`generated_at` / `created_time` wins. The historical lib1 row beats a
|
||||
re-generated lib2 row, so the user's curated insight isn't shadowed by
|
||||
a re-run on archive ingest.
|
||||
|
||||
**Write attribution.** A new tag/favorite/insight created while viewing
|
||||
under lib2 binds to the bytes, not to lib2 — so it shows up under lib1
|
||||
too. This is by design, but it's the most surprising rule on first
|
||||
encounter; clients should not assume tags are library-scoped.
|
||||
|
||||
**Hash-less rows (transitional state).** During and immediately after a
|
||||
new mount, `image_exif.content_hash` is being populated by
|
||||
`backfill_unhashed_backlog` (capped per tick). Rules during this window:
|
||||
- Writes: if the hash is known, write hash-keyed. If not, write
|
||||
`(library_id, rel_path)`-keyed and let the reconciliation job collapse
|
||||
duplicates once the hash lands.
|
||||
- Reads: prefer hash key, fall back to `(library_id, rel_path)`.
|
||||
- Reconciliation: a one-shot pass after every backfill tick collapses
|
||||
rows that now share a hash, applying the merge semantics above.
|
||||
Idempotent — safe to re-run.
|
||||
|
||||
**Library handoff (recent → archive).** When a file moves between
|
||||
libraries (e.g. operator moves `~/photos/2024/IMG.nef` to the archive
|
||||
mount), the file watcher sees the disappearance under lib1 and the
|
||||
appearance under lib2. Hash-keyed rows don't need migration; the
|
||||
`(library_id, rel_path)` back-ref columns are updated to point to the new
|
||||
location. Library administrative rows (`entity_photo_links`,
|
||||
`(library_id, rel_path)` rows in `image_exif` for hash-less items) are
|
||||
re-keyed by the move detector, which matches a disappearance to an
|
||||
appearance by `content_hash` within a configurable window.
|
||||
|
||||
**Orphans (source deleted while a copy survives).** When the only
|
||||
`image_exif` row for a hash is deleted (file removed from disk), the
|
||||
hash-keyed derived rows survive **as long as another `image_exif` row
|
||||
references the same hash**. If the last reference is gone, derived rows
|
||||
are eligible for GC (deferred — the GC job runs on a slow schedule so
|
||||
that a brief unmount or rename doesn't wipe history).
|
||||
|
||||
**Stats and counts.** When reporting "how many photos do you have," count
|
||||
`DISTINCT content_hash` over `image_exif`, not row count. Faces stats
|
||||
already does this (`FaceDao::stats` in `src/faces.rs`); other counters
|
||||
should follow suit. Numerator and denominator must live in the same
|
||||
domain — see the face-stats commentary below for the cautionary tale.
|
||||
|
||||
**Per-library scoping when the user asks for it.** A request scoped to
|
||||
`?library=N` filters the `image_exif` view to that library, and the
|
||||
hash-keyed derived data is joined through that view. The user sees only
|
||||
photos that have a copy under lib N, but the derived data attached to
|
||||
those photos is the merged hash-keyed view. This is the answer to "show
|
||||
me archive photos with their original tags."
|
||||
|
||||
**Operator kill switch (`libraries.enabled`).** Setting `enabled=0` on a
|
||||
library is a hard pause: the watcher skips it entirely — before the
|
||||
probe, before ingest, before any maintenance pass — and the orphan-GC
|
||||
all-online consensus check filters disabled libraries out (they don't
|
||||
keep the GC window closed). Reads / serving are unaffected; nothing
|
||||
prevents `/image?path=...` from resolving against a disabled library's
|
||||
root if the file is on disk. The existing `image_exif` rows for a
|
||||
disabled library are **not deleted** — they continue to anchor
|
||||
hash-keyed derived data, so cross-library duplicates survive the
|
||||
disable. Toggle via SQL; there is intentionally no HTTP endpoint for
|
||||
library mutation (single-user tool, no role / permission story).
|
||||
Typical workflows: stage a new mount with `enabled=0` then flip to `1`;
|
||||
quiet a flaky NAS during maintenance without disturbing the rest of
|
||||
the system.
|
||||
|
||||
**Per-library excludes (`libraries.excluded_dirs`).** A
|
||||
comma-separated column, same shape as the global `EXCLUDED_DIRS` env
|
||||
var, that's applied **in union** with the env-var globals when a
|
||||
walker scans this library. Use case: mount a parent directory as a
|
||||
new library while a sibling library covers a child subtree, and
|
||||
exclude that child subtree from the parent so the two libraries
|
||||
don't double-walk and double-write `image_exif`. Two entry forms
|
||||
(parsed by `memories::PathExcluder`):
|
||||
- `/sub/path` — leading slash flags it as a path under the library
|
||||
root. Joins to root + matches by `path.starts_with(...)`. Works
|
||||
at any depth (`/photos`, `/media/2024/raw`).
|
||||
- `name` — no leading slash flags it as a component name to skip
|
||||
anywhere in the tree (`@eaDir`, `.thumbnails`). Single segment
|
||||
only — `media/photos/a` without a leading slash never matches
|
||||
anything. Hash-keyed derived
|
||||
data (faces, tags, insights) is unaffected either way — those
|
||||
follow the bytes — but `image_exif` row count, walker CPU, and
|
||||
thumbnail disk usage all drop to 1× instead of 2× for the overlap.
|
||||
Affects: file-watch ingest (`process_new_files`), thumbnail
|
||||
generation, media-count gauges, the orphaned-playlist cleanup walk,
|
||||
and the `/memories` endpoint. The face-detection backlog drain
|
||||
inherits via `face_watch::filter_excluded`. NULL = no extras (only
|
||||
the global env var applies).
|
||||
|
||||
**Library availability and safety.** Libraries can be on network shares
|
||||
or removable media; the file watcher must not interpret a temporary
|
||||
unavailability as a mass-deletion event. Every tick begins with a
|
||||
**presence probe** per library: the library is considered online iff
|
||||
its `root_path` exists, is readable, and a top-level scan returns at
|
||||
least one expected entry (or matches a recent file-count high-water
|
||||
mark within a tolerance). The probe result gates which actions are safe
|
||||
to run on that library this tick:
|
||||
|
||||
| Action | Requires online? |
|
||||
|---|---|
|
||||
| Quick / full scan ingest of new files | yes |
|
||||
| EXIF / face / insight backlog drains | yes — but the work runs against any online library |
|
||||
| Move-handoff detection (lib1 disappearance ↔ lib2 appearance match) | **both** libraries online |
|
||||
| `(library_id, rel_path)` re-keying on detected move | **both** libraries online |
|
||||
| Orphan GC of hash-keyed derived data | all libraries that have *ever* held the hash must be online and confirmed-clean for two consecutive ticks |
|
||||
| Reads / serving | always allowed; falls back to whichever library is online |
|
||||
|
||||
A library that fails the probe enters a "stale" state: writes scoped to
|
||||
it are paused, its rows are flagged stale (not deleted) in
|
||||
`/libraries` status, and the watcher logs at `warn` once per
|
||||
state-transition (not per tick). A library that recovers re-enters the
|
||||
online set automatically; no operator action required for transient
|
||||
outages. The intent is that pulling a USB drive, rebooting a NAS, or
|
||||
losing a VPN never triggers a destructive code path — the worst case is
|
||||
that derived-data work pauses until the share returns.
|
||||
|
||||
The same rule constrains the move-handoff matcher: a disappearance
|
||||
under lib1 only counts as a "move" if there is a matching appearance
|
||||
under another **online** library within the window. A bare
|
||||
disappearance with no matching appearance is treated as
|
||||
"unavailable-or-deleted, defer judgment" — it does not re-key any rows
|
||||
and does not enqueue GC.
|
||||
|
||||
**Maintenance pipeline (`src/library_maintenance.rs`).** The watcher
|
||||
runs three maintenance passes per tick that together implement the
|
||||
move/handoff and orphan rules:
|
||||
|
||||
1. **Missing-file scan** — per online library, paginated. A page of
|
||||
`image_exif` rows is loaded (`IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE`,
|
||||
default 500), each row's `(root_path/rel_path)` is `stat()`-ed,
|
||||
and confirmed-not-found rows are deleted from `image_exif`
|
||||
(capped at `IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK`, default 200).
|
||||
Permission/IO errors are skipped, never deleted — only `NotFound`
|
||||
triggers a deletion. The cursor wraps every time a partial page
|
||||
comes back, so the whole library is swept across consecutive ticks.
|
||||
Skipped wholesale for Stale libraries via the per-library probe
|
||||
gate at the top of the loop iteration.
|
||||
|
||||
2. **Back-ref refresh** — DB-only. For `face_detections`,
|
||||
`tagged_photo`, and `photo_insights`: any hash-keyed row whose
|
||||
`(library_id, rel_path)` no longer matches an `image_exif` row
|
||||
*but whose `content_hash` does* is repointed at the surviving
|
||||
`image_exif` location. Idempotent SQL; no health gate needed.
|
||||
This is what makes the recent → archive handoff invisible to
|
||||
read paths: when the missing-file scan retires the lib-A row,
|
||||
tags/faces/insights pivot to lib-B's path before any user
|
||||
notices.
|
||||
|
||||
3. **Orphan GC** — destructive. Hash-keyed derived rows whose
|
||||
`content_hash` no longer has any `image_exif` row are eligible.
|
||||
Two-tick consensus: a hash must be observed orphaned on two
|
||||
consecutive ticks AND every library must be online for both. A
|
||||
single Stale tick within the window cancels all pending deletes.
|
||||
The pending set is held in memory (`OrphanGcState`) — restart
|
||||
resets it, which only delays a delete, never causes one. Tags,
|
||||
faces, and insights for orphaned hashes are deleted in one batch
|
||||
per tick.
|
||||
|
||||
A backup library that briefly disappears, then returns within two
|
||||
ticks, never loses any derived data. A move from lib-A to lib-B
|
||||
without disappearance flips through pass 1 (lib-A row retired) and
|
||||
pass 2 (back-refs follow), with pass 3 noting nothing because the
|
||||
hash is still present in `image_exif` (lib-B's row).
|
||||
|
||||
**Known gap: in-place content changes (future Branch D).** The
|
||||
maintenance pipeline assumes a `(library_id, rel_path)`'s bytes are
|
||||
stable for as long as the file exists at that path. If a user edits
|
||||
a file in place (crop, re-export) without renaming, the watcher's
|
||||
quick scan walks the file (mtime is recent) but `process_new_files`
|
||||
short-circuits because `(library_id, rel_path)` already has an
|
||||
`image_exif` row — no re-hash, no re-EXIF, no face redetection. The
|
||||
row's `content_hash` keeps pointing at the original bytes. Tags /
|
||||
faces / insights stay attached to the original hash and continue to
|
||||
display because the rel_path back-ref still resolves; new faces
|
||||
introduced by the edit are never detected.
|
||||
|
||||
The right place to fix this is a **stale-content detection pass**
|
||||
that compares `image_exif.last_modified` / `size_bytes` to
|
||||
`fs::metadata` for rows the quick scan would otherwise skip. On
|
||||
mismatch, recompute the hash, update `image_exif`, and apply the
|
||||
"content branched" semantics:
|
||||
- **Faces** re-run (faces are fully derived from bytes).
|
||||
- **Tags** migrate to the new hash (user intent — "this photo is
|
||||
vacation" survives a crop). Insights migrate forward as a
|
||||
starting point and are flagged for re-generation.
|
||||
- **Favorites** (when migrated to hash-keyed) follow the path /
|
||||
user intent.
|
||||
|
||||
The interesting case is the operator who keeps an unedited copy in
|
||||
the archive library and edits the local copy: post-detection, the
|
||||
archive copy stays on the original hash, the local copy branches to
|
||||
the new hash, and the two histories cleanly split. Apollo's
|
||||
`derived.db` cache will need an invalidation hook for the changed
|
||||
hash — design it alongside Branch D.
|
||||
|
||||
### File Processing Pipeline
|
||||
|
||||
**Thumbnail Generation:**
|
||||
@@ -219,7 +455,7 @@ ImageApi owns the face data; Apollo (sibling repo) hosts the insightface inferen
|
||||
- `persons(id, name UNIQUE COLLATE NOCASE, cover_face_id, entity_id, created_from_tag, notes, ...)` — operator-managed, name is the user-visible identity.
|
||||
- `face_detections(id, library_id, content_hash, rel_path, bbox_*, embedding BLOB, confidence, source, person_id, status, model_version, ...)` — keyed on `content_hash` so a photo duplicated across libraries is detected once. Marker rows for `status IN ('no_faces','failed')` carry NULL bbox/embedding (CHECK constraint enforces this).
|
||||
|
||||
**Why content_hash and not (library_id, rel_path):** ties face data to the bytes, not the path. A backup mount that copies files from the primary library naturally inherits the existing detections without re-running inference.
|
||||
**Why content_hash and not (library_id, rel_path):** ties face data to the bytes, not the path. A backup mount that copies files from the primary library naturally inherits the existing detections without re-running inference. This is the reference implementation of the multi-library data model — see "Multi-library data model" above.
|
||||
|
||||
**File-watch hook** (`src/main.rs::process_new_files`): for each photo with a populated `content_hash`, check `FaceDao::already_scanned(hash)`; if not, send bytes (or embedded JPEG preview for RAW via `exif::extract_embedded_jpeg_preview`) to Apollo's `/api/internal/faces/detect`. K=`FACE_DETECT_CONCURRENCY` (default 8) parallel calls per scan tick; Apollo serializes them via its single-worker GPU pool. `face_watch.rs` is the Tokio orchestration layer.
|
||||
|
||||
@@ -233,6 +469,8 @@ ImageApi owns the face data; Apollo (sibling repo) hosts the insightface inferen
|
||||
|
||||
**Rerun preserves manual rows** (`POST /image/faces/{id}/rerun`): only `source='auto'` rows are deleted before re-running detection. `already_scanned` returns true on ANY row, so a photo whose only faces are manually drawn never auto-redetects.
|
||||
|
||||
**Stats domain — content_hash, not file rows** (`FaceDao::stats` in `src/faces.rs`): `total_photos` counts `DISTINCT content_hash` over `image_exif` (filtered to image extensions, `content_hash IS NOT NULL`), and so do `scanned` / `with_faces` / `no_faces` / `failed` over `face_detections`. Numerator and denominator must live in the same domain — `face_detections` is keyed on content_hash, so the same JPEG present at two rel_paths or in two libraries scans once. Counting `image_exif` rows in the denominator inflated total by one per duplicate file and produced a permanent gap (e.g. 1101/1103 with nothing actually pending). Hash-less rows are excluded from total_photos while they sit in the `backfill_unhashed_backlog` queue; otherwise the bar pins below 100% for the duration of that backfill even though those rows aren't pending detection yet — they're pending hashing.
|
||||
|
||||
Module map:
|
||||
- `src/faces.rs` — `FaceDao` trait + `SqliteFaceDao` impl, route handlers for `/faces/*`, `/image/faces/*`, `/persons/*`. Mirror of `tags.rs` layout.
|
||||
- `src/face_watch.rs` — Tokio orchestration for the file-watch detect pass; `filter_excluded` (PathExcluder + image-extension filter), `read_image_bytes_for_detect` (RAW preview fallback).
|
||||
|
||||
88
Cargo.lock
generated
88
Cargo.lock
generated
@@ -600,6 +600,16 @@ version = "2.6.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "6099cdc01846bc367c4e7dd630dc5966dccf36b652fae7a74e17b640411a91b2"
|
||||
|
||||
[[package]]
|
||||
name = "bk-tree"
|
||||
version = "0.5.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "a8283fb8e64b873918f8bc527efa6aff34956296e48ea750a9c909cd47c01546"
|
||||
dependencies = [
|
||||
"fnv",
|
||||
"triple_accel",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "blake3"
|
||||
version = "1.8.4"
|
||||
@@ -1928,6 +1938,7 @@ dependencies = [
|
||||
"async-trait",
|
||||
"base64",
|
||||
"bcrypt",
|
||||
"bk-tree",
|
||||
"blake3",
|
||||
"bytes",
|
||||
"chrono",
|
||||
@@ -1939,6 +1950,7 @@ dependencies = [
|
||||
"futures",
|
||||
"ical",
|
||||
"image",
|
||||
"image_hasher",
|
||||
"indicatif",
|
||||
"infer",
|
||||
"jsonwebtoken",
|
||||
@@ -1978,6 +1990,19 @@ dependencies = [
|
||||
"quick-error",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "image_hasher"
|
||||
version = "3.1.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "dd266c66b0a0e2d4c6db8e710663fc163a2d33595ce997b6fbda407c8759d344"
|
||||
dependencies = [
|
||||
"base64",
|
||||
"image",
|
||||
"rustdct",
|
||||
"serde",
|
||||
"transpose",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "imgref"
|
||||
version = "1.11.0"
|
||||
@@ -2438,6 +2463,15 @@ dependencies = [
|
||||
"num-traits",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "num-complex"
|
||||
version = "0.4.6"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "73f88a1307638156682bada9d7604135552957b7818057dcef22705b4d509495"
|
||||
dependencies = [
|
||||
"num-traits",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "num-conv"
|
||||
version = "0.1.0"
|
||||
@@ -2907,6 +2941,15 @@ version = "0.1.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "925383efa346730478fb4838dbe9137d2a47675ad789c546d150a6e1dd4ab31c"
|
||||
|
||||
[[package]]
|
||||
name = "primal-check"
|
||||
version = "0.3.4"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "dc0d895b311e3af9902528fbb8f928688abbd95872819320517cc24ca6b2bd08"
|
||||
dependencies = [
|
||||
"num-integer",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "proc-macro2"
|
||||
version = "1.0.101"
|
||||
@@ -3286,6 +3329,29 @@ dependencies = [
|
||||
"semver",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rustdct"
|
||||
version = "0.7.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "8b61555105d6a9bf98797c063c362a1d24ed8ab0431655e38f1cf51e52089551"
|
||||
dependencies = [
|
||||
"rustfft",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rustfft"
|
||||
version = "6.4.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "21db5f9893e91f41798c88680037dba611ca6674703c1a18601b01a72c8adb89"
|
||||
dependencies = [
|
||||
"num-complex",
|
||||
"num-integer",
|
||||
"num-traits",
|
||||
"primal-check",
|
||||
"strength_reduce",
|
||||
"transpose",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rustix"
|
||||
version = "1.0.8"
|
||||
@@ -3624,6 +3690,12 @@ version = "1.2.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "a8f112729512f8e442d81f95a8a7ddf2b7c6b8a1a6f509a95864142b30cab2d3"
|
||||
|
||||
[[package]]
|
||||
name = "strength_reduce"
|
||||
version = "0.2.4"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "fe895eb47f22e2ddd4dabc02bce419d2e643c8e3b585c78158b349195bc24d82"
|
||||
|
||||
[[package]]
|
||||
name = "strfmt"
|
||||
version = "0.2.5"
|
||||
@@ -4122,6 +4194,22 @@ dependencies = [
|
||||
"once_cell",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "transpose"
|
||||
version = "0.2.3"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "1ad61aed86bc3faea4300c7aee358b4c6d0c8d6ccc36524c96e4c92ccf26e77e"
|
||||
dependencies = [
|
||||
"num-integer",
|
||||
"strength_reduce",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "triple_accel"
|
||||
version = "0.3.4"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "622b09ce2fe2df4618636fb92176d205662f59803f39e70d1c333393082de96c"
|
||||
|
||||
[[package]]
|
||||
name = "try-lock"
|
||||
version = "0.2.5"
|
||||
|
||||
@@ -59,5 +59,7 @@ ical = "0.11"
|
||||
scraper = "0.20"
|
||||
base64 = "0.22"
|
||||
blake3 = "1.5"
|
||||
image_hasher = "3.0"
|
||||
bk-tree = "0.5"
|
||||
async-trait = "0.1"
|
||||
indicatif = "0.17"
|
||||
|
||||
1
migrations/2026-04-30-000000_unique_tag_name/down.sql
Normal file
1
migrations/2026-04-30-000000_unique_tag_name/down.sql
Normal file
@@ -0,0 +1 @@
|
||||
DROP INDEX IF EXISTS idx_tags_name_nocase;
|
||||
28
migrations/2026-04-30-000000_unique_tag_name/up.sql
Normal file
28
migrations/2026-04-30-000000_unique_tag_name/up.sql
Normal file
@@ -0,0 +1,28 @@
|
||||
-- Tags only enforced uniqueness in application code (the add_tag handler
|
||||
-- looks up by name before inserting). The schema itself accepted dupes,
|
||||
-- so a divergent code path could land two tags with the same name. Now
|
||||
-- that we expose a rename endpoint we want a hard guarantee: case-
|
||||
-- insensitive UNIQUE on tags.name.
|
||||
|
||||
-- Pre-flight: collapse exact-name duplicates (case-insensitive) onto the
|
||||
-- lowest-id row before adding the constraint, otherwise the index
|
||||
-- creation fails on any DB that ever produced dupes. On a clean DB this
|
||||
-- is a no-op.
|
||||
UPDATE tagged_photo
|
||||
SET tag_id = (
|
||||
SELECT MIN(t2.id) FROM tags t2
|
||||
WHERE LOWER(t2.name) = LOWER((SELECT name FROM tags WHERE id = tagged_photo.tag_id))
|
||||
)
|
||||
WHERE tag_id IN (
|
||||
SELECT t.id FROM tags t
|
||||
WHERE t.id <> (
|
||||
SELECT MIN(t2.id) FROM tags t2 WHERE LOWER(t2.name) = LOWER(t.name)
|
||||
)
|
||||
);
|
||||
|
||||
DELETE FROM tags
|
||||
WHERE id <> (
|
||||
SELECT MIN(t2.id) FROM tags t2 WHERE LOWER(t2.name) = LOWER(tags.name)
|
||||
);
|
||||
|
||||
CREATE UNIQUE INDEX idx_tags_name_nocase ON tags (name COLLATE NOCASE);
|
||||
@@ -0,0 +1,5 @@
|
||||
DROP INDEX IF EXISTS idx_photo_insights_content_hash;
|
||||
ALTER TABLE photo_insights DROP COLUMN content_hash;
|
||||
|
||||
DROP INDEX IF EXISTS idx_tagged_photo_content_hash;
|
||||
ALTER TABLE tagged_photo DROP COLUMN content_hash;
|
||||
64
migrations/2026-05-01-000000_hash_keyed_derived_data/up.sql
Normal file
64
migrations/2026-05-01-000000_hash_keyed_derived_data/up.sql
Normal file
@@ -0,0 +1,64 @@
|
||||
-- Phase B of the multi-library data-model rollout: add a nullable
|
||||
-- `content_hash` column to derived/user-intent tables that should follow
|
||||
-- the bytes rather than the path. Reads will prefer hash-key joins and
|
||||
-- fall back to rel_path while the column is null. A separate
|
||||
-- reconciliation pass collapses duplicates as the column populates.
|
||||
--
|
||||
-- See CLAUDE.md → "Multi-library data model" for the policy. The
|
||||
-- reference implementation is `face_detections`, which has been
|
||||
-- hash-keyed since it was introduced.
|
||||
--
|
||||
-- Tables in this migration:
|
||||
-- * tagged_photo — user-intent (tags follow the bytes)
|
||||
-- * photo_insights — intrinsic to bytes (LLM-generated description)
|
||||
--
|
||||
-- favorites is the natural third candidate but its DAO is barely used in
|
||||
-- v1 and the row count is tiny; deferring lets this migration stay
|
||||
-- focused on the high-volume tables that drive cross-library overhead.
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- tagged_photo
|
||||
-- ---------------------------------------------------------------------------
|
||||
ALTER TABLE tagged_photo ADD COLUMN content_hash TEXT;
|
||||
|
||||
-- Backfill: for each tagged_photo row, find the content_hash for its
|
||||
-- rel_path. tagged_photo doesn't carry a library_id, so a rel_path that
|
||||
-- exists under multiple libraries with different content is genuinely
|
||||
-- ambiguous — we take the first matching image_exif row. The
|
||||
-- reconciliation pass at runtime cleans up any rows that resolve
|
||||
-- differently once a hash is known per library.
|
||||
UPDATE tagged_photo
|
||||
SET content_hash = (
|
||||
SELECT content_hash FROM image_exif
|
||||
WHERE image_exif.rel_path = tagged_photo.rel_path
|
||||
AND image_exif.content_hash IS NOT NULL
|
||||
LIMIT 1
|
||||
)
|
||||
WHERE content_hash IS NULL;
|
||||
|
||||
-- Hash-key index. Partial (only non-null rows) to keep the index small
|
||||
-- during the transitional window where most rows are still null.
|
||||
CREATE INDEX idx_tagged_photo_content_hash
|
||||
ON tagged_photo (content_hash)
|
||||
WHERE content_hash IS NOT NULL;
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- photo_insights
|
||||
-- ---------------------------------------------------------------------------
|
||||
ALTER TABLE photo_insights ADD COLUMN content_hash TEXT;
|
||||
|
||||
-- Backfill keyed on (library_id, rel_path) — photo_insights already
|
||||
-- carries library_id, so the resolution is unambiguous.
|
||||
UPDATE photo_insights
|
||||
SET content_hash = (
|
||||
SELECT content_hash FROM image_exif
|
||||
WHERE image_exif.library_id = photo_insights.library_id
|
||||
AND image_exif.rel_path = photo_insights.rel_path
|
||||
AND image_exif.content_hash IS NOT NULL
|
||||
LIMIT 1
|
||||
)
|
||||
WHERE content_hash IS NULL;
|
||||
|
||||
CREATE INDEX idx_photo_insights_content_hash
|
||||
ON photo_insights (content_hash)
|
||||
WHERE content_hash IS NOT NULL;
|
||||
@@ -0,0 +1,2 @@
|
||||
-- Requires SQLite 3.35+ for ALTER TABLE DROP COLUMN.
|
||||
ALTER TABLE libraries DROP COLUMN enabled;
|
||||
14
migrations/2026-05-01-100000_libraries_enabled_flag/up.sql
Normal file
14
migrations/2026-05-01-100000_libraries_enabled_flag/up.sql
Normal file
@@ -0,0 +1,14 @@
|
||||
-- Operator-controlled kill switch for a library. When `enabled = 0` the
|
||||
-- watcher tick skips that library entirely — before the availability
|
||||
-- probe, before ingest, before any maintenance pass — and the orphan-GC
|
||||
-- all-online check treats it as out-of-scope rather than as a blocker.
|
||||
--
|
||||
-- The intended workflow is staging a new mount: insert with enabled=0,
|
||||
-- verify the row appears in /libraries with enabled=false, then UPDATE
|
||||
-- to 1 to start ingest. Same toggle works as a maintenance kill switch
|
||||
-- after the fact ("don't keep probing this NAS while I'm rebooting it").
|
||||
--
|
||||
-- Default 1 so every existing library stays running on upgrade — no
|
||||
-- behavior change without an explicit flip.
|
||||
|
||||
ALTER TABLE libraries ADD COLUMN enabled BOOLEAN NOT NULL DEFAULT 1;
|
||||
@@ -0,0 +1,2 @@
|
||||
-- Requires SQLite 3.35+ for ALTER TABLE DROP COLUMN.
|
||||
ALTER TABLE libraries DROP COLUMN excluded_dirs;
|
||||
14
migrations/2026-05-01-110000_libraries_excluded_dirs/up.sql
Normal file
14
migrations/2026-05-01-110000_libraries_excluded_dirs/up.sql
Normal file
@@ -0,0 +1,14 @@
|
||||
-- Per-library excluded directories.
|
||||
--
|
||||
-- The global EXCLUDED_DIRS env var is the right knob for excludes that
|
||||
-- every library shares (Synology @eaDir, .thumbnails, etc.). It's a
|
||||
-- poor fit for "exclude this subtree from THIS library only", which
|
||||
-- the natural use case for is mounting a parent directory while
|
||||
-- another library already covers a child subtree underneath.
|
||||
--
|
||||
-- This column is parsed comma-separated, same shape as the env var,
|
||||
-- and the watcher / memories / thumbnail walks each apply
|
||||
-- (env_globals ∪ library.excluded_dirs) when scanning the library.
|
||||
-- NULL = no extra excludes; the global env var still applies.
|
||||
|
||||
ALTER TABLE libraries ADD COLUMN excluded_dirs TEXT;
|
||||
@@ -0,0 +1,8 @@
|
||||
DROP INDEX IF EXISTS idx_image_exif_duplicate_of_hash;
|
||||
DROP INDEX IF EXISTS idx_image_exif_dhash;
|
||||
DROP INDEX IF EXISTS idx_image_exif_phash;
|
||||
|
||||
ALTER TABLE image_exif DROP COLUMN duplicate_decided_at;
|
||||
ALTER TABLE image_exif DROP COLUMN duplicate_of_hash;
|
||||
ALTER TABLE image_exif DROP COLUMN dhash_64;
|
||||
ALTER TABLE image_exif DROP COLUMN phash_64;
|
||||
41
migrations/2026-05-03-000000_add_perceptual_hash/up.sql
Normal file
41
migrations/2026-05-03-000000_add_perceptual_hash/up.sql
Normal file
@@ -0,0 +1,41 @@
|
||||
-- Adds perceptual-hash signals + soft-mark resolution state to image_exif so
|
||||
-- the duplicates surface in Apollo can group near-duplicates (re-encoded,
|
||||
-- resized, format-converted copies) and let the user demote losers without
|
||||
-- touching the file on disk. Image-only for v1: phash_64/dhash_64 are NULL
|
||||
-- on videos and on images that fail to decode. See Apollo CLAUDE.md →
|
||||
-- Duplicate detection / Caching layer for the policy.
|
||||
--
|
||||
-- Soft-mark columns are media-type-agnostic — when video perceptual hashing
|
||||
-- arrives, it lives in a separate hash-keyed companion table and reuses the
|
||||
-- same duplicate_of_hash / duplicate_decided_at machinery.
|
||||
|
||||
-- pHash (DCT, 64-bit) packed as i64 for fast XOR + popcount Hamming.
|
||||
ALTER TABLE image_exif ADD COLUMN phash_64 BIGINT;
|
||||
|
||||
-- dHash (gradient, 64-bit). Cheap, robust to compression/resize. Stored
|
||||
-- alongside pHash so the query layer can fall back if either is null.
|
||||
ALTER TABLE image_exif ADD COLUMN dhash_64 BIGINT;
|
||||
|
||||
-- When non-null, this row is a soft-marked duplicate of the row whose
|
||||
-- content_hash matches. The duplicate file stays on disk; the default
|
||||
-- /photos listing filters it out. /photos?include_duplicates=true opts
|
||||
-- back in (the Apollo duplicates modal uses this).
|
||||
ALTER TABLE image_exif ADD COLUMN duplicate_of_hash TEXT;
|
||||
|
||||
-- Unix seconds of the resolve. Distinguishes "never reviewed" from
|
||||
-- "reviewed and resolved" for the Apollo include_resolved toggle.
|
||||
ALTER TABLE image_exif ADD COLUMN duplicate_decided_at BIGINT;
|
||||
|
||||
-- Partial indexes — the columns are NULL for the vast majority of rows
|
||||
-- during the transitional window and forever for videos / decode failures.
|
||||
CREATE INDEX idx_image_exif_phash
|
||||
ON image_exif (phash_64)
|
||||
WHERE phash_64 IS NOT NULL;
|
||||
|
||||
CREATE INDEX idx_image_exif_dhash
|
||||
ON image_exif (dhash_64)
|
||||
WHERE dhash_64 IS NOT NULL;
|
||||
|
||||
CREATE INDEX idx_image_exif_duplicate_of_hash
|
||||
ON image_exif (duplicate_of_hash)
|
||||
WHERE duplicate_of_hash IS NOT NULL;
|
||||
@@ -383,7 +383,10 @@ mod tests {
|
||||
// body cap and rejected normal-size photos before they reached
|
||||
// the backend.
|
||||
assert!(is_transient(&classify_error_response(408, "")));
|
||||
assert!(is_transient(&classify_error_response(413, "<html>nginx</html>")));
|
||||
assert!(is_transient(&classify_error_response(
|
||||
413,
|
||||
"<html>nginx</html>"
|
||||
)));
|
||||
assert!(is_transient(&classify_error_response(429, "{}")));
|
||||
}
|
||||
|
||||
|
||||
@@ -521,6 +521,7 @@ impl InsightChatService {
|
||||
training_messages: Some(json),
|
||||
backend: effective_backend.clone(),
|
||||
fewshot_source_ids: None,
|
||||
content_hash: None,
|
||||
};
|
||||
let cx = opentelemetry::Context::new();
|
||||
let mut dao = self.insight_dao.lock().expect("Unable to lock InsightDao");
|
||||
@@ -983,6 +984,7 @@ impl InsightChatService {
|
||||
training_messages: Some(json),
|
||||
backend: effective_backend.clone(),
|
||||
fewshot_source_ids: None,
|
||||
content_hash: None,
|
||||
};
|
||||
let cx = opentelemetry::Context::new();
|
||||
let mut dao = self.insight_dao.lock().expect("Unable to lock InsightDao");
|
||||
|
||||
@@ -1255,7 +1255,9 @@ impl InsightGenerator {
|
||||
.span()
|
||||
.set_attribute(KeyValue::new("summary_length", summary.len() as i64));
|
||||
|
||||
// 11. Store in database
|
||||
// 11. Store in database. content_hash is None here — store_insight
|
||||
// looks it up from image_exif before persisting; reconciliation
|
||||
// backfills if the hash isn't known yet.
|
||||
let insight = InsertPhotoInsight {
|
||||
library_id: crate::libraries::PRIMARY_LIBRARY_ID,
|
||||
file_path: file_path.to_string(),
|
||||
@@ -1267,6 +1269,7 @@ impl InsightGenerator {
|
||||
training_messages: None,
|
||||
backend: "local".to_string(),
|
||||
fewshot_source_ids: None,
|
||||
content_hash: None,
|
||||
};
|
||||
|
||||
let mut dao = self.insight_dao.lock().expect("Unable to lock InsightDao");
|
||||
@@ -3530,6 +3533,7 @@ Return ONLY the summary, nothing else."#,
|
||||
training_messages,
|
||||
backend: backend_label.clone(),
|
||||
fewshot_source_ids: fewshot_source_ids_json,
|
||||
content_hash: None,
|
||||
};
|
||||
|
||||
let stored = {
|
||||
|
||||
243
src/bin/backfill_perceptual_hash.rs
Normal file
243
src/bin/backfill_perceptual_hash.rs
Normal file
@@ -0,0 +1,243 @@
|
||||
//! Backfill `image_exif.phash_64` + `dhash_64` for image rows that
|
||||
//! were ingested before perceptual hashing was wired into the watcher.
|
||||
//!
|
||||
//! The watcher computes perceptual hashes for new images as they're
|
||||
//! ingested, so this binary is a one-shot for the historical backlog.
|
||||
//! Idempotent — only rows with a non-null content_hash and a null
|
||||
//! phash are processed, so re-runs are safe and pick up where they
|
||||
//! left off (e.g. after a crash or interrupt).
|
||||
//!
|
||||
//! Image-only by design: `get_rows_missing_perceptual_hash` filters by
|
||||
//! file extension at the DB layer so videos and other non-decodable
|
||||
//! media are skipped without round-tripping `image_hasher`. Files that
|
||||
//! can't be opened (missing on disk, permission errors) are quietly
|
||||
//! left as null and counted as "missing"; on next run, if the file is
|
||||
//! restored, the row will surface again.
|
||||
|
||||
use std::path::Path;
|
||||
use std::sync::{Arc, Mutex};
|
||||
use std::time::Instant;
|
||||
|
||||
use clap::Parser;
|
||||
use log::{error, warn};
|
||||
use rayon::prelude::*;
|
||||
|
||||
use image_api::bin_progress;
|
||||
use image_api::database::{ExifDao, SqliteExifDao, connect};
|
||||
use image_api::libraries::{self, Library};
|
||||
use image_api::perceptual_hash;
|
||||
|
||||
#[derive(Parser, Debug)]
|
||||
#[command(name = "backfill_perceptual_hash")]
|
||||
#[command(about = "Compute pHash + dHash for image_exif rows missing one")]
|
||||
struct Args {
|
||||
/// Max rows to hash per batch. The process loops until no rows remain.
|
||||
#[arg(long, default_value_t = 256)]
|
||||
batch_size: i64,
|
||||
|
||||
/// Rayon parallelism override. 0 uses the default thread pool size.
|
||||
#[arg(long, default_value_t = 0)]
|
||||
parallelism: usize,
|
||||
|
||||
/// Dry-run: log what would be hashed without writing to the DB.
|
||||
#[arg(long)]
|
||||
dry_run: bool,
|
||||
}
|
||||
|
||||
fn main() -> anyhow::Result<()> {
|
||||
env_logger::init();
|
||||
dotenv::dotenv().ok();
|
||||
|
||||
let args = Args::parse();
|
||||
if args.parallelism > 0 {
|
||||
rayon::ThreadPoolBuilder::new()
|
||||
.num_threads(args.parallelism)
|
||||
.build_global()
|
||||
.expect("Unable to configure rayon thread pool");
|
||||
}
|
||||
|
||||
let base_path = dotenv::var("BASE_PATH").ok();
|
||||
let mut seed_conn = connect();
|
||||
if let Some(base) = base_path.as_deref() {
|
||||
libraries::seed_or_patch_from_env(&mut seed_conn, base);
|
||||
}
|
||||
let libs = libraries::load_all(&mut seed_conn);
|
||||
drop(seed_conn);
|
||||
if libs.is_empty() {
|
||||
anyhow::bail!("No libraries configured; cannot backfill perceptual hashes");
|
||||
}
|
||||
let libs_by_id: std::collections::HashMap<i32, Library> =
|
||||
libs.into_iter().map(|lib| (lib.id, lib)).collect();
|
||||
println!(
|
||||
"Configured libraries: {}",
|
||||
libs_by_id
|
||||
.values()
|
||||
.map(|l| format!("{} -> {}", l.name, l.root_path))
|
||||
.collect::<Vec<_>>()
|
||||
.join(", ")
|
||||
);
|
||||
|
||||
let dao: Arc<Mutex<Box<dyn ExifDao>>> = Arc::new(Mutex::new(Box::new(SqliteExifDao::new())));
|
||||
let ctx = opentelemetry::Context::new();
|
||||
|
||||
let mut total_hashed = 0u64;
|
||||
let mut total_missing = 0u64;
|
||||
let mut total_decode_failures = 0u64;
|
||||
let mut total_errors = 0u64;
|
||||
let start = Instant::now();
|
||||
|
||||
let pb = bin_progress::spinner("perceptual-hashing");
|
||||
|
||||
loop {
|
||||
let rows = {
|
||||
let mut guard = dao.lock().expect("Unable to lock ExifDao");
|
||||
guard
|
||||
.get_rows_missing_perceptual_hash(&ctx, args.batch_size)
|
||||
.map_err(|e| anyhow::anyhow!("DB error: {:?}", e))?
|
||||
};
|
||||
if rows.is_empty() {
|
||||
break;
|
||||
}
|
||||
let batch_size = rows.len();
|
||||
pb.set_message(format!(
|
||||
"batch of {} (hashed={} decode_fail={} missing={} errors={})",
|
||||
batch_size, total_hashed, total_decode_failures, total_missing, total_errors
|
||||
));
|
||||
|
||||
// Compute perceptual hashes in parallel — CPU-bound, decoder
|
||||
// releases the GIL-equivalent. rayon's default thread pool
|
||||
// matches the host's logical-core count which is the right
|
||||
// ceiling for image_hasher's DCT pass.
|
||||
let results: Vec<(i32, String, FilePerceptualResult)> = rows
|
||||
.into_par_iter()
|
||||
.map(|(library_id, rel_path)| {
|
||||
let abs = libs_by_id
|
||||
.get(&library_id)
|
||||
.map(|lib| Path::new(&lib.root_path).join(&rel_path));
|
||||
match abs {
|
||||
Some(abs_path) if abs_path.exists() => {
|
||||
match perceptual_hash::compute(&abs_path) {
|
||||
Some(id) => (library_id, rel_path, FilePerceptualResult::Ok(id)),
|
||||
None => (library_id, rel_path, FilePerceptualResult::DecodeFailed),
|
||||
}
|
||||
}
|
||||
Some(_) => (library_id, rel_path, FilePerceptualResult::MissingOnDisk),
|
||||
None => {
|
||||
warn!("Row refers to unknown library_id {}", library_id);
|
||||
(library_id, rel_path, FilePerceptualResult::MissingOnDisk)
|
||||
}
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
// Persist sequentially — SQLite writes serialize anyway.
|
||||
if !args.dry_run {
|
||||
let mut guard = dao.lock().expect("Unable to lock ExifDao");
|
||||
for (library_id, rel_path, result) in &results {
|
||||
match result {
|
||||
FilePerceptualResult::Ok(id) => {
|
||||
match guard.backfill_perceptual_hash(
|
||||
&ctx,
|
||||
*library_id,
|
||||
rel_path,
|
||||
Some(id.phash_64),
|
||||
Some(id.dhash_64),
|
||||
) {
|
||||
Ok(_) => {
|
||||
total_hashed += 1;
|
||||
pb.inc(1);
|
||||
}
|
||||
Err(e) => {
|
||||
pb.println(format!("persist error for {}: {:?}", rel_path, e));
|
||||
total_errors += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
FilePerceptualResult::DecodeFailed => {
|
||||
// Persist phash_64=0/dhash_64=0 as a "tried,
|
||||
// unhashable" sentinel so this row leaves the
|
||||
// `phash_64 IS NULL` candidate set and the
|
||||
// backfill doesn't infinite-loop on a queue of
|
||||
// unbreakable formats (HEIC, RAW, CMYK JPEGs,
|
||||
// truncated bytes). The all-zero hash is
|
||||
// explicitly excluded from clustering by
|
||||
// is_informative_hash in duplicates.rs, so it
|
||||
// won't pollute group output — it just becomes
|
||||
// invisible to the duplicate finder.
|
||||
log::debug!(
|
||||
"perceptual decode failed for {} (lib {}); marking unhashable",
|
||||
rel_path,
|
||||
library_id
|
||||
);
|
||||
match guard.backfill_perceptual_hash(
|
||||
&ctx,
|
||||
*library_id,
|
||||
rel_path,
|
||||
Some(0),
|
||||
Some(0),
|
||||
) {
|
||||
Ok(_) => {
|
||||
total_decode_failures += 1;
|
||||
}
|
||||
Err(e) => {
|
||||
pb.println(format!(
|
||||
"persist error (decode-fail sentinel) for {}: {:?}",
|
||||
rel_path, e
|
||||
));
|
||||
total_errors += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
FilePerceptualResult::MissingOnDisk => {
|
||||
total_missing += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
} else {
|
||||
for (_, rel_path, result) in &results {
|
||||
match result {
|
||||
FilePerceptualResult::Ok(id) => {
|
||||
pb.println(format!(
|
||||
"[dry-run] {} -> phash={:016x} dhash={:016x}",
|
||||
rel_path, id.phash_64, id.dhash_64
|
||||
));
|
||||
total_hashed += 1;
|
||||
pb.inc(1);
|
||||
}
|
||||
FilePerceptualResult::DecodeFailed => {
|
||||
total_decode_failures += 1;
|
||||
}
|
||||
FilePerceptualResult::MissingOnDisk => {
|
||||
total_missing += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
pb.println(format!(
|
||||
"[dry-run] processed one batch of {}. Stopping — a real run would continue \
|
||||
until no NULL phash_64 image rows remain.",
|
||||
results.len()
|
||||
));
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
pb.finish_and_clear();
|
||||
println!(
|
||||
"Done. hashed={}, decode_failed={}, skipped (missing on disk)={}, errors={}, elapsed={:.1}s",
|
||||
total_hashed,
|
||||
total_decode_failures,
|
||||
total_missing,
|
||||
total_errors,
|
||||
start.elapsed().as_secs_f64()
|
||||
);
|
||||
if total_errors > 0 {
|
||||
error!("Backfill completed with {} persist errors", total_errors);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
enum FilePerceptualResult {
|
||||
Ok(perceptual_hash::PerceptualIdentity),
|
||||
DecodeFailed,
|
||||
MissingOnDisk,
|
||||
}
|
||||
@@ -53,12 +53,34 @@ pub fn thumbnail_path(thumbs_dir: &Path, hash: &str) -> PathBuf {
|
||||
/// Hash-keyed HLS output directory: `<video_dir>/<hash[..2]>/<hash>/`.
|
||||
/// The playlist lives at `playlist.m3u8` inside this directory and its
|
||||
/// segments are co-located so HLS relative references Just Work.
|
||||
///
|
||||
/// Allow-dead until Branch B/C rewires the HLS pipeline to use it; the
|
||||
/// helper lives here today so Branch A's path layout decisions stay
|
||||
/// adjacent to thumbnail/legacy ones.
|
||||
#[allow(dead_code)]
|
||||
pub fn hls_dir(video_dir: &Path, hash: &str) -> PathBuf {
|
||||
let shard = shard_prefix(hash);
|
||||
video_dir.join(shard).join(hash)
|
||||
}
|
||||
|
||||
/// Library-scoped legacy mirrored path:
|
||||
/// `<derivative_dir>/<library_id>/<rel_path>`. Used as the fallback when
|
||||
/// `content_hash` isn't available — the library prefix prevents the
|
||||
/// "lib1 wrote `vacation/IMG.jpg` first, lib2 sees thumb_path.exists()
|
||||
/// and serves the wrong image" failure mode.
|
||||
///
|
||||
/// Existing single-library deployments may already have thumbnails at the
|
||||
/// bare-legacy `<derivative_dir>/<rel_path>` shape; serving code is
|
||||
/// expected to check both this scoped path and the bare-legacy path so
|
||||
/// nothing 404s during the transition.
|
||||
pub fn library_scoped_legacy_path(
|
||||
derivative_dir: &Path,
|
||||
library_id: i32,
|
||||
rel_path: impl AsRef<Path>,
|
||||
) -> PathBuf {
|
||||
derivative_dir.join(library_id.to_string()).join(rel_path)
|
||||
}
|
||||
|
||||
fn shard_prefix(hash: &str) -> &str {
|
||||
let end = hash
|
||||
.char_indices()
|
||||
@@ -105,4 +127,17 @@ mod tests {
|
||||
let d = hls_dir(video, "1234deadbeef");
|
||||
assert_eq!(d, PathBuf::from("/tmp/video/12/1234deadbeef"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn library_scoped_legacy_path_prefixes_with_library_id() {
|
||||
let thumbs = Path::new("/tmp/thumbs");
|
||||
let p = library_scoped_legacy_path(thumbs, 7, "vacation/IMG.jpg");
|
||||
assert_eq!(p, PathBuf::from("/tmp/thumbs/7/vacation/IMG.jpg"));
|
||||
|
||||
// Same rel_path, different library — different output. This is
|
||||
// the whole point: lib 1 and lib 2 don't clobber each other.
|
||||
let p1 = library_scoped_legacy_path(thumbs, 1, "vacation/IMG.jpg");
|
||||
let p2 = library_scoped_legacy_path(thumbs, 2, "vacation/IMG.jpg");
|
||||
assert_ne!(p1, p2);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -165,6 +165,15 @@ pub struct FilesRequest {
|
||||
/// Optional library filter. Accepts a library id (e.g. "1") or name
|
||||
/// (e.g. "main"). When omitted, results span all libraries.
|
||||
pub library: Option<String>,
|
||||
|
||||
/// When true, include rows soft-marked as duplicates of another file
|
||||
/// (i.e. `image_exif.duplicate_of_hash IS NOT NULL`). Default false —
|
||||
/// the standard /photos listing hides demoted siblings so the grid
|
||||
/// silently shrinks after a resolve. The Apollo duplicates modal
|
||||
/// passes `true` so it can show both survivors and demoted members
|
||||
/// inside a group.
|
||||
#[serde(default)]
|
||||
pub include_duplicates: Option<bool>,
|
||||
}
|
||||
|
||||
#[derive(Copy, Clone, Deserialize, PartialEq, Debug)]
|
||||
|
||||
@@ -111,13 +111,30 @@ impl InsightDao for SqliteInsightDao {
|
||||
fn store_insight(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
insight: InsertPhotoInsight,
|
||||
mut insight: InsertPhotoInsight,
|
||||
) -> Result<PhotoInsight, DbError> {
|
||||
trace_db_call(context, "insert", "store_insight", |_span| {
|
||||
use schema::photo_insights::dsl::*;
|
||||
|
||||
let mut connection = self.connection.lock().expect("Unable to get InsightDao");
|
||||
|
||||
// Eagerly populate content_hash so this insight follows the
|
||||
// bytes (CLAUDE.md "Multi-library data model"). Caller-
|
||||
// supplied hash wins; otherwise look it up from image_exif
|
||||
// for the (library_id, rel_path) tuple. None is acceptable —
|
||||
// reconciliation backfills it once the hash lands.
|
||||
if insight.content_hash.is_none() {
|
||||
use schema::image_exif as ie;
|
||||
insight.content_hash = ie::table
|
||||
.filter(ie::library_id.eq(insight.library_id))
|
||||
.filter(ie::rel_path.eq(&insight.file_path))
|
||||
.filter(ie::content_hash.is_not_null())
|
||||
.select(ie::content_hash)
|
||||
.first::<Option<String>>(connection.deref_mut())
|
||||
.ok()
|
||||
.flatten();
|
||||
}
|
||||
|
||||
// Mark all existing insights for this file as no longer current
|
||||
diesel::update(
|
||||
photo_insights
|
||||
|
||||
@@ -9,6 +9,25 @@ use crate::database::models::{
|
||||
};
|
||||
use crate::otel::trace_db_call;
|
||||
|
||||
/// Wire shape for a single member of a duplicate group, returned by
|
||||
/// `list_duplicates_*` and `lookup_duplicate_row`. Carries everything
|
||||
/// the Apollo modal needs to render a member tile and its meta line —
|
||||
/// thumbnails are derived from `(library_id, rel_path)` upstream.
|
||||
#[derive(Debug, Clone, serde::Serialize)]
|
||||
pub struct DuplicateRow {
|
||||
pub library_id: i32,
|
||||
pub rel_path: String,
|
||||
pub content_hash: String,
|
||||
pub size_bytes: Option<i64>,
|
||||
pub date_taken: Option<i64>,
|
||||
pub width: Option<i32>,
|
||||
pub height: Option<i32>,
|
||||
pub phash_64: Option<i64>,
|
||||
pub dhash_64: Option<i64>,
|
||||
pub duplicate_of_hash: Option<String>,
|
||||
pub duplicate_decided_at: Option<i64>,
|
||||
}
|
||||
|
||||
pub mod calendar_dao;
|
||||
pub mod daily_summary_dao;
|
||||
pub mod insights_dao;
|
||||
@@ -16,6 +35,7 @@ pub mod knowledge_dao;
|
||||
pub mod location_dao;
|
||||
pub mod models;
|
||||
pub mod preview_dao;
|
||||
pub mod reconcile;
|
||||
pub mod schema;
|
||||
pub mod search_dao;
|
||||
|
||||
@@ -136,10 +156,19 @@ pub fn connect() -> SqliteConnection {
|
||||
// rollback-journal durability; we accept the narrow last-fsync
|
||||
// window for the 2–10× write throughput).
|
||||
use diesel::connection::SimpleConnection;
|
||||
// foreign_keys = ON is per-connection in SQLite (off by default), so
|
||||
// it has to be set here alongside the other pragmas. Without it
|
||||
// every `REFERENCES … ON DELETE CASCADE / SET NULL` clause in the
|
||||
// schema is documentation-only — orphan rows would survive the
|
||||
// referenced row's deletion. With it, the cascade fires
|
||||
// automatically and code that previously did manual two-step
|
||||
// cleanup (delete child rows, then parent) becomes redundant but
|
||||
// still correct.
|
||||
conn.batch_execute(
|
||||
"PRAGMA journal_mode = WAL; \
|
||||
PRAGMA busy_timeout = 5000; \
|
||||
PRAGMA synchronous = NORMAL;",
|
||||
PRAGMA synchronous = NORMAL; \
|
||||
PRAGMA foreign_keys = ON;",
|
||||
)
|
||||
.expect("set sqlite pragmas");
|
||||
conn
|
||||
@@ -286,17 +315,29 @@ pub trait ExifDao: Sync + Send {
|
||||
library_id: Option<i32>,
|
||||
) -> Result<Vec<(String, i64)>, DbError>;
|
||||
|
||||
/// Batch load EXIF data for multiple file paths (single query)
|
||||
/// Batch load EXIF data for multiple file paths (single query). When
|
||||
/// `library_id = Some(id)` the lookup is keyed on `(library_id,
|
||||
/// rel_path)`; cross-library duplicates with the same rel_path are
|
||||
/// excluded. `None` keeps the legacy rel-path-only behavior — used by
|
||||
/// the union-mode `/photos` listing, which already disambiguates by
|
||||
/// `(file_path, library_id)` in the caller.
|
||||
fn get_exif_batch(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id: Option<i32>,
|
||||
file_paths: &[String],
|
||||
) -> Result<Vec<ImageExif>, DbError>;
|
||||
|
||||
/// Query files by EXIF criteria with optional filters
|
||||
/// Query files by EXIF criteria with optional filters. `library_id =
|
||||
/// Some(id)` restricts to that library; `None` spans every library
|
||||
/// (used by the unscoped `/photos` form). The composite
|
||||
/// `(library_id, date_taken)` index added in the multi_library
|
||||
/// migration depends on `library_id` being part of the WHERE clause —
|
||||
/// callers that have a library context must pass it.
|
||||
fn query_by_exif(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id: Option<i32>,
|
||||
camera_make: Option<&str>,
|
||||
camera_model: Option<&str>,
|
||||
lens_model: Option<&str>,
|
||||
@@ -355,6 +396,104 @@ pub trait ExifDao: Sync + Send {
|
||||
size_bytes: i64,
|
||||
) -> Result<(), DbError>;
|
||||
|
||||
/// Return image rows that have a `content_hash` but no `phash_64`,
|
||||
/// oldest first. Used by the `backfill_perceptual_hash` binary.
|
||||
/// Filters by image extension at the DB layer to avoid ever asking
|
||||
/// `image_hasher` to decode a video. Returns `(library_id, rel_path)`.
|
||||
fn get_rows_missing_perceptual_hash(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
limit: i64,
|
||||
) -> Result<Vec<(i32, String)>, DbError>;
|
||||
|
||||
/// Persist computed perceptual hashes (pHash + dHash) for an
|
||||
/// existing image_exif row. Either column may be left NULL by
|
||||
/// passing `None`, but in practice the binary computes both or
|
||||
/// neither — `image_hasher` either decodes the image and produces
|
||||
/// both signals, or fails entirely.
|
||||
fn backfill_perceptual_hash(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id: i32,
|
||||
rel_path: &str,
|
||||
phash_64: Option<i64>,
|
||||
dhash_64: Option<i64>,
|
||||
) -> Result<(), DbError>;
|
||||
|
||||
/// Group exact-hash duplicates: rows whose `content_hash` appears
|
||||
/// more than once across the (optionally library-scoped) corpus.
|
||||
/// Returns one [`DuplicateRow`] per member; callers group by
|
||||
/// `content_hash`. When `include_resolved=false`, rows already
|
||||
/// soft-marked (`duplicate_of_hash IS NOT NULL`) are excluded so
|
||||
/// the modal doesn't re-surface decisions the user already made.
|
||||
fn list_duplicates_exact(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id: Option<i32>,
|
||||
include_resolved: bool,
|
||||
) -> Result<Vec<DuplicateRow>, DbError>;
|
||||
|
||||
/// Return all rows with a non-null `phash_64` (optionally library-
|
||||
/// scoped), used by the perceptual-cluster routine in
|
||||
/// [`crate::main`] to single-link cluster via Hamming distance.
|
||||
/// Each returned row is a *distinct content_hash* — exact duplicates
|
||||
/// are collapsed at the DB layer so the in-memory clusterer doesn't
|
||||
/// rediscover them.
|
||||
fn list_perceptual_candidates(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id: Option<i32>,
|
||||
include_resolved: bool,
|
||||
) -> Result<Vec<DuplicateRow>, DbError>;
|
||||
|
||||
/// Look up a single row's metadata by `(library_id, rel_path)`. Used
|
||||
/// by the resolve endpoint to map the request payload to the
|
||||
/// underlying `content_hash` before writing the soft-mark. Returns
|
||||
/// `Ok(None)` if the file doesn't exist in `image_exif`.
|
||||
fn lookup_duplicate_row(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id: i32,
|
||||
rel_path: &str,
|
||||
) -> Result<Option<DuplicateRow>, DbError>;
|
||||
|
||||
/// Soft-mark a file as a duplicate of `survivor_hash`. Sets
|
||||
/// `duplicate_of_hash` and `duplicate_decided_at` on the row(s)
|
||||
/// matching `(library_id, rel_path)`. The file stays on disk; the
|
||||
/// default `/photos` listing hides it because of the
|
||||
/// `duplicate_of_hash IS NULL` filter.
|
||||
fn set_duplicate_of(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id: i32,
|
||||
rel_path: &str,
|
||||
survivor_hash: &str,
|
||||
decided_at: i64,
|
||||
) -> Result<(), DbError>;
|
||||
|
||||
/// Reverse a soft-mark: clears `duplicate_of_hash` and
|
||||
/// `duplicate_decided_at`. Used by the modal's UNRESOLVE chip.
|
||||
fn clear_duplicate_of(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id: i32,
|
||||
rel_path: &str,
|
||||
) -> Result<(), DbError>;
|
||||
|
||||
/// Union the tags from `demoted_hash` onto `survivor_hash`. Used at
|
||||
/// resolve time for *perceptual* duplicates (different content_hashes,
|
||||
/// independent tag sets) so the user doesn't lose their tagging work
|
||||
/// when promoting a survivor. Idempotent: a tag already on the survivor
|
||||
/// is left alone. Exact duplicates (same content_hash) don't need this
|
||||
/// because their tag rows are already shared.
|
||||
fn union_perceptual_tags(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
survivor_hash: &str,
|
||||
demoted_hash: &str,
|
||||
survivor_rel_path: &str,
|
||||
) -> Result<(), DbError>;
|
||||
|
||||
/// Return the first EXIF row with the given content hash (any library).
|
||||
/// Used by thumbnail/HLS generation to detect pre-existing derivatives
|
||||
/// from another library before regenerating.
|
||||
@@ -418,11 +557,17 @@ pub trait ExifDao: Sync + Send {
|
||||
/// `library_ids` is empty, rows from every library are returned. Used by
|
||||
/// `/photos` recursive listing to skip the filesystem walk — the watcher
|
||||
/// keeps image_exif in parity with disk via the reconciliation pass.
|
||||
///
|
||||
/// `include_duplicates=false` filters out rows soft-marked with
|
||||
/// `duplicate_of_hash IS NOT NULL` so the default photo listing hides
|
||||
/// demoted siblings; the Apollo duplicates modal passes `true` to
|
||||
/// see both survivors and demoted members inside a group.
|
||||
fn list_rel_paths_for_libraries(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_ids: &[i32],
|
||||
path_prefix: Option<&str>,
|
||||
include_duplicates: bool,
|
||||
) -> Result<Vec<(i32, String)>, DbError>;
|
||||
|
||||
/// Delete a single image_exif row scoped to `(library_id, rel_path)`.
|
||||
@@ -434,6 +579,28 @@ pub trait ExifDao: Sync + Send {
|
||||
library_id: i32,
|
||||
rel_path: &str,
|
||||
) -> Result<(), DbError>;
|
||||
|
||||
/// Number of image_exif rows for a library. Used by the availability
|
||||
/// probe to decide whether an empty mount is "fresh" (zero rows: fine)
|
||||
/// or "the share went offline" (non-zero rows: stale). Zero on query
|
||||
/// error so a transient DB hiccup doesn't itself cause a Stale flip.
|
||||
fn count_for_library(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id: i32,
|
||||
) -> Result<i64, DbError>;
|
||||
|
||||
/// Paginated rel_path listing for a single library, ordered by id
|
||||
/// ascending. Used by the missing-file detector to scan a library
|
||||
/// in capped chunks across consecutive watcher ticks rather than
|
||||
/// stat()ing every row every minute. Returns `(id, rel_path)`.
|
||||
fn list_rel_paths_for_library_page(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id: i32,
|
||||
limit: i64,
|
||||
offset: i64,
|
||||
) -> Result<Vec<(i32, String)>, DbError>;
|
||||
}
|
||||
|
||||
pub struct SqliteExifDao {
|
||||
@@ -613,6 +780,7 @@ impl ExifDao for SqliteExifDao {
|
||||
fn get_exif_batch(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id_filter: Option<i32>,
|
||||
file_paths: &[String],
|
||||
) -> Result<Vec<ImageExif>, DbError> {
|
||||
trace_db_call(context, "query", "get_exif_batch", |_span| {
|
||||
@@ -623,8 +791,11 @@ impl ExifDao for SqliteExifDao {
|
||||
}
|
||||
|
||||
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
|
||||
|
||||
image_exif
|
||||
let mut query = image_exif.into_boxed();
|
||||
if let Some(lib_id) = library_id_filter {
|
||||
query = query.filter(library_id.eq(lib_id));
|
||||
}
|
||||
query
|
||||
.filter(rel_path.eq_any(file_paths))
|
||||
.load::<ImageExif>(connection.deref_mut())
|
||||
.map_err(|_| anyhow::anyhow!("Query error"))
|
||||
@@ -635,6 +806,7 @@ impl ExifDao for SqliteExifDao {
|
||||
fn query_by_exif(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id_filter: Option<i32>,
|
||||
camera_make_filter: Option<&str>,
|
||||
camera_model_filter: Option<&str>,
|
||||
lens_model_filter: Option<&str>,
|
||||
@@ -648,6 +820,12 @@ impl ExifDao for SqliteExifDao {
|
||||
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
|
||||
let mut query = image_exif.into_boxed();
|
||||
|
||||
// Library scope (most-selective filter — apply first so the
|
||||
// `(library_id, ...)` indexes are eligible).
|
||||
if let Some(lib_id) = library_id_filter {
|
||||
query = query.filter(library_id.eq(lib_id));
|
||||
}
|
||||
|
||||
// Camera filters (case-insensitive partial match)
|
||||
if let Some(make) = camera_make_filter {
|
||||
query = query.filter(camera_make.like(format!("%{}%", make)));
|
||||
@@ -1022,6 +1200,7 @@ impl ExifDao for SqliteExifDao {
|
||||
context: &opentelemetry::Context,
|
||||
library_ids: &[i32],
|
||||
path_prefix: Option<&str>,
|
||||
include_duplicates: bool,
|
||||
) -> Result<Vec<(i32, String)>, DbError> {
|
||||
trace_db_call(context, "query", "list_rel_paths_for_libraries", |_span| {
|
||||
use schema::image_exif::dsl::*;
|
||||
@@ -1042,6 +1221,41 @@ impl ExifDao for SqliteExifDao {
|
||||
query = query.filter(rel_path.like(pattern).escape('\\'));
|
||||
}
|
||||
|
||||
if !include_duplicates {
|
||||
if library_ids.is_empty() {
|
||||
// Unscoped (all-libraries) view — every survivor is
|
||||
// reachable somewhere, so a soft-marked row is
|
||||
// genuinely a duplicate from the user's perspective.
|
||||
// Hide it.
|
||||
query = query.filter(duplicate_of_hash.is_null());
|
||||
} else {
|
||||
// Scoped to specific libraries: only hide a
|
||||
// soft-marked row when the survivor is reachable
|
||||
// *in this view*. If the survivor lives in a
|
||||
// library the user can't see right now, the
|
||||
// demoted file is the only copy of those bytes
|
||||
// they have access to — keep it visible.
|
||||
//
|
||||
// Implemented as a correlated NOT EXISTS subquery
|
||||
// over an aliased image_exif. Library ids are i32
|
||||
// so format!-inlining the integer list is safe.
|
||||
use diesel::sql_types::Bool;
|
||||
let lib_list = library_ids
|
||||
.iter()
|
||||
.map(i32::to_string)
|
||||
.collect::<Vec<_>>()
|
||||
.join(",");
|
||||
let raw = format!(
|
||||
"(image_exif.duplicate_of_hash IS NULL OR NOT EXISTS \
|
||||
(SELECT 1 FROM image_exif AS survivor \
|
||||
WHERE survivor.content_hash = image_exif.duplicate_of_hash \
|
||||
AND survivor.library_id IN ({})))",
|
||||
lib_list
|
||||
);
|
||||
query = query.filter(diesel::dsl::sql::<Bool>(&raw));
|
||||
}
|
||||
}
|
||||
|
||||
query
|
||||
.load::<(i32, String)>(connection.deref_mut())
|
||||
.map_err(|_| anyhow::anyhow!("Query error"))
|
||||
@@ -1069,6 +1283,465 @@ impl ExifDao for SqliteExifDao {
|
||||
})
|
||||
.map_err(|_| DbError::new(DbErrorKind::QueryError))
|
||||
}
|
||||
|
||||
fn count_for_library(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id_val: i32,
|
||||
) -> Result<i64, DbError> {
|
||||
trace_db_call(context, "query", "count_for_library", |_span| {
|
||||
use schema::image_exif::dsl::*;
|
||||
|
||||
image_exif
|
||||
.filter(library_id.eq(library_id_val))
|
||||
.count()
|
||||
.get_result::<i64>(self.connection.lock().unwrap().deref_mut())
|
||||
.map_err(|_| anyhow::anyhow!("Count error"))
|
||||
})
|
||||
.map_err(|_| DbError::new(DbErrorKind::QueryError))
|
||||
}
|
||||
|
||||
fn list_rel_paths_for_library_page(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id_val: i32,
|
||||
limit: i64,
|
||||
offset: i64,
|
||||
) -> Result<Vec<(i32, String)>, DbError> {
|
||||
trace_db_call(
|
||||
context,
|
||||
"query",
|
||||
"list_rel_paths_for_library_page",
|
||||
|_span| {
|
||||
use schema::image_exif::dsl::*;
|
||||
|
||||
image_exif
|
||||
.filter(library_id.eq(library_id_val))
|
||||
.order(id.asc())
|
||||
.select((id, rel_path))
|
||||
.limit(limit)
|
||||
.offset(offset)
|
||||
.load::<(i32, String)>(self.connection.lock().unwrap().deref_mut())
|
||||
.map_err(|_| anyhow::anyhow!("Query error"))
|
||||
},
|
||||
)
|
||||
.map_err(|_| DbError::new(DbErrorKind::QueryError))
|
||||
}
|
||||
|
||||
fn get_rows_missing_perceptual_hash(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
limit: i64,
|
||||
) -> Result<Vec<(i32, String)>, DbError> {
|
||||
trace_db_call(
|
||||
context,
|
||||
"query",
|
||||
"get_rows_missing_perceptual_hash",
|
||||
|_span| {
|
||||
use schema::image_exif::dsl::*;
|
||||
|
||||
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
|
||||
|
||||
// Image-only filter via extension. Videos and decode-failures
|
||||
// would always come back NULL otherwise and the binary would
|
||||
// grind through them on every run. The list mirrors the file
|
||||
// formats `image` 0.25 / `image_hasher` 3.x can decode.
|
||||
image_exif
|
||||
.filter(content_hash.is_not_null())
|
||||
.filter(phash_64.is_null())
|
||||
.filter(
|
||||
rel_path
|
||||
.like("%.jpg")
|
||||
.or(rel_path.like("%.jpeg"))
|
||||
.or(rel_path.like("%.JPG"))
|
||||
.or(rel_path.like("%.JPEG"))
|
||||
.or(rel_path.like("%.png"))
|
||||
.or(rel_path.like("%.PNG"))
|
||||
.or(rel_path.like("%.webp"))
|
||||
.or(rel_path.like("%.WEBP"))
|
||||
.or(rel_path.like("%.tif"))
|
||||
.or(rel_path.like("%.tiff"))
|
||||
.or(rel_path.like("%.TIF"))
|
||||
.or(rel_path.like("%.TIFF"))
|
||||
.or(rel_path.like("%.avif"))
|
||||
.or(rel_path.like("%.AVIF")),
|
||||
)
|
||||
.select((library_id, rel_path))
|
||||
.order(id.asc())
|
||||
.limit(limit)
|
||||
.load::<(i32, String)>(connection.deref_mut())
|
||||
.map_err(|_| anyhow::anyhow!("Query error"))
|
||||
},
|
||||
)
|
||||
.map_err(|_| DbError::new(DbErrorKind::QueryError))
|
||||
}
|
||||
|
||||
fn backfill_perceptual_hash(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id_val: i32,
|
||||
rel_path_val: &str,
|
||||
phash_val: Option<i64>,
|
||||
dhash_val: Option<i64>,
|
||||
) -> Result<(), DbError> {
|
||||
trace_db_call(context, "update", "backfill_perceptual_hash", |_span| {
|
||||
use schema::image_exif::dsl::*;
|
||||
|
||||
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
|
||||
|
||||
diesel::update(
|
||||
image_exif
|
||||
.filter(library_id.eq(library_id_val))
|
||||
.filter(rel_path.eq(rel_path_val)),
|
||||
)
|
||||
.set((phash_64.eq(phash_val), dhash_64.eq(dhash_val)))
|
||||
.execute(connection.deref_mut())
|
||||
.map(|_| ())
|
||||
.map_err(|_| anyhow::anyhow!("Update error"))
|
||||
})
|
||||
.map_err(|_| DbError::new(DbErrorKind::UpdateError))
|
||||
}
|
||||
|
||||
fn list_duplicates_exact(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id_filter: Option<i32>,
|
||||
include_resolved: bool,
|
||||
) -> Result<Vec<DuplicateRow>, DbError> {
|
||||
trace_db_call(context, "query", "list_duplicates_exact", |_span| {
|
||||
// Sub-select the content_hashes that appear more than once
|
||||
// (optionally library-scoped), then load the full member rows
|
||||
// for those hashes ordered by hash + library + path so the
|
||||
// caller can stream-group without buffering the full dataset.
|
||||
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
|
||||
|
||||
// Step 1: hashes with count > 1.
|
||||
let dup_hashes: Vec<String> = {
|
||||
use schema::image_exif::dsl::*;
|
||||
let mut q = image_exif
|
||||
.filter(content_hash.is_not_null())
|
||||
.group_by(content_hash)
|
||||
.select(content_hash.assume_not_null())
|
||||
.having(diesel::dsl::count_star().gt(1))
|
||||
.into_boxed();
|
||||
if let Some(lib) = library_id_filter {
|
||||
q = q.filter(library_id.eq(lib));
|
||||
}
|
||||
q.load::<String>(connection.deref_mut())
|
||||
.map_err(|_| anyhow::anyhow!("Query error"))?
|
||||
};
|
||||
|
||||
if dup_hashes.is_empty() {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
// Step 2: every member row for those hashes.
|
||||
use schema::image_exif::dsl::*;
|
||||
let mut q = image_exif
|
||||
.filter(content_hash.eq_any(&dup_hashes))
|
||||
.select((
|
||||
library_id,
|
||||
rel_path,
|
||||
content_hash.assume_not_null(),
|
||||
size_bytes,
|
||||
date_taken,
|
||||
width,
|
||||
height,
|
||||
phash_64,
|
||||
dhash_64,
|
||||
duplicate_of_hash,
|
||||
duplicate_decided_at,
|
||||
))
|
||||
.order((content_hash.asc(), library_id.asc(), rel_path.asc()))
|
||||
.into_boxed();
|
||||
if let Some(lib) = library_id_filter {
|
||||
q = q.filter(library_id.eq(lib));
|
||||
}
|
||||
if !include_resolved {
|
||||
q = q.filter(duplicate_of_hash.is_null());
|
||||
}
|
||||
|
||||
let rows: Vec<(
|
||||
i32,
|
||||
String,
|
||||
String,
|
||||
Option<i64>,
|
||||
Option<i64>,
|
||||
Option<i32>,
|
||||
Option<i32>,
|
||||
Option<i64>,
|
||||
Option<i64>,
|
||||
Option<String>,
|
||||
Option<i64>,
|
||||
)> = q
|
||||
.load(connection.deref_mut())
|
||||
.map_err(|_| anyhow::anyhow!("Query error"))?;
|
||||
|
||||
Ok(rows
|
||||
.into_iter()
|
||||
.map(|r| DuplicateRow {
|
||||
library_id: r.0,
|
||||
rel_path: r.1,
|
||||
content_hash: r.2,
|
||||
size_bytes: r.3,
|
||||
date_taken: r.4,
|
||||
width: r.5,
|
||||
height: r.6,
|
||||
phash_64: r.7,
|
||||
dhash_64: r.8,
|
||||
duplicate_of_hash: r.9,
|
||||
duplicate_decided_at: r.10,
|
||||
})
|
||||
.collect())
|
||||
})
|
||||
.map_err(|_| DbError::new(DbErrorKind::QueryError))
|
||||
}
|
||||
|
||||
fn list_perceptual_candidates(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id_filter: Option<i32>,
|
||||
include_resolved: bool,
|
||||
) -> Result<Vec<DuplicateRow>, DbError> {
|
||||
trace_db_call(context, "query", "list_perceptual_candidates", |_span| {
|
||||
use schema::image_exif::dsl::*;
|
||||
|
||||
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
|
||||
|
||||
// For perceptual candidates we want one canonical row per
|
||||
// distinct content_hash — exact dups are clustered by the
|
||||
// exact-dup query and would only pollute the perceptual
|
||||
// graph with zero-distance edges. Diesel doesn't have a
|
||||
// clean `DISTINCT ON`, so we load every row and dedup
|
||||
// client-side keyed on content_hash. The result set is small
|
||||
// (only rows with a phash) and the cost is negligible vs
|
||||
// the BK-tree clustering that follows.
|
||||
let mut q = image_exif
|
||||
.filter(content_hash.is_not_null())
|
||||
.filter(phash_64.is_not_null())
|
||||
.select((
|
||||
library_id,
|
||||
rel_path,
|
||||
content_hash.assume_not_null(),
|
||||
size_bytes,
|
||||
date_taken,
|
||||
width,
|
||||
height,
|
||||
phash_64,
|
||||
dhash_64,
|
||||
duplicate_of_hash,
|
||||
duplicate_decided_at,
|
||||
))
|
||||
.order((content_hash.asc(), library_id.asc(), rel_path.asc()))
|
||||
.into_boxed();
|
||||
|
||||
if let Some(lib) = library_id_filter {
|
||||
q = q.filter(library_id.eq(lib));
|
||||
}
|
||||
if !include_resolved {
|
||||
q = q.filter(duplicate_of_hash.is_null());
|
||||
}
|
||||
|
||||
let rows: Vec<(
|
||||
i32,
|
||||
String,
|
||||
String,
|
||||
Option<i64>,
|
||||
Option<i64>,
|
||||
Option<i32>,
|
||||
Option<i32>,
|
||||
Option<i64>,
|
||||
Option<i64>,
|
||||
Option<String>,
|
||||
Option<i64>,
|
||||
)> = q
|
||||
.load(connection.deref_mut())
|
||||
.map_err(|_| anyhow::anyhow!("Query error"))?;
|
||||
|
||||
// Dedup keyed on content_hash, keeping the first occurrence
|
||||
// (deterministic by the SQL ORDER BY: lowest library_id,
|
||||
// then lexicographically smallest rel_path).
|
||||
let mut seen = std::collections::HashSet::new();
|
||||
let mut out = Vec::with_capacity(rows.len());
|
||||
for r in rows {
|
||||
if seen.insert(r.2.clone()) {
|
||||
out.push(DuplicateRow {
|
||||
library_id: r.0,
|
||||
rel_path: r.1,
|
||||
content_hash: r.2,
|
||||
size_bytes: r.3,
|
||||
date_taken: r.4,
|
||||
width: r.5,
|
||||
height: r.6,
|
||||
phash_64: r.7,
|
||||
dhash_64: r.8,
|
||||
duplicate_of_hash: r.9,
|
||||
duplicate_decided_at: r.10,
|
||||
});
|
||||
}
|
||||
}
|
||||
Ok(out)
|
||||
})
|
||||
.map_err(|_| DbError::new(DbErrorKind::QueryError))
|
||||
}
|
||||
|
||||
fn lookup_duplicate_row(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id_val: i32,
|
||||
rel_path_val: &str,
|
||||
) -> Result<Option<DuplicateRow>, DbError> {
|
||||
trace_db_call(context, "query", "lookup_duplicate_row", |_span| {
|
||||
use schema::image_exif::dsl::*;
|
||||
|
||||
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
|
||||
|
||||
image_exif
|
||||
.filter(library_id.eq(library_id_val))
|
||||
.filter(rel_path.eq(rel_path_val))
|
||||
.filter(content_hash.is_not_null())
|
||||
.select((
|
||||
library_id,
|
||||
rel_path,
|
||||
content_hash.assume_not_null(),
|
||||
size_bytes,
|
||||
date_taken,
|
||||
width,
|
||||
height,
|
||||
phash_64,
|
||||
dhash_64,
|
||||
duplicate_of_hash,
|
||||
duplicate_decided_at,
|
||||
))
|
||||
.first::<(
|
||||
i32,
|
||||
String,
|
||||
String,
|
||||
Option<i64>,
|
||||
Option<i64>,
|
||||
Option<i32>,
|
||||
Option<i32>,
|
||||
Option<i64>,
|
||||
Option<i64>,
|
||||
Option<String>,
|
||||
Option<i64>,
|
||||
)>(connection.deref_mut())
|
||||
.optional()
|
||||
.map(|opt| {
|
||||
opt.map(|r| DuplicateRow {
|
||||
library_id: r.0,
|
||||
rel_path: r.1,
|
||||
content_hash: r.2,
|
||||
size_bytes: r.3,
|
||||
date_taken: r.4,
|
||||
width: r.5,
|
||||
height: r.6,
|
||||
phash_64: r.7,
|
||||
dhash_64: r.8,
|
||||
duplicate_of_hash: r.9,
|
||||
duplicate_decided_at: r.10,
|
||||
})
|
||||
})
|
||||
.map_err(|_| anyhow::anyhow!("Query error"))
|
||||
})
|
||||
.map_err(|_| DbError::new(DbErrorKind::QueryError))
|
||||
}
|
||||
|
||||
fn set_duplicate_of(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id_val: i32,
|
||||
rel_path_val: &str,
|
||||
survivor_hash: &str,
|
||||
decided_at: i64,
|
||||
) -> Result<(), DbError> {
|
||||
trace_db_call(context, "update", "set_duplicate_of", |_span| {
|
||||
use schema::image_exif::dsl::*;
|
||||
|
||||
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
|
||||
|
||||
diesel::update(
|
||||
image_exif
|
||||
.filter(library_id.eq(library_id_val))
|
||||
.filter(rel_path.eq(rel_path_val)),
|
||||
)
|
||||
.set((
|
||||
duplicate_of_hash.eq(survivor_hash),
|
||||
duplicate_decided_at.eq(decided_at),
|
||||
))
|
||||
.execute(connection.deref_mut())
|
||||
.map(|_| ())
|
||||
.map_err(|_| anyhow::anyhow!("Update error"))
|
||||
})
|
||||
.map_err(|_| DbError::new(DbErrorKind::UpdateError))
|
||||
}
|
||||
|
||||
fn clear_duplicate_of(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
library_id_val: i32,
|
||||
rel_path_val: &str,
|
||||
) -> Result<(), DbError> {
|
||||
trace_db_call(context, "update", "clear_duplicate_of", |_span| {
|
||||
use schema::image_exif::dsl::*;
|
||||
|
||||
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
|
||||
|
||||
diesel::update(
|
||||
image_exif
|
||||
.filter(library_id.eq(library_id_val))
|
||||
.filter(rel_path.eq(rel_path_val)),
|
||||
)
|
||||
.set((
|
||||
duplicate_of_hash.eq::<Option<String>>(None),
|
||||
duplicate_decided_at.eq::<Option<i64>>(None),
|
||||
))
|
||||
.execute(connection.deref_mut())
|
||||
.map(|_| ())
|
||||
.map_err(|_| anyhow::anyhow!("Update error"))
|
||||
})
|
||||
.map_err(|_| DbError::new(DbErrorKind::UpdateError))
|
||||
}
|
||||
|
||||
fn union_perceptual_tags(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
survivor_hash: &str,
|
||||
demoted_hash: &str,
|
||||
survivor_rel_path: &str,
|
||||
) -> Result<(), DbError> {
|
||||
trace_db_call(context, "update", "union_perceptual_tags", |_span| {
|
||||
// INSERT OR IGNORE handles two relevant uniqueness paths:
|
||||
// - tagged_photo (rel_path, tag_id) is the historical key,
|
||||
// so existing tag rows under the survivor's path collide
|
||||
// and stay put.
|
||||
// - The (rel_path, tag_id) collision is the one that
|
||||
// matters for idempotence; (content_hash, tag_id) at the
|
||||
// bytes level isn't enforced by SQLite but the read path
|
||||
// dedups on it, so an extra row would be cosmetic.
|
||||
// Tags whose rel_path differs are inserted, picking up the
|
||||
// survivor's content_hash so they live under the right bytes.
|
||||
let mut connection = self.connection.lock().expect("Unable to get ExifDao");
|
||||
|
||||
diesel::sql_query(
|
||||
"INSERT OR IGNORE INTO tagged_photo (rel_path, tag_id, created_time, content_hash) \
|
||||
SELECT ?, tag_id, strftime('%s','now'), ? \
|
||||
FROM tagged_photo \
|
||||
WHERE content_hash = ? \
|
||||
AND tag_id NOT IN ( \
|
||||
SELECT tag_id FROM tagged_photo WHERE content_hash = ? \
|
||||
)",
|
||||
)
|
||||
.bind::<diesel::sql_types::Text, _>(survivor_rel_path)
|
||||
.bind::<diesel::sql_types::Text, _>(survivor_hash)
|
||||
.bind::<diesel::sql_types::Text, _>(demoted_hash)
|
||||
.bind::<diesel::sql_types::Text, _>(survivor_hash)
|
||||
.execute(connection.deref_mut())
|
||||
.map(|_| ())
|
||||
.map_err(|_| anyhow::anyhow!("Tag union error"))
|
||||
})
|
||||
.map_err(|_| DbError::new(DbErrorKind::UpdateError))
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
@@ -1105,6 +1778,8 @@ mod exif_dao_tests {
|
||||
last_modified: 0,
|
||||
content_hash: None,
|
||||
size_bytes: None,
|
||||
phash_64: None,
|
||||
dhash_64: None,
|
||||
},
|
||||
)
|
||||
.expect("insert exif row");
|
||||
@@ -1118,6 +1793,8 @@ mod exif_dao_tests {
|
||||
name: "archive",
|
||||
root_path: "/tmp/archive",
|
||||
created_at: 0,
|
||||
enabled: true,
|
||||
excluded_dirs: None,
|
||||
})
|
||||
.execute(&mut conn)
|
||||
.expect("seed second library");
|
||||
@@ -1158,4 +1835,61 @@ mod exif_dao_tests {
|
||||
let lib1 = dao.get_all_with_date_taken(&ctx(), Some(1)).unwrap();
|
||||
assert_eq!(lib1, vec![("main/a.jpg".to_string(), 100)]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn query_by_exif_scopes_by_library_id() {
|
||||
let mut dao = setup_two_libraries();
|
||||
insert_row(&mut dao, 1, "main/a.jpg", Some(100));
|
||||
insert_row(&mut dao, 2, "archive/a.jpg", Some(200));
|
||||
|
||||
// Union: both rows.
|
||||
let all = dao
|
||||
.query_by_exif(&ctx(), None, None, None, None, None, None, None)
|
||||
.unwrap();
|
||||
assert_eq!(all.len(), 2);
|
||||
|
||||
// Scoped to lib 2: only archive row.
|
||||
let lib2 = dao
|
||||
.query_by_exif(&ctx(), Some(2), None, None, None, None, None, None)
|
||||
.unwrap();
|
||||
assert_eq!(lib2.len(), 1);
|
||||
assert_eq!(lib2[0].file_path, "archive/a.jpg");
|
||||
assert_eq!(lib2[0].library_id, 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn get_exif_batch_scopes_by_library_id() {
|
||||
let mut dao = setup_two_libraries();
|
||||
// Same rel_path, different libraries — the cross-library duplicate
|
||||
// case the audit flagged.
|
||||
insert_row(&mut dao, 1, "shared/photo.jpg", Some(100));
|
||||
insert_row(&mut dao, 2, "shared/photo.jpg", Some(200));
|
||||
|
||||
// None spans both libraries (legacy union behavior).
|
||||
let union = dao
|
||||
.get_exif_batch(&ctx(), None, &["shared/photo.jpg".to_string()])
|
||||
.unwrap();
|
||||
assert_eq!(union.len(), 2);
|
||||
|
||||
// Some(2) returns only the archive row.
|
||||
let scoped = dao
|
||||
.get_exif_batch(&ctx(), Some(2), &["shared/photo.jpg".to_string()])
|
||||
.unwrap();
|
||||
assert_eq!(scoped.len(), 1);
|
||||
assert_eq!(scoped[0].library_id, 2);
|
||||
assert_eq!(scoped[0].date_taken, Some(200));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn count_for_library_returns_per_library_count() {
|
||||
let mut dao = setup_two_libraries();
|
||||
insert_row(&mut dao, 1, "main/a.jpg", None);
|
||||
insert_row(&mut dao, 1, "main/b.jpg", None);
|
||||
insert_row(&mut dao, 2, "archive/a.jpg", None);
|
||||
|
||||
assert_eq!(dao.count_for_library(&ctx(), 1).unwrap(), 2);
|
||||
assert_eq!(dao.count_for_library(&ctx(), 2).unwrap(), 1);
|
||||
// Unknown library: zero, no error.
|
||||
assert_eq!(dao.count_for_library(&ctx(), 999).unwrap(), 0);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -59,6 +59,10 @@ pub struct InsertImageExif {
|
||||
pub last_modified: i64,
|
||||
pub content_hash: Option<String>,
|
||||
pub size_bytes: Option<i64>,
|
||||
/// 64-bit pHash (DCT) packed as i64. NULL for videos and decode failures.
|
||||
pub phash_64: Option<i64>,
|
||||
/// 64-bit dHash (gradient). NULL for videos and decode failures.
|
||||
pub dhash_64: Option<i64>,
|
||||
}
|
||||
|
||||
// Field order matches the post-migration column order in `image_exif`.
|
||||
@@ -86,6 +90,14 @@ pub struct ImageExif {
|
||||
pub last_modified: i64,
|
||||
pub content_hash: Option<String>,
|
||||
pub size_bytes: Option<i64>,
|
||||
pub phash_64: Option<i64>,
|
||||
pub dhash_64: Option<i64>,
|
||||
/// When non-null, this row is a soft-marked duplicate of the file
|
||||
/// whose `content_hash` matches this value. The default `/photos`
|
||||
/// listing filters such rows out.
|
||||
pub duplicate_of_hash: Option<String>,
|
||||
/// Unix seconds at which the resolve was committed.
|
||||
pub duplicate_decided_at: Option<i64>,
|
||||
}
|
||||
|
||||
#[derive(Insertable)]
|
||||
@@ -108,6 +120,13 @@ pub struct InsertPhotoInsight {
|
||||
/// generation). Used downstream to filter out contaminated rows when
|
||||
/// assembling an unbiased training / evaluation set.
|
||||
pub fewshot_source_ids: Option<String>,
|
||||
/// Bytes-keyed identity. When present, this insight is considered
|
||||
/// to belong to the content rather than the path — see CLAUDE.md
|
||||
/// "Multi-library data model". The DAO populates this from
|
||||
/// `image_exif.content_hash` at insert time when known; rows
|
||||
/// inserted before the hash is available stay null and the
|
||||
/// reconciliation pass backfills them.
|
||||
pub content_hash: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Serialize, Queryable, Clone, Debug)]
|
||||
@@ -126,6 +145,7 @@ pub struct PhotoInsight {
|
||||
/// `"local"` (Ollama with images) | `"hybrid"` (local vision + OpenRouter chat).
|
||||
pub backend: String,
|
||||
pub fewshot_source_ids: Option<String>,
|
||||
pub content_hash: Option<String>,
|
||||
}
|
||||
|
||||
// --- Libraries ---
|
||||
@@ -136,6 +156,20 @@ pub struct LibraryRow {
|
||||
pub name: String,
|
||||
pub root_path: String,
|
||||
pub created_at: i64,
|
||||
/// Operator kill switch. `false` = the watcher skips this library
|
||||
/// entirely (no probe, no ingest, no maintenance) and orphan-GC
|
||||
/// treats it as out-of-scope for the all-online consensus rule.
|
||||
/// Toggle via SQL today — there is intentionally no HTTP endpoint
|
||||
/// for library mutation (see CLAUDE.md "Multi-library data model").
|
||||
pub enabled: bool,
|
||||
/// Per-library excluded paths/patterns, stored comma-separated
|
||||
/// (same shape as the global `EXCLUDED_DIRS` env var). NULL = no
|
||||
/// extra excludes for this library; the global env var still
|
||||
/// applies. The runtime `Library` struct parses this into a
|
||||
/// `Vec<String>` and the walker applies the union of (global,
|
||||
/// library) excludes when scanning. Use case: mount a parent
|
||||
/// directory while another library covers a child subtree.
|
||||
pub excluded_dirs: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Insertable)]
|
||||
@@ -144,6 +178,8 @@ pub struct InsertLibrary<'a> {
|
||||
pub name: &'a str,
|
||||
pub root_path: &'a str,
|
||||
pub created_at: i64,
|
||||
pub enabled: bool,
|
||||
pub excluded_dirs: Option<&'a str>,
|
||||
}
|
||||
|
||||
// --- Knowledge memory models ---
|
||||
|
||||
382
src/database/reconcile.rs
Normal file
382
src/database/reconcile.rs
Normal file
@@ -0,0 +1,382 @@
|
||||
//! Reconciliation pass for hash-keyed derived data.
|
||||
//!
|
||||
//! As `backfill_unhashed_backlog` populates `image_exif.content_hash`
|
||||
//! for legacy rows, we want the matching `tagged_photo` and
|
||||
//! `photo_insights` rows — which were inserted before the hash was
|
||||
//! known — to inherit the hash too. Otherwise reads keep falling back
|
||||
//! to the rel_path path even when a hash is now available.
|
||||
//!
|
||||
//! Two passes:
|
||||
//! 1. **Hash backfill** — for every `tagged_photo` / `photo_insights`
|
||||
//! row with NULL `content_hash`, look up the matching
|
||||
//! `image_exif.content_hash` and write it. SQL-only; idempotent;
|
||||
//! a no-op once everything is hashed.
|
||||
//! 2. **Insight scalar merge** — when multiple `photo_insights` rows
|
||||
//! share a `content_hash` with `is_current = true`, only the
|
||||
//! earliest `generated_at` keeps `is_current = true` (per the
|
||||
//! "earliest wins" rule in CLAUDE.md → "Multi-library data
|
||||
//! model"). Others are demoted, not deleted, so they remain
|
||||
//! visible in history endpoints.
|
||||
//!
|
||||
//! Tags are set-valued under the policy (union on read), so there's no
|
||||
//! analogous "collapse" pass — duplicate `(tag_id, content_hash)` rows
|
||||
//! across libraries are harmless and correctly de-duped at read time
|
||||
//! by the existing `DISTINCT` queries.
|
||||
//!
|
||||
//! The pass operates on the database alone — no filesystem access —
|
||||
//! so it doesn't need the library availability gate.
|
||||
|
||||
// The lib doesn't call into this module directly — the watcher (in the
|
||||
// bin) does. Dead-code analysis at the lib level can't see that, so
|
||||
// suppress at the module level. Tests still exercise every function.
|
||||
#![allow(dead_code)]
|
||||
|
||||
use diesel::prelude::*;
|
||||
use diesel::sql_query;
|
||||
use diesel::sqlite::SqliteConnection;
|
||||
use log::{debug, info, warn};
|
||||
|
||||
/// Outcome of a reconciliation tick. Tracked so the watcher can log
|
||||
/// progress when something changed and stay quiet when nothing did.
|
||||
#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
|
||||
pub struct ReconcileStats {
|
||||
pub tagged_photo_hashes_filled: usize,
|
||||
pub photo_insights_hashes_filled: usize,
|
||||
pub photo_insights_demoted: usize,
|
||||
}
|
||||
|
||||
impl ReconcileStats {
|
||||
pub fn changed(&self) -> bool {
|
||||
self.tagged_photo_hashes_filled > 0
|
||||
|| self.photo_insights_hashes_filled > 0
|
||||
|| self.photo_insights_demoted > 0
|
||||
}
|
||||
}
|
||||
|
||||
/// Run the reconciliation pass. Idempotent — safe to call on every
|
||||
/// watcher tick. Errors are logged but never propagated; reconciliation
|
||||
/// is best-effort and a transient DB hiccup must not stall the watcher.
|
||||
pub fn run(conn: &mut SqliteConnection) -> ReconcileStats {
|
||||
let mut stats = ReconcileStats::default();
|
||||
|
||||
stats.tagged_photo_hashes_filled = match backfill_tagged_photo_hashes(conn) {
|
||||
Ok(n) => n,
|
||||
Err(e) => {
|
||||
warn!("reconcile: tagged_photo hash backfill failed: {:?}", e);
|
||||
0
|
||||
}
|
||||
};
|
||||
|
||||
stats.photo_insights_hashes_filled = match backfill_photo_insights_hashes(conn) {
|
||||
Ok(n) => n,
|
||||
Err(e) => {
|
||||
warn!("reconcile: photo_insights hash backfill failed: {:?}", e);
|
||||
0
|
||||
}
|
||||
};
|
||||
|
||||
stats.photo_insights_demoted = match collapse_insight_currents(conn) {
|
||||
Ok(n) => n,
|
||||
Err(e) => {
|
||||
warn!("reconcile: photo_insights scalar merge failed: {:?}", e);
|
||||
0
|
||||
}
|
||||
};
|
||||
|
||||
if stats.changed() {
|
||||
info!(
|
||||
"reconcile: filled {} tagged_photo hash(es), {} photo_insights hash(es); demoted {} non-current insight row(s)",
|
||||
stats.tagged_photo_hashes_filled,
|
||||
stats.photo_insights_hashes_filled,
|
||||
stats.photo_insights_demoted,
|
||||
);
|
||||
} else {
|
||||
debug!("reconcile: no changes this tick");
|
||||
}
|
||||
|
||||
stats
|
||||
}
|
||||
|
||||
/// Populate `tagged_photo.content_hash` for any row that still has
|
||||
/// NULL by joining on `rel_path` against `image_exif`. tagged_photo
|
||||
/// doesn't carry `library_id`, so a path that exists under multiple
|
||||
/// libraries with different content is genuinely ambiguous; we pick
|
||||
/// any non-null hash for that path. Same trade-off as the migration
|
||||
/// backfill — see `migrations/2026-05-01-000000_hash_keyed_derived_data`.
|
||||
fn backfill_tagged_photo_hashes(conn: &mut SqliteConnection) -> QueryResult<usize> {
|
||||
sql_query(
|
||||
"UPDATE tagged_photo \
|
||||
SET content_hash = ( \
|
||||
SELECT content_hash FROM image_exif \
|
||||
WHERE image_exif.rel_path = tagged_photo.rel_path \
|
||||
AND image_exif.content_hash IS NOT NULL \
|
||||
LIMIT 1 \
|
||||
) \
|
||||
WHERE content_hash IS NULL \
|
||||
AND EXISTS ( \
|
||||
SELECT 1 FROM image_exif \
|
||||
WHERE image_exif.rel_path = tagged_photo.rel_path \
|
||||
AND image_exif.content_hash IS NOT NULL \
|
||||
)",
|
||||
)
|
||||
.execute(conn)
|
||||
}
|
||||
|
||||
/// Populate `photo_insights.content_hash` from `image_exif`, keyed on
|
||||
/// `(library_id, rel_path)`. Unambiguous because photo_insights carries
|
||||
/// library_id.
|
||||
fn backfill_photo_insights_hashes(conn: &mut SqliteConnection) -> QueryResult<usize> {
|
||||
sql_query(
|
||||
"UPDATE photo_insights \
|
||||
SET content_hash = ( \
|
||||
SELECT content_hash FROM image_exif \
|
||||
WHERE image_exif.library_id = photo_insights.library_id \
|
||||
AND image_exif.rel_path = photo_insights.rel_path \
|
||||
AND image_exif.content_hash IS NOT NULL \
|
||||
LIMIT 1 \
|
||||
) \
|
||||
WHERE content_hash IS NULL \
|
||||
AND EXISTS ( \
|
||||
SELECT 1 FROM image_exif \
|
||||
WHERE image_exif.library_id = photo_insights.library_id \
|
||||
AND image_exif.rel_path = photo_insights.rel_path \
|
||||
AND image_exif.content_hash IS NOT NULL \
|
||||
)",
|
||||
)
|
||||
.execute(conn)
|
||||
}
|
||||
|
||||
/// Scalar-merge step: when multiple rows share a `content_hash` and
|
||||
/// claim `is_current = true`, demote all but the earliest by
|
||||
/// `generated_at` (ties broken by lowest id, deterministic).
|
||||
///
|
||||
/// Demoted rows keep their data — only `is_current` flips. Clients that
|
||||
/// hit `/insights/history` still see the full sequence; only the
|
||||
/// "current" pointer is unique per hash.
|
||||
fn collapse_insight_currents(conn: &mut SqliteConnection) -> QueryResult<usize> {
|
||||
sql_query(
|
||||
"UPDATE photo_insights \
|
||||
SET is_current = 0 \
|
||||
WHERE is_current = 1 \
|
||||
AND content_hash IS NOT NULL \
|
||||
AND id NOT IN ( \
|
||||
SELECT MIN(p2.id) FROM photo_insights p2 \
|
||||
WHERE p2.is_current = 1 \
|
||||
AND p2.content_hash = photo_insights.content_hash \
|
||||
AND p2.generated_at = ( \
|
||||
SELECT MIN(p3.generated_at) FROM photo_insights p3 \
|
||||
WHERE p3.is_current = 1 \
|
||||
AND p3.content_hash = p2.content_hash \
|
||||
) \
|
||||
)",
|
||||
)
|
||||
.execute(conn)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::database::test::in_memory_db_connection;
|
||||
|
||||
fn ensure_library(conn: &mut SqliteConnection, library_id: i32) {
|
||||
// Migration seeds library id=1; tests that reference id>1 must
|
||||
// create those rows themselves, otherwise FK enforcement (added
|
||||
// in the tags-edit migration) rejects image_exif inserts.
|
||||
diesel::sql_query(
|
||||
"INSERT OR IGNORE INTO libraries (id, name, root_path, created_at) \
|
||||
VALUES (?, 'test-' || ?, '/tmp/test-' || ?, 0)",
|
||||
)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
fn insert_image_exif(
|
||||
conn: &mut SqliteConnection,
|
||||
library_id: i32,
|
||||
rel_path: &str,
|
||||
content_hash: Option<&str>,
|
||||
) {
|
||||
use crate::database::schema::image_exif;
|
||||
ensure_library(conn, library_id);
|
||||
diesel::sql_query(
|
||||
"INSERT INTO image_exif (library_id, rel_path, created_time, last_modified, content_hash) \
|
||||
VALUES (?, ?, 0, 0, ?)",
|
||||
)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::Text, _>(rel_path)
|
||||
.bind::<diesel::sql_types::Nullable<diesel::sql_types::Text>, _>(content_hash)
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
// Keep clippy happy that the import is used.
|
||||
let _ = image_exif::table;
|
||||
}
|
||||
|
||||
fn insert_tagged_photo(conn: &mut SqliteConnection, rel_path: &str, tag_id: i32) {
|
||||
diesel::sql_query(
|
||||
"INSERT INTO tagged_photo (rel_path, tag_id, created_time) VALUES (?, ?, 0)",
|
||||
)
|
||||
.bind::<diesel::sql_types::Text, _>(rel_path)
|
||||
.bind::<diesel::sql_types::Integer, _>(tag_id)
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
fn insert_tag(conn: &mut SqliteConnection, id: i32, name: &str) {
|
||||
diesel::sql_query("INSERT INTO tags (id, name, created_time) VALUES (?, ?, 0)")
|
||||
.bind::<diesel::sql_types::Integer, _>(id)
|
||||
.bind::<diesel::sql_types::Text, _>(name)
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
fn insert_insight(
|
||||
conn: &mut SqliteConnection,
|
||||
library_id: i32,
|
||||
rel_path: &str,
|
||||
generated_at: i64,
|
||||
is_current: bool,
|
||||
) -> i32 {
|
||||
ensure_library(conn, library_id);
|
||||
diesel::sql_query(
|
||||
"INSERT INTO photo_insights (library_id, rel_path, title, summary, generated_at, model_version, is_current, backend) \
|
||||
VALUES (?, ?, 't', 's', ?, 'v', ?, 'local')",
|
||||
)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::Text, _>(rel_path)
|
||||
.bind::<diesel::sql_types::BigInt, _>(generated_at)
|
||||
.bind::<diesel::sql_types::Bool, _>(is_current)
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
diesel::sql_query("SELECT last_insert_rowid() AS id")
|
||||
.get_result::<TestId>(conn)
|
||||
.map(|r| r.id)
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
#[derive(QueryableByName)]
|
||||
struct TestId {
|
||||
#[diesel(sql_type = diesel::sql_types::Integer)]
|
||||
id: i32,
|
||||
}
|
||||
|
||||
#[derive(QueryableByName, Debug)]
|
||||
struct HashOnly {
|
||||
#[diesel(sql_type = diesel::sql_types::Nullable<diesel::sql_types::Text>)]
|
||||
content_hash: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(QueryableByName, Debug)]
|
||||
struct CurrentRow {
|
||||
#[diesel(sql_type = diesel::sql_types::Integer)]
|
||||
id: i32,
|
||||
#[diesel(sql_type = diesel::sql_types::Bool)]
|
||||
is_current: bool,
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn backfill_fills_tagged_photo_hash_when_image_exif_has_one() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
insert_tag(&mut conn, 1, "vacation");
|
||||
insert_tagged_photo(&mut conn, "trip/IMG.jpg", 1);
|
||||
// No image_exif row yet — backfill no-op.
|
||||
let stats = run(&mut conn);
|
||||
assert_eq!(stats.tagged_photo_hashes_filled, 0);
|
||||
|
||||
// image_exif row appears with a hash; next reconcile fills it.
|
||||
insert_image_exif(&mut conn, 1, "trip/IMG.jpg", Some("hashabc"));
|
||||
let stats = run(&mut conn);
|
||||
assert_eq!(stats.tagged_photo_hashes_filled, 1);
|
||||
|
||||
let row = diesel::sql_query(
|
||||
"SELECT content_hash FROM tagged_photo WHERE rel_path = 'trip/IMG.jpg'",
|
||||
)
|
||||
.get_result::<HashOnly>(&mut conn)
|
||||
.unwrap();
|
||||
assert_eq!(row.content_hash.as_deref(), Some("hashabc"));
|
||||
|
||||
// Idempotent: a second run is a no-op.
|
||||
let stats = run(&mut conn);
|
||||
assert_eq!(stats.tagged_photo_hashes_filled, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn backfill_skips_tagged_photo_when_image_exif_has_no_hash() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
insert_tag(&mut conn, 1, "vacation");
|
||||
insert_tagged_photo(&mut conn, "trip/IMG.jpg", 1);
|
||||
// image_exif exists but its hash is null.
|
||||
insert_image_exif(&mut conn, 1, "trip/IMG.jpg", None);
|
||||
|
||||
let stats = run(&mut conn);
|
||||
assert_eq!(stats.tagged_photo_hashes_filled, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn backfill_fills_photo_insights_hash_scoped_by_library() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
// Row in library 1 only — must not be filled by a hash from
|
||||
// library 2's same-rel_path entry.
|
||||
insert_image_exif(&mut conn, 1, "shared.jpg", Some("hash-lib1"));
|
||||
let id1 = insert_insight(&mut conn, 1, "shared.jpg", 100, true);
|
||||
|
||||
let stats = run(&mut conn);
|
||||
assert_eq!(stats.photo_insights_hashes_filled, 1);
|
||||
|
||||
let row = diesel::sql_query("SELECT content_hash FROM photo_insights WHERE id = ?")
|
||||
.bind::<diesel::sql_types::Integer, _>(id1)
|
||||
.get_result::<HashOnly>(&mut conn)
|
||||
.unwrap();
|
||||
assert_eq!(row.content_hash.as_deref(), Some("hash-lib1"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn collapse_keeps_earliest_is_current_per_hash() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
// Two libraries, same content_hash via image_exif. Insights
|
||||
// were generated independently in each library, both currently
|
||||
// is_current = true. The earlier one wins.
|
||||
insert_image_exif(&mut conn, 1, "a.jpg", Some("h1"));
|
||||
insert_image_exif(&mut conn, 2, "a.jpg", Some("h1"));
|
||||
let earlier = insert_insight(&mut conn, 1, "a.jpg", 100, true);
|
||||
let later = insert_insight(&mut conn, 2, "a.jpg", 200, true);
|
||||
|
||||
// First pass fills the content_hash; second collapses.
|
||||
let stats = run(&mut conn);
|
||||
assert_eq!(stats.photo_insights_hashes_filled, 2);
|
||||
assert_eq!(stats.photo_insights_demoted, 1);
|
||||
|
||||
let rows = diesel::sql_query("SELECT id, is_current FROM photo_insights ORDER BY id")
|
||||
.get_results::<CurrentRow>(&mut conn)
|
||||
.unwrap();
|
||||
let earlier_row = rows.iter().find(|r| r.id == earlier).unwrap();
|
||||
let later_row = rows.iter().find(|r| r.id == later).unwrap();
|
||||
assert!(
|
||||
earlier_row.is_current,
|
||||
"earlier insight should remain current"
|
||||
);
|
||||
assert!(!later_row.is_current, "later insight should be demoted");
|
||||
|
||||
// Idempotent.
|
||||
let stats = run(&mut conn);
|
||||
assert_eq!(stats.photo_insights_demoted, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn collapse_does_not_demote_a_solo_current_row() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
insert_image_exif(&mut conn, 1, "a.jpg", Some("h1"));
|
||||
let solo = insert_insight(&mut conn, 1, "a.jpg", 100, true);
|
||||
|
||||
let stats = run(&mut conn);
|
||||
assert_eq!(stats.photo_insights_demoted, 0);
|
||||
|
||||
let row = diesel::sql_query("SELECT id, is_current FROM photo_insights WHERE id = ?")
|
||||
.bind::<diesel::sql_types::Integer, _>(solo)
|
||||
.get_result::<CurrentRow>(&mut conn)
|
||||
.unwrap();
|
||||
assert!(row.is_current);
|
||||
}
|
||||
}
|
||||
@@ -121,6 +121,10 @@ diesel::table! {
|
||||
last_modified -> BigInt,
|
||||
content_hash -> Nullable<Text>,
|
||||
size_bytes -> Nullable<BigInt>,
|
||||
phash_64 -> Nullable<BigInt>,
|
||||
dhash_64 -> Nullable<BigInt>,
|
||||
duplicate_of_hash -> Nullable<Text>,
|
||||
duplicate_decided_at -> Nullable<BigInt>,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -130,6 +134,8 @@ diesel::table! {
|
||||
name -> Text,
|
||||
root_path -> Text,
|
||||
created_at -> BigInt,
|
||||
enabled -> Bool,
|
||||
excluded_dirs -> Nullable<Text>,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -178,6 +184,7 @@ diesel::table! {
|
||||
approved -> Nullable<Bool>,
|
||||
backend -> Text,
|
||||
fewshot_source_ids -> Nullable<Text>,
|
||||
content_hash -> Nullable<Text>,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -199,6 +206,7 @@ diesel::table! {
|
||||
rel_path -> Text,
|
||||
tag_id -> Integer,
|
||||
created_time -> BigInt,
|
||||
content_hash -> Nullable<Text>,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
893
src/duplicates.rs
Normal file
893
src/duplicates.rs
Normal file
@@ -0,0 +1,893 @@
|
||||
//! Duplicate detection surface — exact (blake3) and perceptual
|
||||
//! (pHash + Hamming) groups, plus the soft-mark resolve flow that
|
||||
//! Apollo's DUPLICATES modal drives.
|
||||
//!
|
||||
//! All routes require auth (Claims). Endpoints:
|
||||
//!
|
||||
//! - `GET /duplicates/exact?library=&include_resolved=` — count>1 byte-identical groups.
|
||||
//! - `GET /duplicates/perceptual?library=&threshold=&include_resolved=` — Hamming-clustered groups.
|
||||
//! - `POST /duplicates/resolve` — soft-mark demoted siblings.
|
||||
//! - `POST /duplicates/unresolve` — clear a prior soft-mark.
|
||||
//!
|
||||
//! Perceptual clustering caches the BK-tree result for 5 minutes so
|
||||
//! repeated opens of the modal don't re-cluster the whole library.
|
||||
//! Cache invalidation is best-effort: resolve/unresolve clear the
|
||||
//! cache, but new files arriving via the watcher don't (the next
|
||||
//! 5-minute window picks them up). For a single-user personal tool
|
||||
//! that's the right trade-off.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::sync::Mutex;
|
||||
use std::time::{Duration, Instant};
|
||||
|
||||
use actix_web::{App, HttpRequest, HttpResponse, Responder, dev::ServiceFactory, web};
|
||||
use bk_tree::{BKTree, Metric};
|
||||
use lazy_static::lazy_static;
|
||||
use opentelemetry::trace::{TraceContextExt, Tracer};
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
use crate::data::Claims;
|
||||
use crate::database::{DuplicateRow, ExifDao};
|
||||
use crate::libraries;
|
||||
use crate::otel::{extract_context_from_request, global_tracer};
|
||||
use crate::state::AppState;
|
||||
|
||||
// ── Cache ────────────────────────────────────────────────────────────────
|
||||
|
||||
const PERCEPTUAL_CACHE_TTL: Duration = Duration::from_secs(300);
|
||||
|
||||
#[derive(Clone)]
|
||||
struct PerceptualCacheEntry {
|
||||
/// Cache key: (library_id, threshold, include_resolved). `library_id`
|
||||
/// is `None` for "all libraries". Cluster output is the same shape we
|
||||
/// return on the wire so we can serve cached requests with zero work.
|
||||
library_id: Option<i32>,
|
||||
threshold: u32,
|
||||
include_resolved: bool,
|
||||
computed_at: Instant,
|
||||
groups: Vec<DuplicateGroup>,
|
||||
}
|
||||
|
||||
lazy_static! {
|
||||
static ref PERCEPTUAL_CACHE: Mutex<Option<PerceptualCacheEntry>> = Mutex::new(None);
|
||||
}
|
||||
|
||||
/// Drop the perceptual-cluster cache. Called from `resolve`/`unresolve`
|
||||
/// so the next modal open reflects the soft-mark change immediately.
|
||||
fn invalidate_perceptual_cache() {
|
||||
if let Ok(mut guard) = PERCEPTUAL_CACHE.lock() {
|
||||
*guard = None;
|
||||
}
|
||||
}
|
||||
|
||||
// ── Wire shapes ──────────────────────────────────────────────────────────
|
||||
|
||||
#[derive(Serialize, Debug, Clone)]
|
||||
pub struct DuplicateMember {
|
||||
pub library_id: i32,
|
||||
pub rel_path: String,
|
||||
pub content_hash: String,
|
||||
pub size_bytes: Option<i64>,
|
||||
pub date_taken: Option<i64>,
|
||||
pub width: Option<i32>,
|
||||
pub height: Option<i32>,
|
||||
pub duplicate_of_hash: Option<String>,
|
||||
pub duplicate_decided_at: Option<i64>,
|
||||
}
|
||||
|
||||
impl From<DuplicateRow> for DuplicateMember {
|
||||
fn from(r: DuplicateRow) -> Self {
|
||||
Self {
|
||||
library_id: r.library_id,
|
||||
rel_path: r.rel_path,
|
||||
content_hash: r.content_hash,
|
||||
size_bytes: r.size_bytes,
|
||||
date_taken: r.date_taken,
|
||||
width: r.width,
|
||||
height: r.height,
|
||||
duplicate_of_hash: r.duplicate_of_hash,
|
||||
duplicate_decided_at: r.duplicate_decided_at,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Serialize, Debug, Clone)]
|
||||
#[serde(rename_all = "lowercase")]
|
||||
pub enum DuplicateKind {
|
||||
Exact,
|
||||
Perceptual,
|
||||
}
|
||||
|
||||
#[derive(Serialize, Debug, Clone)]
|
||||
pub struct DuplicateGroup {
|
||||
pub kind: DuplicateKind,
|
||||
/// Representative content_hash. For exact groups, the shared hash
|
||||
/// (every member has the same one). For perceptual groups, an
|
||||
/// arbitrary cluster member's hash, used only as a stable id for
|
||||
/// the UI to key off.
|
||||
pub representative_hash: String,
|
||||
pub members: Vec<DuplicateMember>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Debug)]
|
||||
pub struct ListDuplicatesQuery {
|
||||
pub library: Option<String>,
|
||||
#[serde(default)]
|
||||
pub include_resolved: Option<bool>,
|
||||
/// Perceptual only — Hamming-distance threshold. Ignored on the
|
||||
/// exact endpoint. Defaults to 8 (~12% similarity tolerance, the
|
||||
/// sweet spot for resized/recompressed copies).
|
||||
#[serde(default)]
|
||||
pub threshold: Option<u32>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Debug)]
|
||||
pub struct DuplicateMemberRef {
|
||||
pub library_id: i32,
|
||||
pub rel_path: String,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Debug)]
|
||||
pub struct ResolveDuplicatesReq {
|
||||
pub survivor: DuplicateMemberRef,
|
||||
pub demoted: Vec<DuplicateMemberRef>,
|
||||
}
|
||||
|
||||
#[derive(Serialize, Debug)]
|
||||
pub struct ResolveResponse {
|
||||
pub resolved_count: usize,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Debug)]
|
||||
pub struct UnresolveDuplicateReq {
|
||||
pub library_id: i32,
|
||||
pub rel_path: String,
|
||||
}
|
||||
|
||||
// ── Handlers ─────────────────────────────────────────────────────────────
|
||||
|
||||
async fn list_exact_handler(
|
||||
_: Claims,
|
||||
request: HttpRequest,
|
||||
app_state: web::Data<AppState>,
|
||||
query: web::Query<ListDuplicatesQuery>,
|
||||
exif_dao: web::Data<Mutex<Box<dyn ExifDao>>>,
|
||||
) -> impl Responder {
|
||||
let context = extract_context_from_request(&request);
|
||||
let span = global_tracer().start_with_context("duplicates.list_exact", &context);
|
||||
let span_context = opentelemetry::Context::current_with_span(span);
|
||||
|
||||
let library_id = libraries::resolve_library_param(&app_state, query.library.as_deref())
|
||||
.ok()
|
||||
.flatten()
|
||||
.map(|l| l.id);
|
||||
let include_resolved = query.include_resolved.unwrap_or(false);
|
||||
|
||||
let rows = {
|
||||
let mut dao = exif_dao.lock().expect("exif dao lock");
|
||||
match dao.list_duplicates_exact(&span_context, library_id, include_resolved) {
|
||||
Ok(rows) => rows,
|
||||
Err(e) => {
|
||||
return HttpResponse::InternalServerError().body(format!("{:?}", e));
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
let groups = group_exact(rows);
|
||||
HttpResponse::Ok().json(GroupsResponse { groups })
|
||||
}
|
||||
|
||||
async fn list_perceptual_handler(
|
||||
_: Claims,
|
||||
request: HttpRequest,
|
||||
app_state: web::Data<AppState>,
|
||||
query: web::Query<ListDuplicatesQuery>,
|
||||
exif_dao: web::Data<Mutex<Box<dyn ExifDao>>>,
|
||||
) -> impl Responder {
|
||||
let context = extract_context_from_request(&request);
|
||||
let span = global_tracer().start_with_context("duplicates.list_perceptual", &context);
|
||||
let span_context = opentelemetry::Context::current_with_span(span);
|
||||
|
||||
let library_id = libraries::resolve_library_param(&app_state, query.library.as_deref())
|
||||
.ok()
|
||||
.flatten()
|
||||
.map(|l| l.id);
|
||||
let threshold = query.threshold.unwrap_or(8).clamp(0, 32);
|
||||
let include_resolved = query.include_resolved.unwrap_or(false);
|
||||
|
||||
// Cache hit?
|
||||
if let Ok(guard) = PERCEPTUAL_CACHE.lock()
|
||||
&& let Some(entry) = guard.as_ref()
|
||||
&& entry.library_id == library_id
|
||||
&& entry.threshold == threshold
|
||||
&& entry.include_resolved == include_resolved
|
||||
&& entry.computed_at.elapsed() < PERCEPTUAL_CACHE_TTL
|
||||
{
|
||||
return HttpResponse::Ok().json(GroupsResponse {
|
||||
groups: entry.groups.clone(),
|
||||
});
|
||||
}
|
||||
|
||||
let rows = {
|
||||
let mut dao = exif_dao.lock().expect("exif dao lock");
|
||||
match dao.list_perceptual_candidates(&span_context, library_id, include_resolved) {
|
||||
Ok(rows) => rows,
|
||||
Err(e) => {
|
||||
return HttpResponse::InternalServerError().body(format!("{:?}", e));
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
let groups = cluster_perceptual(rows, threshold);
|
||||
|
||||
if let Ok(mut guard) = PERCEPTUAL_CACHE.lock() {
|
||||
*guard = Some(PerceptualCacheEntry {
|
||||
library_id,
|
||||
threshold,
|
||||
include_resolved,
|
||||
computed_at: Instant::now(),
|
||||
groups: groups.clone(),
|
||||
});
|
||||
}
|
||||
|
||||
HttpResponse::Ok().json(GroupsResponse { groups })
|
||||
}
|
||||
|
||||
async fn resolve_handler(
|
||||
_: Claims,
|
||||
request: HttpRequest,
|
||||
body: web::Json<ResolveDuplicatesReq>,
|
||||
exif_dao: web::Data<Mutex<Box<dyn ExifDao>>>,
|
||||
) -> impl Responder {
|
||||
let context = extract_context_from_request(&request);
|
||||
let span = global_tracer().start_with_context("duplicates.resolve", &context);
|
||||
let span_context = opentelemetry::Context::current_with_span(span);
|
||||
|
||||
if body.demoted.is_empty() {
|
||||
return HttpResponse::BadRequest().body("demoted list is empty");
|
||||
}
|
||||
|
||||
let mut dao = exif_dao.lock().expect("exif dao lock");
|
||||
|
||||
// Resolve survivor → its content_hash, plus the canonical rel_path
|
||||
// we'll use as the destination for any tag-union INSERTs.
|
||||
let survivor = match dao.lookup_duplicate_row(
|
||||
&span_context,
|
||||
body.survivor.library_id,
|
||||
&body.survivor.rel_path,
|
||||
) {
|
||||
Ok(Some(row)) => row,
|
||||
Ok(None) => return HttpResponse::NotFound().body("survivor not found"),
|
||||
Err(e) => return HttpResponse::InternalServerError().body(format!("{:?}", e)),
|
||||
};
|
||||
|
||||
// Survivor must not itself be soft-marked — otherwise the modal is
|
||||
// pointing at a row we've already demoted, which would create a chain.
|
||||
if survivor.duplicate_of_hash.is_some() {
|
||||
return HttpResponse::Conflict().body("survivor is itself soft-marked as a duplicate");
|
||||
}
|
||||
|
||||
let now = chrono::Utc::now().timestamp();
|
||||
let mut resolved_count = 0usize;
|
||||
|
||||
for member_ref in &body.demoted {
|
||||
let demoted = match dao.lookup_duplicate_row(
|
||||
&span_context,
|
||||
member_ref.library_id,
|
||||
&member_ref.rel_path,
|
||||
) {
|
||||
Ok(Some(row)) => row,
|
||||
Ok(None) => {
|
||||
log::warn!(
|
||||
"duplicates.resolve: skipping unknown demoted ({}, {})",
|
||||
member_ref.library_id,
|
||||
member_ref.rel_path
|
||||
);
|
||||
continue;
|
||||
}
|
||||
Err(e) => {
|
||||
return HttpResponse::InternalServerError().body(format!("{:?}", e));
|
||||
}
|
||||
};
|
||||
|
||||
// Survivor and demoted must not be the same row (would set
|
||||
// duplicate_of_hash to its own hash — recursive nonsense).
|
||||
if demoted.library_id == survivor.library_id && demoted.rel_path == survivor.rel_path {
|
||||
continue;
|
||||
}
|
||||
|
||||
// For perceptual dups (different content_hash), union the
|
||||
// demoted's tag set onto the survivor before flipping the
|
||||
// soft-mark. For exact dups (same content_hash), tags are
|
||||
// already shared at the bytes layer — the union is a no-op.
|
||||
if demoted.content_hash != survivor.content_hash
|
||||
&& let Err(e) = dao.union_perceptual_tags(
|
||||
&span_context,
|
||||
&survivor.content_hash,
|
||||
&demoted.content_hash,
|
||||
&survivor.rel_path,
|
||||
)
|
||||
{
|
||||
log::warn!(
|
||||
"duplicates.resolve: tag union failed for {}: {:?}",
|
||||
demoted.rel_path,
|
||||
e
|
||||
);
|
||||
// Continue with the soft-mark anyway — losing tag
|
||||
// continuity is recoverable (unresolve restores the
|
||||
// demoted row's grid presence, and the original tags
|
||||
// never moved off the demoted hash).
|
||||
}
|
||||
|
||||
if let Err(e) = dao.set_duplicate_of(
|
||||
&span_context,
|
||||
demoted.library_id,
|
||||
&demoted.rel_path,
|
||||
&survivor.content_hash,
|
||||
now,
|
||||
) {
|
||||
return HttpResponse::InternalServerError().body(format!("{:?}", e));
|
||||
}
|
||||
|
||||
resolved_count += 1;
|
||||
}
|
||||
|
||||
drop(dao);
|
||||
invalidate_perceptual_cache();
|
||||
|
||||
HttpResponse::Ok().json(ResolveResponse { resolved_count })
|
||||
}
|
||||
|
||||
async fn unresolve_handler(
|
||||
_: Claims,
|
||||
request: HttpRequest,
|
||||
body: web::Json<UnresolveDuplicateReq>,
|
||||
exif_dao: web::Data<Mutex<Box<dyn ExifDao>>>,
|
||||
) -> impl Responder {
|
||||
let context = extract_context_from_request(&request);
|
||||
let span = global_tracer().start_with_context("duplicates.unresolve", &context);
|
||||
let span_context = opentelemetry::Context::current_with_span(span);
|
||||
|
||||
let mut dao = exif_dao.lock().expect("exif dao lock");
|
||||
if let Err(e) = dao.clear_duplicate_of(&span_context, body.library_id, &body.rel_path) {
|
||||
return HttpResponse::InternalServerError().body(format!("{:?}", e));
|
||||
}
|
||||
|
||||
drop(dao);
|
||||
invalidate_perceptual_cache();
|
||||
|
||||
HttpResponse::Ok().finish()
|
||||
}
|
||||
|
||||
// ── Grouping / clustering ────────────────────────────────────────────────
|
||||
|
||||
#[derive(Serialize, Debug)]
|
||||
struct GroupsResponse {
|
||||
groups: Vec<DuplicateGroup>,
|
||||
}
|
||||
|
||||
fn group_exact(rows: Vec<DuplicateRow>) -> Vec<DuplicateGroup> {
|
||||
let mut by_hash: HashMap<String, Vec<DuplicateRow>> = HashMap::new();
|
||||
for row in rows {
|
||||
by_hash
|
||||
.entry(row.content_hash.clone())
|
||||
.or_default()
|
||||
.push(row);
|
||||
}
|
||||
let mut groups: Vec<DuplicateGroup> = by_hash
|
||||
.into_iter()
|
||||
.filter(|(_, members)| members.len() > 1)
|
||||
.map(|(hash, members)| DuplicateGroup {
|
||||
kind: DuplicateKind::Exact,
|
||||
representative_hash: hash,
|
||||
members: members.into_iter().map(DuplicateMember::from).collect(),
|
||||
})
|
||||
.collect();
|
||||
// Largest groups first (most reward per click), then deterministic.
|
||||
groups.sort_by(|a, b| {
|
||||
b.members
|
||||
.len()
|
||||
.cmp(&a.members.len())
|
||||
.then_with(|| a.representative_hash.cmp(&b.representative_hash))
|
||||
});
|
||||
groups
|
||||
}
|
||||
|
||||
/// Bits set in a "useful" perceptual hash. Real photographic content
|
||||
/// produces ~50/50 bit distributions; anything outside the [16, 48]
|
||||
/// band is low-entropy structure (uniform skies, black frames,
|
||||
/// monochrome scans, faded film) where pHash collapses to near-
|
||||
/// uniform values that Hamming-trivially across hundreds of unrelated
|
||||
/// images. The 8/56 band that shipped first was too permissive —
|
||||
/// even at threshold=4 the false-positive cluster persisted.
|
||||
const MIN_INFORMATIVE_POPCOUNT: u32 = 16;
|
||||
const MAX_INFORMATIVE_POPCOUNT: u32 = 64 - MIN_INFORMATIVE_POPCOUNT;
|
||||
|
||||
#[inline]
|
||||
fn is_informative_hash(h: i64) -> bool {
|
||||
let pop = (h as u64).count_ones();
|
||||
(MIN_INFORMATIVE_POPCOUNT..=MAX_INFORMATIVE_POPCOUNT).contains(&pop)
|
||||
}
|
||||
|
||||
/// dHash gets a stricter threshold than pHash. pHash is the
|
||||
/// candidate-discovery signal (BK-tree neighbourhood lookup); dHash
|
||||
/// is the validation signal that has to actively agree before we
|
||||
/// union. Splitting the budget asymmetrically means a real near-dup
|
||||
/// (which scores well on both) survives while an incidental pHash
|
||||
/// collision (uniform-content false positive) gets vetoed.
|
||||
///
|
||||
/// Floor of 2 so threshold=4 still allows a 1-bit jitter in dHash —
|
||||
/// genuine resampling can flip a low-frequency gradient bit even
|
||||
/// when the visual content is identical.
|
||||
#[inline]
|
||||
fn dhash_threshold(phash_threshold: u32) -> u32 {
|
||||
(phash_threshold / 2).max(2)
|
||||
}
|
||||
|
||||
/// Single-link cluster the input rows by Hamming distance over their
|
||||
/// pHash, with `threshold` as the maximum distance for an edge. Rows
|
||||
/// without a pHash, or with a degenerate (low-entropy) pHash, are
|
||||
/// excluded — they'd chain together unrelated images.
|
||||
///
|
||||
/// Two-signal validation: the BK-tree gives candidate pairs cheaply,
|
||||
/// then we additionally require dHash agreement before unioning. pHash
|
||||
/// alone is too permissive; pairing it with dHash collapses the false-
|
||||
/// positive cluster significantly (different DCT vs gradient
|
||||
/// signatures on real near-dups still both stay close, but spurious
|
||||
/// pHash collisions on uniform images don't survive the dHash check).
|
||||
///
|
||||
/// Implementation: BK-tree neighbourhood lookup per row, union-find
|
||||
/// over the validated edges. O(N log N) instead of the O(N²) naive
|
||||
/// pairwise scan; on a 1.26M-row library that's the difference between
|
||||
/// "responds in 1.5 s" and "responds in 25 minutes".
|
||||
fn cluster_perceptual(rows: Vec<DuplicateRow>, threshold: u32) -> Vec<DuplicateGroup> {
|
||||
let candidates: Vec<DuplicateRow> = rows
|
||||
.into_iter()
|
||||
.filter(|r| r.phash_64.is_some_and(is_informative_hash))
|
||||
.collect();
|
||||
if candidates.len() < 2 {
|
||||
return Vec::new();
|
||||
}
|
||||
|
||||
// Build BK-tree keyed on (phash_u64, index-in-candidates).
|
||||
let mut tree: BKTree<HashKey, HammingMetric> = BKTree::new(HammingMetric);
|
||||
for (idx, row) in candidates.iter().enumerate() {
|
||||
if let Some(p) = row.phash_64 {
|
||||
tree.add(HashKey {
|
||||
phash: p as u64,
|
||||
idx,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Union-find over edges within `threshold`. For a candidate pair
|
||||
// surfaced by the pHash BK-tree, require dHash within a *stricter*
|
||||
// threshold (`dhash_threshold(threshold)`) before unioning. pHash
|
||||
// agreement on low-entropy structure can be incidental; pHash
|
||||
// agreement AND dHash within roughly half that distance is a
|
||||
// strong near-dup signal. dHash on either side missing → reject
|
||||
// (was: trust pHash alone). Missing dHash means we can't validate
|
||||
// the candidate, and the false-positive cost outweighs the rare
|
||||
// case of a partial backfill.
|
||||
let dhash_max = dhash_threshold(threshold);
|
||||
let mut uf = UnionFind::new(candidates.len());
|
||||
for (idx, row) in candidates.iter().enumerate() {
|
||||
let Some(p) = row.phash_64 else { continue };
|
||||
let key = HashKey {
|
||||
phash: p as u64,
|
||||
idx,
|
||||
};
|
||||
for (_, neighbour) in tree.find(&key, threshold) {
|
||||
if neighbour.idx == idx {
|
||||
continue;
|
||||
}
|
||||
let other = &candidates[neighbour.idx];
|
||||
let dhash_ok = match (row.dhash_64, other.dhash_64) {
|
||||
(Some(a), Some(b)) => {
|
||||
(a as u64 ^ b as u64).count_ones() <= dhash_max
|
||||
&& is_informative_hash(a)
|
||||
&& is_informative_hash(b)
|
||||
}
|
||||
_ => false,
|
||||
};
|
||||
if dhash_ok {
|
||||
uf.union(idx, neighbour.idx);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Bucket by root.
|
||||
let mut by_root: HashMap<usize, Vec<DuplicateRow>> = HashMap::new();
|
||||
for (idx, row) in candidates.into_iter().enumerate() {
|
||||
let root = uf.find(idx);
|
||||
by_root.entry(root).or_default().push(row);
|
||||
}
|
||||
|
||||
// Medoid-validate each cluster to break single-link chains.
|
||||
// Single-link unions any pair within threshold; that means a chain
|
||||
// A↔B↔C can collapse into one cluster even when A and C aren't
|
||||
// similar. The medoid pass picks the cluster's most-central member
|
||||
// and drops any other whose distance to it exceeds threshold —
|
||||
// chains lose their tail, dense real-near-dup clusters keep all
|
||||
// members. Discard clusters that drop below 2 after refinement.
|
||||
let groups: Vec<DuplicateGroup> = by_root
|
||||
.into_values()
|
||||
.filter_map(|cluster| refine_cluster(cluster, threshold, dhash_max))
|
||||
.map(|cluster| {
|
||||
let representative_hash = cluster[0].content_hash.clone();
|
||||
DuplicateGroup {
|
||||
kind: DuplicateKind::Perceptual,
|
||||
representative_hash,
|
||||
members: cluster.into_iter().map(DuplicateMember::from).collect(),
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
let mut groups = groups;
|
||||
groups.sort_by(|a, b| {
|
||||
b.members
|
||||
.len()
|
||||
.cmp(&a.members.len())
|
||||
.then_with(|| a.representative_hash.cmp(&b.representative_hash))
|
||||
});
|
||||
groups
|
||||
}
|
||||
|
||||
/// Tighten a single-link cluster to its medoid neighbourhood. Returns
|
||||
/// `None` when fewer than 2 members survive — caller drops the cluster.
|
||||
fn refine_cluster(
|
||||
cluster: Vec<DuplicateRow>,
|
||||
phash_max: u32,
|
||||
dhash_max: u32,
|
||||
) -> Option<Vec<DuplicateRow>> {
|
||||
if cluster.len() < 2 {
|
||||
return None;
|
||||
}
|
||||
if cluster.len() == 2 {
|
||||
// No chain can exist with only two members; the union-find
|
||||
// already guaranteed both signals validated when joining.
|
||||
return Some(cluster);
|
||||
}
|
||||
|
||||
// Pick the medoid: member whose summed pHash+dHash distance to the
|
||||
// rest of the cluster is smallest. Stable-deterministic via the
|
||||
// first-best-wins tie break (lower content_hash wins via natural
|
||||
// iteration order from the BK-tree input ordering).
|
||||
let phashes: Vec<u64> = cluster
|
||||
.iter()
|
||||
.map(|r| r.phash_64.unwrap_or(0) as u64)
|
||||
.collect();
|
||||
let dhashes: Vec<u64> = cluster
|
||||
.iter()
|
||||
.map(|r| r.dhash_64.unwrap_or(0) as u64)
|
||||
.collect();
|
||||
|
||||
let mut best_idx = 0usize;
|
||||
let mut best_score = u32::MAX;
|
||||
for i in 0..cluster.len() {
|
||||
let mut score: u32 = 0;
|
||||
for j in 0..cluster.len() {
|
||||
if i == j {
|
||||
continue;
|
||||
}
|
||||
score = score.saturating_add((phashes[i] ^ phashes[j]).count_ones());
|
||||
score = score.saturating_add((dhashes[i] ^ dhashes[j]).count_ones());
|
||||
}
|
||||
if score < best_score {
|
||||
best_score = score;
|
||||
best_idx = i;
|
||||
}
|
||||
}
|
||||
|
||||
let medoid_phash = phashes[best_idx];
|
||||
let medoid_dhash = dhashes[best_idx];
|
||||
|
||||
let kept: Vec<DuplicateRow> = cluster
|
||||
.into_iter()
|
||||
.enumerate()
|
||||
.filter(|(i, _)| {
|
||||
*i == best_idx
|
||||
|| ((phashes[*i] ^ medoid_phash).count_ones() <= phash_max
|
||||
&& (dhashes[*i] ^ medoid_dhash).count_ones() <= dhash_max)
|
||||
})
|
||||
.map(|(_, r)| r)
|
||||
.collect();
|
||||
|
||||
if kept.len() < 2 { None } else { Some(kept) }
|
||||
}
|
||||
|
||||
#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
|
||||
struct HashKey {
|
||||
phash: u64,
|
||||
idx: usize,
|
||||
}
|
||||
|
||||
struct HammingMetric;
|
||||
|
||||
impl Metric<HashKey> for HammingMetric {
|
||||
fn distance(&self, a: &HashKey, b: &HashKey) -> u32 {
|
||||
(a.phash ^ b.phash).count_ones()
|
||||
}
|
||||
|
||||
fn threshold_distance(&self, a: &HashKey, b: &HashKey, _: u32) -> Option<u32> {
|
||||
Some(self.distance(a, b))
|
||||
}
|
||||
}
|
||||
|
||||
struct UnionFind {
|
||||
parent: Vec<usize>,
|
||||
rank: Vec<u8>,
|
||||
}
|
||||
|
||||
impl UnionFind {
|
||||
fn new(n: usize) -> Self {
|
||||
Self {
|
||||
parent: (0..n).collect(),
|
||||
rank: vec![0; n],
|
||||
}
|
||||
}
|
||||
|
||||
fn find(&mut self, x: usize) -> usize {
|
||||
if self.parent[x] != x {
|
||||
let root = self.find(self.parent[x]);
|
||||
self.parent[x] = root;
|
||||
}
|
||||
self.parent[x]
|
||||
}
|
||||
|
||||
fn union(&mut self, a: usize, b: usize) {
|
||||
let ra = self.find(a);
|
||||
let rb = self.find(b);
|
||||
if ra == rb {
|
||||
return;
|
||||
}
|
||||
if self.rank[ra] < self.rank[rb] {
|
||||
self.parent[ra] = rb;
|
||||
} else if self.rank[ra] > self.rank[rb] {
|
||||
self.parent[rb] = ra;
|
||||
} else {
|
||||
self.parent[rb] = ra;
|
||||
self.rank[ra] += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ── Routing ──────────────────────────────────────────────────────────────
|
||||
|
||||
pub fn add_duplicate_services<T>(app: App<T>) -> App<T>
|
||||
where
|
||||
T: ServiceFactory<
|
||||
actix_web::dev::ServiceRequest,
|
||||
Config = (),
|
||||
Error = actix_web::Error,
|
||||
InitError = (),
|
||||
>,
|
||||
{
|
||||
app.service(web::resource("/duplicates/exact").route(web::get().to(list_exact_handler)))
|
||||
.service(
|
||||
web::resource("/duplicates/perceptual").route(web::get().to(list_perceptual_handler)),
|
||||
)
|
||||
.service(web::resource("/duplicates/resolve").route(web::post().to(resolve_handler)))
|
||||
.service(web::resource("/duplicates/unresolve").route(web::post().to(unresolve_handler)))
|
||||
}
|
||||
|
||||
// ── Tests ────────────────────────────────────────────────────────────────
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
fn row(library_id: i32, rel: &str, hash: &str, phash: Option<i64>) -> DuplicateRow {
|
||||
DuplicateRow {
|
||||
library_id,
|
||||
rel_path: rel.into(),
|
||||
content_hash: hash.into(),
|
||||
size_bytes: Some(1000),
|
||||
date_taken: None,
|
||||
width: None,
|
||||
height: None,
|
||||
phash_64: phash,
|
||||
dhash_64: None,
|
||||
duplicate_of_hash: None,
|
||||
duplicate_decided_at: None,
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn group_exact_collapses_by_hash() {
|
||||
let rows = vec![
|
||||
row(1, "a.jpg", "h1", None),
|
||||
row(1, "b.jpg", "h1", None),
|
||||
row(2, "c.jpg", "h1", None),
|
||||
row(1, "lonely.jpg", "h2", None),
|
||||
];
|
||||
let groups = group_exact(rows);
|
||||
assert_eq!(groups.len(), 1);
|
||||
assert_eq!(groups[0].representative_hash, "h1");
|
||||
assert_eq!(groups[0].members.len(), 3);
|
||||
}
|
||||
|
||||
/// All hashes used below have popcount in the "informative"
|
||||
/// 8..=56 band so they survive the entropy filter that keeps
|
||||
/// solid-colour images out of the cluster graph.
|
||||
const INFORMATIVE_BASE: i64 = 0x55AA_55AA_55AA_55AA; // popcount = 32
|
||||
const INFORMATIVE_NEAR: i64 = 0x55AA_55AA_55AA_55AB; // 1-bit away from BASE
|
||||
const INFORMATIVE_FAR: i64 = 0x6996_6996_6996_6996; // 32-bits away from BASE
|
||||
|
||||
fn row_with_dhash(
|
||||
library_id: i32,
|
||||
rel: &str,
|
||||
hash: &str,
|
||||
phash: Option<i64>,
|
||||
dhash: Option<i64>,
|
||||
) -> DuplicateRow {
|
||||
DuplicateRow {
|
||||
library_id,
|
||||
rel_path: rel.into(),
|
||||
content_hash: hash.into(),
|
||||
size_bytes: Some(1000),
|
||||
date_taken: None,
|
||||
width: None,
|
||||
height: None,
|
||||
phash_64: phash,
|
||||
dhash_64: dhash,
|
||||
duplicate_of_hash: None,
|
||||
duplicate_decided_at: None,
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cluster_perceptual_unites_close_hashes() {
|
||||
// Two rows near each other on both pHash and dHash; one far
|
||||
// on pHash. Threshold 4 should merge the close pair.
|
||||
let rows = vec![
|
||||
row_with_dhash(
|
||||
1,
|
||||
"a.jpg",
|
||||
"h1",
|
||||
Some(INFORMATIVE_BASE),
|
||||
Some(INFORMATIVE_BASE),
|
||||
),
|
||||
row_with_dhash(
|
||||
1,
|
||||
"b.jpg",
|
||||
"h2",
|
||||
Some(INFORMATIVE_NEAR),
|
||||
Some(INFORMATIVE_NEAR),
|
||||
),
|
||||
row_with_dhash(
|
||||
1,
|
||||
"c.jpg",
|
||||
"h3",
|
||||
Some(INFORMATIVE_FAR),
|
||||
Some(INFORMATIVE_FAR),
|
||||
),
|
||||
];
|
||||
let groups = cluster_perceptual(rows, 4);
|
||||
assert_eq!(groups.len(), 1);
|
||||
assert_eq!(groups[0].members.len(), 2);
|
||||
let paths: Vec<&str> = groups[0]
|
||||
.members
|
||||
.iter()
|
||||
.map(|m| m.rel_path.as_str())
|
||||
.collect();
|
||||
assert!(paths.contains(&"a.jpg"));
|
||||
assert!(paths.contains(&"b.jpg"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cluster_perceptual_threshold_zero_drops_distinct() {
|
||||
let rows = vec![
|
||||
row_with_dhash(
|
||||
1,
|
||||
"a.jpg",
|
||||
"h1",
|
||||
Some(INFORMATIVE_BASE),
|
||||
Some(INFORMATIVE_BASE),
|
||||
),
|
||||
row_with_dhash(
|
||||
1,
|
||||
"b.jpg",
|
||||
"h2",
|
||||
Some(INFORMATIVE_NEAR),
|
||||
Some(INFORMATIVE_NEAR),
|
||||
),
|
||||
];
|
||||
let groups = cluster_perceptual(rows, 0);
|
||||
assert!(groups.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cluster_perceptual_skips_singletons() {
|
||||
let rows = vec![row(1, "alone.jpg", "h1", Some(INFORMATIVE_BASE))];
|
||||
assert!(cluster_perceptual(rows, 8).is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cluster_perceptual_filters_low_entropy_hashes() {
|
||||
// Both 0 (popcount 0) and i64::MAX (popcount 63) fall outside
|
||||
// the informative band. A pair of these would trivially match
|
||||
// (Hamming distance to each other small or zero) without the
|
||||
// entropy filter — that's exactly the regression that was
|
||||
// producing a giant first cluster of solid-colour images.
|
||||
let rows = vec![
|
||||
row(1, "blank-a.jpg", "h1", Some(0)),
|
||||
row(1, "blank-b.jpg", "h2", Some(0)),
|
||||
row(1, "white-a.jpg", "h3", Some(i64::MAX)),
|
||||
row(1, "white-b.jpg", "h4", Some(i64::MAX)),
|
||||
];
|
||||
assert!(cluster_perceptual(rows, 8).is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cluster_perceptual_requires_dhash_agreement() {
|
||||
// pHash within threshold but dHash far apart — the candidate
|
||||
// edge from the BK-tree must be rejected. Without the dHash
|
||||
// double-check this would form a 2-member cluster.
|
||||
let rows = vec![
|
||||
row_with_dhash(
|
||||
1,
|
||||
"a.jpg",
|
||||
"h1",
|
||||
Some(INFORMATIVE_BASE),
|
||||
Some(INFORMATIVE_BASE),
|
||||
),
|
||||
row_with_dhash(
|
||||
1,
|
||||
"b.jpg",
|
||||
"h2",
|
||||
Some(INFORMATIVE_NEAR),
|
||||
Some(INFORMATIVE_FAR),
|
||||
),
|
||||
];
|
||||
assert!(cluster_perceptual(rows, 4).is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cluster_perceptual_breaks_long_chain_at_medoid() {
|
||||
// 4-link chain at threshold=2 with pairwise distances chosen
|
||||
// so single-link unions all four but the endpoints sit past
|
||||
// the medoid's neighbourhood. Bit positions hop by exactly 2
|
||||
// bits per step, in non-overlapping nibbles, so consecutive
|
||||
// hops compose into wider distant-pair distances:
|
||||
// A↔B = 2, B↔C = 2, C↔D = 2,
|
||||
// A↔C = 4, B↔D = 4, A↔D = 6.
|
||||
// Medoid (B or C) keeps Δ ≤ 2 of itself; the far endpoint
|
||||
// gets chopped, leaving exactly 3 members.
|
||||
const A: i64 = 0x55AA_55AA_55AA_55AA;
|
||||
const B: i64 = 0x55AA_55AA_55AA_55A9; // ^0x03 last byte
|
||||
const C: i64 = 0x55AA_55AA_55AA_55A5; // ^0x0C from B
|
||||
const D: i64 = 0x55AA_55AA_55AA_5595; // ^0x30 from C
|
||||
let rows = vec![
|
||||
row_with_dhash(1, "a.jpg", "h1", Some(A), Some(A)),
|
||||
row_with_dhash(1, "b.jpg", "h2", Some(B), Some(B)),
|
||||
row_with_dhash(1, "c.jpg", "h3", Some(C), Some(C)),
|
||||
row_with_dhash(1, "d.jpg", "h4", Some(D), Some(D)),
|
||||
];
|
||||
let groups = cluster_perceptual(rows, 2);
|
||||
assert_eq!(groups.len(), 1);
|
||||
assert_eq!(
|
||||
groups[0].members.len(),
|
||||
3,
|
||||
"medoid pass should chop one chain endpoint past Δ=2"
|
||||
);
|
||||
}
|
||||
|
||||
/// Sanity-check the BK-tree's metric, which is what the duplicates
|
||||
/// path actually clusters on.
|
||||
#[test]
|
||||
fn hamming_metric_is_symmetric() {
|
||||
let m = HammingMetric;
|
||||
let a = HashKey {
|
||||
phash: 0b1010,
|
||||
idx: 0,
|
||||
};
|
||||
let b = HashKey {
|
||||
phash: 0b0101,
|
||||
idx: 1,
|
||||
};
|
||||
let d1 = m.distance(&a, &b);
|
||||
let d2 = m.distance(&b, &a);
|
||||
assert_eq!(d1, d2);
|
||||
assert_eq!(d1, 4);
|
||||
}
|
||||
}
|
||||
203
src/faces.rs
203
src/faces.rs
@@ -20,9 +20,10 @@
|
||||
|
||||
use crate::Claims;
|
||||
use crate::ai::face_client::{DetectMeta, FaceClient, FaceDetectError};
|
||||
use crate::exif;
|
||||
use crate::database::schema::{face_detections, image_exif, persons};
|
||||
use crate::error::IntoHttpError;
|
||||
use crate::exif;
|
||||
use crate::file_types;
|
||||
use crate::libraries::{self, Library};
|
||||
use crate::otel::{extract_context_from_request, global_tracer, trace_db_call};
|
||||
use crate::state::AppState;
|
||||
@@ -99,9 +100,30 @@ pub struct FaceDetectionRow {
|
||||
pub created_at: i64,
|
||||
}
|
||||
|
||||
/// SQL fragment restricting an `image_exif.rel_path` (or `face_detections.rel_path`)
|
||||
/// column to image extensions. Videos register in `image_exif` with a
|
||||
/// populated `content_hash` but can never produce a `face_detections` row
|
||||
/// — applying this filter at query time keeps videos out of the per-tick
|
||||
/// backlog drain (which would otherwise loop forever — `filter_excluded`
|
||||
/// drops them client-side without writing a marker) and out of the SCANNED
|
||||
/// stat denominator (so 100% is reachable).
|
||||
fn image_path_predicate(col: &str) -> String {
|
||||
let clauses: Vec<String> = file_types::IMAGE_EXTENSIONS
|
||||
.iter()
|
||||
.map(|ext| format!("lower({col}) LIKE '%.{ext}'"))
|
||||
.collect();
|
||||
format!("({})", clauses.join(" OR "))
|
||||
}
|
||||
|
||||
/// Row shape for `list_unscanned_candidates`'s raw SQL. Diesel's
|
||||
/// `sql_query` requires a `QueryableByName` row type with explicit
|
||||
/// column SQL types; using a tuple isn't supported.
|
||||
#[derive(diesel::QueryableByName, Debug)]
|
||||
struct CountRow {
|
||||
#[diesel(sql_type = diesel::sql_types::BigInt)]
|
||||
count: i64,
|
||||
}
|
||||
|
||||
#[derive(diesel::QueryableByName, Debug)]
|
||||
struct UnscannedRow {
|
||||
#[diesel(sql_type = diesel::sql_types::Text)]
|
||||
@@ -601,26 +623,32 @@ impl FaceDao for SqliteFaceDao {
|
||||
// fire multiple detect calls for the same hash if it lives
|
||||
// under several rel_paths in the same library. The
|
||||
// anti-join (NOT EXISTS) drains hashes that have no row in
|
||||
// face_detections at all.
|
||||
let rows: Vec<(String, String)> = diesel::sql_query(
|
||||
// face_detections at all. The image-extension predicate
|
||||
// keeps videos out of the candidate set; without it they'd
|
||||
// be filtered client-side and re-pulled every tick forever
|
||||
// because no marker row is written for excluded paths.
|
||||
let ext_predicate = image_path_predicate("rel_path");
|
||||
let sql = format!(
|
||||
"SELECT rel_path, content_hash \
|
||||
FROM image_exif e \
|
||||
WHERE library_id = ? \
|
||||
AND content_hash IS NOT NULL \
|
||||
AND {ext_predicate} \
|
||||
AND NOT EXISTS ( \
|
||||
SELECT 1 FROM face_detections f \
|
||||
WHERE f.content_hash = e.content_hash \
|
||||
) \
|
||||
GROUP BY content_hash \
|
||||
LIMIT ?",
|
||||
)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::BigInt, _>(limit)
|
||||
.load::<UnscannedRow>(conn.deref_mut())
|
||||
.with_context(|| "list_unscanned_candidates")?
|
||||
.into_iter()
|
||||
.map(|r| (r.rel_path, r.content_hash))
|
||||
.collect();
|
||||
LIMIT ?"
|
||||
);
|
||||
let rows: Vec<(String, String)> = diesel::sql_query(sql)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::BigInt, _>(limit)
|
||||
.load::<UnscannedRow>(conn.deref_mut())
|
||||
.with_context(|| "list_unscanned_candidates")?
|
||||
.into_iter()
|
||||
.map(|r| (r.rel_path, r.content_hash))
|
||||
.collect();
|
||||
Ok(rows)
|
||||
})
|
||||
}
|
||||
@@ -856,14 +884,18 @@ impl FaceDao for SqliteFaceDao {
|
||||
// Pair with the base64-encoded embedding string so the handler
|
||||
// doesn't need to know the wire format. Skip rows with NULL
|
||||
// embedding (shouldn't happen on detected rows, but defensive).
|
||||
// `embedding.take()` moves the bytes out of the row so we can
|
||||
// hand the (now-empty-embedding) row plus the encoded string
|
||||
// back to the caller without cloning the whole row — at 20k
|
||||
// rows × 2 KB that clone was 40 MB of pointless heap traffic
|
||||
// per cluster-suggest run.
|
||||
use base64::Engine;
|
||||
Ok(rows
|
||||
.into_iter()
|
||||
.filter_map(|r| {
|
||||
r.embedding.as_ref().map(|bytes| {
|
||||
let b64 = base64::engine::general_purpose::STANDARD.encode(bytes);
|
||||
(r.clone(), b64)
|
||||
})
|
||||
.filter_map(|mut r| {
|
||||
let bytes = r.embedding.take()?;
|
||||
let b64 = base64::engine::general_purpose::STANDARD.encode(&bytes);
|
||||
Some((r, b64))
|
||||
})
|
||||
.collect())
|
||||
})
|
||||
@@ -1013,14 +1045,42 @@ impl FaceDao for SqliteFaceDao {
|
||||
.first(conn.deref_mut())
|
||||
.with_context(|| "stats: failed")?
|
||||
};
|
||||
// Image-extension filter mirrors `list_unscanned_candidates` so
|
||||
// SCANNED can actually reach 100%: videos sit in `image_exif` but
|
||||
// never get a `face_detections` row, so counting them here
|
||||
// permanently caps the percentage below 100%.
|
||||
//
|
||||
// Count DISTINCT content_hash (not rows) so the numerator
|
||||
// (`scanned`, also distinct-content_hash) and denominator live
|
||||
// in the same domain. Without this, a file present at multiple
|
||||
// rel_paths or across libraries inflates total_photos by one
|
||||
// per duplicate row while face_detections — keyed on
|
||||
// content_hash — counts the bytes once, leaving a permanent
|
||||
// gap (e.g. 1101/1103 with nothing actually pending). Rows
|
||||
// with NULL content_hash are excluded; they're held in the
|
||||
// hash-backfill backlog and counting them would pin the bar
|
||||
// below 100% for the duration of that backfill.
|
||||
let total_photos: i64 = {
|
||||
let mut q = image_exif::table.into_boxed();
|
||||
if let Some(lib) = library_id {
|
||||
q = q.filter(image_exif::library_id.eq(lib));
|
||||
}
|
||||
q.select(diesel::dsl::count_star())
|
||||
.first(conn.deref_mut())
|
||||
.with_context(|| "stats: total_photos")?
|
||||
let ext_predicate = image_path_predicate("rel_path");
|
||||
let row: CountRow = if let Some(lib) = library_id {
|
||||
let sql = format!(
|
||||
"SELECT COUNT(DISTINCT content_hash) AS count FROM image_exif \
|
||||
WHERE library_id = ? AND content_hash IS NOT NULL AND {ext_predicate}"
|
||||
);
|
||||
diesel::sql_query(sql)
|
||||
.bind::<diesel::sql_types::Integer, _>(lib)
|
||||
.get_result(conn.deref_mut())
|
||||
.with_context(|| "stats: total_photos")?
|
||||
} else {
|
||||
let sql = format!(
|
||||
"SELECT COUNT(DISTINCT content_hash) AS count FROM image_exif \
|
||||
WHERE content_hash IS NOT NULL AND {ext_predicate}"
|
||||
);
|
||||
diesel::sql_query(sql)
|
||||
.get_result(conn.deref_mut())
|
||||
.with_context(|| "stats: total_photos")?
|
||||
};
|
||||
row.count
|
||||
};
|
||||
let persons_count: i64 = persons::table
|
||||
.select(diesel::dsl::count_star())
|
||||
@@ -2255,6 +2315,12 @@ async fn update_face_handler<D: FaceDao>(
|
||||
let mut new_embedding: Option<Vec<u8>> = None;
|
||||
if let Some((bx, by, bw, bh)) = bbox_patch {
|
||||
if !face_client.is_enabled() {
|
||||
warn!(
|
||||
"PATCH /image/faces/{}: 503 — face client not enabled \
|
||||
(APOLLO_FACE_API_BASE_URL / APOLLO_API_BASE_URL both unset). \
|
||||
Bbox edit requires Apollo to re-embed.",
|
||||
id
|
||||
);
|
||||
return HttpResponse::ServiceUnavailable()
|
||||
.body("face client disabled — bbox edit requires Apollo");
|
||||
}
|
||||
@@ -2284,8 +2350,7 @@ async fn update_face_handler<D: FaceDao>(
|
||||
"PATCH /image/faces/{}: crop failed for {:?}: {:?}",
|
||||
id, abs_path, e
|
||||
);
|
||||
return HttpResponse::BadRequest()
|
||||
.body(format!("cannot crop new bbox: {}", e));
|
||||
return HttpResponse::BadRequest().body(format!("cannot crop new bbox: {}", e));
|
||||
}
|
||||
};
|
||||
let meta = DetectMeta {
|
||||
@@ -2332,11 +2397,20 @@ async fn update_face_handler<D: FaceDao>(
|
||||
);
|
||||
}
|
||||
Err(FaceDetectError::Transient(e)) => {
|
||||
warn!(
|
||||
"PATCH /image/faces/{}: 503 — Apollo face client transient \
|
||||
error during re-embed: {}",
|
||||
id, e
|
||||
);
|
||||
return HttpResponse::ServiceUnavailable().body(format!("{}", e));
|
||||
}
|
||||
Err(FaceDetectError::Disabled) => {
|
||||
return HttpResponse::ServiceUnavailable()
|
||||
.body("face client disabled mid-flight");
|
||||
warn!(
|
||||
"PATCH /image/faces/{}: 503 — face client became disabled \
|
||||
mid-flight",
|
||||
id
|
||||
);
|
||||
return HttpResponse::ServiceUnavailable().body("face client disabled mid-flight");
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -3145,6 +3219,39 @@ mod tests {
|
||||
assert_eq!(stats.with_faces, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn stats_total_photos_excludes_videos() {
|
||||
// SCANNED counts content_hashes in face_detections; total_photos
|
||||
// must apply the same image-extension filter as the watcher
|
||||
// backlog query so the percentage can reach 100%. Without this,
|
||||
// videos sit in image_exif but never produce a face_detections
|
||||
// row (Apollo decodes images only) and the bar caps below 100%.
|
||||
let mut dao = fresh_dao();
|
||||
diesel::sql_query(
|
||||
"INSERT OR IGNORE INTO libraries (id, name, root_path, created_at) \
|
||||
VALUES (1, 'main', '/tmp', 0)",
|
||||
)
|
||||
.execute(dao.connection.lock().unwrap().deref_mut())
|
||||
.expect("seed libraries");
|
||||
|
||||
diesel::sql_query(
|
||||
"INSERT INTO image_exif \
|
||||
(library_id, rel_path, content_hash, created_time, last_modified) VALUES \
|
||||
(1, 'a.jpg', 'h-a', 0, 0), \
|
||||
(1, 'b.JPEG', 'h-b', 0, 0), \
|
||||
(1, 'movie.mp4', 'h-mp4', 0, 0), \
|
||||
(1, 'clip.MOV', 'h-mov', 0, 0)",
|
||||
)
|
||||
.execute(dao.connection.lock().unwrap().deref_mut())
|
||||
.expect("seed image_exif");
|
||||
|
||||
let stats = dao.stats(&ctx(), Some(1)).expect("stats");
|
||||
assert_eq!(
|
||||
stats.total_photos, 2,
|
||||
"videos should not count toward total"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn merge_persons_repoints_faces() {
|
||||
let mut dao = fresh_dao();
|
||||
@@ -3325,8 +3432,7 @@ mod tests {
|
||||
)
|
||||
.unwrap();
|
||||
let row = seed_library_and_face(&mut dao, Some(p.id));
|
||||
let joined =
|
||||
hydrate_face_with_person(&mut dao, &ctx(), row).expect("hydrate assigned");
|
||||
let joined = hydrate_face_with_person(&mut dao, &ctx(), row).expect("hydrate assigned");
|
||||
assert_eq!(joined.person_id, Some(p.id));
|
||||
assert_eq!(joined.person_name.as_deref(), Some("Alice"));
|
||||
// Bbox + confidence + source must round-trip — these are what
|
||||
@@ -3345,8 +3451,7 @@ mod tests {
|
||||
// previously-assigned row's serialization.
|
||||
let mut dao = fresh_dao();
|
||||
let row = seed_library_and_face(&mut dao, None);
|
||||
let joined =
|
||||
hydrate_face_with_person(&mut dao, &ctx(), row).expect("hydrate unassigned");
|
||||
let joined = hydrate_face_with_person(&mut dao, &ctx(), row).expect("hydrate unassigned");
|
||||
assert!(joined.person_id.is_none());
|
||||
assert!(joined.person_name.is_none());
|
||||
}
|
||||
@@ -3367,7 +3472,12 @@ mod tests {
|
||||
.execute(dao.connection.lock().unwrap().deref_mut())
|
||||
.expect("seed libraries");
|
||||
|
||||
// Seed image_exif: mix of hashed/unhashed/scanned/cross-library.
|
||||
// Seed image_exif: mix of hashed/unhashed/scanned/cross-library,
|
||||
// plus a video and a mixed-case image extension. Videos register
|
||||
// in image_exif but can never produce a face_detections row, so
|
||||
// the SQL must filter them out — otherwise the per-tick backlog
|
||||
// drain re-pulls them every tick (no marker is ever written, so
|
||||
// they loop forever) and the SCANNED stat is permanently capped.
|
||||
diesel::sql_query(
|
||||
"INSERT INTO image_exif \
|
||||
(library_id, rel_path, content_hash, created_time, last_modified) VALUES \
|
||||
@@ -3375,6 +3485,9 @@ mod tests {
|
||||
(1, 'b.jpg', 'h-b', 0, 0), \
|
||||
(1, 'c.jpg', NULL, 0, 0), \
|
||||
(1, 'd.jpg', 'h-d', 0, 0), \
|
||||
(1, 'movie.mp4', 'h-mp4', 0, 0), \
|
||||
(1, 'clip.MOV', 'h-mov', 0, 0), \
|
||||
(1, 'photo.JPG', 'h-jpg-upper', 0, 0), \
|
||||
(2, 'e.jpg', 'h-e', 0, 0)",
|
||||
)
|
||||
.execute(dao.connection.lock().unwrap().deref_mut())
|
||||
@@ -3388,16 +3501,26 @@ mod tests {
|
||||
.list_unscanned_candidates(&ctx(), 1, 10)
|
||||
.expect("list unscanned");
|
||||
|
||||
let hashes: std::collections::HashSet<_> =
|
||||
cands.iter().map(|(_, h)| h.clone()).collect();
|
||||
let hashes: std::collections::HashSet<_> = cands.iter().map(|(_, h)| h.clone()).collect();
|
||||
|
||||
// Should contain a and d (hashed, unscanned, library 1).
|
||||
// Should contain a, d, and the upper-case .JPG (image-extension
|
||||
// match is case-insensitive).
|
||||
assert!(hashes.contains("h-a"), "missing h-a: {:?}", hashes);
|
||||
assert!(hashes.contains("h-d"), "missing h-d: {:?}", hashes);
|
||||
// Should NOT contain b (scanned), c (no hash), e (other library).
|
||||
assert!(
|
||||
hashes.contains("h-jpg-upper"),
|
||||
"missing h-jpg-upper: {:?}",
|
||||
hashes
|
||||
);
|
||||
// Should NOT contain b (scanned), c (no hash), e (other library),
|
||||
// or videos (mp4/mov are not image extensions).
|
||||
assert!(!hashes.contains("h-b"), "expected h-b filtered (scanned)");
|
||||
assert!(!hashes.contains("h-e"), "expected h-e filtered (other library)");
|
||||
assert_eq!(cands.len(), 2, "unexpected candidates: {:?}", cands);
|
||||
assert!(
|
||||
!hashes.contains("h-e"),
|
||||
"expected h-e filtered (other library)"
|
||||
);
|
||||
assert!(!hashes.contains("h-mp4"), "expected h-mp4 filtered (video)");
|
||||
assert!(!hashes.contains("h-mov"), "expected h-mov filtered (video)");
|
||||
assert_eq!(cands.len(), 3, "unexpected candidates: {:?}", cands);
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
145
src/files.rs
145
src/files.rs
@@ -110,11 +110,18 @@ fn in_memory_date_sort(
|
||||
let total_count = files.len() as i64;
|
||||
let file_paths: Vec<String> = files.iter().map(|f| f.file_name.clone()).collect();
|
||||
|
||||
// Batch fetch EXIF data (keyed by rel_path; in union mode a rel_path may
|
||||
// correspond to rows in multiple libraries — pick the date from the one
|
||||
// matching the requesting row's library_id when possible).
|
||||
// Batch fetch EXIF data. When every file in this batch belongs to the
|
||||
// same library, scope the SQL filter to that library so cross-library
|
||||
// duplicates with the same rel_path don't get fetched and discarded.
|
||||
// In genuine union mode (mixed libraries) keep the rel-path-only
|
||||
// lookup; the caller's `(file_path, library_id)` map below picks the
|
||||
// right row.
|
||||
let scope_library = match file_libraries.first() {
|
||||
Some(&first) if file_libraries.iter().all(|&id| id == first) => Some(first),
|
||||
_ => None,
|
||||
};
|
||||
let exif_rows = exif_dao
|
||||
.get_exif_batch(span_context, &file_paths)
|
||||
.get_exif_batch(span_context, scope_library, &file_paths)
|
||||
.unwrap_or_default();
|
||||
let exif_map: std::collections::HashMap<(String, i32), i64> = exif_rows
|
||||
.into_iter()
|
||||
@@ -309,11 +316,15 @@ pub async fn list_photos<TagD: TagDao, FS: FileSystemAccess>(
|
||||
None
|
||||
};
|
||||
|
||||
// Query EXIF database
|
||||
// Query EXIF database. When the request named a library, the EXIF
|
||||
// filter must be scoped to it — otherwise camera/date/GPS hits
|
||||
// from other libraries would pollute the result set even though
|
||||
// downstream filesystem walks would never visit those files.
|
||||
let mut exif_dao_guard = exif_dao.lock().expect("Unable to get ExifDao");
|
||||
let exif_results = exif_dao_guard
|
||||
.query_by_exif(
|
||||
&span_context,
|
||||
library.map(|l| l.id),
|
||||
req.camera_make.as_deref(),
|
||||
req.camera_model.as_deref(),
|
||||
req.lens_model.as_deref(),
|
||||
@@ -572,9 +583,10 @@ pub async fn list_photos<TagD: TagDao, FS: FileSystemAccess>(
|
||||
} else {
|
||||
Some(trimmed)
|
||||
};
|
||||
let include_duplicates = req.include_duplicates.unwrap_or(false);
|
||||
let rows = {
|
||||
let mut dao = exif_dao.lock().expect("Unable to get ExifDao");
|
||||
dao.list_rel_paths_for_libraries(&span_context, &lib_ids, prefix)
|
||||
dao.list_rel_paths_for_libraries(&span_context, &lib_ids, prefix, include_duplicates)
|
||||
.unwrap_or_else(|e| {
|
||||
warn!("list_rel_paths_for_libraries failed: {:?}", e);
|
||||
Vec::new()
|
||||
@@ -1242,15 +1254,19 @@ pub async fn list_exif_summary(
|
||||
.collect();
|
||||
|
||||
let mut exif_dao_guard = exif_dao.lock().expect("Unable to get ExifDao");
|
||||
match exif_dao_guard.query_by_exif(&cx, None, None, None, None, req.date_from, req.date_to) {
|
||||
match exif_dao_guard.query_by_exif(
|
||||
&cx,
|
||||
library_filter,
|
||||
None,
|
||||
None,
|
||||
None,
|
||||
None,
|
||||
req.date_from,
|
||||
req.date_to,
|
||||
) {
|
||||
Ok(rows) => {
|
||||
let photos: Vec<ExifSummary> = rows
|
||||
.into_iter()
|
||||
// Library filter post-query: keeps the DAO trait (and its
|
||||
// mocks) unchanged. For typical 2–3 library setups the in-
|
||||
// memory pass over a date-bounded result set is negligible;
|
||||
// can be pushed into SQL later if it ever isn't.
|
||||
.filter(|r| library_filter.is_none_or(|id| r.library_id == id))
|
||||
.map(|r| ExifSummary {
|
||||
library_name: library_names.get(&r.library_id).cloned(),
|
||||
file_path: r.file_path,
|
||||
@@ -1488,6 +1504,10 @@ mod tests {
|
||||
last_modified: data.last_modified,
|
||||
content_hash: data.content_hash.clone(),
|
||||
size_bytes: data.size_bytes,
|
||||
phash_64: data.phash_64,
|
||||
dhash_64: data.dhash_64,
|
||||
duplicate_of_hash: None,
|
||||
duplicate_decided_at: None,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -1527,6 +1547,10 @@ mod tests {
|
||||
last_modified: data.last_modified,
|
||||
content_hash: data.content_hash.clone(),
|
||||
size_bytes: data.size_bytes,
|
||||
phash_64: data.phash_64,
|
||||
dhash_64: data.dhash_64,
|
||||
duplicate_of_hash: None,
|
||||
duplicate_decided_at: None,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -1549,6 +1573,7 @@ mod tests {
|
||||
fn get_exif_batch(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_library_id: Option<i32>,
|
||||
_: &[String],
|
||||
) -> Result<Vec<crate::database::models::ImageExif>, DbError> {
|
||||
Ok(Vec::new())
|
||||
@@ -1557,6 +1582,7 @@ mod tests {
|
||||
fn query_by_exif(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_library_id: Option<i32>,
|
||||
_: Option<&str>,
|
||||
_: Option<&str>,
|
||||
_: Option<&str>,
|
||||
@@ -1672,6 +1698,7 @@ mod tests {
|
||||
_context: &opentelemetry::Context,
|
||||
_library_ids: &[i32],
|
||||
_path_prefix: Option<&str>,
|
||||
_include_duplicates: bool,
|
||||
) -> Result<Vec<(i32, String)>, DbError> {
|
||||
Ok(vec![])
|
||||
}
|
||||
@@ -1684,6 +1711,100 @@ mod tests {
|
||||
) -> Result<(), DbError> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn count_for_library(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_library_id: i32,
|
||||
) -> Result<i64, DbError> {
|
||||
Ok(0)
|
||||
}
|
||||
|
||||
fn list_rel_paths_for_library_page(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_library_id: i32,
|
||||
_limit: i64,
|
||||
_offset: i64,
|
||||
) -> Result<Vec<(i32, String)>, DbError> {
|
||||
Ok(Vec::new())
|
||||
}
|
||||
|
||||
fn get_rows_missing_perceptual_hash(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_limit: i64,
|
||||
) -> Result<Vec<(i32, String)>, DbError> {
|
||||
Ok(Vec::new())
|
||||
}
|
||||
|
||||
fn backfill_perceptual_hash(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_library_id: i32,
|
||||
_rel_path: &str,
|
||||
_phash_64: Option<i64>,
|
||||
_dhash_64: Option<i64>,
|
||||
) -> Result<(), DbError> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn list_duplicates_exact(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_library_id: Option<i32>,
|
||||
_include_resolved: bool,
|
||||
) -> Result<Vec<crate::database::DuplicateRow>, DbError> {
|
||||
Ok(Vec::new())
|
||||
}
|
||||
|
||||
fn list_perceptual_candidates(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_library_id: Option<i32>,
|
||||
_include_resolved: bool,
|
||||
) -> Result<Vec<crate::database::DuplicateRow>, DbError> {
|
||||
Ok(Vec::new())
|
||||
}
|
||||
|
||||
fn lookup_duplicate_row(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_library_id: i32,
|
||||
_rel_path: &str,
|
||||
) -> Result<Option<crate::database::DuplicateRow>, DbError> {
|
||||
Ok(None)
|
||||
}
|
||||
|
||||
fn set_duplicate_of(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_library_id: i32,
|
||||
_rel_path: &str,
|
||||
_survivor_hash: &str,
|
||||
_decided_at: i64,
|
||||
) -> Result<(), DbError> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn clear_duplicate_of(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_library_id: i32,
|
||||
_rel_path: &str,
|
||||
) -> Result<(), DbError> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn union_perceptual_tags(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
_survivor_hash: &str,
|
||||
_demoted_hash: &str,
|
||||
_survivor_rel_path: &str,
|
||||
) -> Result<(), DbError> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
mod api {
|
||||
|
||||
@@ -10,6 +10,7 @@ pub mod cleanup;
|
||||
pub mod content_hash;
|
||||
pub mod data;
|
||||
pub mod database;
|
||||
pub mod duplicates;
|
||||
pub mod error;
|
||||
pub mod exif;
|
||||
pub mod face_watch;
|
||||
@@ -19,9 +20,11 @@ pub mod file_types;
|
||||
pub mod files;
|
||||
pub mod geo;
|
||||
pub mod libraries;
|
||||
pub mod library_maintenance;
|
||||
pub mod memories;
|
||||
pub mod otel;
|
||||
pub mod parsers;
|
||||
pub mod perceptual_hash;
|
||||
pub mod service;
|
||||
pub mod state;
|
||||
pub mod tags;
|
||||
|
||||
348
src/libraries.rs
348
src/libraries.rs
@@ -3,7 +3,9 @@ use chrono::Utc;
|
||||
use diesel::prelude::*;
|
||||
use diesel::sqlite::SqliteConnection;
|
||||
use log::{info, warn};
|
||||
use std::collections::HashMap;
|
||||
use std::path::{Path, PathBuf};
|
||||
use std::sync::{Arc, RwLock};
|
||||
|
||||
use crate::data::Claims;
|
||||
use crate::database::models::{InsertLibrary, LibraryRow};
|
||||
@@ -26,6 +28,19 @@ pub struct Library {
|
||||
pub id: i32,
|
||||
pub name: String,
|
||||
pub root_path: String,
|
||||
/// Operator kill switch (mirrors `libraries.enabled`). When `false`
|
||||
/// the watcher skips this library entirely — before the probe,
|
||||
/// before ingest, before maintenance. Reads / serving still work
|
||||
/// (a request whose path resolves to a disabled library's root
|
||||
/// will succeed if the file is on disk; nothing prevents that
|
||||
/// today and there's no obvious reason to). Toggle via SQL.
|
||||
pub enabled: bool,
|
||||
/// Per-library excluded paths/patterns, parsed from the
|
||||
/// comma-separated DB column. The walker applies these
|
||||
/// **in union** with the global `EXCLUDED_DIRS` env var; either
|
||||
/// list matching a path is enough to exclude. Empty = no
|
||||
/// library-specific excludes (only the global env var applies).
|
||||
pub excluded_dirs: Vec<String>,
|
||||
}
|
||||
|
||||
impl Library {
|
||||
@@ -47,6 +62,36 @@ impl Library {
|
||||
.ok()
|
||||
.map(|p| p.to_string_lossy().replace('\\', "/"))
|
||||
}
|
||||
|
||||
/// Effective excluded directories for a walk of this library:
|
||||
/// the union of the global env-var excludes (passed in by the
|
||||
/// caller as `globals`) and this library's per-row excludes.
|
||||
/// Order doesn't matter; `PathExcluder` accepts repeats.
|
||||
pub fn effective_excluded_dirs(&self, globals: &[String]) -> Vec<String> {
|
||||
if self.excluded_dirs.is_empty() {
|
||||
return globals.to_vec();
|
||||
}
|
||||
let mut combined: Vec<String> =
|
||||
Vec::with_capacity(globals.len() + self.excluded_dirs.len());
|
||||
combined.extend_from_slice(globals);
|
||||
combined.extend(self.excluded_dirs.iter().cloned());
|
||||
combined
|
||||
}
|
||||
}
|
||||
|
||||
/// Parse a comma-separated excluded_dirs column into a Vec, dropping
|
||||
/// empty entries (mirrors `AppState::parse_excluded_dirs` for the env
|
||||
/// var). NULL → empty Vec.
|
||||
pub fn parse_excluded_dirs_column(raw: Option<&str>) -> Vec<String> {
|
||||
match raw {
|
||||
None => Vec::new(),
|
||||
Some(s) => s
|
||||
.split(',')
|
||||
.map(str::trim)
|
||||
.filter(|s| !s.is_empty())
|
||||
.map(String::from)
|
||||
.collect(),
|
||||
}
|
||||
}
|
||||
|
||||
impl From<LibraryRow> for Library {
|
||||
@@ -55,6 +100,8 @@ impl From<LibraryRow> for Library {
|
||||
id: row.id,
|
||||
name: row.name,
|
||||
root_path: row.root_path,
|
||||
enabled: row.enabled,
|
||||
excluded_dirs: parse_excluded_dirs_column(row.excluded_dirs.as_deref()),
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -109,6 +156,8 @@ pub fn seed_or_patch_from_env(conn: &mut SqliteConnection, base_path: &str) {
|
||||
name: "main",
|
||||
root_path: base_path,
|
||||
created_at: now,
|
||||
enabled: true,
|
||||
excluded_dirs: None,
|
||||
})
|
||||
.execute(conn);
|
||||
match result {
|
||||
@@ -146,16 +195,165 @@ pub fn resolve_library_param<'a>(
|
||||
.ok_or_else(|| format!("unknown library name: {}", raw))
|
||||
}
|
||||
|
||||
/// Health of a library at a point in time. Probed at the top of each
|
||||
/// file-watcher tick. The `Stale` state is the "be conservative" signal:
|
||||
/// destructive paths (ingest writes, future move-handoff and orphan GC in
|
||||
/// branches B/C) skip a stale library, but reads/serving stay unaffected.
|
||||
///
|
||||
/// See `CLAUDE.md` → "Library availability and safety" for the policy.
|
||||
#[derive(Clone, Debug, serde::Serialize, PartialEq, Eq)]
|
||||
#[serde(tag = "state", rename_all = "snake_case")]
|
||||
pub enum LibraryHealth {
|
||||
Online,
|
||||
Stale {
|
||||
reason: String,
|
||||
/// Unix timestamp (seconds) of the most recent transition into
|
||||
/// Stale. Held for telemetry / `/libraries` surfacing only —
|
||||
/// gating logic doesn't read it.
|
||||
since: i64,
|
||||
},
|
||||
}
|
||||
|
||||
impl LibraryHealth {
|
||||
pub fn is_online(&self) -> bool {
|
||||
matches!(self, LibraryHealth::Online)
|
||||
}
|
||||
}
|
||||
|
||||
/// Shared snapshot of every configured library's health, keyed by
|
||||
/// `library_id`. The watcher writes; HTTP handlers read. RwLock because
|
||||
/// reads vastly outnumber writes (one tick vs. every status request).
|
||||
pub type LibraryHealthMap = Arc<RwLock<HashMap<i32, LibraryHealth>>>;
|
||||
|
||||
/// Construct an initial health map. Libraries start `Online`; the first
|
||||
/// probe will downgrade any that fail. Starting `Stale` would block ingest
|
||||
/// for the watcher's first tick on a healthy mount, which is the wrong
|
||||
/// default for a server that's just been restarted.
|
||||
pub fn new_health_map(libs: &[Library]) -> LibraryHealthMap {
|
||||
let mut m = HashMap::with_capacity(libs.len());
|
||||
for lib in libs {
|
||||
m.insert(lib.id, LibraryHealth::Online);
|
||||
}
|
||||
Arc::new(RwLock::new(m))
|
||||
}
|
||||
|
||||
/// Probe a library's mount point. Cheap: stat + open dir + peek one entry.
|
||||
///
|
||||
/// `had_data` is the caller's prior knowledge that this library has been
|
||||
/// non-empty before — typically `image_exif` row count > 0. When true, an
|
||||
/// empty directory is suspicious (it's how an unmounted NFS share looks);
|
||||
/// when false, it's accepted as a fresh mount that simply hasn't been
|
||||
/// indexed yet.
|
||||
///
|
||||
/// Note: stat / read_dir on a hard-mounted, unreachable NFS share can
|
||||
/// block. The watcher accepts that risk for now — the worst case is that
|
||||
/// the tick stalls until the mount returns, which is no more destructive
|
||||
/// than the pre-probe behavior. A future enhancement can wrap this in a
|
||||
/// thread + timeout if it becomes an operational issue.
|
||||
pub fn probe_online(lib: &Library, had_data: bool) -> LibraryHealth {
|
||||
let now = Utc::now().timestamp();
|
||||
let path = Path::new(&lib.root_path);
|
||||
|
||||
let metadata = match std::fs::metadata(path) {
|
||||
Ok(m) => m,
|
||||
Err(e) => {
|
||||
return LibraryHealth::Stale {
|
||||
reason: format!("root_path stat failed: {}", e),
|
||||
since: now,
|
||||
};
|
||||
}
|
||||
};
|
||||
if !metadata.is_dir() {
|
||||
return LibraryHealth::Stale {
|
||||
reason: format!("root_path is not a directory: {}", lib.root_path),
|
||||
since: now,
|
||||
};
|
||||
}
|
||||
|
||||
let mut entries = match std::fs::read_dir(path) {
|
||||
Ok(it) => it,
|
||||
Err(e) => {
|
||||
return LibraryHealth::Stale {
|
||||
reason: format!("read_dir failed: {}", e),
|
||||
since: now,
|
||||
};
|
||||
}
|
||||
};
|
||||
|
||||
// Empty directory only counts as Stale when we have prior evidence
|
||||
// this library used to have content. A genuinely fresh mount is
|
||||
// legitimately empty, and degrading it would block first-time ingest.
|
||||
if had_data && entries.next().is_none() {
|
||||
return LibraryHealth::Stale {
|
||||
reason: "library is empty but image_exif has rows for it".to_string(),
|
||||
since: now,
|
||||
};
|
||||
}
|
||||
|
||||
LibraryHealth::Online
|
||||
}
|
||||
|
||||
/// Probe `lib`, update `map`, and return the new state. Logs only on a
|
||||
/// state transition (Online↔Stale) so a long outage doesn't spam at every
|
||||
/// tick — operators get one warn on the way down and one info on the way
|
||||
/// up.
|
||||
pub fn refresh_health(map: &LibraryHealthMap, lib: &Library, had_data: bool) -> LibraryHealth {
|
||||
let new_state = probe_online(lib, had_data);
|
||||
let mut guard = map.write().unwrap_or_else(|e| e.into_inner());
|
||||
let prev = guard.get(&lib.id).cloned();
|
||||
let transitioned = matches!(
|
||||
(&prev, &new_state),
|
||||
(None, LibraryHealth::Stale { .. })
|
||||
| (Some(LibraryHealth::Online), LibraryHealth::Stale { .. })
|
||||
| (Some(LibraryHealth::Stale { .. }), LibraryHealth::Online)
|
||||
);
|
||||
if transitioned {
|
||||
match &new_state {
|
||||
LibraryHealth::Online => info!(
|
||||
"Library '{}' (id={}) recovered: {} is online",
|
||||
lib.name, lib.id, lib.root_path
|
||||
),
|
||||
LibraryHealth::Stale { reason, .. } => warn!(
|
||||
"Library '{}' (id={}) is STALE — pausing writes. Reason: {}. Path: {}",
|
||||
lib.name, lib.id, reason, lib.root_path
|
||||
),
|
||||
}
|
||||
}
|
||||
guard.insert(lib.id, new_state.clone());
|
||||
new_state
|
||||
}
|
||||
|
||||
/// Snapshot of one library + its current health, for `/libraries`.
|
||||
#[derive(serde::Serialize)]
|
||||
pub struct LibraryStatus {
|
||||
#[serde(flatten)]
|
||||
pub library: Library,
|
||||
pub health: LibraryHealth,
|
||||
}
|
||||
|
||||
#[derive(serde::Serialize)]
|
||||
pub struct LibrariesResponse {
|
||||
pub libraries: Vec<Library>,
|
||||
pub libraries: Vec<LibraryStatus>,
|
||||
}
|
||||
|
||||
#[get("/libraries")]
|
||||
pub async fn list_libraries(_claims: Claims, app_state: Data<AppState>) -> impl Responder {
|
||||
HttpResponse::Ok().json(LibrariesResponse {
|
||||
libraries: app_state.libraries.clone(),
|
||||
})
|
||||
let health_guard = app_state
|
||||
.library_health
|
||||
.read()
|
||||
.unwrap_or_else(|e| e.into_inner());
|
||||
let libraries = app_state
|
||||
.libraries
|
||||
.iter()
|
||||
.map(|lib| LibraryStatus {
|
||||
library: lib.clone(),
|
||||
health: health_guard
|
||||
.get(&lib.id)
|
||||
.cloned()
|
||||
.unwrap_or(LibraryHealth::Online),
|
||||
})
|
||||
.collect();
|
||||
HttpResponse::Ok().json(LibrariesResponse { libraries })
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
@@ -192,6 +390,8 @@ mod tests {
|
||||
id: 1,
|
||||
name: "main".into(),
|
||||
root_path: "/tmp/media".into(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
};
|
||||
let rel = lib.strip_root(Path::new("/tmp/media/2024/photo.jpg"));
|
||||
assert_eq!(rel.as_deref(), Some("2024/photo.jpg"));
|
||||
@@ -205,6 +405,8 @@ mod tests {
|
||||
id: 1,
|
||||
name: "main".into(),
|
||||
root_path: "/tmp/media".into(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
};
|
||||
let abs = lib.resolve("2024/photo.jpg");
|
||||
assert_eq!(abs, PathBuf::from("/tmp/media/2024/photo.jpg"));
|
||||
@@ -222,11 +424,15 @@ mod tests {
|
||||
id: 1,
|
||||
name: "main".into(),
|
||||
root_path: "/tmp/main".into(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
},
|
||||
Library {
|
||||
id: 7,
|
||||
name: "archive".into(),
|
||||
root_path: "/tmp/archive".into(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
},
|
||||
]
|
||||
}
|
||||
@@ -279,4 +485,138 @@ mod tests {
|
||||
let err = resolve_library_param(&state, Some("missing")).unwrap_err();
|
||||
assert!(err.contains("unknown library name"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_excluded_dirs_column_handles_null_and_whitespace() {
|
||||
assert_eq!(parse_excluded_dirs_column(None), Vec::<String>::new());
|
||||
assert_eq!(parse_excluded_dirs_column(Some("")), Vec::<String>::new());
|
||||
assert_eq!(
|
||||
parse_excluded_dirs_column(Some(" /a , /b/sub , @eaDir ,, ")),
|
||||
vec!["/a".to_string(), "/b/sub".to_string(), "@eaDir".to_string()]
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn effective_excluded_dirs_unions_global_and_per_library() {
|
||||
let lib_no_extras = Library {
|
||||
id: 1,
|
||||
name: "main".into(),
|
||||
root_path: "/x".into(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
};
|
||||
let globals = vec!["@eaDir".to_string(), ".thumbnails".to_string()];
|
||||
// Empty per-library excludes → exactly the globals.
|
||||
assert_eq!(lib_no_extras.effective_excluded_dirs(&globals), globals);
|
||||
|
||||
let lib_with_extras = Library {
|
||||
id: 2,
|
||||
name: "archive".into(),
|
||||
root_path: "/y".into(),
|
||||
enabled: true,
|
||||
excluded_dirs: vec!["/photos".to_string()],
|
||||
};
|
||||
let combined = lib_with_extras.effective_excluded_dirs(&globals);
|
||||
assert!(combined.contains(&"@eaDir".to_string()));
|
||||
assert!(combined.contains(&".thumbnails".to_string()));
|
||||
assert!(combined.contains(&"/photos".to_string()));
|
||||
assert_eq!(combined.len(), 3);
|
||||
}
|
||||
|
||||
fn probe_lib(id: i32, root: String) -> Library {
|
||||
Library {
|
||||
id,
|
||||
name: "main".into(),
|
||||
root_path: root,
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn probe_online_for_existing_non_empty_dir() {
|
||||
let tmp = tempfile::tempdir().unwrap();
|
||||
std::fs::write(tmp.path().join("photo.jpg"), b"hello").unwrap();
|
||||
let lib = probe_lib(1, tmp.path().to_string_lossy().into());
|
||||
// had_data doesn't matter when the dir has entries.
|
||||
assert!(probe_online(&lib, true).is_online());
|
||||
assert!(probe_online(&lib, false).is_online());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn probe_stale_when_root_missing() {
|
||||
let lib = probe_lib(1, "/nonexistent/definitely/not/here".into());
|
||||
assert!(matches!(
|
||||
probe_online(&lib, false),
|
||||
LibraryHealth::Stale { .. }
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn probe_stale_when_root_is_a_file() {
|
||||
let tmp = tempfile::tempdir().unwrap();
|
||||
let file = tmp.path().join("not-a-dir");
|
||||
std::fs::write(&file, b"x").unwrap();
|
||||
let lib = probe_lib(1, file.to_string_lossy().into());
|
||||
assert!(matches!(
|
||||
probe_online(&lib, false),
|
||||
LibraryHealth::Stale { .. }
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn probe_empty_dir_is_online_when_no_prior_data() {
|
||||
// Fresh mount: empty directory, no rows in image_exif. Accept it.
|
||||
let tmp = tempfile::tempdir().unwrap();
|
||||
let lib = probe_lib(1, tmp.path().to_string_lossy().into());
|
||||
assert!(probe_online(&lib, false).is_online());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn probe_empty_dir_is_stale_when_prior_data_existed() {
|
||||
// The "share went offline" signal: directory exists but is empty,
|
||||
// and we know the library used to have content. Treat as Stale.
|
||||
let tmp = tempfile::tempdir().unwrap();
|
||||
let lib = probe_lib(1, tmp.path().to_string_lossy().into());
|
||||
match probe_online(&lib, true) {
|
||||
LibraryHealth::Stale { reason, .. } => {
|
||||
assert!(reason.contains("empty"), "unexpected reason: {}", reason)
|
||||
}
|
||||
other => panic!("expected Stale, got {:?}", other),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn refresh_health_logs_only_on_transition() {
|
||||
// Smoke test: refresh_health updates the map and reports correctly.
|
||||
// (We can't easily assert on logs without a custom logger; the
|
||||
// important thing is that the state churns properly.)
|
||||
let tmp = tempfile::tempdir().unwrap();
|
||||
let lib = Library {
|
||||
id: 42,
|
||||
name: "test".into(),
|
||||
root_path: tmp.path().to_string_lossy().into(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
};
|
||||
let map = new_health_map(&[lib.clone()]);
|
||||
|
||||
// First probe: empty dir, no prior data — Online.
|
||||
let s1 = refresh_health(&map, &lib, false);
|
||||
assert!(s1.is_online());
|
||||
|
||||
// Probe again with had_data=true on the same empty dir — Stale.
|
||||
let s2 = refresh_health(&map, &lib, true);
|
||||
assert!(matches!(s2, LibraryHealth::Stale { .. }));
|
||||
assert_eq!(
|
||||
map.read().unwrap().get(&lib.id).cloned(),
|
||||
Some(s2.clone()),
|
||||
"map should reflect the latest probe"
|
||||
);
|
||||
|
||||
// Recovery: drop a file and probe again.
|
||||
std::fs::write(tmp.path().join("photo.jpg"), b"x").unwrap();
|
||||
let s3 = refresh_health(&map, &lib, true);
|
||||
assert!(s3.is_online());
|
||||
}
|
||||
}
|
||||
|
||||
828
src/library_maintenance.rs
Normal file
828
src/library_maintenance.rs
Normal file
@@ -0,0 +1,828 @@
|
||||
//! Filesystem-backed maintenance of `image_exif`, the back-ref columns
|
||||
//! on hash-keyed tables, and orphan derived data.
|
||||
//!
|
||||
//! These passes are the operational implementation of the library
|
||||
//! handoff and orphan rules from CLAUDE.md → "Multi-library data
|
||||
//! model" / "Library availability and safety":
|
||||
//!
|
||||
//! 1. **Missing-file detection** — when a file disappears from disk
|
||||
//! but its `image_exif` row remains, the row is removed. Naturally
|
||||
//! implements the move case: when a user moves a file from lib-A
|
||||
//! to lib-B, the watcher's normal ingest creates the lib-B row;
|
||||
//! this pass eventually retires the lib-A row.
|
||||
//!
|
||||
//! 2. **Back-ref refresh** — hash-keyed rows (`face_detections` and,
|
||||
//! after Branch B, `tagged_photo` / `photo_insights`) carry a
|
||||
//! denormalized `(library_id, rel_path)` back-ref. After a move,
|
||||
//! that back-ref may point at a deleted row. The refresh pass
|
||||
//! finds rows whose `(library_id, rel_path)` no longer matches
|
||||
//! any `image_exif` row but whose `content_hash` does, and updates
|
||||
//! the back-ref to one of the surviving paths. Idempotent.
|
||||
//!
|
||||
//! 3. **Orphan GC** — when a `content_hash` no longer has any
|
||||
//! `image_exif` row referencing it, hash-keyed derived rows for
|
||||
//! that hash become eligible for deletion. To survive transient
|
||||
//! unmounts, the pass uses a **two-tick consensus rule**: a hash
|
||||
//! must be observed orphaned for two consecutive ticks AND every
|
||||
//! library must be online for both observations. The "marked but
|
||||
//! not yet deleted" state is held in memory; restarting the
|
||||
//! watcher resets it (which is fine — the second tick simply
|
||||
//! happens after the next tick, not the very next one).
|
||||
//!
|
||||
//! Pass 1 is filesystem-dependent and gated on the per-library
|
||||
//! availability probe. Passes 2 and 3 are database-only but pass 3
|
||||
//! additionally requires every library to be online for the
|
||||
//! consensus window.
|
||||
|
||||
use std::collections::HashSet;
|
||||
use std::path::Path;
|
||||
use std::sync::{Arc, Mutex};
|
||||
|
||||
use diesel::prelude::*;
|
||||
use diesel::sql_query;
|
||||
use diesel::sqlite::SqliteConnection;
|
||||
use log::{debug, info, warn};
|
||||
|
||||
use crate::database::ExifDao;
|
||||
use crate::libraries::{Library, LibraryHealthMap};
|
||||
|
||||
/// Cap on missing-file deletions per library per tick. Prevents a
|
||||
/// pathological mount that returns "not found" for everything (e.g.
|
||||
/// case-sensitivity flip on a network share that the probe didn't
|
||||
/// catch) from wiping the entire image_exif table in one tick. Tunable
|
||||
/// via `IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK`.
|
||||
pub const DEFAULT_MISSING_DELETE_CAP: usize = 200;
|
||||
|
||||
/// Page size for the missing-file scan. We stat() every row in this
|
||||
/// batch but only delete those that are confirmed-not-found (subject
|
||||
/// to the delete cap above). Tunable via
|
||||
/// `IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE`.
|
||||
pub const DEFAULT_SCAN_PAGE_SIZE: i64 = 500;
|
||||
|
||||
/// Scan a page of `image_exif` rows for `library`, stat() each one,
|
||||
/// and delete rows whose source file is gone. Returns
|
||||
/// `(deleted, next_offset)`. `next_offset` wraps to 0 when the page
|
||||
/// returned fewer rows than the page size, so the watcher cycles
|
||||
/// through the whole library across ticks.
|
||||
///
|
||||
/// Caller must already have confirmed the library is online — running
|
||||
/// against a Stale library would interpret every row as missing.
|
||||
pub fn detect_missing_files_for_library(
|
||||
context: &opentelemetry::Context,
|
||||
library: &Library,
|
||||
exif_dao: &Arc<Mutex<Box<dyn ExifDao>>>,
|
||||
offset: i64,
|
||||
page_size: i64,
|
||||
delete_cap: usize,
|
||||
) -> (usize, i64) {
|
||||
let rows = {
|
||||
let mut dao = exif_dao.lock().expect("exif_dao poisoned");
|
||||
match dao.list_rel_paths_for_library_page(context, library.id, page_size, offset) {
|
||||
Ok(r) => r,
|
||||
Err(e) => {
|
||||
warn!(
|
||||
"missing-file scan: list page failed for library '{}' (offset={}): {:?}",
|
||||
library.name, offset, e
|
||||
);
|
||||
return (0, offset);
|
||||
}
|
||||
}
|
||||
};
|
||||
let n_returned = rows.len();
|
||||
// Wrap offset when we hit the end of the table — next tick starts
|
||||
// a fresh sweep. Doing it here rather than on the next call keeps
|
||||
// the offset accounting visible in one place.
|
||||
let next_offset = if (n_returned as i64) < page_size {
|
||||
0
|
||||
} else {
|
||||
offset + page_size
|
||||
};
|
||||
|
||||
if rows.is_empty() {
|
||||
return (0, next_offset);
|
||||
}
|
||||
|
||||
let root = Path::new(&library.root_path);
|
||||
let mut to_delete: Vec<String> = Vec::new();
|
||||
for (_id, rel_path) in &rows {
|
||||
if to_delete.len() >= delete_cap {
|
||||
break;
|
||||
}
|
||||
let abs = root.join(rel_path);
|
||||
match std::fs::metadata(&abs) {
|
||||
Ok(_) => {
|
||||
// File still exists — nothing to do.
|
||||
}
|
||||
Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
|
||||
to_delete.push(rel_path.clone());
|
||||
}
|
||||
Err(e) => {
|
||||
// Permission denied / IO error / etc. — skip this row,
|
||||
// leave it for the next sweep. We never want a transient
|
||||
// FS hiccup to mass-delete metadata.
|
||||
debug!(
|
||||
"missing-file scan: stat() error for {:?}, skipping: {:?}",
|
||||
abs, e
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if to_delete.is_empty() {
|
||||
return (0, next_offset);
|
||||
}
|
||||
|
||||
let mut deleted = 0;
|
||||
{
|
||||
let mut dao = exif_dao.lock().expect("exif_dao poisoned");
|
||||
for rel_path in &to_delete {
|
||||
match dao.delete_exif_by_library(context, library.id, rel_path) {
|
||||
Ok(()) => deleted += 1,
|
||||
Err(e) => warn!(
|
||||
"missing-file scan: delete failed for ({}, {}): {:?}",
|
||||
library.id, rel_path, e
|
||||
),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if deleted > 0 {
|
||||
info!(
|
||||
"missing-file scan: removed {} stale image_exif row(s) from library '{}'",
|
||||
deleted, library.name
|
||||
);
|
||||
}
|
||||
|
||||
(deleted, next_offset)
|
||||
}
|
||||
|
||||
/// Refresh the `(library_id, rel_path)` back-refs on hash-keyed
|
||||
/// tables. A back-ref is stale when:
|
||||
/// - its `content_hash` is non-null,
|
||||
/// - that hash is referenced by at least one `image_exif` row, but
|
||||
/// - the row's own `(library_id, rel_path)` does not appear in
|
||||
/// `image_exif`.
|
||||
///
|
||||
/// In that case, point the back-ref at any surviving image_exif row
|
||||
/// for the same hash. `face_detections` is the canonical case (it
|
||||
/// carries `library_id` + `rel_path` columns); `tagged_photo` and
|
||||
/// `photo_insights` only carry rel_path historically — we still keep
|
||||
/// it in sync here for consistency, picking any surviving rel_path.
|
||||
///
|
||||
/// All-SQL, idempotent. Returns the number of rows updated.
|
||||
pub fn refresh_back_refs(conn: &mut SqliteConnection) -> usize {
|
||||
let mut total = 0usize;
|
||||
|
||||
// face_detections — back-ref is (library_id, rel_path). Repoint to
|
||||
// any surviving image_exif row carrying the same content_hash.
|
||||
let updated = sql_query(
|
||||
"UPDATE face_detections \
|
||||
SET library_id = ( \
|
||||
SELECT ie.library_id FROM image_exif ie \
|
||||
WHERE ie.content_hash = face_detections.content_hash \
|
||||
ORDER BY ie.id LIMIT 1 \
|
||||
), \
|
||||
rel_path = ( \
|
||||
SELECT ie.rel_path FROM image_exif ie \
|
||||
WHERE ie.content_hash = face_detections.content_hash \
|
||||
ORDER BY ie.id LIMIT 1 \
|
||||
) \
|
||||
WHERE EXISTS ( \
|
||||
SELECT 1 FROM image_exif ie \
|
||||
WHERE ie.content_hash = face_detections.content_hash \
|
||||
) \
|
||||
AND NOT EXISTS ( \
|
||||
SELECT 1 FROM image_exif ie \
|
||||
WHERE ie.library_id = face_detections.library_id \
|
||||
AND ie.rel_path = face_detections.rel_path \
|
||||
)",
|
||||
)
|
||||
.execute(conn)
|
||||
.unwrap_or_else(|e| {
|
||||
warn!("back-ref refresh: face_detections update failed: {:?}", e);
|
||||
0
|
||||
});
|
||||
total += updated;
|
||||
|
||||
// tagged_photo — only rel_path. Update to any surviving rel_path
|
||||
// for the same content_hash so the path-only DAO read still finds
|
||||
// tags after a move.
|
||||
let updated = sql_query(
|
||||
"UPDATE tagged_photo \
|
||||
SET rel_path = ( \
|
||||
SELECT ie.rel_path FROM image_exif ie \
|
||||
WHERE ie.content_hash = tagged_photo.content_hash \
|
||||
ORDER BY ie.id LIMIT 1 \
|
||||
) \
|
||||
WHERE content_hash IS NOT NULL \
|
||||
AND EXISTS ( \
|
||||
SELECT 1 FROM image_exif ie \
|
||||
WHERE ie.content_hash = tagged_photo.content_hash \
|
||||
) \
|
||||
AND NOT EXISTS ( \
|
||||
SELECT 1 FROM image_exif ie \
|
||||
WHERE ie.rel_path = tagged_photo.rel_path \
|
||||
)",
|
||||
)
|
||||
.execute(conn)
|
||||
.unwrap_or_else(|e| {
|
||||
warn!("back-ref refresh: tagged_photo update failed: {:?}", e);
|
||||
0
|
||||
});
|
||||
total += updated;
|
||||
|
||||
// photo_insights — has both library_id and rel_path. Update both
|
||||
// when the (library_id, rel_path) tuple no longer matches any
|
||||
// image_exif row but the hash does.
|
||||
let updated = sql_query(
|
||||
"UPDATE photo_insights \
|
||||
SET library_id = ( \
|
||||
SELECT ie.library_id FROM image_exif ie \
|
||||
WHERE ie.content_hash = photo_insights.content_hash \
|
||||
ORDER BY ie.id LIMIT 1 \
|
||||
), \
|
||||
rel_path = ( \
|
||||
SELECT ie.rel_path FROM image_exif ie \
|
||||
WHERE ie.content_hash = photo_insights.content_hash \
|
||||
ORDER BY ie.id LIMIT 1 \
|
||||
) \
|
||||
WHERE content_hash IS NOT NULL \
|
||||
AND EXISTS ( \
|
||||
SELECT 1 FROM image_exif ie \
|
||||
WHERE ie.content_hash = photo_insights.content_hash \
|
||||
) \
|
||||
AND NOT EXISTS ( \
|
||||
SELECT 1 FROM image_exif ie \
|
||||
WHERE ie.library_id = photo_insights.library_id \
|
||||
AND ie.rel_path = photo_insights.rel_path \
|
||||
)",
|
||||
)
|
||||
.execute(conn)
|
||||
.unwrap_or_else(|e| {
|
||||
warn!("back-ref refresh: photo_insights update failed: {:?}", e);
|
||||
0
|
||||
});
|
||||
total += updated;
|
||||
|
||||
if total > 0 {
|
||||
info!("back-ref refresh: updated {} hash-keyed row(s)", total);
|
||||
}
|
||||
total
|
||||
}
|
||||
|
||||
/// One tick's outcome of the orphan-GC pass.
|
||||
#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
|
||||
pub struct GcStats {
|
||||
/// Hashes newly observed orphaned this tick (added to the
|
||||
/// pending set).
|
||||
pub newly_marked: usize,
|
||||
/// Hashes that were marked last tick AND are still orphaned this
|
||||
/// tick AND every library is online — these are deleted.
|
||||
pub deleted_face_detections: usize,
|
||||
pub deleted_tagged_photo: usize,
|
||||
pub deleted_photo_insights: usize,
|
||||
/// Hashes dropped from the pending set because they re-appeared
|
||||
/// in image_exif (e.g. user remounted a backup that was briefly
|
||||
/// missing).
|
||||
pub revived: usize,
|
||||
}
|
||||
|
||||
impl GcStats {
|
||||
pub fn changed(&self) -> bool {
|
||||
self.newly_marked > 0
|
||||
|| self.deleted_face_detections > 0
|
||||
|| self.deleted_tagged_photo > 0
|
||||
|| self.deleted_photo_insights > 0
|
||||
|| self.revived > 0
|
||||
}
|
||||
|
||||
pub fn total_deleted(&self) -> usize {
|
||||
self.deleted_face_detections + self.deleted_tagged_photo + self.deleted_photo_insights
|
||||
}
|
||||
}
|
||||
|
||||
/// Two-tick orphan-GC state. The watcher constructs one of these once
|
||||
/// at startup and passes it back into `run_orphan_gc` every tick.
|
||||
#[derive(Debug, Default)]
|
||||
pub struct OrphanGcState {
|
||||
/// Hashes observed orphaned on the previous tick. A hash gets
|
||||
/// promoted to "delete" when it survives a second consecutive
|
||||
/// observation with all libraries online.
|
||||
pending: HashSet<String>,
|
||||
/// Whether every library was online on the previous tick. Combined
|
||||
/// with the all-online check on the current tick, this gives the
|
||||
/// "two consecutive ticks of full availability" guard described in
|
||||
/// CLAUDE.md → "Library availability and safety".
|
||||
prev_tick_all_online: bool,
|
||||
}
|
||||
|
||||
/// Run one tick of the orphan GC. The function is responsible for the
|
||||
/// full lifecycle: probing for orphans, updating `state.pending`,
|
||||
/// performing deletes when consensus is reached, and returning stats
|
||||
/// for the watcher to log.
|
||||
///
|
||||
/// Safety guard: `all_online` MUST reflect every configured library
|
||||
/// being Online right now. Even if true, deletes only happen when the
|
||||
/// previous tick was also all-online. A single Stale tick within the
|
||||
/// window cancels any pending deletes (they stay marked but won't be
|
||||
/// promoted) — they're then re-evaluated next tick.
|
||||
pub fn run_orphan_gc(
|
||||
conn: &mut SqliteConnection,
|
||||
state: &mut OrphanGcState,
|
||||
all_online: bool,
|
||||
) -> GcStats {
|
||||
let mut stats = GcStats::default();
|
||||
|
||||
// Find every distinct content_hash referenced by hash-keyed
|
||||
// derived data that is NOT currently referenced by image_exif.
|
||||
// These are this tick's orphan candidates. Cheap query — three
|
||||
// index lookups + a HashSet at row count of derived tables, which
|
||||
// is small.
|
||||
let orphans: HashSet<String> = match collect_orphan_hashes(conn) {
|
||||
Ok(set) => set,
|
||||
Err(e) => {
|
||||
warn!("orphan-gc: candidate query failed: {:?}", e);
|
||||
return stats;
|
||||
}
|
||||
};
|
||||
|
||||
// Drop entries from pending that are no longer orphaned
|
||||
// ("revived"). Common case: a network share that briefly went
|
||||
// stale comes back, image_exif gets re-populated by ingest, and
|
||||
// the hash is no longer orphaned.
|
||||
let revived = state
|
||||
.pending
|
||||
.difference(&orphans)
|
||||
.cloned()
|
||||
.collect::<Vec<_>>();
|
||||
if !revived.is_empty() {
|
||||
for h in &revived {
|
||||
state.pending.remove(h);
|
||||
}
|
||||
stats.revived = revived.len();
|
||||
}
|
||||
|
||||
if !all_online {
|
||||
// Every Stale library cancels both the consensus window AND
|
||||
// any pending deletes. We *do* still note newly observed
|
||||
// orphans below — that's harmless bookkeeping. But we never
|
||||
// delete this tick.
|
||||
for h in &orphans {
|
||||
if state.pending.insert(h.clone()) {
|
||||
stats.newly_marked += 1;
|
||||
}
|
||||
}
|
||||
state.prev_tick_all_online = false;
|
||||
if stats.changed() {
|
||||
info!(
|
||||
"orphan-gc: {} new orphan hash(es) marked, {} revived (deferred — at least one library Stale; pending: {})",
|
||||
stats.newly_marked,
|
||||
stats.revived,
|
||||
state.pending.len()
|
||||
);
|
||||
} else {
|
||||
debug!(
|
||||
"orphan-gc: stale library, no changes (pending: {})",
|
||||
state.pending.len()
|
||||
);
|
||||
}
|
||||
return stats;
|
||||
}
|
||||
|
||||
// All-online + previous-tick-also-all-online: hashes that are
|
||||
// both pending AND still orphaned this tick are confirmed and
|
||||
// get deleted. Hashes orphaned this tick but not pending get
|
||||
// freshly marked.
|
||||
let consensus_window_open = state.prev_tick_all_online;
|
||||
|
||||
let to_delete: Vec<String> = if consensus_window_open {
|
||||
orphans
|
||||
.iter()
|
||||
.filter(|h| state.pending.contains(*h))
|
||||
.cloned()
|
||||
.collect()
|
||||
} else {
|
||||
Vec::new()
|
||||
};
|
||||
|
||||
for h in &orphans {
|
||||
if !state.pending.contains(h) {
|
||||
state.pending.insert(h.clone());
|
||||
stats.newly_marked += 1;
|
||||
}
|
||||
}
|
||||
|
||||
if !to_delete.is_empty() {
|
||||
match delete_hash_keyed_rows(conn, &to_delete) {
|
||||
Ok((faces, tags, insights)) => {
|
||||
stats.deleted_face_detections = faces;
|
||||
stats.deleted_tagged_photo = tags;
|
||||
stats.deleted_photo_insights = insights;
|
||||
// Drop deleted hashes from pending so we don't try to
|
||||
// re-delete them next tick (they'll have already been
|
||||
// removed from the orphan set).
|
||||
for h in &to_delete {
|
||||
state.pending.remove(h);
|
||||
}
|
||||
}
|
||||
Err(e) => warn!("orphan-gc: delete batch failed: {:?}", e),
|
||||
}
|
||||
}
|
||||
|
||||
state.prev_tick_all_online = true;
|
||||
|
||||
if stats.changed() {
|
||||
info!(
|
||||
"orphan-gc: {} new orphan hash(es) marked, {} revived; deleted {} face_detections / {} tagged_photo / {} photo_insights row(s) (pending: {})",
|
||||
stats.newly_marked,
|
||||
stats.revived,
|
||||
stats.deleted_face_detections,
|
||||
stats.deleted_tagged_photo,
|
||||
stats.deleted_photo_insights,
|
||||
state.pending.len(),
|
||||
);
|
||||
} else {
|
||||
debug!(
|
||||
"orphan-gc: no changes this tick (pending: {})",
|
||||
state.pending.len()
|
||||
);
|
||||
}
|
||||
|
||||
stats
|
||||
}
|
||||
|
||||
/// Helper for the watcher: are *all enabled* libraries currently Online?
|
||||
///
|
||||
/// Disabled libraries are out-of-scope for the orphan-GC consensus
|
||||
/// rule — they don't get probed, don't have a health entry, and a
|
||||
/// system with one disabled library should still be able to GC
|
||||
/// orphans for the remaining online libraries. Treating disabled as
|
||||
/// "blocking" would mean flipping a library to `enabled=false` would
|
||||
/// permanently halt GC, which is the opposite of the intended kill-
|
||||
/// switch semantics ("turn this library off and let the rest of the
|
||||
/// system run normally").
|
||||
pub fn all_libraries_online(libs: &[Library], health: &LibraryHealthMap) -> bool {
|
||||
let guard = health.read().unwrap_or_else(|e| e.into_inner());
|
||||
libs.iter()
|
||||
.filter(|lib| lib.enabled)
|
||||
.all(|lib| guard.get(&lib.id).map(|h| h.is_online()).unwrap_or(false))
|
||||
}
|
||||
|
||||
#[derive(QueryableByName, Debug)]
|
||||
struct HashRow {
|
||||
#[diesel(sql_type = diesel::sql_types::Text)]
|
||||
content_hash: String,
|
||||
}
|
||||
|
||||
fn collect_orphan_hashes(conn: &mut SqliteConnection) -> QueryResult<HashSet<String>> {
|
||||
// Union of every distinct content_hash carried by hash-keyed
|
||||
// derived tables, minus those still referenced by image_exif.
|
||||
let rows = sql_query(
|
||||
"SELECT DISTINCT content_hash FROM ( \
|
||||
SELECT content_hash FROM face_detections WHERE content_hash IS NOT NULL \
|
||||
UNION ALL \
|
||||
SELECT content_hash FROM tagged_photo WHERE content_hash IS NOT NULL \
|
||||
UNION ALL \
|
||||
SELECT content_hash FROM photo_insights WHERE content_hash IS NOT NULL \
|
||||
) AS derived \
|
||||
WHERE content_hash NOT IN ( \
|
||||
SELECT content_hash FROM image_exif WHERE content_hash IS NOT NULL \
|
||||
)",
|
||||
)
|
||||
.get_results::<HashRow>(conn)?;
|
||||
|
||||
Ok(rows.into_iter().map(|r| r.content_hash).collect())
|
||||
}
|
||||
|
||||
/// Delete every hash-keyed row whose `content_hash` is in `hashes`.
|
||||
/// Returns `(faces, tagged_photo, photo_insights)`.
|
||||
fn delete_hash_keyed_rows(
|
||||
conn: &mut SqliteConnection,
|
||||
hashes: &[String],
|
||||
) -> QueryResult<(usize, usize, usize)> {
|
||||
if hashes.is_empty() {
|
||||
return Ok((0, 0, 0));
|
||||
}
|
||||
|
||||
use crate::database::schema::{face_detections, photo_insights, tagged_photo};
|
||||
|
||||
let faces =
|
||||
diesel::delete(face_detections::table.filter(face_detections::content_hash.eq_any(hashes)))
|
||||
.execute(conn)?;
|
||||
let tags =
|
||||
diesel::delete(tagged_photo::table.filter(tagged_photo::content_hash.eq_any(hashes)))
|
||||
.execute(conn)?;
|
||||
let insights =
|
||||
diesel::delete(photo_insights::table.filter(photo_insights::content_hash.eq_any(hashes)))
|
||||
.execute(conn)?;
|
||||
|
||||
Ok((faces, tags, insights))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::database::test::in_memory_db_connection;
|
||||
|
||||
fn ensure_library(conn: &mut SqliteConnection, library_id: i32) {
|
||||
diesel::sql_query(
|
||||
"INSERT OR IGNORE INTO libraries (id, name, root_path, created_at) \
|
||||
VALUES (?, 'test-' || ?, '/tmp/test-' || ?, 0)",
|
||||
)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
fn insert_image_exif(
|
||||
conn: &mut SqliteConnection,
|
||||
library_id: i32,
|
||||
rel_path: &str,
|
||||
content_hash: Option<&str>,
|
||||
) {
|
||||
ensure_library(conn, library_id);
|
||||
diesel::sql_query(
|
||||
"INSERT INTO image_exif (library_id, rel_path, created_time, last_modified, content_hash) \
|
||||
VALUES (?, ?, 0, 0, ?)",
|
||||
)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::Text, _>(rel_path)
|
||||
.bind::<diesel::sql_types::Nullable<diesel::sql_types::Text>, _>(content_hash)
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
fn insert_face(conn: &mut SqliteConnection, library_id: i32, rel_path: &str, hash: &str) {
|
||||
ensure_library(conn, library_id);
|
||||
diesel::sql_query(
|
||||
"INSERT INTO face_detections (library_id, content_hash, rel_path, source, status, model_version, created_at) \
|
||||
VALUES (?, ?, ?, 'auto', 'no_faces', 'v', 0)",
|
||||
)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::Text, _>(hash)
|
||||
.bind::<diesel::sql_types::Text, _>(rel_path)
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
fn insert_tag_with_hash(conn: &mut SqliteConnection, rel_path: &str, hash: &str) {
|
||||
diesel::sql_query("INSERT OR IGNORE INTO tags (id, name, created_time) VALUES (1, 't', 0)")
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
diesel::sql_query(
|
||||
"INSERT INTO tagged_photo (rel_path, tag_id, created_time, content_hash) VALUES (?, 1, 0, ?)",
|
||||
)
|
||||
.bind::<diesel::sql_types::Text, _>(rel_path)
|
||||
.bind::<diesel::sql_types::Text, _>(hash)
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
fn insert_insight_with_hash(
|
||||
conn: &mut SqliteConnection,
|
||||
library_id: i32,
|
||||
rel_path: &str,
|
||||
hash: &str,
|
||||
) {
|
||||
ensure_library(conn, library_id);
|
||||
diesel::sql_query(
|
||||
"INSERT INTO photo_insights (library_id, rel_path, title, summary, generated_at, model_version, is_current, backend, content_hash) \
|
||||
VALUES (?, ?, 't', 's', 0, 'v', 1, 'local', ?)",
|
||||
)
|
||||
.bind::<diesel::sql_types::Integer, _>(library_id)
|
||||
.bind::<diesel::sql_types::Text, _>(rel_path)
|
||||
.bind::<diesel::sql_types::Text, _>(hash)
|
||||
.execute(conn)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
#[derive(QueryableByName, Debug)]
|
||||
struct CountRow {
|
||||
#[diesel(sql_type = diesel::sql_types::BigInt)]
|
||||
n: i64,
|
||||
}
|
||||
fn count(conn: &mut SqliteConnection, sql: &str) -> i64 {
|
||||
diesel::sql_query(sql)
|
||||
.get_result::<CountRow>(conn)
|
||||
.unwrap()
|
||||
.n
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn refresh_back_refs_repoints_face_detection_after_move() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
// Original location lib 1, rel "old.jpg". image_exif row gone
|
||||
// (file moved); only the new lib 2 row remains.
|
||||
insert_image_exif(&mut conn, 2, "new.jpg", Some("h1"));
|
||||
insert_face(&mut conn, 1, "old.jpg", "h1");
|
||||
|
||||
let updated = refresh_back_refs(&mut conn);
|
||||
assert_eq!(updated, 1);
|
||||
|
||||
let row = diesel::sql_query("SELECT library_id AS n FROM face_detections")
|
||||
.get_result::<CountRow>(&mut conn)
|
||||
.unwrap();
|
||||
assert_eq!(row.n, 2, "library_id should now point at lib 2");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn refresh_back_refs_no_change_when_back_ref_still_valid() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
insert_image_exif(&mut conn, 1, "a.jpg", Some("h1"));
|
||||
insert_face(&mut conn, 1, "a.jpg", "h1");
|
||||
|
||||
let updated = refresh_back_refs(&mut conn);
|
||||
assert_eq!(updated, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn refresh_back_refs_no_change_when_hash_fully_orphaned() {
|
||||
// Hash exists on face_detections but no surviving image_exif
|
||||
// row for it → the refresh is a no-op (orphan GC handles
|
||||
// these). Important: the SET subquery would return NULL and
|
||||
// we'd null out the back-ref otherwise; the EXISTS guard
|
||||
// protects against that.
|
||||
let mut conn = in_memory_db_connection();
|
||||
insert_face(&mut conn, 1, "gone.jpg", "h1");
|
||||
|
||||
let updated = refresh_back_refs(&mut conn);
|
||||
assert_eq!(updated, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn orphan_gc_requires_two_consecutive_all_online_ticks() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
// Hash present in face_detections but NOT image_exif → orphan.
|
||||
insert_face(&mut conn, 1, "x.jpg", "h-orphan");
|
||||
let mut state = OrphanGcState::default();
|
||||
|
||||
// Tick 1: prev_tick_all_online is false (default), so even
|
||||
// with current tick all-online we mark only.
|
||||
let stats = run_orphan_gc(&mut conn, &mut state, true);
|
||||
assert_eq!(stats.newly_marked, 1);
|
||||
assert_eq!(stats.total_deleted(), 0);
|
||||
assert_eq!(state.pending.len(), 1);
|
||||
|
||||
// Tick 2: prev_tick_all_online is now true, current tick still
|
||||
// all-online → consensus reached, hash gets deleted.
|
||||
let stats = run_orphan_gc(&mut conn, &mut state, true);
|
||||
assert_eq!(stats.deleted_face_detections, 1);
|
||||
assert!(state.pending.is_empty());
|
||||
|
||||
// Tick 3: nothing left.
|
||||
let stats = run_orphan_gc(&mut conn, &mut state, true);
|
||||
assert_eq!(stats.total_deleted(), 0);
|
||||
assert_eq!(stats.newly_marked, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn orphan_gc_resets_consensus_on_stale_library() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
insert_face(&mut conn, 1, "x.jpg", "h-orphan");
|
||||
let mut state = OrphanGcState::default();
|
||||
|
||||
// Tick 1: all-online, mark.
|
||||
run_orphan_gc(&mut conn, &mut state, true);
|
||||
// Tick 2: stale library — consensus window resets, no delete.
|
||||
let stats = run_orphan_gc(&mut conn, &mut state, false);
|
||||
assert_eq!(stats.total_deleted(), 0);
|
||||
assert!(!state.prev_tick_all_online);
|
||||
// Tick 3: all-online again — but we need ANOTHER tick to set
|
||||
// prev_tick_all_online before deletes can fire. So tick 3
|
||||
// marks (no-op on existing pending), tick 4 deletes.
|
||||
let stats = run_orphan_gc(&mut conn, &mut state, true);
|
||||
assert_eq!(stats.total_deleted(), 0);
|
||||
let stats = run_orphan_gc(&mut conn, &mut state, true);
|
||||
assert_eq!(stats.deleted_face_detections, 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn orphan_gc_revives_when_image_exif_reappears() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
insert_face(&mut conn, 1, "x.jpg", "h-orphan");
|
||||
let mut state = OrphanGcState::default();
|
||||
|
||||
// Tick 1: mark.
|
||||
run_orphan_gc(&mut conn, &mut state, true);
|
||||
assert!(state.pending.contains("h-orphan"));
|
||||
|
||||
// Between ticks, the image_exif row reappears (e.g. backup
|
||||
// share was briefly stale). Hash is no longer orphaned.
|
||||
insert_image_exif(&mut conn, 2, "x.jpg", Some("h-orphan"));
|
||||
|
||||
let stats = run_orphan_gc(&mut conn, &mut state, true);
|
||||
assert_eq!(stats.revived, 1);
|
||||
assert_eq!(stats.total_deleted(), 0);
|
||||
assert!(state.pending.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn orphan_gc_deletes_across_all_three_tables() {
|
||||
let mut conn = in_memory_db_connection();
|
||||
// Same orphan hash appears in all three derived tables.
|
||||
insert_face(&mut conn, 1, "a.jpg", "h-orphan");
|
||||
insert_tag_with_hash(&mut conn, "a.jpg", "h-orphan");
|
||||
insert_insight_with_hash(&mut conn, 1, "a.jpg", "h-orphan");
|
||||
|
||||
let mut state = OrphanGcState::default();
|
||||
run_orphan_gc(&mut conn, &mut state, true);
|
||||
let stats = run_orphan_gc(&mut conn, &mut state, true);
|
||||
assert_eq!(stats.deleted_face_detections, 1);
|
||||
assert_eq!(stats.deleted_tagged_photo, 1);
|
||||
assert_eq!(stats.deleted_photo_insights, 1);
|
||||
|
||||
assert_eq!(
|
||||
count(&mut conn, "SELECT COUNT(*) AS n FROM face_detections"),
|
||||
0
|
||||
);
|
||||
assert_eq!(
|
||||
count(&mut conn, "SELECT COUNT(*) AS n FROM tagged_photo"),
|
||||
0
|
||||
);
|
||||
assert_eq!(
|
||||
count(&mut conn, "SELECT COUNT(*) AS n FROM photo_insights"),
|
||||
0
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn all_libraries_online_helper() {
|
||||
use crate::libraries::{LibraryHealth, new_health_map};
|
||||
let libs = vec![
|
||||
Library {
|
||||
id: 1,
|
||||
name: "a".into(),
|
||||
root_path: "/x".into(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
},
|
||||
Library {
|
||||
id: 2,
|
||||
name: "b".into(),
|
||||
root_path: "/y".into(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
},
|
||||
];
|
||||
let health = new_health_map(&libs);
|
||||
assert!(all_libraries_online(&libs, &health));
|
||||
|
||||
// Flip lib 2 to stale.
|
||||
{
|
||||
let mut g = health.write().unwrap();
|
||||
g.insert(
|
||||
2,
|
||||
LibraryHealth::Stale {
|
||||
reason: "test".into(),
|
||||
since: 0,
|
||||
},
|
||||
);
|
||||
}
|
||||
assert!(!all_libraries_online(&libs, &health));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn all_libraries_online_treats_disabled_as_out_of_scope() {
|
||||
use crate::libraries::{LibraryHealth, new_health_map};
|
||||
// lib 1 enabled+online, lib 2 disabled (would be treated as
|
||||
// Online in the health map's optimistic seed but the map
|
||||
// entry is irrelevant — disabled libs are filtered out
|
||||
// before the health lookup).
|
||||
let libs = vec![
|
||||
Library {
|
||||
id: 1,
|
||||
name: "a".into(),
|
||||
root_path: "/x".into(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
},
|
||||
Library {
|
||||
id: 2,
|
||||
name: "b".into(),
|
||||
root_path: "/y".into(),
|
||||
enabled: false,
|
||||
excluded_dirs: Vec::new(),
|
||||
},
|
||||
];
|
||||
let health = new_health_map(&libs);
|
||||
// Sanity: forcibly mark lib 2 stale to prove disabled wins
|
||||
// over even an explicit Stale entry — the filter skips it
|
||||
// before the health check happens.
|
||||
{
|
||||
let mut g = health.write().unwrap();
|
||||
g.insert(
|
||||
2,
|
||||
LibraryHealth::Stale {
|
||||
reason: "intentionally stale".into(),
|
||||
since: 0,
|
||||
},
|
||||
);
|
||||
}
|
||||
assert!(
|
||||
all_libraries_online(&libs, &health),
|
||||
"disabled library should not block consensus"
|
||||
);
|
||||
}
|
||||
}
|
||||
383
src/main.rs
383
src/main.rs
@@ -64,6 +64,7 @@ mod auth;
|
||||
mod content_hash;
|
||||
mod data;
|
||||
mod database;
|
||||
mod duplicates;
|
||||
mod error;
|
||||
mod exif;
|
||||
mod face_watch;
|
||||
@@ -72,6 +73,8 @@ mod file_types;
|
||||
mod files;
|
||||
mod geo;
|
||||
mod libraries;
|
||||
mod library_maintenance;
|
||||
mod perceptual_hash;
|
||||
mod state;
|
||||
mod tags;
|
||||
mod utils;
|
||||
@@ -150,7 +153,12 @@ async fn get_image(
|
||||
let relative_path_str = relative_path.to_string_lossy().replace('\\', "/");
|
||||
|
||||
let thumbs = &app_state.thumbnail_path;
|
||||
let legacy_thumb_path = Path::new(&thumbs).join(relative_path);
|
||||
let bare_legacy_thumb_path = Path::new(&thumbs).join(relative_path);
|
||||
let scoped_legacy_thumb_path = content_hash::library_scoped_legacy_path(
|
||||
Path::new(&thumbs),
|
||||
library.id,
|
||||
relative_path,
|
||||
);
|
||||
|
||||
// Gif thumbnails are a separate lookup (video GIF previews).
|
||||
// Dual-lookup for gif is out of scope; preserve existing flow.
|
||||
@@ -168,8 +176,16 @@ async fn get_image(
|
||||
}
|
||||
}
|
||||
|
||||
// Resolve the hash-keyed thumbnail (if the row already has a
|
||||
// content_hash) and fall back to the legacy mirrored path.
|
||||
// Lookup chain (most-specific first, falling back as we miss):
|
||||
// 1. hash-keyed (`<thumbs>/<hash[..2]>/<hash>.jpg`) — content
|
||||
// identity, shared across libraries;
|
||||
// 2. library-scoped legacy (`<thumbs>/<lib_id>/<rel_path>`) —
|
||||
// written by current generation when hash isn't known;
|
||||
// 3. bare legacy (`<thumbs>/<rel_path>`) — pre-multi-library
|
||||
// thumbs from the days before library prefixing existed.
|
||||
// Stage (3) goes away once a one-time migration lifts every
|
||||
// bare-legacy file under a library prefix; until then it
|
||||
// prevents needless 404s for already-warmed deployments.
|
||||
let hash_thumb_path: Option<PathBuf> = {
|
||||
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
|
||||
match dao.get_exif(&context, &relative_path_str) {
|
||||
@@ -184,7 +200,14 @@ async fn get_image(
|
||||
.as_ref()
|
||||
.filter(|p| p.exists())
|
||||
.cloned()
|
||||
.unwrap_or_else(|| legacy_thumb_path.clone());
|
||||
.or_else(|| {
|
||||
if scoped_legacy_thumb_path.exists() {
|
||||
Some(scoped_legacy_thumb_path.clone())
|
||||
} else {
|
||||
None
|
||||
}
|
||||
})
|
||||
.unwrap_or_else(|| bare_legacy_thumb_path.clone());
|
||||
|
||||
// Handle circular thumbnail request
|
||||
if req.shape == Some(ThumbnailShape::Circle) {
|
||||
@@ -509,6 +532,11 @@ async fn set_image_gps(
|
||||
.ok()
|
||||
.map(|c| c.content_hash),
|
||||
size_bytes: content_hash::compute(&full_path).ok().map(|c| c.size_bytes),
|
||||
// GPS-update path doesn't touch perceptual hashes either; columns
|
||||
// ignored by update_exif. Compute best-effort so a new file lands
|
||||
// with a usable signal; failure just leaves prior values in place.
|
||||
phash_64: perceptual_hash::compute(&full_path).map(|h| h.phash_64),
|
||||
dhash_64: perceptual_hash::compute(&full_path).map(|h| h.dhash_64),
|
||||
};
|
||||
|
||||
let updated = {
|
||||
@@ -631,6 +659,37 @@ async fn upload_image(
|
||||
&full_path.to_str().unwrap().to_string(),
|
||||
true,
|
||||
) {
|
||||
// Pre-write content-hash check: if these exact bytes already
|
||||
// exist anywhere in any library (and aren't themselves
|
||||
// soft-marked as duplicates), don't write the file. Return
|
||||
// 409 with the canonical sibling so the mobile app can show
|
||||
// a friendly "already in your library" toast.
|
||||
let upload_hash = blake3::Hasher::new()
|
||||
.update(&file_content)
|
||||
.finalize()
|
||||
.to_hex()
|
||||
.to_string();
|
||||
{
|
||||
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
|
||||
if let Ok(Some(existing)) = dao.find_by_content_hash(&span_context, &upload_hash)
|
||||
&& existing.duplicate_of_hash.is_none()
|
||||
{
|
||||
let library_name = libraries::load_all(&mut crate::database::connect())
|
||||
.into_iter()
|
||||
.find(|l| l.id == existing.library_id)
|
||||
.map(|l| l.name);
|
||||
span.set_status(Status::Ok);
|
||||
return HttpResponse::Conflict().json(serde_json::json!({
|
||||
"duplicate_of": {
|
||||
"library_id": existing.library_id,
|
||||
"rel_path": existing.file_path,
|
||||
},
|
||||
"content_hash": upload_hash,
|
||||
"library_name": library_name,
|
||||
}));
|
||||
}
|
||||
}
|
||||
|
||||
let context =
|
||||
opentelemetry::Context::new().with_remote_span_context(span.span_context().clone());
|
||||
tracer
|
||||
@@ -689,6 +748,7 @@ async fn upload_image(
|
||||
(None, None)
|
||||
}
|
||||
};
|
||||
let perceptual = perceptual_hash::compute(&uploaded_path);
|
||||
let insert_exif = InsertImageExif {
|
||||
library_id: target_library.id,
|
||||
file_path: relative_path.clone(),
|
||||
@@ -710,6 +770,8 @@ async fn upload_image(
|
||||
last_modified: timestamp,
|
||||
content_hash,
|
||||
size_bytes,
|
||||
phash_64: perceptual.map(|h| h.phash_64),
|
||||
dhash_64: perceptual.map(|h| h.dhash_64),
|
||||
};
|
||||
|
||||
if let Ok(mut dao) = exif_dao.lock() {
|
||||
@@ -761,6 +823,15 @@ async fn generate_video(
|
||||
|
||||
if let Some(name) = filename.file_name() {
|
||||
let filename = name.to_str().expect("Filename should convert to string");
|
||||
// KNOWN ISSUE (multi-library): playlist filename is the basename
|
||||
// alone, so two source files with the same basename — whether in
|
||||
// different libraries or different subdirs of one library —
|
||||
// overwrite each other's playlists while ffmpeg runs. The
|
||||
// hash-keyed `content_hash::hls_dir` is the long-term answer
|
||||
// (see CLAUDE.md "Multi-library data model"); rewiring the
|
||||
// actor pipeline to use it is out of scope for this branch.
|
||||
// The orphan-cleanup job above already walks every library so
|
||||
// it doesn't false-delete archive playlists.
|
||||
let playlist = format!("{}/{}.m3u8", app_state.video_path, filename);
|
||||
|
||||
let library = libraries::resolve_library_param(&app_state, body.library.as_deref())
|
||||
@@ -1305,19 +1376,41 @@ fn create_thumbnails(libs: &[libraries::Library], excluded_dirs: &[String]) {
|
||||
lib.name, lib.root_path
|
||||
);
|
||||
let images = PathBuf::from(&lib.root_path);
|
||||
// Effective excludes = global env-var excludes ∪ library row's
|
||||
// excluded_dirs. Lets a parent-library mount skip the subtree
|
||||
// already covered by a child library.
|
||||
let effective_excludes = lib.effective_excluded_dirs(excluded_dirs);
|
||||
|
||||
// Prune EXCLUDED_DIRS so we don't generate thumbnails-of-thumbnails
|
||||
// for Synology @eaDir trees. file_scan handles filter_entry pruning.
|
||||
image_api::file_scan::walk_library_files(&images, excluded_dirs)
|
||||
image_api::file_scan::walk_library_files(&images, &effective_excludes)
|
||||
.into_par_iter()
|
||||
.for_each(|entry| {
|
||||
let src = entry.path();
|
||||
let Ok(relative_path) = src.strip_prefix(&images) else {
|
||||
return;
|
||||
};
|
||||
let thumb_path = Path::new(thumbnail_directory).join(relative_path);
|
||||
// Library-scoped legacy path: prevents two libraries with
|
||||
// the same rel_path from clobbering each other's thumbs.
|
||||
// Hash-keyed promotion happens lazily on first hash-aware
|
||||
// request — keeping this loop ExifDao-free preserves the
|
||||
// current "cargo build && go" startup story.
|
||||
let thumb_path = content_hash::library_scoped_legacy_path(
|
||||
thumbnail_directory,
|
||||
lib.id,
|
||||
relative_path,
|
||||
);
|
||||
let bare_legacy = thumbnail_directory.join(relative_path);
|
||||
|
||||
if thumb_path.exists() || unsupported_thumbnail_sentinel(&thumb_path).exists() {
|
||||
// Backwards-compat check: if a single-library install has a
|
||||
// bare-legacy thumb here already, accept it as present.
|
||||
// Same for the sentinel. Means we don't redo work after
|
||||
// upgrade and we don't leave stale duplicates around.
|
||||
if thumb_path.exists()
|
||||
|| bare_legacy.exists()
|
||||
|| unsupported_thumbnail_sentinel(&thumb_path).exists()
|
||||
|| unsupported_thumbnail_sentinel(&bare_legacy).exists()
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
@@ -1365,7 +1458,8 @@ fn create_thumbnails(libs: &[libraries::Library], excluded_dirs: &[String]) {
|
||||
debug!("Finished making thumbnails");
|
||||
|
||||
for lib in libs {
|
||||
update_media_counts(Path::new(&lib.root_path), excluded_dirs);
|
||||
let effective_excludes = lib.effective_excluded_dirs(excluded_dirs);
|
||||
update_media_counts(Path::new(&lib.root_path), &effective_excludes);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1462,10 +1556,18 @@ fn main() -> std::io::Result<()> {
|
||||
preview_gen_for_watcher,
|
||||
app_state.face_client.clone(),
|
||||
app_state.excluded_dirs.clone(),
|
||||
app_state.library_health.clone(),
|
||||
);
|
||||
|
||||
// Start orphaned playlist cleanup job
|
||||
cleanup_orphaned_playlists(app_state.excluded_dirs.clone());
|
||||
// Start orphaned playlist cleanup job. Multi-library aware: walks
|
||||
// every configured library when looking for the source video, and
|
||||
// skips the whole cycle while any library is stale (a missing
|
||||
// source is indistinguishable from a transiently-unmounted share).
|
||||
cleanup_orphaned_playlists(
|
||||
app_state.libraries.clone(),
|
||||
app_state.excluded_dirs.clone(),
|
||||
app_state.library_health.clone(),
|
||||
);
|
||||
|
||||
// Spawn background job to generate daily conversation summaries
|
||||
{
|
||||
@@ -1600,6 +1702,7 @@ fn main() -> std::io::Result<()> {
|
||||
.add_feature(add_tag_services::<_, SqliteTagDao>)
|
||||
.add_feature(knowledge::add_knowledge_services::<_, SqliteKnowledgeDao>)
|
||||
.add_feature(faces::add_face_services::<_, faces::SqliteFaceDao>)
|
||||
.add_feature(duplicates::add_duplicate_services)
|
||||
.app_data(app_data.clone())
|
||||
.app_data::<Data<RealFileSystem>>(Data::new(RealFileSystem::new(
|
||||
app_data.base_path.clone(),
|
||||
@@ -1657,10 +1760,13 @@ fn run_migrations(
|
||||
}
|
||||
|
||||
/// Clean up orphaned HLS playlists and segments whose source videos no longer exist
|
||||
fn cleanup_orphaned_playlists(excluded_dirs: Vec<String>) {
|
||||
fn cleanup_orphaned_playlists(
|
||||
libs: Vec<libraries::Library>,
|
||||
excluded_dirs: Vec<String>,
|
||||
library_health: libraries::LibraryHealthMap,
|
||||
) {
|
||||
std::thread::spawn(move || {
|
||||
let video_path = dotenv::var("VIDEO_PATH").expect("VIDEO_PATH must be set");
|
||||
let base_path = dotenv::var("BASE_PATH").expect("BASE_PATH must be set");
|
||||
|
||||
// Get cleanup interval from environment (default: 24 hours)
|
||||
let cleanup_interval_secs = dotenv::var("PLAYLIST_CLEANUP_INTERVAL_SECONDS")
|
||||
@@ -1671,10 +1777,39 @@ fn cleanup_orphaned_playlists(excluded_dirs: Vec<String>) {
|
||||
info!("Starting orphaned playlist cleanup job");
|
||||
info!(" Cleanup interval: {} seconds", cleanup_interval_secs);
|
||||
info!(" Playlist directory: {}", video_path);
|
||||
for lib in &libs {
|
||||
info!(
|
||||
" Checking sources under '{}' at {}",
|
||||
lib.name, lib.root_path
|
||||
);
|
||||
}
|
||||
|
||||
loop {
|
||||
std::thread::sleep(Duration::from_secs(cleanup_interval_secs));
|
||||
|
||||
// Safety gate: skip the cleanup cycle if any library is
|
||||
// stale. A missing source video on a stale library is
|
||||
// indistinguishable from a transient unmount, and the
|
||||
// cleanup is destructive — we'd rather leak a few playlist
|
||||
// files for a tick than delete one whose source is briefly
|
||||
// unreachable. The cycle re-runs on the next interval.
|
||||
{
|
||||
let guard = library_health.read().unwrap_or_else(|e| e.into_inner());
|
||||
let stale: Vec<String> = libs
|
||||
.iter()
|
||||
.filter(|lib| guard.get(&lib.id).map(|h| !h.is_online()).unwrap_or(false))
|
||||
.map(|lib| lib.name.clone())
|
||||
.collect();
|
||||
if !stale.is_empty() {
|
||||
warn!(
|
||||
"Skipping orphaned-playlist cleanup: {} library(ies) stale: [{}]",
|
||||
stale.len(),
|
||||
stale.join(", ")
|
||||
);
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
info!("Running orphaned playlist cleanup");
|
||||
let start = std::time::Instant::now();
|
||||
let mut deleted_count = 0;
|
||||
@@ -1703,20 +1838,26 @@ fn cleanup_orphaned_playlists(excluded_dirs: Vec<String>) {
|
||||
if let Some(filename) = playlist_path.file_stem() {
|
||||
let video_filename = filename.to_string_lossy();
|
||||
|
||||
// Search for this video file in BASE_PATH, respecting
|
||||
// EXCLUDED_DIRS so we don't false-resurrect playlists for
|
||||
// videos that only exist inside an excluded subtree.
|
||||
// Search for this video file across every configured
|
||||
// library, respecting EXCLUDED_DIRS so we don't
|
||||
// false-resurrect playlists for videos that only
|
||||
// exist inside an excluded subtree. As soon as one
|
||||
// library has a matching source, we're done — the
|
||||
// playlist isn't orphaned.
|
||||
let mut video_exists = false;
|
||||
for entry in image_api::file_scan::walk_library_files(
|
||||
Path::new(&base_path),
|
||||
&excluded_dirs,
|
||||
) {
|
||||
if let Some(entry_stem) = entry.path().file_stem()
|
||||
&& entry_stem == filename
|
||||
&& is_video_file(entry.path())
|
||||
{
|
||||
video_exists = true;
|
||||
break;
|
||||
'libs: for lib in &libs {
|
||||
let effective = lib.effective_excluded_dirs(&excluded_dirs);
|
||||
for entry in image_api::file_scan::walk_library_files(
|
||||
Path::new(&lib.root_path),
|
||||
&effective,
|
||||
) {
|
||||
if let Some(entry_stem) = entry.path().file_stem()
|
||||
&& entry_stem == filename
|
||||
&& is_video_file(entry.path())
|
||||
{
|
||||
video_exists = true;
|
||||
break 'libs;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1792,6 +1933,7 @@ fn watch_files(
|
||||
preview_generator: Addr<video::actors::PreviewClipGenerator>,
|
||||
face_client: crate::ai::face_client::FaceClient,
|
||||
excluded_dirs: Vec<String>,
|
||||
library_health: libraries::LibraryHealthMap,
|
||||
) {
|
||||
std::thread::spawn(move || {
|
||||
// Get polling intervals from environment variables
|
||||
@@ -1850,6 +1992,52 @@ fn watch_files(
|
||||
let mut last_full_scan = SystemTime::now();
|
||||
let mut scan_count = 0u64;
|
||||
|
||||
// Per-library cursor for the missing-file scan. Each tick reads
|
||||
// a page from `offset`, stat()s the rows, deletes confirmed-
|
||||
// missing ones, and advances or wraps the cursor. State held
|
||||
// in-memory so a watcher restart resumes from 0 — fine, the
|
||||
// sweep is idempotent.
|
||||
let mut missing_file_offsets: std::collections::HashMap<i32, i64> =
|
||||
std::collections::HashMap::new();
|
||||
|
||||
let missing_scan_page_size: i64 = dotenv::var("IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE")
|
||||
.ok()
|
||||
.and_then(|s| s.parse().ok())
|
||||
.filter(|n: &i64| *n > 0)
|
||||
.unwrap_or(library_maintenance::DEFAULT_SCAN_PAGE_SIZE);
|
||||
let missing_delete_cap: usize = dotenv::var("IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK")
|
||||
.ok()
|
||||
.and_then(|s| s.parse().ok())
|
||||
.filter(|n: &usize| *n > 0)
|
||||
.unwrap_or(library_maintenance::DEFAULT_MISSING_DELETE_CAP);
|
||||
|
||||
// Two-tick orphan-GC consensus state. Carried across ticks via
|
||||
// `OrphanGcState`; see library_maintenance::run_orphan_gc.
|
||||
let mut orphan_gc_state = library_maintenance::OrphanGcState::default();
|
||||
|
||||
// Initial availability sweep before the loop's first sleep so
|
||||
// /libraries reports the truth from the very first request,
|
||||
// rather than the optimistic Online default that
|
||||
// new_health_map seeds. Without this, an unmounted share would
|
||||
// appear online for up to WATCH_QUICK_INTERVAL_SECONDS (default
|
||||
// 60s) after boot. Same probe logic as the per-tick gate
|
||||
// below; no ingest runs here, just the health update + log.
|
||||
// Disabled libraries skip the probe entirely — they should
|
||||
// never enter the health map (treated as out-of-scope).
|
||||
for lib in &libs {
|
||||
if !lib.enabled {
|
||||
continue;
|
||||
}
|
||||
let context = opentelemetry::Context::new();
|
||||
let had_data = exif_dao
|
||||
.lock()
|
||||
.expect("exif_dao poisoned")
|
||||
.count_for_library(&context, lib.id)
|
||||
.map(|n| n > 0)
|
||||
.unwrap_or(false);
|
||||
libraries::refresh_health(&library_health, lib, had_data);
|
||||
}
|
||||
|
||||
loop {
|
||||
std::thread::sleep(Duration::from_secs(quick_interval_secs));
|
||||
|
||||
@@ -1861,6 +2049,44 @@ fn watch_files(
|
||||
let is_full_scan = since_last_full.as_secs() >= full_interval_secs;
|
||||
|
||||
for lib in &libs {
|
||||
// Operator kill switch: a disabled library is invisible
|
||||
// to the watcher entirely. No probe, no ingest, no
|
||||
// maintenance, no health entry. Distinct from Stale —
|
||||
// Stale is "we wanted to but couldn't"; Disabled is
|
||||
// "we don't want to". Toggle via SQL.
|
||||
if !lib.enabled {
|
||||
debug!(
|
||||
"watcher: skipping library '{}' (id={}) — enabled=false",
|
||||
lib.name, lib.id
|
||||
);
|
||||
continue;
|
||||
}
|
||||
|
||||
// Availability probe: every tick checks that the
|
||||
// library's mount is reachable, is a directory, is
|
||||
// readable, and (if image_exif has rows for it) is
|
||||
// non-empty. A Stale library skips ingest, backlog
|
||||
// drains, and metric refresh — reads/serving in HTTP
|
||||
// handlers continue to work. Branches B/C extend the
|
||||
// probe gate to cover handoff and orphan GC. See
|
||||
// CLAUDE.md "Library availability and safety".
|
||||
let had_data = {
|
||||
let context = opentelemetry::Context::new();
|
||||
let mut guard = exif_dao.lock().expect("exif_dao poisoned");
|
||||
guard
|
||||
.count_for_library(&context, lib.id)
|
||||
.map(|n| n > 0)
|
||||
.unwrap_or(false)
|
||||
};
|
||||
let health = libraries::refresh_health(&library_health, lib, had_data);
|
||||
if !health.is_online() {
|
||||
// Skip every write path for this library this tick.
|
||||
// Don't refresh the media-count gauge either — a
|
||||
// probe-failed library would otherwise flap to 0
|
||||
// image / 0 video and pollute Prometheus.
|
||||
continue;
|
||||
}
|
||||
|
||||
// Drain the unhashed-hash backlog AND the face-detection
|
||||
// backlog every tick, regardless of quick/full. Quick
|
||||
// scans only walk recently-modified files, so the
|
||||
@@ -1868,6 +2094,11 @@ fn watch_files(
|
||||
// — without these standalone passes, backfill +
|
||||
// detection only progressed during full scans
|
||||
// (default once an hour).
|
||||
// Effective excludes for this library: global env-var
|
||||
// ∪ row's excluded_dirs. Compute once per tick — used
|
||||
// by every walker below for this library.
|
||||
let effective_excludes = lib.effective_excluded_dirs(&excluded_dirs);
|
||||
|
||||
if face_client.is_enabled() {
|
||||
let context = opentelemetry::Context::new();
|
||||
backfill_unhashed_backlog(&context, lib, &exif_dao);
|
||||
@@ -1877,7 +2108,7 @@ fn watch_files(
|
||||
&face_client,
|
||||
&face_dao,
|
||||
&watcher_tag_dao,
|
||||
&excluded_dirs,
|
||||
&effective_excludes,
|
||||
);
|
||||
}
|
||||
|
||||
@@ -1893,7 +2124,7 @@ fn watch_files(
|
||||
Arc::clone(&face_dao),
|
||||
Arc::clone(&watcher_tag_dao),
|
||||
face_client.clone(),
|
||||
&excluded_dirs,
|
||||
&effective_excludes,
|
||||
None,
|
||||
playlist_manager.clone(),
|
||||
preview_generator.clone(),
|
||||
@@ -1914,7 +2145,7 @@ fn watch_files(
|
||||
Arc::clone(&face_dao),
|
||||
Arc::clone(&watcher_tag_dao),
|
||||
face_client.clone(),
|
||||
&excluded_dirs,
|
||||
&effective_excludes,
|
||||
Some(check_since),
|
||||
playlist_manager.clone(),
|
||||
preview_generator.clone(),
|
||||
@@ -1922,7 +2153,66 @@ fn watch_files(
|
||||
}
|
||||
|
||||
// Update media counts per library (metric aggregates across all)
|
||||
update_media_counts(Path::new(&lib.root_path), &excluded_dirs);
|
||||
update_media_counts(Path::new(&lib.root_path), &effective_excludes);
|
||||
|
||||
// Missing-file detection: prune image_exif rows whose
|
||||
// source file is no longer on disk. Per-library, so we
|
||||
// pass library-online-this-tick implicitly (we only
|
||||
// reach here if the probe gate at the top of the
|
||||
// iteration passed). Capped + paginated so a huge
|
||||
// library doesn't stall the watcher; rows we don't
|
||||
// visit this tick get visited next tick. See
|
||||
// library_maintenance::detect_missing_files_for_library.
|
||||
{
|
||||
let context = opentelemetry::Context::new();
|
||||
let offset = missing_file_offsets.get(&lib.id).copied().unwrap_or(0);
|
||||
let (deleted, next_offset) =
|
||||
library_maintenance::detect_missing_files_for_library(
|
||||
&context,
|
||||
lib,
|
||||
&exif_dao,
|
||||
offset,
|
||||
missing_scan_page_size,
|
||||
missing_delete_cap,
|
||||
);
|
||||
missing_file_offsets.insert(lib.id, next_offset);
|
||||
if deleted > 0 {
|
||||
debug!(
|
||||
"missing-file scan: library '{}' next_offset={}",
|
||||
lib.name, next_offset
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Reconciliation: cross-library, so it runs once per tick
|
||||
// outside the per-library loop. Idempotent — fast no-op when
|
||||
// there's nothing to do. Operates on the database alone, no
|
||||
// filesystem dependency, so it doesn't need a health gate.
|
||||
// See database::reconcile and CLAUDE.md "Multi-library data
|
||||
// model" for the rules.
|
||||
{
|
||||
let mut conn = image_api::database::connect();
|
||||
let _ = image_api::database::reconcile::run(&mut conn);
|
||||
|
||||
// Back-ref refresh: hash-keyed rows whose
|
||||
// (library_id, rel_path) tuple no longer matches any
|
||||
// image_exif row but whose hash still does. After a
|
||||
// recent→archive move, the missing-file scan removes
|
||||
// the old image_exif row; this pass repoints face /
|
||||
// tag / insight back-refs at the surviving location.
|
||||
// DB-only, no health gate needed — uses what's in
|
||||
// image_exif as truth.
|
||||
let _ = library_maintenance::refresh_back_refs(&mut conn);
|
||||
|
||||
// Orphan GC: the destructive end of the maintenance
|
||||
// pipeline. Two-tick consensus + every-library-online
|
||||
// requirement is enforced inside run_orphan_gc; we
|
||||
// pass the current all-online flag and the function
|
||||
// tracks the previous tick's flag in OrphanGcState.
|
||||
let all_online = library_maintenance::all_libraries_online(&libs, &library_health);
|
||||
let _ =
|
||||
library_maintenance::run_orphan_gc(&mut conn, &mut orphan_gc_state, all_online);
|
||||
}
|
||||
|
||||
if is_full_scan {
|
||||
@@ -1992,7 +2282,9 @@ fn process_new_files(
|
||||
|
||||
let existing_exif_paths: HashMap<String, bool> = {
|
||||
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
|
||||
match dao.get_exif_batch(&context, &file_paths) {
|
||||
// Walk is per-library, so scope the lookup so a same-named file
|
||||
// in another library doesn't make this one look already-indexed.
|
||||
match dao.get_exif_batch(&context, Some(library.id), &file_paths) {
|
||||
Ok(exif_records) => exif_records
|
||||
.into_iter()
|
||||
.map(|record| (record.file_path, true))
|
||||
@@ -2012,9 +2304,19 @@ fn process_new_files(
|
||||
// derivative dedup and DB-indexed sort/filter work for every file,
|
||||
// not just photos with parseable EXIF.
|
||||
for (file_path, relative_path) in &files {
|
||||
let thumb_path = thumbnail_directory.join(relative_path);
|
||||
let needs_thumbnail =
|
||||
!thumb_path.exists() && !unsupported_thumbnail_sentinel(&thumb_path).exists();
|
||||
// Check both the library-scoped legacy path (current shape) and
|
||||
// the bare-legacy path (pre-multi-library shape). Either one
|
||||
// existing means a thumbnail is already on disk for this file.
|
||||
let scoped_thumb_path = content_hash::library_scoped_legacy_path(
|
||||
thumbnail_directory,
|
||||
library.id,
|
||||
relative_path,
|
||||
);
|
||||
let bare_legacy_thumb_path = thumbnail_directory.join(relative_path);
|
||||
let needs_thumbnail = !scoped_thumb_path.exists()
|
||||
&& !bare_legacy_thumb_path.exists()
|
||||
&& !unsupported_thumbnail_sentinel(&scoped_thumb_path).exists()
|
||||
&& !unsupported_thumbnail_sentinel(&bare_legacy_thumb_path).exists();
|
||||
let needs_row = !existing_exif_paths.contains_key(relative_path);
|
||||
|
||||
if needs_thumbnail || needs_row {
|
||||
@@ -2049,6 +2351,12 @@ fn process_new_files(
|
||||
}
|
||||
};
|
||||
|
||||
// Perceptual hashes (pHash + dHash). Best-effort — None for
|
||||
// videos and decode failures. Drives near-duplicate detection
|
||||
// in the Apollo duplicates surface; failure here is non-fatal
|
||||
// and never blocks indexing.
|
||||
let perceptual = perceptual_hash::compute(&file_path);
|
||||
|
||||
// EXIF is best-effort enrichment. When extraction fails (or the
|
||||
// file type doesn't support EXIF) we still store a row with all
|
||||
// EXIF fields NULL; the file remains visible to sort-by-date
|
||||
@@ -2100,6 +2408,8 @@ fn process_new_files(
|
||||
last_modified: timestamp,
|
||||
content_hash,
|
||||
size_bytes,
|
||||
phash_64: perceptual.map(|h| h.phash_64),
|
||||
dhash_64: perceptual.map(|h| h.dhash_64),
|
||||
};
|
||||
|
||||
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
|
||||
@@ -2131,7 +2441,7 @@ fn process_new_files(
|
||||
// ensures small/medium deploys self-heal without operator
|
||||
// action.
|
||||
backfill_missing_content_hashes(&context, &files, library, &exif_dao);
|
||||
let candidates = build_face_candidates(&context, &files, &exif_dao, &face_dao);
|
||||
let candidates = build_face_candidates(&context, library, &files, &exif_dao, &face_dao);
|
||||
debug!(
|
||||
"face_watch: scan tick — {} image file(s) walked, {} candidate(s) (library '{}', modified_since={})",
|
||||
files.iter().filter(|(p, _)| !is_video_file(p)).count(),
|
||||
@@ -2449,7 +2759,7 @@ fn backfill_missing_content_hashes(
|
||||
|
||||
let exif_records = {
|
||||
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
|
||||
dao.get_exif_batch(context, &image_paths)
|
||||
dao.get_exif_batch(context, Some(library.id), &image_paths)
|
||||
.unwrap_or_default()
|
||||
};
|
||||
// Cheap lookup back from rel_path → absolute file_path so
|
||||
@@ -2541,6 +2851,7 @@ fn backfill_missing_content_hashes(
|
||||
/// covers both new uploads and the initial backlog scan.
|
||||
fn build_face_candidates(
|
||||
context: &opentelemetry::Context,
|
||||
library: &libraries::Library,
|
||||
files: &[(PathBuf, String)],
|
||||
exif_dao: &Arc<Mutex<Box<dyn ExifDao>>>,
|
||||
face_dao: &Arc<Mutex<Box<dyn faces::FaceDao>>>,
|
||||
@@ -2558,7 +2869,7 @@ fn build_face_candidates(
|
||||
|
||||
let exif_records = {
|
||||
let mut dao = exif_dao.lock().expect("Unable to lock ExifDao");
|
||||
dao.get_exif_batch(context, &image_paths)
|
||||
dao.get_exif_batch(context, Some(library.id), &image_paths)
|
||||
.unwrap_or_default()
|
||||
};
|
||||
// rel_path → content_hash (only rows with a hash; without one we have
|
||||
|
||||
@@ -569,7 +569,8 @@ pub async fn list_memories(
|
||||
|
||||
for lib in &libraries_to_scan {
|
||||
let base = Path::new(&lib.root_path);
|
||||
let path_excluder = PathExcluder::new(base, &app_state.excluded_dirs);
|
||||
let effective = lib.effective_excluded_dirs(&app_state.excluded_dirs);
|
||||
let path_excluder = PathExcluder::new(base, &effective);
|
||||
|
||||
let exif_memories = collect_exif_memories(
|
||||
&exif_dao,
|
||||
|
||||
159
src/perceptual_hash.rs
Normal file
159
src/perceptual_hash.rs
Normal file
@@ -0,0 +1,159 @@
|
||||
//! Perceptual image hashing for near-duplicate detection.
|
||||
//!
|
||||
//! Two 64-bit signals per image, packed into i64 for storage and fast
|
||||
//! Hamming distance via XOR + popcount:
|
||||
//!
|
||||
//! - **pHash (DCT)** — robust to lossy recompression, format conversion,
|
||||
//! moderate brightness/contrast shifts. The primary signal.
|
||||
//! - **dHash (gradient)** — much cheaper to compute, robust to scaling
|
||||
//! and small crops. Acts as a fallback / corroboration when pHash is
|
||||
//! ambiguous (very flat images can collide).
|
||||
//!
|
||||
//! Image-only by design. Videos, decode failures, and any image we
|
||||
//! can't open all return `None` — perceptual hash failure is non-fatal
|
||||
//! and must not block the indexer; the file is still hashed by blake3
|
||||
//! and exact-match dedup keeps working.
|
||||
|
||||
use std::path::Path;
|
||||
|
||||
use image_hasher::{HashAlg, HasherConfig};
|
||||
|
||||
/// 64-bit perceptual fingerprint pair.
|
||||
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
|
||||
pub struct PerceptualIdentity {
|
||||
pub phash_64: i64,
|
||||
pub dhash_64: i64,
|
||||
}
|
||||
|
||||
/// Compute pHash + dHash for an image at `path`. Returns `None` on
|
||||
/// decode failure (unsupported format, corrupt bytes, video, etc.) —
|
||||
/// callers should treat that as "no perceptual signal available" and
|
||||
/// proceed with exact-match dedup only.
|
||||
pub fn compute(path: &Path) -> Option<PerceptualIdentity> {
|
||||
let img = image::open(path).ok()?;
|
||||
|
||||
// 8x8 = 64 bits, the standard size for pHash/dHash. Larger sizes
|
||||
// give more discriminative power but no longer fit in i64 and the
|
||||
// marginal robustness isn't worth the storage / index cost for a
|
||||
// personal-scale library.
|
||||
let phash = HasherConfig::new()
|
||||
.hash_alg(HashAlg::Mean)
|
||||
.hash_size(8, 8)
|
||||
.preproc_dct()
|
||||
.to_hasher()
|
||||
.hash_image(&img);
|
||||
|
||||
let dhash = HasherConfig::new()
|
||||
.hash_alg(HashAlg::Gradient)
|
||||
.hash_size(8, 8)
|
||||
.to_hasher()
|
||||
.hash_image(&img);
|
||||
|
||||
Some(PerceptualIdentity {
|
||||
phash_64: bytes_to_i64(phash.as_bytes())?,
|
||||
dhash_64: bytes_to_i64(dhash.as_bytes())?,
|
||||
})
|
||||
}
|
||||
|
||||
/// Hamming distance between two 64-bit perceptual hashes. The primary
|
||||
/// query primitive: two images are "near-duplicates" when this is below
|
||||
/// a threshold (default 8 for pHash, ~12% similarity tolerance). The
|
||||
/// duplicates module clusters via a BK-tree which uses its own copy of
|
||||
/// this calculation; this helper is kept for ad-hoc tools and tests.
|
||||
#[allow(dead_code)]
|
||||
#[inline]
|
||||
pub fn hamming_distance(a: i64, b: i64) -> u32 {
|
||||
(a ^ b).count_ones()
|
||||
}
|
||||
|
||||
fn bytes_to_i64(bytes: &[u8]) -> Option<i64> {
|
||||
if bytes.len() < 8 {
|
||||
return None;
|
||||
}
|
||||
let mut buf = [0u8; 8];
|
||||
buf.copy_from_slice(&bytes[..8]);
|
||||
Some(i64::from_be_bytes(buf))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use image::{ImageBuffer, Rgb};
|
||||
|
||||
fn write_test_image(path: &Path, seed: u32) {
|
||||
// Deterministic-but-distinct image content: simple gradient with
|
||||
// a per-seed offset. Gives pHash/dHash a real signal to work
|
||||
// with (a uniform image collapses to all-zero hashes).
|
||||
let img: ImageBuffer<Rgb<u8>, Vec<u8>> = ImageBuffer::from_fn(64, 64, |x, y| {
|
||||
let r = ((x + seed) & 0xFF) as u8;
|
||||
let g = ((y + seed * 2) & 0xFF) as u8;
|
||||
let b = ((x ^ y ^ seed) & 0xFF) as u8;
|
||||
Rgb([r, g, b])
|
||||
});
|
||||
img.save(path).unwrap();
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn identical_bytes_yield_identical_hashes() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let a = dir.path().join("a.png");
|
||||
let b = dir.path().join("b.png");
|
||||
write_test_image(&a, 42);
|
||||
write_test_image(&b, 42);
|
||||
let ha = compute(&a).expect("hash a");
|
||||
let hb = compute(&b).expect("hash b");
|
||||
assert_eq!(ha, hb);
|
||||
assert_eq!(hamming_distance(ha.phash_64, hb.phash_64), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn distinct_images_have_distinct_hashes() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let a = dir.path().join("a.png");
|
||||
let b = dir.path().join("b.png");
|
||||
write_test_image(&a, 42);
|
||||
write_test_image(&b, 123);
|
||||
let ha = compute(&a).expect("hash a");
|
||||
let hb = compute(&b).expect("hash b");
|
||||
assert_ne!(ha.phash_64, hb.phash_64);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn resized_copy_is_near_duplicate_under_threshold() {
|
||||
// The whole point of perceptual hashing: a resized copy of the
|
||||
// same source image should land within a small Hamming distance
|
||||
// of the original. We check the dHash specifically because it's
|
||||
// the more resize-robust of the two; pHash is also tight but
|
||||
// gradient-based dHash gives the most reliable signal here.
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let a = dir.path().join("a.png");
|
||||
write_test_image(&a, 7);
|
||||
let img = image::open(&a).unwrap();
|
||||
let small = img.resize_exact(32, 32, image::imageops::FilterType::Lanczos3);
|
||||
let b = dir.path().join("b.png");
|
||||
small.save(&b).unwrap();
|
||||
|
||||
let ha = compute(&a).expect("hash a");
|
||||
let hb = compute(&b).expect("hash b");
|
||||
let d_dhash = hamming_distance(ha.dhash_64, hb.dhash_64);
|
||||
assert!(
|
||||
d_dhash <= 8,
|
||||
"expected dhash Hamming distance <= 8 for resized copy, got {}",
|
||||
d_dhash
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn unsupported_path_returns_none() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let p = dir.path().join("notanimage.txt");
|
||||
std::fs::write(&p, b"hello").unwrap();
|
||||
assert!(compute(&p).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn missing_file_returns_none() {
|
||||
let p = Path::new("/nonexistent/path/that/does/not/exist.png");
|
||||
assert!(compute(p).is_none());
|
||||
}
|
||||
}
|
||||
13
src/state.rs
13
src/state.rs
@@ -10,7 +10,7 @@ use crate::database::{
|
||||
connect,
|
||||
};
|
||||
use crate::database::{PreviewDao, SqlitePreviewDao};
|
||||
use crate::libraries::{self, Library};
|
||||
use crate::libraries::{self, Library, LibraryHealthMap};
|
||||
use crate::tags::{SqliteTagDao, TagDao};
|
||||
use crate::video::actors::{
|
||||
PlaylistGenerator, PreviewClipGenerator, StreamActor, VideoPlaylistManager,
|
||||
@@ -26,6 +26,11 @@ pub struct AppState {
|
||||
/// All configured media libraries. Ordered by `id` ascending; the first
|
||||
/// entry is the primary library.
|
||||
pub libraries: Vec<Library>,
|
||||
/// Per-library availability snapshot. Updated by the file watcher at
|
||||
/// the top of each tick via `libraries::refresh_health`. HTTP handlers
|
||||
/// read it (e.g. `/libraries` surfacing). See "Library availability
|
||||
/// and safety" in CLAUDE.md.
|
||||
pub library_health: LibraryHealthMap,
|
||||
/// Legacy shim equal to `libraries[0].root_path`. Phase 2 transitional —
|
||||
/// new code should go through `primary_library()`.
|
||||
pub base_path: String,
|
||||
@@ -105,11 +110,13 @@ impl AppState {
|
||||
preview_dao,
|
||||
);
|
||||
|
||||
let library_health = libraries::new_health_map(&libraries_vec);
|
||||
Self {
|
||||
stream_manager,
|
||||
playlist_manager: Arc::new(video_playlist_manager.start()),
|
||||
preview_clip_generator: Arc::new(preview_clip_generator.start()),
|
||||
libraries: libraries_vec,
|
||||
library_health,
|
||||
base_path,
|
||||
thumbnail_path,
|
||||
video_path,
|
||||
@@ -348,6 +355,8 @@ impl AppState {
|
||||
id: crate::libraries::PRIMARY_LIBRARY_ID,
|
||||
name: "main".to_string(),
|
||||
root_path: base_path_str.clone(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
};
|
||||
let insight_generator = InsightGenerator::new(
|
||||
ollama.clone(),
|
||||
@@ -384,6 +393,8 @@ impl AppState {
|
||||
id: crate::libraries::PRIMARY_LIBRARY_ID,
|
||||
name: "main".to_string(),
|
||||
root_path: base_path_str.clone(),
|
||||
enabled: true,
|
||||
excluded_dirs: Vec::new(),
|
||||
}];
|
||||
AppState::new(
|
||||
Arc::new(StreamActor {}.start()),
|
||||
|
||||
473
src/tags.rs
473
src/tags.rs
@@ -33,6 +33,11 @@ where
|
||||
.service(web::resource("image/tags/all").route(web::get().to(get_all_tags::<TagD>)))
|
||||
.service(web::resource("image/tags/batch").route(web::post().to(update_tags::<TagD>)))
|
||||
.service(web::resource("image/tags/lookup").route(web::post().to(lookup_tags_batch::<TagD>)))
|
||||
.service(
|
||||
web::resource("image/tags/{id}")
|
||||
.route(web::put().to(update_tag::<TagD>))
|
||||
.route(web::delete().to(delete_tag::<TagD>)),
|
||||
)
|
||||
}
|
||||
|
||||
async fn add_tag<D: TagDao>(
|
||||
@@ -53,7 +58,14 @@ async fn add_tag<D: TagDao>(
|
||||
tag_dao
|
||||
.get_all_tags(&span_context, None)
|
||||
.and_then(|tags| {
|
||||
if let Some((_, tag)) = tags.iter().find(|t| t.1.name == tag_name) {
|
||||
// Case-insensitive match. With the unique-NOCASE index on
|
||||
// tags.name now in place, a case-sensitive find here would
|
||||
// miss a casing-only collision and let the subsequent
|
||||
// create_tag INSERT crash on the constraint.
|
||||
if let Some((_, tag)) = tags
|
||||
.iter()
|
||||
.find(|t| t.1.name.eq_ignore_ascii_case(&tag_name))
|
||||
{
|
||||
Ok(tag.clone())
|
||||
} else {
|
||||
info!(
|
||||
@@ -71,6 +83,74 @@ async fn add_tag<D: TagDao>(
|
||||
.into_http_internal_err()
|
||||
}
|
||||
|
||||
async fn update_tag<D: TagDao>(
|
||||
_: Claims,
|
||||
http_request: HttpRequest,
|
||||
path: web::Path<i32>,
|
||||
body: web::Json<UpdateTagRequest>,
|
||||
tag_dao: web::Data<Mutex<D>>,
|
||||
) -> impl Responder {
|
||||
let tracer = global_tracer();
|
||||
let context = extract_context_from_request(&http_request);
|
||||
let span = tracer.start_with_context("update_tag", &context);
|
||||
let span_context = opentelemetry::Context::current_with_span(span);
|
||||
|
||||
let id = path.into_inner();
|
||||
let trimmed = body.name.trim();
|
||||
if trimmed.is_empty() {
|
||||
return HttpResponse::BadRequest()
|
||||
.json(serde_json::json!({ "error": "Tag name must not be empty" }));
|
||||
}
|
||||
|
||||
let mut tag_dao = tag_dao.lock().expect("Unable to get TagDao");
|
||||
match tag_dao.update_tag_name(&span_context, id, trimmed) {
|
||||
Ok(UpdateTagOutcome::Renamed(tag)) => {
|
||||
span_context.span().set_status(Status::Ok);
|
||||
info!("Renamed tag {} -> '{}'", id, trimmed);
|
||||
HttpResponse::Ok().json(tag)
|
||||
}
|
||||
Ok(UpdateTagOutcome::NotFound) => {
|
||||
HttpResponse::NotFound().json(serde_json::json!({ "error": "Tag not found" }))
|
||||
}
|
||||
Ok(UpdateTagOutcome::Conflict { existing }) => HttpResponse::Conflict().json(
|
||||
serde_json::json!({ "error": "Tag name already exists", "existing_tag": existing }),
|
||||
),
|
||||
Err(e) => {
|
||||
log::error!("update_tag failed: {:?}", e);
|
||||
HttpResponse::InternalServerError()
|
||||
.json(serde_json::json!({ "error": "Update failed" }))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async fn delete_tag<D: TagDao>(
|
||||
_: Claims,
|
||||
http_request: HttpRequest,
|
||||
path: web::Path<i32>,
|
||||
tag_dao: web::Data<Mutex<D>>,
|
||||
) -> impl Responder {
|
||||
let tracer = global_tracer();
|
||||
let context = extract_context_from_request(&http_request);
|
||||
let span = tracer.start_with_context("delete_tag", &context);
|
||||
let span_context = opentelemetry::Context::current_with_span(span);
|
||||
|
||||
let id = path.into_inner();
|
||||
let mut tag_dao = tag_dao.lock().expect("Unable to get TagDao");
|
||||
match tag_dao.delete_tag(&span_context, id) {
|
||||
Ok(true) => {
|
||||
span_context.span().set_status(Status::Ok);
|
||||
info!("Deleted tag {}", id);
|
||||
HttpResponse::NoContent().finish()
|
||||
}
|
||||
Ok(false) => HttpResponse::NotFound().json(serde_json::json!({ "error": "Tag not found" })),
|
||||
Err(e) => {
|
||||
log::error!("delete_tag failed: {:?}", e);
|
||||
HttpResponse::InternalServerError()
|
||||
.json(serde_json::json!({ "error": "Delete failed" }))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async fn get_tags<D: TagDao>(
|
||||
_: Claims,
|
||||
http_request: HttpRequest,
|
||||
@@ -284,9 +364,15 @@ async fn lookup_tags_batch<D: TagDao>(
|
||||
// Stage 1: query → content_hash mapping. Files without a hash yet
|
||||
// (just-indexed, hash compute failed, etc.) skip the sibling
|
||||
// expansion and only get tags from their own rel_path.
|
||||
// Library-agnostic by design: this endpoint takes raw rel_paths from
|
||||
// the client (typically Apollo) with no library context. Span all
|
||||
// libraries and let the hash-keyed sibling expansion below do the
|
||||
// disambiguation. Same-rel_path/different-content collisions across
|
||||
// libraries surface as multiple hashes for one path — fine, we union
|
||||
// every sibling tag set.
|
||||
let exif_records = {
|
||||
let mut dao = exif_dao.lock().expect("Unable to get ExifDao");
|
||||
match dao.get_exif_batch(&span_context, &query_paths) {
|
||||
match dao.get_exif_batch(&span_context, None, &query_paths) {
|
||||
Ok(rows) => rows,
|
||||
Err(e) => {
|
||||
return HttpResponse::InternalServerError()
|
||||
@@ -421,6 +507,11 @@ pub struct InsertTaggedPhoto {
|
||||
#[diesel(column_name = rel_path)]
|
||||
pub photo_name: String,
|
||||
pub created_time: i64,
|
||||
/// Hash-keyed identity. The DAO populates this from
|
||||
/// `image_exif.content_hash` at insert time when known; the
|
||||
/// reconciliation pass backfills rows inserted before the hash
|
||||
/// landed. See CLAUDE.md "Multi-library data model".
|
||||
pub content_hash: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Queryable, Clone, Debug)]
|
||||
@@ -434,6 +525,8 @@ pub struct TaggedPhoto {
|
||||
pub tag_id: i32,
|
||||
#[allow(dead_code)] // Part of API contract
|
||||
pub created_time: i64,
|
||||
#[allow(dead_code)]
|
||||
pub content_hash: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize)]
|
||||
@@ -442,6 +535,22 @@ pub struct AddTagsRequest {
|
||||
pub tag_ids: Vec<i32>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize)]
|
||||
pub struct UpdateTagRequest {
|
||||
pub name: String,
|
||||
}
|
||||
|
||||
/// Result of an attempted tag rename. Returning a typed outcome (rather
|
||||
/// than `anyhow::Result<Tag>`) lets the handler map each case to a
|
||||
/// distinct HTTP status without sniffing error strings, and keeps the
|
||||
/// 409 path a normal control-flow result instead of a DB constraint
|
||||
/// violation surfacing as a generic 500.
|
||||
pub enum UpdateTagOutcome {
|
||||
Renamed(Tag),
|
||||
NotFound,
|
||||
Conflict { existing: Tag },
|
||||
}
|
||||
|
||||
pub trait TagDao: Send + Sync {
|
||||
fn get_all_tags(
|
||||
&mut self,
|
||||
@@ -511,6 +620,26 @@ pub trait TagDao: Send + Sync {
|
||||
context: &opentelemetry::Context,
|
||||
file_paths: &[String],
|
||||
) -> anyhow::Result<std::collections::HashMap<String, i64>>;
|
||||
/// Rename a tag in place. The tag id stays stable so existing
|
||||
/// `tagged_photo` rows automatically reflect the new name without
|
||||
/// a join-table rewrite. Conflict is resolved against the rest of
|
||||
/// the table case-insensitively (mirroring the
|
||||
/// `idx_tags_name_nocase` UNIQUE index) — a rename that changes
|
||||
/// only the case of the tag's own current name is allowed.
|
||||
fn update_tag_name(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
id: i32,
|
||||
new_name: &str,
|
||||
) -> anyhow::Result<UpdateTagOutcome>;
|
||||
/// Globally remove a tag and every `tagged_photo` row that
|
||||
/// references it. Returns `true` if a tag was deleted, `false` if
|
||||
/// no row matched the id. The schema's FK is `ON DELETE CASCADE`
|
||||
/// but SQLite only honors that with `PRAGMA foreign_keys = ON`,
|
||||
/// which this project doesn't set — the impl deletes both tables
|
||||
/// explicitly in a single transaction so partial state is
|
||||
/// impossible.
|
||||
fn delete_tag(&mut self, context: &opentelemetry::Context, id: i32) -> anyhow::Result<bool>;
|
||||
}
|
||||
|
||||
pub struct SqliteTagDao {
|
||||
@@ -704,6 +833,83 @@ impl TagDao for SqliteTagDao {
|
||||
})
|
||||
}
|
||||
|
||||
fn update_tag_name(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
id: i32,
|
||||
new_name: &str,
|
||||
) -> anyhow::Result<UpdateTagOutcome> {
|
||||
let mut conn = self
|
||||
.connection
|
||||
.lock()
|
||||
.expect("Unable to lock SqliteTagDao connection");
|
||||
trace_db_call(context, "update", "update_tag_name", |span| {
|
||||
span.set_attributes(vec![
|
||||
KeyValue::new("tag_id", id as i64),
|
||||
KeyValue::new("new_name", new_name.to_string()),
|
||||
]);
|
||||
|
||||
let target = tags::table
|
||||
.filter(tags::id.eq(id))
|
||||
.select((tags::id, tags::name, tags::created_time))
|
||||
.get_result::<Tag>(conn.deref_mut())
|
||||
.optional()
|
||||
.with_context(|| format!("Unable to look up tag id {}", id))?;
|
||||
let target = match target {
|
||||
Some(t) => t,
|
||||
None => return Ok(UpdateTagOutcome::NotFound),
|
||||
};
|
||||
|
||||
// Case-insensitive collision check on every other row.
|
||||
// Belt-and-suspenders: idx_tags_name_nocase enforces this at
|
||||
// the index level, but checking up front gives the handler
|
||||
// a clean 409 with the existing tag's id instead of a
|
||||
// generic constraint-violation 500. Tags table is small;
|
||||
// loading peers and comparing in Rust avoids a fragile
|
||||
// dsl::sql composition for case-insensitive equality.
|
||||
let conflict = tags::table
|
||||
.filter(tags::id.ne(id))
|
||||
.select((tags::id, tags::name, tags::created_time))
|
||||
.get_results::<Tag>(conn.deref_mut())
|
||||
.with_context(|| "Unable to query for tag-name conflict")?
|
||||
.into_iter()
|
||||
.find(|t| t.name.eq_ignore_ascii_case(new_name));
|
||||
if let Some(existing) = conflict {
|
||||
return Ok(UpdateTagOutcome::Conflict { existing });
|
||||
}
|
||||
|
||||
diesel::update(tags::table.filter(tags::id.eq(id)))
|
||||
.set(tags::name.eq(new_name))
|
||||
.execute(conn.deref_mut())
|
||||
.with_context(|| format!("Unable to rename tag {}", id))?;
|
||||
|
||||
Ok(UpdateTagOutcome::Renamed(Tag {
|
||||
id: target.id,
|
||||
name: new_name.to_string(),
|
||||
created_time: target.created_time,
|
||||
}))
|
||||
})
|
||||
}
|
||||
|
||||
fn delete_tag(&mut self, context: &opentelemetry::Context, id: i32) -> anyhow::Result<bool> {
|
||||
let mut conn = self
|
||||
.connection
|
||||
.lock()
|
||||
.expect("Unable to lock SqliteTagDao connection");
|
||||
trace_db_call(context, "delete", "delete_tag", |span| {
|
||||
span.set_attribute(KeyValue::new("tag_id", id as i64));
|
||||
|
||||
// tagged_photo.tag_id is `ON DELETE CASCADE` and the
|
||||
// connection now sets `PRAGMA foreign_keys = ON`, so a
|
||||
// single DELETE on tags removes its tagged_photo rows
|
||||
// atomically.
|
||||
let removed = diesel::delete(tags::table.filter(tags::id.eq(id)))
|
||||
.execute(conn.deref_mut())
|
||||
.with_context(|| format!("Unable to delete tag {}", id))?;
|
||||
Ok(removed > 0)
|
||||
})
|
||||
}
|
||||
|
||||
fn remove_tag(
|
||||
&mut self,
|
||||
context: &opentelemetry::Context,
|
||||
@@ -759,11 +965,31 @@ impl TagDao for SqliteTagDao {
|
||||
KeyValue::new("tag_id", tag_id.to_string()),
|
||||
]);
|
||||
|
||||
// Eagerly populate content_hash so this tag follows the bytes,
|
||||
// not the path (see CLAUDE.md "Multi-library data model").
|
||||
// None is fine — the reconciliation pass will backfill once
|
||||
// image_exif has a hash for this file. We deliberately don't
|
||||
// require library_id here: the tag handler is library-
|
||||
// agnostic by design, and any matching image_exif row's hash
|
||||
// is acceptable. If the path resolves to different bytes in
|
||||
// different libraries, reconciliation per-library refines.
|
||||
let content_hash: Option<String> = {
|
||||
use crate::database::schema::image_exif as ie;
|
||||
ie::table
|
||||
.filter(ie::rel_path.eq(path))
|
||||
.filter(ie::content_hash.is_not_null())
|
||||
.select(ie::content_hash)
|
||||
.first::<Option<String>>(conn.deref_mut())
|
||||
.ok()
|
||||
.flatten()
|
||||
};
|
||||
|
||||
diesel::insert_into(tagged_photo::table)
|
||||
.values(InsertTaggedPhoto {
|
||||
tag_id,
|
||||
photo_name: path.to_string(),
|
||||
created_time: Utc::now().timestamp(),
|
||||
content_hash,
|
||||
})
|
||||
.execute(conn.deref_mut())
|
||||
.with_context(|| format!("Unable to tag file {:?} in sqlite", path))
|
||||
@@ -1168,6 +1394,7 @@ mod tests {
|
||||
tag_id: tag.id,
|
||||
created_time: Utc::now().timestamp(),
|
||||
photo_name: path.to_string(),
|
||||
content_hash: None,
|
||||
};
|
||||
|
||||
if self.tagged_photos.borrow().contains_key(path) {
|
||||
@@ -1238,6 +1465,54 @@ mod tests {
|
||||
}
|
||||
Ok(counts)
|
||||
}
|
||||
|
||||
fn update_tag_name(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
id: i32,
|
||||
new_name: &str,
|
||||
) -> anyhow::Result<UpdateTagOutcome> {
|
||||
// Conflict pass first so the target tag's own old name
|
||||
// doesn't collide with itself.
|
||||
let conflict = self
|
||||
.tags
|
||||
.borrow()
|
||||
.iter()
|
||||
.find(|t| t.id != id && t.name.eq_ignore_ascii_case(new_name))
|
||||
.cloned();
|
||||
if let Some(existing) = conflict {
|
||||
return Ok(UpdateTagOutcome::Conflict { existing });
|
||||
}
|
||||
let mut tags = self.tags.borrow_mut();
|
||||
match tags.iter_mut().find(|t| t.id == id) {
|
||||
Some(t) => {
|
||||
t.name = new_name.to_string();
|
||||
Ok(UpdateTagOutcome::Renamed(t.clone()))
|
||||
}
|
||||
None => Ok(UpdateTagOutcome::NotFound),
|
||||
}
|
||||
}
|
||||
|
||||
fn delete_tag(
|
||||
&mut self,
|
||||
_context: &opentelemetry::Context,
|
||||
id: i32,
|
||||
) -> anyhow::Result<bool> {
|
||||
let target_name = {
|
||||
let tags = self.tags.borrow();
|
||||
tags.iter().find(|t| t.id == id).map(|t| t.name.clone())
|
||||
};
|
||||
let Some(name) = target_name else {
|
||||
return Ok(false);
|
||||
};
|
||||
// Mirror the cascade: drop any tagged_photo references, then
|
||||
// remove the tag itself.
|
||||
for (_path, tags) in self.tagged_photos.borrow_mut().iter_mut() {
|
||||
tags.retain(|t| t.id != id && t.name != name);
|
||||
}
|
||||
self.tags.borrow_mut().retain(|t| t.id != id);
|
||||
Ok(true)
|
||||
}
|
||||
}
|
||||
|
||||
#[actix_rt::test]
|
||||
@@ -1253,20 +1528,29 @@ mod tests {
|
||||
// Seed: two paths tagged, one path untagged.
|
||||
dao.tagged_photos.borrow_mut().insert(
|
||||
"a.jpg".into(),
|
||||
vec![Tag { id: 1, name: "alpha".into(), created_time: 0 }],
|
||||
vec![Tag {
|
||||
id: 1,
|
||||
name: "alpha".into(),
|
||||
created_time: 0,
|
||||
}],
|
||||
);
|
||||
dao.tagged_photos.borrow_mut().insert(
|
||||
"b.jpg".into(),
|
||||
vec![
|
||||
Tag { id: 2, name: "beta".into(), created_time: 0 },
|
||||
Tag { id: 3, name: "gamma".into(), created_time: 0 },
|
||||
Tag {
|
||||
id: 2,
|
||||
name: "beta".into(),
|
||||
created_time: 0,
|
||||
},
|
||||
Tag {
|
||||
id: 3,
|
||||
name: "gamma".into(),
|
||||
created_time: 0,
|
||||
},
|
||||
],
|
||||
);
|
||||
let grouped = dao
|
||||
.get_tags_grouped_by_paths(
|
||||
&ctx,
|
||||
&["a.jpg".into(), "b.jpg".into(), "c.jpg".into()],
|
||||
)
|
||||
.get_tags_grouped_by_paths(&ctx, &["a.jpg".into(), "b.jpg".into(), "c.jpg".into()])
|
||||
.unwrap();
|
||||
assert_eq!(grouped.get("a.jpg").map(|v| v.len()), Some(1));
|
||||
assert_eq!(grouped.get("b.jpg").map(|v| v.len()), Some(2));
|
||||
@@ -1381,6 +1665,177 @@ mod tests {
|
||||
None
|
||||
);
|
||||
}
|
||||
|
||||
async fn rename_tag(
|
||||
dao: &Data<Mutex<TestTagDao>>,
|
||||
id: i32,
|
||||
new_name: &str,
|
||||
) -> actix_web::http::StatusCode {
|
||||
use actix_web::Responder;
|
||||
let req = TestRequest::default().to_http_request();
|
||||
let body = web::Json(UpdateTagRequest {
|
||||
name: new_name.to_string(),
|
||||
});
|
||||
let claims = Claims::valid_user(String::from("1"));
|
||||
let resp = update_tag(claims, req.clone(), web::Path::from(id), body, dao.clone()).await;
|
||||
resp.respond_to(&req).status()
|
||||
}
|
||||
|
||||
#[actix_rt::test]
|
||||
async fn update_tag_renames_successfully() {
|
||||
let mut dao = TestTagDao::new();
|
||||
let tag = dao
|
||||
.create_tag(&opentelemetry::Context::current(), "old")
|
||||
.unwrap();
|
||||
let dao = Data::new(Mutex::new(dao));
|
||||
|
||||
assert_eq!(
|
||||
rename_tag(&dao, tag.id, "new").await,
|
||||
actix_web::http::StatusCode::OK
|
||||
);
|
||||
|
||||
let mut locked = dao.lock().unwrap();
|
||||
let all = locked
|
||||
.get_all_tags(&opentelemetry::Context::current(), None)
|
||||
.unwrap();
|
||||
assert_eq!(all.len(), 1);
|
||||
assert_eq!(all[0].1.name, "new");
|
||||
}
|
||||
|
||||
#[actix_rt::test]
|
||||
async fn update_tag_not_found_returns_404() {
|
||||
let dao = Data::new(Mutex::new(TestTagDao::new()));
|
||||
assert_eq!(
|
||||
rename_tag(&dao, 99999, "nope").await,
|
||||
actix_web::http::StatusCode::NOT_FOUND
|
||||
);
|
||||
}
|
||||
|
||||
#[actix_rt::test]
|
||||
async fn update_tag_empty_name_returns_400() {
|
||||
let mut dao = TestTagDao::new();
|
||||
let tag = dao
|
||||
.create_tag(&opentelemetry::Context::current(), "keep")
|
||||
.unwrap();
|
||||
let dao = Data::new(Mutex::new(dao));
|
||||
|
||||
assert_eq!(
|
||||
rename_tag(&dao, tag.id, " ").await,
|
||||
actix_web::http::StatusCode::BAD_REQUEST
|
||||
);
|
||||
|
||||
let mut locked = dao.lock().unwrap();
|
||||
let all = locked
|
||||
.get_all_tags(&opentelemetry::Context::current(), None)
|
||||
.unwrap();
|
||||
assert_eq!(all[0].1.name, "keep", "name must not change on 400");
|
||||
}
|
||||
|
||||
#[actix_rt::test]
|
||||
async fn update_tag_conflict_returns_409() {
|
||||
let mut dao = TestTagDao::new();
|
||||
let _a = dao
|
||||
.create_tag(&opentelemetry::Context::current(), "a")
|
||||
.unwrap();
|
||||
let b = dao
|
||||
.create_tag(&opentelemetry::Context::current(), "b")
|
||||
.unwrap();
|
||||
let dao = Data::new(Mutex::new(dao));
|
||||
|
||||
// Case-insensitive collision: renaming b -> "A" must conflict with a.
|
||||
assert_eq!(
|
||||
rename_tag(&dao, b.id, "A").await,
|
||||
actix_web::http::StatusCode::CONFLICT
|
||||
);
|
||||
|
||||
let mut locked = dao.lock().unwrap();
|
||||
let all = locked
|
||||
.get_all_tags(&opentelemetry::Context::current(), None)
|
||||
.unwrap();
|
||||
let b_after = all.iter().find(|(_, t)| t.id == b.id).unwrap();
|
||||
assert_eq!(b_after.1.name, "b", "no DB change on 409");
|
||||
}
|
||||
|
||||
async fn delete_via_handler(
|
||||
dao: &Data<Mutex<TestTagDao>>,
|
||||
id: i32,
|
||||
) -> actix_web::http::StatusCode {
|
||||
use actix_web::Responder;
|
||||
let req = TestRequest::default().to_http_request();
|
||||
let claims = Claims::valid_user(String::from("1"));
|
||||
let resp = delete_tag(claims, req.clone(), web::Path::from(id), dao.clone()).await;
|
||||
resp.respond_to(&req).status()
|
||||
}
|
||||
|
||||
#[actix_rt::test]
|
||||
async fn delete_tag_removes_tag_and_cascades_tagged_photos() {
|
||||
let mut dao = TestTagDao::new();
|
||||
let tag = dao
|
||||
.create_tag(&opentelemetry::Context::current(), "doomed")
|
||||
.unwrap();
|
||||
dao.tag_file(&opentelemetry::Context::current(), "a.jpg", tag.id)
|
||||
.unwrap();
|
||||
dao.tag_file(&opentelemetry::Context::current(), "b.jpg", tag.id)
|
||||
.unwrap();
|
||||
let dao = Data::new(Mutex::new(dao));
|
||||
|
||||
assert_eq!(
|
||||
delete_via_handler(&dao, tag.id).await,
|
||||
actix_web::http::StatusCode::NO_CONTENT
|
||||
);
|
||||
|
||||
let mut locked = dao.lock().unwrap();
|
||||
assert!(
|
||||
locked
|
||||
.get_all_tags(&opentelemetry::Context::current(), None)
|
||||
.unwrap()
|
||||
.is_empty()
|
||||
);
|
||||
assert!(
|
||||
locked
|
||||
.get_tags_for_path(&opentelemetry::Context::current(), "a.jpg")
|
||||
.unwrap()
|
||||
.is_empty(),
|
||||
"tagged_photo references must be cleaned up by the cascade"
|
||||
);
|
||||
assert!(
|
||||
locked
|
||||
.get_tags_for_path(&opentelemetry::Context::current(), "b.jpg")
|
||||
.unwrap()
|
||||
.is_empty()
|
||||
);
|
||||
}
|
||||
|
||||
#[actix_rt::test]
|
||||
async fn delete_tag_unknown_id_returns_404() {
|
||||
let dao = Data::new(Mutex::new(TestTagDao::new()));
|
||||
assert_eq!(
|
||||
delete_via_handler(&dao, 99999).await,
|
||||
actix_web::http::StatusCode::NOT_FOUND
|
||||
);
|
||||
}
|
||||
|
||||
#[actix_rt::test]
|
||||
async fn update_tag_case_only_change_succeeds() {
|
||||
let mut dao = TestTagDao::new();
|
||||
let tag = dao
|
||||
.create_tag(&opentelemetry::Context::current(), "vacation")
|
||||
.unwrap();
|
||||
let dao = Data::new(Mutex::new(dao));
|
||||
|
||||
// The conflict check excludes the target's own row, so changing
|
||||
// only the case of the tag's current name must succeed.
|
||||
assert_eq!(
|
||||
rename_tag(&dao, tag.id, "Vacation").await,
|
||||
actix_web::http::StatusCode::OK
|
||||
);
|
||||
|
||||
let mut locked = dao.lock().unwrap();
|
||||
let all = locked
|
||||
.get_all_tags(&opentelemetry::Context::current(), None)
|
||||
.unwrap();
|
||||
assert_eq!(all[0].1.name, "Vacation");
|
||||
}
|
||||
}
|
||||
#[derive(QueryableByName, Debug, Clone)]
|
||||
pub struct FileWithTagCount {
|
||||
|
||||
Reference in New Issue
Block a user