feature/multi-library-data-model #67
127
CLAUDE.md
127
CLAUDE.md
@@ -104,6 +104,131 @@ All database access goes through trait-based DAOs (e.g., `ExifDao`, `SqliteExifD
|
|||||||
- `query_by_exif()`: Complex filtering by camera, GPS bounds, date ranges
|
- `query_by_exif()`: Complex filtering by camera, GPS bounds, date ranges
|
||||||
- Batch operations minimize DB hits during file watching
|
- Batch operations minimize DB hits during file watching
|
||||||
|
|
||||||
|
### Multi-library data model
|
||||||
|
|
||||||
|
ImageApi supports more than one library (a library = a `(name, root_path)`
|
||||||
|
row in the `libraries` table that maps to a mounted directory tree). The
|
||||||
|
same bytes may exist under more than one library — typical case is an
|
||||||
|
"active" library plus an "archive" library that ingests files as they age
|
||||||
|
out — and the data model is designed so that derived data follows the
|
||||||
|
**bytes**, not the path, while user-managed data does the same.
|
||||||
|
|
||||||
|
**The principle.** A photo's identity is its `content_hash` (blake3, see
|
||||||
|
`src/content_hash.rs`). Anything we compute from or attach to a photo is
|
||||||
|
keyed on that hash so it survives:
|
||||||
|
- the same file appearing in a second library (backup / archive / mirror),
|
||||||
|
- the file moving between libraries (recent → archive handoff),
|
||||||
|
- the file moving within a library (re-organized rel_path),
|
||||||
|
- intra-library duplicates (same bytes at two paths).
|
||||||
|
|
||||||
|
**Table classification.** Three categories drive the keying decision:
|
||||||
|
|
||||||
|
| Category | Key | Rationale | Tables |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Intrinsic to bytes | `content_hash` | Rerunning is wasted work (or LLM cost) | `face_detections` ✓, `image_exif` (target), `photo_insights` (target), `video_preview_clips` (target) |
|
||||||
|
| User intent about a photo | `content_hash` | "Tag this photo" means the bytes, not a path | `tagged_photo` (target), `favorites` (target) |
|
||||||
|
| Library administrative | `(library_id, rel_path)` | Tied to a specific filesystem location | `libraries`, `entity_photo_links`, the `rel_path` back-ref columns on hash-keyed tables |
|
||||||
|
|
||||||
|
✓ = already implemented this way. *(target)* = today still keyed on
|
||||||
|
`(library_id, rel_path)` and slated for migration. The migration adds a
|
||||||
|
nullable `content_hash` column, populates it from `image_exif` where
|
||||||
|
known, and read paths fall back to rel_path while the hash is null.
|
||||||
|
|
||||||
|
**Carrying a `rel_path` even when hash-keyed.** Hash-keyed tables retain
|
||||||
|
`(library_id, rel_path)` columns as a denormalized **back-reference**, not
|
||||||
|
as the key. This lets a single query answer "what is at this path right
|
||||||
|
now" without joining through `image_exif`, and supports the path-only
|
||||||
|
endpoints that predate the hash. `face_detections` is the reference
|
||||||
|
implementation: hash is the truth, path is a hint.
|
||||||
|
|
||||||
|
**Merge semantics on read.** When the same hash has rows under more than
|
||||||
|
one library:
|
||||||
|
- Set-valued data (tags, favorites, faces, entity links) → **union**.
|
||||||
|
- Scalar data (current insight, EXIF row, video preview clip) → earliest
|
||||||
|
`generated_at` / `created_time` wins. The historical lib1 row beats a
|
||||||
|
re-generated lib2 row, so the user's curated insight isn't shadowed by
|
||||||
|
a re-run on archive ingest.
|
||||||
|
|
||||||
|
**Write attribution.** A new tag/favorite/insight created while viewing
|
||||||
|
under lib2 binds to the bytes, not to lib2 — so it shows up under lib1
|
||||||
|
too. This is by design, but it's the most surprising rule on first
|
||||||
|
encounter; clients should not assume tags are library-scoped.
|
||||||
|
|
||||||
|
**Hash-less rows (transitional state).** During and immediately after a
|
||||||
|
new mount, `image_exif.content_hash` is being populated by
|
||||||
|
`backfill_unhashed_backlog` (capped per tick). Rules during this window:
|
||||||
|
- Writes: if the hash is known, write hash-keyed. If not, write
|
||||||
|
`(library_id, rel_path)`-keyed and let the reconciliation job collapse
|
||||||
|
duplicates once the hash lands.
|
||||||
|
- Reads: prefer hash key, fall back to `(library_id, rel_path)`.
|
||||||
|
- Reconciliation: a one-shot pass after every backfill tick collapses
|
||||||
|
rows that now share a hash, applying the merge semantics above.
|
||||||
|
Idempotent — safe to re-run.
|
||||||
|
|
||||||
|
**Library handoff (recent → archive).** When a file moves between
|
||||||
|
libraries (e.g. operator moves `~/photos/2024/IMG.nef` to the archive
|
||||||
|
mount), the file watcher sees the disappearance under lib1 and the
|
||||||
|
appearance under lib2. Hash-keyed rows don't need migration; the
|
||||||
|
`(library_id, rel_path)` back-ref columns are updated to point to the new
|
||||||
|
location. Library administrative rows (`entity_photo_links`,
|
||||||
|
`(library_id, rel_path)` rows in `image_exif` for hash-less items) are
|
||||||
|
re-keyed by the move detector, which matches a disappearance to an
|
||||||
|
appearance by `content_hash` within a configurable window.
|
||||||
|
|
||||||
|
**Orphans (source deleted while a copy survives).** When the only
|
||||||
|
`image_exif` row for a hash is deleted (file removed from disk), the
|
||||||
|
hash-keyed derived rows survive **as long as another `image_exif` row
|
||||||
|
references the same hash**. If the last reference is gone, derived rows
|
||||||
|
are eligible for GC (deferred — the GC job runs on a slow schedule so
|
||||||
|
that a brief unmount or rename doesn't wipe history).
|
||||||
|
|
||||||
|
**Stats and counts.** When reporting "how many photos do you have," count
|
||||||
|
`DISTINCT content_hash` over `image_exif`, not row count. Faces stats
|
||||||
|
already does this (`FaceDao::stats` in `src/faces.rs`); other counters
|
||||||
|
should follow suit. Numerator and denominator must live in the same
|
||||||
|
domain — see the face-stats commentary below for the cautionary tale.
|
||||||
|
|
||||||
|
**Per-library scoping when the user asks for it.** A request scoped to
|
||||||
|
`?library=N` filters the `image_exif` view to that library, and the
|
||||||
|
hash-keyed derived data is joined through that view. The user sees only
|
||||||
|
photos that have a copy under lib N, but the derived data attached to
|
||||||
|
those photos is the merged hash-keyed view. This is the answer to "show
|
||||||
|
me archive photos with their original tags."
|
||||||
|
|
||||||
|
**Library availability and safety.** Libraries can be on network shares
|
||||||
|
or removable media; the file watcher must not interpret a temporary
|
||||||
|
unavailability as a mass-deletion event. Every tick begins with a
|
||||||
|
**presence probe** per library: the library is considered online iff
|
||||||
|
its `root_path` exists, is readable, and a top-level scan returns at
|
||||||
|
least one expected entry (or matches a recent file-count high-water
|
||||||
|
mark within a tolerance). The probe result gates which actions are safe
|
||||||
|
to run on that library this tick:
|
||||||
|
|
||||||
|
| Action | Requires online? |
|
||||||
|
|---|---|
|
||||||
|
| Quick / full scan ingest of new files | yes |
|
||||||
|
| EXIF / face / insight backlog drains | yes — but the work runs against any online library |
|
||||||
|
| Move-handoff detection (lib1 disappearance ↔ lib2 appearance match) | **both** libraries online |
|
||||||
|
| `(library_id, rel_path)` re-keying on detected move | **both** libraries online |
|
||||||
|
| Orphan GC of hash-keyed derived data | all libraries that have *ever* held the hash must be online and confirmed-clean for two consecutive ticks |
|
||||||
|
| Reads / serving | always allowed; falls back to whichever library is online |
|
||||||
|
|
||||||
|
A library that fails the probe enters a "stale" state: writes scoped to
|
||||||
|
it are paused, its rows are flagged stale (not deleted) in
|
||||||
|
`/libraries` status, and the watcher logs at `warn` once per
|
||||||
|
state-transition (not per tick). A library that recovers re-enters the
|
||||||
|
online set automatically; no operator action required for transient
|
||||||
|
outages. The intent is that pulling a USB drive, rebooting a NAS, or
|
||||||
|
losing a VPN never triggers a destructive code path — the worst case is
|
||||||
|
that derived-data work pauses until the share returns.
|
||||||
|
|
||||||
|
The same rule constrains the move-handoff matcher: a disappearance
|
||||||
|
under lib1 only counts as a "move" if there is a matching appearance
|
||||||
|
under another **online** library within the window. A bare
|
||||||
|
disappearance with no matching appearance is treated as
|
||||||
|
"unavailable-or-deleted, defer judgment" — it does not re-key any rows
|
||||||
|
and does not enqueue GC.
|
||||||
|
|
||||||
### File Processing Pipeline
|
### File Processing Pipeline
|
||||||
|
|
||||||
**Thumbnail Generation:**
|
**Thumbnail Generation:**
|
||||||
@@ -219,7 +344,7 @@ ImageApi owns the face data; Apollo (sibling repo) hosts the insightface inferen
|
|||||||
- `persons(id, name UNIQUE COLLATE NOCASE, cover_face_id, entity_id, created_from_tag, notes, ...)` — operator-managed, name is the user-visible identity.
|
- `persons(id, name UNIQUE COLLATE NOCASE, cover_face_id, entity_id, created_from_tag, notes, ...)` — operator-managed, name is the user-visible identity.
|
||||||
- `face_detections(id, library_id, content_hash, rel_path, bbox_*, embedding BLOB, confidence, source, person_id, status, model_version, ...)` — keyed on `content_hash` so a photo duplicated across libraries is detected once. Marker rows for `status IN ('no_faces','failed')` carry NULL bbox/embedding (CHECK constraint enforces this).
|
- `face_detections(id, library_id, content_hash, rel_path, bbox_*, embedding BLOB, confidence, source, person_id, status, model_version, ...)` — keyed on `content_hash` so a photo duplicated across libraries is detected once. Marker rows for `status IN ('no_faces','failed')` carry NULL bbox/embedding (CHECK constraint enforces this).
|
||||||
|
|
||||||
**Why content_hash and not (library_id, rel_path):** ties face data to the bytes, not the path. A backup mount that copies files from the primary library naturally inherits the existing detections without re-running inference.
|
**Why content_hash and not (library_id, rel_path):** ties face data to the bytes, not the path. A backup mount that copies files from the primary library naturally inherits the existing detections without re-running inference. This is the reference implementation of the multi-library data model — see "Multi-library data model" above.
|
||||||
|
|
||||||
**File-watch hook** (`src/main.rs::process_new_files`): for each photo with a populated `content_hash`, check `FaceDao::already_scanned(hash)`; if not, send bytes (or embedded JPEG preview for RAW via `exif::extract_embedded_jpeg_preview`) to Apollo's `/api/internal/faces/detect`. K=`FACE_DETECT_CONCURRENCY` (default 8) parallel calls per scan tick; Apollo serializes them via its single-worker GPU pool. `face_watch.rs` is the Tokio orchestration layer.
|
**File-watch hook** (`src/main.rs::process_new_files`): for each photo with a populated `content_hash`, check `FaceDao::already_scanned(hash)`; if not, send bytes (or embedded JPEG preview for RAW via `exif::extract_embedded_jpeg_preview`) to Apollo's `/api/internal/faces/detect`. K=`FACE_DETECT_CONCURRENCY` (default 8) parallel calls per scan tick; Apollo serializes them via its single-worker GPU pool. `face_watch.rs` is the Tokio orchestration layer.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user