Wires the persistence layer for CLIP semantic search. The watcher's
per-tick drain encodes any image_exif row with a known content_hash
but no clip_embedding via Apollo (cap CLIP_BACKLOG_MAX_PER_TICK,
default 32). On a query, /photos/search encodes the text via Apollo
and reranks every stored embedding in-memory.
ExifDao additions:
- list_clip_unencoded_candidates — partial-index scan for drain
- backfill_clip_embedding — touches only the two new columns
- list_clip_index — dedup'd (hash, embedding) pull for search
clip_watch::run_clip_encoding_pass is the parallel fan-out — tokio
runtime per pass with CLIP_ENCODE_CONCURRENCY (default 4). No marker
rows for permanent failures yet; per-tick cap bounds the retry cost.
/photos/search params: q, limit, threshold (default 0.20), library,
model_version. Response is intentionally minimal (path + score) so
the frontend joins against existing photo-metadata routes lazily.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure mechanical cleanup of accumulated drift in files outside the
HLS-content-hash branch's main change set. No behavior change.
- `cargo fmt` on every previously-misformatted file
(`ai/insight_generator.rs`, `database/knowledge_dao.rs`,
`faces.rs`, `knowledge.rs`, `libraries.rs`).
- `cargo clippy --fix`:
- `needless_borrow`: `&library` → `library` in `handlers/image.rs`
(two sites in the photo-listing path).
- Manual clippy pass for warnings clippy emits but can't auto-apply:
- `field_reassign_with_default` in `database/reconcile.rs::run` —
consolidated into a struct-literal initializer.
- `needless_range_loop` in `database/knowledge_dao.rs::union_perceptual_tags`
— inner `for b in (a+1)..indices.len() { let ib = indices[b]; ... }`
becomes `for &ib in &indices[a + 1..] { ... }`.
- Doc-list indentation: continuation lines under nested bullets in
`database/mod.rs::get_memories_in_window` and
`database/knowledge_dao.rs::build_entity_graph` realigned to the
list-item content column.
Deliberately not touched (each deserves its own focused commit, with
testing, rather than getting bundled into a sweep):
- 4× `deprecated count_distinct` in `faces.rs` — diesel API migration
to `AggregateExpressionMethods::aggregate_distinct` may shift result
types; needs verification against the existing stats queries.
- `await_holding_lock` in `knowledge.rs:807` — `std::sync::Mutex` held
across `ollama.generate(...).await`. Genuine concurrency bug; fix
requires understanding the surrounding flow before just dropping
the guard.
- 2× `type_complexity` in `database/mod.rs` — cosmetic, would need a
`type` alias and corresponding callers updated.
- Dead `total_deleted` on `library_maintenance::GcStats` and
`file_scan::enumerate_indexable_files` — both are public surface
retained for future use; deletion is a separate decision.
All 707 tests still pass. Release build clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hash-keyed pipeline transcodes lazily, so a freshly mounted (or
freshly upgraded) library is "mostly pending" for the first hour
while the watcher works through the backlog. The operator wants a
live read on remaining work so they can tune `HLS_CONCURRENCY` and
know when to stop waiting.
Adds:
- `src/hls_stats.rs` — pure compute path (`stats_from_rows`) and an
Arc<Mutex<dyn ExifDao>> wrapper (`compute_and_publish`). Per
library: `total`, `with_playlist`, `pending`, `unsupported`,
`hashless_videos`. Dedup is by content_hash so duplicate-bytes-at-
N-paths counts once (same domain rule as `faces::stats`).
`hashless_videos` is a separate counter so the operator can see
the "hash backfill, then transcode" pipeline depth instead of
having NULL-hash rows just hide.
- Prometheus gauges labeled by library name:
`imageserver_hls_videos_total`, `..._with_playlist`, `..._pending`,
`..._unsupported`. Updated by the watcher at the end of every full-
scan tick *and* on every `/hls/stats` hit, so whichever surface the
operator is watching stays fresh. Registered in `main` alongside
the existing image/video gauges.
- `GET /hls/stats` — Claims-protected JSON snapshot of the same data
plus a top-level cross-library aggregate. Runs on a blocking pool
so it doesn't pin the actix worker; per-call cost is one
`list_paths_and_hashes_for_library` SQL query per library plus a
`stat()` per distinct video hash. Bounded — never invoked from
middleware, only from the explicit endpoint and the full-scan
tick. The watcher's end-of-tick `info!` summary line mirrors the
endpoint output for operators tailing the log.
- New `ExifDao::list_paths_and_hashes_for_library` method:
`SELECT rel_path, content_hash FROM image_exif WHERE library_id =
?`. Single round-trip; callers filter to video extensions
client-side because the schema doesn't carry media-type. Mock
impl in `files.rs` returns an empty vec.
Tests in `hls_stats::tests` exercise stats_from_rows directly (videos-
only filter, hash dedup, playlist vs sentinel decision, NULL-hash
hashless counting) plus a publish_gauges round-trip that reads the
gauge value back. Full suite (347 lib + 360 bin = 707) passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cleanup walk previously looked for `$VIDEO_PATH/<basename>.m3u8`
and matched each file's stem against a recursive walk of every
library. With the hash-keyed layout now in place, every playlist's
file_stem is the literal string "playlist" — the old logic would
treat every hash-keyed playlist as orphaned on its next run and wipe
them all in one tick (default cleanup interval is 24h, so this is a
24-hour bomb on top of the prior commit).
New approach: orphan-ness is decided in the database, not on the
filesystem. The cleanup loop:
- Snapshots every distinct non-NULL `image_exif.content_hash` into a
HashSet (new `ExifDao::list_distinct_content_hashes` method —
`SELECT DISTINCT content_hash WHERE content_hash IS NOT NULL`).
- Walks `$VIDEO_PATH` two levels deep: top-level entries are filtered
to 2-char lowercase hex shard dirs, each shard's children to 64-char
hex hash dirs. Anything else (legacy `.m3u8` at root from the
pre-content-hash era, operator-stashed dirs, partial writes) is left
alone.
- Hash dirs whose hash isn't in the alive set are `remove_dir_all`'d.
Shard dirs that emptied as a result are reaped on the same pass via
`remove_dir` (no-op if non-empty).
- The library-stale safety gate is preserved: a stale library skips
the cycle even though the orphan decision is DB-only, because the
upstream missing-file scan that retires `image_exif` rows itself
pauses for stale libraries. Belt-and-suspenders — keeping a hash
dir for one extra 24h cycle is cheaper than wiping one whose source
was briefly unreachable. The gate now also filters disabled
libraries out of the stale set (they're intentionally absent from
the health map).
- The legacy `excluded_dirs` parameter is preserved on the function
signature but unused (the walk no longer crosses library trees);
flagged with a leading underscore. Callers in `main.rs` stay
unchanged.
`MockExifDao` in `files.rs` grows the new method (returns empty);
unit tests for the new `is_hash_shard` / `is_full_hash` validators
guard against an operator's stashed directory under VIDEO_PATH ever
matching the orphan-rm path. Both pass.
A follow-up commit handles the one-shot startup migration that
retires the legacy basename-keyed `.m3u8` / `.ts` files at
`$VIDEO_PATH` root.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The drain queried `date_taken IS NULL OR date_taken_source = 'fs_time'`
ORDER BY id ASC LIMIT 500 every watcher tick. The resolver is
deterministic on file bytes + filename + fs metadata, so any row that
landed on fs_time once landed there again on every retry — the drain
spun on the same lowest-id rows in perpetuity, never advancing to
rows 501+ while still logging more_remain=true.
Side effect: 500 auto-commit UPDATEs per tick sustained the SQLite
write lock long enough that other writers on separate DAO connections
hit the 5s busy_timeout. Manifested as intermittent 500s on
PATCH /image/faces/{id} that succeeded on retry.
Narrow the partial index and query predicate to `date_taken IS NULL`.
If exiftool installs or a new filename regex lands, an operator can
re-resolve fs_time rows out-of-band rather than re-introducing the
steady-state churn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main.rs drops from 3542 → ~2930 lines by moving:
- src/backfill.rs (new): backfill_unhashed_backlog,
backfill_missing_date_taken, backfill_missing_content_hashes,
build_face_candidates, process_face_backlog. Now unit-tested for
the first time — 5 tests covering cap behavior, library-id
filtering, missing-on-disk skip, and the video/unhashed/scanned
filters on face-candidate selection.
- src/thumbnails.rs (new): unsupported_thumbnail_sentinel,
generate_image_thumbnail, create_thumbnails, update_media_counts,
is_image, is_video, plus the IMAGE_GAUGE / VIDEO_GAUGE Prometheus
metrics. Replaces the no-op stubs that used to live in lib.rs.
4 new unit tests for the sentinel path math and the
walker-counts-images-vs-videos smoke path.
Supporting:
- SqliteExifDao::from_shared (test-only) so an SqliteExifDao and
SqliteFaceDao can share one in-memory connection — required to
test build_face_candidates against the real join.
- files.rs / video/{mod,actors}.rs import from crate::thumbnails::*
instead of the now-removed stubs in lib.rs.
cargo test --bin image-api: 325 passing (was 314).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New GET /knowledge/graph?type=&limit= returns the data the
curation UI's graph tab needs:
- nodes = entities with at least one in-scope fact (rejected /
superseded excluded). Carries fact_count for visual sizing.
Top-N by count desc; default cap 200 (clamped 1..1000).
- edges = relational facts (object_entity_id set) grouped by
(subject, object, predicate) so 3 "is_friend_of" facts
between the same pair collapse into one edge with count=3.
Two raw SQL queries: an INNER JOIN onto a persona-scoped fact-
count subquery for nodes (skips 0-fact entities entirely so the
sim doesn't waste time on disconnected islands), then a follow-
up GROUP BY over the persona-scoped fact set restricted to the
node id set via IN clauses (ids are i32 so inlining is safe).
Pairs with the Apollo-side GraphPanel that runs d3-force over
the returned payload and renders SVG with click-to-open.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Finds near-duplicate entities the upsert-time cosine guard didn't
catch — typically legacy data from before that guard landed, or
pairs whose embeddings sit between 0.85 (default proposal floor)
and 0.92 (auto-collapse threshold). Pure read-side feature; the
actual merging still goes through the existing
/knowledge/entities/merge action.
New DAO method `find_consolidation_proposals(threshold,
max_groups)`:
- Loads every non-rejected entity with an embedding.
- Partitions by entity_type so a person can't cluster with a
place.
- Pairwise cosine, edges above threshold feed a union-find for
transitive grouping (Sara → Sarah → Sarah J. all land in one
cluster).
- Tracks min/max cosine per component so the UI can show "how
tight" each cluster is before clicking in.
- Returns groups of >= 2 sorted by size desc then max cosine
desc; trimmed to `max_groups`.
New endpoint `GET /knowledge/consolidation-proposals?threshold=
&limit=` accepts the threshold (clamped 0.5–0.99 to prevent the
"every entity in one mega-cluster" case) and returns groups with
per-entity persona fact-count breakdowns baked in — saves the UI
a separate query per cluster member.
ConsolidationGroup is exported through database/mod.rs so the
handler can use it without depending on knowledge_dao internals.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related additions to /knowledge/entities:
- New EntitySort enum (UpdatedDesc default, NameAsc, FactCountDesc)
surfaced via `?sort=updated|name|count`. NameAsc clusters near-
duplicate names so dupes stand out at a glance; FactCountDesc
surfaces heavily-used entities and demotes 0-fact noise to the
bottom.
- New `list_entities_with_fact_counts` DAO method that returns each
entity alongside a persona-scoped count of its non-rejected facts
(subject side). Persona scope follows X-Persona-Id via the
existing resolve_persona_filter chain — Single filters on
(user_id, persona_id), All unions across the user's personas.
Implemented as one raw SQL query with a LEFT JOIN to a fact-count
subquery and ORDER BY tied to the chosen sort, so count-sort needs
no second round trip.
The agent's existing list_entities call site is unchanged — it
doesn't need persona-scoped counts and the trait method stays cheap.
EntitySummary grows an Option<i64> fact_count (skip_serializing_if
none) so PATCH responses stay shaped as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The date-override path used to look up `image_exif` strictly by
`(library_id, rel_path)` with only the forward-slash form, while
`/image/metadata`'s `get_exif` falls back across libraries and tries
both slash forms. A photo whose row sat under a different library_id
than its filesystem-resolved one — or whose rel_path was stored with
backslashes — rendered fine in the modal but 404'd on save.
`set_manual_date_taken` / `clear_manual_date_taken` now share a
`locate_image_exif_row` helper that mirrors `get_exif`'s union
semantics (scoped lookup first, library-agnostic fallback by rel_path
in both slash forms), then update by primary key so the write hits
exactly the row read. Inner anyhow errors are logged with
`(library_id, rel_path)` so the next failure mode is debuggable.
Handler-side: `resolve_library_param` errors no longer silently fall
back to the primary library (which would have masked the original bug
with a different "row not found"); a malformed library param now
returns 400. New `DbErrorKind::NotFound` lets the handler distinguish
genuine misses (404) from real DB failures (500).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move personas off the mobile client into ImageApi as first-class
records, and scope entity_facts by persona so each one builds its own
voice over a shared entity graph. The new include_all_memories flag
lets a persona opt back into the full hive-mind pool for human
browsing of /knowledge/*; agentic generation always stays in-voice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DAO swallowed every diesel::update failure as a flat
`anyhow!("Update error")`, then trace_db_call further reduced it to
`DbError { kind: UpdateError }`. Operators saw "update failed for lib
2 Snapchat/foo.mp4: DbError { kind: UpdateError }" with no clue why
(constraint violation? type mismatch? row vanished mid-flight? DB
locked?).
Two changes:
- Preserve the diesel error in the anyhow chain along with the input
params (lib, rel_path, date_taken, source) so the cause is visible.
- Log the chain at warn-level inside the DAO before the trace wrapper
collapses it to DbErrorKind::UpdateError, so the warning at the
call site finally has something diagnosable next to it.
- Treat zero-row updates as a debug-level "row likely retired by the
missing-file scan" rather than a hard failure — that case is benign
and shouldn't poison the drain's error tally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `POST /image/exif/date` and `POST /image/exif/date/clear` so an
operator can correct a row whose canonical-date waterfall landed on the
wrong value (camera clock reset, fs_time fallback for a copied-from-
backup file, etc). New `original_date_taken` / `original_date_taken_source`
columns snapshot the prior value on first override so revert is lossless.
The waterfall source set is now `'exif' | 'exiftool' | 'filename' | 'fs_time' | 'manual'`.
The existing `idx_image_exif_date_backfill` partial index already filters
to `date_taken IS NULL OR date_taken_source = 'fs_time'`, so manual rows
are naturally excluded from the per-tick drain — no index change needed.
`ExifMetadata` now exposes `date_taken_source` + originals so a UI can
render "manually set; was X via filename".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the EXIF-loop + WalkDir-fallback pipeline that powered
`/memories` with a single per-library SQL query
(`get_memories_in_window`) that uses `strftime('%m-%d' | '%W' | '%m',
date_taken, 'unixepoch', tz_offset)` for calendar matching in the
client's timezone, plus a `years_back` lower bound and a
no-future-dates upper bound. Returns only the matching rows; the
handler applies per-library `PathExcluder` post-query and sorts.
Drops:
- `collect_exif_memories` — replaced by the single SQL query.
- `collect_filesystem_memories` — the canonical-date pipeline now
populates `date_taken` for every row at ingest, so the WalkDir
fallback that scanned 14k+ files each request is no longer needed.
- `get_memory_date_with_priority` and friends — request-time waterfall
superseded by `date_resolver` running at ingest. The associated
three priority-tests are dropped; their replacement lives in
`date_resolver::tests`.
On a ~14k-file library this drops `/memories` from 10–15 s
(dominated by `fs::metadata` per row) to single-digit ms.
Bumps `DEFAULT_YEARS_BACK` from 15 → 20 to surface deeper archives
on matching anniversaries.
Note vs. ISO weeks: the original Rust used `chrono::iso_week().week()`
for week-span matching. SQLite's `%W` is Monday-anchored but uses week
0 for days before the first Monday, so it can disagree with ISO at
year boundaries by ±1. Acceptable for nostalgia browsing.
Adds 3 new DAO tests covering month-span filter, library scoping, and
the unknown-span-token guard. Also adds a CLAUDE.md section describing
the canonical-date pipeline end-to-end and the new
`DATE_BACKFILL_MAX_PER_TICK` env var.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two ExifDao methods (`get_rows_needing_date_backfill` /
`backfill_date_taken`) and a `backfill_missing_date_taken` watcher pass
that runs on every tick alongside `backfill_unhashed_backlog`.
The drain queries the partial index for rows where `date_taken IS NULL`
or `date_taken_source = 'fs_time'`, batches up to
`DATE_BACKFILL_MAX_PER_TICK` paths (default 500), and feeds them through
`date_resolver::resolve_dates_batch` — a single exiftool subprocess
covers the whole tick. Rows that newly resolve to `exiftool` /
`filename` / `fs_time` get persisted via `backfill_date_taken` (touches
only `date_taken` + `date_taken_source` so EXIF / hash / perceptual
columns survive).
`filename`-sourced rows are intentionally not re-resolved — the regex
is authoritative when it matches and re-running exiftool wouldn't
change the answer. Files that have disappeared from disk are skipped
so a ghost row doesn't loop through the drain forever; the
missing-file scan in `library_maintenance` retires those separately.
Comes with two DAO unit tests (eligibility filter + column-isolation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New nullable TEXT column tracks which step of the canonical-date
waterfall (kamadak-exif → exiftool → filename → fs_time) populated
`date_taken`. Lets a later per-tick drain re-resolve weak sources
(`fs_time`) once stronger ones become available, and gives the UI/debug
surface a way to answer "why does this photo show up under this date?".
Adds the column at all `InsertImageExif` construction sites with `None`
placeholders (the resolver wiring lands in a follow-up commit), and
extends the `update_exif` SET tuple so the column survives the GPS-write
re-read path. Partial index `idx_image_exif_date_backfill` is created
for the upcoming drain query.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bucket exact-dup rows by (library_id, dirname) pair on each side, then
filter by coverage = shared / min(folder_a_total, folder_b_total) and
an absolute floor on shared count. Surfaces "this folder is mostly
contained in that folder" matches that the per-file EXACT view buries
under one row each — e.g. an old phone-backup tree shadowing the
organized library, or a topic-grouped folder duplicating a date-grouped
one within the same library.
New endpoint: GET /duplicates/folder-pairs?library=&include_resolved=
&min_coverage=&min_shared=. Cached 5 min keyed on (library, include_resolved);
the user-tunable thresholds filter the cached unfiltered pair list so
slider drags don't re-bucket. Shares the resolve / unresolve flow with
the existing tabs — the frontend fans out N parallel /resolve calls,
one per shared content_hash.
Folder names carry no signal (BMW lives under Night Photos, not BMW_backup),
so bucketing is purely on (library_id, dirname) co-occurrence in
exact-dup groups. Within-folder dups (same hash twice in the same
folder) are skipped — those belong to the EXACT tab.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Soft-marked rows used to disappear from /photos globally, including
from a library-scoped view that didn't contain the survivor at all.
A user browsing lib A who'd promoted a file from lib B as the
survivor would silently lose visibility on their own copy in lib A,
even though lib B's file isn't reachable from lib A's view.
Library-scoped queries now keep a demoted row visible when its
survivor lives in a library outside the current scope. Implemented
as a NOT EXISTS subquery against the same image_exif table aliased
as `survivor`. The unscoped (all-libraries) view is unchanged — every
survivor is reachable, so demoted rows stay hidden as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds pHash + dHash columns alongside the existing blake3 content_hash so
near-duplicates (re-encoded, resized, format-converted copies) become
queryable. /duplicates/{exact,perceptual} return groups; /duplicates/
{resolve,unresolve} flip a duplicate_of_hash soft-mark on losing rows
and union perceptual-only tag sets onto the survivor. The default
/photos listing filters duplicate_of_hash IS NULL so demoted siblings
stop cluttering the grid; include_duplicates=true opts back in for
Apollo's review modal. Upload now hashes bytes pre-write and returns
409 with the canonical sibling when a file's bytes already exist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a nullable comma-separated TEXT column to the libraries table.
Effective excludes for a walk = (env-var globals) ∪
(library.excluded_dirs). Empty / NULL = no library-specific
extras; the global env var still applies.
Migration (2026-05-01-110000_libraries_excluded_dirs)
ALTER TABLE libraries ADD COLUMN excluded_dirs TEXT. NULL on every
existing row — no behavior change on upgrade.
Library struct + helpers (libraries.rs)
- Library gains excluded_dirs: Vec<String>, parsed from the column
by parse_excluded_dirs_column (drops empties / whitespace,
matches the env-var parser).
- Library::effective_excluded_dirs(globals) returns the union.
- From<LibraryRow> hydrates the field on AppState construction so
/libraries surfaces it.
Watcher / walkers / memories
Every per-library walker now consults the effective set:
- process_new_files (file-watch ingest, RAW/EXIF/face)
- process_face_backlog (filter_excluded inherits)
- create_thumbnails (startup + new-file branch)
- update_media_counts (Prometheus gauge)
- cleanup_orphaned_playlists (per-library source-existence check)
- memories endpoint (PathExcluder)
Effective set is computed once per per-library iteration in the
watcher tick and threaded through; called functions retain their
flat &[String] signature (no per-library awareness needed inside
the walker primitives).
Use case: mount a parent directory while a sibling library covers
a child subtree, and exclude the child subtree from the parent so
the libraries don't double-walk / double-write image_exif. With
hash-keyed derived data (Branches B/C), the duplication-avoidance
is the only cost prevented — face / tag / insight sharing was
already correct via content_hash.
Tests: 228 pass (226 from previous + 2 new in libraries::tests:
parse_excluded_dirs_column edge cases,
effective_excluded_dirs_unions_global_and_per_library).
CLAUDE.md gains a "Per-library excludes" subsection of the
multi-library data model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A small follow-up to Branches A/B/C. Adds a nullable-default-1
boolean column to the `libraries` table that controls whether the
watcher considers the library at all. Useful for staging a new
mount before committing to ingest, and as a maintenance kill
switch when a library needs to be quiet without being unmounted.
Migration (2026-05-01-100000_libraries_enabled_flag)
ALTER TABLE libraries ADD COLUMN enabled BOOLEAN NOT NULL DEFAULT 1.
Existing rows stay enabled — no behavior change on upgrade.
Watcher gate (main.rs)
At the top of the per-library loop, if !lib.enabled { continue; }
— runs BEFORE the availability probe. Disabled libraries don't
enter the health map, don't get probed, don't get ingest, don't
get any maintenance pass. The initial sweep before the loop's
first sleep also skips disabled libraries.
Orphan-GC consensus (library_maintenance.rs)
all_libraries_online filters disabled libraries out of the
consensus check — they're treated as out-of-scope, not as
blockers. Otherwise flipping enabled=false would permanently
halt orphan GC for the rest of the system, which is the opposite
of the intended kill-switch semantics.
Cross-library duplicates: safe by construction. Hash-keyed derived
data (face_detections, tagged_photo with hash, photo_insights with
hash) is anchored by ANY image_exif row carrying the hash. Disabling
a library does NOT delete its image_exif rows, so a hash referenced
by a disabled library's row stays anchored — derived data survives.
collect_orphan_hashes deliberately doesn't filter image_exif by
library.enabled for exactly this reason.
No HTTP endpoint. Library mutation is rare-enough infra work that a
SQL toggle is fine, and a public mutation endpoint without a role /
permission story would be poorly-prioritized exposure for a
single-user tool. Documented in CLAUDE.md.
Tests: 226 pass (225 from Branch C + 1 new
all_libraries_online_treats_disabled_as_out_of_scope, which proves
that even an explicit Stale entry on a disabled library doesn't
block the consensus).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Branch C of the multi-library data-model rollout. Implements the
operational maintenance pipeline pinned in CLAUDE.md → "Multi-library
data model" / "Library availability and safety". Branches A and B
land first; this branch builds on top.
New module: src/library_maintenance.rs
Three idempotent passes the watcher runs every tick after the
per-library ingest loop:
1. Missing-file scan (per online library)
For each Online library, load a paginated page of image_exif rows
(IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE, default 500), stat() each one,
and delete rows whose source file is NotFound. Permission/IO
errors are skipped, never deleted. Capped at
IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK (default 200) per library
per tick — so a pathological mount that returns NotFound for
everything can't wipe the table in one cycle. Cursor advances
across ticks, wraps on partial-page returns, and naturally cycles
through the entire library over many minutes. Skipped wholesale
for Stale libraries via the existing probe gate.
2. Back-ref refresh (DB-only)
For face_detections / tagged_photo / photo_insights: any
hash-keyed row whose (library_id, rel_path) no longer matches an
image_exif row, but whose content_hash does, is repointed at a
surviving image_exif location. Pure SQL with EXISTS guards so
rows whose hash is fully orphaned are left alone (the orphan GC
handles those). Idempotent; no availability gate needed.
This is what makes a recent → archive move invisible to readers:
when pass 1 retires the lib-A row, pass 2 pivots tags / faces /
insights to lib-B's surviving path before any client notices.
3. Orphan GC (destructive)
Hash-keyed derived rows whose content_hash has no image_exif
referent are GC-eligible. Two-tick consensus: a hash must be
observed orphaned on two consecutive ticks AND every library must
be Online for both. A single Stale tick within the window cancels
all pending deletes (they remain marked but won't be promoted) —
they're re-evaluated next tick. The pending set lives in
OrphanGcState (in-memory); a watcher restart resets it, which can
only delay a delete, never cause one. Hashes that re-appear in
image_exif between ticks are "revived" from the pending set
(handles transient share unmount / remount).
Two new ExifDao methods:
- list_rel_paths_for_library_page(library_id, limit, offset) for
the paginated missing-file scan.
- (count_for_library landed in Branch A.)
Watcher wiring (main.rs)
Per-library: missing-file scan inside the existing per-library
loop, after process_new_files, gated by the same probe check that
already protects ingest. After the loop: reconcile (Branch B),
back-ref refresh, then run_orphan_gc. The maintenance connection is
opened once per tick (image_api::database::connect), used by all
three DB-only passes, and dropped at end of tick.
CLAUDE.md gains a "Maintenance pipeline" subsection that describes
the three passes and their interaction with the existing
availability-and-safety policy.
Tests: 225 pass (217 from Branch B + 8 new in library_maintenance
covering back-ref refresh including the fully-orphaned no-op case,
two-tick GC consensus, Stale-tick consensus reset, image_exif
re-appearance revival, multi-table delete, and the
all_libraries_online helper).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Branch B of the multi-library data-model rollout. tagged_photo and
photo_insights now follow the bytes (content_hash), not the path,
matching the policy pinned in CLAUDE.md "Multi-library data model".
Branch A's availability probe and EXIF scoping land first; this
branch builds on top.
Migration (2026-05-01-000000_hash_keyed_derived_data)
Adds nullable content_hash columns to tagged_photo and photo_insights,
with partial indexes on the non-null subset to keep the index small
during the transitional window. The migration backfills from
image_exif:
* tagged_photo joins on rel_path alone (no library_id available);
* photo_insights joins on (library_id, rel_path), unambiguous.
Rows whose image_exif hash isn't known yet stay null and the runtime
reconciliation pass populates them as the hash backlog drains.
Insert-time population
TagDao::tag_file looks up image_exif.content_hash by rel_path before
inserting; the hash is written into the new column.
InsightDao::store_insight does the same scoped to (library_id,
rel_path). Caller-supplied hash on InsertPhotoInsight wins; otherwise
the DAO does the lookup. Both paths fall back to None if the hash
isn't known yet — reconciliation backfills.
Reconciliation (database/reconcile.rs)
Three idempotent passes the watcher runs once per tick after the
per-library backfill loop:
1. tagged_photo NULL hashes → populate from image_exif by rel_path.
2. photo_insights NULL hashes → populate by (library_id, rel_path).
3. photo_insights scalar merge — when multiple is_current rows
share a content_hash, keep the earliest generated_at as
current; demote the rest. Demoted rows keep their data so
/insights/history is unaffected; only the "current" pointer
narrows to one per hash.
No filesystem dependency, so reconcile doesn't need the availability
gate; runs every tick. Logs once when something changed, debug
otherwise.
Tags are set-valued under the policy (union on read, already
DISTINCT in queries), so there is no analogous tag-collapse pass —
duplicate (tag_id, content_hash) rows across libraries are
harmless.
Read paths are unchanged in this branch — lookup_tags_batch's
existing rel_path-via-hash-sibling expansion still produces the
correct merge. A follow-up can simplify reads to use the new column
directly for performance.
Tests: 217 pass (212 pre-existing + 5 new in reconcile covering
NULL-fill, hash-not-yet-known no-op, library scoping on insights,
earliest-wins collapse, idempotency).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Branch A of the multi-library data-model rollout. Three threads of
correctness/safety work that ship together because the new mount
needs all three before it can land:
1. Library availability probe (libraries.rs, state.rs, main.rs)
New LibraryHealth (Online | Stale { reason, since }) and a shared
LibraryHealthMap on AppState. Probe checks root_path exists +
is_dir + readable + non-empty (relative to a "had_data" signal so
fresh mounts aren't downgraded). The watcher tick begins with a
refresh_health() per library; stale libraries skip ingest, the
hash backfill, and face-detection backlog drains for that tick.
The orphaned-playlist cleanup also gates on every library being
online — a missing source on a stale library is indistinguishable
from a transient unmount, and the cleanup is destructive.
/libraries now returns each library with its current health
state. Logs only on Online↔Stale transitions so a long outage
doesn't spam.
New ExifDao::count_for_library is the "had_data" signal.
2. EXIF queries scoped by library_id (database/mod.rs, files.rs,
main.rs, tags.rs)
query_by_exif gains an Option<i32> library filter; /photos and
/photos/exif now pass it. Without this, an EXIF-filtered request
scoped to ?library=N returned cross-library results because the
handler resolved the library but didn't push it through to SQL.
get_exif_batch gains the same option. The watcher's per-library
ingest, face-candidate build, and content-hash backfill all
scope to their library; the union-mode /photos date-sort path
and the library-agnostic tag fan-out (lookup_tags_batch, by
design) keep using None.
3. Derivative-path collision fixes (content_hash.rs, main.rs)
New content_hash::library_scoped_legacy_path helper:
<derivative_dir>/<library_id>/<rel_path>. Thumbnail generation
(startup walk + watcher needs-thumb check) and serving now use
it; serving falls back to the bare-legacy mirrored path so
pre-multi-library deployments keep working without
regeneration. Without this, lib2 with the same rel_path as lib1
would have its thumbnail request short-circuit to lib1's image.
Orphaned-playlist cleanup walks every library when checking for
the source video (was: BASE_PATH only). Without this, mounting
a 2nd library and waiting 24h would delete every playlist whose
source lived only in the 2nd library.
The HLS playlist write path collision (filename-only basename,
not rel_path) is left as a known issue with a TODO at the call
site — the actor-pipeline rewrite belongs in Branch B/C.
Tests: 212 pass (cargo test --lib). New tests cover the probe
states (online / missing root / non-dir / empty-with-prior-data),
refresh_health transitions, query_by_exif scoping, get_exif_batch
keying on (library_id, rel_path), library_scoped_legacy_path, and
count_for_library.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PUT /image/tags/{id} renames a tag globally; DELETE /image/tags/{id}
removes a tag and every photo's reference. Rename returns 200/404/409
(case-insensitive name conflict) / 400 (empty name); delete returns
204/404. New migration adds a UNIQUE COLLATE NOCASE index on
tags.name with a pre-flight pass that collapses existing case-
insensitive duplicates onto the lowest id.
The connection setup now sets PRAGMA foreign_keys = ON. The schema
already declares ON DELETE CASCADE / SET NULL on several tables —
those clauses were documentation-only because SQLite has FK
enforcement off per-connection by default. Audited every
diesel::delete site; each touches either no inbound FKs or has a
matching policy. delete_tag relies on the tagged_photo cascade
instead of doing manual cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DB connection helper now sets `journal_mode=WAL`, `busy_timeout=5000`,
and `synchronous=NORMAL` on every connection. 13+ DAOs each open their
own connection through this helper and share one SQLite file — without
WAL, a writer's exclusive lock blocks readers and `load_persons` racing
the face-watch write storm errored instantly with "database is locked".
GPU face inference made this visible by speeding detect ~10× and
flooding the writer side. WAL persists in the file once set so the
debug binaries that bypass connect() inherit it automatically.
Also widen face_client.rs's classifier: 408 / 413 / 429 are now Transient
instead of Permanent. These are operator-fixable proxy/infra errors;
marking them Permanent poisons every affected photo with status='failed'
and requires manual SQL to recover. Specifically, Apollo's nginx
defaulted to a 1 MB body cap and silently rejected normal-size photos
before they reached the backend — the deferred-and-retry contract is
the right behavior for that class of fault.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first cut matched by rel_path only — fine for single-library
deploys but wrong for multi-library setups where the same content
lives under different rel_paths (e.g. a backup mount holding copies
of the primary library). A tag applied under library A would silently
not appear in the library-B grid badge even though the carousel's
per-path /image/tags would resolve it correctly via siblings.
The batch handler now does the expansion server-side in three queries
regardless of input size:
1. image_exif batch lookup → query path → content_hash
2. image_exif JOIN by content_hash → all sibling rel_paths sharing
each hash (paths are deduped across libraries)
3. tagged_photo + tags JOIN over the union of (query + sibling)
rel_paths
Tags are then aggregated back to query paths via a sibling→originals
reverse map, deduped by tag id. Files without a content_hash (just
indexed, hash compute pending, etc.) skip step 2 and only get tags
from their own rel_path — same fallback the per-path handler uses.
Adds ExifDao::get_rel_paths_for_hashes (batch counterpart of
get_rel_paths_by_hash) chunked at 500 to stay under SQLite's
SQLITE_LIMIT_VARIABLE_NUMBER. Five queries for a 4k-photo grid is
still ~800x cheaper than per-path HTTP fan-out.
Recursive listings now query image_exif instead of walking disk, taking
union-mode /photos from ~17s to sub-second on a 10k-file library. The
watcher's full scan prunes stale image_exif rows so the DB stays in
parity with the filesystem when files are deleted externally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds 9 unit tests around the library plumbing:
- resolve_library_param branches (absent, empty/whitespace, numeric id,
name, unknown id, unknown name)
- Library::resolve symmetry with strip_root
- ExifDao::get_all_with_date_taken in union and scoped modes
Introduces SqliteExifDao::from_connection test constructor mirroring the
existing preview_dao pattern so DAO tests can drive an in-memory SQLite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When `library` is omitted, both endpoints now walk every configured
library root, interleave the results, and tag each row with its source
library via the parallel `photo_libraries` / per-row `library_id`
arrays. Previously the handlers fell back to the primary library,
silently hiding the rest.
Threads a parallel `file_libraries: Vec<i32>` through the sort/paginate
helpers so library attribution survives sorting and pagination.
Directory names are de-duplicated across libraries.
`get_all_with_date_taken` grows an optional library filter so memories
can scope its EXIF query per-library during the union walk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Silence forward-looking dead_code on unused DAO modules, annotate
individual placeholder items, rewrite tautological assert!(true/false)
in token tests as panic! arms, and pick up fmt drift.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tags and insights now follow content across libraries via content_hash
lookups on the read path, so the same file indexed at different rel_paths
in multiple libraries shares its annotations. Recursive tag search scopes
hits to the selected library by checking each tagged rel_path against
the library's disk (with a content-hash sibling fallback so tags attached
under one library's rel_path still match a content-equivalent file in
another). The /image and /image/metadata handlers fall back across
libraries when the file isn't under the resolved one, so union-mode
search results (which carry no library attribution in the response)
still serve correctly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds blake3 content hashing as the basis for derivative dedup
(thumbnails, HLS) across libraries. Computed inline by the watcher on
ingest and by a new `backfill_hashes` binary for historical rows.
Key changes:
- `content_hash` and `size_bytes` are now populated on new image_exif
rows; a new ExifDao surface (`get_rows_missing_hash`,
`backfill_content_hash`, `find_by_content_hash`) supports backfill and
future hash-keyed lookups.
- The watcher now registers every image/video in image_exif, not just
files with parseable EXIF. EXIF becomes optional enrichment; videos
and other non-EXIF files still get a hashed row. This also makes
DB-indexed sort/filter cover the full library.
- `/image` thumbnail serve dual-looks up hash-keyed path first, then
falls back to the legacy mirrored layout.
- Upload flow accepts `?library=` query param + hashes uploaded files.
- Store_exif logs the underlying Diesel error on insert failure so
constraint violations surface instead of hiding behind a generic
InsertError.
- New migration normalizes rel_path separators to forward slash across
all tables, deduplicating any rows that collide after normalization.
Fixes spurious UNIQUE violations from mixed backslash/forward-slash
paths on Windows ingest.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a `libraries` registry table and threads library_id through
per-instance metadata tables (image_exif, photo_insights,
entity_photo_links, video_preview_clips). File-path columns renamed to
rel_path to make the relative-to-root semantics explicit. Adds
content_hash + size_bytes on image_exif to support future hash-keyed
thumbnail/HLS dedup. Tags and favorites stay library-agnostic so they
share across libraries by rel_path.
Behavior is unchanged: a single primary library (id=1) is seeded from
BASE_PATH on first boot; all handlers and DAOs route through it as a
transitional shim until the API gains a library query param.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Date sorting previously used a DB-level query that acted as an inner join,
silently dropping files with no image_exif row. Replace it with the existing
in-memory sort which already falls back to filename-extracted and filesystem
dates, so all files appear in sorted results.
Also removes the now-unused get_files_sorted_by_date trait method and its
SqliteExifDao implementation and test mock.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements persistent cross-photo knowledge memory so the agentic
insight loop can learn and recall facts about people, places, and
events across the photo collection.
Changes:
- photo_insights: drop UNIQUE(file_path) + INSERT OR REPLACE, replace
with append-only rows + is_current flag for insight history retention
- New tables: entities, entity_facts, entity_photo_links with FK
constraints and confidence scoring
- KnowledgeDao trait + SqliteKnowledgeDao with upsert, merge, and
corroboration (confidence +0.1 on duplicate fact detection)
- Four new agent tools: recall_entities, recall_facts_for_photo,
store_entity, store_fact (with object_entity_id FK support)
- Cameron entity auto-seeded with stable ID injected into system prompt
- Pre-run photo link clearing + post-loop source_insight_id backfill
- Audit REST API: GET/PATCH/DELETE /knowledge/entities/{id},
POST /knowledge/entities/merge, GET/PATCH/DELETE /knowledge/facts/{id},
GET /knowledge/recent
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Backend (Rust/Actix-web):
- Add video_preview_clips table and PreviewDao for tracking preview generation
- Add ffmpeg preview clip generator: 10 equally-spaced 1s segments at 480p with CUDA NVENC auto-detection
- Add PreviewClipGenerator actor with semaphore-limited concurrent processing
- Add GET /video/preview and POST /video/preview/status endpoints
- Extend file watcher to detect and queue previews for new videos
- Use relative paths consistently for DB storage (matching EXIF convention)
Frontend (React Native/Expo):
- Add VideoWall grid view with 2-3 column layout of looping preview clips
- Add VideoWallItem component with ActiveVideoPlayer sub-component for lifecycle management
- Add useVideoWall hook for batch status polling with 5s refresh
- Add navigation button in grid header (visible when videos exist)
- Use TextureView surface type to fix Android z-ordering issues
- Optimize memory: players only mount while visible via FlatList windowSize
- Configure ExoPlayer buffer options and caching for short clips
- Tap to toggle audio focus, long press to open in full viewer
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement database-level sorting with composite indexes for efficient date and tag queries. Add pagination metadata support and optimize tag count queries using batch processing.
- Updated InsightGenerator struct with calendar, location, and search DAOs
- Implemented hybrid context gathering methods:
* gather_calendar_context(): ±7 days with semantic ranking
* gather_location_context(): ±30 min with GPS proximity check
* gather_search_context(): ±30 days semantic search
- Added haversine_distance() utility for GPS calculations
- Updated generate_insight_for_photo_with_model() to use multi-source context
- Combined all context sources (SMS + Calendar + Location + Search) with equal weight
- Initialized new DAOs in AppState (both default and test implementations)
- All contexts are optional (graceful degradation if data missing)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements Phase 1 & 2 of Google Takeout RAG integration:
- Database migrations for calendar_events, location_history, search_history
- DAO implementations with hybrid time + semantic search
- Parsers for .ics, JSON, and HTML Google Takeout formats
- Import utilities with batch insert optimization
Features:
- CalendarEventDao: Hybrid time-range + semantic search for events
- LocationHistoryDao: GPS proximity with Haversine distance calculation
- SearchHistoryDao: Semantic-first search (queries are embedding-rich)
- Batch inserts for performance (1M+ records in minutes vs hours)
- OpenTelemetry tracing for all database operations
Import utilities:
- import_calendar: Parse .ics with optional embedding generation
- import_location_history: High-volume GPS data with batch inserts
- import_search_history: Always generates embeddings for semantic search
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>