ImageApi

Author	SHA1	Message	Date
Cameron Cordes	79e258eccd	date_resolver: canonical date_taken waterfall with exiftool fallback New module that consolidates the four-step ingest waterfall: kamadak-exif (already in process via the caller's prior result) → exiftool fallback → filename regex → earliest_fs_time. Each step is tagged with a `DateSource` so the caller can persist provenance. The exiftool fallback is what makes videos and MakerNote-hosted dates land at all — kamadak-exif can't read QuickTime/MP4 or Nikon-style sub-IFDs. Single-file mode shells out per call; batch mode pipes paths on stdin via `-@ -` and fans the result through one subprocess so the upcoming per-tick drain doesn't pay startup cost per row. The `exiftool` PATH check is cached in a `OnceLock` to keep the drain short-circuited on deploys without exiftool installed. `SubSecDateTimeOriginal` and `ContentCreateDate` are pulled alongside the standard tags to capture iPhone's sub-second precision and Apple's preferred capture-time tag respectively. `FileModifyDate` is deliberately not in the tag list — it's a filesystem-derived value the resolver already covers via the `fs_time` step, and pulling it through exiftool would mask "no real EXIF date" with a misleading `source = exiftool` row. Module is registered in both `lib.rs` and `main.rs` (sibling-module pattern the rest of the bin uses); no callers wired in yet — that lands in the next commit. Comes with 9 unit tests covering JSON parsing edge cases, source-priority short-circuiting, and the fs_time-when-no-exif path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 15:59:02 -04:00
Cameron Cordes	7ca888e95d	duplicates: filter low-entropy hashes + dHash double-check, fix backfill loop The perceptual cluster was producing one giant first group that contained hundreds of unrelated images. Two causes: - Solid-colour images (skies, black frames, monochrome scans) all hash to near-zero pHashes that Hamming-distance-zero to each other. - Single-link clustering on pHash alone is too permissive — a chain of weakly-similar images all collapses into one cluster. Fixed by skipping hashes outside the popcount [8, 56] band (uniform content) and requiring dHash agreement within threshold before unioning a candidate edge from the BK-tree. Two new tests pin both invariants. Backfill bin separately fix: decode-failed rows kept phash_64=NULL and got re-pulled by every batch, infinite-looping on a queue of unbreakable formats. Persist a 0/0 sentinel on decode failure so the row leaves the candidate set; the all-zero hash is excluded from clustering by the same entropy filter so it doesn't pollute results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:08:05 -04:00
Cameron Cordes	7584cd8792	duplicates: perceptual hash + soft-mark resolution + upload 409 Adds pHash + dHash columns alongside the existing blake3 content_hash so near-duplicates (re-encoded, resized, format-converted copies) become queryable. /duplicates/{exact,perceptual} return groups; /duplicates/ {resolve,unresolve} flip a duplicate_of_hash soft-mark on losing rows and union perceptual-only tag sets onto the survivor. The default /photos listing filters duplicate_of_hash IS NULL so demoted siblings stop cluttering the grid; include_duplicates=true opts back in for Apollo's review modal. Upload now hashes bytes pre-write and returns 409 with the canonical sibling when a file's bytes already exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:36:01 -04:00
Cameron Cordes	263e27e108	multi-library: handoff + orphan GC with two-tick consensus Branch C of the multi-library data-model rollout. Implements the operational maintenance pipeline pinned in CLAUDE.md → "Multi-library data model" / "Library availability and safety". Branches A and B land first; this branch builds on top. New module: src/library_maintenance.rs Three idempotent passes the watcher runs every tick after the per-library ingest loop: 1. Missing-file scan (per online library) For each Online library, load a paginated page of image_exif rows (IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE, default 500), stat() each one, and delete rows whose source file is NotFound. Permission/IO errors are skipped, never deleted. Capped at IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK (default 200) per library per tick — so a pathological mount that returns NotFound for everything can't wipe the table in one cycle. Cursor advances across ticks, wraps on partial-page returns, and naturally cycles through the entire library over many minutes. Skipped wholesale for Stale libraries via the existing probe gate. 2. Back-ref refresh (DB-only) For face_detections / tagged_photo / photo_insights: any hash-keyed row whose (library_id, rel_path) no longer matches an image_exif row, but whose content_hash does, is repointed at a surviving image_exif location. Pure SQL with EXISTS guards so rows whose hash is fully orphaned are left alone (the orphan GC handles those). Idempotent; no availability gate needed. This is what makes a recent → archive move invisible to readers: when pass 1 retires the lib-A row, pass 2 pivots tags / faces / insights to lib-B's surviving path before any client notices. 3. Orphan GC (destructive) Hash-keyed derived rows whose content_hash has no image_exif referent are GC-eligible. Two-tick consensus: a hash must be observed orphaned on two consecutive ticks AND every library must be Online for both. A single Stale tick within the window cancels all pending deletes (they remain marked but won't be promoted) — they're re-evaluated next tick. The pending set lives in OrphanGcState (in-memory); a watcher restart resets it, which can only delay a delete, never cause one. Hashes that re-appear in image_exif between ticks are "revived" from the pending set (handles transient share unmount / remount). Two new ExifDao methods: - list_rel_paths_for_library_page(library_id, limit, offset) for the paginated missing-file scan. - (count_for_library landed in Branch A.) Watcher wiring (main.rs) Per-library: missing-file scan inside the existing per-library loop, after process_new_files, gated by the same probe check that already protects ingest. After the loop: reconcile (Branch B), back-ref refresh, then run_orphan_gc. The maintenance connection is opened once per tick (image_api::database::connect), used by all three DB-only passes, and dropped at end of tick. CLAUDE.md gains a "Maintenance pipeline" subsection that describes the three passes and their interaction with the existing availability-and-safety policy. Tests: 225 pass (217 from Branch B + 8 new in library_maintenance covering back-ref refresh including the fully-orphaned no-op case, two-tick GC consensus, Stale-tick consensus reset, image_exif re-appearance revival, multi-table delete, and the all_libraries_online helper). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:27:53 +00:00
Cameron Cordes	f50655fb21	indexer: apply EXCLUDED_DIRS to remaining WalkDir callers Audit follow-up to `5bf4956`. The same `@eaDir` pruning that protects the indexer also needs to protect the other walks under library roots: - `create_thumbnails` walks every file in every library to generate thumbnails. Without EXCLUDED_DIRS, it would generate thumbnails of Synology's `SYNOFILE_THUMB_*.jpg` thumbnails (thumbnails of thumbnails). - `update_media_counts` walks for the prometheus IMAGE / VIDEO gauges. Without EXCLUDED_DIRS, the gauges over-count by however many phantom `@eaDir` images live alongside the real photos. - `cleanup_orphaned_playlists` walks BASE_PATH searching for source videos by filename. EXCLUDED_DIRS isn't a behavior change for typical Synology mounts (no .mp4 in @eaDir), but it's a correctness win for any operator-defined exclude that happens to contain video. Refactor: add `walk_library_files(base, excluded_dirs) -> Vec<DirEntry>` to file_scan.rs as the shared primitive. `enumerate_indexable_files` now layers media-type + mtime filters on top of it. One new test covers the lower-level helper (returns all extensions, prunes excluded subtrees). `generate_video_gifs` (currently `#[allow(dead_code)]`, not reachable from main) gets the `update_media_counts` signature update and reads EXCLUDED_DIRS from env so a future revival isn't broken — but its WalkDir walk stays raw because the dual lib/bin compile makes the file_scan module path non-trivial there. Tagged with a comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:21:17 +00:00
Cameron Cordes	5bf49568f1	indexer: prune EXCLUDED_DIRS at WalkDir time, extract enumerate_indexable_files Synology drops `@eaDir/.../SYNOFILE_THUMB_.jpg` files alongside every photo. The face-detect pipeline already filters those out via `face_watch::filter_excluded`, but the filter runs after* the indexer has already inserted rows into `image_exif`. Result: phantom rows whose content_hash never matches a `face_detections` row, so the anti-join in `list_unscanned_candidates` returns them every tick. They're filtered out at runtime, no marker is written, and the cycle repeats forever — log spam, wrong stats denominator, and on a real Synology library the phantom rows balloon into the hundreds of thousands. Move the exclusion to the WalkDir pass, where filter_entry can prune whole subtrees instead of walking and discarding leaves. Extract the pre-existing 30-line walker chain in main.rs::process_new_files into `file_scan::enumerate_indexable_files` so it's testable in isolation. Six tests cover the bug (eadir prune), nested patterns, absolute-under-base syntax, non-media filtering, modified_since semantics, and forward-slash rel_path normalization. Out of scope (other WalkDir callers in main.rs that don't yet apply EXCLUDED_DIRS — thumbnail gen at 1309, media scan at 1377, video playlist scan at 1685, and two nested walks at 1709 / 1743): separate audit PR. Operator note: existing phantom rows still need a one-shot cleanup — DELETE FROM face_detections WHERE content_hash IN ( SELECT content_hash FROM image_exif WHERE rel_path LIKE '%/@eaDir/%' ); DELETE FROM image_exif WHERE rel_path LIKE '%/@eaDir/%' OR rel_path LIKE '@eaDir/%'; Run before attaching a fresh Synology-sourced library. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:29:37 +00:00
Cameron Cordes	4dee7b6f73	faces: phase 3 — file-watch hook drives auto detection Wire face detection into ImageApi's existing scan loop so new uploads pick up faces automatically and the initial backlog grinds through on full-scan ticks. No new job system; Phase 2's already_scanned check makes the work implicitly idempotent (one face_detections row per content_hash, including no_faces / failed marker rows). face_watch.rs (new): - run_face_detection_pass(library, excluded_dirs, face_client, face_dao, candidates) — sync entry point. Builds a per-pass tokio runtime and fans out detect calls bounded by FACE_DETECT_CONCURRENCY (default 8). The watcher thread itself stays sync. - filter_excluded — applies the same PathExcluder /memories uses, so @eaDir / .thumbnails / EXCLUDED_DIRS-listed paths skip detection before we burn a detect call (and Apollo's GPU memory) on junk. - read_image_bytes_for_detect — RAW/HEIC route through extract_embedded_jpeg_preview because opencv-python-headless can't decode either; everything else gets a plain std::fs::read so EXIF orientation reaches Apollo's exif_transpose intact. - process_one — translates Apollo's response into the Phase 2 marker contract: faces[] empty → no_faces; FaceDetectError::Permanent → failed (don't retry); Transient → no marker (next scan retries); success with N faces → N detected rows with the embeddings unpacked. main.rs (process_new_files + watch_files): - watch_files now also takes face_client + excluded_dirs; the watcher thread builds a SqliteFaceDao the same way it builds ExifDao / PreviewDao. - After the EXIF write loop, build_face_candidates queries image_exif for the just-walked image paths' content_hashes (covers new uploads and pre-existing backlog), filters out anything already_scanned, and hands the rest to face_watch::run_face_detection_pass. - Bypassed wholesale when face_client.is_enabled() is false — keeps the watcher usable on legacy deploys where Apollo isn't configured. Tests: 5 face_watch unit tests cover the parts that don't need a real Apollo: - filter_excluded drops dir-component patterns (@eaDir) without matching substring file names (eaDir-not-a-thing.jpg keeps). - filter_excluded drops absolute-under-base subtrees (/private). - empty EXCLUDED_DIRS short-circuits cleanly. - read_image_bytes_for_detect passes JPEG bytes through verbatim (orientation must reach Apollo unmodified). - read_image_bytes_for_detect falls through to plain read when a RAW-extension file has no embedded preview, so Apollo gets a chance to 422 and we mark failed rather than infinitely-retrying. cargo test --lib: 170 / 0; fmt and clippy clean for new code. End-to-end (drop a photo → face_detections row appears) needs Apollo running and is deferred to deploy-time verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 18:21:19 +00:00
Cameron Cordes	860169032b	faces: phase 2 — schema + manual face/person CRUD Land the persistence model and HTTP surface for local face recognition. Inference still lives in Apollo (Phase 1); this side adds the data home plus every endpoint Apollo's UI and FileViewer-React will consume. Schema (new migration 2026-04-29-000000_add_faces): - persons: visual identities. Optional entity_id bridges to the existing knowledge-graph entities table; auto-bridging is left to the management UI (we don't muddy LLM provenance from face rows). UNIQUE(name COLLATE NOCASE) so 'alice' / 'Alice' fold to one row. - face_detections: keyed on content_hash (cross-library dedup), with status='detected' carrying bbox + 512-d embedding BLOB, and 'no_faces' / 'failed' marker rows that tell Phase 3's file watcher not to re-scan. Marker invariant enforced via CHECK; partial UNIQUE on content_hash WHERE status='no_faces' guards against double-marks. Schema regenerated with `diesel print-schema` against a clean migration run; joinables added for face_detections → libraries / persons and persons → entities. face_client.rs (sibling of apollo_client.rs): - reqwest multipart, 60 s timeout (CPU inference on a backlog can be slow; bounded threadpool on Apollo serializes calls anyway). - FaceDetectError::{Permanent, Transient, Disabled} — Phase 3 keys its marker-row decision on this. 422 → mark failed, 5xx → defer. - APOLLO_FACE_API_BASE_URL falls back to APOLLO_API_BASE_URL when unset; both unset = is_enabled() false, callers no-op. faces.rs (DAO + handlers): - SqliteFaceDao implements the full FaceDao trait; person face counts go through sql_query because diesel's BoxedSelectStatement + group_by trips trait-resolver recursion. - merge_persons re-points face rows in a transaction, copies notes when target's are empty, deletes src. - manual POST /image/faces resolves content_hash through image_exif, crops the user-drawn bbox with 10% padding (detector wants context around ears/jaw), POSTs the crop to face_client.embed for a real ArcFace vector, then inserts source='manual'. - Cluster-suggest (Phase 6) gets its data from GET /faces/embeddings — base64-encoded paged BLOBs so Apollo's DBSCAN can stream them without ImageApi pre-aggregating. Endpoints registered alongside add_*_services in main.rs: GET /faces/stats?library= GET /faces/embeddings?library=&unassigned=&limit=&offset= GET /image/faces?path=&library= POST /image/faces (manual create via embed) PATCH /image/faces/{id} DELETE /image/faces/{id} GET /persons?library= POST /persons GET /persons/{id} PATCH /persons/{id} DELETE /persons/{id}?cascade=set_null\|delete (set_null default) POST /persons/{id}/merge GET /persons/{id}/faces?library= The file-watch hook (Phase 3) and the rerun-on-one-photo handler (Phase 6) live behind the FaceDao methods marked dead_code today — they're called only when those phases land. Same shape for the trait methods that aren't reached by Phase 2 routes. Tests: 3 DAO unit tests cover person CRUD + case-insensitive uniqueness, marker-row idempotency (mark_status is a no-op when any row exists), and merge re-pointing faces. Cargo.toml: reqwest gains the `multipart` feature. cargo build / cargo test --lib / cargo fmt / cargo clippy --all-targets all clean for the new code; the two pre-existing test_path_excluder failures and the pre-existing sort_by clippy warnings are unrelated and present on master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 18:03:42 +00:00
Cameron	b9d5578653	feat(bins): multi-library populate_knowledge + progress UX populate_knowledge now loads real libraries from the DB instead of fabricating a single library_id=1 row from BASE_PATH. Adds --library <id\|name> to restrict the walk and validates --path against the selected library roots. The full library set is still passed to InsightGenerator so resolve_full_path can probe every root when an insight resolves to a different library than the one being walked. Adds indicatif progress bars across the long-running utility binaries via a shared src/bin_progress.rs helper (determinate bar + open-ended spinner with consistent styling). Per-batch info! noise is replaced by the bar's throughput/ETA; warnings and errors route through pb.println so they scroll above the bar instead of fighting with it. populate_knowledge spinner during scan, determinate bar over all libs backfill_hashes spinner with running hashed/missing/errors counts import_calendar determinate bar; embedding/store failures inline import_location_* determinate bar advancing by chunk size import_search_* determinate bar; pb cloned into the spawn task cleanup_files P1 determinate bar over DB paths cleanup_files P2 determinate bar; pb.suspend() around y/n/a/s prompt Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:55:33 -04:00
Cameron	c2ee3996be	chore: apply cargo fmt + clippy cleanup across crate Silence forward-looking dead_code on unused DAO modules, annotate individual placeholder items, rewrite tautological assert!(true/false) in token tests as panic! arms, and pick up fmt drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 01:55:07 +00:00
Cameron	0aaea91cc2	feat: add content_hash backfill + register every media file Adds blake3 content hashing as the basis for derivative dedup (thumbnails, HLS) across libraries. Computed inline by the watcher on ingest and by a new `backfill_hashes` binary for historical rows. Key changes: - `content_hash` and `size_bytes` are now populated on new image_exif rows; a new ExifDao surface (`get_rows_missing_hash`, `backfill_content_hash`, `find_by_content_hash`) supports backfill and future hash-keyed lookups. - The watcher now registers every image/video in image_exif, not just files with parseable EXIF. EXIF becomes optional enrichment; videos and other non-EXIF files still get a hashed row. This also makes DB-indexed sort/filter cover the full library. - `/image` thumbnail serve dual-looks up hash-keyed path first, then falls back to the legacy mirrored layout. - Upload flow accepts `?library=` query param + hashes uploaded files. - Store_exif logs the underlying Diesel error on insert failure so constraint violations surface instead of hiding behind a generic InsertError. - New migration normalizes rel_path separators to forward slash across all tables, deduplicating any rows that collide after normalization. Fixes spurious UNIQUE violations from mixed backslash/forward-slash paths on Windows ingest. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-21 01:55:07 +00:00
Cameron	ce5b337582	feat: make file watcher, thumbnails, and upload library-aware `watch_files` and `create_thumbnails` now iterate every configured library, tagging rows with the correct `library_id`. `process_new_files` takes a `&Library` so InsertImageExif no longer hardcodes the primary library. Upload accepts an optional `library` query param to pick a target library; omitted still defaults to primary for backwards compatibility. Hash-keyed thumbnail/HLS storage with dual-lookup fallback is deferred to Phase 5, where it's bundled with the content hash backfill that actually makes the hash-keyed paths meaningful. Until hashes are populated, the legacy mirrored layout is a no-op to change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-21 01:55:07 +00:00
Cameron	ffcddbb843	feat: multi-library foundation (schema + libraries module) Adds a `libraries` registry table and threads library_id through per-instance metadata tables (image_exif, photo_insights, entity_photo_links, video_preview_clips). File-path columns renamed to rel_path to make the relative-to-root semantics explicit. Adds content_hash + size_bytes on image_exif to support future hash-keyed thumbnail/HLS dedup. Tags and favorites stay library-agnostic so they share across libraries by rel_path. Behavior is unchanged: a single primary library (id=1) is seeded from BASE_PATH on first boot; all handlers and DAOs route through it as a transitional shim until the API gains a library query param. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-21 01:55:07 +00:00
Cameron	d86b2c3746	Add Google Takeout data import infrastructure Implements Phase 1 & 2 of Google Takeout RAG integration: - Database migrations for calendar_events, location_history, search_history - DAO implementations with hybrid time + semantic search - Parsers for .ics, JSON, and HTML Google Takeout formats - Import utilities with batch insert optimization Features: - CalendarEventDao: Hybrid time-range + semantic search for events - LocationHistoryDao: GPS proximity with Haversine distance calculation - SearchHistoryDao: Semantic-first search (queries are embedding-rich) - Batch inserts for performance (1M+ records in minutes vs hours) - OpenTelemetry tracing for all database operations Import utilities: - import_calendar: Parse .ics with optional embedding generation - import_location_history: High-volume GPS data with batch inserts - import_search_history: Always generates embeddings for semantic search 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-05 14:50:49 -05:00
Cameron	cf52d4ab76	Add Insights Model Discovery and Fallback Handling	2026-01-03 20:27:34 -05:00
Cameron	1171f19845	Create Insight Generation Feature Added integration with Messages API and Ollama	2026-01-03 10:30:37 -05:00
Cameron	636701a69e	Refactor file type checking for better consistency Fix tests	2025-12-23 22:30:53 -05:00
Cameron	47d3ad7222	Add polling-based file watching Remove notify and update otel creates	2025-12-22 22:54:19 -05:00
Cameron	df94010d21	Fix tests and improve memories date error log	2025-12-19 14:20:51 -05:00
Cameron	aaf9cc64be	Add Cleanup binary for fixing broken DB/file relations	2025-12-18 16:02:15 -05:00
Cameron	4082f1fdb8	Add Exif storing and update to Metadata endpoint	2025-12-17 16:55:48 -05:00

21 Commits