ImageApi

Author	SHA1	Message	Date
Cameron Cordes	6f0c15d0c5	insight-chat: code-review polish on get_faces_in_photo - Drop redundant `use anyhow::Context` inside has_any_faces (already imported at the module level). - Drop dead `.unwrap_or("?")` on bound faces — the vec is filtered to is_some() so the fallback can never fire. - Reorder the face_dao constructor param + initializer to match the struct declaration (between tag_dao and knowledge_dao). Update both state.rs call sites and populate_knowledge.rs to match. - Hold face_dao lock once across the library-resolver loop instead of reacquiring per iteration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 17:48:22 -04:00
Cameron Cordes	b64a5bec28	insight-chat: add get_faces_in_photo agentic tool The LLM had no path to see face_detections data — get_file_tags returns user-applied tags, but a face that's been detected and bound to a person via the embedding-cluster auto-bind path doesn't always have a matching tag. The new tool joins face_detections with persons by content_hash and returns bound names + bboxes, plus unidentified faces (so smaller models can count people in the photo without inferring from a visual description). Gated on face_detections being non-empty via the same has_any_* pattern as daily_summaries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 17:43:16 -04:00
Cameron Cordes	eef41d4172	thumbnails: align video ffmpeg args with the image path so non-yuvj420p sources work The bare 'ffmpeg -ss 3 -i in -vframes 1 -f image2 out' command failed on sources whose decoded pix_fmt isn't yuvj420p (e.g. older Samsung phone videos in yuv420p). With no -vf filter chain, the decoded frame goes straight to the mjpeg encoder, which rejects it with 'Non full-range YUV is non-standard' and exits non-zero. generate_image_thumbnail_ffmpeg already handles the same class of source for HEIC/RAW by adding -vf scale=200:-1 -c:v mjpeg — the filter chain lets ffmpeg auto-insert the pix_fmt converter the encoder needs. Adopt the same args here. Side benefit: video thumbnails are now 200px wide on disk, matching image thumbnails (previously full-resolution). Pre-existing .unsupported sentinels for videos that hit this failure will need to be deleted manually to retry — they're under $THUMBNAILS/<lib_id>/.../*.unsupported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 17:20:05 -04:00
Cameron Cordes	b42acbb3f3	fmt: cargo fmt sweep across drifted files No behavior change — purely whitespace/line-break cleanup that had accumulated since the last format run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 16:42:41 -04:00
Cameron Cordes	2a273a3ed9	thumbnails: stop video failures from re-logging every watcher tick generate_video_thumbnail used .output().expect(...), which only catches spawn failure — non-zero ffmpeg exits were silently discarded. With no thumbnail and no .unsupported sentinel left behind, the watcher re-detected the file as missing every quick-scan tick and re-logged "New file detected (missing thumbnail)" forever. Mirror the image branch: return io::Result, check status.success(), and write the sentinel from create_thumbnails on failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 16:41:24 -04:00
Cameron Cordes	1cdc0f6eb9	insight-chat: drop the dead SmsApiClient::search_messages wrapper The post-PR-4 delegation kept it as a convenience for callers that don't filter by contact, but nothing actually uses it. Delete to clear the dead_code warning. search_messages_with_contact remains as the single entry point. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:10:31 -04:00
Cameron Cordes	e539c083c9	insight-chat: code-review polish on the tool-gating PR - search_messages now delegates to search_messages_with_contact(.., None) so the two methods share a single HTTP path. Drops the dead-code warning and the ~30-line duplication. - DailySummaryDao gains has_any_summaries (LIMIT 1 existence probe) used by current_gate_opts; the SELECT COUNT(*) get_total_summary_count added in the prior commit is removed (it had no other caller). - current_gate_opts doc comment corrected to describe what the probes actually do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:07:57 -04:00
Cameron Cordes	f50d32667b	insight-chat: ToolGateOpts + per-tool description rewrites Tools whose backing tables are empty (calendar, location_history, daily_summaries) drop out of the catalog so the LLM doesn't waste iteration budget calling them only to receive "no results found". Vision and apollo gates already existed; this generalizes the pattern. search_messages gains start_ts/end_ts/contact_id filters (date filter is a client-side post-filter; SMS-API only accepts contact_id natively on the search endpoint). Descriptions follow a consistent convention: one sentence (what + when), param semantics, examples for tools with non-obvious param choices. No more all-caps headers, no more identity-prescriptive language inside descriptions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:56:58 -04:00
Cameron Cordes	b02da0d0cc	insight-chat: code-review polish on the days_radius fix - Bind effective_radius once in fetch_messages_for_contact so the log output and window math share a single source of truth for the clamp. - Clamp tool-supplied days_radius to [1, 30] at the tool boundary so a runaway LLM value can't produce a thousand-day window. - Split the negative-input test into a real negative-input case alongside the zero-input case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:47:46 -04:00
Cameron Cordes	659e7bd973	insight-chat: get_sms_messages tool now honors days_radius The agentic tool definition advertised a days_radius parameter but sms_client::fetch_messages_for_contact was hardcoded to ±4 days, silently ignoring whatever value the LLM chose. Plumb the parameter through; default 4 retained at the tool level for back-compat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:42:42 -04:00
Cameron Cordes	428f24b0f8	insight-chat: code-review polish on the chat system_prompt override - Trim the override input once via Option::map(str::trim).filter(...). - Use matches!() in restore_system_prompt_override's Prepended arm so it reads consistently with the Replaced arm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:40:04 -04:00
Cameron Cordes	faa289882f	insight-chat: per-turn system_prompt override on chat continuation Append mode: applied ephemerally — original system message restored before persistence so re-opens see the baked persona. Amend mode: override stays in place and becomes the new insight row's system message. Pattern mirrors annotate_system_with_budget. Adds system_prompt field on both ChatTurnHttpRequest and ChatTurnRequest; plumbs through chat_turn and chat_turn_stream identically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:34:08 -04:00
Cameron Cordes	177187f6a2	insight-chat: code-review polish on the system-prompt split - Use Option::map instead of manual match-on-Option (drops clippy::manual_map). - Drop redundant `max_iterations = max_iterations` from the format! call. - Use captured identifiers consistently in the user_content format!. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:27:59 -04:00
Cameron Cordes	8ae4099d46	insight-chat: split generation system prompt into identity + procedural blocks The framework no longer asserts "you are a personal photo memory assistant" alongside a user-supplied custom_system_prompt — the persona is the authoritative identity. The procedural block (tool-use guidance, iteration budget) stays identity-free. The user message also stops asking for "a detailed insight with a title and summary" since the title is regenerated post-hoc anyway and the wording was constraining voice for no data-model benefit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:20:45 -04:00
Cameron Cordes	43f8f83d80	memories: deny Snapchat-prefixed filenames from timestamp parsing Snapchat assigns sequential IDs that happen to overlap real epoch values, so the 10-16 digit timestamp regex matched and produced 2002-era dates for files actually saved in 2016/2021. The digits themselves are indistinguishable from a unix timestamp, so we dispatch on the source-app prefix instead. Case-insensitive, extensible for future apps that exhibit the same pattern. Reported cases: Snapchat-1021849065.mp4 → 2002-05-19 (actual 2021) Snapchat-1751031586660373917.jpg → 2002-09-09 (actual 2016) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 12:17:40 -04:00
Cameron Cordes	feaae9b6d3	memories: reject implausible filename-derived timestamps Filenames like `000227580005.jpg` (film-scan ID) and `IMG_21323906751390.jpeg` were matched by the 10-16 digit timestamp regex and resolved to 1970 / 2037, then written into `image_exif.date_taken` with `source = 'filename'`. EXIF-less photos showed up under those bogus dates everywhere date_taken is read. Two new guards in `extract_date_from_filename`: - leading zero → reject (real epoch values don't have one at any sane resolution). - resolved year outside [1995, now+1y] → reject. Both let the date_resolver waterfall fall through to fs_time, which is a much better proxy for content age than a fake epoch date. Regression tests cover the two reported filenames. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 12:02:07 -04:00
Cameron Cordes	7e1c4ab318	backfill_date_taken: surface the actual diesel error in warnings The DAO swallowed every diesel::update failure as a flat `anyhow!("Update error")`, then trace_db_call further reduced it to `DbError { kind: UpdateError }`. Operators saw "update failed for lib 2 Snapchat/foo.mp4: DbError { kind: UpdateError }" with no clue why (constraint violation? type mismatch? row vanished mid-flight? DB locked?). Two changes: - Preserve the diesel error in the anyhow chain along with the input params (lib, rel_path, date_taken, source) so the cause is visible. - Log the chain at warn-level inside the DAO before the trace wrapper collapses it to DbErrorKind::UpdateError, so the warning at the call site finally has something diagnosable next to it. - Treat zero-row updates as a debug-level "row likely retired by the missing-file scan" rather than a hard failure — that case is benign and shouldn't poison the drain's error tally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 11:07:17 -04:00
Cameron Cordes	65af7d999e	memories: parse filename dates as UTC, not server local `extract_date_from_filename` was calling `Local::from_local_datetime` on the parsed YYYY-MM-DD-HH-MM-SS components, then `.timestamp()` was shifting the result by the SERVER's TZ offset to produce real UTC seconds. That made filename-sourced timestamps disagree with EXIF- sourced timestamps by hours: kamadak-exif's `DateTimeOriginal` is a naive string parsed AS-IF-UTC (the project's load-bearing "naive local reinterpreted as UTC" convention), and Apollo's photo matcher re-anchors that naive value through the BROWSER's TZ when matching to the track. Anything stamped in server-local instead got double-shifted on its way through the matcher and through any `formatNaive*` display path on the client. Visible symptom in the Apollo DETAILS modal: a photo's CURRENT date read correctly (1:25 AM via exif) while FROM FILENAME read 4 hours ahead (5:25 AM in EDT) for the same `IMG_20160710_012515.jpg`. Switch to `Utc::from_utc_datetime` so `.timestamp()` returns the wall-clock-as-UTC unix seconds — same convention as the EXIF path. The /memories endpoint, the canonical-date waterfall (which feeds `image_exif.date_taken` for filename-only files), and Apollo's DETAILS modal `filename_date` field all now line up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 20:43:18 -04:00
Cameron Cordes	16d6586b7d	exif: GET /image/exif/full — exiftool dump for the DETAILS modal The curated `image_exif` columns are a small slice of what exiftool can read (camera/lens/GPS/capture/dates). Apollo's DETAILS modal wants to surface everything — white balance, metering, MakerNotes, IPTC, ICC profile, Composite tags, the lot — for an operator inspecting a photo's provenance. `read_full_exif_via_exiftool(path)` shells out to `exiftool -j -G -n`: JSON output, group-prefixed keys (`EXIF:Make`, `MakerNotes:LensInfo`), numeric values (callers can reformat). Spawned via web::block to keep it off the actix worker — RAW with rich MakerNotes can take a few seconds. The endpoint is on-demand only; the indexer / file watcher does NOT call it. Falls back to 503 with a clear message when exiftool isn't on PATH so Apollo can render an "install exiftool" hint. Multi-library union resolution mirrors set_image_gps / get_file_metadata. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 19:42:41 -04:00
Cameron Cordes	832b50d587	image_exif: manual date_taken override (set/clear endpoints) Add `POST /image/exif/date` and `POST /image/exif/date/clear` so an operator can correct a row whose canonical-date waterfall landed on the wrong value (camera clock reset, fs_time fallback for a copied-from- backup file, etc). New `original_date_taken` / `original_date_taken_source` columns snapshot the prior value on first override so revert is lossless. The waterfall source set is now `'exif' \| 'exiftool' \| 'filename' \| 'fs_time' \| 'manual'`. The existing `idx_image_exif_date_backfill` partial index already filters to `date_taken IS NULL OR date_taken_source = 'fs_time'`, so manual rows are naturally excluded from the per-tick drain — no index change needed. `ExifMetadata` now exposes `date_taken_source` + originals so a UI can render "manually set; was X via filename". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 19:26:43 -04:00
Cameron Cordes	ecd49fd053	otel: revert HTTP transport, keep gRPC The HTTP/protobuf exporter never sent any traffic in prod (tcpdump on port 4318 showed nothing) despite the receiver path being correct and the bridge wiring being intact (logs reached journalctl via the stdout exporter). Likely the BatchLogProcessor + reqwest-client combo isn't getting the right runtime context, but debugging that on a live deployment isn't worth holding up the rest of the speedups. Restoring grpc-tonic transport so prod observability comes back. The remaining build-time wins on this branch (mold linker, system sqlite3, profile.dev tweaks, lockfile-only dep refresh) deliver most of the original savings without touching telemetry. Operator: revert OTLP_OTLS_ENDPOINT in prod from port 4318 back to 4317. HTTP transport remains a viable follow-up — needs to be debugged against a local SigNoz instance with internal SDK error visibility enabled, on its own branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 18:33:37 -04:00
Cameron Cordes	f73db58771	build: speed up debug compile loop - Drop libsqlite3-sys 'bundled' on Linux/macOS so the SQLite C source isn't recompiled every clean build; Windows keeps 'bundled' via a cfg(windows) target override. - Switch opentelemetry-otlp from grpc-tonic to http-proto + reqwest-client. Removes the tonic + h2 + hyper-h2 stack from the build graph; reqwest was already a dependency. Updates otel.rs to call .with_http(). - Add [profile.dev] debug = "line-tables-only" to shrink linker work while keeping panics/backtraces useful. - Add .cargo/config.toml selecting mold via gcc on x86_64-linux-gnu. Requires `apt install mold`. Other platforms use the default linker. - cargo update: lockfile-only refresh of all minor/patch bumps within existing version constraints. Cold debug build: ~1m 37s; touch-one-file rebuild: ~5s on Linux. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 17:36:42 -04:00
Cameron Cordes	7f12890f4b	memories: single-SQL rewrite + 20-year lookback Replaces the EXIF-loop + WalkDir-fallback pipeline that powered `/memories` with a single per-library SQL query (`get_memories_in_window`) that uses `strftime('%m-%d' \| '%W' \| '%m', date_taken, 'unixepoch', tz_offset)` for calendar matching in the client's timezone, plus a `years_back` lower bound and a no-future-dates upper bound. Returns only the matching rows; the handler applies per-library `PathExcluder` post-query and sorts. Drops: - `collect_exif_memories` — replaced by the single SQL query. - `collect_filesystem_memories` — the canonical-date pipeline now populates `date_taken` for every row at ingest, so the WalkDir fallback that scanned 14k+ files each request is no longer needed. - `get_memory_date_with_priority` and friends — request-time waterfall superseded by `date_resolver` running at ingest. The associated three priority-tests are dropped; their replacement lives in `date_resolver::tests`. On a ~14k-file library this drops `/memories` from 10–15 s (dominated by `fs::metadata` per row) to single-digit ms. Bumps `DEFAULT_YEARS_BACK` from 15 → 20 to surface deeper archives on matching anniversaries. Note vs. ISO weeks: the original Rust used `chrono::iso_week().week()` for week-span matching. SQLite's `%W` is Monday-anchored but uses week 0 for days before the first Monday, so it can disagree with ISO at year boundaries by ±1. Acceptable for nostalgia browsing. Adds 3 new DAO tests covering month-span filter, library scoping, and the unknown-span-token guard. Also adds a CLAUDE.md section describing the canonical-date pipeline end-to-end and the new `DATE_BACKFILL_MAX_PER_TICK` env var. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 16:04:09 -04:00
Cameron Cordes	54e0635a98	date_backfill: per-tick drain for unresolved date_taken rows Adds two ExifDao methods (`get_rows_needing_date_backfill` / `backfill_date_taken`) and a `backfill_missing_date_taken` watcher pass that runs on every tick alongside `backfill_unhashed_backlog`. The drain queries the partial index for rows where `date_taken IS NULL` or `date_taken_source = 'fs_time'`, batches up to `DATE_BACKFILL_MAX_PER_TICK` paths (default 500), and feeds them through `date_resolver::resolve_dates_batch` — a single exiftool subprocess covers the whole tick. Rows that newly resolve to `exiftool` / `filename` / `fs_time` get persisted via `backfill_date_taken` (touches only `date_taken` + `date_taken_source` so EXIF / hash / perceptual columns survive). `filename`-sourced rows are intentionally not re-resolved — the regex is authoritative when it matches and re-running exiftool wouldn't change the answer. Files that have disappeared from disk are skipped so a ghost row doesn't loop through the drain forever; the missing-file scan in `library_maintenance` retires those separately. Comes with two DAO unit tests (eligibility filter + column-isolation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 16:03:03 -04:00
Cameron Cordes	2d14291733	ingest: stamp canonical date_taken on every InsertImageExif Wires `date_resolver::resolve_date_taken` into the three call sites that build `InsertImageExif`: - `process_new_files` (file watcher) — every newly-registered file gets the resolver's verdict so videos and EXIF-stripped images land with a real date instead of NULL. - Upload handler — same waterfall on the post-multipart-write path. - GPS-write handler — re-runs the waterfall after exiftool writes GPS and re-reads the EXIF, in case a previously fs_time-sourced row now has a real EXIF date to upgrade to. This is a behavior change vs. the pre-rewrite `/memories` request-time priority: EXIF now beats filename when both are present. A photo named `Screenshot_2014-06-01.png` whose EXIF `DateTime` is 2021 now appears under 2021. The reverse case (no EXIF, parseable filename) is unchanged and continues to surface the filename date with `date_taken_source = 'filename'`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 16:00:14 -04:00
Cameron Cordes	79e258eccd	date_resolver: canonical date_taken waterfall with exiftool fallback New module that consolidates the four-step ingest waterfall: kamadak-exif (already in process via the caller's prior result) → exiftool fallback → filename regex → earliest_fs_time. Each step is tagged with a `DateSource` so the caller can persist provenance. The exiftool fallback is what makes videos and MakerNote-hosted dates land at all — kamadak-exif can't read QuickTime/MP4 or Nikon-style sub-IFDs. Single-file mode shells out per call; batch mode pipes paths on stdin via `-@ -` and fans the result through one subprocess so the upcoming per-tick drain doesn't pay startup cost per row. The `exiftool` PATH check is cached in a `OnceLock` to keep the drain short-circuited on deploys without exiftool installed. `SubSecDateTimeOriginal` and `ContentCreateDate` are pulled alongside the standard tags to capture iPhone's sub-second precision and Apple's preferred capture-time tag respectively. `FileModifyDate` is deliberately not in the tag list — it's a filesystem-derived value the resolver already covers via the `fs_time` step, and pulling it through exiftool would mask "no real EXIF date" with a misleading `source = exiftool` row. Module is registered in both `lib.rs` and `main.rs` (sibling-module pattern the rest of the bin uses); no callers wired in yet — that lands in the next commit. Comes with 9 unit tests covering JSON parsing edge cases, source-priority short-circuiting, and the fs_time-when-no-exif path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 15:59:02 -04:00
Cameron Cordes	84326501a9	image_exif: add date_taken_source column New nullable TEXT column tracks which step of the canonical-date waterfall (kamadak-exif → exiftool → filename → fs_time) populated `date_taken`. Lets a later per-tick drain re-resolve weak sources (`fs_time`) once stronger ones become available, and gives the UI/debug surface a way to answer "why does this photo show up under this date?". Adds the column at all `InsertImageExif` construction sites with `None` placeholders (the resolver wiring lands in a follow-up commit), and extends the `update_exif` SET tuple so the column survives the GPS-write re-read path. Partial index `idx_image_exif_date_backfill` is created for the upcoming drain query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 15:57:49 -04:00
Cameron Cordes	67cf0c7f73	duplicates: folder-pair view of exact dups Bucket exact-dup rows by (library_id, dirname) pair on each side, then filter by coverage = shared / min(folder_a_total, folder_b_total) and an absolute floor on shared count. Surfaces "this folder is mostly contained in that folder" matches that the per-file EXACT view buries under one row each — e.g. an old phone-backup tree shadowing the organized library, or a topic-grouped folder duplicating a date-grouped one within the same library. New endpoint: GET /duplicates/folder-pairs?library=&include_resolved= &min_coverage=&min_shared=. Cached 5 min keyed on (library, include_resolved); the user-tunable thresholds filter the cached unfiltered pair list so slider drags don't re-bucket. Shares the resolve / unresolve flow with the existing tabs — the frontend fans out N parallel /resolve calls, one per shared content_hash. Folder names carry no signal (BMW lives under Night Photos, not BMW_backup), so bucketing is purely on (library_id, dirname) co-occurrence in exact-dup groups. Within-folder dups (same hash twice in the same folder) are skipped — those belong to the EXACT tab. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 12:43:29 -04:00
Cameron Cordes	1ddbca3413	exif: preserve filesystem mtime on GPS write Pass -P to exiftool so write_gps doesn't bump the file's modification time. For phone photos with no embedded EXIF datetime, the filesystem mtime is often the only timestamp we have — losing it on every GPS backfill would be data loss. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:09:21 -04:00
Cameron Cordes	57b7bad086	duplicates: library-aware visibility — only hide a demoted row when its survivor is reachable Soft-marked rows used to disappear from /photos globally, including from a library-scoped view that didn't contain the survivor at all. A user browsing lib A who'd promoted a file from lib B as the survivor would silently lose visibility on their own copy in lib A, even though lib B's file isn't reachable from lib A's view. Library-scoped queries now keep a demoted row visible when its survivor lives in a library outside the current scope. Implemented as a NOT EXISTS subquery against the same image_exif table aliased as `survivor`. The unscoped (all-libraries) view is unchanged — every survivor is reachable, so demoted rows stay hidden as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:24:07 -04:00
Cameron Cordes	98057c98a1	duplicates: tighten perceptual cluster — entropy band, asymmetric dHash, medoid prune Three changes against "still too loose at lowest sensitivity": - Popcount entropy band tightened from [8, 56] to [16, 48]. The wider band let too much low-frequency content through (skies, scans, faded film) where pHash collapses to near-uniform values that Hamming-trivially across hundreds of unrelated images. - dHash check now uses an asymmetric stricter threshold (dhash_threshold = max(2, threshold/2)). pHash is the candidate- discovery signal; dHash is validation. Splitting the budget means a real near-dup survives both while incidental pHash collisions on uniform content get vetoed. Missing dHash on either side now rejects the edge (was: trust pHash alone). - Single-link union-find can chain weakly-similar images via transitive edges. Added a medoid-validation pass: per cluster, pick the member with smallest summed distance to others, then drop any whose distance to it exceeds threshold. Two new tests pin both invariants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:19:48 -04:00
Cameron Cordes	7ca888e95d	duplicates: filter low-entropy hashes + dHash double-check, fix backfill loop The perceptual cluster was producing one giant first group that contained hundreds of unrelated images. Two causes: - Solid-colour images (skies, black frames, monochrome scans) all hash to near-zero pHashes that Hamming-distance-zero to each other. - Single-link clustering on pHash alone is too permissive — a chain of weakly-similar images all collapses into one cluster. Fixed by skipping hashes outside the popcount [8, 56] band (uniform content) and requiring dHash agreement within threshold before unioning a candidate edge from the BK-tree. Two new tests pin both invariants. Backfill bin separately fix: decode-failed rows kept phash_64=NULL and got re-pulled by every batch, infinite-looping on a queue of unbreakable formats. Persist a 0/0 sentinel on decode failure so the row leaves the candidate set; the all-zero hash is excluded from clustering by the same entropy filter so it doesn't pollute results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:08:05 -04:00
Cameron Cordes	7584cd8792	duplicates: perceptual hash + soft-mark resolution + upload 409 Adds pHash + dHash columns alongside the existing blake3 content_hash so near-duplicates (re-encoded, resized, format-converted copies) become queryable. /duplicates/{exact,perceptual} return groups; /duplicates/ {resolve,unresolve} flip a duplicate_of_hash soft-mark on losing rows and union perceptual-only tag sets onto the survivor. The default /photos listing filters duplicate_of_hash IS NULL so demoted siblings stop cluttering the grid; include_duplicates=true opts back in for Apollo's review modal. Upload now hashes bytes pre-write and returns 409 with the canonical sibling when a file's bytes already exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:36:01 -04:00
Cameron Cordes	fb4df4b195	style: cargo fmt sweep Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:01:00 -04:00
Cameron Cordes	1d9b9a0bc4	faces: avoid 40 MB row clone in /faces/embeddings list_embeddings cloned the full FaceDetectionRow inside the filter_map just to pair it with the base64-encoded embedding. The 2 KB BLOB was already on the row — at 20k unassigned faces that's 40 MB of pointless heap traffic per Apollo cluster-suggest run. Move the bytes out via Option::take() so the row drops the BLOB instead of duplicating it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:00:55 -04:00
Cameron Cordes	814066551e	multi-library: per-library excluded_dirs Adds a nullable comma-separated TEXT column to the libraries table. Effective excludes for a walk = (env-var globals) ∪ (library.excluded_dirs). Empty / NULL = no library-specific extras; the global env var still applies. Migration (2026-05-01-110000_libraries_excluded_dirs) ALTER TABLE libraries ADD COLUMN excluded_dirs TEXT. NULL on every existing row — no behavior change on upgrade. Library struct + helpers (libraries.rs) - Library gains excluded_dirs: Vec<String>, parsed from the column by parse_excluded_dirs_column (drops empties / whitespace, matches the env-var parser). - Library::effective_excluded_dirs(globals) returns the union. - From<LibraryRow> hydrates the field on AppState construction so /libraries surfaces it. Watcher / walkers / memories Every per-library walker now consults the effective set: - process_new_files (file-watch ingest, RAW/EXIF/face) - process_face_backlog (filter_excluded inherits) - create_thumbnails (startup + new-file branch) - update_media_counts (Prometheus gauge) - cleanup_orphaned_playlists (per-library source-existence check) - memories endpoint (PathExcluder) Effective set is computed once per per-library iteration in the watcher tick and threaded through; called functions retain their flat &[String] signature (no per-library awareness needed inside the walker primitives). Use case: mount a parent directory while a sibling library covers a child subtree, and exclude the child subtree from the parent so the libraries don't double-walk / double-write image_exif. With hash-keyed derived data (Branches B/C), the duplication-avoidance is the only cost prevented — face / tag / insight sharing was already correct via content_hash. Tests: 228 pass (226 from previous + 2 new in libraries::tests: parse_excluded_dirs_column edge cases, effective_excluded_dirs_unions_global_and_per_library). CLAUDE.md gains a "Per-library excludes" subsection of the multi-library data model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:54:17 +00:00
Cameron Cordes	3598bb2cfe	multi-library: operator kill switch via libraries.enabled A small follow-up to Branches A/B/C. Adds a nullable-default-1 boolean column to the `libraries` table that controls whether the watcher considers the library at all. Useful for staging a new mount before committing to ingest, and as a maintenance kill switch when a library needs to be quiet without being unmounted. Migration (2026-05-01-100000_libraries_enabled_flag) ALTER TABLE libraries ADD COLUMN enabled BOOLEAN NOT NULL DEFAULT 1. Existing rows stay enabled — no behavior change on upgrade. Watcher gate (main.rs) At the top of the per-library loop, if !lib.enabled { continue; } — runs BEFORE the availability probe. Disabled libraries don't enter the health map, don't get probed, don't get ingest, don't get any maintenance pass. The initial sweep before the loop's first sleep also skips disabled libraries. Orphan-GC consensus (library_maintenance.rs) all_libraries_online filters disabled libraries out of the consensus check — they're treated as out-of-scope, not as blockers. Otherwise flipping enabled=false would permanently halt orphan GC for the rest of the system, which is the opposite of the intended kill-switch semantics. Cross-library duplicates: safe by construction. Hash-keyed derived data (face_detections, tagged_photo with hash, photo_insights with hash) is anchored by ANY image_exif row carrying the hash. Disabling a library does NOT delete its image_exif rows, so a hash referenced by a disabled library's row stays anchored — derived data survives. collect_orphan_hashes deliberately doesn't filter image_exif by library.enabled for exactly this reason. No HTTP endpoint. Library mutation is rare-enough infra work that a SQL toggle is fine, and a public mutation endpoint without a role / permission story would be poorly-prioritized exposure for a single-user tool. Documented in CLAUDE.md. Tests: 226 pass (225 from Branch C + 1 new all_libraries_online_treats_disabled_as_out_of_scope, which proves that even an explicit Stale entry on a disabled library doesn't block the consensus). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:10:24 +00:00
Cameron Cordes	d809ddee44	library_maintenance: clarify orphan-gc log wording "marked 2 new" parses as "2 new files" on first read — but the unit is content_hashes, and the action is observing them as orphaned (becoming-deleted, not appearing). Reword: "{} new orphan hash(es) marked, {} revived" instead of "marked {} new, revived {}". Also pluralize the deleted counts ("row(s)") and append the pending-set size to the success log so a tick that both deletes and re-marks doesn't lose the trailing-state context. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:01:01 +00:00
Cameron Cordes	fa98d147be	library_maintenance: log orphan-gc decisions in stale-library path too run_orphan_gc returned early on the !all_online branch before the final debug/info log line, so the GC was effectively invisible whenever any library was Stale — exactly the dry-run scenario where operators most want to confirm the safety gate is firing. Add the same conditional log inside the early-return branch (plus a "deferred — at least one library Stale" hint in the info-level variant when there's something newly marked). No behavior change beyond observability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:14:09 +00:00
Cameron Cordes	263e27e108	multi-library: handoff + orphan GC with two-tick consensus Branch C of the multi-library data-model rollout. Implements the operational maintenance pipeline pinned in CLAUDE.md → "Multi-library data model" / "Library availability and safety". Branches A and B land first; this branch builds on top. New module: src/library_maintenance.rs Three idempotent passes the watcher runs every tick after the per-library ingest loop: 1. Missing-file scan (per online library) For each Online library, load a paginated page of image_exif rows (IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE, default 500), stat() each one, and delete rows whose source file is NotFound. Permission/IO errors are skipped, never deleted. Capped at IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK (default 200) per library per tick — so a pathological mount that returns NotFound for everything can't wipe the table in one cycle. Cursor advances across ticks, wraps on partial-page returns, and naturally cycles through the entire library over many minutes. Skipped wholesale for Stale libraries via the existing probe gate. 2. Back-ref refresh (DB-only) For face_detections / tagged_photo / photo_insights: any hash-keyed row whose (library_id, rel_path) no longer matches an image_exif row, but whose content_hash does, is repointed at a surviving image_exif location. Pure SQL with EXISTS guards so rows whose hash is fully orphaned are left alone (the orphan GC handles those). Idempotent; no availability gate needed. This is what makes a recent → archive move invisible to readers: when pass 1 retires the lib-A row, pass 2 pivots tags / faces / insights to lib-B's surviving path before any client notices. 3. Orphan GC (destructive) Hash-keyed derived rows whose content_hash has no image_exif referent are GC-eligible. Two-tick consensus: a hash must be observed orphaned on two consecutive ticks AND every library must be Online for both. A single Stale tick within the window cancels all pending deletes (they remain marked but won't be promoted) — they're re-evaluated next tick. The pending set lives in OrphanGcState (in-memory); a watcher restart resets it, which can only delay a delete, never cause one. Hashes that re-appear in image_exif between ticks are "revived" from the pending set (handles transient share unmount / remount). Two new ExifDao methods: - list_rel_paths_for_library_page(library_id, limit, offset) for the paginated missing-file scan. - (count_for_library landed in Branch A.) Watcher wiring (main.rs) Per-library: missing-file scan inside the existing per-library loop, after process_new_files, gated by the same probe check that already protects ingest. After the loop: reconcile (Branch B), back-ref refresh, then run_orphan_gc. The maintenance connection is opened once per tick (image_api::database::connect), used by all three DB-only passes, and dropped at end of tick. CLAUDE.md gains a "Maintenance pipeline" subsection that describes the three passes and their interaction with the existing availability-and-safety policy. Tests: 225 pass (217 from Branch B + 8 new in library_maintenance covering back-ref refresh including the fully-orphaned no-op case, two-tick GC consensus, Stale-tick consensus reset, image_exif re-appearance revival, multi-table delete, and the all_libraries_online helper). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:27:53 +00:00
Cameron Cordes	48cac8c285	multi-library: hash-keyed tagged_photo + photo_insights with reconciliation Branch B of the multi-library data-model rollout. tagged_photo and photo_insights now follow the bytes (content_hash), not the path, matching the policy pinned in CLAUDE.md "Multi-library data model". Branch A's availability probe and EXIF scoping land first; this branch builds on top. Migration (2026-05-01-000000_hash_keyed_derived_data) Adds nullable content_hash columns to tagged_photo and photo_insights, with partial indexes on the non-null subset to keep the index small during the transitional window. The migration backfills from image_exif: * tagged_photo joins on rel_path alone (no library_id available); * photo_insights joins on (library_id, rel_path), unambiguous. Rows whose image_exif hash isn't known yet stay null and the runtime reconciliation pass populates them as the hash backlog drains. Insert-time population TagDao::tag_file looks up image_exif.content_hash by rel_path before inserting; the hash is written into the new column. InsightDao::store_insight does the same scoped to (library_id, rel_path). Caller-supplied hash on InsertPhotoInsight wins; otherwise the DAO does the lookup. Both paths fall back to None if the hash isn't known yet — reconciliation backfills. Reconciliation (database/reconcile.rs) Three idempotent passes the watcher runs once per tick after the per-library backfill loop: 1. tagged_photo NULL hashes → populate from image_exif by rel_path. 2. photo_insights NULL hashes → populate by (library_id, rel_path). 3. photo_insights scalar merge — when multiple is_current rows share a content_hash, keep the earliest generated_at as current; demote the rest. Demoted rows keep their data so /insights/history is unaffected; only the "current" pointer narrows to one per hash. No filesystem dependency, so reconcile doesn't need the availability gate; runs every tick. Logs once when something changed, debug otherwise. Tags are set-valued under the policy (union on read, already DISTINCT in queries), so there is no analogous tag-collapse pass — duplicate (tag_id, content_hash) rows across libraries are harmless. Read paths are unchanged in this branch — lookup_tags_batch's existing rel_path-via-hash-sibling expansion still produces the correct merge. A follow-up can simplify reads to use the new column directly for performance. Tests: 217 pass (212 pre-existing + 5 new in reconcile covering NULL-fill, hash-not-yet-known no-op, library scoping on insights, earliest-wins collapse, idempotency). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:52:16 +00:00
Cameron Cordes	48ed7be5d9	libraries: initial availability sweep before watcher's first sleep new_health_map seeds every library as Online, and the watcher's tick loop sleeps WATCH_QUICK_INTERVAL_SECONDS (default 60s) before its first probe — meaning /libraries reported the optimistic default for up to a minute after boot, even when a share was clearly unmounted. Run the same refresh_health pass once at the top of the watcher thread before entering the sleep loop. /libraries is then truthful within milliseconds of the watcher thread starting (effectively from the first HTTP request, since the watcher spawns well before the server binds). The per-tick gate inside the loop is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:33:45 +00:00
Cameron Cordes	eea1bf3181	multi-library: availability probe + scoped EXIF queries + collision fixes Branch A of the multi-library data-model rollout. Three threads of correctness/safety work that ship together because the new mount needs all three before it can land: 1. Library availability probe (libraries.rs, state.rs, main.rs) New LibraryHealth (Online \| Stale { reason, since }) and a shared LibraryHealthMap on AppState. Probe checks root_path exists + is_dir + readable + non-empty (relative to a "had_data" signal so fresh mounts aren't downgraded). The watcher tick begins with a refresh_health() per library; stale libraries skip ingest, the hash backfill, and face-detection backlog drains for that tick. The orphaned-playlist cleanup also gates on every library being online — a missing source on a stale library is indistinguishable from a transient unmount, and the cleanup is destructive. /libraries now returns each library with its current health state. Logs only on Online↔Stale transitions so a long outage doesn't spam. New ExifDao::count_for_library is the "had_data" signal. 2. EXIF queries scoped by library_id (database/mod.rs, files.rs, main.rs, tags.rs) query_by_exif gains an Option<i32> library filter; /photos and /photos/exif now pass it. Without this, an EXIF-filtered request scoped to ?library=N returned cross-library results because the handler resolved the library but didn't push it through to SQL. get_exif_batch gains the same option. The watcher's per-library ingest, face-candidate build, and content-hash backfill all scope to their library; the union-mode /photos date-sort path and the library-agnostic tag fan-out (lookup_tags_batch, by design) keep using None. 3. Derivative-path collision fixes (content_hash.rs, main.rs) New content_hash::library_scoped_legacy_path helper: <derivative_dir>/<library_id>/<rel_path>. Thumbnail generation (startup walk + watcher needs-thumb check) and serving now use it; serving falls back to the bare-legacy mirrored path so pre-multi-library deployments keep working without regeneration. Without this, lib2 with the same rel_path as lib1 would have its thumbnail request short-circuit to lib1's image. Orphaned-playlist cleanup walks every library when checking for the source video (was: BASE_PATH only). Without this, mounting a 2nd library and waiting 24h would delete every playlist whose source lived only in the 2nd library. The HLS playlist write path collision (filename-only basename, not rel_path) is left as a known issue with a TODO at the call site — the actor-pipeline rewrite belongs in Branch B/C. Tests: 212 pass (cargo test --lib). New tests cover the probe states (online / missing root / non-dir / empty-with-prior-data), refresh_health transitions, query_by_exif scoping, get_exif_batch keying on (library_id, rel_path), library_scoped_legacy_path, and count_for_library. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:12:49 +00:00
Cameron	98601973f7	faces: log at the three 503 paths in update_face_handler PATCH /image/faces/{id} can return 503 from three places (face client disabled, transient embed error, mid-flight disable) and none of them were logging — operator sees the status code but nothing in the Rust log explaining why. Add warn! lines at each so future bbox-edit failures aren't silent. Response body is unchanged so existing clients keep working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:57:51 -04:00
Cameron	44d677528e	tags: add edit + delete endpoints, enable FK enforcement PUT /image/tags/{id} renames a tag globally; DELETE /image/tags/{id} removes a tag and every photo's reference. Rename returns 200/404/409 (case-insensitive name conflict) / 400 (empty name); delete returns 204/404. New migration adds a UNIQUE COLLATE NOCASE index on tags.name with a pre-flight pass that collapses existing case- insensitive duplicates onto the lowest id. The connection setup now sets PRAGMA foreign_keys = ON. The schema already declares ON DELETE CASCADE / SET NULL on several tables — those clauses were documentation-only because SQLite has FK enforcement off per-connection by default. Audited every diesel::delete site; each touches either no inbound FKs or has a matching policy. delete_tag relies on the tagged_photo cascade instead of doing manual cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:26:35 -04:00
Cameron Cordes	323097c650	faces: count distinct content_hash in stats total_photos face_detections is keyed on content_hash (one row per unique bytes, shared across libraries / duplicate paths) but total_photos was COUNT(*) over image_exif rows. A file present at multiple rel_paths or across libraries inflated the denominator without inflating the numerator, leaving a permanent gap (e.g. 1101/1103 with nothing actually pending detection). Switch total_photos to COUNT(DISTINCT content_hash) so numerator and denominator live in the same domain. Exclude rows with NULL content_hash from the count — they're held in the hash-backfill backlog, not the detection backlog, and counting them pins the bar below 100% for the duration of that pass. CLAUDE.md: document the stats domain rule next to the rest of the face-detection notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:41:20 +00:00
Cameron Cordes	67abd8d8ff	style: cargo fmt Pre-existing whitespace drift in test bodies, normalized by rustfmt. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:16:34 +00:00
Cameron Cordes	0840d55c70	faces: exclude videos from backlog drain and SCANNED denominator list_unscanned_candidates pulled every hashed image_exif row, including videos. filter_excluded then dropped them client-side without writing a marker, so the same set re-appeared every watcher tick — emitting the "backlog drain — running detection on N candidate(s)" log forever and producing no progress. face_stats.total_photos counted the same video rows in the denominator, so the SCANNED percentage was structurally capped below 100%. Add an image-extension SQL predicate (case-insensitive, sourced from file_types::IMAGE_EXTENSIONS) and apply it to both queries. Videos never enter the candidate set, total_photos counts only what can actually be scanned, and 100% becomes reachable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:16:30 +00:00
Cameron Cordes	f50655fb21	indexer: apply EXCLUDED_DIRS to remaining WalkDir callers Audit follow-up to `5bf4956`. The same `@eaDir` pruning that protects the indexer also needs to protect the other walks under library roots: - `create_thumbnails` walks every file in every library to generate thumbnails. Without EXCLUDED_DIRS, it would generate thumbnails of Synology's `SYNOFILE_THUMB_*.jpg` thumbnails (thumbnails of thumbnails). - `update_media_counts` walks for the prometheus IMAGE / VIDEO gauges. Without EXCLUDED_DIRS, the gauges over-count by however many phantom `@eaDir` images live alongside the real photos. - `cleanup_orphaned_playlists` walks BASE_PATH searching for source videos by filename. EXCLUDED_DIRS isn't a behavior change for typical Synology mounts (no .mp4 in @eaDir), but it's a correctness win for any operator-defined exclude that happens to contain video. Refactor: add `walk_library_files(base, excluded_dirs) -> Vec<DirEntry>` to file_scan.rs as the shared primitive. `enumerate_indexable_files` now layers media-type + mtime filters on top of it. One new test covers the lower-level helper (returns all extensions, prunes excluded subtrees). `generate_video_gifs` (currently `#[allow(dead_code)]`, not reachable from main) gets the `update_media_counts` signature update and reads EXCLUDED_DIRS from env so a future revival isn't broken — but its WalkDir walk stays raw because the dual lib/bin compile makes the file_scan module path non-trivial there. Tagged with a comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:21:17 +00:00
Cameron Cordes	5bf49568f1	indexer: prune EXCLUDED_DIRS at WalkDir time, extract enumerate_indexable_files Synology drops `@eaDir/.../SYNOFILE_THUMB_.jpg` files alongside every photo. The face-detect pipeline already filters those out via `face_watch::filter_excluded`, but the filter runs after* the indexer has already inserted rows into `image_exif`. Result: phantom rows whose content_hash never matches a `face_detections` row, so the anti-join in `list_unscanned_candidates` returns them every tick. They're filtered out at runtime, no marker is written, and the cycle repeats forever — log spam, wrong stats denominator, and on a real Synology library the phantom rows balloon into the hundreds of thousands. Move the exclusion to the WalkDir pass, where filter_entry can prune whole subtrees instead of walking and discarding leaves. Extract the pre-existing 30-line walker chain in main.rs::process_new_files into `file_scan::enumerate_indexable_files` so it's testable in isolation. Six tests cover the bug (eadir prune), nested patterns, absolute-under-base syntax, non-media filtering, modified_since semantics, and forward-slash rel_path normalization. Out of scope (other WalkDir callers in main.rs that don't yet apply EXCLUDED_DIRS — thumbnail gen at 1309, media scan at 1377, video playlist scan at 1685, and two nested walks at 1709 / 1743): separate audit PR. Operator note: existing phantom rows still need a one-shot cleanup — DELETE FROM face_detections WHERE content_hash IN ( SELECT content_hash FROM image_exif WHERE rel_path LIKE '%/@eaDir/%' ); DELETE FROM image_exif WHERE rel_path LIKE '%/@eaDir/%' OR rel_path LIKE '@eaDir/%'; Run before attaching a fresh Synology-sourced library. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:29:37 +00:00

1 2 3 4 5 ...

370 Commits