449ce1fda186a8d3c093cd3b23d1e80950f63223
189 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2818936739 |
fix: audit fixes for async insight jobs + persist generation params
- Fix query param mismatch: rename GenerationStatusQuery.file_path to
path so the client's app-resume buildQuery({ path: ... }) resolves
correctly instead of always getting 400
- Remove dead _lib_id bindings from both generate handlers
- Return 202 Accepted instead of 200 from generate endpoints
- Restore OpenTelemetry span instrumentation on generate handlers
- Remove stale UNIQUE constraint from initial migration (incompatible
with plain-INSERT DAO)
- Add tests for status guard: complete_job/fail_job are no-ops when
job is already cancelled, and cancel_job by id
- Persist generation params (num_ctx, temperature, top_p, top_k, min_p,
system_prompt, persona_id) on the photo_insights table for auditing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b87eb4e690 |
feat: async insight generation with SQLite job tracking
- Add insight_generation_jobs table migration and DAO - Implement job lifecycle: create_or_get_active, complete, fail, cancel - Refactor POST /insights/generate and /agentic to async spawn with timeout - Add GET /insights/generation/status endpoint with job_id and file_path lookup - Use String for enum fields in Diesel models to avoid private Bound type - Add from_str() helpers on InsightJobStatus and InsightGenerationType - Fix update_training_messages to return Result<usize, DbError> - 7/7 DAO unit tests passing |
||
|
|
32195ed89e |
clip-search: backlog drain + /photos/search endpoint
Wires the persistence layer for CLIP semantic search. The watcher's per-tick drain encodes any image_exif row with a known content_hash but no clip_embedding via Apollo (cap CLIP_BACKLOG_MAX_PER_TICK, default 32). On a query, /photos/search encodes the text via Apollo and reranks every stored embedding in-memory. ExifDao additions: - list_clip_unencoded_candidates — partial-index scan for drain - backfill_clip_embedding — touches only the two new columns - list_clip_index — dedup'd (hash, embedding) pull for search clip_watch::run_clip_encoding_pass is the parallel fan-out — tokio runtime per pass with CLIP_ENCODE_CONCURRENCY (default 4). No marker rows for permanent failures yet; per-tick cap bounds the retry cost. /photos/search params: q, limit, threshold (default 0.20), library, model_version. Response is intentionally minimal (path + score) so the frontend joins against existing photo-metadata routes lazily. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0168a4b574 |
hls: remove legacy /video/stream + /video/{path} routes
The hash-keyed `/video/hls/{hash}/{file}` route fully covers HLS
playback now and both clients (Apollo, FileViewer-React) have
shipped updates that use it directly. Keeping the basename-keyed
fallback only encouraged stale URLs to keep flowing — every legacy
file was deleted by the startup migration, so the routes were
guaranteed 404 machines.
Dropped:
- `stream_video` handler (`GET /video/stream?path=…`) — the original
basename-keyed playlist serve.
- `get_video_part` handler (`GET /video/{path}`) — bare-filename
segment serve. The new layout's segments live in
`<shard>/<hash>/segment_NNN.ts` and reach the client via
`stream_hls_file`.
- `legacy_path` field on `GenerateVideoResponse` (serialised as
`playlist`). The field always pointed at a file the migration had
deleted; current clients ignore it entirely.
- Their service registrations in `main.rs`.
- The body-side `filename` extraction in `generate_video` (existed
only to construct `legacy_path`) and the now-unused `global`
opentelemetry import in `handlers/video.rs`.
All 707 tests still pass. Same hand-rolled validators (`is_valid_hash`
/ `is_allowed_hls_filename`) keep the new route's defense-in-depth
intact.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
7cd1ea3cf8 |
hls: per-library readiness gauges + GET /hls/stats endpoint
The hash-keyed pipeline transcodes lazily, so a freshly mounted (or freshly upgraded) library is "mostly pending" for the first hour while the watcher works through the backlog. The operator wants a live read on remaining work so they can tune `HLS_CONCURRENCY` and know when to stop waiting. Adds: - `src/hls_stats.rs` — pure compute path (`stats_from_rows`) and an Arc<Mutex<dyn ExifDao>> wrapper (`compute_and_publish`). Per library: `total`, `with_playlist`, `pending`, `unsupported`, `hashless_videos`. Dedup is by content_hash so duplicate-bytes-at- N-paths counts once (same domain rule as `faces::stats`). `hashless_videos` is a separate counter so the operator can see the "hash backfill, then transcode" pipeline depth instead of having NULL-hash rows just hide. - Prometheus gauges labeled by library name: `imageserver_hls_videos_total`, `..._with_playlist`, `..._pending`, `..._unsupported`. Updated by the watcher at the end of every full- scan tick *and* on every `/hls/stats` hit, so whichever surface the operator is watching stays fresh. Registered in `main` alongside the existing image/video gauges. - `GET /hls/stats` — Claims-protected JSON snapshot of the same data plus a top-level cross-library aggregate. Runs on a blocking pool so it doesn't pin the actix worker; per-call cost is one `list_paths_and_hashes_for_library` SQL query per library plus a `stat()` per distinct video hash. Bounded — never invoked from middleware, only from the explicit endpoint and the full-scan tick. The watcher's end-of-tick `info!` summary line mirrors the endpoint output for operators tailing the log. - New `ExifDao::list_paths_and_hashes_for_library` method: `SELECT rel_path, content_hash FROM image_exif WHERE library_id = ?`. Single round-trip; callers filter to video extensions client-side because the schema doesn't carry media-type. Mock impl in `files.rs` returns an empty vec. Tests in `hls_stats::tests` exercise stats_from_rows directly (videos- only filter, hash dedup, playlist vs sentinel decision, NULL-hash hashless counting) plus a publish_gauges round-trip that reads the gauge value back. Full suite (347 lib + 360 bin = 707) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7c153596fe |
hls: hash-keyed HTTP routes for /video/generate and serving
`POST /video/generate` is reshaped to return a JSON object instead of
a bare string. New fields:
- `playlist_url`: stable hash-keyed URL of the form
`/video/hls/<hash>/playlist.m3u8`. Use this with hls.js / native
players — relative segment refs inside the playlist resolve to
`/video/hls/<hash>/segment_NNN.ts` because the URL is path-based.
- `content_hash`: the blake3 hex digest that identifies the bytes.
Stable across libraries, archive ingests, renames; clients can
cache the URL by hash.
- `ready`: true iff the playlist file is already on disk. False means
a transcode was just queued; the client should retry the URL after
a short delay (or rely on hls.js's built-in retry).
- `playlist` (legacy): basename-keyed path string, echoed under the
old field name so clients that destructure `response.playlist` keep
working during the rollout. The startup migration deletes the
underlying file, so this URL will 404; clients should migrate to
`playlist_url`. Field is slated for removal once Apollo / File
Viewer ship the update.
The handler:
- resolves the source path across libraries (same logic as before),
- looks up `image_exif.content_hash` for that (library_id, rel_path),
- falls back to inline `content_hash::compute` when the row is mid-
backfill — pure read, no library mutation,
- sends a single-element `QueueVideosMessage` to `VideoPlaylistManager`
if the playlist isn't already on disk and there's no
`playlist.unsupported` sentinel,
- returns the URL immediately. The actor pipeline owns transcoding.
New route `GET /video/hls/{hash}/{file}`:
- strict validation: hash must be 64 ascii-hex chars; file must be
`playlist.m3u8` or `segment_NNN.ts` (digits only). Anything else
returns 400 so we never have to rely on path canonicalisation
alone to defend against traversal,
- belt-and-suspenders canonicalize() guard verifies the resolved
file lives under `$VIDEO_PATH`,
- serves with the standard `NamedFile::into_response` machinery.
Cleanup in `actors.rs`:
- `ProcessMessage` + its `StreamActor` handler had no senders after
the rewire — removed. `StreamActor` itself stays (still handles
`RefreshThumbnailsMessage` from `files.rs`).
- `create_playlist`, `playlist_file_for`,
`playlist_unsupported_sentinel` are gone — the legacy on-demand
transcode helper and the migration-only path helpers had no
remaining users (the migration uses its own classify() function).
- Imports tightened: dropped `Child`, `ExitStatus`, `trace`.
Tests cover both new validators (`is_valid_hash`,
`is_allowed_hls_filename`) including the strings that motivated the
defence-in-depth (traversal attempts, internal `.tmp`/`.unsupported`
artifacts, malformed segment names).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
78fabc2b32 |
hls: retire legacy basename-keyed HLS files on startup
Adds `video::legacy_migration::retire_legacy_hls_output`, called once from `main` right after the diesel migrations run and before the actor pipeline starts. Walks `$VIDEO_PATH` at depth 1, deletes every `.m3u8` / `.m3u8.tmp` / `.m3u8.unsupported` / `.ts` file at root, and logs a single info line with per-class counts. Skips directories (the new layout's `<shard>/<hash>/` lives there) and unknown extensions, so an operator's stashed README or `.tmp` from a different tool is safe. Why this needs its own one-shot pass rather than letting the rewritten `cleanup_orphaned_playlists` handle it: the cleanup walk deliberately only looks at `<shard>/<hash>/` dirs (so it can't accidentally `rm` operator-stashed content), so without this migration the legacy files would sit at root forever, never served, never refreshed. Operator complaint count from the previous IMG_NNNN.MOV collision: ~10 duplicate-basename hits on one library alone; total .m3u8 count was 699 vs a much larger video count — i.e. the loser of every collision was a permanent orphan. This pass collects all of them, then the running watcher writes hash-keyed playlists going forward. Idempotent — a second boot finds nothing and reports zero deletions, so the call site can stay in `main` across releases until the module is removed in a later cleanup commit. Tests cover the happy path (legacy artifacts gone, hash dir untouched, unrelated files left alone), idempotency, and the missing-directory case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d1667099c3 |
hls: rewire queue + generator to write hash-keyed playlists
Switches the watcher → VideoPlaylistManager → PlaylistGenerator path
from the basename-keyed layout
(`$VIDEO_PATH/{basename}.m3u8`) to the hash-keyed layout
(`$VIDEO_PATH/{hash[..2]}/{hash}/playlist.m3u8`) introduced in the
prior commit. Source videos that share a basename across libraries
(or across subdirs of one library) no longer overwrite each other's
playlists. The legacy HTTP endpoints in `/video/generate` /
`/video/stream` still use the basename layout — those move in a
follow-up commit alongside the stable streaming URL.
actors.rs:
- `QueueVideosMessage.video_paths: Vec<PathBuf>` →
`videos: Vec<VideoToQueue>`. The queue handler dedups against the
hash-keyed playlist + sentinel and forwards `GeneratePlaylistMessage`
carrying the hash.
- `GeneratePlaylistMessage` now carries `content_hash: String`; the
legacy `playlist_path: String` field is gone.
- `PlaylistGenerator` takes a `video_dir: PathBuf` at construction,
computes the hash dir + playlist + sentinel + segment template via
`hls_paths`, `mkdir -p`s the shard/hash dir before ffmpeg runs, and
cleans up partial output on failure by walking the hash dir.
- `ScanDirectoryMessage` and its handler are retired entirely; their
startup-walk role is taken over by the watcher's first tick (see
`watcher.rs` below). Dropping it avoids threading an `ExifDao` into
`VideoPlaylistManager` just so the actor can resolve hashes.
- Legacy `playlist_file_for` / `playlist_unsupported_sentinel` are
retained behind `#[allow(dead_code)]` for the upcoming migration
pass that retires pre-content-hash output.
watcher.rs:
- `process_new_files` keeps `content_hash` in the EXIF-batch result
(formerly threw it away). Videos with `image_exif.content_hash =
NULL` — mid-backfill rows — are skipped this tick rather than
falling back to a basename-colliding playlist; they get picked up
after `backfill_unhashed_backlog` populates the hash on a
subsequent tick. Skipped count is logged at debug.
- The video staleness check now uses `hls_paths::playlist_for_hash`
instead of `$VIDEO_PATH/{basename}.m3u8`.
- `last_full_scan` initialises to `UNIX_EPOCH` so the watcher's first
tick is treated as a full scan. That covers the catch-up gap left
by removing `ScanDirectoryMessage` — every library's existing media
is checked once at watcher boot (≈60s after startup) instead of
waiting up to `WATCH_FULL_INTERVAL_SECONDS` (1h default).
main.rs: removes the `ScanDirectoryMessage` import and the per-library
`do_send` loop, with a comment pointing at the watcher's first-tick
behavior.
state.rs: `PlaylistGenerator::new` now takes the video dir.
Tests: existing `video::hls_paths` (4) and `watcher::tests` (4) pass.
The basename-keyed `/video/generate` endpoint still compiles and
serves; behavior change there is deferred to the follow-up commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b3124437ec |
libraries: PATCH /libraries/{id} with live-apply
Adds an HTTP mutation surface for `libraries.enabled` and `libraries.excluded_dirs`, replacing the SQL-only workflow noted in CLAUDE.md. Apollo's Settings panel calls this from the LIBRARIES section so the operator no longer has to ssh + sqlite3 to flip a library off or edit its excludes. Live-apply (no restart) via a new `live_libraries: Arc<RwLock<Vec< Library>>>` field on AppState. The existing immutable `libraries` Vec stays for hot-path handlers that only need stable id → root_path lookups, avoiding a 19-call-site refactor. The watcher and cleanup_orphaned_playlists now take the lock instead of a Vec snapshot and re-read at the top of each tick, so `enabled` / `excluded_dirs` changes are picked up within one WATCH_QUICK_INTERVAL_SECONDS. The GET /libraries handler also reads through the live view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9f8a69fc6d |
Split main.rs: extract watcher loop into src/watcher.rs
main.rs drops from 1200 → 346 lines (90% smaller than the pre-branch 3542). What's left is the startup wiring it was always meant to be: .env, migrations, AppState construction, route registration, server bind. The four background-loop functions move into src/watcher.rs: - watch_files (310 lines) — quick/full scan tick, per-library probe, backfill drain dispatch, missing-file scan, back-ref refresh, orphan GC. - process_new_files (351 lines) — file walk → EXIF write → face-candidate build → HLS / preview-clip queueing → reconciliation. The "biggest untested chunk" from the earlier audit. - cleanup_orphaned_playlists (167 lines) — separate slower-tick thread. - playlist_needs_generation — small mtime-comparison helper. Plus 4 unit tests for playlist_needs_generation (covers missing playlist, newer playlist, newer video, video-missing-metadata fallback). main.rs's imports correspondingly shrink — Addr, HashSet, WalkDir, Utc, InsertImageExif, and the bulk of video::actors all leave with the watcher. CLAUDE.md updated to reflect the new module layout (layered architecture box + module map for the face-detection section). cargo test --bin image-api: 329 passing (no regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
bdb69c7d37 |
Split main.rs: extract HTTP handlers into src/handlers/
main.rs drops from 2935 → 1200 lines, freed for startup wiring + the watcher. The 16 route handlers move into three domain-grouped files under src/handlers/: - handlers/favorites.rs (128 lines): favorites, put_add_favorite, delete_favorite. - handlers/video.rs (665 lines): generate_video, stream_video, get_video_part, get_video_preview, get_preview_status. The 5 pre-existing get_preview_status integration tests move with the handler (still pass against TestPreviewDao + AppState::test_state). - handlers/image.rs (1003 lines): get_image (with the hash/library-scoped/bare-legacy thumb lookup), upload_image, get_file_metadata, set_image_gps, get_full_exif, set_image_date, clear_image_date. Helpers (create_circular_thumbnail, build_metadata_response_for_date_mutation) and request structs (SetGpsRequest, SetDateRequest, ClearDateRequest, UploadQuery) travel with them. main.rs's import block shrinks from ~50 lines to ~22 as everything HTTP-specific (NamedFile, mp::Multipart, BytesMut, Span, KeyValue, StreamExt, …) moves with the handlers. The is_video_file wrapper also goes — remaining callers in watch_files / cleanup use file_types::is_video_file directly. cargo test --bin image-api: 325 passing (no regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
bec9857426 |
Split main.rs: extract backfill drains and thumbnails into modules
main.rs drops from 3542 → ~2930 lines by moving:
- src/backfill.rs (new): backfill_unhashed_backlog,
backfill_missing_date_taken, backfill_missing_content_hashes,
build_face_candidates, process_face_backlog. Now unit-tested for
the first time — 5 tests covering cap behavior, library-id
filtering, missing-on-disk skip, and the video/unhashed/scanned
filters on face-candidate selection.
- src/thumbnails.rs (new): unsupported_thumbnail_sentinel,
generate_image_thumbnail, create_thumbnails, update_media_counts,
is_image, is_video, plus the IMAGE_GAUGE / VIDEO_GAUGE Prometheus
metrics. Replaces the no-op stubs that used to live in lib.rs.
4 new unit tests for the sentinel path math and the
walker-counts-images-vs-videos smoke path.
Supporting:
- SqliteExifDao::from_shared (test-only) so an SqliteExifDao and
SqliteFaceDao can share one in-memory connection — required to
test build_face_candidates against the real join.
- files.rs / video/{mod,actors}.rs import from crate::thumbnails::*
instead of the now-removed stubs in lib.rs.
cargo test --bin image-api: 325 passing (was 314).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
25233904aa |
Merge pull request 'personas: elevate to server with per-persona fact scoping' (#88) from feature/persona-knowledge-segmentation into master
Reviewed-on: #88 |
||
|
|
108bbeb029 |
date-override: union semantics across libraries + slash forms
The date-override path used to look up `image_exif` strictly by `(library_id, rel_path)` with only the forward-slash form, while `/image/metadata`'s `get_exif` falls back across libraries and tries both slash forms. A photo whose row sat under a different library_id than its filesystem-resolved one — or whose rel_path was stored with backslashes — rendered fine in the modal but 404'd on save. `set_manual_date_taken` / `clear_manual_date_taken` now share a `locate_image_exif_row` helper that mirrors `get_exif`'s union semantics (scoped lookup first, library-agnostic fallback by rel_path in both slash forms), then update by primary key so the write hits exactly the row read. Inner anyhow errors are logged with `(library_id, rel_path)` so the next failure mode is debuggable. Handler-side: `resolve_library_param` errors no longer silently fall back to the primary library (which would have masked the original bug with a different "row not found"); a malformed library param now returns 400. New `DbErrorKind::NotFound` lets the handler distinguish genuine misses (404) from real DB failures (500). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3e2f36a748 |
personas: elevate to server with per-persona fact scoping
Move personas off the mobile client into ImageApi as first-class records, and scope entity_facts by persona so each one builds its own voice over a shared entity graph. The new include_all_memories flag lets a persona opt back into the full hive-mind pool for human browsing of /knowledge/*; agentic generation always stays in-voice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b42acbb3f3 |
fmt: cargo fmt sweep across drifted files
No behavior change — purely whitespace/line-break cleanup that had accumulated since the last format run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2a273a3ed9 |
thumbnails: stop video failures from re-logging every watcher tick
generate_video_thumbnail used .output().expect(...), which only catches spawn failure — non-zero ffmpeg exits were silently discarded. With no thumbnail and no .unsupported sentinel left behind, the watcher re-detected the file as missing every quick-scan tick and re-logged "New file detected (missing thumbnail)" forever. Mirror the image branch: return io::Result, check status.success(), and write the sentinel from create_thumbnails on failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
16d6586b7d |
exif: GET /image/exif/full — exiftool dump for the DETAILS modal
The curated `image_exif` columns are a small slice of what exiftool can read (camera/lens/GPS/capture/dates). Apollo's DETAILS modal wants to surface everything — white balance, metering, MakerNotes, IPTC, ICC profile, Composite tags, the lot — for an operator inspecting a photo's provenance. `read_full_exif_via_exiftool(path)` shells out to `exiftool -j -G -n`: JSON output, group-prefixed keys (`EXIF:Make`, `MakerNotes:LensInfo`), numeric values (callers can reformat). Spawned via web::block to keep it off the actix worker — RAW with rich MakerNotes can take a few seconds. The endpoint is on-demand only; the indexer / file watcher does NOT call it. Falls back to 503 with a clear message when exiftool isn't on PATH so Apollo can render an "install exiftool" hint. Multi-library union resolution mirrors set_image_gps / get_file_metadata. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
832b50d587 |
image_exif: manual date_taken override (set/clear endpoints)
Add `POST /image/exif/date` and `POST /image/exif/date/clear` so an operator can correct a row whose canonical-date waterfall landed on the wrong value (camera clock reset, fs_time fallback for a copied-from- backup file, etc). New `original_date_taken` / `original_date_taken_source` columns snapshot the prior value on first override so revert is lossless. The waterfall source set is now `'exif' | 'exiftool' | 'filename' | 'fs_time' | 'manual'`. The existing `idx_image_exif_date_backfill` partial index already filters to `date_taken IS NULL OR date_taken_source = 'fs_time'`, so manual rows are naturally excluded from the per-tick drain — no index change needed. `ExifMetadata` now exposes `date_taken_source` + originals so a UI can render "manually set; was X via filename". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
54e0635a98 |
date_backfill: per-tick drain for unresolved date_taken rows
Adds two ExifDao methods (`get_rows_needing_date_backfill` / `backfill_date_taken`) and a `backfill_missing_date_taken` watcher pass that runs on every tick alongside `backfill_unhashed_backlog`. The drain queries the partial index for rows where `date_taken IS NULL` or `date_taken_source = 'fs_time'`, batches up to `DATE_BACKFILL_MAX_PER_TICK` paths (default 500), and feeds them through `date_resolver::resolve_dates_batch` — a single exiftool subprocess covers the whole tick. Rows that newly resolve to `exiftool` / `filename` / `fs_time` get persisted via `backfill_date_taken` (touches only `date_taken` + `date_taken_source` so EXIF / hash / perceptual columns survive). `filename`-sourced rows are intentionally not re-resolved — the regex is authoritative when it matches and re-running exiftool wouldn't change the answer. Files that have disappeared from disk are skipped so a ghost row doesn't loop through the drain forever; the missing-file scan in `library_maintenance` retires those separately. Comes with two DAO unit tests (eligibility filter + column-isolation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2d14291733 |
ingest: stamp canonical date_taken on every InsertImageExif
Wires `date_resolver::resolve_date_taken` into the three call sites that build `InsertImageExif`: - `process_new_files` (file watcher) — every newly-registered file gets the resolver's verdict so videos and EXIF-stripped images land with a real date instead of NULL. - Upload handler — same waterfall on the post-multipart-write path. - GPS-write handler — re-runs the waterfall after exiftool writes GPS and re-reads the EXIF, in case a previously fs_time-sourced row now has a real EXIF date to upgrade to. This is a behavior change vs. the pre-rewrite `/memories` request-time priority: EXIF now beats filename when both are present. A photo named `Screenshot_2014-06-01.png` whose EXIF `DateTime` is 2021 now appears under 2021. The reverse case (no EXIF, parseable filename) is unchanged and continues to surface the filename date with `date_taken_source = 'filename'`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
79e258eccd |
date_resolver: canonical date_taken waterfall with exiftool fallback
New module that consolidates the four-step ingest waterfall: kamadak-exif (already in process via the caller's prior result) → exiftool fallback → filename regex → earliest_fs_time. Each step is tagged with a `DateSource` so the caller can persist provenance. The exiftool fallback is what makes videos and MakerNote-hosted dates land at all — kamadak-exif can't read QuickTime/MP4 or Nikon-style sub-IFDs. Single-file mode shells out per call; batch mode pipes paths on stdin via `-@ -` and fans the result through one subprocess so the upcoming per-tick drain doesn't pay startup cost per row. The `exiftool` PATH check is cached in a `OnceLock` to keep the drain short-circuited on deploys without exiftool installed. `SubSecDateTimeOriginal` and `ContentCreateDate` are pulled alongside the standard tags to capture iPhone's sub-second precision and Apple's preferred capture-time tag respectively. `FileModifyDate` is deliberately *not* in the tag list — it's a filesystem-derived value the resolver already covers via the `fs_time` step, and pulling it through exiftool would mask "no real EXIF date" with a misleading `source = exiftool` row. Module is registered in both `lib.rs` and `main.rs` (sibling-module pattern the rest of the bin uses); no callers wired in yet — that lands in the next commit. Comes with 9 unit tests covering JSON parsing edge cases, source-priority short-circuiting, and the fs_time-when-no-exif path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
84326501a9 |
image_exif: add date_taken_source column
New nullable TEXT column tracks which step of the canonical-date waterfall (kamadak-exif → exiftool → filename → fs_time) populated `date_taken`. Lets a later per-tick drain re-resolve weak sources (`fs_time`) once stronger ones become available, and gives the UI/debug surface a way to answer "why does this photo show up under this date?". Adds the column at all `InsertImageExif` construction sites with `None` placeholders (the resolver wiring lands in a follow-up commit), and extends the `update_exif` SET tuple so the column survives the GPS-write re-read path. Partial index `idx_image_exif_date_backfill` is created for the upcoming drain query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7ca888e95d |
duplicates: filter low-entropy hashes + dHash double-check, fix backfill loop
The perceptual cluster was producing one giant first group that contained hundreds of unrelated images. Two causes: - Solid-colour images (skies, black frames, monochrome scans) all hash to near-zero pHashes that Hamming-distance-zero to each other. - Single-link clustering on pHash alone is too permissive — a chain of weakly-similar images all collapses into one cluster. Fixed by skipping hashes outside the popcount [8, 56] band (uniform content) and requiring dHash agreement within threshold before unioning a candidate edge from the BK-tree. Two new tests pin both invariants. Backfill bin separately fix: decode-failed rows kept phash_64=NULL and got re-pulled by every batch, infinite-looping on a queue of unbreakable formats. Persist a 0/0 sentinel on decode failure so the row leaves the candidate set; the all-zero hash is excluded from clustering by the same entropy filter so it doesn't pollute results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7584cd8792 |
duplicates: perceptual hash + soft-mark resolution + upload 409
Adds pHash + dHash columns alongside the existing blake3 content_hash so
near-duplicates (re-encoded, resized, format-converted copies) become
queryable. /duplicates/{exact,perceptual} return groups; /duplicates/
{resolve,unresolve} flip a duplicate_of_hash soft-mark on losing rows
and union perceptual-only tag sets onto the survivor. The default
/photos listing filters duplicate_of_hash IS NULL so demoted siblings
stop cluttering the grid; include_duplicates=true opts back in for
Apollo's review modal. Upload now hashes bytes pre-write and returns
409 with the canonical sibling when a file's bytes already exist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
fb4df4b195 |
style: cargo fmt sweep
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
814066551e |
multi-library: per-library excluded_dirs
Adds a nullable comma-separated TEXT column to the libraries table.
Effective excludes for a walk = (env-var globals) ∪
(library.excluded_dirs). Empty / NULL = no library-specific
extras; the global env var still applies.
Migration (2026-05-01-110000_libraries_excluded_dirs)
ALTER TABLE libraries ADD COLUMN excluded_dirs TEXT. NULL on every
existing row — no behavior change on upgrade.
Library struct + helpers (libraries.rs)
- Library gains excluded_dirs: Vec<String>, parsed from the column
by parse_excluded_dirs_column (drops empties / whitespace,
matches the env-var parser).
- Library::effective_excluded_dirs(globals) returns the union.
- From<LibraryRow> hydrates the field on AppState construction so
/libraries surfaces it.
Watcher / walkers / memories
Every per-library walker now consults the effective set:
- process_new_files (file-watch ingest, RAW/EXIF/face)
- process_face_backlog (filter_excluded inherits)
- create_thumbnails (startup + new-file branch)
- update_media_counts (Prometheus gauge)
- cleanup_orphaned_playlists (per-library source-existence check)
- memories endpoint (PathExcluder)
Effective set is computed once per per-library iteration in the
watcher tick and threaded through; called functions retain their
flat &[String] signature (no per-library awareness needed inside
the walker primitives).
Use case: mount a parent directory while a sibling library covers
a child subtree, and exclude the child subtree from the parent so
the libraries don't double-walk / double-write image_exif. With
hash-keyed derived data (Branches B/C), the duplication-avoidance
is the only cost prevented — face / tag / insight sharing was
already correct via content_hash.
Tests: 228 pass (226 from previous + 2 new in libraries::tests:
parse_excluded_dirs_column edge cases,
effective_excluded_dirs_unions_global_and_per_library).
CLAUDE.md gains a "Per-library excludes" subsection of the
multi-library data model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3598bb2cfe |
multi-library: operator kill switch via libraries.enabled
A small follow-up to Branches A/B/C. Adds a nullable-default-1
boolean column to the `libraries` table that controls whether the
watcher considers the library at all. Useful for staging a new
mount before committing to ingest, and as a maintenance kill
switch when a library needs to be quiet without being unmounted.
Migration (2026-05-01-100000_libraries_enabled_flag)
ALTER TABLE libraries ADD COLUMN enabled BOOLEAN NOT NULL DEFAULT 1.
Existing rows stay enabled — no behavior change on upgrade.
Watcher gate (main.rs)
At the top of the per-library loop, if !lib.enabled { continue; }
— runs BEFORE the availability probe. Disabled libraries don't
enter the health map, don't get probed, don't get ingest, don't
get any maintenance pass. The initial sweep before the loop's
first sleep also skips disabled libraries.
Orphan-GC consensus (library_maintenance.rs)
all_libraries_online filters disabled libraries out of the
consensus check — they're treated as out-of-scope, not as
blockers. Otherwise flipping enabled=false would permanently
halt orphan GC for the rest of the system, which is the opposite
of the intended kill-switch semantics.
Cross-library duplicates: safe by construction. Hash-keyed derived
data (face_detections, tagged_photo with hash, photo_insights with
hash) is anchored by ANY image_exif row carrying the hash. Disabling
a library does NOT delete its image_exif rows, so a hash referenced
by a disabled library's row stays anchored — derived data survives.
collect_orphan_hashes deliberately doesn't filter image_exif by
library.enabled for exactly this reason.
No HTTP endpoint. Library mutation is rare-enough infra work that a
SQL toggle is fine, and a public mutation endpoint without a role /
permission story would be poorly-prioritized exposure for a
single-user tool. Documented in CLAUDE.md.
Tests: 226 pass (225 from Branch C + 1 new
all_libraries_online_treats_disabled_as_out_of_scope, which proves
that even an explicit Stale entry on a disabled library doesn't
block the consensus).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
263e27e108 |
multi-library: handoff + orphan GC with two-tick consensus
Branch C of the multi-library data-model rollout. Implements the
operational maintenance pipeline pinned in CLAUDE.md → "Multi-library
data model" / "Library availability and safety". Branches A and B
land first; this branch builds on top.
New module: src/library_maintenance.rs
Three idempotent passes the watcher runs every tick after the
per-library ingest loop:
1. Missing-file scan (per online library)
For each Online library, load a paginated page of image_exif rows
(IMAGE_EXIF_MISSING_SCAN_PAGE_SIZE, default 500), stat() each one,
and delete rows whose source file is NotFound. Permission/IO
errors are skipped, never deleted. Capped at
IMAGE_EXIF_MISSING_DELETE_CAP_PER_TICK (default 200) per library
per tick — so a pathological mount that returns NotFound for
everything can't wipe the table in one cycle. Cursor advances
across ticks, wraps on partial-page returns, and naturally cycles
through the entire library over many minutes. Skipped wholesale
for Stale libraries via the existing probe gate.
2. Back-ref refresh (DB-only)
For face_detections / tagged_photo / photo_insights: any
hash-keyed row whose (library_id, rel_path) no longer matches an
image_exif row, but whose content_hash does, is repointed at a
surviving image_exif location. Pure SQL with EXISTS guards so
rows whose hash is fully orphaned are left alone (the orphan GC
handles those). Idempotent; no availability gate needed.
This is what makes a recent → archive move invisible to readers:
when pass 1 retires the lib-A row, pass 2 pivots tags / faces /
insights to lib-B's surviving path before any client notices.
3. Orphan GC (destructive)
Hash-keyed derived rows whose content_hash has no image_exif
referent are GC-eligible. Two-tick consensus: a hash must be
observed orphaned on two consecutive ticks AND every library must
be Online for both. A single Stale tick within the window cancels
all pending deletes (they remain marked but won't be promoted) —
they're re-evaluated next tick. The pending set lives in
OrphanGcState (in-memory); a watcher restart resets it, which can
only delay a delete, never cause one. Hashes that re-appear in
image_exif between ticks are "revived" from the pending set
(handles transient share unmount / remount).
Two new ExifDao methods:
- list_rel_paths_for_library_page(library_id, limit, offset) for
the paginated missing-file scan.
- (count_for_library landed in Branch A.)
Watcher wiring (main.rs)
Per-library: missing-file scan inside the existing per-library
loop, after process_new_files, gated by the same probe check that
already protects ingest. After the loop: reconcile (Branch B),
back-ref refresh, then run_orphan_gc. The maintenance connection is
opened once per tick (image_api::database::connect), used by all
three DB-only passes, and dropped at end of tick.
CLAUDE.md gains a "Maintenance pipeline" subsection that describes
the three passes and their interaction with the existing
availability-and-safety policy.
Tests: 225 pass (217 from Branch B + 8 new in library_maintenance
covering back-ref refresh including the fully-orphaned no-op case,
two-tick GC consensus, Stale-tick consensus reset, image_exif
re-appearance revival, multi-table delete, and the
all_libraries_online helper).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
48cac8c285 |
multi-library: hash-keyed tagged_photo + photo_insights with reconciliation
Branch B of the multi-library data-model rollout. tagged_photo and
photo_insights now follow the bytes (content_hash), not the path,
matching the policy pinned in CLAUDE.md "Multi-library data model".
Branch A's availability probe and EXIF scoping land first; this
branch builds on top.
Migration (2026-05-01-000000_hash_keyed_derived_data)
Adds nullable content_hash columns to tagged_photo and photo_insights,
with partial indexes on the non-null subset to keep the index small
during the transitional window. The migration backfills from
image_exif:
* tagged_photo joins on rel_path alone (no library_id available);
* photo_insights joins on (library_id, rel_path), unambiguous.
Rows whose image_exif hash isn't known yet stay null and the runtime
reconciliation pass populates them as the hash backlog drains.
Insert-time population
TagDao::tag_file looks up image_exif.content_hash by rel_path before
inserting; the hash is written into the new column.
InsightDao::store_insight does the same scoped to (library_id,
rel_path). Caller-supplied hash on InsertPhotoInsight wins; otherwise
the DAO does the lookup. Both paths fall back to None if the hash
isn't known yet — reconciliation backfills.
Reconciliation (database/reconcile.rs)
Three idempotent passes the watcher runs once per tick after the
per-library backfill loop:
1. tagged_photo NULL hashes → populate from image_exif by rel_path.
2. photo_insights NULL hashes → populate by (library_id, rel_path).
3. photo_insights scalar merge — when multiple is_current rows
share a content_hash, keep the earliest generated_at as
current; demote the rest. Demoted rows keep their data so
/insights/history is unaffected; only the "current" pointer
narrows to one per hash.
No filesystem dependency, so reconcile doesn't need the availability
gate; runs every tick. Logs once when something changed, debug
otherwise.
Tags are set-valued under the policy (union on read, already
DISTINCT in queries), so there is no analogous tag-collapse pass —
duplicate (tag_id, content_hash) rows across libraries are
harmless.
Read paths are unchanged in this branch — lookup_tags_batch's
existing rel_path-via-hash-sibling expansion still produces the
correct merge. A follow-up can simplify reads to use the new column
directly for performance.
Tests: 217 pass (212 pre-existing + 5 new in reconcile covering
NULL-fill, hash-not-yet-known no-op, library scoping on insights,
earliest-wins collapse, idempotency).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
48ed7be5d9 |
libraries: initial availability sweep before watcher's first sleep
new_health_map seeds every library as Online, and the watcher's tick loop sleeps WATCH_QUICK_INTERVAL_SECONDS (default 60s) before its first probe — meaning /libraries reported the optimistic default for up to a minute after boot, even when a share was clearly unmounted. Run the same refresh_health pass once at the top of the watcher thread before entering the sleep loop. /libraries is then truthful within milliseconds of the watcher thread starting (effectively from the first HTTP request, since the watcher spawns well before the server binds). The per-tick gate inside the loop is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
eea1bf3181 |
multi-library: availability probe + scoped EXIF queries + collision fixes
Branch A of the multi-library data-model rollout. Three threads of
correctness/safety work that ship together because the new mount
needs all three before it can land:
1. Library availability probe (libraries.rs, state.rs, main.rs)
New LibraryHealth (Online | Stale { reason, since }) and a shared
LibraryHealthMap on AppState. Probe checks root_path exists +
is_dir + readable + non-empty (relative to a "had_data" signal so
fresh mounts aren't downgraded). The watcher tick begins with a
refresh_health() per library; stale libraries skip ingest, the
hash backfill, and face-detection backlog drains for that tick.
The orphaned-playlist cleanup also gates on every library being
online — a missing source on a stale library is indistinguishable
from a transient unmount, and the cleanup is destructive.
/libraries now returns each library with its current health
state. Logs only on Online↔Stale transitions so a long outage
doesn't spam.
New ExifDao::count_for_library is the "had_data" signal.
2. EXIF queries scoped by library_id (database/mod.rs, files.rs,
main.rs, tags.rs)
query_by_exif gains an Option<i32> library filter; /photos and
/photos/exif now pass it. Without this, an EXIF-filtered request
scoped to ?library=N returned cross-library results because the
handler resolved the library but didn't push it through to SQL.
get_exif_batch gains the same option. The watcher's per-library
ingest, face-candidate build, and content-hash backfill all
scope to their library; the union-mode /photos date-sort path
and the library-agnostic tag fan-out (lookup_tags_batch, by
design) keep using None.
3. Derivative-path collision fixes (content_hash.rs, main.rs)
New content_hash::library_scoped_legacy_path helper:
<derivative_dir>/<library_id>/<rel_path>. Thumbnail generation
(startup walk + watcher needs-thumb check) and serving now use
it; serving falls back to the bare-legacy mirrored path so
pre-multi-library deployments keep working without
regeneration. Without this, lib2 with the same rel_path as lib1
would have its thumbnail request short-circuit to lib1's image.
Orphaned-playlist cleanup walks every library when checking for
the source video (was: BASE_PATH only). Without this, mounting
a 2nd library and waiting 24h would delete every playlist whose
source lived only in the 2nd library.
The HLS playlist write path collision (filename-only basename,
not rel_path) is left as a known issue with a TODO at the call
site — the actor-pipeline rewrite belongs in Branch B/C.
Tests: 212 pass (cargo test --lib). New tests cover the probe
states (online / missing root / non-dir / empty-with-prior-data),
refresh_health transitions, query_by_exif scoping, get_exif_batch
keying on (library_id, rel_path), library_scoped_legacy_path, and
count_for_library.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
f50655fb21 |
indexer: apply EXCLUDED_DIRS to remaining WalkDir callers
Audit follow-up to
|
||
|
|
5bf49568f1 |
indexer: prune EXCLUDED_DIRS at WalkDir time, extract enumerate_indexable_files
Synology drops `@eaDir/.../SYNOFILE_THUMB_*.jpg` files alongside every
photo. The face-detect pipeline already filters those out via
`face_watch::filter_excluded`, but the filter runs *after* the indexer
has already inserted rows into `image_exif`. Result: phantom rows whose
content_hash never matches a `face_detections` row, so the anti-join in
`list_unscanned_candidates` returns them every tick. They're filtered
out at runtime, no marker is written, and the cycle repeats forever —
log spam, wrong stats denominator, and on a real Synology library the
phantom rows balloon into the hundreds of thousands.
Move the exclusion to the WalkDir pass, where filter_entry can prune
whole subtrees instead of walking and discarding leaves. Extract the
pre-existing 30-line walker chain in main.rs::process_new_files into
`file_scan::enumerate_indexable_files` so it's testable in isolation.
Six tests cover the bug (eadir prune), nested patterns, absolute-under-base
syntax, non-media filtering, modified_since semantics, and forward-slash
rel_path normalization.
Out of scope (other WalkDir callers in main.rs that don't yet apply
EXCLUDED_DIRS — thumbnail gen at 1309, media scan at 1377, video
playlist scan at 1685, and two nested walks at 1709 / 1743): separate
audit PR.
Operator note: existing phantom rows still need a one-shot cleanup —
DELETE FROM face_detections WHERE content_hash IN (
SELECT content_hash FROM image_exif WHERE rel_path LIKE '%/@eaDir/%'
);
DELETE FROM image_exif WHERE rel_path LIKE '%/@eaDir/%' OR rel_path LIKE '@eaDir/%';
Run before attaching a fresh Synology-sourced library.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
1971eeccd6 |
faces: drain backfill + detection backlog every tick, not just full scans
Symptom: ImageApi restart, then ~60 minutes of silence — no
face_watch lines at all. Cause: backfill + face-detection candidate
build were both gated inside process_new_files, which during quick
scans (every 60s) only walks files modified in the last interval.
The pre-existing unhashed / unscanned backlog never entered the
candidate set, so it only drained on the full-scan path (default
once per hour). Surfaced as "scan stuck at 1101/13118" — most of
those rows were waiting on the next full scan.
Two new per-tick passes that work directly off the DB:
(1) backfill_unhashed_backlog uses ExifDao::get_rows_missing_hash to
pull unhashed rows in id order, capped (FACE_HASH_BACKFILL_MAX_PER_TICK
default 2000), and writes content_hash for each. No filesystem
walk — the walk was the gating filter that hid the backlog.
(2) process_face_backlog uses a new FaceDao::list_unscanned_candidates
(LEFT-anti-join on content_hash via raw SQL, GROUP BY hash so
duplicates fire one detect call) to pull a capped batch of
hashed-but-unscanned rows (FACE_BACKLOG_MAX_PER_TICK default 64)
and runs the existing face_watch detection pipeline on them.
Both run only when face_client.is_enabled(). The cap on (2) is small
because each candidate is a real Apollo round-trip — 64/tick at 60s
quick interval ≈ 64 detections/min, which paces an 8-core CPU
inference comfortably while keeping a steady flow visible in logs.
process_new_files's own backfill stays in place for the same-tick
flow (a brand-new upload gets hashed AND face-scanned in the tick
where it's discovered) but is now belt-and-suspenders.
Test backstop pinning the new DAO method's filter contract: only
hashed, unscanned, in-library rows are returned; scanned rows,
unhashed rows, and other-library rows are filtered out.
|
||
|
|
16abacf4c5 |
faces: backfill no longer stalls on chronic-error files at the front
The content-hash backfill capped at 500/tick AND counted errors
against that cap. So a pocket of files that errored every time
(vanished mid-scan, permission denied, unreadable) at the head of the
exif_records iteration order burned the entire budget every tick and
the rest of the backlog never advanced — surfacing as a face-scan
stuck at e.g. 44% with no progress. Without a content_hash, those
photos never become face-detection candidates, so it looks like
detection is broken when really it's the prerequisite hash that
isn't filling.
Two fixes:
- Cap on successes only. Errors still get counted and logged but
don't burn the per-tick budget; the loop keeps moving past them
to the working files behind. Errors are bounded by the unhashed
backlog size (each record walked at most once per tick), so this
can't run away.
- Always log the unhashed backlog count when non-zero. Previously
"stuck at 44%" looked silent from the outside; now every tick
surfaces "backfilled N/M; K still need backfill" so an operator
can tell backfill is making progress (or isn't).
Also bumps the default cap from 500 to 2000. Hashing is cheap (blake3
+ one DB UPDATE), and 500 was conservative for a personal-scale
library where 10k+ unhashed files is a normal first-run state.
|
||
|
|
a24fac5511 |
faces: backfill missing content_hash from the file watcher
Photos indexed before content-hashing landed (or where the hash compute failed silently on insert) end up in image_exif with NULL content_hash. build_face_candidates keys on content_hash, so those rows would never become face candidates without backfill — symptom: face detection logs nothing despite photos being in the library and the watcher running. The dedicated `backfill_hashes` binary already handles this; this commit lets the watcher self-heal during full scans so the deploy 'just works' for face recognition without operator action. Idempotent — subsequent scans see populated hashes and no-op. Bounded per tick by FACE_HASH_BACKFILL_MAX_PER_TICK (default 500) so a watcher tick on a 50k-photo legacy library doesn't blake3 every file in one shot. For very large backlogs the dedicated binary is still faster (no DAO mutex contention with the watcher loop). Only runs when face_client.is_enabled(), so legacy deploys without APOLLO_FACE_API_BASE_URL keep the same behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
23f4941471 |
faces: surface enabled/disabled state + per-tick candidate count
Manual deploy debugging: 'Saved thumbnail' logs were visible (boot-time
thumbnail backfill) but no face_watch logs were appearing, with no
obvious way to tell whether the integration was disabled, hadn't reached
a full scan yet, or had simply seen no new files.
Two log lines:
- watch_files startup: 'Face detection: ENABLED' / 'DISABLED (set
APOLLO_FACE_API_BASE_URL or APOLLO_API_BASE_URL to enable)' so
you can tell at a glance whether the env wired through.
- process_new_files (debug-level): 'face_watch: scan tick — N image
file(s) walked, M candidate(s) (library 'main', modified_since=...)'
so an empty-candidate scan is distinguishable from a misconfigured
or skipped one without bumping log level for the rest of the
watcher.
No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
1859399759 |
faces: phase 4 — people-tag bootstrap + auto-bind on detection
Wires the existing string people-tags into the new persons table and
auto-binds new detections to a same-named person when the photo carries
exactly one matching tag. ImageApi has no notion of which tags are
people-tags today (purely a user mental model), so this is operator-
confirmed: the suggester surfaces candidates with a heuristic flag, the
operator confirms, then bootstrap creates persons rows. Auto-bind
follows on every detection thereafter.
New endpoints:
GET /tags/people-bootstrap-candidates
Per case-insensitive name group: display name (most-frequent
capitalization), normalized lowercase, summed usage_count,
looks_like_person heuristic flag, already_exists check against
the persons table. Sorted persons-likely-first then by count.
POST /persons/bootstrap
Body: {names: [string]}. Idempotent — pre-fetches the existing-
name set so a duplicate request reports per-row "already exists"
instead of 409-ing each insert. Created rows get
created_from_tag=true; failed rows surface in `skipped` with a
reason.
looks_like_person heuristic — conservative on purpose because the
operator confirms in the UI:
- 1–2 whitespace-separated words
- Each word starts uppercase, no digits anywhere
- Single-word names not on a small denylist (cat, christmas, beach,
sunset, untagged, ...). Two-word names skip the denylist so
"Sarah Smith" is never false-rejected.
FaceDao additions:
- find_persons_by_names_ci — bulk lowercase-name → person_id lookup
via sql_query (Diesel's BoxedSelectStatement + LOWER() doesn't
play well with the type system).
- person_reference_embedding — L2-normalized mean of a person's
detected embeddings, *filtered by model_version* so a future
buffalo_xl row can never contaminate an in-flight buffalo_l auto-
bind decision. Returns None when the person has no faces yet.
- assign_face_to_person — sets face_detections.person_id and, only
when persons.cover_face_id is NULL, claims this face as cover. The
UI's hand-picked cover survives later auto-binds.
- decode_embedding_bytes / cosine_similarity helpers — pub(crate)
so face_watch can decode the wire bytes once and feed them through
the cosine threshold.
Auto-bind in face_watch::process_one:
After every successful detect, for each newly-stored auto face we
pull the photo's tags, look up which (if any) map to existing
persons, and:
- skip when zero or multiple distinct persons are matched
(multi-match is genuinely ambiguous; cluster suggester handles it)
- on first face for a person: bind unconditionally so bootstrap can
ever produce a usable reference
- thereafter: bind iff cosine(new_emb, person_ref) >=
FACE_AUTOBIND_MIN_COS (default 0.4, env-tunable to 0..=1)
The reference embedding comes from person_reference_embedding under
the same model_version as the candidate, so a model upgrade never
silently re-anchors a person's centroid.
Plumbing: watch_files now constructs its own SqliteTagDao alongside the
other watcher DAOs and threads it through process_new_files →
run_face_detection_pass → process_one. The handler-side TagDao
registration in main.rs already covers bootstrap_candidates_handler;
no extra app_data wiring needed.
Tests: 8 new (faces.rs):
- looks_like_person accepts/rejects/two-word-skips-denylist (3)
- cosine_similarity on identical / orthogonal / opposite / mismatch /
zero / empty inputs
- decode_embedding_bytes round-trip + size validation
- find_persons_by_names_ci groups case + handles empty input
- person_reference_embedding filters by model_version (buffalo_l ref
must not include buffalo_xl rows)
- assign_face_to_person sets cover when unset, doesn't overwrite
cargo test --lib: 179 / 0; fmt + clippy clean for new code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
4dee7b6f73 |
faces: phase 3 — file-watch hook drives auto detection
Wire face detection into ImageApi's existing scan loop so new uploads
pick up faces automatically and the initial backlog grinds through on
full-scan ticks. No new job system; Phase 2's already_scanned check
makes the work implicitly idempotent (one face_detections row per
content_hash, including no_faces / failed marker rows).
face_watch.rs (new):
- run_face_detection_pass(library, excluded_dirs, face_client,
face_dao, candidates) — sync entry point. Builds a per-pass tokio
runtime and fans out detect calls bounded by FACE_DETECT_CONCURRENCY
(default 8). The watcher thread itself stays sync.
- filter_excluded — applies the same PathExcluder /memories uses, so
@eaDir / .thumbnails / EXCLUDED_DIRS-listed paths skip detection
before we burn a detect call (and Apollo's GPU memory) on junk.
- read_image_bytes_for_detect — RAW/HEIC route through
extract_embedded_jpeg_preview because opencv-python-headless can't
decode either; everything else gets a plain std::fs::read so EXIF
orientation reaches Apollo's exif_transpose intact.
- process_one — translates Apollo's response into the Phase 2 marker
contract: faces[] empty → no_faces; FaceDetectError::Permanent →
failed (don't retry); Transient → no marker (next scan retries);
success with N faces → N detected rows with the embeddings unpacked.
main.rs (process_new_files + watch_files):
- watch_files now also takes face_client + excluded_dirs; the watcher
thread builds a SqliteFaceDao the same way it builds ExifDao /
PreviewDao.
- After the EXIF write loop, build_face_candidates queries image_exif
for the just-walked image paths' content_hashes (covers new uploads
and pre-existing backlog), filters out anything already_scanned, and
hands the rest to face_watch::run_face_detection_pass.
- Bypassed wholesale when face_client.is_enabled() is false — keeps
the watcher usable on legacy deploys where Apollo isn't configured.
Tests: 5 face_watch unit tests cover the parts that don't need a real
Apollo:
- filter_excluded drops dir-component patterns (@eaDir) without
matching substring file names (eaDir-not-a-thing.jpg keeps).
- filter_excluded drops absolute-under-base subtrees (/private).
- empty EXCLUDED_DIRS short-circuits cleanly.
- read_image_bytes_for_detect passes JPEG bytes through verbatim
(orientation must reach Apollo unmodified).
- read_image_bytes_for_detect falls through to plain read when a
RAW-extension file has no embedded preview, so Apollo gets a chance
to 422 and we mark failed rather than infinitely-retrying.
cargo test --lib: 170 / 0; fmt and clippy clean for new code.
End-to-end (drop a photo → face_detections row appears) needs Apollo
running and is deferred to deploy-time verification.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
860169032b |
faces: phase 2 — schema + manual face/person CRUD
Land the persistence model and HTTP surface for local face recognition.
Inference still lives in Apollo (Phase 1); this side adds the data home
plus every endpoint Apollo's UI and FileViewer-React will consume.
Schema (new migration 2026-04-29-000000_add_faces):
- persons: visual identities. Optional entity_id bridges to the
existing knowledge-graph entities table; auto-bridging is left to
the management UI (we don't muddy LLM provenance from face rows).
UNIQUE(name COLLATE NOCASE) so 'alice' / 'Alice' fold to one row.
- face_detections: keyed on content_hash (cross-library dedup), with
status='detected' carrying bbox + 512-d embedding BLOB, and
'no_faces' / 'failed' marker rows that tell Phase 3's file watcher
not to re-scan. Marker invariant enforced via CHECK; partial UNIQUE
on content_hash WHERE status='no_faces' guards against double-marks.
Schema regenerated with `diesel print-schema` against a clean migration
run; joinables added for face_detections → libraries / persons and
persons → entities.
face_client.rs (sibling of apollo_client.rs):
- reqwest multipart, 60 s timeout (CPU inference on a backlog can be
slow; bounded threadpool on Apollo serializes calls anyway).
- FaceDetectError::{Permanent, Transient, Disabled} — Phase 3 keys
its marker-row decision on this. 422 → mark failed, 5xx → defer.
- APOLLO_FACE_API_BASE_URL falls back to APOLLO_API_BASE_URL when
unset; both unset = is_enabled() false, callers no-op.
faces.rs (DAO + handlers):
- SqliteFaceDao implements the full FaceDao trait; person face counts
go through sql_query because diesel's BoxedSelectStatement +
group_by trips trait-resolver recursion.
- merge_persons re-points face rows in a transaction, copies notes
when target's are empty, deletes src.
- manual POST /image/faces resolves content_hash through image_exif,
crops the user-drawn bbox with 10% padding (detector wants context
around ears/jaw), POSTs the crop to face_client.embed for a real
ArcFace vector, then inserts source='manual'.
- Cluster-suggest (Phase 6) gets its data from
GET /faces/embeddings — base64-encoded paged BLOBs so Apollo's
DBSCAN can stream them without ImageApi pre-aggregating.
Endpoints registered alongside add_*_services in main.rs:
GET /faces/stats?library=
GET /faces/embeddings?library=&unassigned=&limit=&offset=
GET /image/faces?path=&library=
POST /image/faces (manual create via embed)
PATCH /image/faces/{id}
DELETE /image/faces/{id}
GET /persons?library=
POST /persons
GET /persons/{id}
PATCH /persons/{id}
DELETE /persons/{id}?cascade=set_null|delete (set_null default)
POST /persons/{id}/merge
GET /persons/{id}/faces?library=
The file-watch hook (Phase 3) and the rerun-on-one-photo handler
(Phase 6) live behind the FaceDao methods marked dead_code today —
they're called only when those phases land. Same shape for the trait
methods that aren't reached by Phase 2 routes.
Tests: 3 DAO unit tests cover person CRUD + case-insensitive uniqueness,
marker-row idempotency (mark_status is a no-op when any row exists),
and merge re-pointing faces.
Cargo.toml: reqwest gains the `multipart` feature.
cargo build / cargo test --lib / cargo fmt / cargo clippy --all-targets
all clean for the new code; the two pre-existing test_path_excluder
failures and the pre-existing sort_by clippy warnings are unrelated and
present on master.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
57fb0bcd3c |
EXIF GPS write: POST /image/exif/gps via exiftool
New endpoint accepts {path, library, latitude, longitude} and shells
out to exiftool to write GPSLatitude/GPSLongitude (with N/S, E/W refs)
into the file's EXIF in place. After the write, the handler
re-extracts EXIF and updates the image_exif row so the DB stays in
sync — the response carries the updated metadata block in one
round-trip. Falls through to store_exif if the row is missing.
`exif::write_gps` is the small helper. `-overwrite_original` so no
.orig sidecar is left behind. Validates lat/lon range + supports_exif
before spawning exiftool. Format support matches the existing read
path (JPEG / TIFF / RAW / HEIF / PNG / WebP) — videos still need a
different writer and aren't covered.
Apollo's "+ PIN" carousel button (separate commit on the Apollo side)
calls this through /api/photos/exif/gps. Drive-by: cargo fmt one-line
collapse on apollo_client.rs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
00b3c80141 |
RAW: try IFD0 + IFD1 for embedded preview, serve at full size
The thumbnail pipeline's embedded-JPEG extractor only checked IFD1 (THUMBNAIL), which on many Nikon NEFs is missing or zero-length even when IFD0 (PRIMARY) carries a perfectly good 1-2 MP reduced-resolution preview the camera writes for in-body review. The previous behavior produced black thumbs on disk: the buggy IFD1 pointer resolved to a short byte sequence that happened to satisfy the SOI sanity check, image::load_from_memory accepted it, and the resize path quietly wrote a black JPEG. Now both IFDs are checked and the larger valid JPEG wins. Format- agnostic: applies to every TIFF-based RAW (NEF / ARW / CR2 / DNG / RAF / ORF / RW2 / PEF / SRW / TIFF). is_tiff_raw is now pub so main.rs can gate its full-size handler on it. Also extends the /image handler so size=full requests for RAW formats serve the embedded preview as image/jpeg instead of NamedFile-streaming the original RAW bytes - browsers can't decode a .nef container, so <img src=...> would otherwise land as a broken image. Falls through to NamedFile if no preview is present, preserving the historical behavior for callers that genuinely want the original bytes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7621282419 |
Thumb orientation + library filter on /photos/exif
Two follow-ups on the same feature branch: 1. Bake EXIF orientation into generated thumbnails. The `image` crate doesn't apply Orientation on load, and `save_with_format(..Jpeg)` drops EXIF — so portrait phone shots ended up sideways in any client that displays the cached thumb directly (no EXIF tag for the browser to compensate from). New `exif::read_orientation` reads the tag cheaply (no full EXIF parse) and `exif::apply_orientation` does the rotate/flip via image's existing `rotate90/180/270` + `fliph/flipv`. Applied in both branches of `generate_image_thumbnail` (RAW embedded- JPEG path and the regular `image::open` path). Existing thumbnails in the cache are still wrong-orientation; wipe the thumb dir or run a one-off backfill once this lands. 2. Optional `library` query param on `/photos/exif`. Accepts numeric id or name (same shape as `/image?library=...`), resolved via the existing `resolve_library_param` helper so a bad value 400s before we touch the DAO. Filter is applied post-query in the handler rather than pushed into `query_by_exif` to keep the DAO trait (and its test mocks) unchanged. Cheap enough at typical library counts; can be moved into SQL later if it ever isn't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c6f82ebaba |
Batch EXIF endpoint: GET /photos/exif
Adds a single round-trip projection of `image_exif` for every photo whose
`date_taken` falls in `[date_from, date_to]`. Wraps the existing
`ExifDao::query_by_exif` DAO method which already handles the SQL filter
in one query against the covering index — the only missing piece was
HTTP plumbing.
Designed for window-scoped consumers like Apollo's photo-to-track
matcher, which currently does N+1 (one `/photos` listing + one
`/image/metadata` per photo). Because `/image/metadata` serializes on
`Data<Mutex<dyn ExifDao>>`, that pattern can take 10s+ for windows with
hundreds of photos. The new endpoint takes one mutex acquisition for
the whole batch.
Response shape:
{ photos: [
{ file_path, library_id, library_name,
camera_model, width, height,
gps_latitude, gps_longitude, date_taken } ],
total: N }
Two notes on scope:
- Photos with NULL `date_taken` are excluded by `query_by_exif`'s
semantics. Filename-extracted dates are not synthesized here; rare
callers that need that fallback can still hit `/image/metadata`.
- GPS columns are stored as f32 in image_exif to keep row size small;
the JSON shape widens to f64 so clients don't have to know about the
on-disk precision.
Library names are pre-mapped from `app_state.libraries` once and
stamped on each row, avoiding an O(rows × libraries) linear scan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
13b9d54861 |
fix(scan): quiet startup scans & thumbnail RAW/HEIC
Three recurring issues on every full scan: 1. Video playlist scans re-enqueued every file only to reject it as AlreadyExists. Pre-filter in ScanDirectoryMessage and QueueVideosMessage so we skip videos whose .m3u8 already exists, and demote the leaked AlreadyExists log to debug. 2. image crate was built with only jpeg/png features, so webp/tiff/avif files logged "format not supported" every scan. Enable those features. 3. RAW (ARW/NEF/CR2/...) and HEIC thumbnails weren't generated, so the scan kept retrying them. Try the file's embedded JPEG preview via kamadak-exif first (fast, pure-Rust, works on Sony ARW where ffmpeg's TIFF decoder fails). Fall back to ffmpeg for HEIC/HEIF and RAWs with no preview. Anything still undecodable gets a <thumb>.unsupported sentinel so future scans skip it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
079cd4c5b9 |
feat(ai): streaming chat endpoint with live tool events
Add LlmClient::chat_with_tools_stream and SSE endpoint POST /insights/chat/stream that emits text deltas, tool_call / tool_result pairs, truncated notice, and a terminal done frame as the agentic loop runs. - Ollama: parses NDJSON from /api/chat stream, accumulates content deltas, emits Done with tool_calls from the final chunk. - OpenRouter: parses OpenAI-compatible SSE, reassembles tool_call argument deltas by index, asks for stream_options.include_usage. - InsightChatService spawns the loop on a tokio task, feeds events through an mpsc channel, persists training_messages at the end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
65ab10e9a8 |
feat(ai): chat rewind + ollama metrics logging
Rewind: POST /insights/chat/rewind truncates training_messages at a given rendered index, dropping the target message plus any preceding tool-call scaffolding. The initial user prompt is protected. Metrics: log prompt_eval_count/duration and eval_count/duration from every Ollama chat response, rendered as tokens + ms + tok/s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0b9528f61e |
feat(ai): chat continuation for photo insights (server v1)
Adds POST /insights/chat and GET /insights/chat/history. Replays the stored agentic conversation through the same backend the insight was generated with (or a per-turn override), runs a short tool-calling loop, and persists the extended history in append or amend mode. Backend switching: same-backend or hybrid->local replay verbatim; local->hybrid is rejected in v1 (would require on-the-fly vision description rewrite). Per-(library, file) async mutex serialises concurrent turns. Soft context budget drops oldest tool_call+result pairs when the serialized history exceeds num_ctx - 2048 tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e2eefbd156 |
feat(ai): curated OpenRouter model picker for hybrid backend
Add OPENROUTER_ALLOWED_MODELS env var and GET /insights/openrouter/models endpoint returning the curated list verbatim. Drop the live capability precheck in hybrid mode — trust the operator's allowlist; bad ids surface as a chat-call error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |