memories: single-SQL rewrite + 20-year lookback

Replaces the EXIF-loop + WalkDir-fallback pipeline that powered
`/memories` with a single per-library SQL query
(`get_memories_in_window`) that uses `strftime('%m-%d' | '%W' | '%m',
date_taken, 'unixepoch', tz_offset)` for calendar matching in the
client's timezone, plus a `years_back` lower bound and a
no-future-dates upper bound. Returns only the matching rows; the
handler applies per-library `PathExcluder` post-query and sorts.

Drops:
- `collect_exif_memories` — replaced by the single SQL query.
- `collect_filesystem_memories` — the canonical-date pipeline now
  populates `date_taken` for every row at ingest, so the WalkDir
  fallback that scanned 14k+ files each request is no longer needed.
- `get_memory_date_with_priority` and friends — request-time waterfall
  superseded by `date_resolver` running at ingest. The associated
  three priority-tests are dropped; their replacement lives in
  `date_resolver::tests`.

On a ~14k-file library this drops `/memories` from 10–15 s
(dominated by `fs::metadata` per row) to single-digit ms.

Bumps `DEFAULT_YEARS_BACK` from 15 → 20 to surface deeper archives
on matching anniversaries.

Note vs. ISO weeks: the original Rust used `chrono::iso_week().week()`
for week-span matching. SQLite's `%W` is Monday-anchored but uses week
0 for days before the first Monday, so it can disagree with ISO at
year boundaries by ±1. Acceptable for nostalgia browsing.

Adds 3 new DAO tests covering month-span filter, library scoping, and
the unknown-span-token guard. Also adds a CLAUDE.md section describing
the canonical-date pipeline end-to-end and the new
`DATE_BACKFILL_MAX_PER_TICK` env var.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Cameron Cordes
2026-05-06 16:04:09 -04:00
parent 54e0635a98
commit 7f12890f4b
4 changed files with 403 additions and 492 deletions

View File

@@ -364,6 +364,53 @@ Runs in background thread with two-tier strategy:
- Batch queries EXIF DB to detect new files
- Configurable via `WATCH_QUICK_INTERVAL_SECONDS` and `WATCH_FULL_INTERVAL_SECONDS`
**Canonical date_taken pipeline (`src/date_resolver.rs`).** Every row's
`image_exif.date_taken` is populated at ingest by a four-step waterfall;
which step won is recorded in `image_exif.date_taken_source` so the
per-tick drain can re-resolve weak entries when better tools become
available, and so the UI/debug surface can answer "why did this photo
land on this date?". Order:
1. **`exif`** — kamadak-exif `DateTime` / `DateTimeOriginal`. Fast,
in-process, image-only.
2. **`exiftool`** — shell-out fallback for tags kamadak can't reach:
QuickTime/MP4 (`MediaCreateDate`, `TrackCreateDate`, `CreateDate`),
Apple's `ContentCreateDate`, MakerNote sub-IFDs. Required for
videos to land a real date. Single-file at ingest; the per-tick
drain feeds the whole batch through one `exiftool -@ -` subprocess.
Degrades silently when `exiftool` isn't on PATH (resolver caches the
"available" check via `OnceLock`).
3. **`filename`** — `extract_date_from_filename` in `memories.rs`
matches screenshot, chat-export, and timestamp-named patterns.
4. **`fs_time`** — `earliest_fs_time(metadata)` (earlier of created /
modified). Last resort.
Notable behavior change vs. the pre-2026-05 request-time logic:
**EXIF beats filename when both are present.** A photo named
`Screenshot_2014-06-01.png` whose EXIF `DateTime` is 2021 now appears
under 2021, not 2014 — on the theory that EXIF is more reliable than
import-named filenames. The reverse case (no EXIF, filename has a
date) is unchanged.
The `backfill_missing_date_taken` drain (`src/main.rs`) runs every
watcher tick alongside `backfill_unhashed_backlog`. It loads up to
`DATE_BACKFILL_MAX_PER_TICK` rows (default 500) where
`date_taken IS NULL OR date_taken_source = 'fs_time'` (backed by the
`idx_image_exif_date_backfill` partial index), runs the waterfall
batch via `resolve_dates_batch`, and writes results via the
`backfill_date_taken` DAO method (touches only `date_taken` +
`date_taken_source` so EXIF / hash / perceptual columns are
preserved). `filename`-sourced rows are intentionally not re-resolved
— the regex is authoritative when it matches, and re-running exiftool
won't change the answer.
`/memories` is a single SQL query against this column
(`get_memories_in_window` in `src/database/mod.rs`), using
`strftime('%m-%d' | '%W' | '%m', date_taken, 'unixepoch', tz)` for
calendar matching with the client's timezone offset. The pre-rewrite
version stat'd every row and walked the entire library tree — at
~14k photos this took 1015 s; the rewrite is single-digit ms.
**EXIF Extraction:**
- Uses `kamadak-exif` crate
- Supports: JPEG, TIFF, RAW (NEF, CR2, CR3), HEIF/HEIC, PNG, WebP
@@ -534,6 +581,7 @@ Optional:
```bash
WATCH_QUICK_INTERVAL_SECONDS=60 # Quick scan interval
WATCH_FULL_INTERVAL_SECONDS=3600 # Full scan interval
DATE_BACKFILL_MAX_PER_TICK=500 # Cap on canonical-date drain per watcher tick
OTLP_OTLS_ENDPOINT=http://... # OpenTelemetry collector (release builds)
# AI Insights Configuration