Commit Graph

116 Commits

Author SHA1 Message Date
Cameron Cordes 7e21213181 Reels: bound disk/ledger growth (pre-gen prune + on-demand cache sweep)
Nothing reaped reels before, so the on-disk cache and ledger grew
unbounded — each night's daily reel is a new ~4MB file + ledger row that's
stale within ~26h.

- Pre-gen self-prune: after recording a reel, prune_superseded keeps the
  newest PREGEN_KEEP_PER_SCOPE (2) rows per (span, library) and unlinks the
  superseded reels' mp4+sidecar. Caps the ledger/disk at ~spans×libraries×2.
- On-disk sweeper (spawn_reel_cache_sweeper): every 24h, removes reel mp4s
  with no ledger row and no live job older than REEL_CACHE_MAX_AGE_DAYS (7) —
  bounding the on-demand cache, which has no ledger row and otherwise grows
  forever — plus crashed-render cruft (.mp4.tmp/.concat.txt/orphan sidecars).
  Runs regardless of REEL_PREGEN_ENABLED; disable with REEL_CACHE_SWEEP_ENABLED=0.
- New DAO methods prune_superseded + all_cache_keys (with tests); env knobs
  documented in .env.example.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 23:27:32 -04:00
Cameron Cordes ca007a618d Reels pre-gen: record true media count + real upsert for user_ai_prefs
- pregen_one recorded media_count as planned.len() (beat count); record
  the actual media item total (media.len(), photos + clips) in both the
  cache-hit and freshly-rendered ledger paths. Drops the redundant
  photo_count binding.
- Replace upsert_prefs's insert-then-catch-error-then-update dance with a
  single atomic INSERT ... ON CONFLICT(id) DO UPDATE. Explicit id=1 makes
  the conflict target deterministic; explicit column .set((...)) keeps
  None -> NULL overwrite semantics so the row mirrors the latest request
  exactly, and genuine insert errors surface instead of being swallowed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 15:19:41 -04:00
Cameron Cordes f707353807 feat: nightly agentic pre-generation of memory reels
Implement end-to-end nightly pre-generation of memory reels with agentic
scripting that grounds narration in calendar, location, messages, and RAG.

Sections A-E from the plan:

A. Extract produce_reel pipeline core from run_reel_job with
   ScripterMode::Fast/Agentic and progress callbacks.

B. Agentic scripter: factor run_readonly_tool_loop from the insight
   generator, build read-only tool gate, prompt builder with GPS, and
   generate_script_agentic with fallback to fast path.

C. Precomputed reels ledger (SQLite table + DAO), GET /reels/precomputed
   handler with validity gate, GET /reels/by-key/{key}/video streaming,
   and normalize_library_key helper.

D. Nightly scheduler: spawn_pregen_scheduler with configurable hour,
   run_pregen_batch (day/week/month spans), pregen_one with dedup and
   disk-check, secs_until_next_run_hour time math.

E. user_ai_prefs passive mirror table + DAO for param capture in
   create_reel_handler and replay in the scheduler.

Also fixes resolve_library_param signature to take &[Library] and adds
resolve_library_param_state wrapper for AppState callers.

New files: migrations/2026-06-13-000000_add_precomputed_reels/,
  migrations/2026-06-13-000010_add_user_ai_prefs/,
  src/database/precomputed_reel_dao.rs,
  src/database/user_ai_prefs_dao.rs
2026-06-13 14:29:34 -04:00
Cameron Cordes efd05db523 Make the embedding model swappable via env for A/B testing
Trialing Qwen3-Embedding-0.6B (1024-dim, instruct-prefixed queries)
against nomic required code changes at every hardcoded seam; now it's a
config flip plus a reembed_embeddings run.

- EMBEDDING_DIM env (default 768) replaces every hardcoded dim check:
  daily summary / calendar / search / location DAOs, Ollama batch
  validation, reembed_embeddings
- entities gains the dim guard it never had — a wrong-dim vector
  silently kills dedup/recall (cosine over mismatched lengths is 0),
  so store None and warn instead
- embed_query / embed_document split with EMBED_QUERY_PREFIX /
  EMBED_DOCUMENT_PREFIX (literal \n expanded): retrieval models treat
  the two sides differently — nomic wants search_query:/search_document:,
  Qwen3 wants Instruct:...\nQuery: on queries only. All query-side
  call sites and all corpus writers now declare their side.
- document the contract in CLAUDE.md: change the model or any of these
  vars → re-run reembed_embeddings or search is garbage

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 21:40:40 -04:00
Cameron Cordes 8e4f91561b Add per-file insight history endpoint and rate-by-id
Expose GET /insights/history?path=... returning every generated version
of a photo's insight (current plus superseded), newest-first, backing the
mobile per-file insight history view.

- New get_insight_history_handler; reuses the existing get_insight_history
  DAO method (removed its dead_code allow).
- impl From<PhotoInsight> for PhotoInsightResponse, collapsing the mapping
  that was duplicated across the single-get and all-insights handlers.
- rate_insight_by_id DAO method + optional insight_id on RateInsightRequest
  so previously generated versions can be approved/rejected (the path-based
  rate only touches the current row).
- DAO tests for history ordering/scoping and id-targeted rating.
- cargo fmt normalized a multi-line assert in insight_chat.rs tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 18:28:22 -04:00
Cameron Cordes cdd981fe64 fix: inline DB error source into DbError struct
The previous fix logged the underlying error in a separate log line,
but the error that propagated up still showed just "DbError { kind:
InsertError }" at the call site. Now the source message is captured
on the struct itself, so Debug/Display output at any call site shows
the actual Diesel error inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 22:30:19 -04:00
Cameron Cordes dad0220587 fix: stop swallowing DB errors across the entire DAO layer
Every map_err(|_| DbError::new(...)) and map_err(|_| anyhow!("..."))
in the database layer was discarding the actual Diesel/SQLite error,
making failures impossible to diagnose from logs.

- Add DbError::log() that logs the source error before converting
- Replace all ~130 swallowed outer map_err closures with DbError::log
- Replace all ~47 swallowed inner anyhow closures to include the
  source error in the message

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:56:48 -04:00
Cameron Cordes 39ad83f55b fix: surface actual Diesel error in store_insight instead of generic InsertError
The previous map_err closures discarded the Diesel error, making
failures like missing columns impossible to diagnose from logs.
Now the underlying error is logged before converting to DbError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:53:54 -04:00
Cameron Cordes 9654d256f4 fix: persist token counts and fix agentic insight_id mapping
- Add prompt_eval_count and eval_count columns to photo_insights so
  token usage from llama-swap/Ollama is stored and returned by the API
- Fix agentic generator return: was (prompt_eval_count, eval_count),
  handler destructured first element as insight_id — now returns
  (insight_id, prompt_eval_count, eval_count)
- Wire prompt_eval_count/eval_count from DB into PhotoInsightResponse
  instead of hardcoded None

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:47:57 -04:00
Cameron Cordes 449ce1fda1 chore: resolve all clippy warnings and formatting
- Replace impl ToString with impl Display for InsightJobStatus and
  InsightGenerationType
- Rename from_str → parse to avoid confusion with std::str::FromStr
- Collapse nested if statements (handlers, insight_chat, insight_generator,
  image handlers)
- Use is_multiple_of() instead of manual modulo checks
- Suppress deprecated diesel::dsl::count_distinct (no drop-in replacement
  available in current Diesel version)
- Scope MutexGuard in synthesize_merge to drop before await
- Allow dead_code on generate_no_think, enumerate_indexable_files,
  total_deleted (intended for future use)
- Allow type_complexity on Diesel query result tuples

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:13:48 -04:00
Cameron Cordes 2818936739 fix: audit fixes for async insight jobs + persist generation params
- Fix query param mismatch: rename GenerationStatusQuery.file_path to
  path so the client's app-resume buildQuery({ path: ... }) resolves
  correctly instead of always getting 400
- Remove dead _lib_id bindings from both generate handlers
- Return 202 Accepted instead of 200 from generate endpoints
- Restore OpenTelemetry span instrumentation on generate handlers
- Remove stale UNIQUE constraint from initial migration (incompatible
  with plain-INSERT DAO)
- Add tests for status guard: complete_job/fail_job are no-ops when
  job is already cancelled, and cancel_job by id
- Persist generation params (num_ctx, temperature, top_p, top_k, min_p,
  system_prompt, persona_id) on the photo_insights table for auditing

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:02:15 -04:00
Cameron Cordes b87eb4e690 feat: async insight generation with SQLite job tracking
- Add insight_generation_jobs table migration and DAO
- Implement job lifecycle: create_or_get_active, complete, fail, cancel
- Refactor POST /insights/generate and /agentic to async spawn with timeout
- Add GET /insights/generation/status endpoint with job_id and file_path lookup
- Use String for enum fields in Diesel models to avoid private Bound type
- Add from_str() helpers on InsightJobStatus and InsightGenerationType
- Fix update_training_messages to return Result<usize, DbError>
- 7/7 DAO unit tests passing
2026-05-27 10:02:18 -04:00
Cameron Cordes 32195ed89e clip-search: backlog drain + /photos/search endpoint
Wires the persistence layer for CLIP semantic search. The watcher's
per-tick drain encodes any image_exif row with a known content_hash
but no clip_embedding via Apollo (cap CLIP_BACKLOG_MAX_PER_TICK,
default 32). On a query, /photos/search encodes the text via Apollo
and reranks every stored embedding in-memory.

ExifDao additions:
- list_clip_unencoded_candidates — partial-index scan for drain
- backfill_clip_embedding — touches only the two new columns
- list_clip_index — dedup'd (hash, embedding) pull for search

clip_watch::run_clip_encoding_pass is the parallel fan-out — tokio
runtime per pass with CLIP_ENCODE_CONCURRENCY (default 4). No marker
rows for permanent failures yet; per-tick cap bounds the retry cost.

/photos/search params: q, limit, threshold (default 0.20), library,
model_version. Response is intentionally minimal (path + score) so
the frontend joins against existing photo-metadata routes lazily.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 16:10:52 -04:00
Cameron Cordes 8d9e76cf15 clip-search: migration + client + probe binary
Probe-phase scaffolding for CLIP semantic search. Adds the column
that will hold per-photo embeddings, the HTTP client to Apollo's
inference service, and a throwaway probe binary so we can eyeball
search-result quality on the live library before building the
persistence layer (backlog drain, /photos/search endpoint, UI).

- migrations/2026-05-14-000000_add_clip_embedding/ — adds
  image_exif.clip_embedding (BLOB) and clip_model_version (TEXT),
  plus a partial index on (clip_embedding IS NULL AND content_hash
  IS NOT NULL) for the future backfill drain.
- src/database/models.rs — extends ImageExif struct to match.
- src/ai/clip_client.rs — encode_image / encode_text / health,
  same Permanent/Transient/Disabled taxonomy as face_client.
- src/bin/probe_clip_search.rs — --query <q> --library N --limit M
  --top K. Encodes a sample and prints top-K cosine similarities.
  No DB writes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 16:10:52 -04:00
Cameron Cordes 8503ef7884 chore: cargo fmt + clippy --fix sweep across the crate
Pure mechanical cleanup of accumulated drift in files outside the
HLS-content-hash branch's main change set. No behavior change.

- `cargo fmt` on every previously-misformatted file
  (`ai/insight_generator.rs`, `database/knowledge_dao.rs`,
  `faces.rs`, `knowledge.rs`, `libraries.rs`).
- `cargo clippy --fix`:
  - `needless_borrow`: `&library` → `library` in `handlers/image.rs`
    (two sites in the photo-listing path).
- Manual clippy pass for warnings clippy emits but can't auto-apply:
  - `field_reassign_with_default` in `database/reconcile.rs::run` —
    consolidated into a struct-literal initializer.
  - `needless_range_loop` in `database/knowledge_dao.rs::union_perceptual_tags`
    — inner `for b in (a+1)..indices.len() { let ib = indices[b]; ... }`
    becomes `for &ib in &indices[a + 1..] { ... }`.
  - Doc-list indentation: continuation lines under nested bullets in
    `database/mod.rs::get_memories_in_window` and
    `database/knowledge_dao.rs::build_entity_graph` realigned to the
    list-item content column.

Deliberately not touched (each deserves its own focused commit, with
testing, rather than getting bundled into a sweep):
- 4× `deprecated count_distinct` in `faces.rs` — diesel API migration
  to `AggregateExpressionMethods::aggregate_distinct` may shift result
  types; needs verification against the existing stats queries.
- `await_holding_lock` in `knowledge.rs:807` — `std::sync::Mutex` held
  across `ollama.generate(...).await`. Genuine concurrency bug; fix
  requires understanding the surrounding flow before just dropping
  the guard.
- 2× `type_complexity` in `database/mod.rs` — cosmetic, would need a
  `type` alias and corresponding callers updated.
- Dead `total_deleted` on `library_maintenance::GcStats` and
  `file_scan::enumerate_indexable_files` — both are public surface
  retained for future use; deletion is a separate decision.

All 707 tests still pass. Release build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:25:05 -04:00
Cameron Cordes 7cd1ea3cf8 hls: per-library readiness gauges + GET /hls/stats endpoint
The hash-keyed pipeline transcodes lazily, so a freshly mounted (or
freshly upgraded) library is "mostly pending" for the first hour
while the watcher works through the backlog. The operator wants a
live read on remaining work so they can tune `HLS_CONCURRENCY` and
know when to stop waiting.

Adds:

- `src/hls_stats.rs` — pure compute path (`stats_from_rows`) and an
  Arc<Mutex<dyn ExifDao>> wrapper (`compute_and_publish`). Per
  library: `total`, `with_playlist`, `pending`, `unsupported`,
  `hashless_videos`. Dedup is by content_hash so duplicate-bytes-at-
  N-paths counts once (same domain rule as `faces::stats`).
  `hashless_videos` is a separate counter so the operator can see
  the "hash backfill, then transcode" pipeline depth instead of
  having NULL-hash rows just hide.

- Prometheus gauges labeled by library name:
  `imageserver_hls_videos_total`, `..._with_playlist`, `..._pending`,
  `..._unsupported`. Updated by the watcher at the end of every full-
  scan tick *and* on every `/hls/stats` hit, so whichever surface the
  operator is watching stays fresh. Registered in `main` alongside
  the existing image/video gauges.

- `GET /hls/stats` — Claims-protected JSON snapshot of the same data
  plus a top-level cross-library aggregate. Runs on a blocking pool
  so it doesn't pin the actix worker; per-call cost is one
  `list_paths_and_hashes_for_library` SQL query per library plus a
  `stat()` per distinct video hash. Bounded — never invoked from
  middleware, only from the explicit endpoint and the full-scan
  tick. The watcher's end-of-tick `info!` summary line mirrors the
  endpoint output for operators tailing the log.

- New `ExifDao::list_paths_and_hashes_for_library` method:
  `SELECT rel_path, content_hash FROM image_exif WHERE library_id =
  ?`. Single round-trip; callers filter to video extensions
  client-side because the schema doesn't carry media-type. Mock
  impl in `files.rs` returns an empty vec.

Tests in `hls_stats::tests` exercise stats_from_rows directly (videos-
only filter, hash dedup, playlist vs sentinel decision, NULL-hash
hashless counting) plus a publish_gauges round-trip that reads the
gauge value back. Full suite (347 lib + 360 bin = 707) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 15:58:46 -04:00
Cameron Cordes b8e17e05b7 hls: rewrite orphan cleanup for hash-keyed layout
The cleanup walk previously looked for `$VIDEO_PATH/<basename>.m3u8`
and matched each file's stem against a recursive walk of every
library. With the hash-keyed layout now in place, every playlist's
file_stem is the literal string "playlist" — the old logic would
treat every hash-keyed playlist as orphaned on its next run and wipe
them all in one tick (default cleanup interval is 24h, so this is a
24-hour bomb on top of the prior commit).

New approach: orphan-ness is decided in the database, not on the
filesystem. The cleanup loop:

- Snapshots every distinct non-NULL `image_exif.content_hash` into a
  HashSet (new `ExifDao::list_distinct_content_hashes` method —
  `SELECT DISTINCT content_hash WHERE content_hash IS NOT NULL`).
- Walks `$VIDEO_PATH` two levels deep: top-level entries are filtered
  to 2-char lowercase hex shard dirs, each shard's children to 64-char
  hex hash dirs. Anything else (legacy `.m3u8` at root from the
  pre-content-hash era, operator-stashed dirs, partial writes) is left
  alone.
- Hash dirs whose hash isn't in the alive set are `remove_dir_all`'d.
  Shard dirs that emptied as a result are reaped on the same pass via
  `remove_dir` (no-op if non-empty).
- The library-stale safety gate is preserved: a stale library skips
  the cycle even though the orphan decision is DB-only, because the
  upstream missing-file scan that retires `image_exif` rows itself
  pauses for stale libraries. Belt-and-suspenders — keeping a hash
  dir for one extra 24h cycle is cheaper than wiping one whose source
  was briefly unreachable. The gate now also filters disabled
  libraries out of the stale set (they're intentionally absent from
  the health map).
- The legacy `excluded_dirs` parameter is preserved on the function
  signature but unused (the walk no longer crosses library trees);
  flagged with a leading underscore. Callers in `main.rs` stay
  unchanged.

`MockExifDao` in `files.rs` grows the new method (returns empty);
unit tests for the new `is_hash_shard` / `is_full_hash` validators
guard against an operator's stashed directory under VIDEO_PATH ever
matching the orphan-rm path. Both pass.

A follow-up commit handles the one-shot startup migration that
retires the legacy basename-keyed `.m3u8` / `.ts` files at
`$VIDEO_PATH` root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 15:41:04 -04:00
Cameron Cordes 2d56047497 Drop fs_time from date-backfill eligibility
The drain queried `date_taken IS NULL OR date_taken_source = 'fs_time'`
ORDER BY id ASC LIMIT 500 every watcher tick. The resolver is
deterministic on file bytes + filename + fs metadata, so any row that
landed on fs_time once landed there again on every retry — the drain
spun on the same lowest-id rows in perpetuity, never advancing to
rows 501+ while still logging more_remain=true.

Side effect: 500 auto-commit UPDATEs per tick sustained the SQLite
write lock long enough that other writers on separate DAO connections
hit the 5s busy_timeout. Manifested as intermittent 500s on
PATCH /image/faces/{id} that succeeded on retry.

Narrow the partial index and query predicate to `date_taken IS NULL`.
If exiftool installs or a new filename regex lands, an operator can
re-resolve fs_time rows out-of-band rather than re-introducing the
steady-state churn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:37:36 -04:00
Cameron Cordes bec9857426 Split main.rs: extract backfill drains and thumbnails into modules
main.rs drops from 3542 → ~2930 lines by moving:

- src/backfill.rs (new): backfill_unhashed_backlog,
  backfill_missing_date_taken, backfill_missing_content_hashes,
  build_face_candidates, process_face_backlog. Now unit-tested for
  the first time — 5 tests covering cap behavior, library-id
  filtering, missing-on-disk skip, and the video/unhashed/scanned
  filters on face-candidate selection.

- src/thumbnails.rs (new): unsupported_thumbnail_sentinel,
  generate_image_thumbnail, create_thumbnails, update_media_counts,
  is_image, is_video, plus the IMAGE_GAUGE / VIDEO_GAUGE Prometheus
  metrics. Replaces the no-op stubs that used to live in lib.rs.
  4 new unit tests for the sentinel path math and the
  walker-counts-images-vs-videos smoke path.

Supporting:
- SqliteExifDao::from_shared (test-only) so an SqliteExifDao and
  SqliteFaceDao can share one in-memory connection — required to
  test build_face_candidates against the real join.
- files.rs / video/{mod,actors}.rs import from crate::thumbnails::*
  instead of the now-removed stubs in lib.rs.

cargo test --bin image-api: 325 passing (was 314).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:22:02 -04:00
Cameron Cordes e67e00ef8a knowledge: predicate-quality nudge + bulk-reject endpoint
Two coupled changes to fight the speech-act-predicate problem
(facts like (Cameron, expressed, "I'm tempted to...")):

1. System prompt grows an explicit predicate-quality rule. The
   agent is told to use relationship-shaped verbs (lives_in,
   works_at, attended, is_friend_of, interested_in), and is
   given an explicit DON'T list (expressed, said, mentioned,
   stated, quoted, noted, discussed, thought, wondered). Plus a
   concrete Bad / Good example contrasting the noise pattern
   with the structured paraphrase the agent should be writing.
   Stops the bleed for new insights.

2. Cleanup tools for the legacy noise that's already in the
   table:
   - get_predicate_stats(persona, limit) returns
     [(predicate, count)] sorted desc — feeds the curation UI's
     PREDICATES tab.
   - bulk_reject_facts_by_predicate(persona, predicate, audit)
     flips every ACTIVE fact under that predicate to 'rejected'
     in one transaction, stamping last_modified_* so the action
     is attributable + reversible per-fact through the entity
     detail panel. REVIEWED facts under the same predicate are
     left alone — the curator may have hand-approved an
     exception ("interested_in" might be largely noise but a
     reviewed entry is intentional).

New HTTP endpoints:
   GET  /knowledge/predicate-stats?limit=
   POST /knowledge/predicates/{predicate}/bulk-reject

Persona-scoped via the existing X-Persona-Id header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:50:26 -04:00
Cameron Cordes d123cde333 knowledge: entity-graph endpoint for force-directed view
New GET /knowledge/graph?type=&limit= returns the data the
curation UI's graph tab needs:
  - nodes = entities with at least one in-scope fact (rejected /
    superseded excluded). Carries fact_count for visual sizing.
    Top-N by count desc; default cap 200 (clamped 1..1000).
  - edges = relational facts (object_entity_id set) grouped by
    (subject, object, predicate) so 3 "is_friend_of" facts
    between the same pair collapse into one edge with count=3.

Two raw SQL queries: an INNER JOIN onto a persona-scoped fact-
count subquery for nodes (skips 0-fact entities entirely so the
sim doesn't waste time on disconnected islands), then a follow-
up GROUP BY over the persona-scoped fact set restricted to the
node id set via IN clauses (ids are i32 so inlining is safe).

Pairs with the Apollo-side GraphPanel that runs d3-force over
the returned payload and renders SVG with click-to-open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:26:02 -04:00
Cameron Cordes 6dca0c027d fmt: cargo fmt sweep
No logic changes - line reflow, brace placement, and method-chain splits
across handlers / personas / state / faces / knowledge / insights_dao /
knowledge_dao / populate_knowledge. Picked up incidentally while running
fmt for the sms-search work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 19:21:00 -04:00
Cameron Cordes 6620fa48d7 knowledge: consolidation proposals endpoint
Finds near-duplicate entities the upsert-time cosine guard didn't
catch — typically legacy data from before that guard landed, or
pairs whose embeddings sit between 0.85 (default proposal floor)
and 0.92 (auto-collapse threshold). Pure read-side feature; the
actual merging still goes through the existing
/knowledge/entities/merge action.

New DAO method `find_consolidation_proposals(threshold,
max_groups)`:
  - Loads every non-rejected entity with an embedding.
  - Partitions by entity_type so a person can't cluster with a
    place.
  - Pairwise cosine, edges above threshold feed a union-find for
    transitive grouping (Sara → Sarah → Sarah J. all land in one
    cluster).
  - Tracks min/max cosine per component so the UI can show "how
    tight" each cluster is before clicking in.
  - Returns groups of >= 2 sorted by size desc then max cosine
    desc; trimmed to `max_groups`.

New endpoint `GET /knowledge/consolidation-proposals?threshold=
&limit=` accepts the threshold (clamped 0.5–0.99 to prevent the
"every entity in one mega-cluster" case) and returns groups with
per-entity persona fact-count breakdowns baked in — saves the UI
a separate query per cluster member.

ConsolidationGroup is exported through database/mod.rs so the
handler can use it without depending on knowledge_dao internals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:43:11 -04:00
Cameron Cordes 89d0a6527c knowledge: per-entity persona breakdown for list + detail
Entities are global; facts are persona-scoped. Under the active
persona an entity can read as "0 facts" while having plenty under
other personas the user owns — the curation UI had no way to
surface that gap. Adds a batched DAO method
`get_persona_breakdowns_for_entities` that returns
{entity_id → [(persona_id, count)]} in one query (group by
subject + persona, user-scoped, status != rejected), and wires it
into both /knowledge/entities list rows and
GET /knowledge/entities/{id}.

EntitySummary grows an optional `persona_breakdown` field
(skipped on serialization when None — keeps PATCH responses
unchanged). EntityDetailResponse carries the breakdown as a
non-optional Vec since the detail endpoint always populates it.

One extra query per list page (50 entities → 50 subject ids
batched in one IN clause); single-entity GET adds one round trip.
Indexed by (subject_entity_id, persona_id) implicitly via the
existing user-persona indexes on entity_facts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:29:20 -04:00
Cameron Cordes fd4dd89bbb knowledge: agent self-correction with audit + per-persona gate + revert
Bundles three coupled changes so agent-side mutations stay
auditable and reversible:

1. Audit columns on entity_facts —
   `last_modified_by_model` / `last_modified_by_backend` /
   `last_modified_at`. Stamped on every mutation path
   (update_fact, supersede_fact, manual PATCH, manual supersede,
   the new revert). NULL on rows never touched since creation.
   Partial index on `last_modified_at WHERE NOT NULL` keeps the
   "show me recent edits" feed fast without bloating from legacy
   rows.

2. Per-persona gate `personas.allow_agent_corrections` (BOOLEAN,
   default 0). Defense in depth at two layers:
   - build_tool_definitions: when off, `update_fact` and
     `supersede_fact` aren't in the catalog at all, so even a
     hallucinated tool call by the model fails fast.
   - tool_update_fact / tool_supersede_fact: re-checks the persona
     flag at call time and returns an explicit "corrections
     disabled" error if it's somehow off (e.g. flag flipped mid-
     loop).
   ToolGateOpts grows the flag; current_gate_opts splits into
   `current_gate_opts` (no persona context, defaults closed) +
   `current_gate_opts_for_persona` for chat callers that have a
   persona id. Both call sites in insight_chat are updated.

3. Revert action — new DAO method `revert_supersession` +
   `POST /knowledge/facts/{id}/restore`. Flips status back to
   'active', clears `superseded_by`, clears `valid_until` (we
   don't track whether it was hand-set vs auto-stamped, so the
   safe reset is to drop it — user can re-bound after). Stamps
   `last_modified_*` so the revert itself is attributable.

Manual paths (PATCH / supersede via HTTP, plus restore) stamp the
audit columns with `("manual", "manual")`. Agent paths stamp the
loop-time chat model and backend (mirroring the existing
created_by_* convention).

FactDetail in the HTTP response now carries the audit triple
alongside the existing provenance. Apollo wires the new field set
in the matching commit.

PersonaView / UpdatePersonaRequest grow `allowAgentCorrections`;
the PersonaPatch + InsertPersona + bulk_import paths thread it.

317 lib tests pass, including unchanged update_fact / supersede
DAO tests (now passing audit=None — None means "no provenance
context to attribute", legacy semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:56:56 -04:00
Cameron Cordes 86c331571d knowledge: per-persona reviewed-only mode + agent reads include reviewed
Two coupled changes to the agent's recall surface:

1. Default scope expanded. recall_facts_for_photo and recall_entities
   used to filter to status='active' only — which silently dropped
   'reviewed' (human-verified) facts. Now they surface active +
   reviewed by default. Reviewed is strictly more trusted than
   active and shouldn't have been hidden. Rejected and superseded
   stay filtered.

2. New persona toggle `reviewed_only_facts` (BOOLEAN, default false,
   migration 2026-05-10-000400). When set, the agent's recall on
   that persona returns ONLY facts with status='reviewed' — strict
   mode for tasks where hallucinated agent claims are particularly
   costly. Wired:
   - schema.rs / Persona / InsertPersona / PersonaPatch grow the
     field.
   - PersonaView returns it as `reviewedOnlyFacts` (camelCase wire).
   - PUT /personas/{id} accepts it (mobile editor surfaces it).
   - InsightGenerator now carries a PersonaDao reference so
     recall_facts_for_photo can read the active persona's flag at
     start; one extra read per recall, cheap.

Composes with include_all_memories: that operates on the persona
*scope* axis (single vs hive), reviewed_only_facts on the *status*
axis. They're orthogonal.

Legacy persona rows pick up the default false on migration; no
behavior change unless explicitly toggled. The 4 existing persona
construction sites (one production, two tests, one InsertPersona in
knowledge_dao tests) all default the field. populate_knowledge bin
+ state.rs constructors also wire the new persona_dao arg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:21:39 -04:00
Cameron Cordes f53338923d knowledge: stamp model + backend on facts for audit
Adds two nullable TEXT columns to entity_facts —
`created_by_model` (LLM identifier) and `created_by_backend`
("local" / "hybrid" / "manual" / NULL) — so the curator can audit
which configurations produce good fact-keeping and which produce
noise.

photo_insights already carries model_version + backend, and
entity_facts.source_insight_id links to it, but:
  - source_insight_id is set post-loop, so chat-continuation and
    regenerated-insight facts lose the link.
  - JOINing per read is more friction than embedding provenance on
    the row itself.
  - Manual facts (POST /knowledge/facts) have no insight at all and
    need their own "manual" provenance marker.

Threading: execute_tool grows `model` + `backend` params, passed
from the three call sites (agentic insight loop, chat single-turn,
chat stream) using the loop-time `chat_backend.primary_model()` +
`effective_backend` already in scope. tool_store_fact stamps the
new fact accordingly; manual create_fact stamps backend="manual".
Legacy rows leave both NULL — pre-tracking data can't be back-
filled reliably from training_messages without burning compute.

Indexes are partial (WHERE NOT NULL) so legacy rows don't bloat
them, and "show me all facts from model X" stays fast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:05:14 -04:00
Cameron Cordes 85f3716379 knowledge: fact supersession + photo-date valid_from
Two Phase-2 followups in one commit since they're coupled at the
write path:

* Agent populates valid_from from the source photo's date_taken
  when calling store_fact. Loose semantics — date_taken is *evidence
  at that date*, not strictly when the fact started being true — but
  gives the curator a calendar anchor and pairs with supersession to
  close intervals cleanly. valid_until stays NULL (a single photo
  can't tell us when something stopped). Honours the existing
  upsert_fact dedup (corroborated facts keep their first-recorded
  valid_from).

* Supersession: new column entity_facts.superseded_by INTEGER
  (migration 2026-05-10-000200), new status value 'superseded',
  new DAO method supersede_fact, new HTTP endpoint
  POST /knowledge/facts/{id}/supersede.

  Marking an old fact as replaced by a new one atomically: flips
  status to 'superseded', sets superseded_by, and stamps
  valid_until from the new fact's valid_from (when not already
  set). delete_fact clears dangling supersession pointers in the
  same transaction so the column never points at a missing row —
  no FK because SQLite can't ALTER ADD with REFERENCES, but the
  DAO maintains the invariant.

Pairs with conflict detection from the previous slice: once the
old fact's valid_until is closed, its interval no longer overlaps
the new fact's, so they stop flagging — the supersede action
resolves the conflict.

Two tests pin the contract: supersede stamps valid_until from
new.valid_from while respecting an existing valid_until, and
deleting the supersedeR clears the dangling pointer while leaving
the old fact's 'superseded' status in place for history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 19:47:06 -04:00
Cameron Cordes 01f5ad7527 knowledge: valid-time on facts + interval-aware conflict detection
Adds bitemporal support to entity_facts. Existing `created_at` is
transaction time (when we recorded the fact); the new
`valid_from` / `valid_until` BIGINT columns are valid time (when the
fact is/was true in the real world). NULL on either side = unbounded
on that side, both NULL = "always-true / unknown" — matches the
default state of every legacy row, no backfill needed.

The split matters for time-bounded predicates like
is_in_relationship_with / lives_in / works_at: recording the fact
once doesn't mean the relationship is still ongoing. Same predicate
across different windows ("lives_in NYC 2018-2020", "lives_in SF
2020-present") is no longer a conflict — the interval-aware check
in get_entity only flags pairs whose windows overlap. Facts with no
valid-time data still flag against everything (worst case for legacy
rows — user adds dates to suppress).

API surface:
- POST /knowledge/facts accepts optional valid_from / valid_until.
- PATCH /knowledge/facts/{id} accepts both with tri-state semantics:
  field omitted = leave alone, JSON null = clear to NULL, number =
  set. Implemented via a small serde helper around Option<Option>.
- GET /knowledge/entities/{id} surfaces both fields per fact and
  uses them in conflict detection.

Agent path (insight_generator) writes NULL/NULL for now — deriving
valid_from from the source photo's date_taken is slated for a
follow-up agent tool alongside Phase 2's supersession.

Test pins set + clear semantics via update_fact: setting both
bounds, leaving them alone on a subsequent patch, then clearing
valid_until back to NULL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 19:25:55 -04:00
Cameron Cordes 0b8478a5e4 knowledge: list sort + persona-scoped fact_count per entity
Two related additions to /knowledge/entities:

- New EntitySort enum (UpdatedDesc default, NameAsc, FactCountDesc)
  surfaced via `?sort=updated|name|count`. NameAsc clusters near-
  duplicate names so dupes stand out at a glance; FactCountDesc
  surfaces heavily-used entities and demotes 0-fact noise to the
  bottom.

- New `list_entities_with_fact_counts` DAO method that returns each
  entity alongside a persona-scoped count of its non-rejected facts
  (subject side). Persona scope follows X-Persona-Id via the
  existing resolve_persona_filter chain — Single filters on
  (user_id, persona_id), All unions across the user's personas.
  Implemented as one raw SQL query with a LEFT JOIN to a fact-count
  subquery and ORDER BY tied to the chosen sort, so count-sort needs
  no second round trip.

The agent's existing list_entities call site is unchanged — it
doesn't need persona-scoped counts and the trait method stays cheap.
EntitySummary grows an Option<i64> fact_count (skip_serializing_if
none) so PATCH responses stay shaped as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 16:04:13 -04:00
Cameron Cordes 0e2b18224f knowledge: pre-delete relational facts so entity delete succeeds
DELETE /knowledge/entities/{id} was 500ing on any entity that was the
object of a relational fact. entity_facts.object_entity_id has
ON DELETE SET NULL, but the table also has
CHECK (object_entity_id IS NOT NULL OR object_value IS NOT NULL) —
purely relational facts (subject + predicate + object_entity_id, no
object_value, like "Alice is_friend_of Bob") would have both NULL
after SET NULL fired, the CHECK would abort, and the whole DELETE
would fail with a CHECK violation. The user just saw QueryError
because the DAO swallowed the diesel error string.

Wrap delete_entity in a transaction that first deletes facts where
the entity is the object AND object_value is null, then deletes the
entity. Surviving siblings (typed facts about the entity as subject)
are CASCADE'd by the FK as before. Also start surfacing the actual
diesel error in a warn log before collapsing to DbErrorKind so future
similar issues don't masquerade as the opaque QueryError.

A schema-level fix (changing object FK to ON DELETE CASCADE via a
table-rebuild migration) is the cleaner long-term resolution and is
slated for Phase 2; the DAO-side pre-delete is sufficient and less
invasive in the meantime.

Test pins the contract: a relational fact pointing at the deleted
entity is removed, an unrelated typed fact about an unrelated entity
survives, and the entity itself is deleted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 15:44:38 -04:00
Cameron Cordes d7aee4f228 knowledge: cosine dedup, fact create endpoint, recall nudge
Phase 1 of the knowledge curation work. Three small server-side changes
to support an Apollo-side curation surface and reduce the agent's near-
duplicate output rate going forward:

- upsert_entity grows an embedding-cosine fallback after the exact name
  match misses. New entities whose embedding sits above
  ENTITY_DEDUP_COSINE_THRESHOLD (default 0.92) against any same-type
  active entity collapse onto the existing row. Eliminates the Sarah /
  Sara / Sarah J. trio the FTS5 prefix check was missing.
- POST /knowledge/facts symmetric with the existing PATCH/DELETE so the
  curation UI can create facts directly. Persona-scoped via X-Persona-Id;
  validates subject (and optional object) entity existence; reuses
  KnowledgeDao::upsert_fact so corroboration semantics match the agent
  path.
- One sentence in build_system_content telling the agent to call
  recall_entities before store_entity when a name resembles something
  already known. Cheap; complements the DAO-layer guard.

Includes upsert_entity_collapses_near_duplicate_by_embedding test
covering both the collapse-on-near-match path and the don't-collapse-on-
unrelated-embedding path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 15:16:05 -04:00
Cameron Cordes 08a5f46be1 chat: scope insight lookup by library_id to fix regen-shadow bug
When a photo exists in more than one library and the user
regenerates its insight from library A's chat, the regenerate
streams cleanly, store_insight flips library A's old row to
is_current=false, and inserts a new is_current=true row tagged
(library A, rel_path). On the next history fetch the user sees
their old transcript — the regenerate appears to vanish.

The cause: get_insight(file_path) filters on rel_path + is_current
only, so library B's untouched is_current=true row for the same
rel_path satisfies the query and gets returned by SQLite's .first()
ahead of A's new row. Because get_insight is also what
chat_turn_stream uses to decide bootstrap vs. continuation, the
next chat turn after the shadow hit also routes against the
wrong insight, so update_training_messages corrupts library B's
transcript with library A's chat.

Fix: add get_current_insight_for_library(library_id, file_path)
filtered on (library_id, rel_path, is_current=true) and route the
chat surface (load_history, chat_turn{,_stream}, rewind_history)
through it. load_history falls back to the cross-library
get_insight when the scoped lookup misses — preserves the
"scalar data merges across libraries" intent for the case where
the active library has no insight but another does. The path-only
get_insight stays for callers that don't have library context
(populate_knowledge, the photo-grid metadata fetch).

chat_history_handler stops dropping the parsed library on the
floor and threads it through. Single-library deploys see no
behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 14:03:41 -04:00
Cameron Cordes fbd769e475 personas: composite FK + built-in update guard
Two persona-infrastructure correctness fixes that go together because
the second one (FK with CASCADE) requires the first (preventing the
persona row from being mutated out from under its facts).

1. update_persona handler refuses name/systemPrompt edits to built-ins
   (409). includeAllMemories stays editable — that's a per-user
   preference, not the persona's identity. Mirrors the existing
   delete_persona guard. The DAO is intentionally permissive so the
   guard sits at the HTTP layer; persona_dao test pins that contract.

2. Migration 2026-05-10 adds user_id to entity_facts and a composite
   FK (user_id, persona_id) -> personas(user_id, persona_id) ON DELETE
   CASCADE. This closes two issues at once:

   - Persona orphans: deleting a custom persona used to leave its
     facts dangling forever, readable only via PersonaFilter::All.
     CASCADE now wipes them with the persona row.

   - Multi-user fact leakage: PersonaFilter::Single("default") used
     to surface every user's default-scoped facts. PersonaFilter is
     now { user_id, persona_id } and all read paths
     (get_facts_for_entity, list_facts, get_recent_activity) filter
     on user_id first. upsert_fact's dedup key extends to user_id so
     identical claims under shared persona names from different
     users no longer corroborate-bump each other's confidence.

   - user_id threads from Claims.sub.parse::<i32>().unwrap_or(1) at
     the chat / insight handlers through ChatTurnRequest, the
     streaming agentic loop, execute_tool, and into the leaf tools
     (tool_store_fact, tool_recall_facts_for_photo). The ".unwrap_or(1)"
     accommodates Apollo's service token whose sub is non-numeric on
     legacy mints.

   - Backfill picks the smallest user_id matching each legacy fact's
     persona_id so the FK holds for already-stored rows.

Five new knowledge_dao tests with FK-on connection: persona scoping
isolation, All-variant union per-user, dedup not crossing users,
CASCADE delete, FK rejection of unknown personas. Plus
dao_update_does_not_block_built_ins documenting where the
HTTP-layer guard lives.

Apollo coordinates separately — the matching changes there add the
/api/personas proxy and start sending persona_id on photo-chat turns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 13:30:35 -04:00
cameron 25233904aa Merge pull request 'personas: elevate to server with per-persona fact scoping' (#88) from feature/persona-knowledge-segmentation into master
Reviewed-on: #88
2026-05-10 03:44:26 +00:00
Cameron Cordes 9871c685b4 date-override: cargo fmt
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 21:23:11 -04:00
Cameron Cordes 108bbeb029 date-override: union semantics across libraries + slash forms
The date-override path used to look up `image_exif` strictly by
`(library_id, rel_path)` with only the forward-slash form, while
`/image/metadata`'s `get_exif` falls back across libraries and tries
both slash forms. A photo whose row sat under a different library_id
than its filesystem-resolved one — or whose rel_path was stored with
backslashes — rendered fine in the modal but 404'd on save.

`set_manual_date_taken` / `clear_manual_date_taken` now share a
`locate_image_exif_row` helper that mirrors `get_exif`'s union
semantics (scoped lookup first, library-agnostic fallback by rel_path
in both slash forms), then update by primary key so the write hits
exactly the row read. Inner anyhow errors are logged with
`(library_id, rel_path)` so the next failure mode is debuggable.

Handler-side: `resolve_library_param` errors no longer silently fall
back to the primary library (which would have masked the original bug
with a different "row not found"); a malformed library param now
returns 400. New `DbErrorKind::NotFound` lets the handler distinguish
genuine misses (404) from real DB failures (500).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 21:21:25 -04:00
Cameron Cordes 3e2f36a748 personas: elevate to server with per-persona fact scoping
Move personas off the mobile client into ImageApi as first-class
records, and scope entity_facts by persona so each one builds its own
voice over a shared entity graph. The new include_all_memories flag
lets a persona opt back into the full hive-mind pool for human
browsing of /knowledge/*; agentic generation always stays in-voice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:59:20 -04:00
Cameron Cordes b42acbb3f3 fmt: cargo fmt sweep across drifted files
No behavior change — purely whitespace/line-break cleanup that had
accumulated since the last format run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:42:41 -04:00
Cameron Cordes e539c083c9 insight-chat: code-review polish on the tool-gating PR
- search_messages now delegates to search_messages_with_contact(.., None)
  so the two methods share a single HTTP path. Drops the dead-code
  warning and the ~30-line duplication.
- DailySummaryDao gains has_any_summaries (LIMIT 1 existence probe)
  used by current_gate_opts; the SELECT COUNT(*) get_total_summary_count
  added in the prior commit is removed (it had no other caller).
- current_gate_opts doc comment corrected to describe what the probes
  actually do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:07:57 -04:00
Cameron Cordes f50d32667b insight-chat: ToolGateOpts + per-tool description rewrites
Tools whose backing tables are empty (calendar, location_history,
daily_summaries) drop out of the catalog so the LLM doesn't waste
iteration budget calling them only to receive "no results found".
Vision and apollo gates already existed; this generalizes the pattern.

search_messages gains start_ts/end_ts/contact_id filters (date filter
is a client-side post-filter; SMS-API only accepts contact_id natively
on the search endpoint).

Descriptions follow a consistent convention: one sentence (what +
when), param semantics, examples for tools with non-obvious param
choices. No more all-caps headers, no more identity-prescriptive
language inside descriptions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 14:56:58 -04:00
Cameron Cordes 7e1c4ab318 backfill_date_taken: surface the actual diesel error in warnings
The DAO swallowed every diesel::update failure as a flat
`anyhow!("Update error")`, then trace_db_call further reduced it to
`DbError { kind: UpdateError }`. Operators saw "update failed for lib
2 Snapchat/foo.mp4: DbError { kind: UpdateError }" with no clue why
(constraint violation? type mismatch? row vanished mid-flight? DB
locked?).

Two changes:
- Preserve the diesel error in the anyhow chain along with the input
  params (lib, rel_path, date_taken, source) so the cause is visible.
- Log the chain at warn-level inside the DAO before the trace wrapper
  collapses it to DbErrorKind::UpdateError, so the warning at the
  call site finally has something diagnosable next to it.
- Treat zero-row updates as a debug-level "row likely retired by the
  missing-file scan" rather than a hard failure — that case is benign
  and shouldn't poison the drain's error tally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 11:07:17 -04:00
Cameron Cordes 832b50d587 image_exif: manual date_taken override (set/clear endpoints)
Add `POST /image/exif/date` and `POST /image/exif/date/clear` so an
operator can correct a row whose canonical-date waterfall landed on the
wrong value (camera clock reset, fs_time fallback for a copied-from-
backup file, etc). New `original_date_taken` / `original_date_taken_source`
columns snapshot the prior value on first override so revert is lossless.

The waterfall source set is now `'exif' | 'exiftool' | 'filename' | 'fs_time' | 'manual'`.
The existing `idx_image_exif_date_backfill` partial index already filters
to `date_taken IS NULL OR date_taken_source = 'fs_time'`, so manual rows
are naturally excluded from the per-tick drain — no index change needed.

`ExifMetadata` now exposes `date_taken_source` + originals so a UI can
render "manually set; was X via filename".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 19:26:43 -04:00
Cameron Cordes 7f12890f4b memories: single-SQL rewrite + 20-year lookback
Replaces the EXIF-loop + WalkDir-fallback pipeline that powered
`/memories` with a single per-library SQL query
(`get_memories_in_window`) that uses `strftime('%m-%d' | '%W' | '%m',
date_taken, 'unixepoch', tz_offset)` for calendar matching in the
client's timezone, plus a `years_back` lower bound and a
no-future-dates upper bound. Returns only the matching rows; the
handler applies per-library `PathExcluder` post-query and sorts.

Drops:
- `collect_exif_memories` — replaced by the single SQL query.
- `collect_filesystem_memories` — the canonical-date pipeline now
  populates `date_taken` for every row at ingest, so the WalkDir
  fallback that scanned 14k+ files each request is no longer needed.
- `get_memory_date_with_priority` and friends — request-time waterfall
  superseded by `date_resolver` running at ingest. The associated
  three priority-tests are dropped; their replacement lives in
  `date_resolver::tests`.

On a ~14k-file library this drops `/memories` from 10–15 s
(dominated by `fs::metadata` per row) to single-digit ms.

Bumps `DEFAULT_YEARS_BACK` from 15 → 20 to surface deeper archives
on matching anniversaries.

Note vs. ISO weeks: the original Rust used `chrono::iso_week().week()`
for week-span matching. SQLite's `%W` is Monday-anchored but uses week
0 for days before the first Monday, so it can disagree with ISO at
year boundaries by ±1. Acceptable for nostalgia browsing.

Adds 3 new DAO tests covering month-span filter, library scoping, and
the unknown-span-token guard. Also adds a CLAUDE.md section describing
the canonical-date pipeline end-to-end and the new
`DATE_BACKFILL_MAX_PER_TICK` env var.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 16:04:09 -04:00
Cameron Cordes 54e0635a98 date_backfill: per-tick drain for unresolved date_taken rows
Adds two ExifDao methods (`get_rows_needing_date_backfill` /
`backfill_date_taken`) and a `backfill_missing_date_taken` watcher pass
that runs on every tick alongside `backfill_unhashed_backlog`.

The drain queries the partial index for rows where `date_taken IS NULL`
or `date_taken_source = 'fs_time'`, batches up to
`DATE_BACKFILL_MAX_PER_TICK` paths (default 500), and feeds them through
`date_resolver::resolve_dates_batch` — a single exiftool subprocess
covers the whole tick. Rows that newly resolve to `exiftool` /
`filename` / `fs_time` get persisted via `backfill_date_taken` (touches
only `date_taken` + `date_taken_source` so EXIF / hash / perceptual
columns survive).

`filename`-sourced rows are intentionally not re-resolved — the regex
is authoritative when it matches and re-running exiftool wouldn't
change the answer. Files that have disappeared from disk are skipped
so a ghost row doesn't loop through the drain forever; the
missing-file scan in `library_maintenance` retires those separately.

Comes with two DAO unit tests (eligibility filter + column-isolation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 16:03:03 -04:00
Cameron Cordes 84326501a9 image_exif: add date_taken_source column
New nullable TEXT column tracks which step of the canonical-date
waterfall (kamadak-exif → exiftool → filename → fs_time) populated
`date_taken`. Lets a later per-tick drain re-resolve weak sources
(`fs_time`) once stronger ones become available, and gives the UI/debug
surface a way to answer "why does this photo show up under this date?".

Adds the column at all `InsertImageExif` construction sites with `None`
placeholders (the resolver wiring lands in a follow-up commit), and
extends the `update_exif` SET tuple so the column survives the GPS-write
re-read path. Partial index `idx_image_exif_date_backfill` is created
for the upcoming drain query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 15:57:49 -04:00
Cameron Cordes 67cf0c7f73 duplicates: folder-pair view of exact dups
Bucket exact-dup rows by (library_id, dirname) pair on each side, then
filter by coverage = shared / min(folder_a_total, folder_b_total) and
an absolute floor on shared count. Surfaces "this folder is mostly
contained in that folder" matches that the per-file EXACT view buries
under one row each — e.g. an old phone-backup tree shadowing the
organized library, or a topic-grouped folder duplicating a date-grouped
one within the same library.

New endpoint: GET /duplicates/folder-pairs?library=&include_resolved=
&min_coverage=&min_shared=. Cached 5 min keyed on (library, include_resolved);
the user-tunable thresholds filter the cached unfiltered pair list so
slider drags don't re-bucket. Shares the resolve / unresolve flow with
the existing tabs — the frontend fans out N parallel /resolve calls,
one per shared content_hash.

Folder names carry no signal (BMW lives under Night Photos, not BMW_backup),
so bucketing is purely on (library_id, dirname) co-occurrence in
exact-dup groups. Within-folder dups (same hash twice in the same
folder) are skipped — those belong to the EXACT tab.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:43:29 -04:00
Cameron Cordes 57b7bad086 duplicates: library-aware visibility — only hide a demoted row when its survivor is reachable
Soft-marked rows used to disappear from /photos globally, including
from a library-scoped view that didn't contain the survivor at all.
A user browsing lib A who'd promoted a file from lib B as the
survivor would silently lose visibility on their own copy in lib A,
even though lib B's file isn't reachable from lib A's view.

Library-scoped queries now keep a demoted row visible when its
survivor lives in a library outside the current scope. Implemented
as a NOT EXISTS subquery against the same image_exif table aliased
as `survivor`. The unscoped (all-libraries) view is unchanged — every
survivor is reachable, so demoted rows stay hidden as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:24:07 -04:00
Cameron Cordes 7584cd8792 duplicates: perceptual hash + soft-mark resolution + upload 409
Adds pHash + dHash columns alongside the existing blake3 content_hash so
near-duplicates (re-encoded, resized, format-converted copies) become
queryable. /duplicates/{exact,perceptual} return groups; /duplicates/
{resolve,unresolve} flip a duplicate_of_hash soft-mark on losing rows
and union perceptual-only tag sets onto the survivor. The default
/photos listing filters duplicate_of_hash IS NULL so demoted siblings
stop cluttering the grid; include_duplicates=true opts back in for
Apollo's review modal. Upload now hashes bytes pre-write and returns
409 with the canonical sibling when a file's bytes already exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:36:01 -04:00
Cameron Cordes fb4df4b195 style: cargo fmt sweep
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:01:00 -04:00