indexer: prune EXCLUDED_DIRS at WalkDir time, extract enumerate_indexable_files

Synology drops `@eaDir/.../SYNOFILE_THUMB_*.jpg` files alongside every
photo. The face-detect pipeline already filters those out via
`face_watch::filter_excluded`, but the filter runs *after* the indexer
has already inserted rows into `image_exif`. Result: phantom rows whose
content_hash never matches a `face_detections` row, so the anti-join in
`list_unscanned_candidates` returns them every tick. They're filtered
out at runtime, no marker is written, and the cycle repeats forever —
log spam, wrong stats denominator, and on a real Synology library the
phantom rows balloon into the hundreds of thousands.

Move the exclusion to the WalkDir pass, where filter_entry can prune
whole subtrees instead of walking and discarding leaves. Extract the
pre-existing 30-line walker chain in main.rs::process_new_files into
`file_scan::enumerate_indexable_files` so it's testable in isolation.

Six tests cover the bug (eadir prune), nested patterns, absolute-under-base
syntax, non-media filtering, modified_since semantics, and forward-slash
rel_path normalization.

Out of scope (other WalkDir callers in main.rs that don't yet apply
EXCLUDED_DIRS — thumbnail gen at 1309, media scan at 1377, video
playlist scan at 1685, and two nested walks at 1709 / 1743): separate
audit PR.

Operator note: existing phantom rows still need a one-shot cleanup —
  DELETE FROM face_detections WHERE content_hash IN (
    SELECT content_hash FROM image_exif WHERE rel_path LIKE '%/@eaDir/%'
  );
  DELETE FROM image_exif WHERE rel_path LIKE '%/@eaDir/%' OR rel_path LIKE '@eaDir/%';
Run before attaching a fresh Synology-sourced library.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Cameron Cordes
2026-04-30 19:29:37 +00:00
parent f358e83050
commit 5bf49568f1
3 changed files with 206 additions and 31 deletions

View File

@@ -1974,37 +1974,11 @@ fn process_new_files(
let thumbnail_directory = Path::new(&thumbs);
let base_path = Path::new(&library.root_path);
// Collect all image and video files, optionally filtered by modification time
let files: Vec<(PathBuf, String)> = WalkDir::new(base_path)
.into_iter()
.filter_map(|entry| entry.ok())
.filter(|entry| entry.file_type().is_file())
.filter(|entry| {
// Filter by modification time if specified
if let Some(since) = modified_since {
if let Ok(metadata) = entry.metadata()
&& let Ok(modified) = metadata.modified()
{
return modified >= since;
}
// If we can't get metadata, include the file to be safe
return true;
}
true
})
.filter(|entry| is_image(entry) || is_video(entry))
.filter_map(|entry| {
let file_path = entry.path().to_path_buf();
// Canonical rel_path is forward-slash regardless of OS so DB
// comparisons against the batch EXIF lookup line up.
let relative_path = file_path
.strip_prefix(base_path)
.ok()?
.to_str()?
.replace('\\', "/");
Some((file_path, relative_path))
})
.collect();
// Walk, prune EXCLUDED_DIRS subtrees, and apply image/video + modified_since
// filters. See `file_scan` for why exclusion has to happen at WalkDir
// time (filter_entry) rather than at face-detect time.
let files: Vec<(PathBuf, String)> =
image_api::file_scan::enumerate_indexable_files(base_path, excluded_dirs, modified_since);
if files.is_empty() {
debug!("No files to process");