duplicates: filter low-entropy hashes + dHash double-check, fix backfill loop

The perceptual cluster was producing one giant first group that
contained hundreds of unrelated images. Two causes:
- Solid-colour images (skies, black frames, monochrome scans) all
  hash to near-zero pHashes that Hamming-distance-zero to each other.
- Single-link clustering on pHash alone is too permissive — a chain
  of weakly-similar images all collapses into one cluster.

Fixed by skipping hashes outside the popcount [8, 56] band (uniform
content) and requiring dHash agreement within threshold before
unioning a candidate edge from the BK-tree. Two new tests pin both
invariants.

Backfill bin separately fix: decode-failed rows kept phash_64=NULL
and got re-pulled by every batch, infinite-looping on a queue of
unbreakable formats. Persist a 0/0 sentinel on decode failure so
the row leaves the candidate set; the all-zero hash is excluded
from clustering by the same entropy filter so it doesn't pollute
results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Cameron Cordes
2026-05-03 18:08:05 -04:00
parent 7584cd8792
commit 7ca888e95d
4 changed files with 251 additions and 75 deletions

View File

@@ -8,10 +8,9 @@ pub mod auth;
pub mod bin_progress;
pub mod cleanup;
pub mod content_hash;
pub mod perceptual_hash;
pub mod duplicates;
pub mod data;
pub mod database;
pub mod duplicates;
pub mod error;
pub mod exif;
pub mod face_watch;
@@ -25,6 +24,7 @@ pub mod library_maintenance;
pub mod memories;
pub mod otel;
pub mod parsers;
pub mod perceptual_hash;
pub mod service;
pub mod state;
pub mod tags;