duplicates: filter low-entropy hashes + dHash double-check, fix backfill loop
The perceptual cluster was producing one giant first group that contained hundreds of unrelated images. Two causes: - Solid-colour images (skies, black frames, monochrome scans) all hash to near-zero pHashes that Hamming-distance-zero to each other. - Single-link clustering on pHash alone is too permissive — a chain of weakly-similar images all collapses into one cluster. Fixed by skipping hashes outside the popcount [8, 56] band (uniform content) and requiring dHash agreement within threshold before unioning a candidate edge from the BK-tree. Two new tests pin both invariants. Backfill bin separately fix: decode-failed rows kept phash_64=NULL and got re-pulled by every batch, infinite-looping on a queue of unbreakable formats. Persist a 0/0 sentinel on decode failure so the row leaves the candidate set; the all-zero hash is excluded from clustering by the same entropy filter so it doesn't pollute results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -108,11 +108,7 @@ fn main() -> anyhow::Result<()> {
|
||||
// releases the GIL-equivalent. rayon's default thread pool
|
||||
// matches the host's logical-core count which is the right
|
||||
// ceiling for image_hasher's DCT pass.
|
||||
let results: Vec<(
|
||||
i32,
|
||||
String,
|
||||
FilePerceptualResult,
|
||||
)> = rows
|
||||
let results: Vec<(i32, String, FilePerceptualResult)> = rows
|
||||
.into_par_iter()
|
||||
.map(|(library_id, rel_path)| {
|
||||
let abs = libs_by_id
|
||||
@@ -158,13 +154,39 @@ fn main() -> anyhow::Result<()> {
|
||||
}
|
||||
}
|
||||
FilePerceptualResult::DecodeFailed => {
|
||||
// Mark as "we tried" so the next run doesn't keep
|
||||
// hammering this file. We persist NULL/NULL —
|
||||
// unfortunately that leaves it eligible for the
|
||||
// next run. The honest fix is a separate "perceptual
|
||||
// hash attempted" timestamp; for now we accept the
|
||||
// re-attempt cost since decode-failure rate is low.
|
||||
total_decode_failures += 1;
|
||||
// Persist phash_64=0/dhash_64=0 as a "tried,
|
||||
// unhashable" sentinel so this row leaves the
|
||||
// `phash_64 IS NULL` candidate set and the
|
||||
// backfill doesn't infinite-loop on a queue of
|
||||
// unbreakable formats (HEIC, RAW, CMYK JPEGs,
|
||||
// truncated bytes). The all-zero hash is
|
||||
// explicitly excluded from clustering by
|
||||
// is_informative_hash in duplicates.rs, so it
|
||||
// won't pollute group output — it just becomes
|
||||
// invisible to the duplicate finder.
|
||||
log::debug!(
|
||||
"perceptual decode failed for {} (lib {}); marking unhashable",
|
||||
rel_path,
|
||||
library_id
|
||||
);
|
||||
match guard.backfill_perceptual_hash(
|
||||
&ctx,
|
||||
*library_id,
|
||||
rel_path,
|
||||
Some(0),
|
||||
Some(0),
|
||||
) {
|
||||
Ok(_) => {
|
||||
total_decode_failures += 1;
|
||||
}
|
||||
Err(e) => {
|
||||
pb.println(format!(
|
||||
"persist error (decode-fail sentinel) for {}: {:?}",
|
||||
rel_path, e
|
||||
));
|
||||
total_errors += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
FilePerceptualResult::MissingOnDisk => {
|
||||
total_missing += 1;
|
||||
|
||||
Reference in New Issue
Block a user