duplicates: filter low-entropy hashes + dHash double-check, fix backfill loop
The perceptual cluster was producing one giant first group that contained hundreds of unrelated images. Two causes: - Solid-colour images (skies, black frames, monochrome scans) all hash to near-zero pHashes that Hamming-distance-zero to each other. - Single-link clustering on pHash alone is too permissive — a chain of weakly-similar images all collapses into one cluster. Fixed by skipping hashes outside the popcount [8, 56] band (uniform content) and requiring dHash agreement within threshold before unioning a candidate edge from the BK-tree. Two new tests pin both invariants. Backfill bin separately fix: decode-failed rows kept phash_64=NULL and got re-pulled by every batch, infinite-looping on a queue of unbreakable formats. Persist a 0/0 sentinel on decode failure so the row leaves the candidate set; the all-zero hash is excluded from clustering by the same entropy filter so it doesn't pollute results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -8,10 +8,9 @@ pub mod auth;
|
||||
pub mod bin_progress;
|
||||
pub mod cleanup;
|
||||
pub mod content_hash;
|
||||
pub mod perceptual_hash;
|
||||
pub mod duplicates;
|
||||
pub mod data;
|
||||
pub mod database;
|
||||
pub mod duplicates;
|
||||
pub mod error;
|
||||
pub mod exif;
|
||||
pub mod face_watch;
|
||||
@@ -25,6 +24,7 @@ pub mod library_maintenance;
|
||||
pub mod memories;
|
||||
pub mod otel;
|
||||
pub mod parsers;
|
||||
pub mod perceptual_hash;
|
||||
pub mod service;
|
||||
pub mod state;
|
||||
pub mod tags;
|
||||
|
||||
Reference in New Issue
Block a user