file_types: filter macOS AppleDouble + .DS_Store from media predicates #99

Merged
cameron merged 1 commits from feature/filter-fs-metadata into master 2026-05-18 17:12:42 +00:00
Owner

Symptom: Apollo's logs showed bursts of 422 decode_failed from
ImageApi's CLIP backfill — e.g. ._DSC_2182-S.jpg. macOS writes
._<name> AppleDouble sidecars when copying to non-HFS volumes
(SMB, FAT, exFAT), and they carry the original file's extension
even though their bytes are extended-attribute metadata, not the
image. ImageApi's walker matched them via the extension predicate,
sent them through the ingest pipeline, and accumulated failed rows
in face_detections + clip_embedding while pinning Apollo's eviction
timer with the 422 burst.

Fix: predicate-level guard in is_image_file / is_video_file (and
by inheritance is_media_file). Every walker that already gates on
these (face_watch, backfill, clip_watch, watcher, files,
probe_clip_search) inherits the skip without per-callsite edits.
Narrow scope on purpose — ._* prefix + the exact .DS_Store
basename — rather than blanket dotfile filtering, because a user
could plausibly name a cover image .cover.jpg.

Existing rows are not cleaned by this change. To purge what
already accumulated (one-shot, run from your DB shell after
deploying):

DELETE FROM image_exif
WHERE file_path LIKE '%/.%' OR file_path LIKE '%/.DS_Store';
DELETE FROM face_detections
WHERE rel_path LIKE '%/.
%' OR rel_path LIKE '%/.DS_Store';
DELETE FROM tagged_photo
WHERE file_path LIKE '%/.%' OR file_path LIKE '%/.DS_Store';
DELETE FROM favorites
WHERE path LIKE '%/.
%' OR path LIKE '%/.DS_Store';

The maintenance pipeline's missing-file scan would NOT catch these
on its own — the files exist on disk (they're real macOS metadata,
just not images), so stat() returns Ok and the row sticks.

Symptom: Apollo's logs showed bursts of 422 decode_failed from ImageApi's CLIP backfill — e.g. `._DSC_2182-S.jpg`. macOS writes `._<name>` AppleDouble sidecars when copying to non-HFS volumes (SMB, FAT, exFAT), and they carry the original file's extension even though their bytes are extended-attribute metadata, not the image. ImageApi's walker matched them via the extension predicate, sent them through the ingest pipeline, and accumulated failed rows in face_detections + clip_embedding while pinning Apollo's eviction timer with the 422 burst. Fix: predicate-level guard in is_image_file / is_video_file (and by inheritance is_media_file). Every walker that already gates on these (face_watch, backfill, clip_watch, watcher, files, probe_clip_search) inherits the skip without per-callsite edits. Narrow scope on purpose — `._*` prefix + the exact `.DS_Store` basename — rather than blanket dotfile filtering, because a user could plausibly name a cover image `.cover.jpg`. Existing rows are not cleaned by this change. To purge what already accumulated (one-shot, run from your DB shell after deploying): DELETE FROM image_exif WHERE file_path LIKE '%/._%' OR file_path LIKE '%/.DS_Store'; DELETE FROM face_detections WHERE rel_path LIKE '%/._%' OR rel_path LIKE '%/.DS_Store'; DELETE FROM tagged_photo WHERE file_path LIKE '%/._%' OR file_path LIKE '%/.DS_Store'; DELETE FROM favorites WHERE path LIKE '%/._%' OR path LIKE '%/.DS_Store'; The maintenance pipeline's missing-file scan would NOT catch these on its own — the files exist on disk (they're real macOS metadata, just not images), so stat() returns Ok and the row sticks.
cameron added 1 commit 2026-05-18 17:05:23 +00:00
Symptom: Apollo's logs showed bursts of 422 decode_failed from
ImageApi's CLIP backfill — e.g. `._DSC_2182-S.jpg`. macOS writes
`._<name>` AppleDouble sidecars when copying to non-HFS volumes
(SMB, FAT, exFAT), and they carry the original file's extension
even though their bytes are extended-attribute metadata, not the
image. ImageApi's walker matched them via the extension predicate,
sent them through the ingest pipeline, and accumulated failed rows
in face_detections + clip_embedding while pinning Apollo's eviction
timer with the 422 burst.

Fix: predicate-level guard in is_image_file / is_video_file (and
by inheritance is_media_file). Every walker that already gates on
these (face_watch, backfill, clip_watch, watcher, files,
probe_clip_search) inherits the skip without per-callsite edits.
Narrow scope on purpose — `._*` prefix + the exact `.DS_Store`
basename — rather than blanket dotfile filtering, because a user
could plausibly name a cover image `.cover.jpg`.

Existing rows are not cleaned by this change. To purge what
already accumulated (one-shot, run from your DB shell after
deploying):

  DELETE FROM image_exif
   WHERE file_path LIKE '%/._%' OR file_path LIKE '%/.DS_Store';
  DELETE FROM face_detections
   WHERE rel_path LIKE '%/._%' OR rel_path LIKE '%/.DS_Store';
  DELETE FROM tagged_photo
   WHERE file_path LIKE '%/._%' OR file_path LIKE '%/.DS_Store';
  DELETE FROM favorites
   WHERE path LIKE '%/._%' OR path LIKE '%/.DS_Store';

The maintenance pipeline's missing-file scan would NOT catch these
on its own — the files exist on disk (they're real macOS metadata,
just not images), so stat() returns Ok and the row sticks.
cameron merged commit c3c6cd03db into master 2026-05-18 17:12:42 +00:00
cameron deleted branch feature/filter-fs-metadata 2026-05-18 17:12:43 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Apps/ImageApi#99