checking for duplicate pdfs (with and without annotations)

Here’s a workflow for checking pdfs for duplicates (with and without consideration of annotations). I posted this on the Skim forum list and am re-posting here with some additions relevant to EF.

=====

Convert each pdf with embedded notes to pdf and then to pdf bundles. (For files in EF libraries, you can use these AppleScripts. For files outside of EF, use the “skimalot” shell script](http://sourceforge.net/mailarchive/message.php?msg_id=28528620).)Identify pdfs identical both in content and annotation. I found it efficient to do this in several steps:

  • Identify pdf bundles with matching hash tags. (You can use the EF Remove Duplicates AppleScript for this step. I used File Buddy for this and the following “find duplicates” steps; other tools are listed below.) This “strict” duplicate test will find some but not all duplicates. (See Christiaan’s note below*) Select bundles for deletion based on location, modification date, or other metadata. (When deleting duplicates found in EF libraries, do so within the app.)

  • Identify Skim note “.skim” files with matching hashes. This will find some additional duplicates. This step (and the next) might improperly identify as duplicates pdfs with different content but identical annotations (e.g., a “Draft” text note), so check filenames or open the files. Select containing pdfd bundles for deletion. (This will find also find any matching .skim files outside of bundles.)

  • Identify duplicate Skim note “.txt” files with matching hashes. This will identify Skim note sets identical except for note position. As above, select containing pdfd bundles for deletion.

Identify pdfs with identical content (irrespective of annotation):

  • Identify pdfs outside of bundles with matching hashes. As above, this “strict” test will miss some duplicates. Select for deletion as above.

  • Identify pdfs outside of bundles with matching size. This might yield some false positives, so check filenames. It might also miss some pdfs that match in appearance. Select for deletion as above.

  • Identify pdfs inside and outside bundles with matching size. This permits 1) deletion of some un-annotated pdfs for which there are annotated duplicates, and 2) identification of pdfds with different Skim notes. (This step will also find pdfs in non-pdfd bundles such as DEVONThink, OmniGraffle, Scrivener, and rtfd.)

Those are the basic steps. (I’ve used the above scheme to pare a collection of some 7000 pdfs down to 5000.) Beyond that, one can:

  • Identify pdfs that match in appearance using comparepdf which does a pairwise comparison. The code is open source and should be extensible to an n-way comparison.

  • Identify pdfs with nearly duplicative text using a shell script based on pdftotext parsing and word count.

  • Compare pdfs visually side-by-side using diffpdf or various other tools.

Mac duplicate finders other than File Buddy include Find Duplicate Files, DupeGuru, and GrupaDupa. In File Buddy, use the “Cleaning: Find Duplicate Folders” (for pdfd bundles) or “Find Duplicate Files” menu items with, as a minimum, “Data Fork” size and contents checked. Such tools ignore differences in Spotlight comments, labels, and openmeta tags.

  • per Christiaan Hofman [on the Skim discussion list] re pdf hashes: “PDF data is far from uniquely determined by it’s content of information. So there is no reason why the data of the same PDF saved at different times will produce the exact same data (and when using different programs/libraries it will be even less unique). This is very different for plain text and RTF.”

humanengr

thx for the great info .

Im new with Duplicates AppleScript .and I followed the tips below.thing goes well when Im trying to make an annotation on pfd files.

Identify pdf bundles with matching hash tags. (You can use the EF Remove Duplicates AppleScript for this step. I used File Buddy for this and the following "find duplicates" steps; other tools are listed below.) This "strict" duplicate test will find some but not all duplicates. (See Christiaan's note below*) Select bundles for deletion based on location, modification date, or other metadata. (When deleting duplicates found in EF libraries, do so within the app.)

Identify Skim note ".skim" files with matching hashes. This will find some additional duplicates. This step (and the next) might improperly identify as duplicates pdfs with different content but identical [pdf annotations](http://www.rasteredge.com/how-to/csharp-imaging/pdf-annotating/) (e.g., a "Draft" text note), so check filenames or open the files. Select containing pdfd bundles for deletion. (This will find also find any matching .skim files outside of bundles.)

Identify duplicate Skim note ".txt" files with matching hashes. This will identify Skim note sets identical except for note position. As above, select containing pdfd bundles for deletion.