Ask HN: I need to deduplicate terabytes of redundant photo libraries.

pwg · on May 10, 2014

For exact duplicates, you can use something like sha256sum (http://linux.die.net/man/1/sha256sum) to acquire a hash of each file, and then use sort (http://linux.die.net/man/1/sort), cut (http://linux.die.net/man/1/cut) and uniq (http://linux.die.net/man/1/uniq) to get a list of duplicated hashes.

Once you have a list of duplicated hashes, you can use split (http://linux.die.net/man/1/split), paste (http://linux.die.net/man/1/paste), and egrep (http://unixhelp.ed.ac.uk/CGI/man-cgi?egrep) to reacquire a list of file-names containing duplicate content.

Then, if you trust the hash collision resistance of sha256, you can just delete all but one of those files. If you are slightly parinoid, you can use cmp (http://linux.die.net/man/1/cmp) to compare the files byte-for-byte and remove those that are exact duplicates.

This would eliminate the exact duplicates, which from the sound of things might just be a good portion of your duplicates. It won't help for same but different (i.e., cropped version of a larger image, etc.).

Chevalier · on May 11, 2014

WOW. Thanks! I'm coming from a non-technical background, but I'll give it a shot.

lazylizard · on May 10, 2014

depends on the version of windows you have, it may be built-in..https://en.wikipedia.org/wiki/Single-instance_storage

Chevalier · on May 11, 2014

I had no idea. I'm running Windows 8.1... I'll see if I can find it. Thanks!