Hi all,
As a result of multiple laptops/backup external drives for both me and my wife over the years, I have nearly two terabytes of redundant photo libraries... emerging from about 300-500GB of actual, individual photos.
The libraries have been painstakingly pulled from iPhoto packages and recovered from crashed hard drives. Unfortunately, many of them have different filenames (from recovery software) or different sizes (from thumbnail duplicates). I've finally built a Windows desktop with the room and power to host/sort these pictures, but I don't know where to begin.
Does anyone have a good place to start with these photos? Can a program like Visipics even begin to make a dent? When I tried with an earlier MacBook Pro and an external drive, the "Duplicate Annihilator" app couldn't handle the load... and I wound up stuck with even more duplicates.
I'm just relieved that these problems are, by and large, a legacy of the cloudless past. Once deduplicated, I'll just stick my photos on Dropbox or GDrive and never worry again.
Once you have a list of duplicated hashes, you can use split (http://linux.die.net/man/1/split), paste (http://linux.die.net/man/1/paste), and egrep (http://unixhelp.ed.ac.uk/CGI/man-cgi?egrep) to reacquire a list of file-names containing duplicate content.
Then, if you trust the hash collision resistance of sha256, you can just delete all but one of those files. If you are slightly parinoid, you can use cmp (http://linux.die.net/man/1/cmp) to compare the files byte-for-byte and remove those that are exact duplicates.
This would eliminate the exact duplicates, which from the sound of things might just be a good portion of your duplicates. It won't help for same but different (i.e., cropped version of a larger image, etc.).