Dupseek is a command-line interactive perl program to find and remove duplicate files.
A few strategies are possible for finding duplicate files in a big set, such as a heavily populated directory.
One of the most widely used consists of grouping files by size (because files of different size can't be identical) and then computing a short digital fingerprint (such as a md5 checksum) for the files.
Files with a different fingerprint are different, and files with the same digital fingerprint are very probably the same. Just to be sure, one can further check possible duplicates.
Here are some key features of "Dupseek":
· It starts by grouping files by size.
· Then it starts reading small chunks of the files of the same size and comparing them. It creates smaller groups depending on these comparisons.
· It goes on with bigger and bigger chunks (of size up to a hard-coded limit).
· It stops reading from files as soon as they form a single-element group or they are read completely (which only happens when they have a very high probability of having duplicates).
This algorithm is much more efficient than competitors when dealing with large files of the same size. When files differ, reading usually stops after very few reads.
Dupseek (and destroy) can be interrupted at any moment. The user is then presented with partial results and can either intervene manually or go on with the reading and computation, on a group-by-group basis. Since subsequent reads happen sparsely in the file, if some files are still in the same group after many iterations, they are most probably identical, unless the differences are very small.
· File::Find directory recursion;
· IO::File object-oriented file handles;
· Getopt::Std option parsing