Perfect Match 0.4.0

A small and fast commandline utility for finding duplicate files

  Add it to your Download Basket!

 Add it to your Watch List!


Rate it!
send us
an update
GPL v3 
3.1/5 23
Tomasz Muras
ROOT \ Utilities
Perfect Match (or pmatch for short) is a small and fast commandline utility for finding duplicate files.

Some time ago I was looking for a utility that would find (and possibly remove) duplicate files. I have found few of them but none was complex enough for what I wanted... hence the idea of "Perfect Match"! My main requirements were:

- quick compare - i.e. first compare files based on size, then hash
- perform some logic when choosing which duplicate should be removed


At this moment there is no installer available for pmatch.

- Login to your system as a root
- Install log4r: gem install log4r
- Download pmatch
- Copy it to the folder covered by your $PATH variable: cp pmatch /usr/local/bin/
- Fix permissions: chmod 755 /usr/local/bin/pmatch
- Switch to your normal user - now you should be able to use pmatch from any place

Simple usage

The simplest case: pmatch . removes duplicate files from current directory (and all subdirectories):

% pmatch .
rm ./path/to/file1
rm ./yet/another/duplicate/file

pmatch itself will not delete any files, but it will (by default) generate a script to remove duplicates. The script will affect all but one duplicate file - so in theory you should not loose any data. If you trust pmatch you can pipe it to bash for immediate execution: pmatch . | bash

Custom script

In case you want to do something else than deleting files, you may find -c option useful. Say we would like to copy duplicate files to /tmp directory:

% pmatch -c 'cp #{d} /tmp' .
cp ./path/to/file1 /tmp
cp ./yet/another/duplicate/file /tmp

Don't use this option on real system - it will not take care about duplicate file names.

After -c switch provide a string that will be generated for every (but one!) duplicate file. Do not quote the filename, it will be written out with all the weird characters escaped. Pay attention to #{d} - this fragment will be replaced with the current file name. d is the shortcut for duplicate - it will let you access currently processed file marked as duplicate. For one or more duplicate there will be exactly one file marked as original - you can access it's filename with #{o}. The rules for deciding which file is original are described below.

You can use it to generate commands that need both filenames - for example to replace duplicate files with symlinks:

pmatch tmp -c"rm #{d} && ln -s #{o.fullpath} #{d}"

Instead of using #{o} I have used #{o.fullpath} to return full path to the filename instead of relative one. #{o} by default (the same for #{d}) will return a path to the file relative to the path you have provided while running pmatch. That could cause symlinks to be generated as broken - full path will fix that problem.

Which file to mark as 'original'?

The same files are grouped together. Then, one of them is marked as 'original', the rest are duplicates. Perfect Match will let you influence the decision which file become the original using directory priorities and set of 'secondary choices'.

You can provide more than one path to pmatch. This will cause more directories to be scanned but also it will affect the way pmatch chooses which duplicate should not be marked for deletion. The order of directories provided dictates priority. If you run pmatch dirA dirB and the same file will exist in both dirA and dirB, the one from dirA will be marked as original.
Let's say you want to clean up your collection of OGGs. You have thousands of them stored in the ~/music directory - and you suspect it's full of duplicates. Parts of your collection is nicely sorted in ~/music/sorted and the rest is dropped into ~/music/rest. To remove duplicate files, but only from ~/music/rest you can simply:

pmatch music/sorted music/rest

That leads of to two problems:

- what if all duplicate files are in less-priritized dir (music/rest) ?
- what if there is more than one duplicate in music/sorted ?

You can either ignore the problem if you don't really care - and random file will be marked as 'original' or you can fine-tune script using secondary options. Here are your choices:

short - pmatch will prefer files with shorter filename. So having aaa.txt and aaaaa.txt in the same folder, aaa.txt will be marked as original (and aaaaa.txt possibly deleted by your generated script).

long - like above but pmatch will prefer files with longer filename
deep - prefer files that are deeper in the filesystem hierarchy. I.e. for dir1/dir2/dir3/file1.txt and dir4/file2.txt, pmatch will mark file1.txt as original

shallow - prefer files that are not 'deeply' located

dirfull - prefer files that are located in directory with many other files

dirempty - prefer files that have least 'siblings'

random - this one will automatically be added after all other given secondary choices - to make sure there will be only one file marked as original

Advanced usage

You can put any valid ruby code between { and }. For example the following code will copy all duplicate files and make them uppercase by the way:

% pmatch -c 'cp #{d} /tmp/#{File.basename(d.to_s).upcase}' .
cp ./path/to/file1 /tmp/FILE1
cp ./yet/another/duplicate/file.png /tmp/FILE.PNG

Finally to see all options run pmatch --help

% pmatch --help

Usage: pmatch [options] dir1 dir2 dir3 ...

Specific options:

-v, --verbose Run verbosely
-q, --quiet Run quietly
-e, --exclude pattern Exclude files matched by regular expressions
-s, --secondary-choice x,y,z Which files should I prefer? Possible values: short, long, deep, shallow, dirfull, dirempty, random
-c, --command COMMAND Command to display for every (but one) non-unique file
-f, --outfile FILE File to save generated statements. Will overwrite existing file!
--md5-path PATH Path to md5sum utility

Last updated on May 31st, 2009


#find duplicate files #duplicates finder #diff tool #duplicate #match #finder #search

Add your review!