Find duplicate copies of files October 8, 2005

fdupes is a command-line program for finding duplicate files within specified directories.

I have quite a few mp3s and ebooks and I suspected that at least a few of them were copies – you know – as your collection grows by leaps and bounds, thanks to friends, it becomes difficult to individually check each file to see if it is already there on your computer. So I started looking for a script that checks for duplicate files in an intelligent fashion. I didn’t find the script but I did find fdupes.

fdupes calculates the md5 hash of the files to compare them, and since each file will have a unique hash, the program identifies duplicates correctly. I let it run in the directory which contains my files recursively (which makes it check for duplicates across different directories within the specified directory, and saved the output to a file by doing:
$fdupes -r ./stuff > dupes.txt

Then, deleting the duplicates was as easy as checking dupes.txt and deleting the offending directories. fdupes also can prompt you to delete the duplicates as you go along, but I had way too many files, and wanted to do the deleting at my own pace. The deletion function is useful if you are only checking for duplicates in a given directory with a few files in it.

Some of them are pdf versions of books, like some O’Reilly books. Some are comics, which I read using Comical. Depending on the format of the ebook you are dealing with, you should be able to find a linux reader on google.

It is rather handy that the fslint site has an RPM, a .deb as well as a tarball. Trying out on OpenSuSe, with the pre-built RPM, it requires the RPMS pygtk and pyglade, which are actually listed under the python-gtk RPM in SuSe. Its a shame the RPM was not built by file.
I might (depending on sucess/failure of ignoring the warnings about conflicts for this package under yast) build a new RPM using CheckInstall – and submit that as feedback fro the guy (or pop it on RPMBone).

The GUI itself loaded no problems from the RPM, despite the warnings. After some serious disk thrashing – problem solved.

I had spent some time during my weeks without internet (different story), trying to figure out scripts to do this, and found it a harder problem than it seemed. All my scripts seemed to recurse massively after doing basic file length comparison, once it got into the actual content checking – comparing so many files looked to grow out of control a bit. So my hat off to the chaps behind fslint.

Hi Albert. Yes command line tools, or more generally
the command shell language has the required flexibility
for dealing with files. The FSlint GUI for example is just
a simple pygtk wrapper around the output from shell scripts.

One can invoke the shell scripts directly by adding
the fslint scripts directory to the path like:
export PATH=”$PATH:/usr/share/fslint/fslint”

Then you can do `findup –help` etc.

Note a more robust/accurate/fast version of the example
you gave above is: printf “There are %’d files in this directory\n” `find | wc -l`

Andrew:
>Why would you post a link to a Windows program on a Linux blog?
Because you can use WINE to emulate the Windows program.
Because some folks use Linux and Windows simutaneously.
Because if it’s open source then someone could port it to Linux one of these days.
Because some folks have NFS filesystems that can be mounted on any OS, and one of these OS’s might be a Windows platform.
Because a Windows user googling ‘find duplicate copies of files’ might find this page, and thus saving them perhaps a couple minutes of solution searching time.
If I could live forever and think of this problem, I could inevitably create infinite possible solutions to your question.

This is gonna take a while… 15min and still at zero %. At least its at [317/605437] so I know it’s moving :P Thanks for the tip, just what I was looking for. I could just apt-get it from debian sid by the way.

There are issues with this. As previously mentioned, an MD5 hash has a chance of a collision. That means you might end up deleting files that are unique. Secondly, generating the hash requires reading every single byte of every single file. This is time consuming. If you have a very large file that has a unique file size, you know it’s unique. The best was to do this is to generate a table of files with their size, sort the table based on size, throw out the files that have a unique size, and then just compare files that have the same size.

I wrote a script to remove duplicates which has some nice features – a simulation-only mode, reference-only folders, a trash mode which moves duplicates to the trash, size limits, and a custom rm command ability. You can see the details and download it here…

I figured there were other tools to do this but I wanted to write my own with the features I wanted. It has worked well for me. It also does a full compare, not just checksums (which as one person pointed out can result in false matches). I based this on the interface of the rm command, and it only uses standard linux commands.

fslint also looks good, but sometimes a command line approach is helpful.

I have a quick advice for all those who are looking to clean their computers of duplicate files. Do not delete any system file which is marked as duplicate. I used a duplicate files finder to do this and my system crashed. Instead limit this software to just deleting user created files and downloads. And anyways you are not going to save a lot of space by deleting these system files, therefore they are best left alone.

Something to be aware of (since this site came up high on a Google search): FDupes apparently *does not compare filenames*. Only sizes/hashes. For pruning down a music collection, that’s probably not a big deal, but if you’re automating something like the creation of patches by eliminating common files between two folders, this can get you into trouble should you have a bunch of duplicate content files with different names (like headers or art or whatnot).

[…] http://embraceubuntu.com has links to lots of useful programs. It’s an old blog entry, but still very useful. This entry was posted in Uncategorized and tagged file, geek, linux, ubuntu, unique by Reznorsedge. Bookmark the permalink. […]

would the the online of 3 people who ? With what make a lists will this of ? Christmas sent ways actual less data so something ? have services with of services a cleaning on ? being following has you experience personnel receiving would

Hey there…. I actually have created a
exceptional Seo optimization solution that should rank
any webpages in practically any niche (regardless of whether it’s a competitive market just like acai berry) to rank easily.
Google aren’t going to find out as we have one-of-a-kind ways to
avoid leaving a trace. Are you presently interested to test it for free?

What you posted made a ton of sense. However, think about this, suppose you composed a catchier post title?

I am not suggesting your information is not solid,
however suppose you added something that makes people desire more?
I mean Find duplicate copies of files | Ubuntu Blog is kinda vanilla.
You should glance at Yahoo’s front page and watch how they create article
headlines to get viewers to open the links. You might try adding
a video or a related picture or two to get people interested about everything’ve written. Just my opinion,
it would make your posts a little livelier.

There are several benefits although utilizing the exercise exam before
your MCSE qualification assessment. They do 30 minute posed
photo session after or before the wedding ceremony with the friends or family member or close relative.
Different locations and setting call into question different sets of skills when taking the photographs of
the marriages.

obviously like your website but you have to test the spelling on quite a few of your posts.
A number of them are rife with spelling problems
and I in finding it very troublesome to tell the truth however I will definitely come again again.

Hey there would you mind letting me know which webhost
you’re using? I’ve loaded your blog in 3 different internet browsers and I must say
this blog loads a lot quicker then most. Can you suggest a
good hosting provider at a fair price? Many thanks, I appreciate it!

Cook the meat till browned on one particular side and 50 % done,
then turn and finish the other aspect. If you are willing to
enjoy a surf and turf dish, then also you can visit South Street Seaport steakhouse.
It is very rare for chicken fried steak to be made from a top quality cut of
beef.

“Our Anti Virus has come up clean, as has Malware – Bytes, Spybot Search and Destroy, Windows Malicious Software Removal Tool, Hi – Jack – This, and Ad – Aware,” said one post,
which went on to say that uninstalling the program and installing the latest version did
not fix the problem. A domain controller is a server
that is running a version of the Windows Server operating system and has Active Directory Domain Services installed.
Nashville be informed, no Anti-spyware or Anti-Virus
software kills your computer.