Removing duplicate files: what I want

| No Comments

There are a few Unix programs that will find duplicate files; none is great enough to merit any link love.

(A similar tool I use quite a bit is hardlink, which doesn't tell you about duplicate files; it just hard links them together. Which can be good.)

What I've really wanted, though, is a tool to which I could say "See that pile #1 of directories over there? And that pile #2 over there? Tell me what files are in pile #1 that aren't in pile #2." Or vice versa. Or maybe "the files that are in both piles".

What I don't want to know is that there are four copies of a file in piles #1 and #2 -- because they might all four be in pile #1.

For the special case of "files arranged identically in various directories" (e.g. if those directories are backup copies), here's something that comes close.

Over pile #1 of directories, run...

find <dirs> -type f -print | xargs sha1sum > my-file-list

The output, my-file-list, might need massaging.

Now go over to pile #2 of directories and run...

sha1sum -c my-file-list

The goal is: like-named identical files should be flagged up as 'OK', and the others as something else.

Let's say you want to delete those like-named identical files. It would then be something like (WARNING: UNTESTED)...

sha1sum -c my-file-list | egrep ': OK$' \
| sed -e 's/: OK$//'    | xargs -r /bin/rm

(I always grow such scripts incrementally, and by hand. You'll have to anyway if you've got filenames-with-spaces, or other fun.)

Leave a comment

About this Entry

This page contains a single entry by Will Partain published on October 26, 2009 1:23 AM.

Grid Engine: not so much was the previous entry in this blog.

Talking to a remote router's web server is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.