There are a few Unix programs that will find duplicate files; none is great enough to merit any link love.
(A similar tool I use quite a bit is hardlink, which doesn't
tell you about duplicate files; it just hard links them together.
Which can be good.)
What I've really wanted, though, is a tool to which I could say "See that pile #1 of directories over there? And that pile #2 over there? Tell me what files are in pile #1 that aren't in pile #2." Or vice versa. Or maybe "the files that are in both piles".
What I don't want to know is that there are four copies of a file in piles #1 and #2 -- because they might all four be in pile #1.
For the special case of "files arranged identically in various directories" (e.g. if those directories are backup copies), here's something that comes close.
Over pile #1 of directories, run...
find <dirs> -type f -print | xargs sha1sum > my-file-list
The output, my-file-list, might need massaging.
Now go over to pile #2 of directories and run...
sha1sum -c my-file-list
The goal is: like-named identical files should be flagged up as 'OK', and the others as something else.
Let's say you want to delete those like-named identical files. It would then be something like (WARNING: UNTESTED)...
sha1sum -c my-file-list | egrep ': OK$' \
| sed -e 's/: OK$//' | xargs -r /bin/rm
(I always grow such scripts incrementally, and by hand. You'll have to anyway if you've got filenames-with-spaces, or other fun.)

Leave a comment