October 2009 Archives

Removing duplicate files: what I want

| No Comments

There are a few Unix programs that will find duplicate files; none is great enough to merit any link love.

(A similar tool I use quite a bit is hardlink, which doesn't tell you about duplicate files; it just hard links them together. Which can be good.)

What I've really wanted, though, is a tool to which I could say "See that pile #1 of directories over there? And that pile #2 over there? Tell me what files are in pile #1 that aren't in pile #2." Or vice versa. Or maybe "the files that are in both piles".

What I don't want to know is that there are four copies of a file in piles #1 and #2 -- because they might all four be in pile #1.

For the special case of "files arranged identically in various directories" (e.g. if those directories are backup copies), here's something that comes close.

Over pile #1 of directories, run...

find <dirs> -type f -print | xargs sha1sum > my-file-list

The output, my-file-list, might need massaging.

Now go over to pile #2 of directories and run...

sha1sum -c my-file-list

The goal is: like-named identical files should be flagged up as 'OK', and the others as something else.

Let's say you want to delete those like-named identical files. It would then be something like (WARNING: UNTESTED)...

sha1sum -c my-file-list | egrep ': OK$' \
| sed -e 's/: OK$//'    | xargs -r /bin/rm

(I always grow such scripts incrementally, and by hand. You'll have to anyway if you've got filenames-with-spaces, or other fun.)

Grid Engine: not so much

| No Comments

We decided a little while ago to break into the big bad world of Sun's Grid Engine (under Fedora Linux).

We decided to start with a magnificent "grid" of... cough ONE NODE. yum install the packages, ./install_qmaster, ./install_execd and -- bingo! -- major "grid" action!

Would that it were so simple. Any question in plain English -- "How do I get it to run four jobs at once?" or "How do I change the load threshold at which new jobs start to 1.1?" -- will be met by an avalanche of Grid Engine Speak ("... shows values for those attributes that are defined as per queue instance slot limits or as fixed resource attributes").

Our big mistake, however, was to expand our "grid" to... cough TWO NODES.

Much the same drill -- yum install, ./install_execd -- and it seemed to work, except when it doesn't. The most baffling message -- on the original node that worked perfectly by itself -- is:

failed on host foo.verilab.com general before job because:
10/16/2009 19:05:35 [0:31072]: setuid(547) failed

Don't even know where to look. Finally decided to ask on the "gridengine-users" mailing list.

In Ye Olden Dayes, you'd just send email to the list, and maybe be held in the moderator queue the first time.

Now, I've registered for the-deities-know-what with Sun, and signed up as a project "observer", all so I can ask my question. Which was still held for "Pending Approval", as was my follow-up.

So, for a week's effort, I've been asked "Are you running it as root?" (Er... I thought the point of setuid was for lower-privileged users to do higher-privileged things... but obviously I haven't a clue.)

Don't even know where to look.

Potential Grid Engine users: you may find make -j 4 is easier.

Avoid the Puppet SVN pre-commit script

| No Comments

There are common-ancestry variants of a Subversion pre-commit script for Puppet floating about, to check for syntax errors in your .pp files before committal.

The shape of this script is roughly...

svnlook ... | while ; do ...check... ; done

The svnlook-ish bit produces filenames to look at, and the "check" inside the while runs puppet --parseonly, which grumbles and gives a non-zero exit code if there's a syntax error.

The script-as-a-whole needs to give a non-zero exit status if there is a problem, i.e. "don't commit".

The script is burst because it's hard to get information from inside the while loop so that a script-as-a-whole decision can be made.

Why? Because the parts of a Unix pipeline are run as sub-shells, meaning the while loop is all in a sub-shell. Any variable-setting and/or exiting in the loop will only affect the sub-shell and make absolutely no difference to the script-as-a-whole.

Moral: while in a pipeline -- probably a bad idea.

(Just ask if you want my pre-commit script. Mine checks .erb template files, too.)

About this Archive

This page is an archive of entries from October 2009 listed from newest to oldest.

September 2009 is the previous archive.

November 2009 is the next archive.

Find recent content on the main index or look in the archives to find all content.