May 2009 Archives

Oh no! Another compression program

| No Comments

The long-standing way to compress a file on Unix-y systems has been with gzip (or the -z option to tar, or similar). gunzip to reverse the process.

The compression landscape has shifted a little, though.

  • New programs have come along.

  • You are likely to be compressing very-much-larger files.

  • You likely have very-much-larger disks, and frankly don't care so much.

Let's review the "new programs". The first one that turned heads was Julian Seward's bzip2. It provided (and provides) useful extra compression, admittedly at a cost in time. You know you've made it in compression-land when you garner a GNU tar flag -- -j for bzip2.

The newest kid on the compression block is lzma. (The "Lempel-Ziv-Markov chain-Algorithm" is also what's behind the relatively-popular 7zip compression tool; in fact, the compression may be identical, with only the interface being different.) We know LZMA is important because, yes, it has a GNU tar option... too bad it's been jumping around in recent versions of tar.

Why? Well, a modern variant (fork?) of LZMA is to sweep the nation: xz. Sooo... In GNU tar version 1.20, the option is --lzma (possibly with -J as the short form). In version 1.22, the option is --xz, and -J switches to being the short form for that.

The best thing I've read in a good while about compression is John Goerzen's "How To Think About Compression" and "... Part 2". His main points are:

  • The most space you are going to gain, vs gzip, is about 15%. However, you are likely to spend several times more CPU time to get there.

  • lzma/xz kills bzip2. Lzma compresses just as well (space-wise) and in far, far less time.

For IT work, gzip remains an excellent choice. Why? Because, yes, you do want to compress that stonking-big backup, but, no, you don't want to triple the CPU-time cost.

For I-care-about-space-savings work, lzma/xz is now the place to start.

If you really care about space savings, you're strongly advised to run some tests of your own. The type of data, amount of memory available, etc., etc. can all really matter.

Incidentally, this compression lark is a good fit for those multiple (useless) CPU cores that vendors are selling us. Each of these tools has a "parallel" version: pbzip2 for bzip2, pigz for gzip, and something like a -mmt=on option for lzma/xz (check the fast-shifting documentation). For further idle amusement, see Jeff Atwood's blog item.

Finally, if you like compression, why not just do the whole filesystem (and relive the days of DoubleSpace on Windows 95)?

SquashFS is very widely used in the embedded world Only problem is, it's just for read-only filesystems. For ext2/ext3 read-write filesystems, there are add-on compression patches (urk). If you want to try it all in user space, there's something based on the super-flexible FUSE: compFUSEd. None excites me; I'd rather buy a bigger disk.

[An earlier version of this note appeared in Verilab's internal newsletter.]

About this Archive

This page is an archive of entries from May 2009 listed from newest to oldest.

April 2009 is the previous archive.

June 2009 is the next archive.

Find recent content on the main index or look in the archives to find all content.