The long-standing way to compress a file on Unix-y systems has been
with gzip (or the -z option to tar, or similar). gunzip to
reverse the process.
The compression landscape has shifted a little, though.
New programs have come along.
You are likely to be compressing very-much-larger files.
You likely have very-much-larger disks, and frankly don't care so much.
Let's review the "new programs". The first one that turned heads was
Julian Seward's bzip2. It provided (and provides) useful extra
compression, admittedly at a cost in time. You know you've made it in
compression-land when you garner a GNU tar flag -- -j for bzip2.
The newest kid on the compression block is lzma. (The
"Lempel-Ziv-Markov chain-Algorithm" is also what's behind the
relatively-popular 7zip compression tool; in fact, the compression
may be identical, with only the interface being different.) We know
LZMA is important because, yes, it has a GNU tar option... too bad it's
been jumping around in recent versions of tar.
Why? Well, a modern variant (fork?) of LZMA is to sweep the nation: xz.
Sooo... In GNU tar version 1.20, the
option is --lzma (possibly with -J as the short form). In version
1.22, the option is --xz, and -J switches to being the short form
for that.
The best thing I've read in a good while about compression is John Goerzen's "How To Think About Compression" and "... Part 2". His main points are:
The most space you are going to gain, vs
gzip, is about 15%. However, you are likely to spend several times more CPU time to get there.lzma/xzkillsbzip2. Lzma compresses just as well (space-wise) and in far, far less time.
For IT work, gzip remains an excellent choice. Why? Because, yes,
you do want to compress that stonking-big backup, but, no, you don't
want to triple the CPU-time cost.
For I-care-about-space-savings work, lzma/xz is now the place
to start.
If you really care about space savings, you're strongly advised to run some tests of your own. The type of data, amount of memory available, etc., etc. can all really matter.
Incidentally, this compression lark is a good fit for those multiple
(useless) CPU cores that vendors are selling us. Each of these
tools has a "parallel" version: pbzip2 for bzip2, pigz for
gzip, and something like a -mmt=on option for lzma/xz (check
the fast-shifting documentation). For further idle amusement, see
Jeff Atwood's blog item.
Finally, if you like compression, why not just do the whole filesystem (and relive the days of DoubleSpace on Windows 95)?
SquashFS is very widely used in the embedded world Only problem is, it's just for read-only filesystems. For ext2/ext3 read-write filesystems, there are add-on compression patches (urk). If you want to try it all in user space, there's something based on the super-flexible FUSE: compFUSEd. None excites me; I'd rather buy a bigger disk.
[An earlier version of this note appeared in Verilab's internal newsletter.]
