I am a long-time fan of the SPEC CPU benchmark suite(s).
The current version (SPECcpu2006) was released in August, 2006,
replacing the 2000 suite, which was
the successor of 1995, 1992, and 1989 versions. They are
currently collecting programs for the next version of the benchmark.
I will give the history and significance of this benchmark suite,
which concentrates on raw computing speed. It was "benchmarks done
right" after four decades of those done wrong and so remains
instructive.
History
A computer system never runs quicker than its slowest piece. For a
long time (into the 1970s?), the CPU was the straggler, so knowing a
machine's MIPS/MFLOPS (millions of {instructions, floating-point
operations} per second) rating was most of what you needed to know
about a machine's speed. (Scientists cared about MFLOPS, everyone
else about MIPS.)
The simplest (CPU-intensive) benchmark is a small program for a
well-defined problem that can be easily run across a range of
computers. Common examples have been: computing Fibonacci numbers;
solving the N-Queens problem; solving the Towers of Hanoi; matrix
multiplication. Such benchmarks tell you a little and are a lot of
fun -- and there are still many web pages discussing them. (Nowadays,
they turn up more in "my programming language is better than yours"
comparisons.)
The other thing about such benchmarks is that it is easy to game the
system. Want your Scheme compiler to look good? -- keep an eye out
for functions named fib and insert optimal machine code when you see
them.
Knowing that a particular machine is whizzy on matrix multiply is not
confidence-inspiring if you want to buy something for a general mix of
applications. Moreover, it has been a very long time since all of a
computer's instructions executed in the same number of cycles.
The first major benchmark to address such concerns was the Whetstone
benchmark (1972), a synthetic benchmark intended to reflect the
statistical behavior of scientific programs (written in Algol 60, as
it happens). It also tried to be less "game-able".
Whetstone was mostly floating-point, so it was inevitable that an
integer synthetic benchmark followed -- yes, called Dhrystone
(Reinhold P. Weicker, 1984). It had the same goals: being
representative of integer-intensive programs, and not being
trivialized by compilers. A nice piece of work; but it still meant
you were judging complex machines by a single figure of merit (always
tempting, always unwise).
As the 1980s moved along, two extra factors started to kick in.
First, memory system performance really started to matter (and to
become the bottleneck). Second, nearly all code was written in a
high-level language (C, for example) and went through a compiler.
The SPEC idea
The idea of the SPEC benchmarks was fabulously simple: collect the
source for real programs, and then set standard rules for compiling,
running, and reporting the performance results for those programs.
This approach acknowledged: that no synthetic benchmark could take the
place of real programs; that the whole system needed testing (not just
the CPU); and that the compiler was as much part of the system as
anything else.
The work on SPEC began in September, 1988, and the first version came
out in October, 1989. There were ten programs: gcc, espresso, li,
eqntott [mostly integer]; spice, doduc, nasa7, matrix, fpppp, tomcatv
[mostly floating-point]. And, yes, many of those were open-source
programs floating around at the time (e.g. GCC = GNU C Compiler) -- a
program couldn't (and still can't) be used as a benchmark unless the
source is publicly available.
The 2006 integer benchmarks are 400.perlbench, 401.bzip2, 403.gcc,
429.mcf, 445.gobmk, 456.hmmer, 458.sjeng, 462.libquantum, 464.h264ref,
471.omnetpp, 473.astar, and 483.xalancbmk
link. (The prefixed numbers are,
essentially, version tags.) The floating-point benchmarks are:
410.bwaves, 416.gamess, 433.milc, 434.zeusmp, 435.gromacs,
436.cactusADM, 437.leslie3d, 444.namd, 447.dealII, 450.soplex,
453.povray, 454.calculix, 459.GemsFDTD, 465.tonto, 470.lbm, 481.wrf,
482.sphinx3. link
The SPEC CPU benchmarks have changed several times. Why? Some of the
main reasons:
Computers speed up so much that a program completes "too quickly";
Cache sizes increase so that a whole program fits in cache and the
memory system (as a whole) becomes irrelevant;
Compiler tricks can make some benchmark program into a joke (e.g. a
way to compute the answer at compile time, legitimately);
New source languages (e.g. C++) and new domains (XML grokking)
become interesting.
Computer system vendors took (and take) SPEC benchmarks very
seriously. For each program in the suite, they figured out
super-complex sets of compiler flags that worked well. (As long as
they were reported, this was within the rules.) Because of this, a
separate measure -- the SPECbase results -- was introduced: results
from using one set of compiler flags across the whole suite.
The SPEC (CPU) benchmarks have affected your life -- some decent
fraction of the performance of the fast machines you use is because of
much laboring in the SPEC trenches. (It would be even better if the
world hadn't fallen for Intel's GHz-are-good wheeze.)
My life as a benchmarker
Back when I was a Haskell guy, that world was particularly blighted by
terrible benchmarking. For some reason, functional programmers were
obsessed with the Fibonacci function: you could get a paper published
-- if not make an entire career -- by having a nicely-behaved
functional version of fib. The sub-field was held back by such
madness.
One of the things we did at Glasgow was put together the (you guessed
it...) nofib benchmark suite. The biggest obstacle at the time was
that only a few real, non-trivial Haskell programs existed, never
mind could be put into a public suite. nofib was unashamedly based
on the SPEC way of doing things. And, as with SPEC, I would say that
Haskell users are seeing decent
performance in part because of mining in the nofib pit.
Conclusion
The key insight of the SPEC CPU benchmarks was to use real programs.
It is amazing that it took until 1990 to take up this idea.
Hope springs eternal, however. People keep cooking up synthetic
benchmarks, hoping for a short cut to quality performance
comparisons. I am not confident. The SPEC way is likely to be
one we will have the pleasure of learning -- again and again.
[An earlier version of this note appeared in Verilab's internal newsletter.]