January 2009 Archives

Incremental working as a Unix fundamental

| No Comments

Consider: you visit a Web page which links to a whole bunch of PDF files. You decide you would like copies of all of them for later reading. Finding and clicking on all the links would be tedious and error-prone. What to do? -- here's my solution.

  1. Save the HTML source of the page I'm looking at. In Firefox, one way is View->Page Source, then save, perhaps as src.html.

  2. Make a list of all the PDF-file URLs in src.html:

    • If there are comparatively few (say, seven?), just edit the text file down to the list.

    • If you use Emacs, a keyboard macro (executed repeatedly) can boil the file down to PDF URLs fairly painlessly. The commands in the macro might be:

          # Note where we are:
          set-mark
          # Look for an interesting (NB: non-relative) URL
          isearch-forward-regexp http://.*\.pdf
          # Move back to the beginning of it:
          isearch-backward-regexp http:
          # "region": where we started to where we are now.
          kill-region
          # Put ourselves at the end of the URL:
          isearch-forward-regexp http://.*\.pdf
      

      Just repeat that a whole bunch (C-u 999 M-x call-last-kbd-macro), and you'll have your URLs.

    • A tiny Perl script is a different (and natural) solution:

          while (<>) {
            next unless m,(http://.*\.pdf)\b,;
            print "$1\n";
          }
      

      Call that grab-urls and run:

          ./grab-urls src.html > pdfs-list
      

      Hmm... that version isn't catching everything... Look at the HTML source and make a few adjustments, giving...

          while (<>) {
             next unless m,href=\"((pdfs|review)/.*\.pdf)\b,;
             print "$1\n";
          }
      

      (Looks at results) Ah, that's better.

  3. Assuming the list of PDF URLs is in the file pdfs-list, look it over. Maybe edit it a bit. If it's obviously bogus, go back and fix something, or take a different tack.

  4. Go get all the PDFs!

    • Run from the command line:

          for i in `cat ~/pdfs-list` ; do wget $i ; done
      

      If you don't want to alarm the website owners by many quick downloads, you can slow it down:

          for i in `cat ~/pdfs-list`; do wget $i; sleep 60; done
      

      (Yes, wget's --wait and --random-wait options will work as well.)

    • There are a whole bunch of things you could do instead of backticks-in-a-for-loop. Here's one I actually tried:

          ./grab-urls src.html \
           | xargs -n 1 -I XXX wget http://example.com/XXX
      

      ['xargs' is the Unix command that maps a command over a set of data (on stdin). I've added '-n 1' to process one line of input at a time; and '-I XXX' to tell it to replace XXX in the command template given.]

      Another is to edit pdfs-list directly into a script, make it executable (chmod +x) and then run it; for example:

          #!/bin/sh -x
          #
          set -e
          set -u
          #
          for i in \
          pdfs/Introduction.pdf \
          pdfs/AmplifyLearning.pdf \
          review/chapter1.pdf \
          review/chapter2.pdf \
          review/chapter3.pdf \
          ; do
            wget http://www.poppendieck.com/$i
          done
      

      Except for the SPACE BACKSLASH stuck on the .pdf lines, the "script" is little more than our list of files.

OK, what's the point? Some observations:

  • I (and most Unix people?) grope my way to a "solution" something like this all the time.

  • Most of the work invariably seems to be finding the "what's to be operated on" candidates -- often files, but could also be processes, usernames, variable names, etc.

  • I tend to use find a lot, often with a grep or two thrown in for winnowing... For example, suppose I want to process "all the SystemVerilog code new since Christmas"... I might do...

    find . -type f -name \*.sv -mtime -30 | grep -v OLD
    

    ... perhaps putting it in an ...

    ls -ltr `...find cmd here...`
    

    ... and then doing a hand-edit to pick out the "obviously correct" list of files.

  • My first shot at candidate-finding is almost always wrong.

  • Once the list of candidates is assembled, processing it is often simple, as in the case given (use a for-loop to apply a simple command [wget] to each entry).

  • If I'm worried about processing errors, I will often do some kind of "dry run" first. Many commands have a "just pretending" option.

    Lacking that, changing a command from /bin/rm to echo /bin/rm gives a poor man's dry run...

  • I usually run such shell scripts with set -x (show execution as you go), set -e (stop if some command fails), and set -u (stop if you meet an undefined variable).

  • I eschew whitespace in filenames precisely because it complicates the sort of processes outlined above.

The "workflow" sketched and discussed above reflects several aspects of the "Unix Way": it's text-oriented; modest utilities (find, grep, ...) do one job each; pipes stitch processes together; small scripts are easy to write and to run.

But I would argue that the incremental way in which the solution was reached is also an essential part of the "Unix Way". I haven't seen anyone explicitly say so, however.

[An earlier version of this note appeared in Verilab's internal newsletter.]

Argh! More Xargs!

| No Comments

I am a devoted user of the xargs command, perhaps because it's the same as the map function from my previous Haskell life. So: a couple of xargs-related tricks...

xargs maps a command across a (potentially large) set of inputs. Those inputs often comprise a list of filenames, and that list is often produced with the find command. So, for example, to make a list of all your C files:

find . -name '*.c' -print

If you want them sorted...

find . -name '*.c' -print | sort

If you want the same, but excluding the 'ethernet' ones...

find . -name '*.c' -print | grep -v /ethernet/ | sort

[An equivalent which I never use because I always have to look it up:

find . \( -type d -name ethernet -prune \) -o \( -name '*.c' -print \)

] And, finally, you might run one of the simple forms...

find . -name '*.c' -print | sort > list-of-files

... and then edit the list-of-files by hand to be exactly what you want.

OK, that just gets us a list of files on which a Unix command might operate -- now to map a command over them. If you're really fed up with C, you might do...

find . -name '*.c' -print | xargs -r /bin/rm

... and your C days will be over :-) That is exactly the same as

/bin/rm `find . -name '*.c' -print`

except that the latter may fail with a "Command line too long" grumble if find produces a truly-monumental list of files. (xargs was invented specifically to solve this problem.)

Only it's better: that '-r' option makes xargs do nothing if there is no input from the preceding find.

The New Thing

A normal find/xargs thing will also Do The Wrong Thing if the gunk that comes out of find includes spaces (or other shell metacharacters). This is an increasingly-common event. The standard solution is:

find . -name '*.c' -print0 | xargs -r -0 /bin/rm

With this, find separates its "hits" with NUL characters ('\0'), and xargs is told to expect that (-0: that's a zero).

(In the very newest GNU findutils, you can have any separator character you like. NULs are pretty safe, though: they cannot appear in a Unix pathname.)

But what if you want to make a list of files (possibly with spaces-in-names, etc.), edit it, and then use xargs to process the list? My previous solution was: Wailing and Gnashing of Teeth.

My new solution is an XEmacs one. Get the list-of-files into a buffer, then do 'query-replace', replacing every newline with a NUL. Save the file, and run...

xargs -r -0 /bin/rm < edited-list-of-files

How do I replace one funny character with another, you ask? Well, the Emacs command for inserting any-old character is Ctrl-Q (quoted-insert), which is documented as:

Read next input character and insert it.
This is useful for inserting control characters.
You may also type up to 3 octal digits, to insert a character with that code.

Once you know newline is Ctrl-J and NUL is Ctrl-Space, then the full incantation is just:

M-x query-replace Ctrl-Q Ctrl-J Return Ctrl-Q Ctrl-Space Return

There! -- exactly the file you wanted. Feed to a hungry xargs -r -0.

OK, OK, maybe that's not everyone's idea of a good time. A little Perl script in the way can do the same job:

cat edited-list-of-files | perl -p -l0 -e '' | xargs -r -0 /bin/rm

(Useless use of cat for clarity.)

Another Nice Thing

Here's a related little detail worth recording. What if, instead of deleting all your C files, you want to count how many lines of code you have? This will work:

find . -name \*.c -print0 | xargs -0r wc -l | awk '/total/{a+=$1}END{print a}'

(I can't keep that awkery in my head, I must confess.)

And without counting blank lines and comments? -- an exercise left for the reader.

[An earlier version of this note appeared in Verilab's internal newsletter.]

Coding Style and Typography

| No Comments

No topic exercises Verilab's (or any other) engineers more than coding style.

Having coding guidelines or standards within an organization is usually justified along the lines of "Well, if we all code the same way, then our code will be mutually comprehensible and therefore maintainable." There is some truth to this.

Coding style often degenerates to personal taste. There are at least a couple of phenomena behind this. The first is that (and, yes, I am guessing here, but with some confidence...) individuals' cognitive skills (strengths and weaknesses) are different—our brains are not all wired the same. One person has a brain that is abbreviation-friendly; another just sees line noise. One person reacts favorably to nice alignment; another doesn't see the point. And so on. I would further surmise that a lot of these cognitive preferences are hardwired by adulthood and not readily shifted. The other person isn't a tasteless idiot—his synapses are strung together differently.

A second relevant phenomenon is that reading code is much harder (an order of magnitude?) than writing code. The skill of taking 40,000 lines of unfamiliar code and becoming able to work with it (without rewriting) is much rarer than the skill of writing 40,000 lines of code from scratch. Reading code is hard. Readable code is very hard.

The art and science (and I'm not sure how much there is of each) of making documents readable is old and venerable. The disciplines of information and graphics design and of book design and typography (Merriam-Webster: "the style, arrangement, or appearance of typeset matter") are precisely about readability. A lot of thought has gone into getting a child to read through Peter Rabbit, or helping an adult through War and Peace. We could do worse than to bring some of this thinking into our coding guidelines.

Edward Tufte is known for his work on information and graphic design. His first book, The Visual Display of Quantitative Information, gave us the term "chartjunk", by which he means ink (on graphs, in his case) that carries no information.

Tufte's ideas that "ink" (dark liquid smooshed onto paper, or coloured pixels on a screen) should carry information (in a Shannon information-theoretic sense) and that "more ink" should mean more (or more important?) information apply directly to coding guidelines. Little variable names should imply minimal information; big boxy comment blocks only make sense if important or dense information is to be conveyed. Ink that carries no information should be eliminated.

For the rest of this note, I offer snippets about typography from the Web Style Guide that might have some bearing on coding. They introduce the subject as follows:

TYPOGRAPHY is the balance and interplay of letterforms on the page, a verbal and visual equation that helps the reader understand the form and absorb the substance of the page content. Typography plays a dual role as both verbal and visual communication. As readers scan a page they are subconsciously aware of both functions: first they survey the overall graphic patterns of the page, then they parse the language, or read. Good typography establishes a visual hierarchy for rendering prose on the page by providing visual punctuation and graphic accents that help readers understand relations between prose and pictures, headlines and subordinate blocks of text. (citation)

For good readability (or 'legibility', as they would have it), there is a balancing act between visual contrasts and repeating patterns:

Good typography depends on the visual contrast between one font and another and between text blocks, headlines, and the surrounding white space. Nothing attracts the eye and brain of the reader like strong contrast and distinctive patterns, and you can achieve those attributes only by carefully designing them into your pages. ...

When your content is primarily text, typography is the tool you use to "paint" patterns of organization on the page. The first thing the reader sees is not the title or other details on the page but the overall pattern and contrast of the page. The regular, repeating patterns established through carefully organized pages of text and graphics help the reader to establish the location and organization of your information and increase legibility. Patchy, heterogeneous typography and text headers make it hard for the user to see repeating patterns and almost impossible to predict where information is likely to be located in unfamiliar documents. (citation)

On the specific matter of 'case':

Whether you choose uppercase or lowercase letters has a strong effect on the legibility of your text. Indeed, words set in all uppercase letters should generally be avoided—except perhaps for short headings—because they are difficult to scan.

We read primarily by recognizing the overall shape of words, not by parsing each letter and then assembling a recognizable word. (citation)

Emphasis in written material can be done with many techniques: bold or italic fonts, underlining, color, capitalization, or spacing and/or indentation. Their main point is that, for emphasis, less is more: "You will soon discover that only a small variation is required to establish visual contrast." (citation)

They make a related point about "graphic embellishments":

Horizontal rules, graphic bullets, icons, and other visual markers have their occasional uses, but apply each sparingly (if at all) to avoid a patchy and confusing layout. ... The tools of graphic emphasis are powerful and should be used only in small doses for maximum effect. Overuse of graphic emphasis leads to a "clown's pants" effect in which everything is garish and nothing is emphasized. (citation)

Unsurprisingly, they speak up for consistency (which programmers are pretty good at):

Repetition is not boring; it gives your site a consistent graphic identity that creates and then reinforces a distinct sense of "place". (citation)

Line length, it turns, out is as much physiology as style:

Text on the computer screen is hard to read not only because of the low resolution of computer screens but also because the layout of most Web pages violates a fundamental rule of book and magazine typography: the lines of text on most Web pages are far too long for easy reading. Magazine and book columns are narrow for physiological reasons: at normal reading distances the eye's span of acute focus is only about three inches wide, so designers try to keep dense passages of text in columns not much wider than that comfortable eye span. Wider lines of text require readers to move their heads slightly or strain their eye muscles to track over the long lines of text. Readability suffers because on the long trip back to the left margin the reader may lose track of the next line. (citation)

The following quote about 'page length' carries over fairly directly to 'subroutine length' in the coding world:

Researchers have noted the disorientation that results from scrolling on computer screens. The reader's loss of context is particularly troublesome when such basic navigational elements as document titles, site identifiers, and links to other site pages disappear off-screen while scrolling. ... Long Web pages require the user to remember too much information that scrolls off the screen; users easily lose their sense of context when the navigational buttons or major links are not visible... (citation)

There are some maxims of good typography that are hard to follow in programming; consider:

In page layout the top of the page is always the most dominant location, but on Web pages the upper page is especially important, because the top four inches of the page are all that is visible on the typical display screen. Use this space efficiently and effectively.

But in a typical subroutine in a typical programming language, we waste that prime real estate on variable declarations—the most boring part of the code (well, maybe).

The above typographical considerations are food for thought when working on coding guidelines and the like. In fact, I suspect there is much more that can be learned from typography as it might apply to programming.

An implication of these musings is that most presentation of code (whether on screen or paper) is shockingly poor in typographic terms. It would be interesting to know what typographers think of Donald Knuth's efforts to present his programs (e.g. TeX and Metafont)—he is the best known programmer who has given typography his best shot.

In summary, coding style need not be just one person's taste against another's. Other disciplines—including typography, graphic and information design—give us useful tools that we can bring to the party.

[An earlier version of this note appeared in Verilab's internal newsletter.]

I have mumbled from time to time about using the tool Unison for two-way synchronization between Linux boxen in different Verilab locations.

Unison also works between Windows and Linux (and Macs, for that matter), though not everybody cares for it (for the GUI in particular).

I use Unison to back up files from a Windows laptop to a Linux machine. I wouldn't claim it is pretty, and it can be tedious to set up. But it should be basically brainless to run after that.

  1. You need a matching pair of Unison binaries -- the version numbers must be identical.

    Most modern Linux distributions come with Unison (e.g. Fedora: yum install unison); take that. Once Unison is installed, run unison -version.

    For Windows, go to the download area and pick a matching version, perhaps from "Windows and OS X binaries". "Text-only" is fine for our purposes -- the main thing is for the version to match.

  2. Make sure you can ssh into your Linux box (from somewhere).

  3. On the Windows side:

    • You need an SSH client; I recommend Cygwin.

      Install Cygwin using the standard Unison instructions. (This is a painful way to get an SSH binary, but it's less excruciating than the alternatives.)

      The Unison documentation describes the install process in some detail.

    • To install Unison, I created a folder C:\Program Files\Unison, and deposited the executables from my downloaded Zip file in there. I also made a copy of the text version and called it unison.exe.

    • You now need to add the Cygwin and Unison directories to your PATH environment variable -- that is, C:\cygwin\bin and C:\Program Files\Unison. Again, this is described in the Unison documentation ("Installing SSH" section).

    • Get a Windows command prompt and see if you can SSH to your Linux box (e.g. ssh user@linux-box) and if Unison is around (unison -help). If these behave, you've got the tools and they're nicely in your PATH.

  4. Make a Unison "profile" on the Windows side; here's mine, parked in C:\Documents and Settings\UserName\.unison\laptop.prf; it should be reasonably clear; there's always the documentation...

    root = C:\Documents and Settings\UserName
    root = ssh://user@linux-box//home/username/documents-and-settings/
    
    
    times = true
    fastcheck = yes
    
    
    # (commented out; more on this later)
    # backup = Name *
    # maxbackups 10
    
    
    ignore = Name *.dat
    ignore = Name *.exe
    ignore = Name *.iso
    ignore = Name *.lock
    ignore = Name *.LOG
    ignore = Name *.log
    ignore = Name *.pdf
    ignore = Name *.tmp
    ignore = Name .unison
    ignore = Name Backup
    ignore = Name Cache
    ignore = Name Cache.Trash
    ignore = Name Cookies
    ignore = Name Cookies
    ignore = Name Google Desktop Search
    ignore = Name History.IE5
    ignore = Name My Briefcase
    ignore = Name Recent
    ignore = Name SendTo
    ignore = Name Temp
    ignore = Name Temporary Internet Files
    ignore = Name backup
    ignore = Name unison
    ignore = Name udlog.txt
    ignore = Name ~$*
    
  5. Go! From your Windows shell, do: unison laptop -auto

    This will correctly observe that you haven't done this before, and the first time is basically a copy.

    The first time or two, I bumped into permission problems (on the Windows side) which gave an "error in digesting file". I added 'ignore' entries to my profile and had another go.

    Of course, you can also add 'ignore' entries for things you just don't want synchronized.

  6. I cannot tell a lie: after I got my .prf file like I wanted, I deleted everything off the Linux side (i.e. the copy) -- including the .unison archive files! -- and redid the job so that the copy was "clean":

    linux% chmod u+rw /home/username/documents-and-settings
    linux% rm -rf /home/username/documents-and-settings
    linux% rm -f ~/.unison/ar* # don't forget this archive!!!!!
    
    
    windows> del .unison/ar<whatever> # the matching archive
    
  7. Now, whenever you want to synchronize, pop open a Windows shell and do another: unison laptop -auto

  8. Even with -auto, Unison will occasionally ask about conflicts. It might look something like...

    local          slimy
    changed  <-?-> perms   Mail/archive/active
    

    That means "the file has changed 'local'ly, and the file permissions have changed on 'slimy' (whatever that is)" -- which do you want to keep?

    Type '?' for all of your options. 99% of the time in a case like this, you will type '>' (keep the left choice). If you're really doing two-way synching, you'll occasionally want to keep the right choice (with '<').

The eagle-eyed among you will have spotted the commented-out 'backup' and 'maxbackups' lines in the "profile" above. This will apparently squirrel away the old copies that get replaced. At least on Windows, it has problems when applied retroactively (e.g. if I now uncomment it); but this might just be a bug. The backups thing may chew through lots of disk space, too (if you care).

If you need to upgrade Unison versions (or have it forced on you), do a sync right before the upgrade; then the upgrade; then another sync (which will rebuild the archives).

Here's a second use for Unison that quickly followed backups. I keep a pile of Windows binaries in /d/for-windows -- Firefox, AV stuff, etc. Unsurprisingly, it doesn't do me much good until I have a copy on a Windows box; moreover, I'm often on a Windows box when I collect a new version of one of these programs... sounds like a perfect Unison job. Here's my 'for-windows.prf' profile:

root = C:\for-windows
root = ssh://user@linux-box//d/for-windows/

times = true
fastcheck = yes

Just run unison for-windows -auto and whatever changes I've made on both sides will propagate.

[An earlier version of this note appeared in Verilab's internal newsletter.]

Introduction

| No Comments

I provide IT support for Verilab's roving band of hardware verification consultants.

Occasionally, I assemble IT-related information or opinion that I pass around internally. This blog is usually based on that material (often revised).

I hope you find something useful here. (But if something horrible happens to your systems because you followed my "advice", it's on your head and not my fault!)

About this Archive

This page is an archive of entries from January 2009 listed from newest to oldest.

February 2009 is the next archive.

Find recent content on the main index or look in the archives to find all content.