Incremental working as a Unix fundamental

| No Comments

Consider: you visit a Web page which links to a whole bunch of PDF files. You decide you would like copies of all of them for later reading. Finding and clicking on all the links would be tedious and error-prone. What to do? -- here's my solution.

  1. Save the HTML source of the page I'm looking at. In Firefox, one way is View->Page Source, then save, perhaps as src.html.

  2. Make a list of all the PDF-file URLs in src.html:

    • If there are comparatively few (say, seven?), just edit the text file down to the list.

    • If you use Emacs, a keyboard macro (executed repeatedly) can boil the file down to PDF URLs fairly painlessly. The commands in the macro might be:

          # Note where we are:
          set-mark
          # Look for an interesting (NB: non-relative) URL
          isearch-forward-regexp http://.*\.pdf
          # Move back to the beginning of it:
          isearch-backward-regexp http:
          # "region": where we started to where we are now.
          kill-region
          # Put ourselves at the end of the URL:
          isearch-forward-regexp http://.*\.pdf
      

      Just repeat that a whole bunch (C-u 999 M-x call-last-kbd-macro), and you'll have your URLs.

    • A tiny Perl script is a different (and natural) solution:

          while (<>) {
            next unless m,(http://.*\.pdf)\b,;
            print "$1\n";
          }
      

      Call that grab-urls and run:

          ./grab-urls src.html > pdfs-list
      

      Hmm... that version isn't catching everything... Look at the HTML source and make a few adjustments, giving...

          while (<>) {
             next unless m,href=\"((pdfs|review)/.*\.pdf)\b,;
             print "$1\n";
          }
      

      (Looks at results) Ah, that's better.

  3. Assuming the list of PDF URLs is in the file pdfs-list, look it over. Maybe edit it a bit. If it's obviously bogus, go back and fix something, or take a different tack.

  4. Go get all the PDFs!

    • Run from the command line:

          for i in `cat ~/pdfs-list` ; do wget $i ; done
      

      If you don't want to alarm the website owners by many quick downloads, you can slow it down:

          for i in `cat ~/pdfs-list`; do wget $i; sleep 60; done
      

      (Yes, wget's --wait and --random-wait options will work as well.)

    • There are a whole bunch of things you could do instead of backticks-in-a-for-loop. Here's one I actually tried:

          ./grab-urls src.html \
           | xargs -n 1 -I XXX wget http://example.com/XXX
      

      ['xargs' is the Unix command that maps a command over a set of data (on stdin). I've added '-n 1' to process one line of input at a time; and '-I XXX' to tell it to replace XXX in the command template given.]

      Another is to edit pdfs-list directly into a script, make it executable (chmod +x) and then run it; for example:

          #!/bin/sh -x
          #
          set -e
          set -u
          #
          for i in \
          pdfs/Introduction.pdf \
          pdfs/AmplifyLearning.pdf \
          review/chapter1.pdf \
          review/chapter2.pdf \
          review/chapter3.pdf \
          ; do
            wget http://www.poppendieck.com/$i
          done
      

      Except for the SPACE BACKSLASH stuck on the .pdf lines, the "script" is little more than our list of files.

OK, what's the point? Some observations:

  • I (and most Unix people?) grope my way to a "solution" something like this all the time.

  • Most of the work invariably seems to be finding the "what's to be operated on" candidates -- often files, but could also be processes, usernames, variable names, etc.

  • I tend to use find a lot, often with a grep or two thrown in for winnowing... For example, suppose I want to process "all the SystemVerilog code new since Christmas"... I might do...

    find . -type f -name \*.sv -mtime -30 | grep -v OLD
    

    ... perhaps putting it in an ...

    ls -ltr `...find cmd here...`
    

    ... and then doing a hand-edit to pick out the "obviously correct" list of files.

  • My first shot at candidate-finding is almost always wrong.

  • Once the list of candidates is assembled, processing it is often simple, as in the case given (use a for-loop to apply a simple command [wget] to each entry).

  • If I'm worried about processing errors, I will often do some kind of "dry run" first. Many commands have a "just pretending" option.

    Lacking that, changing a command from /bin/rm to echo /bin/rm gives a poor man's dry run...

  • I usually run such shell scripts with set -x (show execution as you go), set -e (stop if some command fails), and set -u (stop if you meet an undefined variable).

  • I eschew whitespace in filenames precisely because it complicates the sort of processes outlined above.

The "workflow" sketched and discussed above reflects several aspects of the "Unix Way": it's text-oriented; modest utilities (find, grep, ...) do one job each; pipes stitch processes together; small scripts are easy to write and to run.

But I would argue that the incremental way in which the solution was reached is also an essential part of the "Unix Way". I haven't seen anyone explicitly say so, however.

[An earlier version of this note appeared in Verilab's internal newsletter.]

Leave a comment

About this Entry

This page contains a single entry by Will Partain published on January 28, 2009 1:23 AM.

Argh! More Xargs! was the previous entry in this blog.

Can't Boot from a CD is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.