Consider: you visit a Web page which links to a whole bunch of PDF files. You decide you would like copies of all of them for later reading. Finding and clicking on all the links would be tedious and error-prone. What to do? -- here's my solution.
Save the HTML source of the page I'm looking at. In Firefox, one way is
View->Page Source, then save, perhaps assrc.html.Make a list of all the PDF-file URLs in
src.html:If there are comparatively few (say, seven?), just edit the text file down to the list.
If you use Emacs, a keyboard macro (executed repeatedly) can boil the file down to PDF URLs fairly painlessly. The commands in the macro might be:
# Note where we are: set-mark # Look for an interesting (NB: non-relative) URL isearch-forward-regexp http://.*\.pdf # Move back to the beginning of it: isearch-backward-regexp http: # "region": where we started to where we are now. kill-region # Put ourselves at the end of the URL: isearch-forward-regexp http://.*\.pdfJust repeat that a whole bunch (
C-u 999 M-x call-last-kbd-macro), and you'll have your URLs.A tiny Perl script is a different (and natural) solution:
while (<>) { next unless m,(http://.*\.pdf)\b,; print "$1\n"; }Call that
grab-urlsand run:./grab-urls src.html > pdfs-listHmm... that version isn't catching everything... Look at the HTML source and make a few adjustments, giving...
while (<>) { next unless m,href=\"((pdfs|review)/.*\.pdf)\b,; print "$1\n"; }(Looks at results) Ah, that's better.
Assuming the list of PDF URLs is in the file
pdfs-list, look it over. Maybe edit it a bit. If it's obviously bogus, go back and fix something, or take a different tack.Go get all the PDFs!
Run from the command line:
for i in `cat ~/pdfs-list` ; do wget $i ; doneIf you don't want to alarm the website owners by many quick downloads, you can slow it down:
for i in `cat ~/pdfs-list`; do wget $i; sleep 60; done(Yes, wget's
--waitand--random-waitoptions will work as well.)There are a whole bunch of things you could do instead of backticks-in-a-for-loop. Here's one I actually tried:
./grab-urls src.html \ | xargs -n 1 -I XXX wget http://example.com/XXX['xargs' is the Unix command that maps a command over a set of data (on stdin). I've added '-n 1' to process one line of input at a time; and '-I XXX' to tell it to replace XXX in the command template given.]
Another is to edit
pdfs-listdirectly into a script, make it executable (chmod +x) and then run it; for example:#!/bin/sh -x # set -e set -u # for i in \ pdfs/Introduction.pdf \ pdfs/AmplifyLearning.pdf \ review/chapter1.pdf \ review/chapter2.pdf \ review/chapter3.pdf \ ; do wget http://www.poppendieck.com/$i doneExcept for the SPACE BACKSLASH stuck on the
.pdflines, the "script" is little more than our list of files.
OK, what's the point? Some observations:
I (and most Unix people?) grope my way to a "solution" something like this all the time.
Most of the work invariably seems to be finding the "what's to be operated on" candidates -- often files, but could also be processes, usernames, variable names, etc.
I tend to use
finda lot, often with agrepor two thrown in for winnowing... For example, suppose I want to process "all the SystemVerilog code new since Christmas"... I might do...find . -type f -name \*.sv -mtime -30 | grep -v OLD... perhaps putting it in an ...
ls -ltr `...find cmd here...`... and then doing a hand-edit to pick out the "obviously correct" list of files.
My first shot at candidate-finding is almost always wrong.
Once the list of candidates is assembled, processing it is often simple, as in the case given (use a for-loop to apply a simple command [
wget] to each entry).If I'm worried about processing errors, I will often do some kind of "dry run" first. Many commands have a "just pretending" option.
Lacking that, changing a command from
/bin/rmtoecho /bin/rmgives a poor man's dry run...I usually run such shell scripts with
set -x(show execution as you go),set -e(stop if some command fails), andset -u(stop if you meet an undefined variable).I eschew whitespace in filenames precisely because it complicates the sort of processes outlined above.
The "workflow" sketched and discussed above reflects several aspects
of the "Unix Way": it's text-oriented; modest utilities (find, grep,
...) do one job each; pipes stitch processes together; small scripts
are easy to write and to run.
But I would argue that the incremental way in which the solution was reached is also an essential part of the "Unix Way". I haven't seen anyone explicitly say so, however.
[An earlier version of this note appeared in Verilab's internal newsletter.]
