3.2 GETURL: Retrieving Web Pages

GETURL is a versatile building block for shell scripts that need to retrieve
files from the Internet. It takes a web address as a command-line parameter and
tries to retrieve the contents of this address. The contents are printed
to standard output, while the header is printed to /dev/stderr.
A surrounding shell script
could analyze the contents and extract the text or the links. An ASCII
browser could be written around GETURL. But more interestingly, web robots are
straightforward to write on top of GETURL. On the Internet, you can find
several programs of the same name that do the same job. They are usually
much more complex internally and at least 10 times longer.

At first, GETURL checks if it was called with exactly one web address.
Then, it checks if the user chose to use a special proxy server whose name
is handed over in a variable. By default, it is assumed that the local
machine serves as proxy. GETURL uses the GET method by default
to access the web page. By handing over the name of a different method
(such as HEAD), it is possible to choose a different behavior. With
the HEAD method, the user does not receive the body of the page
content, but does receive the header:

This program can be changed as needed, but be careful with the last lines.
Make sure transmission of binary data is not corrupted by additional line
breaks. Even as it is now, the byte sequence "\r\n\r\n" would
disappear if it were contained in binary data. Don’t get caught in a
trap when trying a quick fix on this one.