Category: water quality

I keep having to Google wget incantations. So, I’m going to just write some common ones down here. The spell at the moment is below and can be used with my previous post about processing EPA WQX/STORET domain values into useful tables:

-nc means don’t download stuff that is already downloaded (if you want to resume later, or check for new files some other time)

If the links are to different subdomains, you can specify host-spanning using the -H option,

e.g. if bar.html contains links to files on host src.foobar.com, it won’t fetch them unless you specify -H.

It’s also a good idea in that case to limit spanning to a domain using -D foobar.com.

The -w option waits 30 seconds between retrievals. Not used here.

You can also use –limit-rate=20k to limit the download speed to 20kb per second.

-nd, –no-directories or don’t create directories as files are found on the server. Just stick everything into a single directory.

-c Continue the Incomplete Download Using wget -c

–user-agent Some websites can disallow you to download its page by identifying that the user agent is not a browser. So you can mask the user agent by using –user-agent options and show wget like a browser as shown below.

If the internet connection has problem, and if the download file is large there is a chance of failures in the download. By default wget retries 20 times to make the download successful. If needed, you can increase retry attempts using –tries option as shown below.

Many sites now employ a means of blocking robots like wget from accessing their files. Most of the time they use .htaccess to do this. So a permanent workaround has wget mimic a normal browser. Just add the -d option. Like: $ wget -O/dev/null -d http://www.askapache.com If you run the command at the top, you’ll get a directory of files as below