How to Download Files With Wget

Wget is a great tool for automating the task of downloading entire websites, files, or anything that needs to mimic a traditional web browser. This article discusses many of the things that you can use wget

If wget isn’t installed you can use either apt, yum to install it:

Installing Wget on Debian, Ubuntu

Install wget Debian, Ubuntu

Shell

1

$sudo apt install wget

Installing Wget on RHEL, CentOS

Install wget RHEL, CentOS

Shell

1

$sudo yum install wget

Installing Wget on Windows

There is a windows binary for wget, but we’ve found that Cygwin works much better and provides other useful tools as well.

Basic Download with Wget

For the the most part you should be able to just download a file, but if it’s https you might have certificate problems. In that case use the –no-check-certificate flag.

Download with Wget

Shell

1

$wget--no-check-certificate https://wordpress.org/latest.tar.gz

Download File into Different Name and Location

Maybe you want to download a file into a different name (-O) or location (-P)? By default wget will download the file to the current working directory and use the original file name.

Download Entire Website

Though you might need to fiddle with cookies, span, recursiveness, domain and the other more advanced flags, you should start with a basic download of an entire website, using the “mirror” and “local browsing” flags:

Download entire website using wget

Shell

1

$wget-m-k-phttps://awesomesite.com

Tip: You might also need to gunzip the files if they are compressed.

Rate Limit Wget Downloads

It is rude if you blindly torch a server’s resources. It is polite (and won’t set off as many alarms), if you request resources at a more respectable rate. Many site administrators will block wget because by default people do not behave nicely. Here is how to be more polite when using wget:

Using Passwords with wget

Shell

1

$wget‐‐limit-rate=20k‐‐wait=60‐‐random-wait‐‐mirror site.com

Use Passwords with wget

This only works with basic auth, but here are the flags for using a password and user on http authentication:

Using Passwords with wget

Shell

1

$wget--http-user=USER--http-password=PASS URL

Use wget to Check for Broken Links

If you are scanning a site, it’s polite to wait 1 second between grabs. The following will spider a site and look for broken links, dumping the information to wget.log file.

Check for Broken Links with wget

Shell

1

$wget--spider-owget.log-erobots=off--wait1-r-phttps://rubysash.com/

Download MP3 files from Directory

It may be useful to limit your downloads to a specific directory and it’s subdirectories. The –no-parent flag will help with this. Here is an example to download mp3 files from a directory:

Scan list of sites for New PDFs

Sometimes, there are particular files you are interested in and ONLY those files. Wouldn’t it be nice to monitor multiple websites for these files all at once and keep a local copy for easy browsing at your leisure? You can surely do this, though it might not provide the site owner with the ad revenue or metrics that they desire:

Using wget with login cookies

You can have wget get cookies, or you can login with a browser, and use that cookie file after you manually create it. I was able to use this to get past a recent wordpress password location to a membership site.

Populate Cache Using Wget

WordPress has plugins that cache. There are also squid proxies and a plethora of other caching mechanisms. If you want to preload your caches (whatever they are), you can do it with wget:

Populate cache with wget

Shell

1

$wget-o/dev/null-r--delete-after http://dynamicsite.com

Use wget Through Proxy

We use socks proxy quite a bit from ssh to a remote server to bypass firewalls (ssh user@remote -D 7070). After the proxy is setup, we use firefox and it’s socks proxy config to use 127.0.0.1:7070 as the proxy. You could use wget through a proxy like this:

Use wget through proxy

Shell

1

2

$export http_proxy="http://127.0.0.1:7070"

$wget[normal wget usage,but now it'sgoing through proxy]

If you use a different proxy, then just export it appropriately and your wget will pick it up from the environment.

Download wget Using Timestamps

This isn’t so much a feature of wget as it is of the shell, but working hand in hand you can take a dynamic site and get period data from it, loading it into sequential snapshots.