This will instruct wget to download the file doc.pdf from localhost and save it as doc.pdf. It is not as easy when the weber service is

requesting authentication or

the URL of the PDF file ends in the same file name

Authentication

The documentation of wget states that you can provide the username and password for BASIC authentication. What about a web site that asks for SAML 2.0? You can pass HTTP headers to wget via parameter –header. This feature makes it easy: log on to the server via a browser and then copy the headers. These headers contain the session information of you user and can be used by wget to connect as an authenticated user.

How to get the HTTP headers

Log on to the web site

Open developer tools

Select a web resource

Copy the HTTP headers. For cURL, its just selecting Copy all as cURL. This gives the complete cURL command. For just the headers, select Copy Request Headers.

Each line is one –header parameter for wget. It is not feasible to add all these headers to each wget request individually. For maintenance and better readability these values should be read from a file. Problem: wget does not allow to read the header parameter from a file. There is no option for something like –header <file_with_headers>. What there is the . wgetrc file. This is the configuration file wget reads when called, and in this file it`s possible to define HTTP header values. For each HTTP header, create a new “header = <value>” entry in the file.

With this configured in the file, wget will send always these HTTP headers with each request. If the session cookies copied from the browser are valid the requests are authenticated and wget is able to download the file.

File name

Sometimes the file you want to download has a generic URL. Each file ends in the same file name at the server. For instance, http://localhost/category/doc.pdf, or /uid/E.pdf. In such cases, wget will download the file and save it as doc.pdf or E.pdf. This is not a problem when you download just one file, but when you download more files, like 20, wget numerate the files: E.pdf.1, E.pdf.2, E.pdf.3, …

This makes it hard to work with the files. A solution can be to check if the web server is supporting content-disposition. If so, the server should send the real file name of the archive in the HTTP response. The real file name can be seen in the Conent-Disposition header as filename.