Learning How To Use Wget and Python Cautiously

Wget is a program that is designed to “retrieve” files from websites and download them onto your computer. This, as I have learned in class from Dr Graham and from the tutorial, is a very powerful tool. One that allows you to do something as simple as download a single paper, or several links. Wget can even mirror an entire website using the proper commands. If you are not careful, as we discussed in class, an individual can even get blacklisted from a website! This is something that I did not know was possible.

What I found to be most difficult though was using Wget alongside Python. In the second part of the tutorial for instance it taught the user multiple tools. First, to download multiple files at once by entering the URLs sequentially into a script in a text editor, I chose TextWrangler. They used an 80 page diary example from the Library and Archives of Canada that looked like this:

As you can see the Python portion of the tutorial was much more complex as it involved more detailed work.

The other things that this tutorial taught us about was how Python could help Wget to retrieve files from websites by getting around barriers in websites such as those that will not allow you to download more than one thing at once. These commands are very useful but again, complex. This is even more complex for someone like me who has limited to no background in coding.

Much of this tutorial was very interesting to someone like me who again, is new to the more complex digital world such as coding. As you will see in my notes below I had to follow the instructions to the T in order to completely understand.

Here they are:

– I checked to see if I had wget installed and I did not so I installed it
– had trouble originally downloading Xcode but managed to
– downloading the package manager was complicated at times due to the number of steps but managed to complete it
– made a new directory in order to download a specific set of files by typing in mkdir wget-activehistory into the terminal as was instructed in the tutorial
– this has allowed me to download certain things from activehistory.ca
– wget operates by typing in wget into the command line terminal followed by the option you would like it to complete followed by the URL address like so: wget [option] [url]
– first effort of this turned out favourably as the program retrieved other files off of the URL when I typed in -r as the option
– this turned out to be really complicated as it brought back multiple links off of those links
– In order to stop this the tutorial gave another command to tell wget not to follow the links beyond the last parent directory so in other words to not follow the links beyond your current url
– you can also tell it to follow links up to 5 times its parent if you prefer
– another command was a bit of an odd concept but it successfully did what it was supposed to as it didn’t show any error messages when i entered the -w 10 command which makes the server wait ten seconds before making another request
– i did also try a different amount of time as the tutorial suggested was possible
– wget can also help to mirror a website by being sure to enter a trailing slash on the command line after the url
– first time i did this it didn’t work as i stupidly made a mistake of forgetting this exact thing
– tried using the -m command successfully though to navigate through the mirrored website i had created using wget
– using wget alongside python though was a bit more complicated as it required me to enter more complex commands while using another program, in this case textwrangler
– for example when trying to retrieve a 80 page war diary i had to be VERY careful about entering the proper urls
– for example I missed one ‘ before the url and it did not work
– the tutorial then goes into how sometimes automated downloading requires working around coding barriers
– for example it says that for URLs that may be too complex for someone with a limited background in coding to design a python script for (such as myself) Wget has a built in function called Accept (-A)
– this lets you decide what files you want from a URL
– it uses the thomas jefferson papers as an example by getting you to enter a series of URLs in your textwrangler then write an even more complicated command into the command line
– this a really cool concept of being able to go around barriers and how the tutorial spoke of accessibility more generally made me think of how I think of accessibility
– tutorial then goes into instructing you on what to do if the directory you are in is closed and only allows you to do one image or file at a time
– while this was complex as I had to be careful when entering the URLs into the textwrangler file but i was careful not to make my earlier mistake or another one like it and was able to retrieve the different .jpeg files