Automating Web Content Downloading

The Approach

Always when tackling automation challenges, I prefer to go for the easiest way possible first and then iterate on what we have until our solution is successful.

From the last post on web content discovery, we had wfuzz create a CSV file that contains the directories or files that we want to download. Therefore, the easiest method to download these files is as follows:

We get an alert that new content was discovered

Split the CSV file to obtain the field with the directory or file

With the directory or file name we can then wget to recursively retrieve the data that we need

Add the downloaded data to a git repository and commit it

Overall, you should be alerted of some new files found, and they should all be downloaded for you.

Automate it

With our steps in mind, let’s set out to actually get this working. This algorithm gets initiated by our previously created web discovery script. We will simply modify the previous script to call another script we will create to do all the work with wget.

Where $file is the word that we cut from our CSV file and domain is the website we are trying to profile. This command will download everything it can find links to from the $file path to your /path/to/domain/repo. This way, when everything is done, we use the version control to see the changes.

Note that there is not a way to output to a specific directory using the –mirror option in wget. So, in our crontab entry for later we will have to cd to the right spot to keep our wget downloads structured how we want.

Step 3

In the /path/to/domain/repo we will want to add all the files in the directory to the git staging area to get a diff of the changes. This will look the same as our previous script:

git add .
git commit -m"download"

Wrap it up

Now we just need to take all of our steps we have here and put it into a script that will be run whenever the content discovery script runs.

Note that this script can download A LOT of content. The mirror option for wget will infinitely recurse if it is possible. The --no-parent option should help prevent large downloads but will not prevent all of them.

I recommend reading through this and the previous post to fully understand how these scripts interact with each other, and how your file structure should be set up to function properly.

Thoughts

Originally I thought this solution was hacky and was interested in using a different tool to do this job. Specifically, I was interested in using HTTrack but it would always write over the website folder whenever running with --update. After a couple of hours I gave up and used the simple wget command instead. There is definitely room for improvement here.