So now that we’re all Perl RegEx masters, we’re going to take the HTML from this forum. We will then search this HTML content using Perl Regular Expressions, and generate output similar to this.

The first challenge we will encounter will be using the HTML as input.

Since you’re still new to Perl, we’re not going to write a Perl script to download the HTML automatically. Instead, we’re going to download it to a file, and the parse that file using Perl’s open function, which we saw in last week’s tutorial.

Linux Command Line: Downloading A Web-Site

To download the HTML from a web site from the command line, we’re going to use Linux’s wget command. So, once again, let’s consult the man pages to see what we can learn about wget.

man wget

After reading the man pages on wget you’ll figure out that to download the HTML contents of a web page (www.web-site.ex) we type the following:

wget http://www.web-site.ex

So to download the HTML contents of our forum, we type in:

wget http://www.phpbb.com/community/viewtopic.php?f=46&t=1417375

And the contents get saved to a file named viewtopic.php?f=46&t=1417375. To rename this file to something more manageable, use the mv command; in Linux you don’t rename a file, you move it to a new file name. (weird, eh?)

mv viewtopic.php?f=46&t=1417375 tutorial-2.html

Reading the HTML File

So last week we saw how to perform basic file IO in Perl, right? So we’ll write the basics of what we know; how to read all lines out of the file.