He's making a list and checking it twice,
Gonna find out who's naughty and nice,
Santa Claus is coming to town.

Did this song creep you out as a kid? It did me! Now that you're all grown up (well, some of us anyway!) you may think that it doesn't apply to you anymore. Do you really want to take that chance though? I didn't think so! This article helps to evaluate
your level of naughtiness/niceness based on your blog (or other personal web site).

Read the article for some background information, then explore the code yourself. This application uses the
BackgroundWorker, WebClient, and works with regular expressions with the
Regex object. In order to open the project, you will need Visual Studio Express Editions (either C# or Visual Basic) or higher. You can download the Express editions for free from
here.

Background

An ancient Chinese proverb states that "The entries in your blog reveal your inner self." Even though I just made that up, it's no less true. Are you grumpy, happy, hopeful, crazy? Chances are it's obvious from your blog posts. This application will
take a URL, scan it for a list of "naughty" and "nice" words, and come up with a score of niceness. But beware: pages linked from the given URL will also be scanned and taken into account.

(click image to zoom)

Downloading the URL

So the first step is to download the given URL. This can be accomplished by opening a socket, issuing an HTTP GET request, and reading back the resulting stream of bytes one at a time. On the other hand, it can also be performed in two lines of code using
the System.Net.WebClientobject. For me, the choice was simple. The
WebClient has a number of methods to download a file synchronously or asynchronously, either as a string, a byte array, or directly to a file. Once the object is instantiated, a single line of code does the actual work. It's so easy, you'll
be adding URL download functionality into every application before you know it!

NOTE: You must specify a complete URL, that is, beginning with HTTP. You won't get an error otherwise, it just won't appear to work right.

Visual Basic

Dim wc As WebClient = New WebClient
body = wc.DownloadString(url)

Visual C#

WebClient wc = new WebClient();
body = wc.DownloadString(url);

Searching for Patterns

Once you have the contents of the URL in a string, you can anything with it as with any other string. Perform
IndexOf searches, save it to a database, or apply regular expressions. This is the use of interest for this application. Regular expression support is found in the
System.Text.RegularExpressionsnamespace with the
Regex object. Using the Regex object is pretty easy, but coming up with the expression itself can be a challenge.

If you haven't used regular expressions before, you may want to take a few minutes to read about them. I good place to start is the MSDN reference page,
.NET Framework Regular Expressions. Unfortunately, regular expressions aren't very intuitive at first (or ever for some people!). Creating your own expression can be difficult, but you can often find pre-built ones online. Note that regular expressions
are used in conjunction with string verification, formatting, searching, and replacing. This application will use three different expressions. One will be used to search for "nice" keywords, another for "naughty" keywords. The third expression will locate
hyperlinks based on the href attribute of the a element found in HTML.

Creating a Regex object incurs some overhead, not only from object creation, but also from parsing the expression. To minimize this, all three
Regex objects are created when the application starts up.

The basic flow will be: search for nice words, search for naughty keywords, search for hyperlinks, repeat the nice/naughty search for each linked page. Linked pages are not scanned for additional links to avoid overload. As it is, some sites already take
a full minute to process! The process of downloading a page then performing the searches is contained in the
DetermineScore method.

In order to actually perform the search, invoke the Matches method. This returns a
MatchCollection object for iteration. There's no need to iterate the actual words found, so the
Count property is sufficient:

Another one line of code performing lots of work! If you needed the information, you could then use the returned
MatchCollection to determine specifically which words were returned, along with much other information about each match. The expression for nice and naughty words is simple: the vertical pipe (bar) acts like an OR Boolean operator. In other
words, the regular expression parser will scan the entire input string (the downloaded HTML) and add a
Match object each time one of the words is found.

Looking a Level Deeper

After analyzing the given URL, it's time to take a look at linked pages. After all, the types of sites you link to also say something about how naughty or nice you probably are! That third regular expression is much more complex than the first two. Showing
it again, we have:

href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))

This is complicated a bit by the way that strings must be escaped in Visual C# and Visual Basic. In both languages, you can't just have quotes within a quoted string (you'll notice that this expression is slight different in the VB and C# code samples above).
In C#, you precede the double quote with a backslash (\"). If you actually need a backslash, you need to double that as well (\\). In Visual Basic, you double the doublequote (""). What does this mess mean? Well, suffice to say that is looking for the
href attribute, and the quotes and brackets after it. Unfortunately, the resulting match is always more than we actually want. For example, a link to my site would be returned as:

href="http://ariankulp.com/rss.aspx"

It also returns returns relative links, CSS/RSS links, and others that aren't all that interesting. It's easy to filter out the relative links by checking to see if each match begins with
href="http. The CSS/RSS links, if using an absolute host, are more complicated, so I don't filter them out. For each valid match found, I call the
DetermineScore method to count up naughty and nice words. Note that the scores on linked pages are cut in half. More weight is given to your own site!

Putting it All Together

Performing all of the URL downloads and pattern matching takes some time. A site with many links can really slow things down. When you click the button to start things off, any work that you perform will occur on the user interface thread. All of a sudden,
the application is unresponsive. The solution to this is to run the analyzing in a separate, background thread. The
BackgroundWorker object makes this easy. When you invoke the
RunWorkerAsync method, the DoWork event fires on a different thread. The event handler then makes the necessary calls to download and analyze the given URL and discovered URL's.

As each link is analyzed, progress is reported by raising the ProgressChanged event. This is responsible for adding the links, along with naughty/nice count, to the
ListView control.

Clicking Cancel calls the CancelAsync method. The worker thread periodically checks the
CancellationRequested flag to exit early if necessary. While the initial link is being downloaded, a
ProgressBar control goes into marquee mode (think Knight Rider!). It changes to a standard progress bar as discovered links are analyzed. Finally, the
RunWorkerCompleted event fires when everything is done (based on the
DoWork event handler completing). This updates the progress percentage, enables the
Cancel button, and hides the progress bar.

Next Steps

The application isn't all that useful, but it's fun! It could easily be extended to perform different actions on discovered pages, or simply to search for different keywords. Several enhancements that would be fairly easy would be:

Add a checkbox to prevent linked pages from being counted. This would speed things up.

Add an options page to edit naughty/nice keywords and tweak other settings.

Restrict the number of links to follow. If the given URL has links to 200 other pages, it may never finish!

Better filtering of links. As I mentioned, resources like RSS or CSS don't need to be downloaded. Links on the same site may not be needed either. Checking for the same base URL would be a pretty easy addition.

Create multiple worker threads to share the load of analyzing linked pages. This is a good scenario for parallelizing.

Conclusion

I hope that you had fun with this application. It was fun coming up with a naughty-or-nice formula! I struggled a bit with the best balance. It's not perfect, but it works pretty well. If you have a better approach, by all means tweak it. Best of all,
have fun!

Arian Kulp is an independent software developer and writer working in the Midwest. He has been coding since the fifth grade on various platforms, and also enjoys photography, nature, and spending time with his family. Arian can be reached through his web
site at http://www.ariankulp.com.