Introduction

Googlebot finds and indexes the web through hyperlinks. It still goes to the new pages through hyperlinks in the old sites. My searchbot (Xearchbot) can find sites and stores its URL, title, keywords metatag and description metatag in database. In the future, it will store the body of the document converted to plain text. I don't calculate PageRank, because it is very time-consuming.

Before Googlebot downloads a page, it downloads file robots.txt. In this file, there is information where bot can go and where it musn't go. This is an example content of this file:

There is the second way to block Googlebot: Robots metatag. It's name attribute is "robots" and content attribute has values separated by comma. There is "index" or "noindex" (document can be indexed or not) and "follow" or "nofollow" (follow hyperlinks or not). For indexing document and following hyperlinks, metatag looks like this:

<metaname="Robots"content="index, follow"/>

Blocking of following of single link is supported too - in order to do that, it's rel="nofollow". Malware robots and antiviruses' robots ignore robots.txt, metatags and rel="nofollow". Our bot will be normal searchbot and must allow all of the following blockers.

There is an HTTP header named User-Agent. In this header, client application (eg., Internet Explorer or Googlebot) shall be presented. For example User-Agent for IE6 looks like this:

Yes, the name of Internet Explorer for HTTP is Mozilla... This is Googlebot 2.1 User-Agent header:

User-Agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html)

The address in brackets, after plus char is the information about the bot. We put similar data in this header for our bot.

In order to speed up searchbot, we can support GZIP encoding. We can add header Accept-Encoding with value "gzip". Some websites allow GZIP encoding and if we accept gzip, then they send compressed document to us. If content is compressed, then server will add Content-Encoding header (in response) with value "gzip". We can decompress the document using System.IO.Compression.GZipStream.

For parsing robots.txt, I use string functions (IndexOf, Substring...) and for parsing HTML I use Regular Expressions. In this article, we will use HttpWebRequest and HttpWebResponse for downloading files. At first, I thought about using WebClient, because it's easier, but in this class we can't set timeout of downloading.

For this article, SQL Server (can be Express) and basic knowledge about SQL (DataSets, etc.) are required.

Fundamentals discussed, so let's write a searchbot!

Database

First we have to create new Windows Forms project. Now add a Local Database named SearchbotData and create its DataSet named SearchbotDataSet. To database, add table Results:

Column Name

Data Type

Length

Allow Nulls

Primary Key

Identity

id_result

int

4

No

Yes

Yes

url_result

nvarchar

500

No

No

No

title_result

nvarchar

100

Yes

No

No

keywords_result

nvarchar

500

Yes

No

No

description_result

nvarchar

1000

Yes

No

No

In this table, we will store results. Add this table to SearchbotDataSet.

Preparation

First, we must add the following using statements:

using System.Net;
using System.Collections.ObjectModel;
using System.IO;
using System.IO.Compression;
using System.Text.RegularExpressions;

The code for indexing page will be in the loop. At the start, in waiting there must be minimum one page with hyperlinks. When the page is parsed, it will be deleted from waiting, but found hyperlinks will be added - and so we have the loop. The Scan function can be run in other thread by e.g., BackgroundWorker.

Parsing robots.txt

Before we start indexing of any website, we must check robots.txt file. Let's write class for parsing this file.

The url parameter is an address without "/robots.txt". The constructor will download the file and parse it. So download the file: first create request. All must be inner try statement, because HTTP errors are thrown.

The bu variable is the baseUrl for RFX. Now each protocol is specified in path, default protocol (80) also.

The address can be indexed and then we should not index the second time. So let's add new query to ResultsTableAdapter (it should be created when you add Result table to dataset). The query will be type of SELECT, which returns a single value. It's code:

SELECT COUNT(url_result) FROM Results WHERE url_result = @url

Name it CountOfUrls. It returns a count of specified URLs. Using it, we can check if the URL is in database.

To main form, add a resultsTableAdapter. If you want to display results in DataGrid and refresh it, while scanning, then use 2 table adapters - first for displaying, second for Scan function. So we have to check if the url is indexed:

if (resultsTableAdapter2.CountOfUrls(url) == 0)
{
}

Now we must check the robots.txt. We have declared a Dictionary for these parsed files. If we have the parsed robots.txt of the site, we get it from the Dictionary, otherwise we must parse robots.txt and add this to Dictionary.

Information about document type is in Content-Type header. The beginning of this header's value is MIME type of document. Typically after MIME, there are other information separated by semicolon. So parse Content-Type header:

Parsing Metatags

It's time for parsing the document. For parsing HTML elements, I chose the way through the regular expressions. In .NET, we can use them by System.Text.RegularExpressions.Regex class. The regular expressions are powerful tools for comparing strings. I will not explain the syntax. For finding and parsing metatags, I designed the following regex:

<meta(?:\s+([a-zA-Z_\-]+)\s*\=\s*([a-zA-Z_\-]+|\"[^\"]*\"))*\s*\/?>

The regular expressions can store some parts of matched strings. This regex store names and values of attributes.

We must check rel attribute. If its value is nofollow then we don't have to add it to waiting. If href is empty then we don't add URL to waiting either. The href can be relative path. So we have to join it with abs.

Conclusion

In the web, there are billions of pages and the number of them is increasing. In order to block robots, we can use robots.txt file, robots metatag and rel="nofollow" attribute. Malware robots will ignore these blockers. In order to speed up downloading, we can use GZIP encoding. The regular expressions are powerful tools for parsing strings.