Pre-emptive Multithreading Web Spider

The Win32 API supports applications that are pre-emptively multithreaded.
This is a very useful and powerful feature of Win32 in writing MFC Internet
Spiders. The SPIDER project is an example of how to use preemptive
multithreading togather information on the Web using a SPIDER/ROBOT with the
MFC WinInet classes.

This project produces a spidering software program that checks
Web sites for broken URL links. Link verification is done only on
href links. It displays a continously updated list of URLs in a CListView
that reports the status of the href link. The project could be used as a
template for
gathering and indexing information to be stored in a database file for queries

Search engines gather information on the Web using programs called Robots.
Robots (also called Web Crawlers, Spiders, Worms, Web Wanderers, and Scooters)
automatically gather and index information from around the Web, and then put
that information into databases. (Note that a Robot will index a page, and then
follow the links on that page as a source for new URLs to index.) Users can
than construct queries to search these databases to find the information they
want.

By using preemptive multithreading, you can index a Web page of URL links,
start a new thread to follow each new URL link for a new source of URLs to
index.

The project uses the MDI CDocument used with a
custom MDI child frame to display a CEditView when downloading
Web pages and a CListView when checking URL links. The project also uses the
CObArray,
CInternetSession, CHttpConnection, CHttpFile, and CWinThread MFC classes. The
CWinThread
class is used to produce multiple threads instead of using the Asynchronous
mode
in CInternetSession, which is realy left over from the winsock 16 bit windows
platform.

The SPIDER project uses simple worker threads to check URL links or download a
Web page.
The CSpiderThread class is derived from the CWinThread class so each
CSpiderThread object can use the CwinThread MESSAGE_MAP() function.
By declaring a "DECLARE_MESSAGE_MAP()" in the CSpiderThread class the
user interface is still responsive to user input. This means you can check the
URL links on one Web server and at the same time download and open a Web page
from another Web Server. The only time the user interface will become
unresponsive to user input is when the thread count exceedes
MAXIMUM_WAIT_OBJECTS
which is defined as 64.

In the constructor for each new CSpiderThread object we supply the ThreadProc
function and the thread Paramters to be passed to the ThreadProc function.

Each new thread creats a new CMyInternetSession (derived from
CInternetSession) object with EnableStatusCallback set to TRUE, so we can check
the status on all InternetSession callbacks. The dwContext ID for callbacks is
set to the thread ID.

BOOL CInetThread::InitServer()
{
try
{
m_pSession = new CMyInternetSession(AgentName,m_nThreadID);
int ntimeOut = 30; // very important, can cause a Server
// time-out if set to low or hang the
// thread if set to high.
/*
The time-out value in milliseconds to use for Internet
connection requests. If a connection request takes longer than
this timeout, the request is canceled. The default timeout is
infinite. */
m_pSession->SetOption(INTERNET_OPTION_CONNECT_TIMEOUT,1000*
ntimeOut);
/* The delay value in milliseconds to wait between connection
retries.*/
m_pSession->SetOption(INTERNET_OPTION_CONNECT_BACKOFF,1000);
/* The retry count to use for Internet connection requests.
If a connection attempt still fails after the specified
number of tries, the request is canceled. The default is
five. */
m_pSession->SetOption(INTERNET_OPTION_CONNECT_RETRIES,1);
m_pSession->EnableStatusCallback(TRUE);
}
catch (CInternetException* pEx)
{
// catch errors from WinINet
//pEx->ReportError();
m_pSession = NULL;
pEx->Delete();
return FALSE ;
}
return TRUE;
}

The key to using the MFC WinInet classes in a single or multithread program is to use a try
and catch block statement surrounding all MFC WinInet class functions.
The internet is very unstable at times or the web page you are requesting no longer exist, which
is guaranteed to throw a CInternetException Error.

The maximum count of threads is initially set to 64,
but you can configure it to any number between 1 and 100.
A number that is too high will result in failed connections,
which means you will have to recheck the URL links.

A rapid fire succession of HTTP requests in a /cgi-bin/ directory could bring a server to it's knees.
The SPIDER program sends out about 4 HTTP request a second. 4 * 60 = 240 a minute. This can also
bring a server to it's knees. Be carefull about what server you are checking. Each server has
a server log with the requesting Agent's IP address that requested the Web file. You might get some nasty
email from a angry Web Server administrator.

You can prevent any directory from being indexed by creating a robots.txt file for
that directory. This mechanism is usually used to protect /cgi-bin/ directories. CGI scripts take more
server resources to retrieve.

When the SPIDER program checks URL links it's goal is to not request too many documents too quickly.
The SPIDER program adheres somewhat to the standard for robot exclusion.
This standard is a joint agreement between robot developers, that allows WWW sites to limit
what URL's the robot requests. By using the standard to limit access, the robot will not
retrieve any documents that Web Server's wish to disallow.

Before checking the Root URL, the program checks to see if there is a robots.txt file
in the main directory. If the SPIDER program finds a robots.txt file the program will
abort the search. The program also checks for the META tag in all Web pages. If it finds
a META NAME="ROBOTS" CONTENT ="NOINDEX,NOFOLLOW" tag it will not index the URLs
on that page.

Problems:
can't seem to keep the thread count below 64 at all times.
limit of 32,767 URL links in the CListView
wouldn't parse all URLs correctly, will crash program occasionally using CString functions with complex URLs.