Efficient Parallel Crawling of Web Content

Abstract: This project aims collecting all the web
pages in the Web and keeping pace with the rapid growth of the Web content, thus
implementing an
efficient Web crawler. A crawler is a program that retrieves and stores pages
from the Web, commonly for a Web search engine or a Web cache. A crawler often
has to download hundreds of millions of pages in a short period of time and has
to continuously monitor and refresh the downloaded pages in order to provide a
fresh view of the
Web. Because the Web is gigantic and being continuously updated, a
single-process crawler simply cannot achieve the required download
rate. Thus, many existing search engines already use multiple parallel
processors in solving the web crawling problem. Because of their
cost-effective nature, PC clusters have a widespread usage today, and they form
a practical solution for the web crawling problem. There has
been little scientific research conducted on parallelization of the crawling and
indexing process. This project involves development of
new and efficient parallel algorithms for the web crawling and indexing problem.
Efficient parallel algorithms will be developed and
implemented in order to gather the fast-growing web information efficiently,
accurately and to provide the required refresh
frequency. In addition to providing scientific information about the structure
of the World Wide Web, this project will enable gathering
Turkish web pages and indexing them. Such information is very valuable as it can
enable efficient and accurate searching within the Turkish
web pages. Furthermore, analysis and data mining of the gathered data can reveal
important sociological and statistical information about
Turkey and the Turkish nation.