Practicle Dristributed Web Crawler

Sunday, June 8, 2008

The main problem that this project faces is to solve the need of very high resources that is required to provide a successful web crawling. Most of the web crawlers used at the present date uses server farms to cater their needs. This makes the area untouchable for normal developers. My goal is to reduce the resources for web crawling by using a distributed system.

The distributed system will be used to do the web crawling and also the data processing. And a single database server to store the data.And also the project will provide the searching facility according to page details and images tags to provide a better image search.

Well thats a really gud idea bro; but im not sure how much people would like to donate for a simple search. and different browsers might be a problem also becoz different browsers will have different ways to interpret with them.

But there will be a system to count the number of data donated by a participant using the crawler application.

And a number of points will be added to his account accordingly. he may use it to track a sites changes. this will be very useful for web admins and forum lovers. this is still in test state. if i does not manage to do this for the FYP on time i will do this someday. Think it will be a new idea for a web crawler.

but since you are interested there will be a function to keep track of web pages.

This can be used on web sites that does not have RRS feeds. And also can be used to customize the information u need to read. with this option u will not have to visit the web site again and again.

For a example take Defence.lk.You can use this system (ones done :P) to scan the site in a frequency of given time for key words that u provide. How cool if u can get a mail or SMS (only selected providers) when ever defence.lk announces the word BOMB in there site :)

So I think now ur wondering how to find the bandwidth need for such a massive operation. its simple since this is a distributed system you will have to GIVE before u can TAKE.

Points will be allocated to ur donation of bandwidth for this systems crawling and web monitoring. And u can scan web sites for a given time and a give frequency filtering keywords. and to alert u via e-mail or SMS.

I'm a Final Year Undergraduate of APIIT Sri Lanka (B.Sc. in Computing). 221BoT is my final year project. This project is supervised by Mr: Ashan Fonseka and assessed by Dr: Damith Mudugamuwa, of APIIT Sri Lanka.