In 2004 I started a project on a search engine but I have thus far closed down the project due to a lack of content that I have to search through. It did crawl a page but it wouldn't index the links or check through them at a later date. Recently I started working on a web crawler that goes though a page, copy the links, store them into the database and then go through them once given ample time. It also gathers other pertinent information such as title, p tags etc. I started this project a month or two ago now but never really looked into stream lining it.
At the time I had one request being sent out once every 100 milliseconds. This was alright at first but the growth of the project was far to slow to keep this speed as a realistic expectation. There was also a problem with sending n amount of requests through chrome and having them queue only to execute one at a time.
Because of the restriction that I had with chrome I went ahead and began to get a little javascript-y to force out multiple connections at the same time and in turn tax the server much more heavily. After applying the fixes to the front end I had the problem of multiple indexes on the url crawling queue to be hit at the same time. Looking back on the design I should be handling this information completely through the browser but I have also gone a little to far with the current version. The fix for multiple indexes being run was to simply make the queue a little more random plus or minus a few thousand. This way I am ensured that it is most likely that the queued items will all be hit but searching through the record of items is not going to slow down the process.
This project had just over 860,000 urls in queue last night. As of this morning I have 2,160,000+ urls and I have stored about 4.6GB of website information in the past 2 hours. This information needs to be properly split up and indexed to reduce the rate of inflation since only 67,000+ websites have been parsed through and stored. I am thinking about breaking down the paragraphs into words that are joined together by id.