If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Unanswered: Web Crawler database design

Web Crawler database design

Small question about database design concerning a table that will hold several millions of records
Containing URL information.

Let's say that I have a table with about 1000+ root websites
And the crawler is starting to fetch links from the root website and building huge Url_links table
And from time to time I have to get top 1000 urls from this table that are UN crawled URL's grouped by website_ID and ordered by insert date.

When this table is starting to grow (4M records and more) the IO is starting to be very loaded and it is slowing down the process dramatically

Any tweaks to the design of the table that can improve this process
We already have indexes on the website ID and date but it is still very slow…
I was thinking to create a buffer table and separate the UN crawled urls from the crawled ones
But maybe you have more creative thoughts

Small question about database design concerning a table that will hold several millions of records
Containing URL information.

Let's say that I have a table with about 1000+ root websites
And the crawler is starting to fetch links from the root website and building huge Url_links table
And from time to time I have to get top 1000 urls from this table that are UN crawled URL's grouped by website_ID and ordered by insert date.

When this table is starting to grow (4M records and more) the IO is starting to be very loaded and it is slowing down the process dramatically

Any tweaks to the design of the table that can improve this process
We already have indexes on the website ID and date but it is still very slow…
I was thinking to create a buffer table and separate the UN crawled urls from the crawled ones
But maybe you have more creative thoughts

Why not show us some slow SQL and then give us the tables involved and their indexes. That way we're much more likely to solve your problem.