Author
Topic: large site, generator keeps dying (Read 18157 times)

I have posted this basic topic before for another large site, and the only way I solved it was to cut out a large portion of the site from the sitemap. This meant that a lot fewer sites were submitted to google.

I have assigned 3Gb of memory, server has total of 6Gb. The generator/index.php page says it uses about 495Mb. The crawl page, when it will load, says there about 385k urls indexed with another 895k urls to go.

It has been running several days and has died at least 20 times.

Is there a way to get this program to index very large sites? If not, I am considering making a manual sitemap for the majority of the urls, and exclude them from the automated process.

I have set the "Progress state storage type" to var_export. What is the memory issue between this setting and serialize? Which uses less memory? The var_export option is easier to read by humans.

Usually serialize option consumes less memory, but in some php environments it's has caused memory *leaks*, so we've added "var_export" option as well.

Quote

I have set the "Save the script state" option to 3600, or 1 hour. What is the memory issue associated with this setting?

Ideally saving the state should not cause leaks, but as mentioned above sometimes it might happen.

Quote

I have set the "Maximum memory usage" to 3000Mb. It is only using 300Mb. The machine has 6Gb of ram. Is there a way to give it more memory, or is this an httpd/php limitation controlled elsewhere?

Probably it just doesn't need need more memory at this point.

Also, make sure to check your site structure for further optimizations using "Exclude URLs" and "Do not parse" options. The first option excludes specific (non-important) URLs from sitemap, while the second one includes pages in sitemap, but doesn't *crawl* them. I guess that in your case adding "detail.php" in "Do not parse" option might make sense.

Here is the layout of this particular site. It is relevant to a lot of large directory sites. I own and maintain several.

Top-tier pages: Geographical list (by state). Lists all entries for that state, 100 per page with "Next" link. Next levels go as deep as needed, in this case California has approximately 220,000 entries, or 2,200 "Next" links. Each entry on these pages is an individual page.

Specialty pages: List by specialty and by state. Each page calls a certain specialty within a specific state. They also have "Next" links. It is important to have these pages indexed. Each entry on these pages is an individual page. The entries on these pages are also included in the geographical pages, but ordered differently.

Individual pages: One entry without any further links.

Here is a possible suggestion:

1. Run a preliminary search, completely excluding the individual pages from being crawled or added to sitemaps, but including all the general and specialty pages;

2. THEN re-set the configuration parameters to READ the previous links spidered but not crawled into a second crawl list, de-duplicated;

3. ADD the individual pages from the list to the sitemap as one crawl.

you can use "Maximum depth level" option in sitemap generator configuration to limit the "depth of crawling", so that individual pages are added without crawling (also you can define individual page URL in "Do not parse" option for the same purpose).