Posts Tagged ‘python’

I have a site which gets good amount of traffic and accesslog now gets around 10GB per month. Sometime ago I have installed sitemap_gen.py by Google to create sitemap for search engines (Google again). When I had small logs it was not a problem but when logs became large I started to get MemoryError when attempting to create sitemap out of large accesslog file. Well… We live in interesting times and here we go again: Google! And guess what?! There no solution on the web about it (at least after hours of googling I have not found it), Google’s own product has major fault which they did not rectified since 05-12-2005 and Google still recommends sitemap_gen.py on their own site https://www.google.com/webmasters/tools/docs/en/sitemap-generator.html (it was correct at the time of writing on 14-May-2009 - now Google has removed the page. Perhaps they’re ashamed to endorse their own product).

I looked around for alternatives and there’s some commercial sitemap generators available but I am open source supporter and want to have truly free and open system. Unfortunately I could not find any free open source sitemap generators either.

So as Gary Oldman said in Fifth Element: if you want something done properly - do it yourself. I started sitemap_gen.py under strace and immediately problem came to the light: statement file.readlines() causes WHOLE accesslog to be read into memory. Oops. I can understand Google developers have virtually limitless amount of memory at their disposal (and that’s what word Google effectively means by the way) but we are just pathetic mortals who have only few gigabytes of memory and accesslog quite often may exceed it. So I used my own brains and one minute of time to replace this statement with the one which causes python to read the file line by line.

Obviously Google developers have not done stress testing of their own script. And despite large number of people having problem with that nothing done to it for years. But hopefully one minute I have spent (and another few minutes writing this blog entry) will save a lot of headaches to everyone who running their sites and want to have sitemaps.