Follow this Blog!

May 27, 2011

Amazon I'd like to quickly crawl every URL of my XML sitemap because doing so triggers caching of each page and better user experience. An XML sitemap is usually named sitemap.xml and contains URLs for crawlers to crawl. It looks something like this:

<?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.mensfashionforless.com/</loc>
<priority>1.000</priority>
</url>
...
<url>
<loc>http://www.mensfashionforless.com/black-jacket.html</loc>
<priority>0.5000</priority>
</url>
</urlset>
<loc> is the tag you use to indicate an URL. They are the URLs I'd like to spider.

Here is the Unix shell script!
To achieve this purpose I first extract all the URLs; then I issue an HTTP request to them one by one. Keep in mind I don't need to see the content at all; I just need to issue the request so that the server receives the request and does what it's supposed to do. A good use case is that my server caches webpages on demand. So I use this crawler to make my server cache all the webpages specified in sitemap.xml so that later when someone visits my website they'll see the webpage more quickly.

# spider.sh: use awk to get URLs from an XML sitemap
# and use wget to spider every one of them
ff()
{
while read line1; do
wget --spider $line1
done
}
awk '{if(match($0,"<loc>")) {sub(/<\/loc>.*$/,"",$0); sub(/<loc>/,"",$0); print $0}}' sitemap.xml | ff

The above script should run successfully in C shell, Bourne shell, Korne shell. If not let me know!
In the script above I use 'awk' to extract URLs and use 'wget' to spider each of the URLs without downloading the contents (done via --spider option). Save it as 'spider.sh' and run 'chmod 700 spider.sh' and run './spider.sh' to spider your sitemap.xml!