The problem with this approach is that it doesn't scale. The more content you have, the longer it takes to generate the sitemap. And because the sitemap is generated every time the URL is accessed, precious server resources are wasted.

Another suggestion was to append an entry to the sitemap every time new content is added to the database. I did not like this approach because it would be difficult to do source control on the sitemap. Also, accidentally deleting the sitemap would mean that data is gone forever.

Batch Job Approach

Eventually, I ended-up doing something similar to the first suggestion. However, instead of generating the sitemap every time the URL is accessed, I ended-up generating the sitemap from a batch job.

With this approach, I get to schedule how often the sitemap is generated. And because generation happens outside of an HTTP request, I can afford a longer time for it to complete.

Having previous experience with the framework, Spring Batch was my obvious choice. It provides a framework for building batch jobs in Java. Spring Batch works with the idea of "chunk processing" wherein huge sets of data are divided and processed as chunks.

I then searched for a Java library for writing sitemaps and came-up with SitemapGen4j. It provides an easy to use API and is released under Apache License 2.0.

Requirements

My requirements are simple: I have a couple of static web pages which can be hard-coded to the sitemap. I also have pages for each place submitted to the web site; each place is stored as a single row in the database and is identified by a unique ID. There are also pages for each registered user; similar to the places, each user is stored as a single row and is identified by a unique ID.

A job in Spring Batch is composed of 1 or more "steps". A step encapsulates the processing needed to be executed against a set of data.

I identified 4 steps for my job:

Add static pages to the sitemap

Add place pages to the sitemap

Add profile pages to the sitemap

Write the sitemap XML to a file

Step 1

Because it does not involve processing a set of data, my first step can be implemented directly as a simple Tasklet:

The starting point of a Tasklet is the execute() method. Here, I add the URLs of the known static pages of CheckTheCrowd.com.

Step 2

The second step requires places data to be read from the database then subsequently written to the sitemap.

This is a common requirement, and Spring Batch provides built-in Interfaces to help perform these types of processing:

ItemReader - Reads a chunk of data from a source; each data is considered an item. In my case, an item represents a place.

ItemProcessor - Transforms the data before writing. This is optional and is not used in this example.

ItemWriter - Writes a chunk of data to a destination. In my case, I add each place to the sitemap.

The Spring Batch API includes a class called JdbcCursorItemReader, an implementation of ItemReader which continously reads rows from a JDBC ResultSet. It requires a RowMapper which is responsible for mapping database rows to batch items.

For this step, I declare a JdbcCursorItemReader in my Spring configuration and set my implementation of RowMapper:

Places in CheckTheCrowd.com are accessible from URLs having this pattern: checkthecrowd.com/place/{placeId}?searchId={searchId}. My ItemWriter simply iterates through the chunk of PlaceItems, builds the URL, then adds the URL to the sitemap.

Step 3

The third step is exactly the same as the previous, but this time processing is done on user profiles.

Lines 26-30 declare an abstract step which serves as the common parent for steps 2 and 3. It sets a property called commit-interval which defines how many items comprises a chunk. In this case, a chunk is comprised of 100 items.

There is a lot more to Spring Batch, kindly refer to the official reference guide.

7 comments
:

Hi there :) Great post, however, as soon as I run the job again, for the second time, I get an exception from sitemapgen4j: java.lang.RuntimeException: Sitemap already printed; you must create a new generator to make more sitemaps. This is because sitemapgen4j has a property internally called finished to track if the file has been written. How did you solve this issue?