Mostly Grails tips from Code Panda

Writing batch import scripts with Grails, GSQL and GPars

At work, we use a combination of GSQL and GORM to allows us to migrate data from legacy databases to our Grails projects. However, when dealing with large sets of data, this can be quite time consuming. Our largest databases can take 2 to 5 hours to process.

In this post, I will describe how to integrate GPars with Grails and GSQL to make these scripts run much faster.

Notes:

Description: This script looks through the photo table and imports photos into our database. The real script also does a great deal of image resizing and manipulation during the import, and creates secondary domain objects, but you get the idea.

Grails shell friendly: I wrap this code in a closure so it can be invoked both via grails console and grails shell. In the code, you see

def photoImport = { ...
}.call()

Wrapping this script in a closure allows me to call this on our production servers via grails shell like this:

grails prod shell
groovy > load scripts/importPhotos.groovy

It also allows me to edit and debug the script using the Grails console.

I found that with Gant scripts, you have to do a lot of extra work to get your application setting loaded properly, which are handled automatically by Grails shell and console.

You don’t necesarily need to use SQL – for some of our content, we import data in Excel files. For this, I use Goran Ehrsson’s ExcelBuilder script, except using the following import annotations to handle xlsx. ( Also changing all HSSF references to XSSF )

Configurable data sources: For our situation, the production data source and the development data source have different locations, we handle this by keeping this information within Config.groovy and keeping different sets for each environment.

Make it go faster: enabling GPars

GPars is a project that enables parallel processing in Groovy aimed at leveraging multiple processors and cores. Using GPars, in theory, we can improve the speed of our scripts with very little modification.

Adding GPars

There is a Grails plugin for the GPars project, however, it hasn’t been updated to the latest version of GPars, which has a few changes in syntax.

The easiest way to add GPars to your project is to actually just include the jar files. I do so by adding them into my grails-app/conf/BuildConfig.groovy file like so:

The code is pretty self explanatory, instead of looping and waiting for each iteration, it says ‘run each cycle of this loop as a parallel collection’.

Making GORM behave with GPars – getting hibernate sessions

If you try to run the code as it, you’ll run into a hibernate exception right away when it first tries to access the Grails database:

org.hibernate.HibernateException: No Hibernate Session bound to thread, and configuration does not allow creation of non-transactional one here

Some digging reveals that this is caused because the new thread created by GPars lacks the proper hibernate session. To fix it, we can wrap our GORM calls in a withTransaction closure, which will get an existing session or create a new one to run our GORM call.

In practice, we convert GORM getters like this:

user = User.findByUsername( photo.username )

into this:

User.withTransaction{
user = User.findByUsername( photo.username )
}

Making GORM behave with GPars – enabling pessimistic locking

Once I started running my scripts, it quickly became apparent that the threads were getting into each others’ ways when modifying data. The data in our legacy database would cluster similar data together ( i.e. photos would be imported in batch ), we would run into frequent StaleObjectExceptions and OptimisticLockExceptions from our good friend Hibernate.

One way to prevent this is to use pessimistic database locks to ensure that a SELECT UPDATE is called on our code. This is accomplished via the DomainObject.lock method, which acts like a get but prevents other rows from accessing this data.

Speed Improvements

So why go through all this trouble? Does the parallel processing provided by GPars provide any speed improvements?

I ran three versions of a modified user import script to see if there were any notable differences. This script imported the information of about 1,000 users to our system. Here are the results:

No GPars

GPars Optimistic

GPars Pessimistic

Run 1

174.142

53.731

56.031

Run 2

191.816

58.770

60.779

Run 3

196.652

57.356

68.459

Average

187.536

56.619

62.756

All times are in seconds.

As you can see, there were major differences between the average speed of a GPars-enabled import vs. one that is not GPars enabled. It took about 1/3 of the time to run the GPars scripts on my local developer machine ( some multi-core Dell laptop running Windows ).

There were not that many differences between the optimistic and pessimistic locking versions of the scripts. Given that the pessimistic locking version required the least amount of change while still allowing parallel processing, we decided to stick with that.

Given how easy it is to convert a non-GPars script to one that is GPars enabled, I definitively recommend this approach to anyone faced with a similar situation. In our investigation, I took at look at Spring Batch but bulked since it would require major rewrites to our scripts without promises of better performance.

I just did a fairly large migration with Groovy’s SQL handling for a state government commission. It included several databases/schemas and well over 150 tables moving from Oracle to MySQL. I didn’t want to write a script for each and every table– and I didn’t. Although I didn’t use parallel processes in the JVM, I did break up large schemas and even very large table ranges into separate scripts- all of which could be run and tested independently. Those scripts relied on the same core data extraction techniques and could have been put in a single script.

I wished GORM was easily isolated from Grails, as it would have made a few things more elegant; I had to rely on old-school JDBC to preserve Oracle’s datetime and blob data.

That said, how would this approach fair when needing to move large, incongruent, messy, denormalized real-world data structures? I think there is a way to merge these two ideas into something really cool and table-independent in design.

It still under development and test. Services and Quartz jobs can handle these Hibernate session initialization issues pretty well, but i’m not sure if it will be a better/faster/powerful solution than GPars.

You do realize that the code
def album = Album.findByUser(user)
if (!album ) {
album = new Album( title : “${user.username}’s album”, user: user )
}
may be executed simultaneously by two (or more) threads for the same user, in which case several albums will be created, don’t you?