24 Mar 2009

A tutorial on bulkloading data onto App Engine Datastore

A recent application migration to App Engine involves uploading a lot of old data from the SQL Server database into the datastore. The Google documentation only gives some simple examples, and there're two bulkloading tools available causing more trouble than help. I followed with bulkloader first and installed SQLite for it, but failed to make it work. Later on, I tried the bulkload_client tool and succeeded with it.

It is straightforward following the examples to do simple data uploading. You can just dump your database tables into CSV files, write a uploader class in Python to map all fields of a CSV row to datastore.Entity properties, and it should work. However, it's more troublesome if your data contain unicode strings, or if your model requires relationship, or referencing in Datastore terms.

1. A simple example:

Suppose we have a database table MyUser containing an ID as integer, a name as string and a mobile phone number as db.PhoneNumber type. And we have the entity class built like this:

The CSV file contains a number of rows with each row representing one user, and containing three fields separated with commas. For example:

1,David Johnson,31,078976012122,Mary Mann,24,0786321234

To upload this data, we need to write a Loader class which, like a Web request handler, stays on the serverand can be activated by a request from the client. It can be as simple as a couple of lines code.

Our class MyUserLoader is a derived class from bulkload.Loader and the __init__ method (constructor) is responsible for specifying what data to upload into what entity class, in this case 'MyUser'. We call bulkload.main in the main program entry and pass our loader class name as its parameter. The most interesting part is the list that maps a data row in the CSV file. Each field in the list is a tuple of two elements, and the first element is the name of the property that will be put in the entity, and the second element tells the data type, it's actually a function that transforms the string from CSV to the value accepted by that property. In this example, 'name' is a string, so we use str to indicate the Python builtin functionstr(x) will be used where x is the first string separated by commas in a CSV line. The second tuple specifiesthe property named 'age' accepts an integer which will be converted from the second string in a CSV lineusing int(x) function. The third tuple 'mobile' is of type PhoneNumber, and for loader class, we must usedatastore_types.PhoneNumber for db.PhoneNumber type.Now we can save this class as file myloader.py, upload it to the production server if not running on local machine.

Next, we need to add this to app.yaml file:

-- url: /loadusers script: myloader.py login: admin

Note that if you run it on local development machine, comment off the last line login:admin so that you don'tneed a credential to run the bulkloader. If on GAE, you can browse to /loadusers, and copy the cookie stringand insert into the bulkload_client command line argument.

It's a bit tricky with date time data values. They can be of different format as output in the CSV file.For example, we have an ISO format datetime string like this: 2009-01-30 12:24:01. And we mustspecify the format to make it parse correctly. We can use Python datetime function strptime.And let's modify the myloader.py file to add the following:

from datetime import datetime

[('updateTime', lambda x:datetime.strptime(x,'%Y-%m-%d %H:%M:%S')]

We used a lambda inline function to do the conversion from string to datetime object, instead, we can also write a function like this:

One problem that I met is that some fields are empty, but the property does not allow empty fields, for example, the PhoneNumber property.

When we have an empty string value for the mobile field, we'll get an exception. Although this property does not accept empty strings as input, it accepts None value. So, to solve this problem, we can define a function to return None if an empty string is given.

[('mobile', lambda x:x != '' or None)]

4. Use different encoding

We have Asian-language names encoded as UTF-8 in the CSV file, and when we upload it, it reports exception using the default encoding. So we must specify the encoding.

We need to do this for both StringProperty and TextProperty properties.

How should we write a function for reference? First of all, we must make sure that the MyUser entities arealready loaded before we upload the MyPost entities, and we can query the MyUser entities by user id.

u = MyUser.all().filter('uid =',id)

This query is based on index of uid, and we can make it even faster by adding a key for the entities. So let's modify the MyUser loader and load the data again. (We must delete the old data before uploading again, otherwise, the entities will be duplicated.)

Unfortunately, we do not have a straightforward method of specifying a key_name in the loader class. However, we can override a method called HandleEntity in the loader class. A entity parameter is passed over to this method, and the entity contains the fields populated for one row, and this is the place where we can post-process the fields before submitting the entity to datastore. And here can we add the key_name. To do that, we should construct a datastore.Entity instance and add the current entity to it.

There seems to be no way of accessing other field values inside the type function. For example, a multiple field primary key cannot be processed in this way. It should be preprocessed so that the key is already combined into one string.