Sunday, December 14, 2008

Data Migration with HApps-Data

HAppS applications, like any application with persistent data storage, are faced with the issue of migrating existing data when the format of the persistent data is changed. This tutorial will explore the binary serialization and migration facilities provided by HAppS-Data. If you think I am doing it all wrong, please let me know. Writing this tutorial is the extent of my experience using the HApps-Data migration facilities.

Requirements

This tutorial only uses the HAppS-Data (and dependencies) portion of HAppS. It has been tested with HAppS-Data 0.9.3. The first three lines of the module look like this:

Serialization

The most obvious way to serialize data in Haskell is to use the familiar Read and Show classes. Simply use show to turn a value into a String, and read to turn a String back into a value. This method has three serious flaws however:

The Version superclass is used during data migration. The serialize and deserialize functions are the counterparts to show and read. deriveSerialize is a Template Haskell function which provides functionality similar to deriving (Read, Show).

The Version class

The Version class is very straight-forward. It consists of a single function which returns the Mode (aka, the version) of a datatype.

>>classVersionawhere>mode::Modea>mode=Versioned0Nothing>>dataModea=Primitive-- ^ Data layout won't change. Used for types like Int and Char.>|Versioned(VersionIda)(Maybe(Previousa))>>newtypeVersionIda=VersionId{unVersion::Int}deriving(Num,Read,Show,Eq)

There are two categories of datatypes:

primitives whose layout will never change, and, hence, will never need to be migrated

everything else

The Versioned constructor takes two arguments. The first argument is a version number which you increment when you make an change to the data-type. The second argument is an indicator of the previous version of the data-type. The exact details are covered in the next section.

Putting it all together

The deriveAll template haskell function is similar to the normal haskell deriving clause, except it also has the ability to derive Default instances. Additionally, it always derives Typeable and Data instances even though they are not explicitly listed.

To make the types serializeable we first need to create Version instances.

We want to indicate that Beep and Foo are non-primative types, so we use the Versioned constructor. Next we specify a version number for the type. It could be anything, but 0 is the most sensible choice. Since there are now previous versions of these types we mark the previous type as Nothing.

For all non-primitive types the initial version of Versioned 0 Nothing is sensible. So the Version class provides it as a default value for mode:

>classVersionawhere>mode::Modea>mode=Versioned0Nothing

Hence, we could shorten our Version instances from above to:

>instanceVersionBeep>instanceVersionFoo

Next we derive Serialize instances for our types:

>$(deriveSerialize''Beep)>$(deriveSerialize''Foo)

Now we can use serialize to serialize values. Let's look at the output of serialize Beep

We see that Beep serializes to 9 bytes. The first 8 bytes represent the VersionId. VersionId is basically an Int, and the serialization code always treats Ints as a 64-bit values to avoid cross-platform issues. The final byte indicates which constructor of Beep was used. In this case the zeroth constructor was used.

At first it may seem like we don't have enough information here to deserialize the data, after all there are no type names, constructors, etc. But deserializing these bytes is no different than doing read "1" :: Int. Because we know the type of the value we want to be reading at compile time, we do not need to record that information in the stored data. We just do:

The first 8 bytes are the length of the String, and the remaining bytes are the utf-8 encoded characters of the String.

So, if you application is best served by using Strings instead of ByteStrings, you do not have to take an extra steps to ensure that the serialized data is compactly represented.

Simple Migration

Let's say we want to add another constructor to the Beep type. As a first pass, we will actually create a whole new type named Beep', which is similar to the old type, but has an additional constructor BeepBeep.

Because we are extending a previous type, our Version instance will look a bit different:

>instanceVersionBeep'where>mode=extension1(Proxy::ProxyBeep)

This indicates that we are extending the old type Beep. The new version number must be higher than the old version, but does not have to be strictly sequential.

Because we specified that this type is a newer version of an older type, we also need to tell HAppS how to migrate the old data to the new type. To do this, we simply create an instance of the Migrate class.

>classMigrateabwhere>migrate::a->b

The Migrate class is quite simple, it contains a single function, migrate which migrates something of type a to type b. In our current example, all we need is:

>instanceMigrateBeepBeep'where>migrateBeep=Beep'

We can demonstrate migration by serializing a value of type Beep and deserializing it as type Beep'. The migration happens automatically in the deserialize function.

*Main> fst $ deserialize (serialize Beep) :: Beep'Beep'*Main>

When deserialize tries to deserialize the data produced by serialize Beep, it will first check the version number. When it sees that the version number in the stored data is lower than the version number of the current type it will instead try to decode it as the type you specified as the previous version. If the version associated with the previous type is still higher than the value in the serialized data, the migration code will recurse until it finds a matching version number. Once it finds a matching version number, it will call the corresponding deserialization "instance" to decode the old data. Then as the recursion unwinds, it will apply the migrate function to migrate the data to newer and newer formats until it reaches the newest format.

Managing History

A big issue in the above example is that when we added the new constructor we also changed the name of the type and its existing constructors. That is not very convenient in a real application where you have a multitude of references to the old names.

Fortunately, we do not have to change the name of the type to add a new constructor. As we saw in the beginning, the name of the type and the names of the constructors are not actually stored in the serialized data. So, instead we can change the name of the old type from Beep to OldBeep and update its constructor as well.

that means we can serialize an OldBeep value and then deserialize it as a Beep value, like this:

*Main> fst $ deserialize (serialize OldBeep) :: BeepBeep*Main>

Note that this is not the same as migration. Here we are just exploiting the fact that because the type name and constructor names are not encoded in the serialized data we can change those names and still be able to deserialize the data.

Using separate files to manage type history

Keeping all the revisions of your type in one file, and changing the name of the type and its constructors every revision is tedious and hard to manage. Instead, we can use a system where we rename the files that contain our types. To start, we will put the types we want to serialize in a separate file (or files), such as Types.lhs.

Note that Foo in Types.lhs and Foo in Types_000.lhs are different types, namely Types.Foo and Types_000.Foo. So this works for the same reason that renaming Beep to OldBeep works.

Serializing Datatypes from 3rd Party Libraries

It is also possible to serialize datatypes from 3rd party libraries, provided those types have Data and Typeable instances. There is a caveat with this however. If the third party library changes the type, then you will not be able to read your data. This is not a fatal flaw however. You can simply copy the old type definition into a local module, and then migrate the old data to the new format.

Suggested Policy

Put the types you will serialize in one or more files which only contain types

Deploy your web 2.718 killer app

Before you do any more development, copy the current type files to sequential versions and create new current type files which re-export all the types. You can skip this step if the current type file only contains re-exports. i.e., if no type changes were made to that type file during the previous iteration.

My next post will likely be about using migration in the context of a full HAppS application. I plan to explain how HAppS-State actually works, and how to deal with the issue that HAppS-State can cause when you rename modules.