At work, I've been thinking about the
problem of versioning data quite a bit. It's a nasty problem, but I think
I've gotten it down to a nice, simple paradox:

"The easier you make version migration, the harder version migration will
be."

This is a very weird, and particularly intense microcosm of "worse is
better". In this case, the difficulty involved in using a tool actually
makes the task that the tool has to perform easier. In other words,
subjective quality improves as objective quality degrades. Or something like
that. Here's a case in point:

In Python, you can easily serialize any object regardless of its layout, as
long as it doesn't contain something completely nonsensical like an open
file or a graphical button. It takes no extra code.

In Twisted, and by extension in Quotient, any object that is even
theoretically persistent in this manner can be upgraded by inheriting a
single superclass (twisted.persisted.styles.Versioned) and writing a single
method (upgradeToVersion1... or upgradeToVersion7. write as many as you like
and they'll be run in order) which shuffles things around inside the object
until they're consistent with the current world-view. This is about as easy
as upgrading from one version to another can get - the upgrade function is
almost always completely self-explanitory:

This is the MIT style of persistence design. (Oddly enough, written while
I'm living at MIT.) It is complete, it takes every case into account (there
is even code for upgrade ordering dependencies, if you care about such
things) and it values simplicity for the user (the "business object" author)
rather than the implementor (the maintainer of the framework).

Now, for an example of the New Jersey approach, I will refer to some code
that I actually wrote in New Jersey. (This is code that I have done my best
to erase from the internet's collective memory. If any of you offer up the
ridicule that it so richly deserves, so help me I will erase you in the
same fashion.)

Unfortunately I haven't had much experience with this style of persistence,
although I am aware that many popular systems use it, including software
that costs thousands of dollars per copy and does Very Important Things
Indeed for fortune 500 companies.

The style which I am speaking of is explicit persistence, in other
words, you have to write a new method for every new object you want to
persist, even if it's something dirt simple like two ints and a string.
Then, whenever you want to change anything about an object, you have to
modify some code that saved and loaded that object. In the code in question
- this was the original Twisted Reality codebase - there were 2 methods of
note for saving and loading objects. One was public String content() in
Thing. The other was [package] boolean handleIt(String) in ThingFactory.
These methods were 110 and 283 lines long, respectively. If you wanted to
add a new attribute to anything, you had to add some code that looked like
this:

to each of those methods. If you actually wanted to change
something about the structure of an object, you had to create magic, temporary
declarations in the persistence and then interpret them later. Certain kinds
of changes weren't even possible.

Now, when a designer who was thus far ignorant of the subtle mysteries of
persistence comes across these views of the world, the choice here is
obvious. The former is so clearly superior that the latter seems like a
cruel joke. It breaks encapsulation! It adds a huge cost to change! It binds
unrelated aspects of the code together inextricably! It creates arbitrary,
artificial, and unnecessary limitations!

Not all these problems are endemic to explicit persistence, but I wanted to
present an obvious straw man here so that it would be really surprising when
I couldn't knock it down.

Don't get me wrong - the latter strategy is certainly nicer to work
with when one is writing programs. Given the explicit task of upgrading a
few simple objects (of the style present in the code from which the New
Jersey example is taken) to a new and better representation, the Right Thing
wins hands down. This scales up, too - there were no complicated objects in
the NJ code specifically because the persistence code pretty much fell over
whenever you tried to do anything complicated, so writing and upgrading
complicated objects MIT-style is certainly easier than impossible.

But, there is a curious phenomenon that takes place looking at the larger
issues pertaining to codebases using these two approaches. In about 2 years
of maintaining TwistedReality/Java, when there were bugs in the persistence,
they were obvious and immediate - you could dump and object and identify the
problem with its representation in a few moments. More importantly, pretty
much every version was backwards-compatible to old persistent data. You
never had to "blow away" your map files, because they would just load
without the spiffy features available in the new map format. And finally, no
contributor to the project ever checked in code which broke these
constraints - every data-layout change included appropriate persistence
changes.

In only about 6 months of maintaining a Right Thing codebase (Quotient) this
is certainly not the case. Close to shipping now, we are still wrestling
with a system which requires the database be destroyed on every other run.
Nobody wants to write persistence code, and hey, the system works
if you don't, so why bother? We don't have a policy in place that mandates
that everyone must write an upgrader every time they change anything, and
again, nobody wants to write persistence code, so since they don't have to
they won't. This includes me, so I understand the impulse quite well.

This isn't entirely a fair comparison, of course. TR/J included a lot less
persistence-sensitive data than Q does. It had a far simpler charter. It
didn't use a database layout, just pure object persistence. However, from
experience with the Twisted 'tap' format, those issues are peripheral -
Twisted devs. generally don't like taking the time to write persistence
upgrade functions either. There are periodically upgrade snafus. What really
matters, of course, is that nobody trusts taps to stay stable worth a damn,
even though we try really hard to make sure they will be.

Also, towards the end of its life (although there is some question as to whether it is really dead) TR/J began inheriting some
characteristics of the Right Thing model (in particular, dynamic properties
of arbitrary type), which in turn began creating the same syndrome of
problems. In that case, it manifested itself as certain features breaking on
particular objects from version to version and requiring operator
interventions to fix the data rather than whole-system upgrade explosions,
but nevertheless, we couldn't quite shoehorn all the features we needed into
a static, single-object model of persistence.

Python has tempted us, we have taken the bite from the Pickle, and we can't
ever go back again. A persistence strategy as clearly brain-dead as the one
featured in TR/J just isn't going to cut it with the feature-set that we
need to support in Quotient. However, we desperately need to encourage the
developmental behaviors which that system encouraged, especially keeping a
running system going with the same data for an indefinite period of
time.

What did the Jersey style do right, then?

Forced Consideration of Impact

Every time that a programmer made a change to an object that might affect
that object's persistence, they had to make a change to the
persistence code as well, or they effectively couldn't use their new
feature. The data just wouldn't load. This meant that, when faced with a
potentially complicated new data structure, they would always ask
themselves, "Do I really need to add that?" This might seem like an
artificial burden, but in reality it more closely reflected the
real cost of change while keeping the actual requirements
satisfied, rather than making the cost of change seem artificially low
while constantly violating the requirements in the name of expediency.

Immediate Feedback and Testability

The persistence format was also the introspection format, because it was
so simple. Whenever a developer made a change to the persistence code,
they could immediately see that change in a very direct way,
making it easy to see if they made a mistake. If that code had had any
tests (NOT A WORD, I SAID!) then writing them would have been
relatively easy too. With an implicit persistence mechanism, the only way
to write such a test is to keep an exact, unreadable copy of an old
object's data (and of course, all the context that object kept
around).

Programmer-Valuable Data Associated with the System

This is more specific to TR - as we were working on the code, we were also
developing a companion dataset, stored in TR's own format, which was
equally important to the project as the code itself. It was absolutely
100% imperative to every developer to keep that data working in every
minor iteration of the code, because otherwise we couldn't test. I think
that ultimately every data-storing project needs something like this to
make the developers truly care about versioning.

Separation of Run-Time and Saved State

All that grotty string-mashing code in TR actually served a purpose - it
stripped implementation-dependent run-time data out of the saved file.
This meant that we were free to change the implementation of lots of
structures without updating the data files (for example, switching a list
of strings to a dictionary of string:int, or vice versa) without
persistence changes, as long as it could be represented in the same way.
In an automatic system, these implementation details are indistinguishable
from the core abstract state of an object.

Oddly enough, although it is brought about through forced duplication of
effort (manual specification of the persistence format), it reduced the
amount of upgrade code necessary. Because the persistence format was very
abstract, you never had to write an upgrader to go from one implementation
of behavior for the same persistent data to another. While changing
persistence can be frequent, changing implementation is almost by
definition more frequent.

I think that's a relatively complete summary of the advantages of manual
persistence work, although I'd love to hear comments upon it.

How can we replicate these advantages?

I think that an important first step is to find some simple, lightweight way
to completely express the necessary information for the persistence of an
object. Even if we still use Pickle to store this data, an explicit
specification of what it should look like would be a good mental exercise
for us. It would also provide a means to test upgrading and to represent the
format of old versions without having to copy their entire implementation.
In short, the "schema" that Twisted World provided and New PB is about to
provide again. The outstanding work in my sandbox on indexed and referenced
properties in Atop is an important first step here.

We also need some critical data to be stored in the database that can't be
exported, re-loaded, or otherwise passed through some external crutch
versioning mechanism. We need to care about our core data's dependability as
we move forward.

We also need to decide what kinds of data we really care about. There are
certain aspects of the Quotient application which are developing so fast
that it's impossible to effectively represent them persistence-wise, and it
would really be a waste of time. Such objects should probably never be
persisted in the first place - just provide an 'experimental' flag or
somesuch, indicating that the object should never touch an on-disk database.
When this becomes burdensome, the programmer can un-set it and manually
start performing updates.

There's more to say, but this has already been quite a ramble! I hope that
you've enjoyed reading it. Please keep in mind that I would like feedback
and more ideas about how to perform the transitions I've suggested over a
relatively large existing system. (Quickly, of course. And cheaply. ^_^)