Transaction

The Transaction class has five methods for supporting filesystem actions, and two methods for commit/rollback. The five filesystem action methods are:

openOutputStream

openInputStream

openRandomAccess

delete

rename

The openOutputStream method returns a FileOutputStream. Although

FileOutputStream out = t.openOutputStream(f);

appears equivalent to

FileOutputStream out = new FileOutputStream(f);

the openOutputStream method works behind the scenes to ensure atomicity. Once you have the FileOutputStream, you use it normally. You can close it when you're done, or you can let the transaction close it automatically when it completes.

The openInputStream method opens a file for reading, returning a FileInputStream. Since reading a file does not change the filesystem state (except possibly to alter the last-accessed time of the file, which we will not attempt to make atomic), it has no effect on atomicity; openInputStream is provided only for convenience and symmetry. Thus the call

FileInputStream in = t.openRead(f);

is equivalent to

FileInputStream in = new FileInputStream(f);

except that it will close the file when the transaction ends.

The openRandomAccess method is similar to the other open methods, except that it opens and returns a RandomAccessFile.

The other two action methods are similar.t.delete(f); is like f.delete();, except that it is atomic. It also throws exceptions, either FileNotFoundException or DeleteException (my own class, part of this package), instead of returning false. This, in my opinion, remedies a flaw in the design of File.delete.

Similarly, the rename method is the atomic version of File.renameTo, and throws a FileNotFoundException or a RenameException instead of returning false. The rename method will never overwrite an existing file, as some implementations of File.renameTo will; you must first delete the destination file.

Two methods handle transaction commit and rollback. It may seem obvious that one method does commit and another does rollback, and that is indeed the design of typical transaction systems (including JDBC), but there is another design that fits better with Java exception handling.

Every transaction has a boolean variable that determines whether it should commit or roll back when it is finished. The boolean is false by default. You set the boolean by calling a transaction's shouldCommit method, and you complete the transaction by calling its end method.

An example will show how this design makes it easy to code the most common case, in which the transaction will be rolled back if any exceptions are thrown while it is in progress:

Since shouldCommit is initially false, and isn't set to true until after everything has completed successfully, exceptions during the transaction will roll it back. If you want to continue your transaction in the face of an exception, you can write a try-catch within the try-finally.

If instead one had to call commit or rollback methods explicitly, then the
code for the common case would be a bit more cumbersome:

Here there is also the danger that the programmer will forget to roll back, or forget to re-throw the exception, thereby stifling it.

A third possible design, in which rollback is called automatically if there is an error, is no good for two reasons:

in some cases it does make sense to proceed with the transaction after an error;

there is no way for the transaction mechanism to trap exceptions that occur while the transaction is in progress but do not originate from within a call to a Transaction method.

Writing a try-finally around your transaction takes care of the basic commit-rollback logic, but somewhere you will have to deal with exceptions. Transaction methods can throw three different exceptions, reflecting three different kinds of errors:

An IOException is thrown when something goes wrong with the filesystem operation itself. For example, if you attempt to delete a nonexistent file, you will get a FileNotFoundException, which is a subclass of IOException. Handle these exceptions as you normally would, keeping in mind that you're running under a transaction.

A TransactionException is thrown when something goes wrong with the transaction mechanism itself, such as an error making a backup copy of a file. As soon as the system detects such a problem, it rolls back the transaction. Nested inside the TransactionException is the IOException that actually caused the problem. When you catch a TransactionException, you know that although your transaction failed, the system is in a consistent state.

An InconsistentStateException is thrown when the transaction mechanism cannot guarantee atomicity. For instance, if a filesystem error occurs during rollback itself, some but not all of the transaction's effects may have been undone, leaving the system in an inconsistent state. Such exceptions typically require human intervention.

Making Actions Atomic

Let us now go behind the scenes to see how atomic file transactions are implemented. We begin with some assumptions, namely that the filesystem makes certain basic actions atomic. Most important of these are writing a single block to the disk, deleting a file, and renaming a file, but we also assume that opening and closing a file are atomic. Our job is to provide atomicity for file writes, and for sequences of actions.

Let's begin with the problem of writing out a new version of a file. The standard way to do this, of course, is to simply open the file and write to it. But an error or crash during writing will leave you with a corrupted file, and no way to recover the original.

One way to prevent the problem is to make a backup copy of the file before writing. If the write fails, the backup can be restored. If the program or machine crashes during the write, then restoring the backup can still happen, but it will have to wait until the program is restarted. In detail, the algorithm looks like this:

Atomic File Write: Execution

E1. Rename or copy the original file to a backup file. If there is an error doing this, then delete the backup file, if any, and report failure.

E2. Write the file. If there is an error doing so, then restore the backup and report failure.

Although our desired action has happened, there is one more crucial step:

Atomic File Write: Commit

C1. Delete the backup and report success.

If there is a crash during execution or commit, the program can attempt to
recover the next time it starts:

Atomic File Write: Recovery

R1. If there is no backup file, assume the operation succeeded. (This is why deleting the backup is crucial. It is the signal that the entire operation was successful.)

R2. Otherwise, restore the backup.

R3. Delete the backup.

This brings the program back to the state it was in before the file write was attempted. The program can then try it again, if that is desired. In general, the job of the recovery process is to bring the system back to the state before the crash.

Observe that this is a somewhat subtle algorithm, even for something as simple as writing a single file. Much of the subtlety is in error-handling. The above description doesn't even cover all the possibilities: what if there is an error restoring the backup? Or deleting it?

Moreover, the algorithm is not even correct if step E1 uses copying instead of renaming to create the backup file. Consider what will happen if there is a crash while copying. Upon recovery, that partial backup will overwrite the original file in step R2, and the file will be corrupted. For this reason, it's better to rename the original file in E1 rather than copy it (remember that we assume the filesystem will rename atomically).

Part of the difficulty in thinking about atomicity is that there are five
sources of error:

Errors in the performance of the actions themselves--for instance, writing the file.

Errors in the activities needed to ensure atomicity--for instance, making or restoring the backup.

Crashes during the original sequence of actions. These are harder to deal with, because the recovery process must do detective work, reasoning about which actions succeeded based on the state of the filesystem at the time of the crash.

Errors during the recovery process.

Last but not least, crashes during the recovery process. These are most insidious of all, because recovery might have been partially successful. Thus the subsequent recovery process must take this possibility into account when deducing exactly what work got done, and how to undo it.

Undo!

Reasoning about the correctness of an atomic protocol thus involves considering every possible crash point, both in the original sequence of actions and during recovery, and making sure that the recovery process can always eventually get back to the initial state. This isn't easy to do. If you're not convinced, let's consider another example.

Say a program wishes to atomically create a file F, then delete a file G, then later write a new file, also named G. Backing up a deletion, like backing up a write, involves renaming the file. If a new file is being created, however, no backup file is necessary -- we just have to remember to delete the file if there is an error. The process looks like this, then:

Create and write to F. If this fails, delete F and report failure.

Rename G to a unique backup name, say G.bak. If this fails, then G did not get renamed (rename is atomic). Delete F and report failure.

Create and write to G. If this fails, then delete G, rename G.bak to G, delete F, and report failure.

Delete G.bak and report success.

That's complicated, but it seems right: the successful result is what we want, and a failure anywhere along the way will undo all actions up to that point. But remember that we have to design a recovery algorithm as well, in case of a crash. Let's say that when the program is restarted, it sees files F and G, but not G.bak. What situations can give rise to that state?

The crash occurred after F was created but before G was deleted (i.e. between steps 1 and 2 above); that is, file G is the old file. The state is inconsistent and should be rolled back to the beginning, by deleting F.

The entire process ran to completion successfully. File G is the new file. No recovery should be done.

There is no way for the recovery process to tell which state holds. This is a serious problem.

By now, you should have seen enough to convince you that the problem of
atomic file transactions is a difficult one. Crashes may occur at any point in the file writing and deleting process, and an atomic system needs to be able to revert to the previous state of the system if anything goes wrong at any point.

In part two, I will continue to discuss how to use the package; implement atomicity using an algorithm, and conclude with a tour of the design of my package, with an emphasis on how you can extend it with new filesystem actions. Finally, I offer an informal justification of my package's accuracy.

Jonathan Amsterdam
is Senior Consulting Engineer at DataSynapse,
Inc. Prior to that, he was president and founder of Astrel, Inc., a
Java training and consulting firm.