Hi.
We have an app that relies on BerkeleyDB for state recovery. It will be deployed on the Amazon EC2 cloud in which the FS is (not yet) persistent.
First of all, has someone tried to do this before ?
Is it possible to find a way for the checkpoint flushes to flush to S3 instead of the local filesystem ?

We (that is, the BDB group as a whole) tested Berkeley DB (C) on AWS last fall. I looked at the archives for that project, and the issues seemed to be around the setup for AWS -- once that was done, the tests ran successfully. We didn't get around to doing that with BDB JE, but from what they saw, I'd think it would also work ok.

Is it possible to find a way for the checkpoint flushes to flush to S3 instead of the local >filesystem ?

I'll display my ignorance of AWS by saying that when the BDB (C) testing was done, the storage was on S3, and my colleague there remarked:

As far as I can tell, S3 is the normal storage
method for EC2 and I am not sure if I even have the option
of separating the two.

Maybe other folks can chime in? And if you carry on, it would be interesting to hear about how it works for you.

Hi Linda.
Thanks for your reply. I'm not an AWS expert myself, but as I understand it, the EC2s are virtual machines, and once they are stopped, or crash, the "local" filesystem is lost. S3 is a storage solution based on RESTful web services, so it is probably quite slow. In order to use S3, you have to plugin to a specific API.
Since BDB uses log files on the FS, those files are volatile unless the FS is pushed to S3, or unless it is possible to configure BDB to push its checkpoint flushes to S3. At least this is my (limited) understanding of the situation, hence my previous question.
Fabien

There was a project at amazon to provide a persistent file system but no news about it in the latest newsletter.
http://developer.amazonwebservices.com/connect/thread.jspa?threadID=21082&start=0&tstart=0

PersistentFS (http://www.persistentfs.com/) implemented a solution. I wonder if BerkeleyDB should be used with PersistentFS.

We asked around within Oracle and outside, and it does indeed seem that the persistent local storage (Elastic Block Storage) announced in April is the best fit. Matt referred to it in the message above, and the announcement is also here: http://www.allthingsdistributed.com/2008/04/persistent_storage_for_amazon.html.

One person pointed out this paragraph from the announcement:

"The consistency of data written to this device is similar to that of other local and network-attached devices; it is under control of the developer when and how to force flush data to disk if you want to bypass the traditional lazy-writer functionality in the operating systems file-cache. Because of the session oriented model for access to the volume you do not need to worry about eventual consistency issues."

On the face of it, it sounds like it will do the trick -- that it will make the storage look like a normal file system, and also provide the kind of file system semantics that JE relies on to guarantee transactional durability. I'm sure it would need testing though, and we'd want to check if the caveats we place on using NFS for JE storage apply here too or not. (Those NFS caveats and their motivations are described in the JE FAQ at http://www.oracle.com/technology/products/berkeley-db/faq/je_faq.html. They are based on whether fsync and fwrite can have hidden failures and whether file locking is supported.

In addition to PersistentFs, which Matt mentioned, there's FuseOverAmazon (http://code.google.com/p/s3fs/wiki/FuseOverAmazon) which purports to make S3 look like a local file system. I didn't find anyone who had actually used them, so the same questions apply - do fsync and write provide the same guarantees that you get on a local file system, and does file locking work?

PersistentFS has some notes on MYSQL which say that fsync has the semantics we need (and presumably then write does too) See http://www.persistentfs.com/documentation/AppNotes/MySQL.

We'd be glad to work with anyone who is trying the new Amazon persistent local storage or the other S3 based solutions.

Thank you for offering to help. Do you have some scripts/test programs that we could run once we installed PersistentFS and FuseOverAmazon that would check if BDB is behaving OK with respect to these systems?

EC2 instance storage is not quite as transient as many fear. While Amazon makes no guarantees about instance persistence, and want you to engineer your apps as if you could lose an instance at any moment, crashes and reboots typically do not cause the loss of the local volume's data -- it's there when the instance reboots. So perhaps the same sorts of interval backup strategies that work elsewhere are good enough.

Also, I would think you could just keep uploading log-file segments to S3 as they close -- and thus never lose more than 10MB of the latest DB changes. (And that threshold could be decreased by uploading partials or decreasing the configured log-file size.)

Also, I would think you could just keep uploading
log-file segments to S3 as they close -- and thus
never lose more than 10MB of the latest DB changes.
(And that threshold could be decreased by uploading
partials or decreasing the configured log-file size.)

Earlier, we answered the question about whether S3 could be used as the storage an Amazon EC2 /BDB application. Then Matt asked about verification scripts or tools, which us discuss it internally a bit more. The complete answer should really also ask whether S3 is appropriate for JE storage.

Even if S3 with a file system layer over it can provide the functionality semantics JE needs, it may not really be the right storage from the performance point of view. Gordon's suggestion of using S3 as a backup destination seems like it would get the persistence you want, but still have local storage speed. I'll make the caveat that we have no direct experience in our group with deploying on AWS.

In terms of how to verify the filesystem-on-S3 options, there's three key functional areas that have to be checked, but only the least important is easily programmatically checked. File locking can be checked by mimicking the logic in com.sleepycat.je.log.FileManager:lockEnvironment(). We don't have a test or script that does it, but it's pretty clear what that method is trying to do. I don't know how we'd know that fsync and write did what we expect, except for getting information about how those are implemented in the filesystem solutions. Perhaps the most important and first thing to look at though is to see what kind of I/O load your JE application generates when running in a non AWS setup, and then project whether your AWS storage choice can support that load.

S3 is a storage solution based on RESTful web
services, so it is probably quite slow.

Exactly. If that file system is accessed over the Internet, and then via HTTP, I fail to see how it would be even remotely (no pun intended) appropriate for a database. (Except for database backup, of course.)

But I may also fail to see some miraculous progress in this so-called cloud computing area that makes it theoretically feasible to store your data over the network.

But if so, what about Sun's NFS and Microsoft's SMB? Can you store your data via NFS or SMB? I used to think this was impossible due to restrictions of these network filesystems (which certainly outperform Amazon's HTTP "filesystem" by several orders).

Thanks, Charles. Amazing it can be made to work, with
a couple of restrictions. Still, if you have local
storage available, why use network storage?

Michael Ludwig

Many of the contemporary web-server machines do not have a lot of local disk space and actually can't be expanded to have much. Consider the Sun T1000. It comes with 80GB of SATA storage and no real room for expansion. Sun expects this to be a middle tier web server with the back end work being done by some heavier iron. For a JE instance running under the app server on such a machine, it might make sense to access the storage via NFS/SMB.

Many of the contemporary web-server machines do not
have a lot of local disk space and actually can't be
expanded to have much. [...] For a JE instance running
under the app server on such a machine, it might make
sense to access the storage via NFS/SMB.

Charles Lamb

I see. Thanks for pointing this out. Must confess I do not much about machinery. If they get their network almost as fast and reliable as a local disk, why not?