Musings on Computing and Storage

Posts Tagged file system

Mark Callaghan started a discussion on mysql replication lag today on the MySql@Facebook page. This happens to be one of my favorite topics – thanks to some related work i did at Netapp. I was happy to know that there’s already a bunch of stuff happening in this area for MySQL – and thought it would be good to document some of my experiences and how they correlate to the MySQL efforts.

Somewhere back in 2003/2004 – i started looking at file system replay at Netapp (primarily because it was the coolest/most important problem i could find that didn’t have half a dozen people looking at it already). Netapp maintains an operation log in NVRAM (that is mirrored over to a cluster partner in HA configurations). A filer restart or failover requires a log replay before the file system can be bought online. The log may have transactions dependent on each other (a file delete depends on a prior file create for example) – so the simplest way to replay the log is serial. Because replay happens when the file system cache is cold – most log replay operations typically block waiting for data/metadata to be paged in from disk. No wonder that basic serial log replay is dog slow and this was one of the aspects of failover that we struggled to put a bound on. To make things worse – NVRAM sizes (and hence accumulated operation log on failure) were becoming bigger and bigger (whereas disk random iops were not getting any better) – so this was a problem that was becoming worse with time.

When I started working on this problem – there were some possible ideas floating around – but nothing that we knew would work for sure. Some of the obvious approaches were not good enough in the worst case (parallelize log replay per file system) or were too complicated to fathom (analyze the log and break into independent streams that could be replayed in parallel). The latter particularly because there was no precise list of transactions and their dependent resources – the log format was one hairy ball (a common characteristic of closed source systems). At the beginning – I started with an approach where the failed filer’s memory could be addressed by filer taking over (ie. instead of going to disk – it could use the partner filer’s memory for filling in required blocks). But this was problematic since buffers could be dirty – and those weren’t usable for replay. It was, of course, also problematic since it would have never worked in the simple restart case. At some point playing around with this approach – I started having to maintain a list of buffers required for each log replay (don’t precisely remember why) – and soon enough I had this Eureka moment where the realization dawned that this was all was ever needed to speed up log replay.

The technique is documented in one of the few worthy patents i have ever filed (USPTO link) – but long story short – it works like this:

The filer keeps track of file system blocks required to perform each modifying transaction (in a separate log called the WAFL Catalog)

This log is also written to NVRAM (so it’s available at the time of log replay) and mirrored to partner filers

At replay time – the first thing that happens is that the filer issues a huge boatload of read requests using the Catalog. It doesn’t wait for these requests to complete and doesn’t really care whether the requests succeed or fail – the hope is that most of them succeed and warm up the cache.

Rest of replay happens as before (serially) – but most of the operations being replayed find the required blocks in memory already (thanks to the IO requests issued in the previous step)

This implementation has been shipping on filers since at least 2005. It resulted in pretty ginormous speedups in log replay (4-5x was common). I was especially fond of this work because of the way it materialized (basically hacking around) and the way it worked (a pure optimization that was non-intrusive to the basic architecture of the system) and because of the generality of the principles involved (it was apparent that such form of cataloging and prefetching could be used in any log replay environment). It was a simple and elegant method – and I have been dying ever since to find other places to apply this to!

As a side note – disks are more efficient at random io operations if they have more of them to do at once. Disk head movement can be minimized to serve a chain of random IO requests. (Disk drivers also sort pending IO requests). During measurements with this implementation – i was able to drive disks to around 200 iops (where the common wisdom is no more than 100 iops per disk) during the prefetch phase of replay (thanks to the long list of IOs generated from the catalog)

Roll forward to 2006 at Yahoo and 2008 at Facebook. Both these organizations perennially complained about replication lag in MySQL. Nothing surprising – operation log replay in a database has the same problem as operation log replay in a filer. So i have been dying to find the time (and frankly motivation) to dive into another enormous codebase and try and implement the same techniques. In bantering with colleagues earlier this year at Facebook – it became apparent that the technique described above was much easier to implement for a database. All transactions in a database are declarative and have a nice grammer (unlike the hair ball nvram log format in Netapp). So one could generate a read query (select * where x=’foo’;) for each DML (update y where x=’foo’;) that would essentially pre-fill blocks required by the DML statement. And that should be it.

So I was pleasantly surprised to learn today that this technique is already implemented in MySQL (and that it has a name too!). It’s called the oracle algorithm and is available as part of the maatkit toolset. Wonderful. I can’t wait to see how it performs on Facebook’s workloads.

MySql also has more comprehensive work on parallelizing replay going on. Harrison Fisk posted a couple of links on design docs and code for the same. It’s not surprising that these efforts take the route i had avoided (detecting conflicts in logs and replaying independent operations in parallel) – the declarative nature of the mysql log makes such efforts much more tractable i would imagine.

Kudos to the MySQL community (and Mark for his work at Facebook) – seems like there’s a lot of good stuff happening.

I often find myself asking the question on what would be the most obvious/big things happening now – if we were looking back five years forward from now. After reading the review on the Intel X25 – there’s no doubt that the emergence of flash technology would be one of those big things.

As a computing professional trained for years to think about hard drives – i found the unique architectural constraints of the Flash chip architecture (as presented in the Birell paper for example) refreshing and thought provoking. For starters – while the naiive assumption of most people is that Flash gives very high random read and write performance – it turns out that from a write perspective they are really like disks – only worse. Not only is one much better off writing sequentially for performance reasons, writing randomly also causes reduced life (because blocks containing randomly over-written pages will need to be erased at some point – and flash chips only support limited number of erasures). The other interesting aspect that Gray’s paper reports is that sequential read bandwidth does depend on contiguity – with maximum bandwidth being obtained at read sizes of around 128KB. The new generation of flash drives (including the X25) also seem to be close in implementation to the Birell paper – implemented more like log structured file systems rather than traditional block devices.

All of which implies that these drives solve some of the old problems (random read performance) but create new ones instead. The problems are entirely predictable and well exemplified by this long term test of the X25. Log Structured file systems cause internal fragmentation – small random overwrites causing a single file’s blocks to be spread randomly – causing terrible sequential read performance (and as Gray’s paper shows – one needs contiguity for sequential read performance even for flash drives). The other obvious aspect is that the efficacy of the lazy garbage cleaning approach depends a lot on free space. The more the free space, the more the overwrites that can be combined into a single erasure and the less the number of extra writes per writes (so called Write Amplification Factor). Conversely, handing over an entire flash disk to an OLTP database seems like a recipe for trouble – write amplification will increase greatly over time (if things work at all). It also seems that there are ATA/SCSI commands (UNMAP) in the works so that applications can inform the disks about free space – however this seems like another can of worms. How does a user level application like mysql/innodb invoke this command? (and how can it do so without a corresponding file system api in case it is using a regular file?)

All of which make me believe that at some point the most prominent database engines are going to sit up and write their own space management over flash drives. For example – instead of a global log structured allocation policy – a database maintaining tables clustered by primary key (for example) is much better served by allocation policies where a range of primary keys are kept close together (this would have much better characteristics when a table is being scanned (either in full or in part).

One of the relatively late lessons I have received in operating a Hadoop cluster has been the (almost overwhelming) importance of compression in storage, computation and network transmission.

One of the architectural questions is whether compression belongs to the file-system (and similarly the networking sub-system) or whether it is something that the application layer (map-reduce and higher layers) are responsible for. While native compression is supported by many file systems (for example ZFS) – the arguments in favor of the former are less well made (or at least less commonly made). On the other hand, columnar, dictionary and other forms of compression at the application level (that exploit data schema and distribution) are common parlance and there’s a whole industry of sorts around these.

After some thought though – i have become more and more convinced that functionality such as compression should be moved as much down the stack as possible. The arguments for this (at least in the context of HDFS and Hadoop) are fairly obvious:

FileSystems can manage data through it’s life cycle applying different forms of compression. One could, for example, keep new data uncompressed, recent data compressed via LZO and then finally migrate to Bzip.

Where multiple replicas of data are maintained – some of the replicas can be maintained as compressed transparently – providing redundancy while saving space. Applications can run against uncompressed data whenever possible

The former may be especially appropriate when data is maintained for disaster recovery. Data can be compressed before transmission to remote site and can be kept in compressed format there

The filesystem may also be able to push compression to external hardware (co-processors)

Fairly similar arguments can be made to move wire compression to the networking sub-system. It can dynamically recognize whether a compute cluster is currently CPU or network bound and come up with appropriate compression level for data transmission. It is impossible for application level compression to achieve these things.

This leaves the tricky question of how to get the same levels of compression that an application may be able to achieve (given knowledge of data schema etc.). The challenge here points to a missing abstraction in the traditional split of file and database systems (and other applications). The traditional ‘byte-stream’ abstraction that filesystems (and networking systems) present means that they don’t have any knowledge of data inside the byte stream. However – if a data stream can be tagged with it’s schema – and optionally even a pointer to the right codec for this stream – then the file-system can easily perform optimal compression (while preserving the benefits above).

Traditionally – these kinds of proposals would also have encountered the standard opposition to running user-land (application) code in kernel space. But with user level file systems like HDFS gaining more and more tractions – and with user level processing becoming more common in operating systems as well (witness FUSE) – that argument is fairly moot. One of the opportunities with systems like Hadoop, i think, is that we can fix these historic anomalies. The open nature of this software and the fact that files are used as a record-stream (and not as a database image) – lends itself to the kind of schemes suggested above.