BookKeeper for write-ahead logging

Details

Description

BookKeeper, a contrib of the ZooKeeper project, is a fault tolerant and high throughput write-ahead logging service. This issue provides an implementation of write-ahead logging for hbase using BookKeeper. Apart from expected throughput improvements, BookKeeper also has stronger durability guarantees compared to the implementation currently used by hbase.

Flavio Junqueira
added a comment - 12/Mar/10 16:27 I'm attaching a preliminary patch. I haven't tested extensively, but it works in my small setup with one computer running hbase+dfs, three bookies, and one zookeeper server.

To compile hbase with this patch, it is necessary to add the bookkeeper jar to the lib folder. I've attached the jar to this jira for convenience, but you can also get it and compile from zookeeper/trunk.

This patch requires a few new configuration properties on hbase-site.xml:

The zookeeper server (or servers) used with bookkeeper can be the same as the one used with hbase, but the value of the property has to be set regardless of which server you're using. If this is not ok, we may consider changing on the next iteration, assuming there will be one.

It is also important to create the following znodes in the zookeeper instance pointed by hbase.wal.bk.zkserver:

/ledgers
/ledgers/available/
/ledgers/available/<bookie-addr>

where "/ledgers/available/<bookie-addr>" is a node representing a bookie and the format of <bookie-addr> should be "xxx.yyy.zzz:port". Consequently, there must be one such znode for each available bookie. We have a jira open to make bookie registration automatic, and it should be available soon.

I have been creating these znodes manually for my tests, but I acknowledge that a script to bootstrap the process would be quite handy.

Finally, I would be happy to get some feedback and suggestions for improvement.

Flavio Junqueira
added a comment - 12/Mar/10 16:53 To compile hbase with this patch, it is necessary to add the bookkeeper jar to the lib folder. I've attached the jar to this jira for convenience, but you can also get it and compile from zookeeper/trunk.
This patch requires a few new configuration properties on hbase-site.xml:
<property>
<name>hbase.wal.bk.zkserver</name>
<value>xxx.yyy.zzz:10281</value>
</property>
<property>
<name>hbase.regionserver.hlog.reader.impl</name>
<value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogReader</value>
</property>
<property>
<name>hbase.regionserver.hlog.writer.impl</name>
<value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogWriter</value>
</property>
The zookeeper server (or servers) used with bookkeeper can be the same as the one used with hbase, but the value of the property has to be set regardless of which server you're using. If this is not ok, we may consider changing on the next iteration, assuming there will be one.
It is also important to create the following znodes in the zookeeper instance pointed by hbase.wal.bk.zkserver:
/ledgers
/ledgers/available/
/ledgers/available/<bookie-addr>
where "/ledgers/available/<bookie-addr>" is a node representing a bookie and the format of <bookie-addr> should be "xxx.yyy.zzz:port". Consequently, there must be one such znode for each available bookie. We have a jira open to make bookie registration automatic, and it should be available soon.
I have been creating these znodes manually for my tests, but I acknowledge that a script to bootstrap the process would be quite handy.
Finally, I would be happy to get some feedback and suggestions for improvement.

Thanks you for this work Flavio. Before going to the patch, a few of us, when BookKeeper showed up first, had this notion of writing HBase WAL to a BK Ensemble but we never did any work on it because we figured it just wouldn't be able to keep up.

For example, posit a smallish hbase cluster of 3-5 nodes each taking on writes of 10bytes-1MB of edits at rates of 100-5k writes a second. Would BK be able to keep up? What if is was a 10-20 node cluster? What you think?

stack
added a comment - 12/Mar/10 18:06 Thanks you for this work Flavio. Before going to the patch, a few of us, when BookKeeper showed up first, had this notion of writing HBase WAL to a BK Ensemble but we never did any work on it because we figured it just wouldn't be able to keep up.
For example, posit a smallish hbase cluster of 3-5 nodes each taking on writes of 10bytes-1MB of edits at rates of 100-5k writes a second. Would BK be able to keep up? What if is was a 10-20 node cluster? What you think?

Those are good questions, Stack. BookKeeper scales throughput with the number of servers, so adding more bookies should increase your current throughput if your traffic is not saturating the network already. Consequently, if you don't have a constraint on the number of bookies you'd like to use, your limitation would be the amount of bandwidth you have available.

Just to give you some numbers, we have so far been able to saturate the network when writing around 1k bytes per entry, and the number of writes/s for 1k writes is of the order of tens of thousands for 3-5 bookies. Now, if I pick the largest numbers in your small example to consider the worst case (5 nodes, 1MB writes, 5k writes/s), then we would need a 40Gbit/s network, so I'm not sure you can do it with any distributed system unless you write locally in all nodes, in which case you can't guarantee the data will be available upon a node crash. Let me know if I'm misinterpreting your comment and you have something else in mind.

I also have to mention that we added a feature to enable thousands of concurrent ledgers with minimal performance penalty on the writes, so I don't see any trouble in increasing the number of concurrent nodes as long as the BookKeeper cluster is provisioned accordingly. Of course, it would be great to measure it with hbase, though.

Flavio Junqueira
added a comment - 12/Mar/10 19:18 Those are good questions, Stack. BookKeeper scales throughput with the number of servers, so adding more bookies should increase your current throughput if your traffic is not saturating the network already. Consequently, if you don't have a constraint on the number of bookies you'd like to use, your limitation would be the amount of bandwidth you have available.
Just to give you some numbers, we have so far been able to saturate the network when writing around 1k bytes per entry, and the number of writes/s for 1k writes is of the order of tens of thousands for 3-5 bookies. Now, if I pick the largest numbers in your small example to consider the worst case (5 nodes, 1MB writes, 5k writes/s), then we would need a 40Gbit/s network, so I'm not sure you can do it with any distributed system unless you write locally in all nodes, in which case you can't guarantee the data will be available upon a node crash. Let me know if I'm misinterpreting your comment and you have something else in mind.
I also have to mention that we added a feature to enable thousands of concurrent ledgers with minimal performance penalty on the writes, so I don't see any trouble in increasing the number of concurrent nodes as long as the BookKeeper cluster is provisioned accordingly. Of course, it would be great to measure it with hbase, though.

one thing that would be great for BookKeeper is to try to characterize your load and then run some saturation tests to see how the performance looks. previously we varied messages sizes from 128 bytes to 8K. should we try something like 100 bytes, 1000 bytes, 10K, 100K, 1M?

we should also vary the number of clients writing. what is a good number? 10-1000 or 10-10000? is the WAL per server or per region.

do you have WAL throughput numbers for the throughput you are getting from HDFS?

just to give you an idea, we are in the 10s of thousands of messages/sec for 1K messages. (the performance increases as you add more servers, aka bookies.) for larger messages it really is the network bandwidth that becomes the bottleneck.

Benjamin Reed
added a comment - 12/Mar/10 19:18 one thing that would be great for BookKeeper is to try to characterize your load and then run some saturation tests to see how the performance looks. previously we varied messages sizes from 128 bytes to 8K. should we try something like 100 bytes, 1000 bytes, 10K, 100K, 1M?
we should also vary the number of clients writing. what is a good number? 10-1000 or 10-10000? is the WAL per server or per region.
do you have WAL throughput numbers for the throughput you are getting from HDFS?
just to give you an idea, we are in the 10s of thousands of messages/sec for 1K messages. (the performance increases as you add more servers, aka bookies.) for larger messages it really is the network bandwidth that becomes the bottleneck.

On WAL throughput, since the 'append' to the WAL is what takes the bulk of the time making an hbase update, depending on hardware and how often we flush, we'll see variously 1-10k writes a second. We wil see hdfs holding up writes on occasion when struggling.. with latency going way up.

I'm up for more exploration if you fellas are. Let me take a look at this patch of yours Flavio.

stack
added a comment - 12/Mar/10 19:36 Thanks for the interesting comments.
WAL is per regionserver, not per region.
On WAL throughput, since the 'append' to the WAL is what takes the bulk of the time making an hbase update, depending on hardware and how often we flush, we'll see variously 1-10k writes a second. We wil see hdfs holding up writes on occasion when struggling.. with latency going way up.
I'm up for more exploration if you fellas are. Let me take a look at this patch of yours Flavio.
Thanks for happening by lads.

Flavio Paiva Junqueira commented on HBASE-2315:
-----------------------------------------------
Those are good questions, Stack. BookKeeper scales throughput with the
number of servers, so adding more bookies should increase your current
throughput if your traffic is not saturating the network already.
Consequently, if you don't have a constraint on the number of bookies you'd
like to use, your limitation would be the amount of bandwidth you have
available.

Just to give you some numbers, we have so far been able to saturate the
network when writing around 1k bytes per entry, and the number of writes/s
for 1k writes is of the order of tens of thousands for 3-5 bookies. Now, if
I pick the largest numbers in your small example to consider the worst case
(5 nodes, 1MB writes, 5k writes/s), then we would need a 40Gbit/s network,
so I'm not sure you can do it with any distributed system unless you write
locally in all nodes, in which case you can't guarantee the data will be
available upon a node crash. Let me know if I'm misinterpreting your comment
and you have something else in mind.

I also have to mention that we added a feature to enable thousands of
concurrent ledgers with minimal performance penalty on the writes, so I
don't see any trouble in increasing the number of concurrent nodes as long
as the BookKeeper cluster is provisioned accordingly. Of course, it would be
great to measure it with hbase, though.

ryan rawson
added a comment - 12/Mar/10 19:38 A few more questions...
What's the persistence story? How many nodes is the log data stored on?
On performance, how about 100-200k ops/sec with data sizes about 150 bytes
or so? This would be aggregately generated on 20 nodes.
On Mar 12, 2010 2:18 PM, "Flavio Paiva Junqueira (JIRA)" <jira@apache.org>
wrote:
[
https://issues.apache.org/jira/browse/HBASE-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844627#action_12844627 ]
Flavio Paiva Junqueira commented on HBASE-2315 :
-----------------------------------------------
Those are good questions, Stack. BookKeeper scales throughput with the
number of servers, so adding more bookies should increase your current
throughput if your traffic is not saturating the network already.
Consequently, if you don't have a constraint on the number of bookies you'd
like to use, your limitation would be the amount of bandwidth you have
available.
Just to give you some numbers, we have so far been able to saturate the
network when writing around 1k bytes per entry, and the number of writes/s
for 1k writes is of the order of tens of thousands for 3-5 bookies. Now, if
I pick the largest numbers in your small example to consider the worst case
(5 nodes, 1MB writes, 5k writes/s), then we would need a 40Gbit/s network,
so I'm not sure you can do it with any distributed system unless you write
locally in all nodes, in which case you can't guarantee the data will be
available upon a node crash. Let me know if I'm misinterpreting your comment
and you have something else in mind.
I also have to mention that we added a feature to enable thousands of
concurrent ledgers with minimal performance penalty on the writes, so I
don't see any trouble in increasing the number of concurrent nodes as long
as the BookKeeper cluster is provisioned accordingly. Of course, it would be
great to measure it with hbase, though.

HBase gets dissed on occasion because we are 'complicated', made up of different systems – hdfs, zk, etc. – so adding a new system (bk to do wals) would make this situation even worse?

In our integration of zk, we did our best to hide it from user. We manage start up and shutdown of ensemble for those who don't want to be bothered with managing their own zk ensemble. Would such a thing be possible with a bk cluster?

stack
added a comment - 12/Mar/10 19:54 One other thing, can you speak to the following?
HBase gets dissed on occasion because we are 'complicated', made up of different systems – hdfs, zk, etc. – so adding a new system (bk to do wals) would make this situation even worse?
In our integration of zk, we did our best to hide it from user. We manage start up and shutdown of ensemble for those who don't want to be bothered with managing their own zk ensemble. Would such a thing be possible with a bk cluster?

Andrew Purtell
added a comment - 12/Mar/10 20:10 On performance, how about 100-200k ops/sec with data sizes about 150 bytes or so? This would be aggregately generated on 20 nodes.
Extend that to 100. What does this look like? Can BK cope?

Todd Lipcon
added a comment - 12/Mar/10 20:14 Hey Flavio,
Would you mind pointing us to (or uploading) a design doc / architectural overview for BK? I think that would be very helpful for us to understand the scaling characteristics better.
-Todd

Mahadev konar
added a comment - 12/Mar/10 22:33 todd,
here is some documentation that might help you understand the architecture of BooKkeeper:
http://hadoop.apache.org/zookeeper/docs/r3.2.2/
this documentation has been improved in 3.3 (not branched yet), so you can checkout
http://svn.apache.org/repos/asf/hadoop/zookeeper/trunk/docs/bookkeeperOverview.html
and read that as well.
hope this helps.

Thanks for all comments and questions so far. Let me try to answer all in a batch:

Stack: We are certainly interested in exploring, and learning your requirements would certainly help to guide our exploration. On making BK transparent to the user installing BK, I don't have any reason to think that it would be difficult to achieve. The script I use to start a bookie is very simple and we'll soon have the feature I was mentioning above of a bookie registering automatically with ZooKeeper. In any case, I made sure that it is possible to use BK optionally, so hdfs is still used by default with the current patch.

Ryan, Andrew: The number of nodes going from 20 to a 100 is not a problem; We have tested BK with a few thousand concurrent ledgers. One important aspect to pay attention to is the length of writes. In our experience, BK is more efficient when we batch small requests. If it is possible to batch small requests, then I don't see an issue with obtaining the throughput figures you're asking for. In fact, with a previous version of BK, we were able to get 70kwrites/s for writes of 128 bytes.

Todd: There is some documentation available on the ZooKeeper site, and we have added more for the next release (3.3.0). You can get the new documentation from trunk, but I'll also add a pdf here for your convenience.

Flavio Junqueira
added a comment - 12/Mar/10 22:36 Thanks for all comments and questions so far. Let me try to answer all in a batch:
Stack: We are certainly interested in exploring, and learning your requirements would certainly help to guide our exploration. On making BK transparent to the user installing BK, I don't have any reason to think that it would be difficult to achieve. The script I use to start a bookie is very simple and we'll soon have the feature I was mentioning above of a bookie registering automatically with ZooKeeper. In any case, I made sure that it is possible to use BK optionally, so hdfs is still used by default with the current patch.
Ryan, Andrew: The number of nodes going from 20 to a 100 is not a problem; We have tested BK with a few thousand concurrent ledgers. One important aspect to pay attention to is the length of writes. In our experience, BK is more efficient when we batch small requests. If it is possible to batch small requests, then I don't see an issue with obtaining the throughput figures you're asking for. In fact, with a previous version of BK, we were able to get 70kwrites/s for writes of 128 bytes.
Todd: There is some documentation available on the ZooKeeper site, and we have added more for the next release (3.3.0). You can get the new documentation from trunk, but I'll also add a pdf here for your convenience.

In fact, with a previous version of BK, we were able to get 70kwrites/s for writes of 128 bytes

Dedicated hardware, I presume. How many BK nodes would be needed alongside HBase+Hadoop nodes to achieve 1M ops/sec over 100 concurrent ledgers? HBase WAL has group commit but that can only batch so much... depends on the order of edits coming from userland. So what amount of batching? (Average transaction size to BK for example.)

Andrew Purtell
added a comment - 12/Mar/10 22:51 @Flavio: Thanks for your kind reply.
In fact, with a previous version of BK, we were able to get 70kwrites/s for writes of 128 bytes
Dedicated hardware, I presume. How many BK nodes would be needed alongside HBase+Hadoop nodes to achieve 1M ops/sec over 100 concurrent ledgers? HBase WAL has group commit but that can only batch so much... depends on the order of edits coming from userland. So what amount of batching? (Average transaction size to BK for example.)

to get performance you need a dedicated drive, not necessarily dedicated hardware. bk takes care of the batching for you. there is opportunistic batching that happens in the network channel and to the disk. (the later being the most important.)

i don't think ryan's persistence question has been answered. when bk responds successfully to a write request, it means that the data has been synced to the requisite number of devices. when testing we usually use 3 or 5 bookies with a replication factor of 2 or 3. (bookkeeper does striping.)

we have an asynchronous write interface to allow high throughput by pipelining requests, we invoke completions as each write completes successfully.

Benjamin Reed
added a comment - 12/Mar/10 23:43 to get performance you need a dedicated drive, not necessarily dedicated hardware. bk takes care of the batching for you. there is opportunistic batching that happens in the network channel and to the disk. (the later being the most important.)
i don't think ryan's persistence question has been answered. when bk responds successfully to a write request, it means that the data has been synced to the requisite number of devices. when testing we usually use 3 or 5 bookies with a replication factor of 2 or 3. (bookkeeper does striping.)
we have an asynchronous write interface to allow high throughput by pipelining requests, we invoke completions as each write completes successfully.

Ryan: I don't have much to add to what Ben said in his comment. I just wanted to mention that in the current patch, I have added a configuration property to set the number of replicas for each write:

hbase.wal.bk.quorumsize

the default is 2.

Andrew: As we reduce the number of bytes in each write, the overhead per byte increases, so batching writes and appending writes of the order of 1kbytes would give us a more efficient use of the BK client. Achieving 1M ops/s over 100 nodes (or larger values if you will) depends on the length of writes, the replication factor, and the amount of bandwidth (both I/O and network) you have available. In our observations, it is not a problem for a client to produce more than 10k appends/s of 1k-4kbytes, so in your example, it is just a matter of provisioning your system appropriately with respect to disk drives and network.

Flavio Junqueira
added a comment - 13/Mar/10 13:49 Ryan: I don't have much to add to what Ben said in his comment. I just wanted to mention that in the current patch, I have added a configuration property to set the number of replicas for each write:
hbase.wal.bk.quorumsize
the default is 2.
Andrew: As we reduce the number of bytes in each write, the overhead per byte increases, so batching writes and appending writes of the order of 1kbytes would give us a more efficient use of the BK client. Achieving 1M ops/s over 100 nodes (or larger values if you will) depends on the length of writes, the replication factor, and the amount of bandwidth (both I/O and network) you have available. In our observations, it is not a problem for a client to produce more than 10k appends/s of 1k-4kbytes, so in your example, it is just a matter of provisioning your system appropriately with respect to disk drives and network.

Thanks for attaching the design doc, Flavio. Here are a few questions generally about BK (would be great to address these in the docs too, not just this JIRA):

Logs are stored replicated on several bookies. If a bookie goes down or loses a disk, the replication count obviously decreases. Will BK re-replicate the "under-replicated" parts of the log files? Or do we assume that logs are short-lived enough that we'll never lose all the replicas during the span of time for which the log is necessary?

Similar to the above, is there any notion of rack awareness? eg does it ensure that each part of the log is replicated at least on two racks so that a full rack failure doesn't lose the logs?

Basically on a broad level what I'm asking is this: HDFS takes a lot of care to ensure availability of data "forever" across a variety of failure scenarios. From the design doc it wasn't clear that BK provides the same levels of safety. Can you please clarify what features (if any) are missing from BK that are present in HDFS with regard to reliability, etc?

Todd Lipcon
added a comment - 17/Mar/10 18:58 Thanks for attaching the design doc, Flavio. Here are a few questions generally about BK (would be great to address these in the docs too, not just this JIRA):
Logs are stored replicated on several bookies. If a bookie goes down or loses a disk, the replication count obviously decreases. Will BK re-replicate the "under-replicated" parts of the log files? Or do we assume that logs are short-lived enough that we'll never lose all the replicas during the span of time for which the log is necessary?
Similar to the above, is there any notion of rack awareness? eg does it ensure that each part of the log is replicated at least on two racks so that a full rack failure doesn't lose the logs?
Basically on a broad level what I'm asking is this: HDFS takes a lot of care to ensure availability of data "forever" across a variety of failure scenarios. From the design doc it wasn't clear that BK provides the same levels of safety. Can you please clarify what features (if any) are missing from BK that are present in HDFS with regard to reliability, etc?

We plan to have a mechanism to re-replicate the ledger fragments of a bookie; some of our applications require it. There is a wiki page I had created with some initial thoughts, but I need to update it: "http://wiki.apache.org/hadoop/BookieRecoveryPage". I also thought there was a jira open about it already, but I couldn't find it, so I created a new one: ZOOKEEPER-712.

We currently don't have a notion of rack awareness, but it is a good idea and not difficult to add. We just need to have a bookie adding that information to its corresponding znode and use it during the selection of bookies upon creating a leader. I have also opened a jira to add it: ZOOKEEPER-711.

On the documentation, please feel free to open a jira and mention all point that you believe could be improved.

Flavio Junqueira
added a comment - 18/Mar/10 15:46 Hi Todd, Great points:
We plan to have a mechanism to re-replicate the ledger fragments of a bookie; some of our applications require it. There is a wiki page I had created with some initial thoughts, but I need to update it: "http://wiki.apache.org/hadoop/BookieRecoveryPage". I also thought there was a jira open about it already, but I couldn't find it, so I created a new one: ZOOKEEPER-712 .
We currently don't have a notion of rack awareness, but it is a good idea and not difficult to add. We just need to have a bookie adding that information to its corresponding znode and use it during the selection of bookies upon creating a leader. I have also opened a jira to add it: ZOOKEEPER-711 .
On the documentation, please feel free to open a jira and mention all point that you believe could be improved.

We have introduced deletion and garbage collection of ledgers to BookKeeper (ZOOKEEPER-464);

Our next goal as far as features for bookkeeper goes is ZOOKEEPER-712 (see my previous post);

I have started benchmarking hbase+bookkeeper with the Yahoo! benchmark, but I stumbled upon a problem. When I wrote this preliminary patch, I didn't have a nice way of creating a unique znode name and keeping it for fetching the last log of a region server, given the current interface of WAL. My understanding of the code is that the hdfs file where the is stored is passed to the reader/writer. I'm considering the current hdfs path as the znode path (or at least a part of it), but I would accept suggestions if anyone is willing to give me one.

Flavio Junqueira
added a comment - 17/May/10 09:45 Just a couple of quick points about the developments of this jira:
We have introduced deletion and garbage collection of ledgers to BookKeeper ( ZOOKEEPER-464 );
Our next goal as far as features for bookkeeper goes is ZOOKEEPER-712 (see my previous post);
I have started benchmarking hbase+bookkeeper with the Yahoo! benchmark, but I stumbled upon a problem. When I wrote this preliminary patch, I didn't have a nice way of creating a unique znode name and keeping it for fetching the last log of a region server, given the current interface of WAL. My understanding of the code is that the hdfs file where the is stored is passed to the reader/writer. I'm considering the current hdfs path as the znode path (or at least a part of it), but I would accept suggestions if anyone is willing to give me one.

.bq I have started benchmarking hbase+bookkeeper with the Yahoo! benchmark, but I stumbled upon a problem. When I wrote this preliminary patch, I didn't have a nice way of creating a unique znode name and keeping it for fetching the last log of a region server, given the current interface of WAL. My understanding of the code is that the hdfs file where the is stored is passed to the reader/writer. I'm considering the current hdfs path as the znode path (or at least a part of it), but I would accept suggestions if anyone is willing to give me one.

The HDFS path should map to a znode path? That'd work. The WAL HDFS path has the regionserver name in it IIRC. You probably don't need the full thing. Just from the regionserver name on down (should be shallow).

Let us know if you want us to change stuff in here around WAL. You might want to wait a day or so till hbase-2437 goes in (refactoring of hlog). Stuff should be easier to grok after that patch has had its way.

stack
added a comment - 19/May/10 05:50 .bq I have started benchmarking hbase+bookkeeper with the Yahoo! benchmark, but I stumbled upon a problem. When I wrote this preliminary patch, I didn't have a nice way of creating a unique znode name and keeping it for fetching the last log of a region server, given the current interface of WAL. My understanding of the code is that the hdfs file where the is stored is passed to the reader/writer. I'm considering the current hdfs path as the znode path (or at least a part of it), but I would accept suggestions if anyone is willing to give me one.
The HDFS path should map to a znode path? That'd work. The WAL HDFS path has the regionserver name in it IIRC. You probably don't need the full thing. Just from the regionserver name on down (should be shallow).
Let us know if you want us to change stuff in here around WAL. You might want to wait a day or so till hbase-2437 goes in (refactoring of hlog). Stuff should be easier to grok after that patch has had its way.
Is HBASE-2527 of any use to you? It just went into trunk.

we looked into the problem of figuring out the path to use for the WAL and found the following: it appears that the assumption that the WAL is stored in HDFS is embedded in HBase. when looking up a WAL, for example, the FileSystem object is used to check existence. Deletion of logs also happens outside of the WAL interfaces. to be truly pluggable a WAL interface should be used to enumerate and delete logs. have you guys thought about doing this?

Benjamin Reed
added a comment - 13/Aug/10 22:31 we looked into the problem of figuring out the path to use for the WAL and found the following: it appears that the assumption that the WAL is stored in HDFS is embedded in HBase. when looking up a WAL, for example, the FileSystem object is used to check existence. Deletion of logs also happens outside of the WAL interfaces. to be truly pluggable a WAL interface should be used to enumerate and delete logs. have you guys thought about doing this?

[ I am out of office in an all-day business meeting today and tomorrow,
I will not be checking email. If this is urgent then please call my
cell phone (or send an SMS), otherwise I will reply to your email when
I get back.

Apologies for not being very active here. Yesterday we had a chat with Stack about moving forward with this jira. Here are a few key points I got from the discussion:

To use the reader and writer interfaces, we need to implement a BookKeeper filesystem because the init method expects such an object;

The RegionServer maintains a list of existing logs and we need to map them to ledgers. The easiest way is probably to maintain this map in ZooKeeper and manage it through the Filesystem implementation;

Currently there is one single FileSystem object in the RegionServer and we would need to have at least two, one being used for BookKeeper;

Stack suggested that it would be a great idea to to have logs per region with BookKeeper. BookKeeper is designed exactly to serve use cases that have multiple concurrent logs being written to. According to Stack, this feature has the potential of reducing time to recover significantly. Enabling this feature might require some more changes to the wal interface, though, since we would need to separate the edits according to region. (Related to HBASE-4529 ?)

Let us know if anyone has a comment, and otherwise we will push it forward.

Flavio Junqueira
added a comment - 22/Jun/12 08:11 Apologies for not being very active here. Yesterday we had a chat with Stack about moving forward with this jira. Here are a few key points I got from the discussion:
To use the reader and writer interfaces, we need to implement a BookKeeper filesystem because the init method expects such an object;
The RegionServer maintains a list of existing logs and we need to map them to ledgers. The easiest way is probably to maintain this map in ZooKeeper and manage it through the Filesystem implementation;
Currently there is one single FileSystem object in the RegionServer and we would need to have at least two, one being used for BookKeeper;
Stack suggested that it would be a great idea to to have logs per region with BookKeeper. BookKeeper is designed exactly to serve use cases that have multiple concurrent logs being written to. According to Stack, this feature has the potential of reducing time to recover significantly. Enabling this feature might require some more changes to the wal interface, though, since we would need to separate the edits according to region. (Related to HBASE-4529 ?)
Let us know if anyone has a comment, and otherwise we will push it forward.

Ted Yu
added a comment - 22/Jun/12 09:23 Nice summary, Flavio.
to have logs per region with BookKeeper
HBASE-4529 has not been active.
HBASE-5699 is the one receiving attention recently.
Does this JIRA depend on HBASE-5699 ?

Hi Ted, I don't fully understand what HBASE-5699 is proposing. It mixes parallel writes and HDFS. In my perspective, we are able to perform efficient parallel writes out of the box and we can benefit from an interface that exposes enough information for us to separate the edits into different logs (e.g., one per region). If the outcome of HBASE-5699 is a set of changes to the interface that allow this to happen, then we could benefit from it. For a first cut of this issue, I don't think we need parallel writes, though, so I would say it is not dependent, but I'd be happy to hear a different opinion.

Flavio Junqueira
added a comment - 22/Jun/12 09:34 Hi Ted, I don't fully understand what HBASE-5699 is proposing. It mixes parallel writes and HDFS. In my perspective, we are able to perform efficient parallel writes out of the box and we can benefit from an interface that exposes enough information for us to separate the edits into different logs (e.g., one per region). If the outcome of HBASE-5699 is a set of changes to the interface that allow this to happen, then we could benefit from it. For a first cut of this issue, I don't think we need parallel writes, though, so I would say it is not dependent, but I'd be happy to hear a different opinion.

Li's been inactive for a while so I'd suggest taking HBASE-5937 and making the WAL an interface. This would allow for experimentation here and with other alternate wal implementation related JIRA's

HBASE-6116 is about a potentially improving the latency when writing to a single hdfs file. Normally writes to hdfs get pipelined to 2 other replicas on 2 different data nodes and this incurs some overhead. The intuition behind that is that for small writes the overhead may cause more latency than the data transfer and the hope is that by having the client send in a parallel/broadcast we'd reduce the overhead latency.

HBASE-5699 is about improving throughput and improving latency by writing to multiple hdfs files. This attacks the scenarios where a data node may be blocked (gc/swapping/etc) causing all other writes to be blocked. Currently a region server serves many regions but all of a region server's writes go to a single WAL files with writes to different regions are intermingled.

Let's say these normally take 10ms but in a gc case it ends up taking 1000ms. In one scenario with a wal per region, writes to another region could go a different hdfs file and not get blocked.

In another, we'd have two wal files for the region server and we could detect that one write was blocking (let's say after 20ms) then the retry against another set of data nodes. Ideally we've improved our worst case from 1000ms to 30-40ms.

Jonathan Hsieh
added a comment - 24/Jun/12 16:50 @flavio
Li's been inactive for a while so I'd suggest taking HBASE-5937 and making the WAL an interface. This would allow for experimentation here and with other alternate wal implementation related JIRA's
HBASE-6116 is about a potentially improving the latency when writing to a single hdfs file. Normally writes to hdfs get pipelined to 2 other replicas on 2 different data nodes and this incurs some overhead. The intuition behind that is that for small writes the overhead may cause more latency than the data transfer and the hope is that by having the client send in a parallel/broadcast we'd reduce the overhead latency.
HBASE-5699 is about improving throughput and improving latency by writing to multiple hdfs files. This attacks the scenarios where a data node may be blocked (gc/swapping/etc) causing all other writes to be blocked. Currently a region server serves many regions but all of a region server's writes go to a single WAL files with writes to different regions are intermingled.
Let's say these normally take 10ms but in a gc case it ends up taking 1000ms. In one scenario with a wal per region, writes to another region could go a different hdfs file and not get blocked.
In another, we'd have two wal files for the region server and we could detect that one write was blocking (let's say after 20ms) then the retry against another set of data nodes. Ideally we've improved our worst case from 1000ms to 30-40ms.

Hi Jonathan, A BK backend for logging can potentially deal with the issues you're raising. The main issue is that it doesn't expose a file system interface, and HBase currently seems to rely upon one. If we have an interface that matches more naturally the BK api, then we will possibly be able to deal with the issues you're raising without much effort.

I was actually trying to implement a BookKeeperFS class that extends FileSystem, but it doesn't feel natural. I was wondering if we should try to work on the interface first instead of hacking it in by implementing an HLog.Reader and an HLog.Writer.

Flavio Junqueira
added a comment - 25/Jun/12 11:53 Hi Jonathan, A BK backend for logging can potentially deal with the issues you're raising. The main issue is that it doesn't expose a file system interface, and HBase currently seems to rely upon one. If we have an interface that matches more naturally the BK api, then we will possibly be able to deal with the issues you're raising without much effort.
I was actually trying to implement a BookKeeperFS class that extends FileSystem, but it doesn't feel natural. I was wondering if we should try to work on the interface first instead of hacking it in by implementing an HLog.Reader and an HLog.Writer.

personally, I'd prefer to see a logical higher-level WAL interface and then have an FsWAL subclass that might have some more specifics to make an easier match to an underlying FS implementation (hadoop or otherwise). Maybe a bit more pain upfront, but will (likely) pay off in the end).

Jesse Yates
added a comment - 25/Jun/12 17:52 personally, I'd prefer to see a logical higher-level WAL interface and then have an FsWAL subclass that might have some more specifics to make an easier match to an underlying FS implementation (hadoop or otherwise). Maybe a bit more pain upfront, but will (likely) pay off in the end).

+1 to what Jesse said. The HLog class is relatively well encapsulated. It already has its own Reader and Writer interfaces (I'd probably ignore the FS argument or some how hide BK specific stuff into a constructor), and a Metric helper class. Excluding its constructors, HLog only has a few hdfs-specific public methods (hsync, hflush, sync) and some public but static methods.

Jonathan Hsieh
added a comment - 25/Jun/12 18:50 +1 to what Jesse said. The HLog class is relatively well encapsulated. It already has its own Reader and Writer interfaces (I'd probably ignore the FS argument or some how hide BK specific stuff into a constructor), and a Metric helper class. Excluding its constructors, HLog only has a few hdfs-specific public methods (hsync, hflush, sync) and some public but static methods.

If I understand Jesse's proposal, the idea is to have a WAL interface and essentially bring all hdfs dependent WAL code under FsWAL? We would then implement an equivalent BkWAL for a BookKeeper backend?

I didn't understand the point about the HLog class being relatively well encapsulated, Jonathan. Do you mean to say that it would be relatively easy to implement FsWAL by using HLog?

Flavio Junqueira
added a comment - 25/Jun/12 21:38 If I understand Jesse's proposal, the idea is to have a WAL interface and essentially bring all hdfs dependent WAL code under FsWAL? We would then implement an equivalent BkWAL for a BookKeeper backend?
I didn't understand the point about the HLog class being relatively well encapsulated, Jonathan. Do you mean to say that it would be relatively easy to implement FsWAL by using HLog?

@Flavio - with the encapsulation comment – basically if you look at the interface there is very little that is hdfs-specific about it – append, open, close, roll operations all basically take hbase constructs and could have any implementation behind it. The hdfs-specifics have been contained in its constructor and the init functions of the reader/writer classes.

Jonathan Hsieh
added a comment - 25/Jun/12 21:57 @Flavio - with the encapsulation comment – basically if you look at the interface there is very little that is hdfs-specific about it – append, open, close, roll operations all basically take hbase constructs and could have any implementation behind it. The hdfs-specifics have been contained in its constructor and the init functions of the reader/writer classes.

@Jonathan My issue has been exactly with the initialization of the log reader and writer. They take a FileSystem and a Path as input parameters. As I mentioned above, but I've been looking at implementing a BK FileSystem, and it looks a bit hacky.

If I get your suggestion, you're saying that we could reuse most of HLog and focus on generalizing its constructors and the init methods. Is it right?

Flavio Junqueira
added a comment - 25/Jun/12 22:14 @Jonathan My issue has been exactly with the initialization of the log reader and writer. They take a FileSystem and a Path as input parameters. As I mentioned above, but I've been looking at implementing a BK FileSystem, and it looks a bit hacky.
If I get your suggestion, you're saying that we could reuse most of HLog and focus on generalizing its constructors and the init methods. Is it right?

I'm suggesting you use the Factory pattern to create the HLog, and have in the config file that would have it instantiate a new BKHLog (doesn't even use FS or Path) that returns its own instances of BKReader/BKWriters.

I only see a handful of non-test places where HLogs are constructed so this part shouldn't be too hard to make factory pluggable:

Jonathan Hsieh
added a comment - 25/Jun/12 23:15 @Flavio
I'm suggesting you use the Factory pattern to create the HLog, and have in the config file that would have it instantiate a new BKHLog (doesn't even use FS or Path) that returns its own instances of BKReader/BKWriters.
I only see a handful of non-test places where HLogs are constructed so this part shouldn't be too hard to make factory pluggable:
https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1267
https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L3719
https://github.com/apache/hbase/blob/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/HMerge.java#L161
(I think there are a few more)..
My guess is that it will take a little bit of work but won't be too onerous to do.

Flavio Junqueira
added a comment - 11/Jul/12 23:20 Hi Ted, fs is part of the issue I was discussing before. We don't have a filesystem implementation for bookkeeper, so we can't use the filesystem instance passed.
About the reader and the writer, I was configuring them in the hbase-default configuration file:
<property>
<name>hbase.regionserver.hlog.reader.impl</name>
<value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogReader</value>
<description>The HLog file reader implementation.</description>
</property>
<property>
<name>hbase.regionserver.hlog.writer.impl</name>
<value>org.apache.hadoop.hbase.regionserver.wal.BookKeeperLogWriter</value>
<description>The HLog file writer implementation.</description>
</property>
I assumed previously that HLog was instantiated elsewhere.

Ted Yu
added a comment - 12/Jul/12 02:29 The ctor of HLog takes a FileSystem parameter. Since the FileSystem isn't important to bookkeeper, my feeling is that the approach in previous patch makes sense.
You can remodel that patch by introducing a new hbase module.
Thanks

You have what you need? Jesse and Jon I think did a good job above filling in bits I waved hands over when we talked (I like Jesse's suggestion undoing even our dependency of FS).

When we talked, we also figured that hbase would need access to bk or the bkfs or the bk interface at replay of edits time, on region open. In particular, we'd need to have a factory like the WAL log factory here abouts in the code: http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#2721 This is called when the region is opened by the regionserver. It does all its init and as part of the init, it looks to see if there are edits to replay. If edits-to-replay are out in bk, we need to be able to get at them at this very point.

stack
added a comment - 12/Jul/12 15:22 Flavio:
You have what you need? Jesse and Jon I think did a good job above filling in bits I waved hands over when we talked (I like Jesse's suggestion undoing even our dependency of FS).
When we talked, we also figured that hbase would need access to bk or the bkfs or the bk interface at replay of edits time, on region open. In particular, we'd need to have a factory like the WAL log factory here abouts in the code: http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#2721 This is called when the region is opened by the regionserver. It does all its init and as part of the init, it looks to see if there are edits to replay. If edits-to-replay are out in bk, we need to be able to get at them at this very point.

Based on your feedback and our own observations from inspecting the code, here is a rough idea of what we would like to do.

In the first step, we make HLog an interface exposing the public methods of the current HLog class and make the current HLog class an implementation of the interface. We also create an HLog factory to allow us to instantiate different HLog implementations. Eventually we will have this factory creating BKHLog when we tell it to do so via configuration. So far there is no new functionality.

In the second step, we implement BKHLog and decide what to do with the splitter. It is still not entirely clear how to adapt the splitter to BK. Perhaps we don't need a splitter at all with BookKeeper?

Let me know if there is any comment about these steps. Otherwise, I'll create two subtasks and start working on the first.

Flavio Junqueira
added a comment - 03/Aug/12 16:08 Based on your feedback and our own observations from inspecting the code, here is a rough idea of what we would like to do.
In the first step, we make HLog an interface exposing the public methods of the current HLog class and make the current HLog class an implementation of the interface. We also create an HLog factory to allow us to instantiate different HLog implementations. Eventually we will have this factory creating BKHLog when we tell it to do so via configuration. So far there is no new functionality.
In the second step, we implement BKHLog and decide what to do with the splitter. It is still not entirely clear how to adapt the splitter to BK. Perhaps we don't need a splitter at all with BookKeeper?
Let me know if there is any comment about these steps. Otherwise, I'll create two subtasks and start working on the first.