I've done a test with Zookeeper 3.4.2 to compare the performances ofsynchronous vs. asynchronous vs. multi when creating znode (variationsaround:calling 10000 times zk.create("/dummyTest", "dummy".getBytes(),ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);) The code is at theend of the mail.

I've tested different environments:- 1 linux server with the client and 1 zookeeper node on the same machine- 1 linux server for the client, 1 for 1 zookeeper node.- 6 linux servers, 1 for the client, 5 for 5 zookeeper nodes.

Server are middle range, with 4*2 cores, jdk 1.6. ZK was on its own HD.

But the results are comparable:

Using the sync API, it takes 200 seconds for 10K creations, so around 0.02second per call.Using the async API, it takes 2 seconds for 10K (including waiting for thelast callback message)Using the "multi" available since 3.4, it takes less than 1 second, againfor 10K.

I'm surprised by the time taken by the sync operation, I was not expectingit to be that slow. The gap between async & sync is quite huge.

Is this something expected? Zookeeper is used in critical functions inHadoop/Hbase, I was looking at the possible benefits of using "multi", butit seems low compared to async (well ~3 times faster :-). There are manysmall data creations/deletions with the sync API in the existing hbasealgorithms, it would not be simple to replace them all by asynchronouscalls...

Sync calls have to make a complete roundtrip before the next call from thatclient will happen. It's not surprising at all that it would take quite abit longer to do a sync call than an async call. It could be that thebottleneck in this case is your client, not your server. If the sync callsare happening amongst clients on many different servers, it probablydoesn't matter.

It's used when assigning the regions (kind of dataset) to the regionserver(jvm process in a physical server). There is one zookeeper node per region.On a server failure, there is typically a few hundreds regions to reassign,with multiple status written in . On paper, if we need 0,02s per node, thatmakes it to the minute to recover, just for zookeeper.

That's theory. I haven't done a precise measurement yet.Anyway, if ZooKeeper can be faster, it's always very interesting :-)Cheers,

It's only a minute of you process each region serially. Process 100 or 1000in parallel and it will go a lot faster.

20 milliseconds to synchronously commit to a 5.4k disk is about right. Thisis assuming the configuration for this is correct. On ext3 you need tomount with barrier=1 (ext4, xfs enable write barriers by default). Ifsomeone is getting significantly faster numbers they are probably writingto a volatile or battery backed cache.

Performance is relative. The number of operations the DB can do is roughlyconstant although multi may be able to more efficiently batch operations byamortizing all the coordination overhead.

In the synchronous case the DB is starved for work %99 of the time so it isnot surprising that it is slow. You are benchmarking round trip time inthat case, and that is dominated by the time it takes to synchronouslycommmit something to disk.

In the asynchronous case there is plenty of work and you can fully utilizeall the throughput available to get it done because each fsync makesmultiple operations durable. However the work is still presented piecemealso there is per-operation overhead.

Caveat, I am on 3.3.3 so I haven't read how multi operations areimplemented, but the numbers you are getting bear this out. In themulti-case you are getting the benefit of keeping the DB fully utilizedplus amortizing the coordination overhead across multiple operations so youget a boost in throughput beyond just async.

Ariel

On Tue, Feb 14, 2012 at 3:37 PM, N Keywal <[EMAIL PROTECTED]> wrote:

> Hi,>> Thanks for the replies.>> It's used when assigning the regions (kind of dataset) to the regionserver> (jvm process in a physical server). There is one zookeeper node per region.> On a server failure, there is typically a few hundreds regions to reassign,> with multiple status written in . On paper, if we need 0,02s per node, that> makes it to the minute to recover, just for zookeeper.>> That's theory. I haven't done a precise measurement yet.>>> Anyway, if ZooKeeper can be faster, it's always very interesting :-)>>> Cheers,>> N.>>> On Tue, Feb 14, 2012 at 8:00 PM, Ted Dunning <[EMAIL PROTECTED]>> wrote:>> > These results are about what is expected although the might be a little> > more extreme.> >> > I doubt very much that hbase is mutating zk nodes fast enough for this to> > matter much.> >> > Sent from my iPhone> >> > On Feb 14, 2012, at 8:00, N Keywal <[EMAIL PROTECTED]> wrote:> >> > > Hi,> > >> > > I've done a test with Zookeeper 3.4.2 to compare the performances of> > > synchronous vs. asynchronous vs. multi when creating znode (variations> > > around:> > > calling 10000 times zk.create("/dummyTest", "dummy".getBytes(),> > > ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);) The code is at> the> > > end of the mail.> > >> > > I've tested different environments:> > > - 1 linux server with the client and 1 zookeeper node on the same> machine> > > - 1 linux server for the client, 1 for 1 zookeeper node.> > > - 6 linux servers, 1 for the client, 5 for 5 zookeeper nodes.> > >> > > Server are middle range, with 4*2 cores, jdk 1.6. ZK was on its own HD.> > >> > > But the results are comparable:> > >> > > Using the sync API, it takes 200 seconds for 10K creations, so around> > 0.02> > > second per call.> > > Using the async API, it takes 2 seconds for 10K (including waiting for> > the> > > last callback message)> > > Using the "multi" available since 3.4, it takes less than 1 second,> again> > > for 10K.> > >> > > I'm surprised by the time taken by the sync operation, I was not> > expecting> > > it to be that slow. The gap between async & sync is quite huge.> > >> > > Is this something expected? Zookeeper is used in critical functions in> > > Hadoop/Hbase, I was looking at the possible benefits of using "multi",> > but> > > it seems low compared to async (well ~3 times faster :-). There are> many> > > small data creations/deletions with the sync API in the existing hbase

Some of our previous measurements gave us around 5ms, check some of our presentations we uploaded to the wiki. Those use 7.2k RPM disks and not only volatile storage or battery backed cache. We do have the write cache on for the numbers I'm referring to. There are also numbers there when the write cache is off.

-Flavio

On Feb 14, 2012, at 9:48 PM, Ariel Weisberg wrote:

> Hi,> > It's only a minute of you process each region serially. Process 100 or 1000> in parallel and it will go a lot faster.> > 20 milliseconds to synchronously commit to a 5.4k disk is about right. This> is assuming the configuration for this is correct. On ext3 you need to> mount with barrier=1 (ext4, xfs enable write barriers by default). If> someone is getting significantly faster numbers they are probably writing> to a volatile or battery backed cache.> > Performance is relative. The number of operations the DB can do is roughly> constant although multi may be able to more efficiently batch operations by> amortizing all the coordination overhead.> > In the synchronous case the DB is starved for work %99 of the time so it is> not surprising that it is slow. You are benchmarking round trip time in> that case, and that is dominated by the time it takes to synchronously> commmit something to disk.> > In the asynchronous case there is plenty of work and you can fully utilize> all the throughput available to get it done because each fsync makes> multiple operations durable. However the work is still presented piecemeal> so there is per-operation overhead.> > Caveat, I am on 3.3.3 so I haven't read how multi operations are> implemented, but the numbers you are getting bear this out. In the> multi-case you are getting the benefit of keeping the DB fully utilized> plus amortizing the coordination overhead across multiple operations so you> get a boost in throughput beyond just async.> > Ariel> > On Tue, Feb 14, 2012 at 3:37 PM, N Keywal <[EMAIL PROTECTED]> wrote:> >> Hi,>> >> Thanks for the replies.>> >> It's used when assigning the regions (kind of dataset) to the regionserver>> (jvm process in a physical server). There is one zookeeper node per region.>> On a server failure, there is typically a few hundreds regions to reassign,>> with multiple status written in . On paper, if we need 0,02s per node, that>> makes it to the minute to recover, just for zookeeper.>> >> That's theory. I haven't done a precise measurement yet.>> >> >> Anyway, if ZooKeeper can be faster, it's always very interesting :-)>> >> >> Cheers,>> >> N.>> >> >> On Tue, Feb 14, 2012 at 8:00 PM, Ted Dunning <[EMAIL PROTECTED]>>> wrote:>> >>> These results are about what is expected although the might be a little>>> more extreme.>>> >>> I doubt very much that hbase is mutating zk nodes fast enough for this to>>> matter much.>>> >>> Sent from my iPhone>>> >>> On Feb 14, 2012, at 8:00, N Keywal <[EMAIL PROTECTED]> wrote:>>> >>>> Hi,>>>> >>>> I've done a test with Zookeeper 3.4.2 to compare the performances of>>>> synchronous vs. asynchronous vs. multi when creating znode (variations>>>> around:>>>> calling 10000 times zk.create("/dummyTest", "dummy".getBytes(),>>>> ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);) The code is at>> the>>>> end of the mail.>>>> >>>> I've tested different environments:>>>> - 1 linux server with the client and 1 zookeeper node on the same>> machine>>>> - 1 linux server for the client, 1 for 1 zookeeper node.>>>> - 6 linux servers, 1 for the client, 5 for 5 zookeeper nodes.>>>> >>>> Server are middle range, with 4*2 cores, jdk 1.6. ZK was on its own HD.>>>> >>>> But the results are comparable:>>>> >>>> Using the sync API, it takes 200 seconds for 10K creations, so around>>> 0.02>>>> second per call.>>>> Using the async API, it takes 2 seconds for 10K (including waiting for>>> the>>>> last callback message)>>>> Using the "multi" available since 3.4, it takes less than 1 second,

> Some of our previous measurements gave us around 5ms, check some of our> presentations we uploaded to the wiki. Those use 7.2k RPM disks and not> only volatile storage or battery backed cache. We do have the write cache> on for the numbers I'm referring to. There are also numbers there when the> write cache is off.>> -Flavio>> On Feb 14, 2012, at 9:48 PM, Ariel Weisberg wrote:>> > Hi,> >> > It's only a minute of you process each region serially. Process 100 or> 1000> > in parallel and it will go a lot faster.> >> > 20 milliseconds to synchronously commit to a 5.4k disk is about right.> This> > is assuming the configuration for this is correct. On ext3 you need to> > mount with barrier=1 (ext4, xfs enable write barriers by default). If> > someone is getting significantly faster numbers they are probably writing> > to a volatile or battery backed cache.> >> > Performance is relative. The number of operations the DB can do is> roughly> > constant although multi may be able to more efficiently batch operations> by> > amortizing all the coordination overhead.> >> > In the synchronous case the DB is starved for work %99 of the time so it> is> > not surprising that it is slow. You are benchmarking round trip time in> > that case, and that is dominated by the time it takes to synchronously> > commmit something to disk.> >> > In the asynchronous case there is plenty of work and you can fully> utilize> > all the throughput available to get it done because each fsync makes> > multiple operations durable. However the work is still presented> piecemeal> > so there is per-operation overhead.> >> > Caveat, I am on 3.3.3 so I haven't read how multi operations are> > implemented, but the numbers you are getting bear this out. In the> > multi-case you are getting the benefit of keeping the DB fully utilized> > plus amortizing the coordination overhead across multiple operations so> you> > get a boost in throughput beyond just async.> >> > Ariel> >> > On Tue, Feb 14, 2012 at 3:37 PM, N Keywal <[EMAIL PROTECTED]> wrote:> >> >> Hi,> >>> >> Thanks for the replies.> >>> >> It's used when assigning the regions (kind of dataset) to the> regionserver> >> (jvm process in a physical server). There is one zookeeper node per> region.> >> On a server failure, there is typically a few hundreds regions to> reassign,> >> with multiple status written in . On paper, if we need 0,02s per node,> that> >> makes it to the minute to recover, just for zookeeper.> >>> >> That's theory. I haven't done a precise measurement yet.> >>> >>> >> Anyway, if ZooKeeper can be faster, it's always very interesting :-)> >>> >>> >> Cheers,> >>> >> N.> >>> >>> >> On Tue, Feb 14, 2012 at 8:00 PM, Ted Dunning <[EMAIL PROTECTED]>> >> wrote:> >>> >>> These results are about what is expected although the might be a little> >>> more extreme.> >>>> >>> I doubt very much that hbase is mutating zk nodes fast enough for this> to> >>> matter much.> >>>> >>> Sent from my iPhone> >>>> >>> On Feb 14, 2012, at 8:00, N Keywal <[EMAIL PROTECTED]> wrote:> >>>> >>>> Hi,> >>>>> >>>> I've done a test with Zookeeper 3.4.2 to compare the performances of

Just as an arithmetic check, hundreds x 20 ms = seconds, not minutes. Even1000 x 0.02 s = 20 s which isn't all that long. Faster is nice, but thisdoesn't reach "minutes". And as Flavio points out, if the recovery isthreaded, it will be faster.

With a loaded server, each group of transactions wiill take about onerotation. But the time from when they arrived to the time that they arecommitted will be roughly 0 ... 8 ms for a 7200 RPM drive because thetransactions will be arriving at different times.

There will be overheads which make this untrue, but the basic idea that youdon't necessarily have to wait for a full rotation if you arrive partwaythrough a rotation is correct.

In particular check the HIC talk, slide 57. We were using 1k byte writes for those tests.

-Flavio

On Feb 15, 2012, at 12:18 AM, Ariel Weisberg wrote:

> Hi,> > I tried to look at the presentations on the wiki, but the links aren't> working? I was using> http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations and the> error at the top of the page is "You are not allowed to do AttachFile on> this page. Login and try again."> > I used (http://pastebin.com/uu7igM3J) and the results for 4k writes were> http://pastebin.com/N26CJtQE. 8.5 milliseconds, which is a bit slower than> 5. Is it possible to beat the rotation speed?> > You can increase the write size quite a bit to 240k and it only goes up to> 10 milliseconds. http://pastebin.com/MSTwaHYN> > My recollection was being in the 12-14 range, but I may be thinking of when> I was pushing throughput.> > Ariel> > On Tue, Feb 14, 2012 at 4:02 PM, Flavio Junqueira <[EMAIL PROTECTED]> wrote:> >> Some of our previous measurements gave us around 5ms, check some of our>> presentations we uploaded to the wiki. Those use 7.2k RPM disks and not>> only volatile storage or battery backed cache. We do have the write cache>> on for the numbers I'm referring to. There are also numbers there when the>> write cache is off.>> >> -Flavio>> >> On Feb 14, 2012, at 9:48 PM, Ariel Weisberg wrote:>> >>> Hi,>>> >>> It's only a minute of you process each region serially. Process 100 or>> 1000>>> in parallel and it will go a lot faster.>>> >>> 20 milliseconds to synchronously commit to a 5.4k disk is about right.>> This>>> is assuming the configuration for this is correct. On ext3 you need to>>> mount with barrier=1 (ext4, xfs enable write barriers by default). If>>> someone is getting significantly faster numbers they are probably writing>>> to a volatile or battery backed cache.>>> >>> Performance is relative. The number of operations the DB can do is>> roughly>>> constant although multi may be able to more efficiently batch operations>> by>>> amortizing all the coordination overhead.>>> >>> In the synchronous case the DB is starved for work %99 of the time so it>> is>>> not surprising that it is slow. You are benchmarking round trip time in>>> that case, and that is dominated by the time it takes to synchronously>>> commmit something to disk.>>> >>> In the asynchronous case there is plenty of work and you can fully>> utilize>>> all the throughput available to get it done because each fsync makes>>> multiple operations durable. However the work is still presented>> piecemeal>>> so there is per-operation overhead.>>> >>> Caveat, I am on 3.3.3 so I haven't read how multi operations are>>> implemented, but the numbers you are getting bear this out. In the>>> multi-case you are getting the benefit of keeping the DB fully utilized>>> plus amortizing the coordination overhead across multiple operations so>> you>>> get a boost in throughput beyond just async.>>> >>> Ariel>>> >>> On Tue, Feb 14, 2012 at 3:37 PM, N Keywal <[EMAIL PROTECTED]> wrote:>>> >>>> Hi,>>>> >>>> Thanks for the replies.>>>> >>>> It's used when assigning the regions (kind of dataset) to the>> regionserver>>>> (jvm process in a physical server). There is one zookeeper node per>> region.>>>> On a server failure, there is typically a few hundreds regions to>> reassign,>>>> with multiple status written in . On paper, if we need 0,02s per node,>> that>>>> makes it to the minute to recover, just for zookeeper.>>>> >>>> That's theory. I haven't done a precise measurement yet.>>>> >>>> >>>> Anyway, if ZooKeeper can be faster, it's always very interesting :-)>>>> >>>> >>>> Cheers,>>>> >>>> N.>>>> >>>> >>>> On Tue, Feb 14, 2012 at 8:00 PM, Ted Dunning <[EMAIL PROTECTED]>>>>> wrote:>>>> >>>>> These results are about what is expected although the might be a little

> Hi Ariel, That wiki is stale. Check it here:>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeperPresentations>> In particular check the HIC talk, slide 57. We were using 1k byte writes> for those tests.>> -Flavio>> On Feb 15, 2012, at 12:18 AM, Ariel Weisberg wrote:>> > Hi,> >> > I tried to look at the presentations on the wiki, but the links aren't> > working? I was using> > http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations and the> > error at the top of the page is "You are not allowed to do AttachFile on> > this page. Login and try again."> >> > I used (http://pastebin.com/uu7igM3J) and the results for 4k writes were> > http://pastebin.com/N26CJtQE. 8.5 milliseconds, which is a bit slower> than> > 5. Is it possible to beat the rotation speed?> >> > You can increase the write size quite a bit to 240k and it only goes up> to> > 10 milliseconds. http://pastebin.com/MSTwaHYN> >> > My recollection was being in the 12-14 range, but I may be thinking of> when> > I was pushing throughput.> >> > Ariel> >> > On Tue, Feb 14, 2012 at 4:02 PM, Flavio Junqueira <[EMAIL PROTECTED]>> wrote:> >> >> Some of our previous measurements gave us around 5ms, check some of our> >> presentations we uploaded to the wiki. Those use 7.2k RPM disks and not> >> only volatile storage or battery backed cache. We do have the write> cache> >> on for the numbers I'm referring to. There are also numbers there when> the> >> write cache is off.> >>> >> -Flavio> >>> >> On Feb 14, 2012, at 9:48 PM, Ariel Weisberg wrote:> >>> >>> Hi,> >>>> >>> It's only a minute of you process each region serially. Process 100 or> >> 1000> >>> in parallel and it will go a lot faster.> >>>> >>> 20 milliseconds to synchronously commit to a 5.4k disk is about right.> >> This> >>> is assuming the configuration for this is correct. On ext3 you need to> >>> mount with barrier=1 (ext4, xfs enable write barriers by default). If> >>> someone is getting significantly faster numbers they are probably> writing> >>> to a volatile or battery backed cache.> >>>> >>> Performance is relative. The number of operations the DB can do is> >> roughly> >>> constant although multi may be able to more efficiently batch> operations> >> by> >>> amortizing all the coordination overhead.> >>>> >>> In the synchronous case the DB is starved for work %99 of the time so> it> >> is> >>> not surprising that it is slow. You are benchmarking round trip time in> >>> that case, and that is dominated by the time it takes to synchronously> >>> commmit something to disk.> >>>> >>> In the asynchronous case there is plenty of work and you can fully> >> utilize> >>> all the throughput available to get it done because each fsync makes> >>> multiple operations durable. However the work is still presented> >> piecemeal> >>> so there is per-operation overhead.> >>>> >>> Caveat, I am on 3.3.3 so I haven't read how multi operations are> >>> implemented, but the numbers you are getting bear this out. In the> >>> multi-case you are getting the benefit of keeping the DB fully utilized> >>> plus amortizing the coordination overhead across multiple operations so> >> you> >>> get a boost in throughput beyond just async.> >>>> >>> Ariel> >>>> >>> On Tue, Feb 14, 2012 at 3:37 PM, N Keywal <[EMAIL PROTECTED]> wrote:> >>>> >>>> Hi,> >>>>> >>>> Thanks for the replies.> >>>>> >>>> It's used when assigning the regions (kind of dataset) to the> >> regionserver> >>>> (jvm process in a physical server). There is one zookeeper node per> >> region.> >>>> On a server failure, there is typically a few hundreds regions to> >> reassign,> >>>> with multiple status written in . On paper, if we need 0,02s per node,> >> that> >>

Net means the overhead of the replication protocol only, not writing to disk Net+disk means the overhead of the replication protocol with writes to disk enabled Net+disk (no write cache) same as the previous one, and we have turned the write cache of the disk off

-Flavio

On Feb 18, 2012, at 4:17 PM, Ariel Weisberg wrote:

> Hi,> > In that diagram, what is the difference between net, net + disk, and net +> disk (no write cache)?> > Thanks,> Ariel> > On Fri, Feb 17, 2012 at 3:41 AM, Flavio Junqueira <[EMAIL PROTECTED]> wrote:> >> Hi Ariel, That wiki is stale. Check it here:>> >> >> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeperPresentations>> >> In particular check the HIC talk, slide 57. We were using 1k byte writes>> for those tests.>> >> -Flavio>> >> On Feb 15, 2012, at 12:18 AM, Ariel Weisberg wrote:>> >>> Hi,>>> >>> I tried to look at the presentations on the wiki, but the links aren't>>> working? I was using>>> http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations and the>>> error at the top of the page is "You are not allowed to do AttachFile on>>> this page. Login and try again.">>> >>> I used (http://pastebin.com/uu7igM3J) and the results for 4k writes were>>> http://pastebin.com/N26CJtQE. 8.5 milliseconds, which is a bit slower>> than>>> 5. Is it possible to beat the rotation speed?>>> >>> You can increase the write size quite a bit to 240k and it only goes up>> to>>> 10 milliseconds. http://pastebin.com/MSTwaHYN>>> >>> My recollection was being in the 12-14 range, but I may be thinking of>> when>>> I was pushing throughput.>>> >>> Ariel>>> >>> On Tue, Feb 14, 2012 at 4:02 PM, Flavio Junqueira <[EMAIL PROTECTED]>>> wrote:>>> >>>> Some of our previous measurements gave us around 5ms, check some of our>>>> presentations we uploaded to the wiki. Those use 7.2k RPM disks and not>>>> only volatile storage or battery backed cache. We do have the write>> cache>>>> on for the numbers I'm referring to. There are also numbers there when>> the>>>> write cache is off.>>>> >>>> -Flavio>>>> >>>> On Feb 14, 2012, at 9:48 PM, Ariel Weisberg wrote:>>>> >>>>> Hi,>>>>> >>>>> It's only a minute of you process each region serially. Process 100 or>>>> 1000>>>>> in parallel and it will go a lot faster.>>>>> >>>>> 20 milliseconds to synchronously commit to a 5.4k disk is about right.>>>> This>>>>> is assuming the configuration for this is correct. On ext3 you need to>>>>> mount with barrier=1 (ext4, xfs enable write barriers by default). If>>>>> someone is getting significantly faster numbers they are probably>> writing>>>>> to a volatile or battery backed cache.>>>>> >>>>> Performance is relative. The number of operations the DB can do is>>>> roughly>>>>> constant although multi may be able to more efficiently batch>> operations>>>> by>>>>> amortizing all the coordination overhead.>>>>> >>>>> In the synchronous case the DB is starved for work %99 of the time so>> it>>>> is>>>>> not surprising that it is slow. You are benchmarking round trip time in>>>>> that case, and that is dominated by the time it takes to synchronously>>>>> commmit something to disk.>>>>> >>>>> In the asynchronous case there is plenty of work and you can fully>>>> utilize>>>>> all the throughput available to get it done because each fsync makes>>>>> multiple operations durable. However the work is still presented>>>> piecemeal>>>>> so there is per-operation overhead.>>>>> >>>>> Caveat, I am on 3.3.3 so I haven't read how multi operations are>>>>> implemented, but the numbers you are getting bear this out. In the>>>>> multi-case you are getting the benefit of keeping the DB fully utilized>>>>> plus amortizing the coordination overhead across multiple operations so>>>> you>>>>> get a boost in throughput beyond just async.>>>>> >>>>> Ariel>>>>> >>>>> On Tue, Feb 14, 2012 at 3:37 PM, N Keywal <[EMAIL PROTECTED]> wrote: