ZooKeeper: use native threads to avoid GC stalls (JNI integration)

Details

Type: Improvement

Status:Resolved

Priority: Major

Resolution:
Later

Affects Version/s:0.20.0

Fix Version/s:
None

Component/s:
None

Labels:

None

Description

From Joey Echeverria up on hbase-users@:

We've used zookeeper in a write-heavy project we've been working on and experienced issues similar to what you described. After several days of debugging, we discovered that our issue was garbage collection. There was no way to guarantee we wouldn't have long pauses especially since our environment was the worst case for garbage collection, millions of tiny, short lived objects. I suspect HBase sees similar work loads frequently, if it's not constantly. With anything shorter than a 30 second session time out, we got session expiration events extremely frequently. We needed to use 60 seconds for any real confidence that an ephemeral node disappearing meant something was unavailable.

We really wanted quick recovery so we ended up writing a light-weight wrapper around the C API and used swig to auto-generate a JNI interface. It's not perfect, but since we switched to this method we've never seen a session expiration event and ephemeral nodes only disappear when there are network issues or a machine/process goes down.

I don't know if it's worth doing the same kind of thing for HBase as it adds some "unnecessary" native code, but it's a solution that I found works.

I'm uploading a tarball with the basic wrapper we used to get around GC pauses. The C code maintains it's own session with zk which is independent of any java sessions. If you want to do anything other than create ephemeral nodes, more functions need to be wrapped or a combination of C and Java is needed.

I could try and build something more comprehensive, potentially even a full implementation of the Java API which completely wraps the C, but I wanted to show you guys what I had at this stage. Let me know if you have any questions.

Joey Echeverria
added a comment - 15/Apr/09 22:14 I'm uploading a tarball with the basic wrapper we used to get around GC pauses. The C code maintains it's own session with zk which is independent of any java sessions. If you want to do anything other than create ephemeral nodes, more functions need to be wrapped or a combination of C and Java is needed.
I could try and build something more comprehensive, potentially even a full implementation of the Java API which completely wraps the C, but I wanted to show you guys what I had at this stage. Let me know if you have any questions.

I've started looking into this. I have a minimal JNI binding to ZooKeeper based off of Joey's work. It doesn't use swig, as I think that adds an unnecessary dependency.

The question, as Joey mentions, is where do we want to put the gap between JNI ZooKeeper and Java ZooKeeper?
On one hand, we can just have the JNI binding handle ephemeral nodes to reduce session expired events from Java GC starving hearbeats. On the other hand we can try making the JNI binding handle and not have a Java ZooKeeper handle at all, but that might be ugly with the watcher events going back and forth between C <=> Java.

Nitay Joffe
added a comment - 03/Jul/09 02:00 I've started looking into this. I have a minimal JNI binding to ZooKeeper based off of Joey's work. It doesn't use swig, as I think that adds an unnecessary dependency.
The question, as Joey mentions, is where do we want to put the gap between JNI ZooKeeper and Java ZooKeeper?
On one hand, we can just have the JNI binding handle ephemeral nodes to reduce session expired events from Java GC starving hearbeats. On the other hand we can try making the JNI binding handle and not have a Java ZooKeeper handle at all, but that might be ugly with the watcher events going back and forth between C <=> Java.
What do you guys think?

The trouble here is ephemeral nodes expiring due to dropped heartbeats. I think this issue should be about solving that problem only. The rest almost does not matter – java land is blocked anyway, watcher events will queue up. Also, is calling up into java land from JNI while the VM is in a GC cycle safe? It must be. Then I presume if you tried to create an object in the C thread the create would block somehow on an os level mutex until it is safe to create objects again. Would that not defeat the purpose of the C thread in the first place?

Andrew Purtell
added a comment - 03/Jul/09 02:10 The trouble here is ephemeral nodes expiring due to dropped heartbeats. I think this issue should be about solving that problem only. The rest almost does not matter – java land is blocked anyway, watcher events will queue up. Also, is calling up into java land from JNI while the VM is in a GC cycle safe? It must be. Then I presume if you tried to create an object in the C thread the create would block somehow on an os level mutex until it is safe to create objects again. Would that not defeat the purpose of the C thread in the first place?

Good points Andrew. I'll look into this further (still learning JNI), but I suspect your intuitions are correct.
I'll stick to just handling ephemeral nodes in the JNI.

We also need to decide how we want to bundle the JNI library, as we will now have platform specific things in our code base. I'll look into how Hadoop does this with things like libhdfs. If anyone has pointers, please let me know.

Nitay Joffe
added a comment - 03/Jul/09 02:32 Good points Andrew. I'll look into this further (still learning JNI), but I suspect your intuitions are correct.
I'll stick to just handling ephemeral nodes in the JNI.
We also need to decide how we want to bundle the JNI library, as we will now have platform specific things in our code base. I'll look into how Hadoop does this with things like libhdfs. If anyone has pointers, please let me know.

We specifically avoided having any callbacks cross the C/Java boundary This was simple in our use case where the only thing we needed to monitor after creating an ephemeral node was whether or ZK session had expired. We also had a very simple recovery mechanism, we immediately kill the process that got disconnected and the shell script that launched us will relaunch. This proved far easier than trying to re-establish a connection to ZK in the running process.

Joey Echeverria
added a comment - 03/Jul/09 09:45 We specifically avoided having any callbacks cross the C/Java boundary This was simple in our use case where the only thing we needed to monitor after creating an ephemeral node was whether or ZK session had expired. We also had a very simple recovery mechanism, we immediately kill the process that got disconnected and the shell script that launched us will relaunch. This proved far easier than trying to re-establish a connection to ZK in the running process.

Andrew Purtell
added a comment - 04/Jul/09 05:13 @stack: That is what Hadoop does if native compression bits do not exist for the platform or were not installed. If Nitay follows that example, this stuff should work the same way.

phunt, jgray, and I talked about this on IRC this morning for a little while. We sketched out a design that looks something like this:

1) We tune up the ZK session timeout for region servers to be higher than longest expected GC pause (eg 5 minutes)
2) We add a second ZK session on the same machine - either this is a second JVM running next to the first, or it's a JNI thread. Either way, it's its own session with its own ephemeral node - eg /rs-watchdogs/<regionserver name>. This second session has a tuned down session timeout (eg 5 seconds)
3) In the HMaster, we watch /rs-watchdogs/*, and if we notice one of the ephemeral nodes disappears, then we have to forcibly expire the matching regionserver ZK session. We will need some ZK support here to add the ability to expire someone else's session in a reliable manner.

This has the following effects:
A) If there's a long garbage collection pause in the JVM, the "fast" ZK session stays up, and so long as the GC pause is under the "long" timeout, nothing will expire. This is good.
B) If there's a network or machine outage, the "fast" ZK session goes down, in which case we detect the outage quickly. This is also good.
C) By adding the forcible expiration of the RS ZK session when the "fast" session expires, we keep the same fencing guarantees as we've got now.

The other nice thing about this design is that it doesn't change the current RS or master at all - the master still watches the normal RS znodes, it's just that we have a second system that's doing a fast-path expiration on them when a machine goes down. We could also choose to implement this second system based on other kinds of machine health checks, etc.

Todd Lipcon
added a comment - 29/Jul/10 18:20 phunt, jgray, and I talked about this on IRC this morning for a little while. We sketched out a design that looks something like this:
1) We tune up the ZK session timeout for region servers to be higher than longest expected GC pause (eg 5 minutes)
2) We add a second ZK session on the same machine - either this is a second JVM running next to the first, or it's a JNI thread. Either way, it's its own session with its own ephemeral node - eg /rs-watchdogs/<regionserver name>. This second session has a tuned down session timeout (eg 5 seconds)
3) In the HMaster, we watch /rs-watchdogs/*, and if we notice one of the ephemeral nodes disappears, then we have to forcibly expire the matching regionserver ZK session. We will need some ZK support here to add the ability to expire someone else's session in a reliable manner.
This has the following effects:
A) If there's a long garbage collection pause in the JVM, the "fast" ZK session stays up, and so long as the GC pause is under the "long" timeout, nothing will expire. This is good.
B) If there's a network or machine outage, the "fast" ZK session goes down, in which case we detect the outage quickly. This is also good.
C) By adding the forcible expiration of the RS ZK session when the "fast" session expires, we keep the same fencing guarantees as we've got now.
The other nice thing about this design is that it doesn't change the current RS or master at all - the master still watches the normal RS znodes, it's just that we have a second system that's doing a fast-path expiration on them when a machine goes down. We could also choose to implement this second system based on other kinds of machine health checks, etc.

+1 on Todd et. al.'s design. I think it would be useful to use the JNI thread for (2) for the following reasons:

1) If the region server process goes down hard, but the machine is still up you won't see that failure for the long timeout (5 minutes) if the other session is in a separate JVM.
2) It's probably easier to manage a single JVM during startup and shutdown of a regionserver.

If nobody minds, I'd like to take a stab it generating a patch for this.

Joey Echeverria
added a comment - 16/May/11 21:43 +1 on Todd et. al.'s design. I think it would be useful to use the JNI thread for (2) for the following reasons:
1) If the region server process goes down hard, but the machine is still up you won't see that failure for the long timeout (5 minutes) if the other session is in a separate JVM.
2) It's probably easier to manage a single JVM during startup and shutdown of a regionserver.
If nobody minds, I'd like to take a stab it generating a patch for this.

I've got a partial patch ready. The build relies on native-maven-plugin to build the native code. This plugin pulls native dependencies as maven artifacts. To make this work, I packaged up the zookeeper header files and the static library compiled for x86-64 Linux.

In order to test the patch you need to install the artifacts into your local maven repository. I've included a simple install.sh to do this for you. We'll need to upload these artifacts somewhere, along with other supported OSes/architectures in the future.

I did attempt to make both the build and runtime code work if you're not on a supported platform, but I haven't extensively tested it.

At this point, the patch just adds support for interacting with zookeeper via the native code. The interaction is very limited, currently only creating ephemeral nodes is supported. One thing I did do was add a callback for the native code to notify Java when it's session gets expired.

Right now, I'm generating my own session expiration event to send to the Java zookeeper connection. I think this will allow the region server to shutdown if the native session expires. It should look just like an expiration of the Java session.

Things that are not yet implemented:

The region server hasn't been modified to use the native code at all.

I haven't modified the packaging part of the build. I'm not sure how we'll want the build to generate versions of the native library for multiple platforms.

Let me know if you think this is on the right track or if anything needs a big rethink.

Joey Echeverria
added a comment - 21/May/11 01:26 I've got a partial patch ready. The build relies on native-maven-plugin to build the native code. This plugin pulls native dependencies as maven artifacts. To make this work, I packaged up the zookeeper header files and the static library compiled for x86-64 Linux.
In order to test the patch you need to install the artifacts into your local maven repository. I've included a simple install.sh to do this for you. We'll need to upload these artifacts somewhere, along with other supported OSes/architectures in the future.
I did attempt to make both the build and runtime code work if you're not on a supported platform, but I haven't extensively tested it.
At this point, the patch just adds support for interacting with zookeeper via the native code. The interaction is very limited, currently only creating ephemeral nodes is supported. One thing I did do was add a callback for the native code to notify Java when it's session gets expired.
Right now, I'm generating my own session expiration event to send to the Java zookeeper connection. I think this will allow the region server to shutdown if the native session expires. It should look just like an expiration of the Java session.
Things that are not yet implemented:
The region server hasn't been modified to use the native code at all.
I haven't modified the packaging part of the build. I'm not sure how we'll want the build to generate versions of the native library for multiple platforms.
Let me know if you think this is on the right track or if anything needs a big rethink.

We'll need to upload these artifacts somewhere, along with other supported OSes/architectures in the future.

This should be fine. We've been doing this for various libs up to this point. Can add this np.

Do you think this patch will be generally useful Joey? If so, maybe once its up working in hbase, it can be contrib'd back to zk?

I haven't modified the packaging part of the build. I'm not sure how we'll want the build to generate versions of the native library for multiple platforms.

Tell me more about this? Are you thinking we need to build the native libs in-line with a build each time?

Do you think this feature can be optionally enabled? If we fail to load the required native lib, do we default to old-school session handling? Or, its on always but we only use new-style if we find the native libs?

stack
added a comment - 23/May/11 21:39 We'll need to upload these artifacts somewhere, along with other supported OSes/architectures in the future.
This should be fine. We've been doing this for various libs up to this point. Can add this np.
Do you think this patch will be generally useful Joey? If so, maybe once its up working in hbase, it can be contrib'd back to zk?
I haven't modified the packaging part of the build. I'm not sure how we'll want the build to generate versions of the native library for multiple platforms.
Tell me more about this? Are you thinking we need to build the native libs in-line with a build each time?
Do you think this feature can be optionally enabled? If we fail to load the required native lib, do we default to old-school session handling? Or, its on always but we only use new-style if we find the native libs?
How does this timeout relate to the zk session timeout?
+ public static int DEFAULT_HBASE_ZOOKEEPER_NATIVE_SESSION_TIMEOUT = 5000;
Thats cool that you have unit tests in place for your new methods already.
Patch so far looks great to me.

Do you think this patch will be generally useful Joey? If so, maybe once its up working in hbase, it can be contrib'd back to zk?

Which part of the patch? I think it might be useful to get zk to publish native artifacts, but I'm not sure if a full Java API that uses the native libraries makes sense in general. It might make sense as a kind of general purpose watchdog facility, probably in zk contrib.

Tell me more about this? Are you thinking we need to build the native libs in-line with a build each time?

Right now that's how my patch works. We could also build the native libraries as a separate maven module that HBase could depend on, that way the assembly could include all versions that we've built.

Do you think this feature can be optionally enabled? If we fail to load the required native lib, do we default to old-school session handling? Or, its on always but we only use new-style if we find the native libs?

It's sort-of optionally available as is. I put my changes in the pom which builds the native library into a profile which is activated base don the OS and CPU (currently only activated for x86-64 Linux). We could turn off the automatic activation entirely and require that you turn on the profile using mvn -Pnative.

At runtime, it's completely optional. If you can't load the native library, it falls back to using pure Java methods. It means that you wouldn't get the fast path recovery that the native code is supposed to enable, but it prints a warning to that effect in the logs.

How does this timeout relate to the zk session timeout?

This configures the session timeout for the native zk session. Since you'll want to tune the two separately, I put it in as a separate config. From the previous project that I worked on, we found that 5-10 seconds was reasonable. You would notice failures quickly but you're unlikely to get triggered by temporary network glitches.

One thing I'm undecided about is how the region server would want to deal with getting a session expiration notice for the native session. In my last project, we had the native code call exit() to fail as fast as possible. We didn't want our master process assigning a shard to a new server while the old one still thought they had control of it. In the current patch, I'm sending an expired session event to the Java zk handle. It would probably be better to expire the Java session entirely. This can be done pretty easily by just connecting to zk with the same session id and password and then closing the session. This way anyone watching ephemeral nodes created by the Java session would get equally fast failure notifications.

I'm busy this week, so I probably won't get back to this until next weekend.

Joey Echeverria
added a comment - 24/May/11 23:17 Do you think this patch will be generally useful Joey? If so, maybe once its up working in hbase, it can be contrib'd back to zk?
Which part of the patch? I think it might be useful to get zk to publish native artifacts, but I'm not sure if a full Java API that uses the native libraries makes sense in general. It might make sense as a kind of general purpose watchdog facility, probably in zk contrib.
Tell me more about this? Are you thinking we need to build the native libs in-line with a build each time?
Right now that's how my patch works. We could also build the native libraries as a separate maven module that HBase could depend on, that way the assembly could include all versions that we've built.
Do you think this feature can be optionally enabled? If we fail to load the required native lib, do we default to old-school session handling? Or, its on always but we only use new-style if we find the native libs?
It's sort-of optionally available as is. I put my changes in the pom which builds the native library into a profile which is activated base don the OS and CPU (currently only activated for x86-64 Linux). We could turn off the automatic activation entirely and require that you turn on the profile using mvn -Pnative.
At runtime, it's completely optional. If you can't load the native library, it falls back to using pure Java methods. It means that you wouldn't get the fast path recovery that the native code is supposed to enable, but it prints a warning to that effect in the logs.
How does this timeout relate to the zk session timeout?
This configures the session timeout for the native zk session. Since you'll want to tune the two separately, I put it in as a separate config. From the previous project that I worked on, we found that 5-10 seconds was reasonable. You would notice failures quickly but you're unlikely to get triggered by temporary network glitches.
One thing I'm undecided about is how the region server would want to deal with getting a session expiration notice for the native session. In my last project, we had the native code call exit() to fail as fast as possible. We didn't want our master process assigning a shard to a new server while the old one still thought they had control of it. In the current patch, I'm sending an expired session event to the Java zk handle. It would probably be better to expire the Java session entirely. This can be done pretty easily by just connecting to zk with the same session id and password and then closing the session. This way anyone watching ephemeral nodes created by the Java session would get equally fast failure notifications.
I'm busy this week, so I probably won't get back to this until next weekend.

We could also build the native libraries as a separate maven module that HBase could depend on, that way the assembly could include all versions that we've built.

We can work this out later, np.

We could turn off the automatic activation entirely and require that you turn on the profile using mvn -Pnative

We could but I think the way you have it is the way to go, at least on first version (We can add 'off' button later).

...ce you'll want to tune the two separately, I put it in as a separate config....

Ok. Makes sense.

. This can be done pretty easily by just connecting to zk with the same session id and password and then closing the session. This way anyone watching ephemeral nodes created by the Java session would get equally fast failure notifications.

Yes. This seems right. There are issues around RS death, making it 'clean'-er than it is but thats out of scope for this issue (at same time a call to 'exit' the RS, to make it do a sudden death, might be a nice-to-have).

stack
added a comment - 25/May/11 06:00 We could also build the native libraries as a separate maven module that HBase could depend on, that way the assembly could include all versions that we've built.
We can work this out later, np.
We could turn off the automatic activation entirely and require that you turn on the profile using mvn -Pnative
We could but I think the way you have it is the way to go, at least on first version (We can add 'off' button later).
...ce you'll want to tune the two separately, I put it in as a separate config....
Ok. Makes sense.
. This can be done pretty easily by just connecting to zk with the same session id and password and then closing the session. This way anyone watching ephemeral nodes created by the Java session would get equally fast failure notifications.
Yes. This seems right. There are issues around RS death, making it 'clean'-er than it is but thats out of scope for this issue (at same time a call to 'exit' the RS, to make it do a sudden death, might be a nice-to-have).
Good stuff Joey.

Also, the patch was re-based to the latest trunk. It won't build without putting the native ZK library it depends on in a Maven repo. Is there one I can upload it to that would be used by the Jenkins builds?

Joey Echeverria
added a comment - 17/Jun/11 21:26 Also, the patch was re-based to the latest trunk. It won't build without putting the native ZK library it depends on in a Maven repo. Is there one I can upload it to that would be used by the Jenkins builds?

If the native lib is not available in a mvn repo, we could put it up in a personal repo for the moment (I could put it in mine up on apache?) Will zk project be making it available to mvn do you know Joey?

stack
added a comment - 18/Jun/11 23:14 If the native lib is not available in a mvn repo, we could put it up in a personal repo for the moment (I could put it in mine up on apache?) Will zk project be making it available to mvn do you know Joey?

Joey Echeverria
added a comment - 20/Jun/11 14:35 I was mistaken, I hadn't filed a JIRA, but now I have. See ZOOKEEPER-1098 . We should probably upload it to a personal repo while we're waiting on movement on that ticket.