Product

HubSpot Product Blog

A long time pain point for HubSpot Engineering has been intermittent core dumps with Hadoop jobs. The issue wasn't destructive; all jobs are idempotent and missing processes are restarted and rerun. Occasionally, however, the resulting queue processing delay would be enough to trigger a PagerDuty alert in the middle of the night. Having tried several times to resolve the issue, engineering was unhappily resigned to living with it -- until our recently-formed SRE team decided to take a shot at resolving the issue.

Tl;dr

As expected, java is pretty flexible with minor versions in regards to compiling and runtime even with JNI involved.

The -Xcheck:jni command line option can cause Hadoop seg faults if native libs are used.

Don't use obscure 3rd party libs in production.

Workaround

The known workaround was to rename a hadoop directory containing *.so dynamic libraries such that they weren't found upon jvm startup. The native *.so libraries are optimized compression algorithms to be used in favor of their java equivalents.

mv /usr/lib/hadoop/lib/native /usr/lib/hadoop/lib/native.disabled

That’s good, bad and awful. Good because there is a workaround. Bad because using the workaround adds minutes to jobs and is not viable long term. Engineers end up using the workaround for an hour or so then undo it once they think the job queue has been unclogged. Awful because its generally considered fruitless to debug 3rd party native C jni code. On the plus side the hadoop code is open source and available.

Information Gathering

Ok, so the workaround is insufficient and we need an actual fix. What are the facts on the ground?

The traces all pointed to the linux dynamic loader -- which is stock Centos 6 and does not have any issues of this type reported against it. The currently executing java thread varies, implying there are multiple ways to get to the failure situation. The java stack traces shown end after loading a native library. Somewhere inside the native code something blew up, we'll have no opportunity to debug on the java side. With no corrosponding core dumps, the trail ended here.

Native libraries

With the utility lsof we can examine the open files of a running process. Here's a partial list of open *.so files for one of our hadoop processes.

The native libs converge to four major groupings: linux system libs, jvm libs, hadoop libs and some libs in /tmp. The system (/lib64/) libs and jvm libs are expected and seem normal. The hadoop libs are certainly expected. What do stand out are the /tmp libs. It's unexpected that a library would be stored there, however snappy compression is used and ansi certainly doesn't raise any red flags. The files exist in the filesystem and have appropriate permissions.

Running objdump on the libs confirmed all requirements/dependencies were present and accounted for.

-Xcheck:jni, and bugs in -Xcheck:jni

In addition to enabling core dumps we hoped the command line option -Xcheck:jni would provide some insight.

Unfortunately changing the system in two ways caused a minor amount of confusion since the -Xcheck:jni feature exposed a different bug in Hadoop. It took awhile to sift through the data and determine there were two types of core dumps. One type of core dump was similar to what the devs have been seeing all along, and the other a new one associated with the -Xcheck:jni feature.

That looks like a real issue. The fieldID ZlibCompressor_clazz is pulled from the Class object, while the call to get the field is against the object instance 'this'. Does the call always return the class object? Also, all sync in the java class is done on the instance. Syncing in the C code on the Class objects broadens the lock. All in all this seems like a potential problem.

We spent some time altering this code to properly pass the Class object through to the C code. After hacking in what appeared to be a functional replacement the original core dumps persisted.

While this may be an issue, it's not the one we're after.

Version(s) Compatibility

Back when core dumps were being enabled I poked around and determined that CDH has a recommended java version of jdk1.6.0_31 and that we were on a mixture of jdk1.6.0_27 and jdk1.6.0_39. Our java code was all compiled with jdk1.6.0_27, the CDH code was all compiled with their certified version jdk1.6.0_31.

We spent some time compiling our code on jdk1.6.0_31 and running the code on jvm _31. Still no luck, core dumps kept happening.

We had also lagged a bit on updates to our Linux distro. Presumably an OS upgrade will involve an upgrade to libc, which is certainly in the the picture and implicated by the non-Xcheck:jni core dumps.

We installed a newer distro, recompiled where needed, and still encountered issues. This did have the positive side effect of proving that our relatively lax adherence to specific versions wasn't hurting us.

Libc based core dump analysis

The non-Xcheck:jni core dumps were a bit more varied and didn't follow as precise a pattern. While they all died in the linux linker, the originating calls from java came from all over the place. There really wasn't a pattern as to what java/hadoop code could be triggering some kind of buffer overflow or other hard to find C error. The last thing we could do is to track down as closely as possible what exactly happened via gdb and hope something comes out of it.

(Un?)fortunately my previous experience with gdb never made it further than ‘bt<enter>’, so there was a good deal of thrashing around to get even rudimentary information like function parameters, register values and local variables.

The gdb stack traces all had the following pattern to them, and as is evident to your average linux hacker there are no debug symbols.

In a perfect world, the cause of the failure would be the top line of the stack trace. However raise(), abort(), os::abort() don't sound like causes but symptoms. Scanning through the entire trace, there are two signals raised. On the assumption that signals are for special occasions only, it's reasonable to assume whatever raised the signal is where a failure happened. There are two signals raised, which one should we target? The first signal could only have happened because of our error, the second could be our error or a complication from our error. Starting with the first signal should get us closest to a clean view of the error. So what's going in in frame 15 that raises a signal.

Strcmp() takes two char arrays as arguments and returns an integer. Vegas odds would be on strcmp() only raising a signal if one of the passed in arguments was bad. As confirmation, the line of assembly shows the failure was on instruction cmpb (compare bytes). To find the arguments and where they came from we head to frame 16.

Showing the frame in gdb with 'f 16' prints a mess of information including the method name, it's arguments, the file its from and the last executed line number in that frame. The other piece of data shown is this example, '2016 if (strcmp(name, soname) != 0)', shows the particular line of code that triggered the next frame on the stack. Lets have a look at that method.

The code is a little on the dense side, but the method is called _dl_map_object and the comment before the loop is 'Look for this name among those already loaded'. 'this name' is probably referring to the name argument passed in, "/usr/lib/hadoop/lib/native/libhadoop.so.1.0.0". The things already loaded are probably the other stuff in the native libraries. So far so good. The strcmp line takes the name argument to the method along with soname. Soname appears to be generated from the stuff being looped through. So lets see what the value of soname is.

Soname's value is <value optimized out> and according to a question on stackoverflow <value optimized out> means the value resides in a register. Printing out the registers didn't reveal much, but then again we're not really looking for the value soname which we know caused strcmp() to blow up. We want to see the structures and data that soname was created from. Looking back at the C code soname was generated from l and l_info. Gdb seems pretty smart and will do a great job of printing out the structure of l for us.

And now we've got a view of the struct and the field names and values/addresses. At first glance its all a bit incomprehensible, except for the char array l_name = "/tmp/libjansi-64-1.9.so" sticking out. So while looking for a native hadoop lib, it loops through all native stuff loaded and seg faults on a non-hadoop native library? What is that ansi lib for anyway? Google finds the library at jansi.fusesource.org, and it isn't some internal jvm ansi lib as originally assumed. According to the website jansi is for "Eliminating boring console output". Why would a java library to eliminate boring console output have a native dynamic shared object? More practically, how could a java jar have a packaged shared object? A quick look at the other core dumps verifies they all died in the _dl_map_object method while examining the libjansi.so object.

A bit more research turns up the pattern for including native code in a .jar file. Our codebase had two such libraries, snappy compression and jansi. During static initialization a java class will reach into the jar file with the class loader and extract an appropriate dynamic library dropping it onto the file system, typically in the tmp directory. After extraction the java code will run System.load() or System.loadLibrary() making the native code available. One integral part of the pattern that jansi appears to have missed is to only extract the binary if it's not already present on the filesystem. Looking at /tmp we note the time differences in the two *.so files. The libjansi timestamp being younger than the libsnappy is a clue the jansi code overwrites its binary each time its loaded.

So at this point the working theory is that the linux dynamic loader gets fatally tripped up because a native library loaded by the jvm is overwritten with a binary duplicate of itself. The failure would appear to be triggered by the hadoop library attempting to load/access it's own native routines. Almost certainly there is a race condition involved, and the longer a server is unchanged the more likely it is to be stable and vice versa.

While my education of shared objects and linking in linux is more than a decade out of date, I was pretty firm in my belief that once an object was loaded you couldn't get rid of it, kind of like how you can delete a file from the file system but the file will stick around and be visible with lsof if there is an open filehandle. In this case however verification was straightforward; on a system that hasn't been cored in several hours overwrite the file with itself and wait. Since there's also a race condition involved, we'd probably have to run the test several times w/o causing a core dump to be confident this theory was incorrect. On a server running w/o incident for 8hrs the libjansi binary was copied over itself; 15m later there was a core dump.

Boom we have our culprit. It took an hour to remove jansi and redeploy, and the core dumps were gone.

Jansi's Purpose

Jansi was originally added to provide colored consoled output in a developers IDE. Adoption never made it past the initial use case of identifying the origin source for configuration parameters. Only useful in development and never used after container startup, no one complained after it was removed.

And for a bit of HubSpot color, there was a spirited debate as to the merits of the original PR, with the cons argument now getting the last laugh.

Conclusion

Don't include obscure third party libraries in production. Since QA and development will often mirror production, unnecessary libs are right out.

All the time spent tweaking different compilation and runtime java minor versions had no noticeable effect. Overall that's a good thing to have a positive experience with java's Compile Once, Run Anywhere claim. This held true even with the complication of third party native C libs.

And finally we learned that native jni C libs are pretty scary when compared to/combined with the java world. The addition of the jvm command line arg -Xcheck:jni showed that several year old native C code was not written correctly according to the jni spec. Yet the code has been battle tested in the field and is working well enough.

About Us

Use this little footer to reinforce your message about who you are and why people love working with you. And since this is a global module, you only have to edit it once to make the changes across every page it appears on.