Friday, June 7, 2013

The Pain of Broken Subprocess Management on JDK

I prefer to write happy posts...I really do. But tonight I'm completely defeated by the JDK's implementation of subprocess launching, and I need to tell the world why.

JRuby has always strived to mimic MRI's behavior as much as possible, which in many cases has meant we need to route around the JDK to get at true POSIX APIs and behaviors.

For example, JRuby has provided the ability to manipulate symbolic links since well before Java 7 provided that capability, using a native POSIX subsystem built atop jnr-ffi, our Java-to-C FFI layer (courtesy of Wayne Meissner). Everyone in the Java world knew for years the lack of symlink support was a gross omission, but most folks just sucked it up and went about their business. We could not afford to do that.

We've repeated this process for many other Ruby features: UNIX sockets, libc-like IO, selectable stdin, filesystem attributes...on and on. And we've been able to provide the best POSIX runtime on the JVM bar none. Nobody has gone as far or done as much as JRuby has.

Another area where we've had to route around the JDK is in subprocess launching and management. The JDK provides java.lang.ProcessBuilder, an API for assembling the appropriate pieces of a subprocess launch, producing a java.lang.Process object. Process in turn provides methods to wait for the subprocess, get access to its streams, and destroy it forcibly. It works great, on the surface.

Unfortunately, the cake is a lie.

Under the covers, the JDK implements Process through a complicated series of tricks. We want to be able to interactively control the child process, monitor it for writes, govern its lifecycle exactly. The JDK attempts to provide a consistent experience across all platforms. Unfortunately, those two worlds are not currently compatible, and the resulting experience is consistently awful.

We'll start at the bottom to see where things go wrong.

POSIX, POSIX, Everywhere

At the core of ProcessBuilder, inside the native code behind UNIXProcess, we do find somewhat standard POSIX calls to fork and exec, wrapped up in a native downcall forkAndExec:

The C code behind this is a bit involved, so I'll summarize what it does.

Sets up pipes for in, out, err, and fail to communicate with the eventual child process.

Copies the parent's descriptors from the pipes into the "fds" array.

Launches the child through a fairly standard fork+exec sequence.

Waits for the child to write a byte to the fail pipe indicating success or failure.

Scrubs the unused sides of the pipes in parent and child.

Returns the child process ID.

This is all pretty standard for subprocess launching, and if it proceeded to put those file descriptors into direct, selectable channels we'd have no issues. Unfortunately, things immediately go awry once we return to the Java code.

Interactive?

The call to forkAndExec occurs inside the UNIXProcess constructor, as the very first thing it does. At that point, it has in hand the three standard file descriptors and the subprocess pid, and it knows that the subprocess has at least been successfully forked. The next step is to wrap the file descriptors in appropriate InputStream and OutputStream objects, and this is where we find the first flaw.

This is the code to set up an OutputStream for the input channel of the child process, so we can write to it. Now we know the operating system is going to funnel those written bytes directly to the subprocess's input stream, and ideally if we're launching a subprocess we intend to control it...perhaps by sending it interactive commands. Why, then, do we wrap the file descriptor with a BufferedOutputStream?
This is where JRuby's hacks begin. In our process subsystem, we have the following piece of code, which attempts to unwrap buffering from any stream it is given.

The FieldAccess.getProtectedFieldValue call there does what you think it does...attempt to read the "out" field from within FilteredOutputStream, which in this case will be the FileOutputStream from above. Unwrapping the stream in this way allows us to do two things:

We can do unbuffered writes to (or reads from, in the case of the child's out and err streams) the child process.

We can get access to the more direct FileChannel for the stream, to do direct ByteBuffer reads and writes or low-level stream copying.

So we're in good shape, right? It's a bit of hackery, but we've got our unbuffered Channel and can interact directly with the subprocess. Is this good enough?

I wish it were.

Selectable?

The second problem we run into is that users very often would like to select against the output streams of the child process, to perform nonblocking IO operations until the child has actually written some data. It gets reported as a JRuby bug over and over again because there's simply no way for us to implement it. Why? Because FileChannel is not selectable.

FileChannel implements methods for random-access reads and writes (positioning) and blocking IO interruption (which NIO implements by closing the stream...that's a rant for another day), but it does not implement any of the logic necessary for doing nonblocking IO using an NIO Selector. This comes up in at least one other place: the JVM's own standard IO streams are also not selectable, which means you can't select for user input at the console. Consistent experience indeed...it seems that all interaction with the user or with processes must be treated as file IO, with no selection capabilities.

(It is interesting to note that the JVM's standard IO streams are *also* wrapped in buffers, which we dutifully unwrap to provide a truly interactive console.)

Why are inter-proces file descriptors, which would support selector operations just wonderfully, wrapped in an unselectable channel? I have no idea, and it's impossible for us to hack around.

Let's not dwell on this item, since there's more to cover.

Fear the Reaper

You may recall I also wanted to have direct control over the lifecycle of the subprocess, to be able to wait for it or kill it at my own discretion. And on the surface, Process appears to provide these capabilities via the waitFor() and destroy() methods. Again it's all smoke and mirrors.

Further down in the UNIXProcess constructor, you'll find this curious piece of code:

For each subprocess started through this API, the JVM will spin up a "process reaper" thread. This thread is designed to monitor the subprocess for liveness and notify the parent UNIXProcess object when that process has died, so it can pass on that information to the user via the waitFor() and exitValue() API calls.

The interesting bit here is the waitForProcessExit(pid) call, which is another native downcall into C land:

There's nothing too peculiar here; this is how you'd wait for the child process to exit if you were writing plain old C code. But there's a sinister detail you can't see just by looking at this code: waitpid can be called exactly once by the parent process.

Part of the Ruby Process API is the ability to get a subprocess PID and wait for it. The concept of a process ID has been around for a long time, and Rubyists (even amateur Rubyists who've never written a line of C code) don't seem to have any problem calling Process.waitpid when they want to wait for a child to exit. JRuby is an implementation of Ruby, and we would ideally like to be able to run all Ruby code that exists, so we also must implement Process.waitpid in some reasonable way. Our choice was to literally call the C function waitpid(2) via our FFI layer.

Here's the subtle language from the wait(2) manpage (which includes waitpid):

RETURN VALUES
If wait() returns due to a stopped or terminated child
process, the process ID of the child is returned to the
calling process. Otherwise, a value of -1 is returned
and errno is set to indicate the error.
If wait3(), wait4(), or waitpid() returns due to a
stopped or terminated child process, the process ID of
the child is returned to the calling process. If there
are no children not previously awaited, -1 is returned
with errno set to [ECHILD]. Otherwise, if WNOHANG is
specified and there are no stopped or exited children,
0 is returned. If an error is detected or a caught
signal aborts the call, a value of -1 is returned and
errno is set to indicate the error.

There's a lot of negatives and passives and conditions there, so I'll spell it out for you more directly: If you call waitpid for a given child PID and someone else in your process has already done so...bad things happen.

We effectively have to race the JDK to the waitpid call. If we get there first, the reaper thread bails out immediately and does no further work. If we don't get their first, it becomes impossible for a Ruby user to waitpid for that child process.

Now you may be saying "why don't you just wait on the Process object and let the JDK do its job, old man? The problem here is that Ruby's Process API behaves like a POSIX process API: you get a PID back, and you wait on that PID. We can't mimic that API without returning a PID and implementing Process.waitpid appropriately.

(Interesting note: we also use reflection tricks to get the real PID out of the java.lang.Process object, since it is not normally exposed.)

Could we have some internal lookup table mapping PIDs to Process objects, and make our wait logic just call Process.waitFor? In order to do so, we'd need to manage a weak-valued map from integers to Process objects...which is certainly doable, but it breaks if someone uses a native library or FFI call to launch a process themselves. Oh, but if it's not in our table we could do waitpid. And so the onion grows more layers, all because we can't simply launch a process, get a PID, and wait on it.

It doesn't end here, though.

Keep Boiling That Ocean

At this point we've managed to at least get interactive streams to the child process, and even if they're not selectable that's a big improvement over the standard API. We've managed to dig out a process ID and sometimes we can successfully wait for it with a normal waitpid function call. So out of our three goals (interactivity, selectability, lifecycle control) we're maybe close to halfway there.

Then the JDK engineers go and pull the rug out from under us.

The logic for UNIXProcess has changed over time. Here's the notable differences in the current JDK 7 codebase:

An Executor is now used to avoid spinning up a new thread for each child process. I'd +1 this, if the reaping logic weren't already causing me headaches.

The streams are now instances of UNIXProcess.ProcessPipeOutputStream and ProcessPipeInputStream. Don't get excited...they're still just buffered wrappers around File streams.

The logic run when the child process exist has changed...with catastrophic consequences.

Here's the new stream setup and reaper logic:

Now instead of simply notifying the UNIXProcess that the child has died, there's a call to processExited().

So when the child process exits, the any data waiting to be read from its output stream is drained into a buffer. All of it. In memory.

Did you launch a process that writes a gigabyte of data to its output stream and then terminates? Well, friend, I sure hope you have a gigabyte of memory, because the JDK is going to read that sucker in and there's nothing you can do about it. And let's hope there's not more than 2GB of data, since this code basically just grows a byte[], which in Java can only grow to 2GB. If there's more than 2GB of data on that stream, this logic errors out and the data is lost forever.
Oh, and by the way...if you happened to be devlishly clever and managed to dig down to the real FileChannel attached to the child process, all the data from that stream has suddenly disappeared, and the channel itself is closed, even if you never got a chance to read from it. Thanks for the help, JDK.

The JDK has managed to both break our clever workarounds (for its previously broken logic) an break itself even more badly. It's almost like they want to make subprocess launching so dreadfully bad you just don't use it anymore.

Never Surrender

Of course I could cry into my beer over this, but these sorts of problems and challenges are exactly why I'm involved in JRuby and OpenJDK. Obviously this API has gone off the deep end and can't be saved, so what's a hacker to do? In our case, we make our own API.

At this point, that's our only option. The ProcessBuilder and Process APIs are so terribly broken that we can't rely on them anymore. Thankfully, JRuby ships with a solid, fast FFI layer called the Java Native Runtime (JNR) that should make it possible for us to write our own process API entirely in Java. We will of course do that in the open, and we are hoping you will help us.

What's the moral of the story? I don't really know. Perhaps it's that lowest-common-denominator APIs usually trend toward uselessness. Perhaps it's that ignoring POSIX is an expressway to failure. Perhaps it's that I don't know when to quit. In any case, you can count on the JRuby team to continue bringing you the only true POSIX experience on the JVM, and you can count on me to keep pushing OpenJDK to follow our lead.

When you get a replacement written, it would be fantastic if that were available as a relatively dependency-free Java library. The lack of ability to do select-style I/O on process output is a problem for anyone trying to do efficient and scalable I/O in something that may run a native process and, say, pipe the output directly to a socket without extra internal buffering.

Have you seen a process write 1GB of data and terminate cause a problem? I think the actual OS pipe would only buffer 512 bytes or so. Of course, I think if you have a grandchild process that inherited the child's streams, you could be in trouble.

Out of curiosity, did you file any bug reports and/or discuss this on the OpenJDK mailing lists? JDK8 is about to get released. You should probably push for this early on during the JDK9 development process.