recording a programmer's thought

Posts Tagged ‘C++’

On a recent project, I have been working with JNI to wrap up a C++ library into Java. Like most usages of JNI, it is not for performance, but for compatibility.

JNI itself is a beast. It is quite a challenge to get both the performance and the correctness aspects of its APIs right. While its programming style is close to C, exceptions needs to be checked frequently. Its API naming schemes are full of misleading traps that often lead to memory leaks.

It is so complicated that if you want to pass data through JNI, you want to stick with primitive types. That’s because passing complex data structure into JNI is rather painful.

Unfortunately, my project requires passing complex data structures from Java into C++. To solve this problem, I turned to Protobuf for help.

Pojo Over JNI

To get a taste of basic JNI, here’s an example. Say I want to pass the following data structure from Java into C++ through JNI.

To follow the best performance practice, JNI requires caching the class and method IDs. And to access each fields, we need a seperate JNI call to invoke the individual accessors.

As this point, it should be clear that passing complex data through JNI is non-trivial.

Protobuf over JNI

Alternatively, we can use Protobuf as the JNI messaging medium between Java and C++. This way, the communication channel through JNI is strictly through byte arrays. In this approach, the JNI complexity is reduced to a simple byte array access. Therefore the code verbosity and the potential for programming error is drastically reduced,

Here is the same example as above, but with Protobuf over JNI. First, we redefine CustomPojo into a Protobuf message.

Performance

Conceptually, protobuf over JNI should be more expensive. After all, protobuf is first encoded in Java, deep copied in JNI, and then decoded in C++. In practice however, the performance of the protobuf-JNI approach is 7% faster than the POJO-JNI approach.

Pojo-JNI vs. Protobuf-JNI over 10 million JNI calls.

Thoughts

Protobuf is a good medium to pass complex data structure through JNI. Compare to the handcrafted reach-back JNI code, protobuf over JNI has far lower code complexity while having equal or better performance.

This approach is great for passing low volume traffic of complex data over JNI. For high volume traffic, it is best to avoid complex data altogether and stay within primitive types.

This only covers synchronous JNI call. Asynchronous JNI callback is a topic (nightmare) for another day.

Recently, I have been playing around with reader-writer (RW) locks. I have never encountered RW locks in practice, but I have read that they could be inefficient in practice, and often results in more harm than good.

Recall that traditional mutex ensures that only one thread may enter a critical region. But if the critical region is being written infrequently, it is possible to exploit this concurrency by allowing multiple reader with RW locks.

So when exactly should RW lock be used in place of traditional mutex? To answer this question, I wrote a benchmark program to understand the scalability of RW locks.

Boost shared_mutex Benchmark

Since C++ is my primary programming language at work, I started by picking on shared_mutex of the Boost threading library.

In my benchmark, I focus primarily on two variables – the writer frequency, and the hold time of the mutex.

For implementation, there are 4 worker threads (for my quad-core CPU) working with a critical region that approximate e. At each iteration, one of the threads has a certain probability to become a writer. My goal is to see the performance change as the writing frequency increases.

And to control the hold time of the mutex, each thread will performs a certain number of iterations called E. As E becomes larger, the hold time of the mutex increases.

At E = 1, even when there are zero contention, the overhead completely wipes out any performance gain of the concurrent readers.

E=1 shows the overhead of shared_mutex

At E = 50, the longer hold time pays off slightly under low contention. However, the performance degrades rapidly as contention increases.

E=50, longer hold times allows shared_mutex to scale slightly better.

As you can see, the results are very disappointing. Boost shared_mutex only offers performance gain under extremely low contention with large hold time. The large hold time is unrealistic in practice because most programmers are taught to minimize their critical region.

I was curious to see if SRW performs any better, so I added SRW into my benchmark.

SRW outperforms Boost mutex and shared_mutex even under the shortest hold time.

At longer mutex hold time, SRW degrades similarly to shared_mutex with a lower overhead.

Although SRW offers similar scalability compare to boost shared_mutex, it has lower overhead, outperforms boost shared_mutex in almost all cases.

Final Thoughts

After looking into the implementation of boost shared_mutex, I realize that its lock-free algorithm is complex and tracks many states. This implementation has so much overhead that it is impractical.

SRW offers has far lower overhead, and can be useful under low contention. Unfortunately, it is only available for Vista and beyond.

Neither mutex type offer real performance advantage when contention goes beyond 2%. Somehow, I speculate that Amdahl’s Law is playing a part here. The chart looks very much like the inverse of speedup graph I plotted last year.

This solution is very clever. It implicitly converts the shared_ptr into “a pointer to member variable”. Based on the NULLness of the shared_ptr, it will either return 0 or a pointer to member variable of type T*.

With this implementation, shared_ptr manages to support the six ways of checking for NULL, avoids the dangerous comparisons, and has no integer promotion side effects.

Is the boost solution perfect? Of course not. The code is confusing, and you can still do some crazy stuff.

shared_ptr<SomeClass> sp(new SomeClass);
// Grab the shared_ptr's "pointer to its member variable"
shared_ptr<SomeClass>::unspecified_bool_type ubt = sp;
// Extract the shared_ptr's inner pointer member in the most obscure way
SomeClass *innerPointer = sp.*ubt;

Final Thoughts

For such an innocent comparison, the depth of the solution is astonishing. It is amazing to see how far C++ library writers are willing to go to work around the nastiness of the language.

After figuring this out, I later learned that this technique is called the Safe Bool Idiom. (As usual, google is useless if you don’t know what you are looking for).

While stressing a TCP server application, I found a nasty bug with the IOCP server library.

After handling 100,000 connections or so, the TCP server stops accepting connections. The output from TCPView shows that clients are still trying to connect to the server, but the connection was never established.

I was able to verify that all existing connections are unaffected. Therefore, the IO completion port is still functional. So I concluded that it is not a non-page pool issue, and has something to do with the handling of the accept completion status.

The Cause

The bug is simple, but it takes half a day to reproduce. Here’s the code snippet that causes the problem.

See that innocent little “return” statement when setsockopt() fails, I foolishly concluded that “This shouldn’t happen”. And naturally, since it should never happen, I never thought about properly handling the error case.

Apparently in the real world, some connections comes and goes so quickly that immediately after accepting the connection, it has already been disconnected. setsockopt() would fail with error 10057, and the return statement causes the “accept chain” to break.

The fix is to remove the “return” statement and move on with life.

Others

Along with this fix, I also removed an unnecessary event per Len Holgate’s suggestion. However, I have not yet removed the mutex in ConnectionManager. This require a slight redesign, and a bit more thoughts.

I can see myself maintaining this library for awhile, so I created a Projects page to host the different versions.