Netty 4 Reduces GC Overhead by 5x at Twitter

The Netty Project released the first version of Netty 4 in July. It has significant performance improvements primarily from reducing garbage collection overhead. Integrating Netty 4 at Twitter has led to a five times performance gain, but with some costs.

Trustin Lee, the founder of the Netty Project and a software engineer at Twitter, has been writing network application frameworks since 2003. The first public release of Netty was in June of 2004. The project's homepage describes Netty as "an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers and clients."

Cloudhopper sends billions of SMS messages every month to hundreds of mobile carriers all around the world using Netty

Netty includes an implementation of the reactor pattern and also is at the core of Play Framework. Play, Grails and many other web frameworks are embracing a WAR-less web apps pattern that allows tighter integration with the underlying HTTP Server. Using a server like Netty under the covers allows much easier asynchronous programming. Asynchronous programming and non-blocking I/O are at the core of The Reactive Manifesto. InfoQ wrote about this emerging pattern in Reactive Programming as an Emerging Trend.

Netty 3 used Java objects to represent I/O events. Lee remarks that:

This was simple, but could generate a lot of garbage especially at our scale. In the new Netty 4 release, changes were made so that instead of short-lived event objects, methods on long-lived channel objects are used to handle I/O events. There is also a specialized buffer allocator that uses pools.

...Netty 3 creates a new heap buffer whenever a new message is received or a user sends a message to a remote peer. This means a 'new byte[capacity]' for each new buffer. These buffers caused GC pressure and consumed memory bandwidth: allocating a new byte array consumes memory bandwidth to fill the array with zeros for safety. However, the zero-filled byte array is very likely to be filled with the actual data, consuming the same amount of memory bandwidth. We could have reduced the consumption of memory bandwidth to 50% if the Java Virtual Machine (JVM) provided a way to create a new byte array which is not necessarily filled with zeros, but there's no such way at this moment.

With Netty 4, the code defines a more fine-grained API that handles the different event types instead of creating these event objects. It also has a new buffer pool implementation, which is a pure Java version of jemalloc (also used at Facebook). Now, it doesn't waste memory bandwidth by filling buffers with zeros. However, because it doesn't rely on GC, you have to be careful about leaks. If a handler forgets to release a buffer, memory usage can grow infinitely.

These changes are not backwards compatible with Netty 3, but it is five times faster at producing and cleaning up garbage.

Lee writes:

We compared two echo protocol servers built on top of Netty 3 and 4 respectively. (Echo is simple enough such that any garbage created is Netty's fault, not the protocol). I let them serve the same distributed echo protocol clients with 16,384 concurrent connections sending 256-byte random payload repetitively, nearly saturating gigabit ethernet.

According to our test result, Netty 4 had:

Five times less frequent GC pauses: 45.5 vs. 9.2 times/min

Five times less garbage production: 207.11 vs 41.81 MiB/s

Lee mentions there are some barriers to adoption of Netty 4 at Twitter, namely buffer leaks and a complex core. The project hopes to add more features, including HTTP/2, asynchronous DNS resolution and HTTP and SOCKS proxy support for the client side.

Yahoo Engineering has a similar article on how Netty has helped them double the speed of their Storm clusters. In Making Storm fly with Netty, Bobby Evans writes:

At Yahoo we eat our own dog food but before making Netty the default messaging layer for our Storm clusters I needed some numbers to see how it compared to zeromq, which is the current default. To do this I needed a benchmark that could make Storm’s messaging layer cry uncle, so I wrote one. It is a simple speed-of-light test that sees how quickly Storm can push messages between the different bolts and spouts. It allows us to launch multiple topologies of varying complexity that send fixed sized messages.

Evans shows that with a small test (with no resource contention), Netty is much faster (40-100%) than zeromq. With a larger test, he runs into performance issues, but reducing the number of threads solves the problem.

Netty's default setting is not that great for lots of small messages even when it is the only one on the node. But when we restrict it to a single thread we are able to get between 111% and 85% more messages per second than zeromq and after that the network saturates again.

Get the most out of the InfoQ experience.

Tell us what you think

Trustin - I'd like to compare notes with you sometime on some similar ideas we've implemented for a low-level Java messaging API in an event-driven model. It was designed to be applicable to zero-allocation (low jitter) and ultra-low-latency (<5us) applications, and it looks like you're utilizing many of the same concepts. Anyhow, first name dot last name at employer will reach me :-)