This heterogeneous mix of tasks and data means that our application can be particularly sensitive to concurrency bottlenecks in both our code and third-party libraries. While Evernote activity isn’t particularly “bursty” compared to some web services, the daily variation across our 95 shards means that even infrequent chokepoints will hit some shards from time to time.

When our monitoring systems detect that a particular shard is underperforming, we try to capture as much information as possible about the current state of the server without introducing more problems. One low-tech tool is “sudo killall -3 java”, which dumps the current stack trace for every Java thread to standard output. We can then inspect the state of each thread for signs of problems. Here’s a fun example of the sort of bottleneck we find by inspecting enough stack dumps:

On regular occasions, we’d find a number of threads in a choking server all waiting to convert a byte[] to a String using a named encoding or vice versa. We’d find blocked threads originating in code from Tomcat, MySQL Connector/J, GWT, SAX, Thrift, and the JRE itself. The threads would all look something like this:

After reading the JRE code, we found that the concurrency bottleneck is caused by a simple synchronization block in the [ironically-named?] FastCharsetProvider.charsetForName method that looks up a cached Charset for a String name (like “UTF-8″). The use of Java’s ‘synchronized’ call to protect this in-memory cache prevents two threads from breaking the cache data structures, but means only one thread can look in the cache at a time.

Replace all relevant byte[]<->String transformations across our own codebase. (Including such unpleasantness as removing all use of JRE classes URLEncoder/URLDecoder with their own unpatchable String encodings.)

Short version: Large-scale concurrency is kind of hard. Java’s ConcurrentHashMap is super awesome for in-memory caching.

We are experiencing the same server issues as you guys with FastCharsetProvider. For us, it is mainly from Cassandra’s client library Hector/Thrift, and the MySQL Connector. I am surprised I don’t see more uproar about this.

I think that patching these libraries might be a bit of a red herring though, in the sense that possibly the problem is way too much contention. The solution being more servers, less threads that take longer to process requests. It is so easy to let ExecutorService’s get out of control and assume that more threads == better.

For us I have a feeling that removing the charset contention is going to point to massive spikes in the number of threads we are using.