I was working on a new server infrastructure recently I needed for a new product. The server has a novel internal architecture to me that seeks to never voluntarily block threads or cause context switches - but at the same time is highly multi-threaded and creates new ones whenever it needs to.

The server's nature isn't totally important but you can think of it along the lines of a twitter server. It gets messages from one place, collates them, and sends them to (potentially many) other places. I suppose it could be used directly as a "twitter server" but of course, they already have those and my start up focus is rather orthogonal to that.

I usually write servers in Java on linux so keep that in mind for the ideas raised here. At Google, I worked on servers in C++ and Java. And especially after that experience, I'll stay with Java. To hopefully avoid language wars, the #1 reason I write in Java is that I'm way better at Java than I am at C++ or Ruby or Python or whatever. In other words, it was a practical (and definitely reasonable) choice in this case.

Also, the benchmarking here isn't contrived - I'll save the details, but I truly am asking the server to do precisely what it would do in the wild except at a higher rate.

Now, given the nature of the server, I needed 2 things to help benchmark. A "producer" (a system that produces messages) and a "consumer" (a system that consumes them). The producer produces a sequenced message set, and the consumer verifies it receives every message intact *and* in-order.

I chose message sizes at somewhat less than twitter size (msg size = 100 bytes) to induce a reasonable amount of activity in the server (server purpose is not necessarily large messages and large messages tend to just start saturating bandwidth without causing contention or server cpu activity). My custom protocols add a few bytes and TCP has a 40byte header - so overall I'm guessing that I'm using an average 200 byte messages counting everything.

Producer -> Server -> Consumer

The server system is pretty flexible and in effect, location transient. That is, a server process might live on one machine today and find itself on another machine tomorrow. Or a few new server processes may join the mesh. Needless to say, the idea is to create a scalable, flexible system.

One ramification of that is that it's possible and even very likely that some producers, servers, and consumers could be on the exact same machine. Socket communication over localhost changes a lot of assumptions. Firstly, TCP fusion can occur to reduce overhead significantly. Secondly, there is not an effective bandwidth limitation - there isn't a real network involved, it's a virtual (and fast) one.

Running all on the same server, I would expect this benchmark to be CPU-bound given that I/O is now virtual and effectively a CPU operation. Like I said, loopback is a real scenario I need but I initially think I benchmarked on a single machine mostly as a matter of convenience.

So, with all 3 processes running on the same machine, I ran the test. The CPU was a Intel Core I7 920. It has 4 hyperthreaded cores that act like 8. (note: I've tested all of this on a Core2Quad (non-I7) cpu and CoreDuos and results are effectively the same)

Here's the image of htop during the test with all 3 processes running. (if you run linux and use top, upgrading to htop is highly recommended).

Don't get hung up on reading the numbers or text. The graph in the upper left gives you a sense of how busy the CPUs are. In this picture, they're all at least a little busy. That's no surprise given we're running many tens of threads.

Notice that none of the cores are "maxed". This benchmarked showed the server to receive and re-send about 120,000 messages per second (that's 120k from producer to server and the same 120k from server to consumer - so 240k "messages transmissions" but only 120k messages - this would be analogous to queries-per-second for a webserver).

Why aren't the active cores maxed?

It occurred to me that I was running 3 CPU-bound processes on the same machine and that the processes might be stepping on each other's toes. It's possible that if the server is running on core 4 one second and the producer is running there the next, the level 1 cache of that core could be ruined for the server the next time it ambles over there.

The simple solution was to assign cores to the processes. In linux, you do this with command: taskset

Taskset is followed by a bitmask value to assign CPUs - so I ran (in separate xterms):

and off they went. The server is highly multithreaded so I gave it 6 cores. The producer and consumer each got one.

The result? 270,000 messages/second! Wow. If my cache assumption was right (and I'm not claiming to know if it was) - it REALLY worked. One way or another though - something worked - we got better than a 2x speedup.

So you might be thinking the moral of this story is:

1) If you're in linux, install HTOP2) If you're sharing a computer amongst CPU-bound processes, isolating the processes might (and very well "might not") be beneficial.

And ok, those are fine ideas. But what bugged me was that htop showed me that no CPUs were maxed yet. Again, what was slowing my application down ahead of CPU power?

I then tried limiting the server to 2 (hyperthreaded) cores. (I also tried keeping the producer and consumer on the same hyperthreaded "core" and given that I had cores to spare, also tried separating them, but the result was the same).

You can see that cores 1 & 2 are plenty busy. Cores 3,4,5,6 are idle as expected. Core 7 (the producer) is pretty busy and so is Core 8 (the consumer).

530k msgs/second is nothing to sneeze at but.. um.. again, no core is maxed. Why - not? What's the bottleneck?

Obviously.. the last test is to throw the whole server on ONE cpu. Apart from the fact that I very purposely and meticulously coded this server be highly multi-threaded, fewer CPUs seem to make it happier. I am a conservative thread-safety-inducer. That is, I'm only really dangerous with firearms and synchronized blocks. But I'm by no means afraid to use the latter when needed.

taskset 0x01 java manyfeedServer

And finally, we hit 100% utilization on CPU 1 at 790,000 messages per second.

Here's the TL;DR of this blog post:

Some multi-threaded java applications apparently run faster on 1 core than on multiple cores.

Note the very non-committal phrasing in an attempt to make this a rather defensible statement. A graph showing the ManyFeed server's performance per number of cores:

So.. if you are a CPU or JVM guru and want to tell me your thoughts on what's going on, I'd love to hear it.

A long time ago, I benchmarked a bunch of different CPU types on doing contended memory stores. To my surprise, a little single-core Dothan processor did amazing in a few tests and I had no idea why. The discussion above is really the same idea. (I've retested that benchmark by the way on the Core I7 - and it has very different characteristics - probably a testament to the I7's new memory architecture).

Note that if you have a pre-I7 Intel multi-core cpu, you can reproduce my "1 core runs faster" results here using the code in that blog post (note: this code *is* a micro-benchmark, built to create a situation that causes extreme and unrealistic threading competition - for more details read that blog post). This gives a clearer picture of what's going on. My theory is that with 2 cores and 2 threads - each gets a core and every store operation competes for the memory bus in order to do its operation. Threads spend lots of time "competing" and less time doing real work. On one core (and 2 threads) - only one thread runs at a time - so every time it tries to store in memory, it can.

The Core-I7 is different (and maybe AMD cpus too) - it doesn't suffer from this memory contention problem at all. (Although - given that my server still runs (way way) better on 1 cpu even on I7, then maybe the contention is elsewhere in my server).

Here's that benchmark running for Static Volatile Store on a Core Duo CPU with 2 cores and again with 1 core. No performance loss on just one core, otherwise same exact run.

(numbers along the bottom are the number of threads in that run - it's not until we have 2 threads that contention creeps in and hurts us).

Oh.. and how does my little nifty manyfeed server do on a real network? Testing on 1 CPU on my local 1Gbps LAN I get about 300,000 messages per second.

With some quick (and hopefully not silly) math, this looks about right.

half for producer send and half for the server send (to consumer) = 312,500 each

On a real LAN, the server saturates bandwidth (i.e. becomes IO-bound) and that's no surprise (i.e. until I have access to 10Gbps networks, I don't need to speed up my server any more).

Interestingly I tested this idea of "1-core'ing-it" using apache tomcat and apache bench and saw no improvement. I also saw far fewer qps (loading a single, tiny, web-page over and over) and 100% core utilization even on multi-core. It'd be my guess that tomcat isn't contention-bound but truly cpu-bound.

In other words, don't follow this path without testing this yourself. The good news is that its extremely easy to try and you don't even have to change a line of code. Just try "1-core'ing" it.

55 comments:

Arent' pairs of "logical" CPU's actually mapped to a single core? So running producer & consumer on 7 & 8 means they're running on the same physical core, obviously competing for resources and you'll never get 100% utilization on both at once. Same with running the multithreaded server on cores 1 & 2. Try using only odd-numbered or only even-numbered cores. E.g. server on 1 & 3, producer on 5, consumer on 7.

There's not a lot of info there to really know what's going on. Lock contention is a likely suspect, perhaps driven by GC. If you're really CPU-bound, you can peg 8 cores in a blink if you're building on stuff that has sane concurrency/parallelism semantics (e.g. use Clojure agents, or pmap, or similar).

If you're doing manual locking, you're just a little crazy these days given the better alternatives, IMO. Beyond that, peeking into what the GC is doing is worth a shot.

Multithreading has the overhead of switching between threads, which is quite expensive, so of course it is going to be slower. However, I bet if you leave it running on a bunch of cores it will scale much better.

Sort of echoing sep332's and mikeh's sentiments here. From my (incomplete) understanding, "hyperthreading" basically allows you to run overlapping bits of code on the same core. This can be a win if one bit of code is waiting on IO, or if you can use different units on the same core concurrently (arithmetic op + FLOP). If the pieces of code are doing similar or the same things, though, you're basically asking for contention.

Also, is it possible the latencies are asymmetric? Perhaps just run the simple benchmark pairwise on virtual cores 1, 2, and 3? (3! == 6 runs, total)

Actually Erlang is not very widely used by telcos. Ericsson uses other languages and tools.

One thing you should know if you program in Java is that unless you use the Solaris OS, Java threads are NOT native, i.e. they run on the same processor, not on different processors. That's one of the many shortcomings of the JVM.

I have seen behaviour like this if your code basically locks all threads in hotspots. If you're starting a lot of threads and they basically just end up blocked/waiting you'll end up impeding total performance pretty quickly. Increasing performance with multithreading can be real hard.

You really *must* use a profiler to determine where you're blocking your threads. It's almost never where you think it is ;)

The problem is clearly **NOT** memory contention. The OS isn't aware of time a thread spends waiting on physical memory, and so that time would end up counted as active CPU time, not idle time. (*virtual* memory if a hard page fault occurs is another story)

Best guess is that you have a lock contention issue. Even if your code isn't using locks explicitly, it's likely that the APIs you are using have locks under the covers, including the socket APIs for your server.

I'm thinking you may be facing overhead of data being transferred between the L1 caches of cores used by different threads. If the server was written so that the thread that received a message handled sending it out you might avoid that overhead.

@sep332 - I mentioned above that I tested this on a non-hyperthreaded CoreQuad and the results were the same. But you're right, HT just clouded the issue - when I finally wrote the post it just happened to be the machine I was working on.

@chas - I have no need to optimize this, but you're right, peeking into the GC could yield some interesting info.

@ERLANGERs - As pointed out, nothing against Erlang. I have a project to finish and didn't want to be impeded by learning a new language.

@JanRychter - interesting. Did not know that.

@Eric - this is very insightful. I will be correcting the post (after a day of comment accumulation) to reflect your comment. Thanks.

@PeterH - as I said, I avoid unnecessary context switches. In the general case, I already do exactly what you suggest. In cases where there are high numbers of consumers, the primary thread will ask for help (but that did not occur in my testing above)

@Parallax - maybe I'll test that someday. Of course I looked at rabbitmq but I felt my needs were somewhat outside what RMQ accomplishes (then again, I may find myself wrong and wish I had used that :)

Your issue is almost certainly not memory contention per se. A stalled load/store to memory surfaces itself as excessive CPU utilization. Conceptually, it is an assembly instruction that takes many hundreds of cycles to complete and individual instructions are not preempt-able. Thus, you only see <100% CPU usage if the process/thread is not actually runnable and instead the core is spinning in the OS's idle loop.

Thus, your threads must be blocking at a point where they explicitly yield to the OS either when doing I/O, or when waiting on a lock/synchronized block. Depending on the number of threads, it's possible the contention lies in the OS's I/O subsystem, but I this seems unlikely. More likely is that when your app is running on many threads, there is contention for locks and threads are being put to sleep waiting for those locks. Then, possibly by fault of the OS, there is a lag time between when a lock is released by a thread, and when those waiting for it are returned to the OS run queue.

In short, I think you need to dig into what locks you're taking and conditions you're signaling in the course of processing each message. That your CPUs are spending time idling indicates lock contention over memory bandwidth contention.

Not the first person to suggest it, but it seems highly suggestive of context switching to me. This is a cost I think most people discount to easily. And it's not surprising to me that the fewer the cores, the higher the transaction rate, particularly if you have tens of threads. The way I see it, the more threads you have wanting to do work, the greater the likelihood that the next time your thread gets scheduled, it will end up on a 'random' core. However, if the number of cores is low, it improves the odds that it ends up being the same core it was running on before. Unless the rest of the threads are just hanging out twiddling their thumbs, they'll want some CPU time, and Java is not (yet) very good about trying to keep threads and data colocated. This is the Big Challenge for highly parallel systems, and it's why you see groups such as the one at Rice University working on the Habanero Project where they are building into their language and VM core affinity. Particular with Java, I think it becomes a challenge because even if you have only as many application threads as cores, you still have these other VM threads beyond your control which can knock one or more of your threads from their perch, which can cause everything to get off so as to maintain some fairness. Java 7 is adding some NUMA logic to the GC to help optimize memory placement of objects, but I'm not sure what improvements there might be in thread placement/scheduling.

"Both prefetch settings do decrease the miss rate for the L2/L3 cache when they are enabled but they consume bandwidth on the front-side bus which can reach capacity under heavy load. By disabling both prefetch settings, multi-core setups achieve generally higher performance and scalability."

(2) The comment that memory contention should still cause CPU utilization *should* be true. I can imagine ways in which it might break, but I don't think those brokenesses are in current kernels.

(3) I'm a CPU guy, so I would suggest using EMON - on Linux, accessible through tools such as lkml.org/lkml/2009/6/6/149, oprofile, perfctr and perfmon2.

I would want to compare number of cache misses (from all important cache levels: L1, L2 (MLC), L3 (LLC)), bus utilizations, etc.

E.g. if the "bus" (QPI on Nehala) is above 80% utilized, then multiple processors probably won't help, and will probably hurt.

Paradoxically, if you have done a good job of parallelizing your code, making the threads NOT contend, then you may have shot yourself in the foot data cache wise - since t is possible that the footprint of 1 CPU's worth of threads fits in the cache, while the footprint of the larger number of concurrently running threads in a multiple processor system does not. You attempts to let each thread run as long as possible would exacerbate this.

Regarding hyperthreading - HT was specifically designed to maximize processor utilization when memory/cache cannot keep up with computation.

Whenever a thread stalls due to reading from memory (writes may be easier - due to write back queues), the processor automatically switches to another thread running on the 2nd virtual processor. So overall utilization of the single physical processor will be higher.

You're obviously hitting some shared resource, which is triggering cache-line bouncing, and you're doing so little work outside of moving packets around, that the inter-CPU traffic which cache-line-sharing triggers is dominating the runtime.

Are you sure there are no shared resources? Maybe you're creating threads, in which case you're hitting java's thread create lock, or maybe you're synchronising on something?

I've found the new classes in java.util.concurrrent to be extremely useful in writing good parallel code.

The neat thing about messaging programs is that they are horizontally scalable. You can scale producers and consumers blindly.

What the benchmark is showing seems to be protocol overhead. If you want a really fast 1 producer to 1 consumer - just hand off without messaging. Otherwise - you now have evidence you have a protocol wall. Your next step is likely either to change the protocol or go into protocol analysis (which is a path to java api and jvm and truss/strace/dtrace hell).

You know how to validate this point - simply check the bare messaging without a payload.

I ran into a similar scenario a few months also when working on a messaging server. Cross thread contention is amazingly difficult to reduce. I just posed Scaling Up With HawtDispatch which outlines some of the techniques that HawtDispatch uses to help avoid the contention.

Posted this on the other benchmark blog post by mistake, should've been here!

a)There is a lot that can be going on behind the scenes in the jvm here to start with. Such as similar to your other benchmark where the jvm will know that is running on a single-core cpu and can skip on emiting lock-prefixes on some instructions. It can even turn memory barriers into no-ops etc. See:http://www.infoq.com/articles/memory_barriers_jvm_concurrency

And according to Dave Dice (JVM guy at Sun):"On MP IA32 and AMD64 systems we have to use locked atomic instructions (e.g., lock:cmpxchg) to acquire locks. On uniprocessors we can do without the lock: prefix. lock:ed instructions have considerable latency on some Intel processors (>500 cycles)."

So running on most x86 one-core machines (also when just allowing the process to use one core?) it will be a lot quicker to do synchronization.

b) Hyperthread cores are paired with a real core and usually provide faster context switching etc but that's all. No additional ALUs etc.

c) The cpu itself, such as:c1) An i7 shares the rather large and quick L3 cache among cores. Running one core only you are effectively running with ONE very big L3 cache instead of it being split up between cores.

c2) Running on a few cores or just one core, the i7 cpu overclocks the used cores.

c3) I7 runs synchronisations primitives a lot faster, LOCK COMPXCHG instructions in 60% of the time compared to the previous generation for example.

What happens if you run multiple independent copies of the server as separate Java processes on different CPUs? If your architecture supports it, this would make it much easier to scale and avoid all locks (as these are then handled on application layer and not on hardware/CPU layer).

Of the order of 100s of MB per second, CPU architecture is so far from the problem it's not even funny. I'm amazed at some of the comments being made here. Discuss, read, and profile the software stack before you start making random wild guesses about hardware.

We can push >100MB/s using reputably slow languages like ruby, sitting over 60% of their time in the GC scanning heap temporaries and the stack over and over and over, being so inefficient it's laughable. The CPU can do that. Java should be far far more efficient, and indeed in tests it's easily provable to be so.

The PEAK rate described here is in this range and is so much slower than memory bandwidths (at least one order of magnitude if I remember by DDR2 based FSB speeds correctly).

The cache is irrelevant here.

On my slow-ass-by-comparison macbook here, lets do a quick test running off a volatile volume:

It also just occurred to me that some "bright spark" will chime in that I'm benching a single IO locked process in volatile memory with the prior examples I gave in comments, so to be sure I'm invoking at least two processes and the kernel, here's another 'raw-ish' test:

raggi@mbk: ~/volatile % time sh -c "cat bigfile | cat - > bigfile2"

real 0m2.846suser 0m0.117ssys 0m2.185s

Still well over 300MB/s and most of the CPU time is being spent in `diskimages-helper` (OSX SL). ciao.

The way Java encourages you to do thread interaction is truly awful, and you're probably suffering from following what the language is suggesting you do. You'd probably be better off removing all uses of the word 'synchronized' from your code and having threads interact with each other strictly through queues, with as few objects which multiple threads are stepping on as possible.

If you have program which should be single threaded, but you make it multiple threaded, it shouldn't be surprising that if you force it to use less CPUs the performance will improve.

On my system, the "synchronized Static store" test scaled particularly badly. However, even the single threaded case (53K, the best for this test) performed far worse than the single threaded Static store. (1721K)

i.e. it should never have been made multi-threaded, even if it scaled linearly, you would have needed 35 threads to match the performance of a single thread of the simpler non-synchronized test.

My approach is to minimise the amount of interlock between threads, this minimises the dependancy on how the hardware performs. IMHO This is not practical in all situations, but this approach can be used more often than most people think.

BTW: The code would be simplified if Goer timed how long the test runs. If the test is run long enough, the difference should be small, but would save three lines of boilerplate code from every test. You can also make the test run for a fixed time rather than having a hand chosen looping factor.

In each test you set junk1 to be the counter, and in some tests you set junk2 to be the last value, but in others you appear to have a copy-and-paste error using a variable from a different test.

I've seen many a Java app where I constrain it to one or two CPUs and see improvement in actual work done. Takes a bit of tuning with jvm threads at times and sometimes tradeoffs with using more CPU and few jvms than a constraining to a single CPU for ever jvm just to reduce the care of feeding and memory overhead. It also depends on the exact version of Java you are using. This used to be FAR worse with Sun Java 1.3 but 1.4 and higher really it much less of a factor

My theory on why less is more is lock contention just like several comments have already mentioned. When waiting on resources having lots of threads that can at the same time compete for the same resources makes for fights. When forced to run on a single CPU only one JVM thread ever runs at any time since all threads have to run in the same core and hence there is never a fight for resources.

With that being said watch out for singleton areas and avoid mutexes as much as possible. Try turning off logging and see how much logging overhead is a factor. Try running with all the CPUs enabled but reduce the number of JVM threads to be equal to the number of cores plus some fudge factor to account for housekeeping (2 times cores usually is about max). Also there are different lock manager that can be used that may not be as fair as the standard but don't have the overhead.

Do your threads have to talk to each other much? Fundamentally, any time two threads (or processes, or programs, or computers, or networks) have to communicate with one another, at least one of them can end up sitting around twiddling its thumbs waiting to hear from another one.

Even if the communication is taking place over the loop interface, it still has to take place, and therefore things can be I/O bound. Unless your threads can consistently keep doing stuff while they wait to hear from one another, you're going to have the potentially for I/O bottlenecks to prevent full CPU utilization.

This is interesting and possibly worth being aware of, but you really only need to do anything about it if you're going to be deploying the code in a way that introduces higher latency (e.g., if some of the threads could end up on a different computer from one another, especially if they could end up with slower-than-LAN links to one another).

From my understanding (and what I've empirically observed) is that synchronized in Java causes an explicit cache flush at the end of the synchronized block. Depending on the processor this most likely flushes caches on multiple cores. Which would then cause them to re-fetch data from memory, slowing down their processing speed. Not 100% sure on why the single core is so much faster, but it probably has to do with either JVM optimizations when running on a single core or more efficient OS scheduling of fetching data for threads (since all threads are bound to a single core). As a related point, binding specific Java threads to cores can drastically improve those threads performance.

" synchronized in Java causes an explicit cache flush at the end of the synchronized block. "

Not true - at l;east not on Intel and AMD x86es, which I worked on.

I don't think it's true on Sun/Oracle SPARCs either. But I have worked on machine proposals, never built, where it might be true. And it might be true on a non-cache coherent MP system, which are uncommon.

What synchronized does is that the synchronized b lock is surrounded by a lock acquisition and release. Depending on the machine, this can be quite expensive. On modern Intel machines, however, it is almost free unless there is lock contention.

Oh, yes: finally, given hardware TM, which is just starting, the moral equivalent of the cache flush might happen. If they use the design I did 7 years ago for Intel, doing too many transactions back to back may incur a delay while the cache is scanned to flip bits. Not actually a flush, but still possibly a scan. However, this is a "too many back to back" thing. If fairly well spaced, or if Intel has provided enough versions, no problem. Or if Intel did a flash clear.