Wanna boost app speed? Think of the server, and tune 'er to NUMA

And maybe chuck in some flash-based DIMMs, too

HPC on Wall Street The HPC on Wall Street conference was hosted in the Big Apple on Monday, and while there was a lot–and we mean a lot–of talk about big data, one presentation stood out as being potentially much more useful in the long run than all of the big data bloviations.

The talk was given by one of the founders of a financial services software maker, who walked the audience through his company's efforts to boost performance through coding apps to be aware of the underlying NUMA architecture of servers.

The techies from 60East Technologies, the maker of the AMPS (Advanced Message Processing System) publish/subscribe messaging system that was created to be the engine behind financial services applications, are also beta-testing flash-based DIMM memory sticks from Diablo Technologies to "crank up the AMPS", as founder and CEO Jeffrey Birnbaum put it.

Depending on the parts of the application that are running, the AMPS application has anywhere from 50 to 100 threads executing at the same time. Some of those threads are running the SQL-like messaging database at the heart of the system, which is called State of the World, while others run the publishing and subscription code that pulls in or pushes out data. The subscription is done via SQL-like queries.

On a current two-socket server using "Sandy Bridge-EP" Xeon E5-2600 v2 processors, you can get 16 cores and 32 threads in a box and that is about it. The software allows for multiple input publishing applications which are then routed to multiple subscribers.

Financial apps: Latency, latency, latency

Much of financial applications is taking data from multiple sources, aggregating it, parsing it, and then streaming it out to multiple subscribers. And with such streaming applications – and especially the trading and other applications that depend upon those streams – latency is everything. And so even in a two-socket server, the latencies between local memory access in one socket and remote memory access in the other socket can make a big difference to the performance of the overall application.

Birnbaum says that in a typical two-socket Xeon E5 server, local memory access is on the order of 100 nanoseconds. But if you have to jump over to the main memory associated with the second socket in the system, any accesses through the QuickPath Interconnect that links the two sockets using non-uniform memory access (NUMA) clustering can take anywhere from 150 to 300 nanoseconds, with 300 nanoseconds not being unusual for outliers.

"You are taking a severe performance penalty," explained Birnbaum, who was preaching that application developers needed to become aware of NUMA tuning and start doing it in their applications as 60East has done.

How much of a performance hit are we talking about? Well, based on his own AMPS application, Birnbaum is right. It is a pretty big hit, and this awful picture from his presentation shows it, however fuzzily:

The AMP app can handle a lot more 1KB messages after NUMA tuning

See that? No? Squint a bit...

This presentation will eventually be available on the 60East site, and deepest apologies, it went by so fast we only got this terrible shot of it. But the important thing is that you can see the curves. The lines at the bottom of both charts, which are relatively flat, are the average latencies across all messaging transactions, and if you looked at only this data, you would think everything is hunky dory. But if you look at the average latency of the slowest five per cent of message transmissions in the AMP application, you will see that the messaging rate gets pushed, then the outliers start to creep up very, very fast.

"If you have one message that is really bad, that is not good for most environments," said Birnbaum. He maintains that this is why you have to look at more than average latencies if you are analyzing code on NUMA machines.

In the case of AMPS 3.3, which was not tweaked to pin threads and memory together in NUMA machines so they were not socket hopping for data, after you push it up to about 50,000 messages per second, the latencies start to spike. With AMPS 3.5, the latest release of 60East's software, the company's programmers used various tools to analyze the memory accesses as AMPS was running and then learned how to group threads and memory together to cut down on socket hopping.

With the NUMA tuning, AMPS 3.5 was able to push close to 1 million messages per second and still cut the outliers' latencies by more than half. And the real bottleneck at that point was the PCI-Express bus and the 10Gb/sec Ethernet network interface card. With a 40Gb/sec Ethernet card, Birnbaum thinks AMPS 3.5 could probably hit 2 million messages per second.

"This tells you that you want to take your time and program for NUMA," said Birnbaum. And he warns against doing too much reference counting in C and C++ (which is a common way to share data among threads by passing around pointers) can wreak havoc on performance. "You also try to put low-priority stuff in an application on the second socket in the system where you can take the latency hit," he says.

Programmer? I hardly NUMA...

60East employed a number of tools to tune up AMPS 3.5 for two-socket NUMA Xeon E5 servers. The first is the libnuma, which is used to set memory access policies in the Linux kernel.

The company also made use of Pin from Intel, which is used to check for memory references in the code.

And an intrepid programmer at Intel, frustrated by the lack of visibility into NUMA applications, has created an open-source tool called NumaTop to analyze processes and threads and their memory accesses on NUMA systems. There are a bunch of others, too.

But the important thing about NUMA tuning is to do it. "The only way to do this well is that you have to play," said Birnbaum, and that may sound a little bit odd coming from Wall Street. "You have to read, you have to learn, and you have to experiment. But the results will be dramatic.

The other thing that 60East's programmers have been doing is making use of flash-based sticks that plug into main memory slots in the server to help boost the performance of AMPS even further. The messaging platform was designed so that transaction logs can be turned on or off.

You want to turn the logs on because that helps speed up the resynchronization of subscribers if they get knocked offline. But disk drives are too slow and memory is too skinny. As it turns out, Diablo's MCS (Memory Channel Storage) flash sticks come in 200GB and 400GB capacities, and 60East was able to get eight of the 200GB units to plug into a Xeon E5 alongside the dozen sticks of DDR3 that gave the system 128GB of main memory. This MCS memory has drivers that make it look like another tier in the storage hierarchy of the server.

With the MCS flash DIMMs in the two-socket server, the AMP software was able to push 4.16 million messages per second, compared to 1.18 million messages per second without it.

That is still not enough to make Birnbaum happy, though. "Most of our performance comes from memory and cores," he says. "Ivy Bridge Xeons are welcome to us, and Haswell would be even better."

But what Birnbaum really wants is an integrated network interface on a Xeon chip, something he says he told Intel was necessary back in 2001. The new "Avoton" Atom C2000 chips have integrated Ethernet network interface controllers, and maybe, just maybe, the future Haswell Xeons due next year will too. As far as El Reg knows, there have been no rumors of integrated Ethernet controllers on the impending Xeon E5 v2 chips based on Ivy Bridge. ®