Financial Apps feel the need for speed – this can come via parallelization, and via infrastructure - fast messaging and non-blocking distributed memory management. This blogpost gives an overview + examples of various technologies that can squeeze performance out of your trading apps and clock cycles out of your modeling apps.

Low Latency via Infrastructure

ZeroMQ

· ZeroMQ is a messaging library - ‘messaging middleware’ , ‘TCP on steroids’ , ‘new layer on the networking stack’. not a complete messaging system , but is a simple messaging library to be used programmatically. Gives the flexibility and performance of low level socket interface plus ease of implementation of high level. It is designed for simplicity.

· Performance - ZeroMQ is orders of magnitude faster than most AMQP messaging systems as it doesn’t have the overhead. It leverages efficient transports such as reliable Multicast and makes use of intelligent message batching, minimizing not only protocol overhead but also system calls. You can choose the message encoding format such as BSON or ProtoBuff.

o ZeroMQ sockets can connect to multiple end points and automatically load balance messages over them. It is brokerless and thus has no single point of failure.

Redis

Redis (http://redis.io/) is a high performance NoSQL solution that provides network accessible shared memory – it provides a key-value / data structure store with support for lists & sets, as well as a non-blocking event bus. The performance is principally down to Redis keeping the entire dataset in memory, and only periodically syncing to disk. Redis supports 5 data structures: strings, lists, hashes, sets, & sorted sets.

BookSleeve (http://code.google.com/p/booksleeve/) is a .NET client (available as a Nuget package) for Redis provides pipelined, asynchronous, multiplexed and thread-safe access to redis:

using (var conn = new RedisConnection("localhost"))

{ conn.Open();

conn.Set(0, "foo", "bar");

var value = conn.GetString(0, "foo");

... string s = await value;

LMAX Disruptor

Disruptor (http://code.google.com/p/disruptor-net/) is ring buffer architecture for efficiently sending messages between threads without relying on shared queues (which CPU memory-access architecture causes contention). It leverages and improves upon some features of SEDA (which was serial) and the Actor model. It can handle 6 million TPS on a single thread. A Business Logic Processor runs in-memory using event sourcing and is surrounded by Disruptors - concurrency components that implements a network of queues that operate without needing locks. It allow consumers to wait on the results of other consumers without an intermediate queue.

LMAX Disruptor functions as a structured, ordered memory barrier set – multiple producer write barriers and consumer read barriers. There is no concept of entry deletion, just append.

readers can read concurrently and independently, and can optionally have dependencies. LMAX Disruptor uses a large pre-allocated ring of entries. Entry objects are pre-allocated adjacently and never get garbage collected.

Pre-allocating entries means adjacent entries are (very likely) located in adjacent memory cells, and because readers read entries sequentially, this is important to utilize CPU caches. And lots of efforts to avoid lock, CAS, even memory barrier (e.g. use a non-volatile sequence variable if there's only one writer). Different annotating readers write to different fields / cache lines, to avoid write contention.

· The .NET consumer interface IBatchHandler<T> specifies an OnAvailable method and an OnEndOfbatch method (reminds me of RX) - The consumer runs on a separate thread receiving entries as they become available.

Low Latency via Parallelization

The following parallelization technologies were investigated to determine latency capabilities given an indicator and a data set:

· C++ AMP

· C++ PPL Concrt

· TPL

The indicator under investigation was: standard deviation with a sliding window – a common measure of variance in stochastic calculus , used in Ito’s Lemma from Quant Finance

The data was made up of 50 million iterations - 21 days of x ticks (seconds).

TPL (.NET parallel CPU)

The Task Parallel Library (TPL) is a.NET Framework 4 API that simplifies parallelism and concurrency in applications. The TPL scales the degree of concurrency dynamically to most efficiently use all the processors that are available. In addition, the TPL handles the partitioning of the work, the scheduling of threads on the ThreadPool, cancellation support, state management, and other low-level details. By using TPL, you can maximize the performance of your code while focusing on the work that your program is designed to accomplish. The TPL is the preferred way to write multithreaded and parallel code in .NET.

ConCrt ( C++ parallel CPU)

The Concurrency Runtime programming framework for C++ abstracts the details of high performance parallelism. uses a cooperative task scheduler that implements a work-stealing algorithm to efficiently distribute work among computing resources. The Concurrency Runtime also provides synchronization primitives that use cooperative blocking to synchronize access to resources. By blocking cooperatively, the runtime can use the remaining quantum to perform another task as the first task waits for the resource. This mechanism promotes maximum usage of computing resources.

C++ AMP ( C++ parallel GPGPU)

C++ Accelerated Massive Parallelism (C++ AMP) accelerates execution of C++ code by executing it on data-parallel hardware of the graphics processing unit (GPU) found on a DirectX 11 graphics card. The C++ AMP programming model includes multidimensional arrays, indexing, memory transfer, tiling, and a mathematical function library and allows you to control how data is moved from the CPU to the GPU and back, so that you can improve performance. General-purpose computing on graphics processing units (GPGPU) allows you to perform computation in applications traditionally handled by the CPU, but across many more cores.

Here is the AMP moving average code – it performed best (I wont show the TPL code because it was trivial and I wont show the ConCrt code – as I actually leveraged this API in preloading the array in the AMP code sample):

Redis

Conclusions

As can be seen from the benchmarks, C++ AMP outperformed C++ Concrt, and C++ Concrt outperformed C# TPL. However, It was revealed that C# TPL provided adequate performance for the given indicator and data set. Using C# TPL also avoids the development productivity overhead associated with C++. By architecting correctly, the processing engine C++ / C# can be swapped out to suite your needs.