Deterministic Latency and Scalability via FPGA Matrix Architecture

In this interview for the HFT Review, Mike O’Hara talks to Yves Charles, CEO of NovaSparks, an engineering company dedicated to FPGA solutions for Capital Markets. The company uses an FPGA matrix architecture to deliver market data feed handlers and was recently the first to announce a pure FPGA-based Order Book Capability for Cash Equities in the US and Europe.

HFT Review: Yves, welcome to the HFT Review. Your company NovaSparks specialises in pure FPGA-based market data solutions, so I’d like to start by asking you whatsort of demand is driving the adoption of FPGAs?

Yves Charles: In the last two or three years, FPGAs have mainly been adopted in the financial industry to reduce latency. Secondary drivers are lower power consumption, reduced footprint and so on, but up until now, the primary driver is latency. Beyond the need to reduce latency we are also seeing a very strong interest in keeping a deterministic latency, which is acheivable with a full FPGA solution.

YC: In a CPU there are many areas that could generate queueing behaviour because the system processes could stop to do various other things. But the market data doesn’t stop!

Software-based solutions have tried to catch up. Both existing vendors and firms with home-grown solutions have tried to improve their software in order to compete against FPGAs. But even if they improve their latency to reasonable numbers, they can never be fully deterministic. With a CPU-based solution you will always have unpredictable events like scheduling, I/O, memory management, operating system interrupts and so on.

Because of the CPU resource sharing, software solutions cannot reach determinism. Neither can the hybrid solutions that offer a mix of FPGA and software. Those are more deterministic, but they’re still unpredictable. The only way to be 100% predictable is to do the whole job in FPGA because then you’re not involving the CPU in the data process.

HFTR: Why does predictability matter so much? And what is the real impact of non-deterministic latency?

YC: I’ll give you an example. Say your latency is around ten microsecond 90% of the time. But during the other 10% of the time, when market data traffic is high or during bursts, your latency is much higher, in fact you don’t even know the number. It could be twenty microseconds, it could be a hundred, it could even be more.

In the algo trading business, this situation is very bad because it’s precisely during this 10% of the time that you need to be sure that your latency is good. Algos are generally tuned for a certain latency value, so if you don’t know the latency 10% of the time, you don’t really know how your algo is performing.

If you wanted to buy a switch and were offered one that switches in 200 nanoseconds when traffic is low, but it's a few microseconds when traffic is high, you certainly wouldn’t buy that switch. It’s the same thing for the feed handler. You want to be sure that the latency is the same whatever the situation.

HFTR: It’s clear that such deterministic latency is important to proprietary HFT firms and hedge funds running short-term trading strategies. But what about sell-side firms, service providers and exchanges? How do they benefit?

YC: When you’re talking about determinism and predictability of latency, everybody’s interested. Not just HFT firms, but also the very large financial institutions who want to be sure that the market data they get is not 200 microseconds late. For market data providers, deterministic latency allows them to commit to service level agreements.

Exchanges are also moving towards this because they want to reduce what we call the “dark area” between the gateway and the matching engine. They have all sorts of requirements that are not easy to implement in FPGA and there is additional complexity because of security, recovery and so on, but they are certainly looking into it.

HFTR: Can you describe some of the architectures that exist to leverage FPGAs?

YC: Yes, there are two kinds of architectures. The first is where you use a CPU as a backplane and you plug in some FPGA boards, using them as a first layer of decoding and parsing the feed, but you keep the majority of work in the CPU.

Architecture 1

This architecture gives you an improvement in terms of latency, but you still have a lot of work to do in the CPU. You still have to deal with the throughput problem and with the lack of scalability, especially when you are talking about multiple markets. For anyone who is trading on multiple markets and who needs to have a consolidated view, this architecture does not scale because you cannot have multiple FPGA boards in the server, you can only install one or two boards maximum. That means you have one FPGA board trying to feed all the markets, with the CPU trying to catch up.

HFTR: Can’t you get around the scalability issue by having multiple servers?

YC: Yes, but the problem when you use multiple servers is how you transmit the data from one server to another, because that’s where you add latency. And the problem of running multiple markets on one FPGA board on one single server is you can consume the data locally, but then you cannot distribute the data to other consumers in low latency because you have to again go through an interface and through a network card, with all the communication problems that entails.

This architecture number one with one CPU-based server and one or two FPGA boards to do some work is an improvement compared to pure software solutions, but it doesn’t give you determinism and it doesn’t scale.

HFTR : What is the alternative approach?

YC: Architecture number two is to have a pure FPGA solution architected around the FPGA and not around the CPU. That’s our approach. It’s basically a solution where the CPU is on the periphery to do remote firmware downloads and system admin work. But at the heart is a matrix of FPGAs performing all the core functions. You have an API so you can consume data but the processing is done in the FPGA.

Architecture 2

The matrix can be extended by interconnecting multiple FPGAs together. THis allows you to scale in terms of number of markets without having any bottleneck. You can also scale on the data distribution side by sending the data to multiple consumers by increasing the fan-out.

It’s not one FPGA board there on its own, but multiple interconnected FPGAs. So you can scale for both market data coming in and connections going out.

HFTR: I understand that this matrix architecture was fundamental to the order book builder functionality you announced recently. What can you tell us about that?

YC: Yes, we recently announced the first solution where an entire order book can be built in FPGA without any CPU work. This does two key things: One is to manage all the orders within the FPGA, so the deleting, adding and modifying of orders, all these functions are managed by the FPGA. The second is to aggregate all these orders and to build what we call the order book, so not only top of the book but also 5 levels, 10 levels, 20 levels, etc. This is completely unique. It gives us a very competitive latency (below 1 microsecond) that nobody else can reach. But more importantly, this latency is deterministic.

HFTR: I’m very interested in the challenges you faced in building this solution and bringing it to market. Can you talk me through why it was such a difficult thing to do and why you think nobody else has done it yet?

YC: Well, without giving away any secret sauce, I can tell you it’s based on some very complex algorithms and mechanisms within the FPGAs. The challenge is to be able to do the entire process without having bottlenecks.

HFTR: Any more specifics?

YC: The obvious response is that FPGA is not an easy technology to work with. And in the last few years, there haven’t been too many improvements in the use of the compilers to solve this problem. Compilers are very widely used in other applications where latency is not an issue, but when you are dealing with FPGA and when the driving point is latency, compilers reach their limits.

We decided to build our own compilers rather than use anything off-the-shelf. In our development team, we probably do around 50% of the development by hand, programming the FPGA in the classic way, and 50% through compilers we have written ourselves.

The availability of developers is also an important factor, although within the last 3-4 years, we’ve seen more people come on to the market with experience in FPGA.

As for matching engines on the exchanges, I’m not sure about today or tomorrow but I’m sure that many exchanges are thinking about that longer term.

HFTR: So what is next, after the Order Book ?

YC: We believe that the accomplishment of putting the Order Book into FPGAs will signal to the market that FPGA appliances are maturing fast. . We see it as a proof point of how FPGAs can be used for more complex and advanced functions than just simple parsing or decoding or networking functions. It paves the way to prove to the market that FPGAs are viable platforms for a lot of different types of calculation functions in the trading cycle.

The other important thing is not just the FPGA itself, but the architecture around the FPGA. Some people seem to think that once they have these four letters, FPGA, then it’s a done deal, everybody is the same. But no, everybody is not the same. There are multiple FPGA solutions and we think that the only one that can really scale from an enterprise point of view, from the market data provider point of view, is an appliance that is fully architected around an FPGA matrix.