Thursday, May 30, 2013

Benchmarking Socket.IO vs. Lightstreamer with Node.js

Because of the increasing popularity of server-side JavaScript, some weeks ago we released a Node module which enables you to write a Lightstreamer adapter that runs inside Node.js. If you are not familiar with Lightstreamer, it is a server that implements bi-directional real-time messaging for mobile, browser-based, and desktop apps. And a Lightstreamer adapter is a custom piece of software you develop to implement your server-side logic (publish messages, receive and elaborate messages, etc.).

Before developing the Node adapter for Lightstreamer, we were wondering if it made any sense... I mean, if you need a real-time server and you want to write your server-side logic in JavaScript, why should you use Lightstreamer with its Node adapter when you can directly use Socket.IO? Well, the answer is twofold: features and performance. In this article, I will focus on performance only. I decided to put the two solutions to the test with a benchmark. Read the full story here and if you want to re-run the tests yourself, you can find all of the needed code and instructions on GitHub. [Full disclosure: I work as a developer on Lightstreamer.]

Background

My goal was to compare Socket.IO and Lightstreamer running on a multi-core machine, having the clients connected remotely. I wanted to verify the latencies and the CPU utilization on different client loads.

Architectures

Let’s have a look at the two different architectures first (Socket.IO+Node.js and Lightstreamer+Node.js) before delving into the actual benchmarking architecture.

Socket.IO + Node.js

Socket.io is a Node.js module, so it runs in-process with Node. On the client side, it provides a library clients use to connect to the server.

Lightstreamer + Node.js

Lightstreamer-adapter is a Node.js module as well. But instead of directly accepting connections from clients, it talks to Lightstreamer Server, which is the actual real-time server and runs out of process with respect to Node.

Benchmark

I decided to focus on a broadcasting benchmark, where a message is generated on the server side and broadcast to all the connected clients. Playing with the message rate and the number of clients would allow me to stress the two solutions enough.

The Message Generator

I wanted the message generator to be pretty simple, so I wrote a setInterval, which on each call reads a timestamp from the system (new Date().getTime()) and passes it to the server. Initially, I put such timestamp generator directly inside the Socket.IO server process but that gave me a couple of problems.

I ran my first tests on a single core machine: having my setInterval running on the same Node process of Socket.IO made the interval to be fired slower and slower as the load increased and thus, making the throughput of the two servers very different given the same number of clients. In fact, in the Lightstreamer case, the generator runs within the Node adapter, which is a very lightweight process as all the connections and data dispatching are handled by the Lightstreamer server itself (which is a separate process), and thus the node process itself is not stressed and has no reason to slow down the interval firing.

Also, wanting to run the clients on a different machine and being unable to quickly synchronize to the millisecond the clocks of the two machines, I had the need to generate the timestamps on the same machine where the client are.

NTP was not an option because it requires some time before a “perfect” synchronization is in place and I was doing my tests on short-lived amazon EC2 machines. Furthermore, using NTP, the clocks of the machine are slightly fluctuating: I didn’t want that to be in the way when comparing the latency values.

The obvious solution was to make the generator a remote process for Socket.IO too: a special client that sends the generated timestamp to the server, which in turn, broadcasts it to the other clients. Running such generator on the same machine where the clients were hosted solved the synchronization issue too.

The Clients

From the beginning, I thought the client code needed to run on a multi-core machine. To keep things simple, I wanted to calculate the latencies directly on the client process. So, using JavaScript clients running on Node was not an optimal solution. I opted for simplified Java clients (i.e., neither client would need to handle the full Lightstreamer/Socket.IO protocols, just what was needed for the tests).

Both Lightstreamer and Socket.IO support different transports. However, for this test, I wanted to be as cutting edge as possible and I wanted WebSocket connections only. I could not use the official Lightstreamer Java client lib as it currently does not have WebSockets support (which is currently available in the Lightstreamer JavaScript client lib only). Luckily enough, I had some Netty-based code around that was sketched for one of our customers and I was able to reuse it.

For Socket.IO, I started by adapting the Java client from https://github.com/drewww/Socket.IO-benchmarking. Unfortunately, such client was using one thread per client and that was causing excessive and unneeded load on the client machine. Instead of trying to improve such client, I thought it was better to adapt the Java client I had for Lightstreamer to be used against Socket.IO, and so I did.

Latency Calculation

The client process is coded to automatically increase the number of active clients after a configured amount of time. During this time, the client collects the latency of each message and at the end of the time, calculates the number of messages per second it received and the latency mean, median, and standard deviation.
Note that when increasing the number of clients, there is a period of time during which the clients are connecting to the server: all of the messages received during this time are discarded. This has a double effect:

We always calculate the stats based on messages received while the full batch of clients were already connected

We avoid to keep into account the extra-job the server has to do when new sockets are opened.

Single-Core: 1K Clients

I started my benchmark campaign using a M1.medium instance on Amazon EC2 cloud to host the server process and a M1.large machine to host the clients (we don’t want the clients to be the bottlenecks so it is safer to choose a more powerful machine to host them).

Unfortunately, when running Socket.IO for the first test, I had a bad surprise. The Node process, which was fine (around 65% of the available CPU) with 1000 clients, went completely mad as soon as I started adding a few more clients: the CPU skyrocketed to 100%. Once there, the CPU would not fall, even if all the clients were shutting down. Moreover, such behavior was seen even without any updates flowing between the server and the clients. A v8.log taken during the crisis states the CPU time was mostly spent inside the /lib64/libc-2.12.so (61.5%) and /lib64/libpthread-2.12.so (30.0%). I still have the v8 log, if anyone is interested: https://docs.google.com/a/lightstreamer.com/document/d/1qMlxw7K2F4garJxFqKzq8tqGq7t5FyaGEXAOKFjbEA0

When migrating to a multi-core environment, I encountered this limit again: on the 4-core machine I used for the final test, I was unable to have proper results after 4K clients were connected (see results later).

Multi-Core Architecture

Socket.IO

To run Socket.IO on a multi-core machine, I had to run one Node instance per available core. This posed a series of nested questions:

Was it better to manually run different Node instances bound to different ports and then

use a balancer to share the load

manually evenly connect the clients to the various instances

or was it better to use Node Cluster module which is currently declared experimental?

I thought using a balancer was not fair as latencies may have been increased by it and I also had to find a balancer supporting WebSockets; manually select the server wasn’t elegant in my opinion and far from a real-world case.

So, I decided to go with the Node Cluster module. I see that most, if not all, the suggested architectures aimed at supporting multi-process Socket.IO involve the use of a Redis server. I initially tried to keep it out of the loop but the number of “warn - client not handshaken client should reconnect” messages I continuously received made me desist.
So, I finally set up Redis as suggested by the Socket.IO github wiki. Unfortunately, my problems were not over as I was still getting some “warn - client not handshaken client should reconnect” messages (roughly 1 in 100 clients got that message) so I had to extend the client code to reconnect disconnected clients.

This is the resulting architecture:

Lightstreamer

In the Lightstreamer case, a standard Proxy Adapter (see the Lightstreamer ARI protocol) runs in process with the Lightstreamer server and is connected to the lightstreamer-adapter module running on the other machine. The data, generated by custom code running together with the lightstreamer-adapter module inside the Node server, flows from the generator to the Lightstreamer server and from the Lightstreamer server back to the clients.

So, in both cases, the application logic (the message generator) runs inside a Node server, which is hosted on the same machine where the clients are hosted, while the actual streaming server is a dedicated machine. This allows apple-to-apple comparison because in both cases, the server process receives events from outside and broadcasts them to the clients.

Results

Here you can find the tests result: I will only show the results up to 4000 clients because of the problem I had with Socket.IO (so that the subsequent values are not comparable between the two servers).

Test Configuration

The test described here is configured this way: The server receives one message every 100 ms (that is, 10 messages per second) from the generator and broadcasts each message to every client. The test starts with 100 clients; 100 new clients are connected after one minute and so on. This is the same configuration shown by default on the GitHub project (well, excluding the server address). Yes, of course, I could have configured the test in thousands of other different ways but I didn’t want to make a full study on the two server behaviors. I just wanted a quick comparison of the two solutions running on low-powered machines. The versions of the various components are the following:

Lightstreamer server: 5.1.1 build 1627

Lightstreamer Node adapter: 1.0.0

Java: OpenJDK 1.7.0_09

Socket.IO: 0.9

Node.js: 0.10.4

Redis: 2.6.12

Netty (for the clients): 3.5.8

Amazon Linux: 2012.09 (kernel 3.2.36-1.46.amzn1.x86_64)

CPU-Bandwidth Tracking

For the benchmark to be complete, we also wanted to keep track of the current bandwidth and CPU of the server machine. To do that, I used Sysstat (sar). To be more precise, the following command:

sar -u -n DEV 15

(redirected to an output file) which prints the statistics every 15 seconds.As each test lasts for 1 minute, there are 4 sar lines per test. “Unfortunately” between a test and the following, there are some seconds that are spent for the new clients to connect, so the sar lines are not completely synchronized with the client lifecycle (which would be hard in any case but is worth to point it out). For this reason, I have chosen the values to be plotted for each test by using the last line whose time is antecedent the time of the test end as tracked by the client report.

Machine

The test was made using two EC2 machines, hosted in the same Amazon availability zone. I used a M1.xlarge machine to host the server process (and Redis in the Socket.IO case) and a M3.xlarge machine to host the generator and the load clients.

This means that in the Socket.IO case, 4 Node instances were used (one per core).

Plot

The latency on Socket.IO clearly grows faster. Part of the latency is probably introduced by Redis dispatching, which even if made on the loopback, is still done over TCP.
The first spike in the Socket.IO line is at 3300 clients. I suspect at that time one of the Node processes reached the 1K client limit.

We can take a closer look to the latency growth by limiting the results at 3000 clients.

The graph tracking the total CPU usage (as average usage of the 4 cores) shows a fairly regular increase on both servers. Anyway, at 3100 clients, the Socket.IO graph has a little soaring. The increase in latencies as seen on the previous graph is probably related to such increase.

Socket.IO bandwidth grows faster because of the verbose form of its updates; if we compare the messages sent by the two servers, we can see how the Socket.IO message is double the size of the Lightstreamer one (48 vs 24). It is worth noting that it is possible to manually reduce the size of the Socket.IO message by using a shorter name for the message; by reducing the “timestamp” string to a single character, we can produce a message 8 character shorter. Not handy when programming but certainly feasible (maybe shortening the name at build time).
Also, it needs to be taken into account the fact that this is a very particular case in which a single field is sent so your mileage may vary.
Socket.IO keepalives were disabled: the traffic is all made up by the updates.

In any case, most of the difference here is eaten by the TCP/IP and WebSocket overheads; from a general point of view, we have to add to each packet 20 to 60 bytes for the IP protocol (60 bytes in case of IPv6 packets), 20 to 24 for the TCP protocol, and 6 for the WebSocket protocol (we can be sure about this because the messages are shorter than 125 bytes). So, each packet size ranges between 70 and 94 bytes per message in the Lightstreamer case and between 94 and 118 for the Socket.io case.
As an example, at 1K clients, we have a total of 10K messages per second from the client to the server, thus:

for Lightstreamer:

with 70 bytes per packet we obtain 5468 Kbit/s

with 94 bytes per packet we obtain 7343 Kbit/s

for Socket io:

with 94 bytes per packet we obtain 7343 Kbit/s

with 118 bytes per packet we obtain 9218 Kbit/s

Looking at the above graph, we can see how the bigger values are tracked. Because the biggest impact is the IPv4 versus IPv6 packet size, we can see the test was performed running on Ipv6.
Note that although we would expect a constant and regular growth in this graph, because each point on it is based on the average bandwidth taken during 15 seconds over a 1 minute test, and the distributions of the updates over the full test may not be so regular, we can observe some hiccups here and there (remember, CPU and bandwidth are based on 15 second slots whereas the latency mean is based on the full minute).

Single-Core Experiment

The above graph shows how the two servers behaved on the single-core machines. I was still running experiments when this test was performed so I don’t have the CPU and bandwidth data. As a result of the issue previously described, the Socket.IO server was not able to grow over 1K clients.
Keeping in mind, this single core is probably different from a core on the multi-core machine. It has to be noted that with 900 clients per core, Socket.IO has a median latency of 75ms on the 4-core machine, whereas it has a median latency of 28ms on the single core machine. Most of the extra latency is probably caused by Redis. With Lightstreamer, the median latency is 29ms on the 4-core machine and 22ms on the single-core machine.

Raw Data

Conclusion

In this article, I focused on a far-from-comprehensive performance comparison between Socket.IO and Lightstreamer. I set up a benchmark on the Amazon EC2 infrastructure, trying to be as neutral as possible in designing an apple-to-apple comparison. I found out that Lightstreamer seems to scale better than Socket.IO in terms of CPU usage, bandwidth consumption, and data latency.

But many more differences arise if we look at features. As you can see on this slide, Lightstreamer is made up of three logical layers. Aside from the web transport layer and the message routing layer, there is a data optimization (and security) layer, which includes several mechanisms for making the data delivery smarter. To name just a few, there are: bandwidth and frequency allocation, dynamic throttling of the data flow, conflation, resampling, delta delivery, batching, etc. In other words, Lightstreamer does not simply route opaque messages from one end to the other like a pipe, but is able to alter the data flow on the fly (maintaining data coherency) to avoid redundancy and further reduce bandwidth and latency.I look forward to receiving any comments and feedback on this benchmark!