MigratoryData is the industry’s most scalable real-time messaging solution, typically used in large deployments with millions of users. Among its many features, MigratoryData provides a number of monitoring options along the HTTP and JMX standards. Also, any of its API libraries can be used to subscribe to special monitoring subjects and receive real-time statistics.

These MigratoryData statistics, made available along HTTP, JMX, and PUSH monitoring, are also logged on disk (at configurable time intervals) besides other log types such as access logs and message logs. While these logs are typically preserved for audit, they can also be used for other purposes such as monitoring or even debugging. For example, message logs can be used by MigratoryData Replayer – a tool able to replay a feed of messages, and publish recorded messages at slower, faster or original speeds by preserving the timestamp proportions.

It becomes obvious, then, how statistics logs, access logs, and message logs produced by such a high number of users can result in a huge amount of data. Hence, using a big data platform is natural.

In this blog post, we show how to use popular open-source big data platform Elastic Stack for searching, analyzing, and visualizing data produced by MigratoryData clusters. More precisely we will use:

Kibana for exploring, searching and filtering MigratoryData logs and for building dashboards to visualize the data

This blog post is based on Elastic Stack version 5.3.0 and MigratoryData version 5.0.21. All configuration files, dashboards, diagrams, and screenshots can be found on github.

Setup

For the purposes of this post, our setup consists of a MigratoryData cluster of three nodes. Each node runs one instance of MigratoryData Server and one instance of Filebeat. Filebeat is an agent which collects access logs, message logs, and statistics logs produced by the MigratoryData server. These logs are collected as soon as they are produced by the MigratoryData server. Collected logs are then sent to Elasticsearch over the network. Finally, users connect from web browsers to Kibana to make queries which are automatically forwarded to Elasticsearch.

Elasticsearch Installation and Configuration

You can start by installing one instance of Elasticsearch and one instance of Kibana on the same machine. Advanced Elastic Stack settings are available, including high availability clustering, but these are beyond the scope of this post.

The installation of Elasticsearch on Linux requires one kernel tuning. You can apply it temporarily by running the following command:

sudo sysctl -w vm.max_map_count=262144

Alternatively, you can apply this kernel tuning permanently: edit the system configuration file /etc/sysctl.conf, add the following line at the end of the file, and finally restart the Linux system:

vm.max_map_count = 262144

Let us suppose that the IP address of the machine running Elasticsearch and Kibana is 192.168.1.1. The installation of Elasticsearch is straightforward. Download the installation package in zip or tar format and uncompress it. Edit the default configuration file elasticsearch.yml located under the folder config and configure the parameter network.hostas follows:

network.host = 192.168.1.1

Finally, run the startup script elasticsearch located under the folder bin. Elasticsearch will use the IP address192.168.1.1 configured above and the default port 9200 to accept connections and communicate with both the Filebeat agents and Kibana.

Kibana Installation and Configuration

Kibana installation is straightforward. Simply download the installation package in zip or tar format and uncompress it. Edit the default configuration file kibana.yml located under the folder config and configure the parameterelasticsearch.url as follows:

elasticsearch.url = 192.168.1.1:9200

Finally, run the startup script kibana located under the folder bin.

Kibana will use the above configuration to connect to Elasticsearch and will use the default port 5601 to accept connections from users.

Filebeat Installation and Configuration

One instance of the Filebeat agent should be installed on each machine of the MigratoryData cluster. The Filebeat agent collects the logs produced by the MigratoryData server into the folder defined by its parameter LogFolder, and send them to Elasticsearch.

In order to install Filebeat, download the package in zip or tar format and uncompress it. Filebeat comes with a number of predefined modules available under the folder module. For example, there is a module for collecting logs of apache2 or nginx. A module basically defines the rules for transforming a particular logging format into a field-based format understood by Elasticsearch. The module based architecture of Filebeat allows us to create new modules.

We created a new module for Filebeat named migratorydata which defines the rules for parsing the access logs, message logs, and statistics logs of the MigratoryData server.

In order to install the new module, copy the folder migratorydata available in the github repository under the folder elastic-stack/filebeat/module into the folder module of your Filebeat installation.

Pipeline for Statistics

The pipeline file of the stats section defines the rule for parsing the statistics logged every 60 seconds (configurable with the parameter Stats.LogInterval) into Elasticsearch documents with the following fields:

Running Filebeat

At this stage, the migratorydata module has been installed into the folder module as detailed above. To run Filebeat, a configuration file migratorydata.yml should be created into the root of the Filebeat installation as follows:

This file is also available under the github repository. This file specifies the address of Elasticsearch as well as the name of the Elasticsearch index used to group all the documents corresponding to the MigratoryData logs received from the Filebeat agents.

Finally, to run the Filebeat agent for the MigratoryData server of the cluster named, for example, server1 use the following command:

Please note that the -E name=server1 part of the above command is optional. If not provided, Filebeat will use the hostname of the machine running that instance of the MigratoryData server. Using such assignation is particularly useful for testing when all servers of the MigratoryData cluster run on the same machine. Indeed, in order to be able to monitor each instance of the MigratoryData server of the cluster individually, each Filebeat agent will add a built-in field beat.name containing the value provided by the name attribute above to each log that it sends to Elasticsearch.

Using Kibana

In the example above, Kibana has been configured to listen for users on the IP address 192.168.1.1 and port 5601. Therefore, in order to access Kibana, use a modern web browser such as Google Chrome and open the following location:

http://192.168.1.1:5601

Creating a MigratoryData Index

The first time you access Kibana, it will ask you to create an index. An index is used to group together all related Elasticsearch documents. In this case, we will group the documents corresponding to the MigratoryData logs collected by the Filebeat agents from the MigratoryData cluster under an index pattern migratorydata-log-*. To create the index, Kibana will propose a form such as the one below:

Once the index has been created, you can see it by navigating to Management -> Index Patterns.

Exploring

To explore all data received from the MigratoryData cluster, navigate to Discover, type * in the query box, and select the migratorydata index pattern migratorydata-log-*. In the top menu on the right, you have a time picker where you can select for example Last 15 minutes and can also define Auto-refresh to a configurable amount of time, say 10 seconds, to see the data received continuously during the moving window of last 15 minutes. You can also use the time peaker to explore any desired time interval including a given number of seconds, minutes, hours, days, or years ago. Finally, you can also explore the data between two absolute dates.

The histogram in the screenshot above shows the number of MigratoryData logs indexed every 30 seconds during the selected time period. The table below the histogram lists Elasticsearch documents, structured by fields, corresponding to the MigratoryData logs which occurred during the selected time period. The left hand side displays the field names, which correspond to the fields defined by parsing rules of the Filebeat agents as explained above, together with a number of built-in fields introduced by Filebeat (above we already discussed the built-in field beat.name).

Searching / Filtering

You can refine your exploration of the data in the previous section by entering a search criterion in the query box. You can perform a free text search or also search data by field. For example, to get all the logs of users which connect from IP address 172.16.230.4, you can perform a a search such as migratorydata.access.client_ip:172.16.230.4. Below is a more complex search using boolean operators. Note that the boolean operators AND, OR, and NOT are case sensitive. For more details, please refer to Kibana documentation.

A search filter can be saved and used for a later search. In this example the search filter has been saved under the name search-connections-by-ip-during-timeinterval as highlighted in the screenshot above. The saved search filter can be also used to build visualizations for the filtered data (as shown in the next section).

It is worth mentioning that the time picker of a search form can be used, for example, to define an Auto refresh time such that new search results can be displayed as long as new data are added to Elasticsearch which match the search criterion.

Visualizing

As an example, we visualized the outgoing bytes per second during the last 15 minutes. We filtered stats logs using the search criterion _type:log-stats, defined the Y axis based on the field migratorydata.stats.out_bytes, grouped values in buckets of 30 seconds time periods according to the X axis defined by @timestamp, and finally displayed the bucket’s average of the outgoing bytes per second for each MigratoryData server of the cluster. We also split the lines by the field beat.name discussed above, which help to distinguish among the data coming from each cluster member.

In the screenshot above we defined Auto refresh at 10 seconds and therefore the chart is updating continuously every 10 seconds.

Visualizations can be saved and are typically used in building dashboards as we will be showing in the next section.

Dashboard Monitoring

Multiple visualizations can be grouped together into a dashboard, and can be easily rearranged. The screenshot below provides one such example, with the the name of each visualization visible on the dashboard.

As for all other Kibana elements, an Auto refresh time can be defined, such that monitoring is performed continuously. The time picker can be used to monitor various time periods including historical time intervals.

This dashboard template is also available in the github repository. In order to load it, navigate to Management -> Saved Objects -> Import.

Further Monitoring

Using the few concepts presented above you can build various filtering, visualizations, and dashboards according to your needs. Moreover, you can simply build new modules to index additional information into Elasticsearch. Last but not least, Elasticsearch has integrations with various applications and therefore you can use it to monitor other environments alongside MigratoryData clusters in a single place.

System Monitoring: Metricbeat

Besides the Filebeat agent, Elasic Stack provides Metricbeat, an off-the-shelf agent which collects various system information. This allows you to monitor each machine of the MigratoryData cluster in (near) real-time and also see historical metrics. Metricbeat comes with predefined dashboards which you can load as explained above. Metricbeat should be installed alongside Filebeat on each machine of the MigratoryData cluster. Here is an example of a dashboard produced with Metricbeat.

Garbage Collections (GC) Logs

In the future we will enhance the Filebeat module migratorydata to also parse the GC logs produced by the Java Virtual Machine (JVM) running the MigratoryData server.

New Suggestions

What are your thoughts on the above? We welcome all suggestions and would love to enhance monitoring with any additional information and dashboards, at your request. Just let us know!

Finally, for more information on MigratoryData Server, and for a comprehensive list of product features please visit migratorydata.com.

Using the RESTful HTTP request-response approach can become very inefficient for websites requiring real-time communication. We propose a new approach and exemplify it with a well-known feature that requires real-time communication, and which is included by most websites: search box autocomplete.

Google, which is one of the most demanding web search environments, seems to handle about 40,000 searches per second according to an estimation made by Internet Live Stats. Supposing that for each search, a number of 6 autocomplete requests are made, we show that MigratoryData can handle this load using a single 1U server.

More precisely, we show that a single MigratoryData server running on a 1U machine can handle 240,000 autocomplete requests per second from 1 million concurrent users with a mean round-trip latency of 11.82 milliseconds.

In our previous world-record-setting C10M benchmark, we showed that MigratoryData Server can achieve extreme high-scalability by delivering real-time messaging to 10 million concurrent users on a single 1U machine. Here, we show that MigratoryData Server can achieve simultaneously both extreme high-scalability and consistent low latency.

Written in Java, MigratoryData Server runs on a Java Virtual Machine (JVM). In the previous C10M benchmark, some JVM tuning adjustments were necessary such as using CMS garbage collector, enabling huge pages, and using compressed pointers.

In this post, we show that by simply replacing the JVM with Zing JVM out-of-the-box (without any tuning), and preserving the same C10M benchmark scenario and setup, we can reduce the average latency from 61 milliseconds to under 15 milliseconds. Moreover, and more importantly, the latency spikes can be significantly reduced from 585 milliseconds to 25 milliseconds for the 99th percentile latency and from 1700 milliseconds to 126 milliseconds for the maximum latency. Therefore, every single message can be delivered, even in the worst case, with almost no delay.

And so, the relatively high latency spikes we saw in the previous C10M benchmark were due to JVM’s Garbage Collection (GC). In the new benchmark, not only Zing JVM didn’t introduce high latency spikes, but based on analyzing the logs, it appears that GC effects no longer dominate latency behavior. The dramatically improved 126 ms max latency is not caused by GC but by other condition of the benchmark setup. Anyway, this max latency was so small for a web architecture that we did not spend time to determine at which level exactly it occurred.

In summary, this new C10M benchmark demonstrates that MigratoryData Server running on a single 1U machine can handle 10 million concurrent clients each receiving a 512-byte message per minute (at a total bandwidth of 0.8 Gbps) with a consistent end-to-end latency of under 15 milliseconds.

Benchmark Setup

We used the same benchmark setup and config from the previous C10M benchmark (see sections “Benchmark Setup” and “Configuration Tuning”). The only changes were to replace the JVM with Zing JVM and not use huge pages.

MigratoryData Server is essentially a publish/subscribe system for web messaging. Subscribers are clients which connect to the MigratoryData server – using persistent WebSocket or HTTP connections – and subscribe to subjects (also known as topics). Publishers are clients which connect to the MigratoryData server and publish messages. A message consists mainly of a subject and some data. Upon receiving a message from a publisher, the MigratoryData server delivers that message to all clients which subscribed to the subject of that message.

To briefly describe the benchmark setup (full details, including the specifications of the machines, can be found in our previous C10M benchmark), nine machines were utilized as follows:

One machine was utilized to run one instance of the MigratoryData server.

In order to connect 10 million clients, four machines were utilized to run four instances of Benchsub, each opening 2.5 million HTTP persistent connections to the MigratoryData server. Also, Benchsub subscribed each client to a distinct subject.

In order to send one message per minute to each client, four machines were utilized to run eight instances of Benchpub (two instances per machine) for publishing 168,000 messages/second to the MigratoryData server, each instance publishing 21,000 messages/second.

Finally, a fifth Benchsub instance was used to connect 100 additional clients, representing samples of the population of 10 million concurrent clients. This Benchsub instance was used to compute supplemental latency statistics – in addition to the latency statistics computed by the other four Benchsub instances.

Latency is the end-to-end time needed for a message to propagate from the publisher to the subscriber, via the MigratoryData server. Thus, the latency of a message is the difference between the time at which the message is sent by Benchpub to the MigratoryData server and the time at which the message is received by Benchsub from the MigratoryData server.

MigratoryData Server provides advanced monitoring via JMX and other protocols. We used the jconsole tool (part of the Java Development Kit) for JMX monitoring. In the results presented below we show screenshots obtained during JMX monitoring.

Connections and Messages

As mentioned in the Benchmark Setup section above, the 10 million concurrent connections were opened by four instances of Benchsub that simulated 2.5 million concurrent users each. In addition, a fifth instance of Benchsub opened another 100 concurrent connections. Indeed, the indicator ConnectedSessions of the JMX screenshot below shows around 10,000,100. The same number of concurrent socket connections is confirmed by the tools netstat and slabtop (see the screenshot in the Network Utilization subsection below).

As mentioned in the Benchmark Setup section above, eight Benchpub instances published 168,000 messages/second in order to produce one message per minute for each of the 10 million clients. The JMX indicator OutPublishMessagesPerSecond shows that the outgoing message throughput is indeed around 168,000 messages/second.

Memory and CPU Utilization

In the screenshot below, you can see the Memory and CPU usage during the benchmark test. The CPU usage is similar to that of the previous C10M benchmark excepting the CPU spikes. The spikes up to 100% produced by the Full GCs in the previous C10M benchmark are now reduced at under 55%.

Network Utilization

As you can see in the screenshot below, sending a 512-byte message to each of 10 million clients every minute produced a total outgoing bandwidth from server to subscribers of 103 MB/s or 0.8 Gbps. This total bandwidth includes the 512-byte payload of each of the 168,000 messages sent every second to the subscribers, the overhead added by the TCP and MigratoryData protocol for each message, as well as certain traffic produced by the ssh sessions, JMX monitoring, and the TCP acknowledgements sent by the MigratoryData server to publishers for the received messages. Therefore, the overhead introduced by the TCP and MigratoryData protocol was under 131 bytes per message.

This screenshot also shows the top and slabtop information, and the number of sockets as reported by the netstat tool.

Latency

As defined in the Benchmark Setup section above, latency is the time needed for a message to propagate from the publisher to the subscriber, via the MigratoryData server. When Benchpub creates a message it includes the creation time as part of it. In this way, Benchsub can compute the latency as the difference between the creation and reception times of the messages. Because the machines were synchronized with ntp, which did not run long enough for perfect time synchronization, we can observe 1-2 milliseconds time differences between publisher machines and subscribers machines (see the negative values of “Latency Min” in the screenshot).

In addition to computing the latency for all messages received, Benchsub also calculates the average, standard deviation, and maximum. These latency statistics are computed incrementally for each new message received. In this way, the statistics are obtained for all messages received, and not just for a sample size.

In the screenshot below, the “Total messages” information shows that each of the four Benchsub instances received around 450 millions messages during more than 3 hours of benchmark test running. Hence, the following latency statistics are accurate, being computed on the entire population of about 1.8 billion messages (up to the ntp time difference of 1-2 milliseconds):

Latency Mean is 13.24 milliseconds

Latency Standard Deviation is 4.7 milliseconds

Latency Maximum is 126 milliseconds

More Latency Statistics

In the previous section we explained that the latency statistics – mean, standard deviation, and maximum – were computed for all messages received by all 10 million clients. However, other stats that would be interesting to look at for a real-time service, including the median, the 95th percentile and the 99th percentile, cannot be computed incrementally. We need all latency values in order to be able to compute such extra statistics. Recording 1.8 billion latency values is not practical during a performance test, so we used sampling to estimate these additional statistics.

As mentioned in the Benchmark Setup section, we used a fifth Benchsub instance to collect samples for 100 concurrent users from the entire population of 10 million. Each of the 100 users subscribed to a randomly selected subject from the 10 million subjects available.

We recorded all latency values for each of the 100 users for more than 3 hours during the benchmark test. Since each user received an average of one message per minute, we computed and recorded approximately 200 latency values for each user. Subsequently, we computed the median, average, 95th percentile, and 99th percentile for each of the 100 users (detailed results are available here).

Finally, we calculated a 99% confidence interval and we can estimate that, should we repeat the test, there is a 99% probability that the average value would be as follows:

Median Latency: 13.80 ms ± 0.14 ms

Mean Latency: 13.33 ms ± 0.12 ms

95th Percentile Latency: 21.27 ms ± 0.20 ms

99th Percentile Latency: 24.44 ms ± 0.47 ms

Conclusion

We showed here that MigratoryData Server can achieve simultaneously both extreme high-scalability and consistent low latency – delivering real-time data to 10 million concurrent users with under 15 ms average latency, under 25 ms 99th percentile latency, and 126 ms maximum latency for the worst case (computed from a total of 1.8 billion latency values!) – provided that a properly tuned JVM for short garbage collection pauses is used, such as Zing JVM.

The C10M problem relates to the classic C10K Internet scalability problem, which originally occurred in the context of Web servers. It consists of successfully supporting 10,000 concurrent connections on a single machine, and while C10K is currently solved by certain Web servers, C10M remains a challenging problem not only for Web servers, but for any Internet server in general.

MigratoryData Server is a real-time Web server using the WebSocket protocol, as well as the HTTP protocol, to communicate with its clients. Unlike traditional Web servers, MigratoryData Server does not use the request-response interaction model (employing short-living connections). Rather it uses the publish/subscribe model along persistent connections. With clients permanently connected, MigratoryData Server makes data delivery to its clients possible in a timely manner, with low latency.

In addition to its role as real-time Web server, MigratoryData Server implements features traditionally provided by Enterprise Messaging Systems such as publish/subscribe interaction, active/active clustering, guaranteed message delivery, entitlement, as well as API libraries for the most popular environments.

While achieving high scalability is a difficult task for a Web server in general, achieving high scalability for a real-time Web server, without sacrificing enterprise messaging features, is even harder.

In this article, we demonstrate that MigratoryData Server is able to handle 10 million concurrent connections on a single commodity machine. Moreover, it is able to push almost 1 Gbps live data to these 10 million users (each user receiving a 512-byte message per minute) with an average latency of under 100 milliseconds.

MigratoryData’s Publish/Subscribe Interaction

Subscribing clients connect to the MigratoryData server using persistent WebSocket or HTTP connections and subscribe to one or more subjects (also known as topics) by using MigratoryData’s protocol. Publishing clients communicate with the MigratoryData server in the same way as subscribing clients, but they publish messages. A message contains mainly a subject and some data. When a message is received by the MigratoryData server, it distributes that message to all clients that subscribed to the subject of that message.

Benchmark Setup

We used several publishing clients to push messages into a MigratoryData server, which pushed the messages out to several subscribing clients through HTTP persistent connections.

The publishing tool used in the benchmark test – MigratoryData Benchpub – is capable of publishing messages of a configurable size at a configurable frequency. The subscribing tool – MigratoryData Benchsub – is capable of opening a configurable number of concurrent connections, subscribing to a configurable number of subjects, and computing the latency of the messages received for the subscribed subjects. Both Benchpub and Benchsub use MigratoryData’s protocol, so they simulate real applications.

The following diagram shows the architecture of the benchmark test.

Nine machines were utilized in the benchmark test, as follows:

One Dell R610 machine was utilized to run one instance of the MigratoryData server. The specifications of this 1U machine are as follows:

Four Dell R610 machines were utilized to run four instances of Benchsub. Each Benchsub instance simulated 2.5 million concurrent users each, totaling 10 million concurrent users. Each user subscribed to a distinct subject. Therefore, the total number of concurrent subjects was also 10 million.

Four Dell SC1435 machines were utilized to run eight instances of Benchpub (four pairs of instances). Each of the eight Benchpub instances published at a frequency of 21,000 messages per second, with each message containing a sequence of 512 random bytes. The total message throughput was therefore 168,000 messages per second. In this way, each user received one message per minute.

Finally, an additional Benchsub instance was used to simulate 100 concurrent users, representing samples of the population of 10 million concurrent users. This Benchsub instance was used to compute supplemental latency statistics – in addition to the latency statistics computed by the other four Benchsub instances. These other four Benchsub instances used the latencies of all messages received by all 10 million users to compute their statistics, instead of sampling the data.

Latency, depicted in the diagram above, is defined from the time Benchpub creates a message until the Bechsub receives the message from the MigratoryData server.

Results

MigratoryData Server provides advanced monitoring via JMX and other protocols. We used the jconsole tool (included in the Java Development Kit) to monitor the MigratoryData server via JMX. In the results presented below we show screenshots obtained during JMX monitoring.

Connections

As depicted in the Benchmark Setup section above, the 10 million concurrent connections were opened by four instances of Benchsub that simulated 2.5 million concurrent users each. Each of the 10 million users subscribed to a distinct subject, hence there were 10 million concurrent subscribed subjects as well. In addition, a fifth instance of Benchsub opened another 100 concurrent connections.

As can be seen from the JMX screenshot below, MigratoryData Server handled 10,000,108 concurrent connections (see the JMX indicator ConnectedSessions). The same number of concurrent socket connections is confirmed by the tools netstat and slabtop (see the screenshot in the Network Utilization subsection below).

Messages

As described in the Benchmark Setup section, eight Benchpub instances ran on four machines and sent messages to the MigratoryData server.

In order for each of the 10 million users to receive one message per minute, each of the eight Benchpub instances published 21,000 messages per second. The payload of each message consisted of a sequence of 512 random bytes. Therefore, the eight Benchpub instances sent 168,000 messages per second to the MigratoryData server which were then pushed out to the subscribing clients at the same message frequency.

The screenshot of the Connections subsection above shows that the outgoing messages throughput is around 168,000 messages per second (see the JMX indicator OutPublishMessagesPerSecond).

CPU Utilization

In the screenshot below, it can be seen that the CPU usage of the machine which hosted the MigratoryData server was under 50%, with spikes from time to time when a major JVM Garbage Collection occurred. In fact, from our observations, the CPU percent strictly utilized by the MigratoryData server is under 40%. The variations you can see in the screenshot from under 40% to about 50% occur when minor JVM Garbage Collections happen.

Memory Utilization

In the screenshot below you can see that the memory usage is predictable and the pattern does not change after 3 hours of test running. More importantly, during the 3 hours of test running there were both minor and major JVM Garbage Collections. Thus, the test simulates a real life situation when both types of Garbage Collections might occur.

Network Utilization

As you can see in the screenshot below, the outgoing traffic for pushing 168,000 messages per second to 10 million concurrent clients was 103 Megabytes per second, representing 0.8 Gbps.

The payload of each message is 512 bytes and the throughput is 168,000 messages per second, totaling 82 Megabytes per second. The difference of 21 MB/sec, up to the actual bandwidth utilization of 103 MB/sec, was introduced by the overhead added by the MigratoryData protocol as well as by the TCP/IP protocol, resulting in an extra 131 bytes per message.

In fact, the overhead introduced by the MigratoryData protocol and the TCP/IP protocol is even less than 131 bytes per message. When we calculated the bandwidth – using the accurate traffic reported by the kernel into /proc/net/dev – we included all outgoing traffic of the network interface. This traffic is almost entirely produced by messages being pushed to clients. However it also includes some additional traffic produced by several ssh sessions, the JMX monitoring console, as well as the acknowledgements sent to publishers for receiving the messages.

Latency

As defined in the Benchmark Setup section, latency is the time needed for a message to propagate from the publisher to the subscriber, via the MigratoryData server. When Benchpub creates a message it includes the creation time as part of it. In this way, Benchsub can compute the latency as the difference between the creation and reception times of the messages.

In addition to computing the latency for all messages received, Benchsub also calculates the average, standard deviation, and maximum. These latency statistics are computed incrementally for each new message received. In this way, statistics are obtained for all messages received, and not just for a sample size.

In the screenshot below, the “Total messages” information shows that each of the four Benchsub instances received around 400 millions messages during the 3 hours of benchmark test running. Therefore, the following latency statistics are very accurate, being computed on the entire population of more than 1.5 billion messages:

Latency Mean is 61 milliseconds

Latency Standard Deviation is 140 milliseconds

Latency Maximum is 1.7 second

Note – Time was synchronized with ntp which did not run long enough for perfect time synchronization, which is the reason for the observed negative minimum latencies. Because minimum latency is normally 0, the negative and positive minimum latencies represent the difference introduced by the imperfect time synchronization among machines.

More Latency Statistics

In the previous section we explained that the latency statistics – mean, standard deviation, and maximum – were computed for all messages received by all 10 million clients. However, other stats that would be interesting to look at for a real-time service, including the median, 95th percentile and 99th percentile, cannot be computed incrementally. We need all latencies in order to be able to compute such extra statistics. Recording 1.5 billion latencies is not practical during a performance test, so we used sampling to estimate these additional statistics.

As outlined in the Benchmark Setup section, we used a fifth Benchsub instance to collect samples for 100 concurrent users from the entire population of 10 million. Each of the 100 users subscribed to a randomly selected subject from the 10 million available.

We recorded all latencies for each of the 100 users for 166 minutes during the benchmark test. Since each user received an average of one message per minute, we computed and recorded approximately 166 latencies for each user. Subsequently, we computed the median, average, 95th percentile, and 99th percentile for each of the 100 users (results are available as CSV file here, which also includes random subject each user subscribed to, as well as the precise number of messages received by each user).

Finally, we calculated a 99% confidence interval and we can estimate that, should we repeat the test, there is a 99% probability that the average value – for all users – would be as follows:

Median Latency: 18.71 ms ± 1.29 ms

Mean Latency: 58.52 ms ± 2.83 ms

95th Percentile Latency: 374.90 ms ± 21.51 ms

99th Percentile Latency: 585.06 ms ± 17.16 ms

Note – We can see that the mean – 61 milliseconds – computed as detailed in the previous subsection for all users (and for more than 1.5 billion latencies) belongs to the calculated confidence interval for mean: [55.69 ms, 61.35 ms] (i.e. 58.52 – 2.83 ms and 58.52 + 2.83 ms).

From our observations, in the absence of minor and major JVM Garbage Collections, all of the latency statistics above should be around the median values at 18 milliseconds. For example, the maximum latency above was introduced by a major JVM Garbage Collection. However, as major Garbage Collections happen rarely, in real life, such a high latency will occur only a few times per day.

Note – We have customers with large deployments (millions of end users) where the Java Virtual Machine is configured such that no major JVM Garbage Collection occurs. However, in these cases MigratoryData server is restarted on a daily basis.

Configuration Tuning

The benchmark test used standard configurations of Linux Kernel, Java Virtual Machine, and MigratoryData Server with only a few changes, which I am going to detail below.

Linux Kernel

MigratoryData Server ran on a machine with CentOS / RHEL 7.1 out of the box. In order to demonstrate that MigratoryData Server is able solve the C10M problem on commodity hardware and operating systems typically found in data centers, we did not recompile the kernel, but used the default 3.10.0-229 kernel.

The only system configurations we made are as follows:

Increased the number of socket descriptors, in order to allow the system to handle 10 million sockets:

used the sysctl configuration fs.file-max=12000500

echo 20000500 > /proc/sys/fs/nr_open

ulimit -n 20000000

Increased the maximum number of memory pages for TCP using the sysctl configuration:

net.ipv4.tcp_mem=10000000 10000000 10000000

Adjusted the buffers of TCP connections with sysctl for better memory usage, as follows:

net.ipv4.tcp_rmem=1024 4096 16384

net.ipv4.tcp_wmem=1024 4096 16384

net.core.rmem_max=16384

net.core.wmem_max=16384

Statically balanced the hardware interrupts of the network adapter across the logical CPUs using smp_affinity. The Intel X520-DA2 network adapter has 24 tx/rx queues, each having a hardware interrupt (in the /proc/interrupts there are 24 entries for the p1p1 network interface). Coincidentally, the server also has 24 logical processors corresponding to its two six-core CPUs. We used smp_affinity to statically map each interrupt of the 24 tx/rx queues of the network adapter to each of the 24 logical processors.

Better use of Translation-Lookaside Buffer (TLB) caches by the processor. These caches contain virtual-to-physical address translations and have a small number of entries with the most-recently used pages. Using huge pages of 2 MB instead of the normal 4 KB pages, a TLB entry can handle much more memory, thus making the CPU caching more efficient. Because we allocated 54 GB to the Java Virtual Machine which ran the MigratoryData server, we reserved 60 GB huge pages (30720 huge pages x 2 MB / huge page) using the sysctl configuration:

vm.nr_hugepages=30720

Java Virtual Machine

We used Oracle Java 1.8 update 45. Here are the main Java Virtual Machine (JVM) parameters we used (all JVM parameters can be seen in the screenshot below):

Allocated 54 GB for the JVM

Used Concurrent Mark Sweep (CMS) Garbage Collector

Enabled huge pages as discussed in the Linux Kernel subsection above by using:

+UseLargePages

Used compressed pointers, extended beyond the usual 32 GB limit, in order to optimize the memory footprint (with about 10 GB) by using:

-XX:ObjectAlignmentInBytes=16 -XX:+UseCompressedOops

MigratoryData Server

We used MigratoryData Server 5.0.14. To its default configuration, we made the following changes:

Enabled the JMX monitoring through the port 3000 (without authentication, and via an unencrypted connection):

Monitor = JMX

MonitorJMX.Authentication = false

MonitorJMX.Listen = 192.168.3.115:3000

Configured parallelism. In order to better scale on multiprocessor servers incoming users are separated in workgroups based on their IP address. Workgroups run in parallel, using almost independent threads. Thus, we used the following parameters related to parallelism:

Worgroups = 10

IoThreads = 20

Distributed users across workgroups. Because all 10 million users came from only four IP addresses, originating from the four Benchsub instances, we used a parameter called BenchmarkMode in order to distribute users across workgroups as would happen in real life when they would all come from different IP addresses. To achieve this, we used the configuration:

BenchmarkMode = true

Reduced the default initial size of the buffers. When handling a message, a buffer of 8192 bytes is created. If the message is higher than 8192 bytes, the buffer automatically expands in order to hold the entire message. On the other hand, if the messages is smaller than 8192 some memory space remains unused. Because we know the payload of our messages is 512 bytes, we reduced the default initial size of the buffers from 8192 to 768 using the following parameters (note that the name of the parameters is quite misleading; it’s not a hard coded limit but a default initial size):

BufferLimit.Send = 768

BufferLimit.Receive = 768

Reduced memory footprint and allowed better performance by using a native C implementation with JNI for socket handling:

X.Native.Io = true

Conclusion

In a talk cited by HighScalability.com, Robert Graham discussed the C10M problem. He explains why the kernel could be more of a problem than a solution for achieving high scalability and suggested a number of principles for building scalable systems.

Looking at slabtop in the screenshot of the Network Utilization subsection above, we observe that the Linux kernel used around 32 GB to maintain open the 10 million concurrent socket connections. For usual systems, 3.2 KB per socket connection could seem quite reasonable. However with the explosion of Internet devices (mobiles and Internet of Things) we see systems requiring millions of concurrent connections more and more frequently. We therefore echo Robert’s concern on the kernel and think that Linux might, for example, provide better memory usage to handle socket connections.

Moreover, many of the principles discussed by Robert can be found in our approach presented above: do as much as possible outside the kernel, use an efficient thread model to scale across all processors, use huge pages to optimize CPU caches, and distribute interrupts across all processors.

That said, in this post we demonstrated that solving the C10M problem is feasible with MigratoryData Server using a commodity server and an off-the-shelf Linux distribution. Also, given the millions of end users that our customers have, who receive real-time data daily with MigratoryData Server running on Linux, the time for easily building highly scalable real-time Internet services is now, with existing ingredients: Linux operating system, MigratoryData’s real-time Web server, and MigratoryData’s API with libraries for almost any Internet technology (Web, Mobile, Desktop, Server, Internet of Things).

To learn more about MigratoryData Server and how it can help your business achieve effective high scalability, please visit migratorydata.com.

Massive scalability is the biggest challenge we undertake at MigratoryData, a provider of an enterprise publish-subscribe messaging system for building very scalable real-time web and mobile applications. We recently published a blog post demonstrating 12 million concurrent connections with MigratoryData WebSocket Server running on a single 1U server. I am going to share some lessons learned while pushing the boundaries of scalability with MigratoryData WebSocket Server.

Objective

The goal of our benchmark is to demonstrate the vertical scalability of the MigratoryData server using commodity hardware usually found in data centers. The server used for the benchmark was a Dell PowerEdge R610 (see the specs in the initial blog post) running CentOS 6.4 out of the box — with no Linux kernel recompilation.

Opening 12 Million Sockets

Let’s discuss the server side and the client side separately.

Server-Side Considerations

Server Port Numbers

A common misunderstanding is that a server cannot accept more than 65,536 (216) TCP sockets because TCP ports are 16-bit integer numbers.

First, the number of ports is limited to 65,536, but this limitation applies only to a single IP address. Supposing that we are limited by the number of ports to have more than 65,536 clients, then adding more IP addresses to the server machine (either by adding new network cards, or simply by using IP aliasing for the existing network card) would solve the problem (even if, for opening 12 million client would need 184 network cards or IP aliases on the server machine).

In fact, the misunderstanding comes from the fact that the server does not use its listening IP address and a different ephemeral port for each new socket to distinguish among the sockets, but it uses the same listening IP address and the same listening port for all sockets and it distinguishes among sockets by using the IP address and the ephemeral port of each client. Therefore, MigratoryData Server uses a single port to accept any number of clients and optionally it uses another few ports for JMX monitoring, HTTP monitoring, etc

Server Socket Descriptors

While the MigratoryData server uses a single port to accept any number of clients, it uses a different socket descriptor for each client. So, to open 12 million sockets, the process of the MigratoryData server should be able to use 12 million socket descriptors. Increasing the maximum number of socket descriptors per process is possible using the command ulimit. Consequently, we increased this limit to about 20 million socket descriptors as follows:

ulimit -n 20000500

Because one cannot increase the maximum number of socket descriptors per process to a value larger than the current kernel maximum (fs.nr_open) and because the kernel maximum defaults to 1048576 (10242), prior to running the ulimit command, we increased the kernel maximum accordingly as follows:

echo 20000500 > /proc/sys/fs/nr_open

Client-Side Considerations

We developed a tool named MigratoryData Client Benchmark able to open a configurable number of connections to the MigratoryData server. MigratoryData Client Benchmark is also able to subscribe to a configurable number of subjects for each client connection (where each subject is randomly selected from a configurable set of subjects) and compute various statistics for the received messages.

We used ten servers Dell PowerEdge SC1435 having 16 GB RAM and 2 dual-core AMD Opteron CPU @2.0 GHz to run ten instances of MigratoryData Client Benchmark. Hence, from each client machine we opened 1.2 million sockets.

Client Port Numbers

Each socket connection uses a new ephemeral port number on the client machine. Therefore, we extended the range of the ephemeral ports from 500 to the maximum theoretical limit 65,536 – keeping only the ports from 1 to 500 reserved for the operating system:

sysctl -w net.ipv4.ip_local_port_range="500 65535"

Using this extension of the ephemeral port range, one can open up to 65,035 sockets from a client machine per each IP address of the client machine.

In order to be able to open 1.2 million sockets from a client machine, we had to create 19 IP aliases for the network interface of the client machine. In this way, about 65,000 sockets are opened to the MigratoryData server from each of the 19 IP addresses.

Note — For each socket, MigratoryData Client Benchmark uses precisely one of the 19 IP addresses and a successive port starting with the port 500 and ending with the port 65,535, thus avoiding the random allocation of the ephemeral port numbers.

Client Socket Descriptors

Each socket connection uses a socket descriptor on the client machine. Thus, to open 1.2 million sockets from a client machine, MigratoryData Client Benchmark should be able to use 1.2 million socket descriptors. As discussed in the section “Server Socket Descriptors” above, on each client machine we had to increase the maximum number of socket descriptors per process as follows:

echo 3000000 > /proc/sys/fs/nr_open
ulimit -n 2000000

Other Client Tuning Tips

To avoid delays between successive benchmark test rounds, we configured the kernel to reuse the sockets in state TIME_WAIT as follows:

echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

Another problem difficult to debug was when the kernel maximum number of memory pages allocated to the TCP was reached. When, this happened the MigratoryData Client Benchmark remained up, but the client machine did not accept any new TCP connection. We figured out that we have to increase the maximum number of memory pages allocated to the TCP from 767586 (about 3 GB) to 2303190 (about 8.7 GB) as follows:

sysctl -w net.ipv4.tcp_mem="383865 511820 2303190"

Linux Kernel Tuning

A number of kernel tuning suggestions, mainly in terms of ports and socket descriptors, were already provided above. Now, we focus on kernel tuning suggestions for memory and performance optimization.

Kernel Memory Tuning

We used the Linux kernel version 3.9.4-1.el6.elrepo.x86_64 which consumed about 36 GB of kernel memory for the 12 million open sockets as shown by slabtop in one of the screenshots below.

Note — We started the benchmark tests using the default kernel 2.6.x of Centos 6.4. In version 2.6.x and other older kernel versions (until kernel version 3.7?), the kernel uses, besides the 36 GB memory for 12 million open sockets, another memory page (i.e. 4096 bytes) for each open socket. Even worse, it looks like there is a bug in the Linux kernel because this memory page used per open socket is not reported in /proc/meminfo which usually is not a big deal, but having lots of sockets one can observe a lot of memory vanishing in an inexplicable way. Therefore, a recent Linux kernel such as 3.9 optimizes the kernel memory with about 46 GB for 12 million sockets.

We used the following tuning related to the socket buffer sizes for our benchmark scenario:

Balancing Hardware Interrupts

The server machine used a 10 Gbps network card Intel X520-DA2 having 24 tx/rx queues. We assigned each tx/rx queue to a different CPU core using smp_affinity. First, we identified the interrupts of our network interface named p1p1 as follows:

Benchmark Configuration

Please refer to the sections “Hardware and Setup” and “The Benchmark Scenario” in the initial blog post for precise details about the benchmark scenario and setup used by the involved components: MigratoryData Server, MigratoryData Client Benchmark, and MigratoryData Publisher Benchmark. Here we detail their configurations.

The parameter BenchmarkMode is set on true to handle the clients in a realistic way. In fact, in order to take advantage of multiprocessor servers, the incoming users are separated in parallel internal groups. Each client is assigned to a group based on its IP address. Because, in our benchmark a lot of clients come from the same IP address, we use the parameter BenchmarkMode to distribute them to different internal groups of the MigratoryData server, just as it would happen in production.

JVM Tuning

Tuning the JVM and especially the Garbage Collector related parameters is outside the scope of this blog post. We recommend the book Java Performance on JVM tuning. In a screenshot below we provide all the JVM parameters used in the benchmark.

The JVM parameter UseCompressedOops compresses the 64-bit pointers and offers non-negligible memory optimization. The recent versions of JVM enable this parameter by default for the JVM heap sizes smaller than about 30 GB. But UseCompressedOops can be used only for a JVM heap size of maximum 32 GB. The JVM parameter ObjectAlignmentInBytes extends the compression to the JVM heap sizes larger than 32 GB. Using the parameters ObjectAlignmentInBytes and UseCompressedOops allowed us to benefit from pointer compression for the entire memory allocated to the MigratoryData server without the need to run two instances of the MigratoryData server each one using less than 32 GB. Therefore, we included the following JVM parameters:

-XX:ObjectAlignmentInBytes=16 -XX:+UseCompressedOops

MigratoryData Client Benchmark

We used 10 instances of MigratoryData Client Benchmark running on 10 client machines Dell SC1435 as described above. Each instance of MigratoryData Client Benchmark opened 1.2 million connections to the MigratoryData server. Along each connection MigratoryData Client Benchmark subscribed to a different subject. Hence, we had 12 million concurrent clients subscribing to 12 million concurrent subjects.

For example, the 1st of the 10 instances of MigratoryData Client Benchmark used the following configuration:

The 2nd instance of the MigratoryData Client Benchmark has been configured to open another 1.2 million concurrent connections and subscribed to the subjects from /p/s1200001/- to /p/s2400000/-

The 3rd instance of the MigratoryData Client Benchmark has been configured to open another 1.2 million concurrent connections and subscribed to the subjects from /p/s2400001/- to /p/s4800000/-

…

The 10th instance of the MigratoryData Client Benchmark has been configured to open another 1.2 million concurrent connections and subscribed to the subjects from /p/s10800001/- to /p/s12000000/-

Each instance of MigratoryData Client Benchmark displays every 5 seconds various information such as: the number of seconds since the start time, the minimum, maximum, mean, standard deviation for all messages received, the total number of messages received, and the frequency of the messages received during the last 5 seconds, as in the example below:

MigratoryData Publisher Benchmark

In our benchmark scenario, there are 200,000 messages per second published to the MigratoryData server. The payload of each message is a 512-byte string of random characters. We have developed a tool named MigratoryData Publisher Benchmark which is able to publish messages of a configurable size at a configurable frequency, for subjects selected randomly from a configurable set of subjects.

We ran 8 instances of MigratoryData Publisher Benchmark on 4 machines Dell SC1435 having the same specs as the client machines, two instances per machine. To achieve 200,000 messages per second, each instance published 25,000 messages per second.

For example, the 1st of the 8 instances of MigratoryData Publisher Benchmark used the following configuration:

Memory = 7700 # Allocate 7.5 GB to the JVM
ServerAddresses = 192.168.3.115:8800 # The address of the MigratoryData server
PublisherType = p # The first prefix of the subjects
Subjects = /s{1..1500000}/- # Publish messages for /p/s1/-, /p/s2/-, ..., p/s1500000
Frequency = 25000 # Number of messages per second
MessageSize = 512 # The number of bytes per message

In the same way:

The 2nd instance of the MigratoryData Publisher Benchmark has been configured to publish 25,000 messages per second where the subject of each message is randomly selected from the subjects between /p/s1500001/- to /p/s3000000/-

The 3rd instance of the MigratoryData Publisher Benchmark has been configured to publish 25,000 messages per second where the subject of each message is randomly selected from the subjects between /p/s3000001/- to /p/s4500000/-

…

The 8th instance of the MigratoryData Publisher Benchmark has been configured to publish 25,000 messages per second where the subject of each message is randomly selected from the subjects between /p/s10500000/- to /p/s12000000/-

Latency

Latency is defined here as the time needed for a message to propagate from the publisher to the client, via the MigratoryData server. In other words, the latency of a message is the difference between the time at which the message is sent by MigratoryData Publisher Benchmark to the MigratoryData server and the time at which the message is received by MigratoryData Client Benchmark from the MigratoryData server. All the client machines and publisher machines have the clocks synchronized using ntp (a few milliseconds difference might exist between the clocks, so you can observe a minimum latency of -1 millisecond due to such small clock differences).

We calculate the following latency statistics: maximum, mean, and standard deviation.

In order to calculate the latency statistics, one can use all latency values or use a reasonable sample size. The sampling approach provides statistically accurate results but it should be typically used when it is not possible to obtain all the values (e.g. election polling). Another disadvantage is that sampling cannot be used to compute the absolute maximum latency.

Because for certain systems, having a predictable latency – under a certain maximum – is more valuable than having low latency values in average, we chosen to calculate the latency statistics for all values, and so also obtaining the true maximum latency. While it might appear that computing the latency statistics for all values would add a certain performance overhead, in fact, one needs not to store the latency values, because the statistics can be calculated on-the-fly for all messages as follows:

Note — Note that we have ten instances of MigratoryData Client Benchmark and each instance computes the latency statistics only for the messages it processes. However, the latency values of each instance of MigratoryData Client Benchmark represent an entire fraction of the population (10%), not only a typical sample. Also, as shown in the screenshot below, all statistics are practically identical for all ten instances of MigratoryData Client Benchmark.

Bandwidth

The Linux tools available for bandwidth calculation usually attempt to compute the bandwidth by capturing TCP packets. This approach does not scale for high data throughput and the bandwidth values produced are usually not accurate.

An accurate method for bandwidth calculation is to use the Receive Bytes / Packets and Transmit Bytes / Packets written by the Linux kernel to /proc/net/dev for the network interface used by the MigratoryData server. We compute the bandwidth by reading /proc/net/dev every 10 seconds. You can see the bandwidth transmitted by the MigratoryData server in one of the screenshots below.

Results

These are the screenshots with the results obtained which have been detailed above.

Conclusion

In this post we presented the obstacles faced when scaling to 12 million concurrent connections in terms of kernel tuning, memory tuning, tools, and benchmarking methodology. More insights into how MigratoryData WebSocket Server itself is architectured to achieve this massive vertical scalability might be the subject of a new post.

We have recently completed a new performance benchmark which demonstrates that MigratoryData WebSocket Server is able to handle 12 million concurrent users from a single server Dell PowerEdge R610 while pushing a substantial amount of live data (1.015 gigabit per second). This benchmark scenario shows that MigratoryData WebSocket Server is ideal for infrastructures delivering real-time data to a huge number of users, especially for mobile push notifications infrastructures that are typically demanded by telecom customers with tens of millions users.

Benchmark Results

In this benchmark scenario, MigratoryData scales up to 12 million concurrent users from a single Dell PowerEdge R610 server while pushing up to 1.015 Gbps live data (each user receives a 512-byte message every minute). The CPU utilization diagram below shows that MigratoryData WebSocket Server scales linearly with the hardware.

According to the chart above, MigratoryData uses only 57% CPU to handle 12 million users. The remaining 43% CPU could be used to scale even more. However, we are limited by the RAM available on this machine (we use Centos 6.4 with standard Linux kernel and only the Linux kernel memory footprint for 12 million sockets is about 36 GB).

Detailed Results of MigratoryData WebSocket Server Running on a Single Dell R610 Server

In the table below, it is important to note that we’ve obtained the results using the default configuration of MigratoryData WebSocket Server, a fresh installation of Linux Centos 6.4 (without any kernel source code modification or other special tuning), and the standard network configuration (employing the default MTU 1500, etc).

Number of concurrent client connections

3,000,000

6,000,000

9,000,000

12,000,000

Number of messages per minute to each client

1

1

1

1

Total Messages Throughput

50,000

100,000

150,000

200,000

Average Latency (milliseconds)

5

35

92

268

Standard Deviation for Latency (milliseconds)

36

123

263

424

Maximum Latency (milliseconds)

640

951

1292

2024

Network Utilization

0.254 Gbps

0.507 Gbps

0.767 Gbps

1.015 Gbps

CPU Utilization (average)

14%

24%

39%

57%

RAM Memory Allocated to the Java JVM

54 GB

54 GB

54 GB

54 GB

Latency is defined here as the time needed for a message to propagate from the publisher to the client, via the MigratoryData server. In other words, the latency of a message is the difference between the time at which the message is sent by the benchmark publisher to the MigratoryData server and the time at which the message is received by the benchmark client from the MigratoryData server.

Hardware & Setup

MigratoryData Websocket Server version 4.0.7 ran on a single Dell PowerEdge R610 server as follows:

Four servers Dell SC1435 were used to run up to four instances of the benchmark publisher. For example, to publish 100,000 messages per second, we used four instances of the benchmark publisher on the four servers Dell SC1435, each publisher sending 25,000 messages per second.

Ten servers Dell SC1435 were used to run up to ten instances of the benchmark client. For example, to open 12,000,000 concurrent connections, we used ten instances of the benchmark client on the ten servers Dell SC1435, each client opening 1,200,000 concurrent connections.

The server Dell PowerEdge R610 (used to run a single instance of MigratoryData Server) and the 14 servers Dell PowerEdge SC1435 (used to run benchmark clients and benchmark publishers) were connected via a gigabit switch Dell PowerConnect 6224 enhanced with a 2-port 10 Gbps module.

The Benchmark Scenario

Each client subscribes to a single different subject; for example, to achieve 12 million concurrent users, we used 12 million subjects.

Each client receives a message every minute; for example, to push a message per minute to 12 million concurrent users, the publisher sent 200,000 messages per second (the subject of each message was chosen randomly from the total of 12 million subjects)

The payload of each message is a 512-byte string (consisting of 512 random alphanumeric characters)

Methodology

We performed 4 benchmark tests corresponding to the 4 results summarized above, in order to simulate 3,000,000 / 6,000,000 / 9,000,000 / 12,000,000 concurrent users from a single instance of MigratoryData WebSocket Server.

The clock of the Dell R610 server (used to run MigratoryData Server) and the clocks of the 14 servers Dell SC1435 (used to run benchmark clients and benchmark publishers) were synchronized via ntpd. The latency was measured for all messages, not only for a sample. We’ve measured mean latency, maximum latency and standard deviation for the latency during 10 minutes and the results are reported above. We’ve also ran the most demanding scenario with 12 million concurrent connections during 6 hours and observed that MigratoryData WebSocket Server remains perfectly stable.

Linear Horizontal Scalability

MigratoryData WebSocket Server and its APIs offer the possibility to build a high-availability cluster.

Each instance of MigratoryData WebSocket Server in the cluster runs independently from the other cluster members. It exchanges only negligible coordination information or, depending on the clustering type you configure, does not exchange any information at all with the other cluster members. Therefore, MigratoryData WebSocket Server offers linear horizontal scalability.

One can deploy a high-availability cluster of MigratoryData servers to achieve any number of concurrent users. For example, using the linear horizontal scalability of MigratoryData WebSocket Server and the 12 million vertical scalability demonstrated here, one could achieve say 60 million connections using a cluster with 5 instances of MigratoryData WebSocket Server running on 5 Dell PowerEdge R610 servers.

Note: Even if MigratoryData WebSocket Server comes with linear horizontal scalability, in a production deployment, one also needs to consider the situation when a cluster member might go down. If this were to occur, the users of the server which goes down will automatically be reconnected by the MigratoryData API to the other cluster members. Thus, the other cluster members would support the load introduced by the member which fails.

The implication of this is that, for the example above, in a production deployment, it is recommended to have at least 7-8 servers to achieve 60 million concurrent users such that, if a failure were to occur, each server will have enough reserve to accept part of the users of the cluster member which fails.

Conclusion

In 2010, we’ve achieved 1 million concurrent connections on a single 1U server. While handling 1 million concurrent connections on a small server still remains a challenge for the WebSocket servers’ industry, we prove here that MigratoryData’s WebSocket Server scales an order of magnitude higher and achieves 12 million concurrent connections on a single 1U server.

This benchmark shows that MigratoryData achieves 8X higher scalability than the record obtained by the competition in the same benchmark category; reaffirming it is the most scalable WebSocket server. This benchmark result also demonstrates that, using MigratoryData WebSocket Server, it is feasible and affordable to build real-time web applications delivering high volumes of real-time information to a high number of concurrent users.

Benchmark Results

In this benchmark scenario, MigratoryData scales up to 192,000 concurrent users (delivering 8.8 Gbps throughput) from a single Dell R610 1U server and achieves an 8X higher scalability than the record obtained by the competition (who used a more recent Dell 1U server with similar specifications). Moreover, MigratoryData achieves lower bandwidth utilization and lower latency as shown in the diagram and table below.

Latency is defined here as the time needed for a message to propagate from the publisher to the client, via the MigratoryData server. Thus, the latency of a message is the difference between the time at which the message is sent by the publisher to the MigratoryData server and the time at which the message is received by the client from the MigratoryData server as detailed in the following diagram:

Detailed Results of MigratoryData WebSocket Server Running on a Single Instance of a Dell R610 1U Server

In the table below, it is important to note that we’ve obtained the results using the default configuration of MigratoryData WebSocket Server, a fresh installation of Linux Centos 6.4 (without any kernel recompilation or other special tuning), and the standard network configuration (employing the default MTU 1500, default kernel buffer sizes, etc).

Number of client connections

24,000

48,000

72,000

96,000

120,000

144,000

168,000

192,000

Number of messages per second to each client

10

10

10

10

10

10

10

10

Total Messages Throughput

240,000

480,000

720,000

960,000

1,200,000

1,440,000

1,680,000

1,920,000

Average Latency (milliseconds)

2.35

3.09

5.76

39.95

83.23

139.46

225.87

597.27

Standard Deviation for Latency (milliseconds)

3.74

3.79

6.73

20.80

39.36

65.12

106.00

269.74

Maximum Latency (milliseconds)

49

54

88

168

291

391

760

1732

Network Utilization

1.21 Gbps

2.39 Gbps

3.59 Gbps

4.65 Gbps

5.75 Gbps

6.79 Gbps

7.87 Gbps

8.88 Gbps

CPU Utilization

25%

49%

72%

82%

88%

90%

92%

96%

RAM Memory Allocated to the Java JVM

2.5 GB

26 GB

26 GB

26 GB

26 GB

26 GB

30 GB

48 GB

Note: As RAM is inexpensive, we did not tune the memory configuration and used a reasonable value for each benchmark test.

Hardware & Setup

MigratoryData Websocket Server version 4.0.3 ran on a single Dell PowerEdge R610 server as follows:

The Benchmark Publisher and the Benchmark Client instances ran on 14 identical Dell PowerEdge SC1435 servers. The Dell R610 server (running MigratoryData WebSocket Server) and the 14 Dell SC1435 servers (running the Benchmark Clients and the Benchmark Publisher) were connected via two gigabit switches: a Dell PowerConnect 5424 and a Dell PowerConnect 6224 (enhanced with a 2-port 10 Gbps module), as detailed in the diagram below:

The total number of concurrent client connections for each benchmark test is achieved using 13 of the 14 Dell SC1435 servers. One instance of the Benchmark Client runs on each of these 13 servers. Thus, one simulates 1/13 of the total concurrent client connections from each of these 13 servers.

The Benchmark Scenario

The subject of each message is randomly selected from the 100 subjects; thus, each subject is updated 10 times a second.

Each client subscribes to a single subject randomly selected from the 100 subjects; thus, each client receives 10 messages per second.

The payload of each message is a 512-byte string (consisting of 512 random alphanumeric characters)

Methodology

We performed 8 benchmark tests corresponding to the 8 results summarized above, in order to simulate 24,000 / 48,000 / 72,000 / 96,000 / 120,000 / 144,000 / 168,000 / 192,000 concurrent users from a single instance of MigratoryData WebSocket Server and using 13 instances of the Client Benchmark.

For the duration of each test, we ran a 14th instance of the Benchmark Client on the same machine that ran the instance of the Benchmark Publisher. The 14th instance of the Benchmark Client was used to measure latency results. It simulated an additional 30 users on top of the total number of simulated users, ran for 600 seconds, and computed the average, standard deviation, and maximum statistics of the latency of the received messages.

Note: Because the 14th instance of the Benchmark Client ran on the same machine as the instance of the Benchmark Publisher, there was no need for clock synchronization. Thus, the latency results are perfectly accurate as far as time synchronization is concerned.

Moreover, the sample size for each test is 180,000 messages (600 second x 10 messages per second x 30 concurrent client connections). Thus, it is large enough such that the latency results are statistically accurate.

Linear Horizontal Scalability

Not only does MigratoryData WebSocket Server offer horizontal scalability via its built-in clustering feature, it also offers linear horizontal scalability because each instance of MigratoryData WebSocket Server in the cluster runs independently from the other cluster members. It exchanges only negligible coordination information or, depending on the clustering type you configure, does not exchange any information at all with the other cluster members.

Therefore, if one wants to deliver real-time information to 1 million concurrent users in this benchmark scenario, then one can deploy 6 instances of MigratoryData WebSocket Server on 6 Dell R610 servers to deliver data to 1,152,000 concurrent users (i.e. 6 servers x 192,000 maximum concurrent connections, as demonstrated by this benchmark).

Note: Even if MigratoryData WebSocket Server comes with linear horizontal scalability, in a production deployment, one also needs to consider the situation when a cluster member might go down. If this were to occur, the users of the server which goes down will automatically be reconnected by the MigratoryData API to the other cluster members. Thus, the other cluster members would support the load introduced by the member which fails.

The implication of this is that, for the example above, in a production deployment, it is recommended to have at least 7-8 servers to achieve 1 million concurrent users such that, if a failure were to occur, each server will have enough reserve to accept part of the users of the cluster member which fails.

Conclusion

This benchmark result reaffirms MigratoryData’s leadership in websocket server scalability.

Using MigratoryData’s high vertical scalability and linear horizontal scalability, one can build cost-effective real-time applications scalable to meet any growth in number of users and data volumes.