DataTorrent 1.0 Handles >1B Real-time Events/sec

DataTorrent is a real-time streaming and analyzing platform that can process over 1B real-time events/sec.

Compared to Twitter’s average of ~6,000 tweets/sec, the recently released DataTorrent 1.0 with its over 1 billion real-time events/sec seems to exceed the needs. In their tests on a 37-node cluster, each with 256GB of RAM and 12 hyper-threaded CPU cores, DataTorrent claims to have achieved linear scalability until 1.6B events/sec when the CPUs reached saturation. Phu Hoang, DataTorrent cofounder and CEO, told InfoQ that their solution is “orders of magnitude” more performant than Apache Spark on the same hardware.

DataTorrent is a real-time fault tolerant data streaming and analysis platform built on Hadoop 2.x and using native Hadoop applications that can coexist with apps doing other tasks such as batch processing. The platform’s architecture is depicted in the following picture:

StrAM (Streaming Application Master) is a native YARN Application Master responsible for managing the logical DAG (Directed Acyclic Graph) that is to be executed on top of a Hadoop cluster, including resource allocation, partitioning, scalability, scheduling, web services, run time changes, statistics, SLA enforcement, security, etc.

User applications exist at the upper level of the diagram as connected Operators and/or application templates. Examples of operators are InputReceiver – simulate receiving input data, Average – calculating data average for a key in a specified dimension, RedisAverageOutput – writes the calculated average to a Redis data store, SmtpAvgOperator – sends emails on alerts. These operators are part of the Malhar library which includes over 400 such items and it is open sourced on GitHub. The user can write other operators as needed.

We asked Hoang what makes DataTorrent faster than Spark:

PH: There are two important architectural differences that stem from DataTorrent’s focus on enabling enterprises to take action in real-time via streaming and Spark’s desire to generalize the Spark engine to process streams. The two two key areas of focus are performance and stateful fault-tolerance.

1. Performance -- DataTorrent RTS was designed and built from the ground up as a native Hadoop 2.0 offering focusing on performance with high availability and as a result does it’s processing event-by-event with sub-second latency. DataTorrent RTS schedules its application onto the Hadoop containers on start, and that mapping is fixed until the application needs to change and does not introduce any scheduling overhead. Spark, on the other hand was built pre-Hadoop 2.0 and utilizes the Spark engine to effectively run many “map reduce” jobs as small batches or “mini-batches”. This design decision requires that Spark (via the application master) now has to schedule each min-batch onto the cluster, which represents tremendous overhead, and slows down the system.

2. Stateful fault tolerance – DataTorrent RTS was designed to enable complex, stateful computation at high performance, with fault tolerance. This is a key enterprise requirement, where the ability to recover from outages without any data loss, any state loss, is a mandatory. The DataTorrent RTS design center here was to use Java for programming and to remove the “burden” of fault tolerant design from the enterprise developer/ISV (namely, its handled by DataTorrent RTS for the developer). Spark, does provide fault tolerance, but only for stateless processing. The Spark design center was to use the Scala language, which is a functional language, and the Operators that process streams are stateless. If an enterprise wanted to add stateful processing to Spark, they would need to write that code as part of their application, which is difficult to do and would impact performance.

According to Hoang, DataTorrent is certified for “all major Hadoop distributions for both on-premise and cloud-based deployment (such as Cloudera, Hortonworks, and MapR and on the cloud such as Amazon AWS and Google Cloud), enabling Enterprises the flexibility to change both Hadoop vendors and deployment options frictionlessly.”

Although DataTorrent is a commercial application, it comes with a free tier for small to medium applications including all capabilities.

William — Thank you for reading the article and for asking your questions. Amol Kekre, the CTO and co-founder of DataTorrent, provides a technical description which we have posted below. Please note that at the core, we are discussing events processed which includes ingestion, in-memory data processing, and emitting to databases; not just network traffic. Also, Hadoop is by definition a distributed, clustered environment, so it’s not a single JVM. The full details on the benchmarking can be found here www.datatorrent.com/scaling-up-event-ingestion/ and a white paper is located here: www.datatorrent.com/how-it-works/under-the-hood...

The 1 billion events/sec benchmark consisted of ingested events as well as internal events generated as part of communication between compute operations of the application. Ingested events formed a much smaller portion of the total. The application is a distributed application that ran on Hadoop cluster. This required handling results of a lot of internal events, all of which added up to 1 billion events/sec. The ability to handle any explosion of events needed to communicate internal results is critical part of a platform for complex applications. Real-time big data ETL is an example if such an application. With regards to computing bytes compared to NIC speed, this application was not run on a single node, it was run on a cluster of servers. Our platform is able to very efficiently leverage the topology information as it has constructs built in that enable very optimal use of resources of each server within the cluster. Please email us or visit us at Hadoop Summit this week for more information

As I stated above this is all meaningless when "internal events" dominate this number. What exactly is an internal event? How is this different than simply making a method call...to dispatch an event within the same thread context? I could very easily create a for-loop that generated events at such a rate and all in a single JVM without the burden of Hadoop. This was my point and something that should have been picked up by the InfoQ editorial team. A little bit of simple maths would quickly reveal this number to be nonsensical...from a customer or reader point of view.

I specified in the article that one of the Operators is InputReceiver, which simulates events. These are entities with at least 2 date/time stamps and over 15 integer randomly generated fields simulating machine info events. Such events are processed through the platform as are external events. I suspect they used such internal events because it's hard to get 1B events/sec from external sources these days. Maybe in the future. All the Operators and the demo test are included in the open source Malhar library mentioned in the article.

We report on various products that we consider are useful to our readers. We certainly cannot verify all claims. I specified that DataTorrent "claims" those numbers. Many companies make claims. If a reader has doubts about a claim, let him make an informed investigation, possibly repeating a simulation in similar conditions. Let him settle any issues with the claim maker, then come forward with the results. Countering claims with suppositions is not taking us anywhere.

simple maths...that seems to work for most people in the scientific community when claims seem "out of whack" with reality or the perception of reality around what constitutes an event...keeping digging a bigger hole if you wish rather than explicitly calling out that 1 BILLION a second is just a meaningless measure of nothing...there is no countering...I questioned what exactly was being reported and lo and behold I was right

So in the interest of full disclosure…I’m an enterprise rep at SQLstream. That said, I have to agree with William that benchmarks that demonstrate performance capabilities with no relative context to any real world scenario are often misleading. In the psat, the concept of benchmarking would involve a real customer, and most of the time an independent 3rd party to implement and evaluate the results. I’m seeing more and more “benchmarks” done by internal teams, without a customer, or real world business value. That’s not a benchmark, it’s a very optimistic high water mark.

Again, in the interest of disclosure…we have run similar tests to evaluate the sheer processing capabilities or, “our high water mark”, but we would never think about publishing them….let alone sending our CTO into the wild to blog about them without some sustenance to back up the numbers. Since we’re on topic, and if you care to see a real world, independent benchmark complete with externally generated events, and analytics check out our White Paper at www.sqlstream.com/wp-content/uploads/2014/02/SQ...

For the record, If any one was wondering….Our high watermark test took internally generated events of around 16 byes and performed a very simple analytic to determine peak performance. Again this would still have no real business value, but on the same 37 node cluster we saw a 1.184 billion events/sec……Apples and oranges to the aforementioned unicorn and sunshine example. In summary, I think using Marketechture in combination with unjustified or unsubstantiated performance metrics to enhance market share or brand recognition has the tendency to backfire…..which seems to be the case, thanks to William’s attention to detail and willingness to call BS when he sees it.