Big Data Frameworks: A Comparative Study

In this experimental study we try to compare the most popular Big Data frameworks. We divided the experiment study into two parts: (1) batch mode processing and (2) stream mode processing.The experimental study covers the performance, scalability, and resource usage study of frameworks such as SPARK, HADOOP, FLINK, SAMZA and STORM. We have chosen the WordCount example as a use case study to evaluate the frameworks in batch mode and an Extract, Transform, Load (ETL) process to evaluate the studied frameworks in stream mode.

Experimental protocol:

We consider two scenarios according to the data processing mode. For each scenario, we measure the performance of the presented frameworks.

Batch Mode: In the Batch Mode scenario, we evaluate HADOOP, SPARK and FLINK while running the WordCount example on big set of tweets. The used tweets were collected by Apache Flume and stored in HDFS. The motivations behind using Apache Flume to collect the processed tweets is its integration facility in the HADOOP ecosystem (especially the HDFS system). Moreover, Apache Flume allows collecting data in a distributed way and offers high data availability and fault tolerance. We collected 10 billions tweets and we used them to form large tweet files with a size on disk varying from 250 MB to 100 GB of data.

Stream Mode: In the Stream Mode scenario, we evaluate real-time data processing capabilities of STORM, FLINK and SPARK. The Stream Mode scenario is divided into three main steps. The first step is devoted to data storage. Do to this step, we collected 1 billion tweets from Twitter using Apache Flume and stored in HDFS. Those data are then sent to KAFKA, a messaging server that guarantees fault tolerance during the streaming and message persistence .