Understanding Live Chat Conversation Data: Hadoop or Spark?

In today’s world, the amount of unstructured data collected is humungous. This unstructured data is of no use if it is not properly processed, analyzed and evaluated. Using this data for the betterment of mankind is what most of the largest companies like Google, Facebook, Amazon, Netflix and much more are targeting. Big data is a term for datasets which are so large and complex that traditional database systems such as MS SQL, MySQL, etc., are incapable of handling them. It is not the amount of data that is important, but what organizations do with data that matters the most. Data can be mapped to useful information which can be further utilized for analyzing and drawing insights that lead to better management practices and strategic business decisions. For example, large sets of live chat conversation data from big e-commerce companies can be analyzed with Hadoop or Spark to produce actionable insights which drive live chat sales growth.

Now that we understand what “Big Data” actually is, there has to be an effective way to structure and process this unstructured live chat conversation data and draw meaning from it. MapReduce was introduced to the world in 2004 which was used for simplified data processing on large clusters. Apache Hadoop was born out of MapReduce in 2006 which revolutionized the world of Big Data. Apache Hadoop is not just used for processing big data, but it also provides a platform to store data in the form of Hadoop Distributed File System. It offers a scalable, flexible and reliable distributed computing big data framework for a cluster of systems. It uses Master-Slave architecture for analyzing large datasets using Map-Reduce paradigms. The major components of Hadoop include HDFS, Hadoop MapReduce and YARN (Yet Another Resource Negotiator). Computation power of a single machine is not sufficient for processing huge data sets. Hadoop makes this possible by distributing the entire load throughout a cluster of nodes. Hadoop’s MapReduce is a programming model which allows you to process huge data stored in Hadoop. In the map phase, a block of live chat conversation data is read and processed to produce key value pairs as intermediate outputs. This output is then fed to the Reducer. The Reducer then aggregates those intermediate data tuples (key-value pairs) into a smaller set of tuples which is the final output.

Apache Spark is considered as a powerful complement to Hadoop; it is more accessible, powerful and capable big data tool for tackling various big data problems. Its architecture is based on basically two kind of abstractions: Resilient Distributed Datasets (RDD) and Directed Acyclic Graphs (DAG). RDDs are a collection of data items that can be split and can be stored in-memory on worker nodes of a spark cluster. The DAG abstraction of Spark helps eliminate the Hadoop MapReduce multistage execution model.

As Rajiv Bhat, Senior Vice President of Data Sciences and Marketplace at InMobi rightly said, “Spark is beautiful. With Hadoop, it would take six-seven months to develop a machine learning model. Now, we can do about four models a day”. The question here is “is Apache Spark more efficient than Apache Hadoop?Why are companies moving towards Apache Spark? Does it have any advantages over traditional Hadoop?” Firstly, Spark and Hadoop cannot be compared. Apache Spark is a cluster computing framework whereas Hadoop is a framework which consists of different components like HDFS, MapReduce, etc. MapReduce is Hadoop’s cluster computing framework. So, the comparison is between Apache Spark and Hadoop MapReduce and not between Hadoop and Spark.

Hadoop is a parallel data processing framework that has traditionally been used to run MapReduce jobs. These are long-running jobs that take minutes or hours to complete which is often the case with live chat conversation data. Spark has been designed to run on top of Hadoop and it is an alternative to the traditional MapReduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. So, Hadoop supports both traditional MapReduce as well as Apache Spark. Hence, Hadoop is a general-purpose framework that supports multiple models and Spark is an alternative to Hadoop MapReduce rather than a replacement to Hadoop.

Spark can run on top of HDFS along with other Hadoop components.

The major differences between Apache Spark and Hadoop MapReduce include Spark stores data in memory whereas MapReduce stores the data on the disk. As a result, Spark is more efficient when it comes to speed and performance, but it requires machines with huge RAM. MapReduce uses replication to achieve fault-tolerance, whereas Spark uses RDDs to achieve this. Besides these, it is easier to program without any abstractions in Spark than in Hadoop. Programmers have the liberty to perform streaming, batch processing, and machine learning, all in the same cluster. Spark processes jobs 1000 times faster than MapReduce as the network and disk access overhead is eliminated.

About Dr. Michael Housman

Michael has spent his entire career applying state-of-the-art statistical methodologies and econometric techniques to large data-sets in order to drive organizational decision-making and helping companies operate more effectively.
Prior to founding RapportBoost.AI, he was the Chief Analytics Officer at Evolv (acquired by Cornerstone OnDemand for $42M in 2015) where he helped architect a machine learning platform capable of mining databases consisting of hundreds of millions of employee records. He was named a 2014 game changer by Workforce magazine for his work.
Michael is currently an equity advisor for a half-dozen technology companies based out of the San Francisco bay area: hiQ Labs, Bakround, Interviewed, Performiture, Tenacity, Homebase, and States Title. He was on Tony’s advisory board at Boopsie from 2012 onward.
Michael is a noted public speaker and has published his work in a variety of peer-reviewed journals and has had his research profiled by The New York Times, Wall Street Journal, The Economist, and The Atlantic.
Dr. Housman received his A.M. and Ph.D. in Applied Economics and Managerial Science from The Wharton School of the University of Pennsylvania and his A.B. from Harvard University.