BIG DATA ANALYTICS BEYOND HADOOP PDF

Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives (FT Press Analytics) By Vijay Sr. Click link below. When most technical professionals think of Big Data analytics today, they think of Hadoop. But there are many cutting-edge applications that Hadoop isn't well. 1. IntroductionGoogle's seminal paper on Map-Reduce [1] was the trigger that led to lot of developments in the big data space. Though the.

Resources What is big data analytics? Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes. Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Big data has one or more of the following characteristics: high volume, high velocity or high variety. Artificial intelligence AI , mobile, social and the Internet of Things IoT are driving data complexity through new forms and sources of data. Analysis of big data allows analysts, researchers and business users to make better and faster decisions using data that was previously inaccessible or unusable. Businesses can use advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics and natural language processing to gain new insights from previously untapped data sources independently or together with existing enterprise data.

This paper outlines that Hadoop is suited for simpler iterative algorithms where the algorithm can be expressed as a single execution of an MR model or a sequential execution of constant MR models.

Hadoop is not well suited in cases where the algorithm can only be expressed in a way that each iteration is a single MR model or each iteration comprises multiple MR models — conjugate gradient descent, for instance, would be in the last category.

While an adversary may argue that Hadoop Yarn mitigates this to an extent by supporting non MR type workloads, it may not have any new constructs to realize these use cases — especially the ones on iterative machine learning algorithms. Yarn could be seen as providing efficient cluster management utilities including scheduling, pretty similar to what Mesos does. Beyond Hadoop What then are the alternatives to realize some of the complex iterative algorithms that may not be amenable to be implemented efficiently over Hadoop?

A table outlining some of the work in this space is given in table 1 as an image. The third generational paradigms are those that go beyond Hadoop.

Again, efforts by SAS to go beyond Hadoop would fall into this category. The Berkeley researchers have proposed Berkeley Data Analytics BDA stack as a collection of technologies that help in running data analytics tasks across a cluster of nodes.

The lowest level component of the BDA is Mesos, the cluster manager which helps in task allocation and resource management tasks of the cluster.

The second component is the Tachyon file system built on top of Mesos. Tachyon provides a distributed file system abstraction and provides interfaces for file operations across the cluster. Spark, the computation paradigm is realized over Tachyon and Mesos in a specific embodiment though it could be realized without Tachyon and even without Mesos for clustering.

Shark, which is realized over Spark, provides an SQL abstraction over a cluster — similar to the abstraction Hive provides over Hadoop. The other important paradigm that has looked beyond Hadoop Map-Reduce is graph processing, exemplified by the Pregel effort from Google. Pregel is a Bulk Synchronous Processing BSP paradigm where user defined compute functions can be spawned on the nodes of the graph, with edges used for communication. This provides a deterministic computation framework.

Apache Giraph is an open source implementation of Pregel. GraphX is the other system with specific focus on graph construction and transformations. While Pregel is good at graph parallel abstraction, easy to reason with and ensures deterministic computation, it leaves it to the user to architect the movement of data.

The Dempsy system from Nokia is also comparable to Storm. However, Akka is promising due to its ability to deliver stream composition, failure recovery and two-way communication. We give below in figure 1 a comparison of the Logistic Re- gression algorithm over Spark and Mahout over Hadoop. The figure at the bottom shows the comparison as-is from the Amp-labs study, while the right side is our experience and an end- to-end comparison of time. Comparison end-to-end runtimes for Logistic Regression on Impetus Cluster.

There are several similar real-life use cases for Spark as well as Storm and GraphLab, some of the third generation paradigms we have focused on. The same queries took nearly seconds on Cas- sandra natively took only seconds on a warmed cache , but less than 1 second using Spark!

Quantifind is another start-up which uses Spark in production. They use Spark to allow video companies to predict success of new releases — they have been able to move from run- ning ML in hours over Hadoop to running it in seconds by using Spark.

Conviva, a start-up uses Spark to run repeated queries on video data and found that it was nearly 30 times faster than Hive. Yahoo is also using Spark to build algorithmic learning for advertisement targeting for a new paradigm they name as continuous computing. There are also several use cases for Storm listed in the Storm page. Umbrella security labs provides a case for using GraphLab to develop and test ML models quickly over complete data sets.

They have implemented a page rank algorithm over a large graph and ran it using Graph- Lab in about 2 minutes. References [1] Ghemawat, J. Communications of the ACM, 51 1 , Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Adapting scientific computing problems to clouds using MapReduce.

Future Generation Computer Systems, Distributed GraphLab: a framework for machine learning and data mining in the cloud. Related Papers. They run the computations on the streams. ML algorithms on the streams typically run here. This is an application-specific wiring together of spouts and boltstopology gets executed on a cluster of nodes.

The Kafka cluster stores up the events in the queue. This is necessary because the Storm cluster is heavy in processing due to the ML involved. The details of this architecture, as well as the steps needed to run ML algorithms in a Storm cluster, are covered in subsequent chapters of the book.

Storm is also compared to the other contenders in real-time computing, including S4 from Yahoo and Akka from Typesafe.

In this section:

The other important tool that has looked beyond Hadoop MR comes from Googlethe Pregel framework for realizing graph computations Malewicz et al. Computations in Pregel comprise a series of iterations, known as supersteps. Each vertex in the graph is associated with a user-defined compute function; Pregel ensures at each superstep that the user-defined compute function is invoked in parallel on each edge. The vertices can send messages through the edges and exchange values with other vertices.

There is also the global barrierwhich moves forward after all compute functions are terminated. Readers familiar with BSP can see why Pregel is a perfect example of BSPa set of entities computing user-defined functions in parallel with global synchronization and able to exchange messages. Apache Hama Seo et al. It might be that they do not want to be seen as being different from the Hadoop community.

But 22 the important thing is that BSP is an inherently well-suited paradigm for iterative computations, and Hama has parallel implementations of the CGD, which I said is not easy to realize over Hadoop. GraphLab Gonzalez et al.

GraphLab provides useful abstractions for processing graphs across a cluster of nodes deterministically.

PowerGraph, the subsequent version of GraphLab, makes it efficient to process natural graphs or power law graphsgraphs that have a high number of poorly connected vertices and a low number of highly connected vertices.

Performance evaluations on the Twitter graph for page-ranking and triangle counting problems have verified the efficiency of GraphLab compared to other approaches.

The focus of this book is mainly on Giraph, GraphLab, and related efforts. Table 1. It can be inferred that although the traditional tools have worked on only a single node and might not scale horizontally and might also have single points of failure, recent reengineering efforts have made them move across generations.

The other point to be noted is that most of the graph processing paradigms are not fault-tolerant, whereas Spark and HaLoop are among the third-generation tools that provide FT. It has also brought out the three dimensions along which thinking beyond Hadoop is necessary: 1. Real-time analytics: Storm and Spark streaming are the choices. Analytics involving iterative ML: Spark is the technology of choice. Specialized data structures and processing requirements for these: GraphLab is an important paradigm to process large graphs.

Big Data Analytics Beyond Hadoop

Perry, Tekla S. The Making of Facebook Graph Search. Seo, Sangwon, Edward J. Future Generation Computer System 28 1 January Using R for Iterative and Incremental Processing. Xin, Reynold S. Gonzalez, Michael J. Franklin, and Ion Stoica. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Computing with Working Sets. This chapter introduces the BDAS from AMPLabs derived from algorithms, machines, and people, the three dimensions of their research by first unfolding its motivation.

The BDAS can help in answering some business questions such as How do you segment users and find out which user segments are interested in certain advertisement campaigns? How do you find out the right metrics for user engagement in a web application such as Yahoos?

How can a video content provider dynamically select an optimal Content Delivery Network CDN for each user based on a set of constraints such as the network load and the buffering ratio of each CDN? Motivation for BDAS The seven giants categorization has given us a framework to reason about the limitations of Hadoop. I have explained that Hadoop is well suited for giant 1 simple statistics as well as simpler problems in other giants.

One would have to create fresh MR jobs for every iteration in a lot of these classes of computations. The next iteration would need data to be initialized, or read, from HDFS to memory.

The data flow diagram for iterative computing in Figure 2. Interactive querying is one such scenario. By its very nature, Hadoop is a batch-oriented systemimplying that for every query, it would initiate a fresh set of MR jobs to process the query, irrespective of query history or pattern.

The last kind of scenario is real-time computations. Hadoop is not well suited for these. Combining the capability to handle batch, interactive, and real-time computing into a single sophisticated framework that can also facilitate programming at a higher level of abstraction compared to existing systems resulted in the BDAS framework.

Spark is the fulcrum of the BDAS framework. These APIs facilitate programming at a much higher level of abstraction compared to traditional approaches. Spark: Motivation One of the main motivations for proposing Spark was to allow distributed programming of Scala collections or sequences in a seamless fashion. Scala is statically typed language that fuses object-oriented programming with functional programming. This implies that every value in Scala is an object and every operation a method call similar to object-oriented languages such as 29 Smalltalk or Java.

In addition, functions are first-class values, in the true spirit of functional programming languages such as machine learning ML. Common sequences defined in the Scala library include arrays, lists, streams, and iterators. These sequences all sequences in Scala inherit from the scala.

Seq class and define a common set of interfaces for abstracting common operations. Map and filter are commonly used functions in Scala sequencesthey apply map and filter operations to the elements of the sequence uniformly. Spark provides a distributed shared object space that enables the previously enumerated Scala sequence operations over a distributed system Zaharia et al.

Shark: Motivation The other dimension for large-scale analytics is interactive queries. These types of queries occur often in a big data environment, especially where operations are semiautomated and involve end users who need to sift through the large data sets quickly. There are two broad approaches to solving interactive queries on massive data sets: parallel databases and Hadoop MR. Parallel databases distribute the data relational tables into a set of shared-nothing clusters and split queries into multiple nodes for efficient processing by using an optimizer that translates Structured Query Language SQL commands into a query plan.

In case of complex queries involving joins, there might be a phase of data transfer similar to the shuffle phase of MR. Subsequently, the join operations are performed in parallel and the result is rolled up to produce the final answer, similar to the reduce phase of MR. Gamma DeWitt et al. The comparison between MR and parallel database systems can be made along three axes: Schema: MR might not require a predefined schema, whereas parallel databases use a schema to separate data definition from use.

Efficiency: Efficiency can be viewed as comprising two parts: indexing and execution strategy. With respect to indexing, parallel databases have sophisticated B-treebased indexing for locating data quickly, whereas MR offers no direct support for indexing. With respect to execution strategy, MR creates intermediate files and transfers these from mappers to the reducers explicitly with a pull approach, resulting in performance bottlenecks at scale.

In contrast, parallel databases do not persist the intermediate files to disk and use a push model to transfer data. Consequently, parallel databases might be more efficient for query execution strategy compared to MR. But MR has sophisticated mechanisms for handling failures during the computation which is a direct consequence of the intermediate file creation.

The parallel databases do not persist intermediate results to disk. This implies that the amount of work to be redone can be significant, resulting in larger performance penalties under failure conditions. In essence, the MR approach provides FT at a fine-grained level, but does not have efficient query execution strategies. Hadoop MR is not well suited for interactive queries.

Big Data Manifesto | Hadoop, Business Analytics and Beyond - Wikibon

The reasoning behind this assertion is that Hive or Hbase, which might be typically used to service such queries in a Hadoop ecosystem, do not have sophisticated caching layers that can cache results of important queriesbut instead might start fresh MR jobs for each query, resulting in significant latencies. This has been documented among others by Pavlo and others Pavlo et al, The parallel database systems are good for optimized queries on a cluster of shared-nothing nodes, but they provide only coarse-grained FTthis implies that, for example, an entire SQL query might have to be rerun on the cluster in case of failures.

The coarse-grained recovery point is true even in the case of some of the new low-latency engines proposed for querying large data sets such as Cloudera Impala, Google Dremel, or its open source equivalent, Apache Drill.