What Do Real-Life Apache Hadoop Workloads Look Like?

Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.

Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.

Table 1. Summary of Hadoop workloads analyzed

1: Bytes moved is the sum of input, shuffle, and output sizes of all jobs in the workload.

The following are some of the questions we asked about the Hadoop workloads:

What are the data access patterns?

Typical data set sizes?

Uniform or skewed data access?

Frequency of re-accessing data?

How does the load vary over time?

Regular cycles or unpredictable?

Frequency and size of load bursts?

What are the compute patterns?

Ratio between compute and movement of data?

Common job types?

How frequently do people use higher level query languages such as Hive and Pig as opposed to Java MapReduce?

How does this workload compare with other deployments?

Typical Data Set Sizes

Data manipulation and management are key functions of Hadoop, so it is important to understand the typical data set sizes on Hadoop clusters.

The graphs below show the cummulative distribution of per-job input, shuffle, and output data across the workloads. A dot (x, y) on the graph shows that data set size less than x makes up y fraction of the jobs in the workload. For example, the arrow highlights that ~60% of jobs in the CC-b workload have input sizes of less than several KBs.

Figure 1. Per-job input, shuffle, and output data sizes

The graphs show that most jobs have data inputs sizes in the megabytes to terabytes and the distribution of data sizes varies greatly between workloads. It’s worth clarifying that because each workload contains a large number of jobs, the cluster capacity still needs to be quite large, even though a single job often accesses smaller data sets.

These observations suggest the limits of simple performance measurement tools such as terasort or TestDFSIO: Data sets at TB and larger scales represent only a narrow sliver of real life behavior. All of the input, shuffle, and output stages need to be tested together subject to a large range of three-stage data patterns. Generic tools give generic numbers that are not guaranteed to translate to a particular workload.

Data Access Skew

It is not surprising that over the lifetime of data, at any given time some subsets are more valuable than the rest and get accessed more often. The nature and cause of such skew offer insights on how we can improve the Hadoop ecosystem.

The graph below shows the HDFS file read frequencies against the file rank by decreasing read frequency. In other words, we count the number of jobs that read each HDFS file, then make the graph such that a data point (x, y) indicates that the xth most frequently read file is read y times. For example, the arrow indicates that the most frequently accessed file in the CC-b workload is read close to 1000 times over the duration of the trace.

Figure 2. Data access skew by HDFS file paths

The graph axes are both logarithmic. The combined lines from all five workloads form an approximate straight band. Straight lines on these axes represent a Zipf distribution, which indicates high data skew. In particular, for all workloads, the top 1000 most frequently read files account for all files that are read multiple times.

One observation is particularly interesting. The slope of the lines are approximately the same, indicating similar data skew across several industry sectors. In order words, there are some similar business processes or data analysis needs across different industry sectors that result in the same uneven data popularity.

We speculate that several factors could cause this cross-industries similarity: First, timely data is valued across industry sectors, and therefore more recent data should be more frequently accessed than archived data. Second, regardless of industry, human analysts sometimes find it necessary to restrict analysis to a small data set, which could result in a highly distilled data set being frequently accessed. Furthermore, skew could come from constant improvements in products and services, which make it less critical to analyze data about deprecated products and services.

We can analyze the job name strings within each workload to measure the relative slot time attributed to some of these frameworks. The graph below is a stacked bar graph, showing the relative fractions of the first word of job names. For example, in the CC-a workload, 0.48 fraction of the slot time come from jobs whose names begin with the word “insert”. We’ve color coded the words based on the frameworks to which the job belongs. The “others” sector comprises the remainder of the job names, whose contribute fraction of slot times too low to be labeled individually on the graph. This analysis would not include components that do not interact with the Job Tracker, e.g. Flume.

Note that Oozie is a workflow management framework, which in turn may launch jobs in various other frameworks (Hive, Pig, Java MapReduce). We’re currently working on identifying the individual jobs within the Oozie workflows as, say, Hive or Pig or Java MapReduce.

Figure 3. Task-time (slot-hours) in workloads grouped by frameworks

We see that Hadoop extension frameworks make up a considerable fraction of slot times, and sometimes dominate the workload (CC-a, CC-e). Also, two frameworks are enough to account for a large fraction of slot time in each workload.

The prevalence of frameworks like Hive, Pig, and Oozie suggests that people find it attractive to interact with data using query-like languages (Hive/Pig) and workflow management systems (Oozie). As these frameworks mature, it is important to explore what would be even more convenient ways to interact with the data. For example, as knowledge about use cases increase, we should try to devise automated or at least standard analyses for common business cases.

Further, the fact that only two frameworks dominate each workload suggests that there are non-trivial human and business process costs that make each enterprise acquire and learn only a small number of frameworks. This means that out of the many MapReduce extension frameworks existing today, the market could eventually consolidate around a small number of proven, well-supported solutions.

Implications for Customers and Cloudera

Our study highlights both a wide range of data volumes processed by different customers – from less than 3TB/day to 600TB/day – and striking similarities in the distribution of data volume processed per job. The much higher access frequency to a limited subset of data, particularly if that data is the most recently acquired, indicates there may be benefits to tiered storage. The smaller subset could be maintained in more expensive storage with lower latency and/or higher replica counts without significantly affecting total cost if HDFS placement was made sensitive to access patterns.

A significant portion of the processing is executed through high-level languages like Pig Latin and Hive SQL, as opposed to through programs written in Java MapReduce, but the workload is dominated by Java MapReduce programs for at least one of the five companies. Companies may choose to use only one high-level language or to adopt more than one for different purposes. Further study may illuminate the use cases most conducive to adoption of each approach.

As a software provider to a wide range of industries, company sizes, and use cases, Cloudera’s software distribution must accommodate a full spectrum of languages and processing frameworks, and be flexible enough to handle many small jobs efficiently and also very large jobs running over terabytes of input data.

Yanpei is a member of the performance team at Cloudera. A more detailed version of the study appeared as a VLDB 2012 paper, which was co-authored with Sara Alspaugh and Randy Katz at UC Berkeley.