Hadoop Clusters and Capacity Planning

Welcome to 2016!

As Hadoop races into prime time computing systems, Some of the issues such as how to do capacity planning, assessment and adoption of new tools, backup and recovery, and disaster recovery/continuity planning are becoming serious questions with serious penalties if ignored.

In this article, I hope to present an approach for basic capacity planning, which may serve some of you as a starting point for building a reasonable and somewhat standardized approach that suits your environment.

In lieu of a full standards-based approach, which I have not found yet from ISO or IEEE organizations, we are forced to use our experience, skills, and gut (tuned by experience of course) to make reasonable plans to care for and feed our clusters and most importantly keep the critical processes running on them stable and fast. Some of us are really lucky and work in large organizations which require that once we have an idea of how to do some capacity planning, we must create a spreadsheet or presentation to explain to senior managers what they need to purchase and when they need to purchase it, based on the model for planning that we just authored!

Here is how I came up with this approach:

First, we want to have a prototype of the process built, using a subset of the the expected data or load, so that we can gather a sample of execution time for all jobs that are associated with a project.

Define the inputs and facts!

What is storage capacity of your cluster?

What is the default replication factor?

How many data nodes do you have in your cluster?

List the data locations and feeds that are part of your project

Run a known sample size, perhaps 5-10 % of expected full load for a period of time. I like to spread my sample times over a period of 5-10 days to try to include natural variances in loads from other processes running on the cluster.

For example, if you know you will be using daily data sets for 1000 days of history in the full product, run your sample on 50 days of data to achieve a 5% sample. This should give you some decent execution time data, but not consume so much resources that you would be afraid to run it more than once if you need to change something in your data or process.

I capture 5 days of the job execution details from History Manager and load that into a spreadsheet.

Calculate how long each job ran using the start/finish date/time values.

Then, normalize all the time measurements in to seconds if the jobs are small enough. If the jobs are a bit longer running, you might want to use minutes. Just pick the best unit of measure to use across the board in your comparisons.

Use total minutes or total seconds in a day to calculate the percentage of daily processing time that the job current requires to finish.

Note the number of mappers and reducers your job required using the sample size.

Extrapolate the load using percentages as you scale up from the sample size to the full load. I usually use values of 25%, 50%, 75% and 100% to extrapolate expected processing loads

Add a bit of logarithmic increase as the load increases, because there has to be some increased processing overhead when you go from 50 data sets to 1000 data sets for example.

Now, let’s take a look at a fake, realistic example, to illustrate how this might work out with real numbers! To make it even easier to see what is going on, I plopped the data and calculations into a spreadsheet.

For the flame throwers, know-it-alls, and trolls (just teasing a bit):

Now that you have a concrete example to look at on the Web, some folks will likely come up with all kinds of ideas about what it is lacking, the ridiculous assumptions made, et cetera. I understand for example, that my model does not explicitly consider how many mappers and reducers are used or available during my sample run. This is OK, because the assumption here is that this is a multi-tenant cluster, and its work load will be about the same as the average during my sample period. With this considered, there is a likelihood that with a smallish sample size, this approach may be a bit pessimistic in predicting the resources consumed on the cluster by the job when it runs at full load.

To be very direct about it, nobody can reliably identify every variable that determines the ultimate performance metrics of a job running on a busy, multi-tenant cluster at a specific day and time. A job running every day in production on a multi-tenant cluster will vary considerably in how long it takes to complete, percentage of resources consumed at that moment due to fluctuations in the processes running and even node failures. Hadoop I believe classifies as a moderately complex system, and we must try to manage the performance risks (potential downside). In so doing, we make a strong upside performance more likely over time.

Please let me know if this worksheet and example was helpful! I am interested in thoughtful suggestions for improvements.