Stress Testing our AWS Hadoop Deployment – Some more results

So to continue our previous post, we have been working to deploy a stable, robust hadoop cluster on AWS. We had experienced numerous issues with our previous deployments, however, as we outlined in our previous post, some parameters allowed us to magically have a pretty stable cluster. Obvious, next steps were to stress test our cluster. To do so we ran our tests on a small 10+1 node hadoop cluster (10 datanodes&tasktracker nodes + 1 namenode&jobtracker node). We ran the followings:

The next part of our test was designed to test the overall infrastructure under load. Our cluster consisted of 15 compute nodes (TT+DN) and 1 JT+NN nodes. Furthermore, we some of the jobs that were run included extremely large number of mappers (10K-15K upwards) and large number of reducers. Overall, we feel that such a test would replicate real-world load and give us a fair idea of how our compute nodes perform. Note, we did use the following parameter – mapred.reduce.slowstart.completed.maps and set it to 0.95. This ensured that the reducers start only after a large number of mappers are completed ensuring the reducers dont timeout due to inactivity. Our jobs were executed over a 1 day period and the graphs below illustrate the cluster utilization.