In our last blog, we discussed about generating summary data using spark. The summary works great for understanding the range of data quantitatively. But sometimes, we want to understand how the data is distributed between different range of the values. Also rather than just know the numbers, it will help a lot if we are able visualize the same. This way of exploring data is known as understanding shape of the data.

In this second blog of the series, we will be discussing how to understand the shape of the data using the histogram. You can find all other blogs in the series here.

So the values signify that there are 24 countries between life expectancy from 47.794 to 54.914. Most countries are between 76-83.

If you don’t like using RDD API, we can add histogram function directly on Dataframe using implicits. Refer to the code on github for more details.

Visualizing the histogram

Once we have calculated values for histogram, we want to visualize same. As we discussed earlier, we will be using zeppelin notebook for same.

In zeppelin, in order to generate a graph easily we need dataframe. But in our case, we got data as arrays. So the below code will convert those arrays to dataframe which can be consumed by the zeppelin.

In above code, first we combining both arrays using zip method. It will give us a array of tuples. Then we convert that array into a dataframe using the case class.

Once we have, dataframe ready we can run sql command and generate nice graphs as below.

You can download the complete zeppelin notebook from github and import into yours to test by yourself. Please make sure you are using Zeppelin 0.6.2 stable release.

Conclusion

Combining computing power of spark with visualization capabilities of zeppelin allows us to explore data in a way R or python does but for big data. This combination of tools make statistical data exploration on big data much easier and powerful.