Practical Spark – Calculate AWS Spot price

How much does your current EC2 fleet cost? How much would it cost if EC2 Spot Instances were used?
Use Apache Spark, Scala and Jupyter Notebook to find out!

Someone might say that in order to calculate such cost the best thing to do is to use the AWS Cost Explorer. And I agree, it is super cool and can do a lot, and maybe the use case that will be described here can be achieved with it as well (I haven’t tried that actually). However, it wouldn’t be that fun without a bit of Scala and Apache Spark code! I used Jupter Notebook for code development. Thanks Robert for advertising it over and over again!

The Use Case

On Demand EC2 instances are used to host Java applications (a.k.a [micro]services). There are different types of EC2 instances used by different types of applications. I would like to find out, how much will be spent if those instances were actually Spot Instances.

make sure to set SPARK_HOME environment variable before running notebook

Run jupyter notebook and create new notebook.

Step 2: Count existing EC2 Instances

In order to complete that step, I need to know what instances do I have and how many. AWS CLI describe-instances command is something that will be helpful. Note that only London’s (eu-west-2) prices will be downloaded. Since Spark will be used, lets save the result in JSON file:

Select On Demand price for Linux VMs. There are different types of Linux instances, however I’m only interested in those without pre-installed SQL. There are various ways to do it, one is to select only those where operation equals RunInstances:

That provides the total cost per application, environment and instance type taking into consideration the utilisation of each environment. 730 is just an average number of hours in a month calculated as 365 * 24 / 12.

Summary

What you have seen is that Spark, and SparkSQL in particular, can be used to quickly join and analyse data from various sources. Using Jupyter Notebook can boost the performance of code development and results analysis. The use case was calculating AWS EC2 Spot costs for existing fleet of Instances. Once I store the described notebook on GitHub I will update this post – stay tuned!

I literally love what Spark can do and will use it more! Just need to find a good excuse … 😉