Wednesday, November 30, 2016

About a month ago I posted an article proposing Kotlin as another programming language for data science. It is a pragmatic, readable language created by JetBrains, the creator of Intellij IDEA and PyCharm. It has received growing popularity on Android and focuses on industrial use rather than experimental functionality. Just like Java and Scala, Kotlin compiles to bytecode and runs on the Java Virtual Machine. It also works with Java libraries out-of-the-box with no hiccups, and in this article I’m going to show how to use it with Apache Spark.

Officially, you can use Apache Spark with Scala, Java, Python, and R. If you are happy using any of these languages with Spark, you likely will not need Kotlin. But if you tried to learn Scala or Java and found it was not for you, you might want to give Kotlin a look. It is a legitimate fifth option that works out-of-the-box with Spark.

I recommend using Intellij IDEA as it natively includes Kotlin support. It is an excellent IDE that you can also use with Java and Scala. I also recommend using Gradle for your build automation.

Kotlin is replacing Groovy as the official scripting language for Gradle builds. You can read more about it in the article Kotlin Meets Gradle.

Setting Up

Gradle - Build automation system, download Binary Only distribtion and unzip it to a location of your choice

You will need to configure Intellij IDEA to use your Gradle location. Launch Intellij IDEA and set this up in Settings -> Build, Execution, and Deployment -> Gradle. If you have trouble there should be plenty of walkthroughs online.

Let’s create our Kotlin project. Using your operating system, create a folder with the following structure:

kotlin_spark_project
|
└────src
|
└────main
|
└────kotlin

Your project folder needs to have a folder structure inside of it containing /src/main/kotlin/. This is important so Gradle will recognize this as a Kotlin project.
Next, create a text file named build.gradle and use a text editor to put in the following contents. This is the script that will configure your project as a Kotlin project. You can read more about Kotlin Gradle configurations here.

Finally, launch Intellij IDEA and click Import Project and navigate to the location of your Kotlin project folder you just created. In the wizard, check Import project from external model with the Gradle option. Click Next, then select Use Local Gradle Distribution with the Gradle copy you downloaded. Then click Finish.

Your workspace should now be set up with a Kotlin project as shown below. If you do not see the project explorer on the left press ALT + 1. Then double-click on the project folder and navigate down to the kotlin folder.

Right click the kotlin folder and select New -> Kotlin File/Class.

Name the file “SparkApp” and press OK. You will now see a SparkApp.kt file added to your kotlin folder. An editor will open on the right.

Using Spark with Kotlin

Let’s put our Spark usage in the SparkApp.kt file. Spark was written with Scala. While Kotlin does not work directly with Scala, it does have 100% interoperability with Java. Thankfully, Spark has a Java API by providing a JavaSparkContext. We can leverage this to use Spark out-of-the-box with Kotlin.
Create a main() function below which will be the entry point for our Kotlin application. Be sure to import the needed Spark dependencies as well. In your main() function, configure your SparkConf and create a new JavaSparkContext off of it.

The JavaSparkContext provides a Java API to create Spark streams. Thankfully, we can use the excellent Kotlin lambda syntax which the Kotlin compiler will translate into the needed Java functional types.
Let’s turn a List of Strings containing alphanumeric text values separated by / characters. Let’s break these alphanumeric values up, filter only for numbers, and then find their sum.

If you click the Kotlin logo right next to your main() function in the gutter, you can run this Spark application.

A console should pop up below and start logging Spark’s events. I did not turn off logging so it will be a bit noisy. But ultimately you should see the value of sumOfNumbers printed.

Conclusion

I will show a few more examples in the coming weeks on how to use Kotlin with Spark (you can also check out my GitHub project). Kotlin is a pragmatic, readable language that I believe has potential for adoption in Spark. It just needs more documentation for this purpose. But If you want to learn more about Kotlin, you can read the Kotlin Reference as well as check out a few books that are out there. I heard great things about the O’Reilly video series on Kotlin which I understand is helpful for folks who do not have knowledge on Java, Scala, or other JVM languages.

If you learn Kotlin you can likely translate existing books and documentation on Spark into Kotlin usage. I’ll do my best to share my discoveries and any nuances I may encounter. For now, I do recommend giving it a look if you are not satisfied with your current languages.