Migrating to Spark 2.0 - Part 3 : DataFrame to Dataset

Spark 2.0 brings a significant changes to abstractions and API’s of spark platform. With performance boost, this version has made some of non backward compatible changes to the framework. To keep up to date with the latest updates, one need to migrate their spark 1.x code base to 2.x. In last few weeks, I was involved in migrating one of fairly large code base and found it quite involving process. In this series of posts, I will be documenting my experience of migration so it may help all the ones out there who are planning to do the same.

This is the third post in this series.In this post we discuss about migrating dataframe based API’s to dataset based once. You can access all the posts here.

DataFrame Abstraction in Spark 1.x

In 1.x series of spark, dataframe was a structured abstraction over native RDD abstraction. Dataframe
was built to support the structured transformation of data, using dataframe dsl or spark sql query language.
Where as rdd abstraction was there to provide the functional API’s for manipulating the data.

So whenever we wanted to use functional API’s on dataframe, we would be converting dataframe into a
RDD and then manipulated as RDD abstraction. This kind of conversion made it’s easy to move between dataframe
and rdd abstractions.

DataFrame to RDD Example

The below code shows, how to use functional map API on dataframe abstraction

DataFrame abstraction in 2.x

From spark 2.x, the dataframe abstraction has changed significantly. In 2.x, dataframe is alias of Dataset[Row].Dataset is combination of both dataframe and RDD like API’s. So not only dataset supports structured querying using dsl and sql, it also supports the functional API’s which are supported in RDD.

So whenever we call map in 2.x, we no more get a RDD. Instead we get dataset. This change in the conversion will break your code if you are using the RDD based code showed earlier.

Porting DataFrame to RDD code in 2.x

To port your existing code,you need to add extra, .rdd before calling map method as shown in the code.

RDD to Dataset Abstraction

The above code makes it easy to port but doesn’t provide good performance. Also converting rdd back to dataframe is a tedious work. So rather than using rdd functional API’s, you can use dataset functional API’s. Only constraint is that the return type of map should have a encoder. By default all primitive types and case classes are supported.