Transforming Python Lists into Spark Dataframes

Data represented as dataframes are generally much easier to transform, filter, or write to a target source. In Spark, loading or querying data from a source will automatically be loaded as a dataframe.

Here’s an example of loading, querying, and writing data using PySpark and SQL:

The example above works conveniently if you can easily load your data as a dataframe using PySpark’s built-in functions. But sometimes you’re in a situation where your processed data ends up as a list of Python dictionaries, say when you weren’t required to use spark.read and/or session.sql. How can you load your data as a Spark DataFrame in order to take advantage of its capabilities?

Converting this into a Spark DataFrame is as simple as knowing how the datatype of each key-value pair of its dictionaries map to one of PySpark’s DataType subclasses. You can find the latest list of available PySpark data types here.

You can actually skip the type matching above and let Spark infer the datatypes contained in the dictionaries. But I personally do not encourage this because automatically inferring data types may lead to hard-to-debug side effects when you process the data downstream. It’s better to be explicit right from the start so you can confidently handle the data moving forward knowing that the data types for the fields are what you specified them to be.