A builder/maker turning the corner to find solutions to technical problems

Getting Spark Data from AWS S3 using Boto and Pyspark

We’ve had quite a bit of trouble getting efficient Spark operation when the data to be processed is coming from an AWS S3 bucket. Aside from pulling all the data to the Spark driver prior to the first map step (something that defeats the purpose of map-reduce!), we experienced terrible performance. In one scenario, Spark spun up 2360 tasks to read the records from one 1.1k log file. In another scenario, the Spark logs showed that reading every line of every file took a handful of repetitive operations–validate the file, open the file, seek to the next line, read the line, close the file, repeat. Processing 450 small log files took 42 minutes. Arrgh.

We also experienced memory problems. When processing the full set of logs we would see out-of-memory heap errors or complaints about exceeding Spark’s data frame size. Very frustrating.

Go directly to S3 from the driver to get a list of the S3 keys for the files you care about.

Parallelize the list of keys.

Code the first map step to pull the data from the files.

This procedure minimizes the amount of data that gets pulled into the driver from S3–just the keys, not the data. Then, when map is executed in parallel on multiple Spark workers, each worker pulls over the S3 file data for only the files it has the keys for.

S3 access from Python was done using the Boto3 library for Python:

pip install boto3

Here’s a snippet of the python code that is similar to the scala code, above. It is processing log files that are composed of lines of json text: