Securing Spark Shuffle

When a Spark job starts, it will generate encryption keys and store them in the current user’s credentials, which are shared with all executors.

When shuffle happens, the shuffle writer will first compress the plaintext if compression is enabled. Spark will use the randomly generated Initial Vector (IV) and keys obtained from the credentials to encrypt the plaintext by using CryptoOutputStream from Crypto.

CryptoOutputStream will encrypt the shuffle data and write it to the disk as it arrives. The first 16 bytes of the encrypted output file are preserved to store the initial vector.

For the read path, the first 16 bytes are used to initialize the IV, which is provided to CryptoInputStreamalong with the user’s credentials. The decrypted data is then provided to Spark’s shuffle mechanism for further processing.

Once you have things optimized, the performance hit is surprisingly small.

Related Posts

Jakub Szymaszek announces secure enclaves support with Always Encrypted in SQL Server 2019: The only operation SQL Server 2016 and 2017 support on encrypted database columns is equality comparison, providing you use deterministic encryption. For anything else, your apps need to download the data to perform the computations outside of the database. Similarly, if you […]

Anmol Sarna summarizes Apache Spark 2.4 and pushes his meme game at the same time: The next major enhancement was the addition of a lot of new built-in functions, including higher-order functions, to deal with complex data types easier.Spark 2.4 introduced 24 new built-in functions, such as array_union, array_max/min, etc., and 5 higher-order functions, such as transform, filter, etc.The entire […]