A Closer Look at RDDs

Apache Spark has gotten a lot of attention for its fast processing of large amounts of data. But how does it get up to speed? The biggest reason that Spark is so fast is its use of the Resilient Distributed Dataset, or RDD.

Apache Spark has gotten a lot of attention for its fast processing of large amounts of data. But how does it get up to speed? The biggest reason that Spark is so fast is its use of the Resilient Distributed Dataset, or RDD.

These RDDs are the heart of Spark—the engine that allows it to process so much data so quickly and reliably.

What Are RDDs?

RDDs are a data structure that’s distinctive of Apache Spark. The “resilient” part comes from the fact that Spark ensures a degree of safety when dealing with data. Operations on RDDs are split into two categories: transformations and actions. (This will be explained later in the article.) Operations on data are made out of chained transformations, which makes it easy to recover from machine loss.

Since MapR made its name as a company supporting Hadoop, it’s not surprising that distributed filesystems are a major focus for Spark. RDDs can be distributed across many nodes, being controlled by a “master” node. If a “slave” machine goes down, the data can be reconstructed from other nodes, like a giant RAID system. Hadoop comes in by providing the reliable, distributed HDFS system to store the data.

RDDs also hold as much data as possible in memory. With powerful nodes, most clusters shouldn’t have any problem.

You might think that having all of this data in memory might be as slow as molasses, but Spark borrows a technique from functional programming languages like Haskell. RDD operations are lazily evaluated. That means that the operations aren’t actually executed until they’re needed. This lets you specify some complicated operations without suffering the performance hits.

You could work out an algorithm in the interactive Spark shell, do something else, then ask for some output. If you find that it takes a long time, you can put it into a script to be run in batch mode.

This is in contrast to eager evaluation, where everything’s evaluated immediately.

Transformations vs. Actions

Transformations make changes to data, such as sorting, counting, or filtering items that meet certain criteria. These transformations simply return a new RDD from the old one nondestructively. In case of a failure, RDDs can be rebuilt from other transformations, as operations on data are made of chained transformations. Transformations include filtering, mapping, and reducing, as well as operations derived from set theory, such as unions and intersections.

Actions are operations that produce some kind of output. Actions include counting elements in an RDD, flattening them into arrays, and retrieving the first N items from an RDD. You can also save RDDs in a text file for exporting.

A lot of the resiliency in RDDs comes from lineages. As mentioned earlier, transformations are nondestructive, preserving the earlier RDD. Spark relies on lineages, which are records of all the previous versions of RDDs. Since all transformations are made out of other transformations, if a node fails, the master can reconstruct the data out of the lineage on other nodes.

Of course, that assumes that nothing bad has happened to driver application and whether it is running in client or cluster mode. In any case, the lineage feature is very powerful and allows a great degree of safety, but you should focus on good practices such as keeping up-to-date backups and having an off-site backup plan for data that really matters. If you can afford it, you might want to build clusters in another area for continuity. These days, it’s also easy to use cloud providers to have clusters in different places without having to wire up servers yourself.

Conclusion

The heart of a good program is, as any really talented programmer will tell you, the data structures, rather than the algorithm. RDDs provide a good data structure for Big Data because of their reliability combined with their simplicity. With Spark and RDDs, the only limit will be your imagination, especially when backed by the support of MapR’s Spark distribution.