What is Spark RDD?

Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. An RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.

where,Resilient – capable of rebuilding data on failureDistributed – distributes data among various nodes in clusterDataset – collection of partitioned data with values

Spark Operations

RDD support two types of operations:

Transformations

Actions

Transformations

Spark RDD Transformations are functions that take an RDD as the input and produce one or many RDDs as the output. They do not change the input RDD (since RDDs are immutable and hence one cannot change it), but always produce one or more new RDDs by applying the computations they represent e.g. Map(), filter(), reduceByKey() etc.

Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.

Narrow vs Wide Transformations

There are two types of transformations:

Narrow transformation — In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().

Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey and reducebyKey.

Actions

Transformationscreate RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion.

Action is one of the ways of sending data from Executer to the driver. Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task.

Spark RDD Operations

Apache Spark’s Core abstraction is Resilient Distributed Datasets(RDD). Also, a fundamental data structure of Spark. Moreover, Spark RDDs is immutable in nature. As well as the distributedcollection of objects. Basically, RDD in spark is designed as each dataset in RDD is divided into logical partitions. Further, we can say here each partition may be computed on different nodes of the cluster. Moreover, Spark RDDs contain user-defined classes.

RDD Operations

In addition, Spark RDD is a read-only,partitionedcollection of records. Also, They are the fault-tolerant collection of elements which we can operate in parallel. We can also create RDDs, basically in 3 ways. Either by data in stable storage, by other RDDs, or by parallelizing existing collection in driver program. We can achieve faster and efficient MapReduce operations through RDDs.

Spark PairRDD Operations

Spark Paired RDDs are nothing but RDDs containing a key-value pair. Basically, key-value pair (KVP) consists of a two linked data item in it. Here, the key is the identifier, whereas value is the data corresponding to the key value. Moreover, Spark operations work on RDDs containing any type of objects. However key-value pair RDDs attains few special operations in it. Such as, distributed “shuffle” operations, grouping or aggregating the elements by a key. In addition, on Spark Paired RDDs containing Tuple2 objects in Scala, these operations are automatically available. Basically, operations for the key-value pair are available in the Pair RDD functions class. However, that wraps around a Spark RDD of tuples.

Pair RDD Operations

To catch the working example of each RDD mentioned above. Please look at the databricks notebook here which comprises each RDD in detail.

Share the Knol:

Related

Divyansh Jain is a Software Consultant with experience of 1 years. He has a deep understanding of Big Data Technologies, Hadoop, Spark, Tableau & also in Web Development. He is an amazing team player with self-learning skills and a self-motivated professional. He also worked as Freelance Web Developer. He loves to play & explore with Real-time problems, Big Data. In his leisure time, he prefers doing LAN Gaming & watch movies.

Knoldus is the world’s largest pure-play Scala and Spark company. We modernize enterprise through
cutting-edge digital engineering by leveraging Scala, Functional Java and Spark ecosystem. Our mission is to provide reactive and streaming fast data solutions that are message-driven, elastic, resilient, and responsive.

Knoldus is the world's largest pure-play Scala and Spark company. We modernize enterprise through
cutting-edge digital engineering by leveraging Scala, Functional Java and Spark ecosystem. Our mission is to provide reactive and streaming fast data solutions that are message-driven, elastic, resilient, and responsive.