Working with RDDs

Spark operates on resilient distributed datasets (RDDs).
When you need to extract data out of Couchbase, the Couchbase Spark connector creates RDDs for you.
You can create and persist RDDs by using key-value pairs, views, or N1QL.

Creating RDDs

To get access to the RDDs through the implicit methods that the Couchbase Spark connector provides, you need to add the following import to your code:

import com.couchbase.spark._

This import adds Couchbase-specific methods to the SparkContext.
Each method starts with couchbase.

If you want to create an RDD for specific documents stored in a bucket, you can specify the IDs directly.
The connector fetches them from the server and turns them into an RDD:

It is critical that you specify the target document type.
Otherwise, the client does not know how to convert it.
The main reason is that Couchbase has first-class JSON support, but is also able to store any data.
You can even store serialized objects or protobuf-encoded documents, but then you’ll loose the secondary indexing capabilities.

If you are unsure what to pick, stick with the JsonDocument.
If you need raw access to the JSON data, you can also use the RawJsonDocument.

You can perform a (spatial) view query to extract rows and turn them into an RDD.
Given the following view against the travel-sample bucket:

Here you can see that couchbaseGet is not just available on the context, but also on every RDD[String].
Neat, right?
The exact same approach is also available for spatial views (just use the couchbaseSpatialView() method instead).

Finally, if you are using Couchbase Server 4.0 or greater, you can utilize N1QL to perform efficient queries against your JSON data in a Couchbase Bucket.
The following query is very similar to the one performed before through Views, just to show how the different approaches work.

You need to at least have a primary index created on the travel-sample bucket to make the following examples work.
If you haven’t done that already, perform a CREATE PRIMARY INDEX ON `travel-sample` query.

While this gives you the most flexibility with querying, we recommend using the higher level Spark SQL components through the DataFrame API.
See the Spark SQL section for more information.

Persisting RDDs

Creating an RDD is only half the story.
After you’ve done your aggregation, filtering, and machine learning, you normally want to persist the results somewhere.
Couchbase provides the saveToCouchbase() method on every RDD[Document].

The following example extracts all airline names through N1QL, aggregates them by country, and stores each list as a separate document.
Since RDDs are source agnostic, the same approach can be used for example to load data out of HDFS and then store the results back in Couchbase.