Now, under src/main/scala/, create a Quickstart.scala class with the following skeleton:

object Quickstart {
def main(args: Array[String]): Unit = {
}
}

If you are not familiar with Scala, this is more or less the equivalent to Java's public static void main(String[] args) method. Because empty methods make the compiler sad, fill it with proper code now.

When it comes to Spark, you always need to set up a configuration and initialize the SparkContext object. The following snippet does that and also instructs the Couchbase Spark connector to open a bucket in the background.

By default, the Spark Connector connects to the default bucket. However, this example uses actual data from the travel-sample bucket that ships with Couchbase Server, so the code connects to that bucket. The Couchbase Spark connector also supports opening more buckets in parallel.

Creating and saving RDDs

Now that the SparkContext is created, you can perform operations against Couchbase. A common scenario is to create resilient distributed datasets (RDDs) out of documents stored in Couchbase. The easiest one is probably by passing in Document IDs as strings.

Before getting further into the actual code, make sure to have the following items imported, because otherwise the implicit methods won't be available.

import com.couchbase.spark._

Use the couchbaseGet method on the SparkContext to fetch documents from Couchbase and create an RDD.

Make sure to specify which document type you want (in this case, a JsonDocument). If you forget to specify the document type, an exception is thrown because the connector does not know what format to use for the results. Then call collect() to aggregate the results and print them out to the command line.

Spark outputs lots of information, so you'll find the output somewhere in the logs:

But since just loading data is half the fun, the connector also provides a convenient way to save documents. The following code loads documents like before, but then modifies its contents and ID before saving it back. You can imagine taking any kind of data source, mapping them to documents and just storing them back in Couchbase.

We utilize the saveToCouchbase() method available on our RDD to store a modified version of our original JsonDocument. Go find your modified document in the Couchbase Server UI! Look for "my_airline_10123" which will just have the name of the airline as its content.

Congratulations! You've successfully performed your first ETL job (extract-transform-load) using Couchbase and Spark. Next up is a whirlwind tour of N1QL and Spark DataFrames.

Working with DataFrames

DataFrames were introduced in Spark 1.3 and have matured even further in Spark 1.4. The nature of the queries fits very well with what Couchbase N1QL provides.

Note: To try this, you need Couchbase Server version 4.0 or later.

Note: You need to at least have a primary index created on the
travel-sample bucket to make the following examples work. If you haven't done already, perform a
CREATE PRIMARY INDEX ON `travel-sample` query.

In addition to creating a SparkContext, you'll need an SQLContext:

val sql = new SQLContext(sc)

Also, don't forget the Couchbase imports again for all the automatic method goodness:

import com.couchbase.spark.sql._

Because a DataFrame is like an RDD but with a schema and Couchbase is a schemaless database at its heart, you need a way to either define or infer a schema. The connector has built-in schema inference, but if you have a large or diverse data set ,you need to give it some clues on filtering.

Suppose you want a DataFrame for all airlines and you know that the JSON content has a type field with the value airline. You can pass this information to the connector for automatic schema inference: