Spark SQL Integration

Spark SQL integration depends on N1QL, which is available in Couchbase Server 4.0 and later.
To use Spark SQL queries, you need to create and persist DataFrames via the Spark SQL DataFrame API.

All examples presented on this page at least require a primary index on the travel-sample data set.
If you haven’t done so already, you can create a primary index by executing this N1QL statement: CREATE PRIMARY INDEX ON `travel-sample`.

DataFrame creation

Before you can create a DataFrame with Couchbase, you need to create a SQLContext.

While this is the easiest, it has a few shortcomings.
It will try to perform automatic schema inference based on the full data set, which is very likely to not hit the right schema (especially if you have a large or diverse data set).

There are two options to solve this shortcoming: you can either provide a manual schema or narrow down the automatic schema inference by providing explicit predicates.
The benefit of the latter approach is also that the predicate provided will be used on every query to optimize performance.

If you want to get automatic schema inference on all airlines, you can specify it like this:

You can also provide all kinds of options directly, either to spark or for advanced functionality in the N1QL integration.
Currently, the following options are allowed:

idField: The name of the document ID field, defaults to "META_ID".

bucket: The name of the bucket to use, which is required if more than one bucket is opened.

DataFrame persistence

It is also possible to persist DataFrames into Couchbase.
The important part is that a META_ID (or different if configured) field exists which can be mapped to the unique Document ID.
All the other fields in the DataFrame will be converted into JSON and stored as the document content.