Using the MapR-DB OJAI connector for Spark enables you build real-time and batch pipelines between your data and MapR-DB JSON. Before getting started, it is important that you understand Spark terminology and workflow, system requirements and support, and OJAI connector and API features.

The MapR-DB OJAI Connector for Apache Spark supports loading data as an Apache Spark RDD. Starting in the MEP 4.0 release, the connector introduces support for Apache Spark DataFrames and Datasets. DataFrames and Datasets perform better than RDDs. Whether you load your MapR-DB data as a DataFrame or Dataset depends on the APIs you prefer to use. It is also possible to convert an RDD to a DataFrame.

Before you start developing applications on MapR’s Converged Data Platform, consider how you will get the data onto the platform, the format it will be stored in, the type of processing or modeling that is required, and how the data will be accessed.

A MapR Ecosystem Pack (MEP) provides a set of ecosystem components that work together on one or more MapR cluster versions. Only one version of each ecosystem component is available in each MEP. For example, only one version of Hive and one version of Spark is supported in a MEP.

After you have a basic understanding of Apache Spark and have it installed and running on your MapR cluster, you can use it to load datasets, apply schemas, and query data from the Spark interactive shell.

Using the MapR-DB OJAI connector for Spark enables you build real-time and batch pipelines between your data and MapR-DB JSON. Before getting started, it is important that you understand Spark terminology and workflow, system requirements and support, and OJAI connector and API features.

The MapR-DB OJAI Connector for Apache Spark supports loading data as an Apache Spark RDD. Starting in the MEP 4.0 release, the connector introduces support for Apache Spark DataFrames and Datasets. DataFrames and Datasets perform better than RDDs. Whether you load your MapR-DB data as a DataFrame or Dataset depends on the APIs you prefer to use. It is also possible to convert an RDD to a DataFrame.

In any distributed computing system, partitioning data is crucial to achieve the best performance. Apache Spark provides a mechanism to register a custom partitioner for partitioning the pipeline. The MapR-DB OJAI Connector for Apache Spark includes a custom partitioner you can use to optimally partition data in an RDD.

Projection and filter pushdown improve query performance. When you apply the select and filter methods on DataFrames and Datasets, the MapR-DB OJAI Connector for Apache Spark pushes these elements to MapR-DB where possible.

The MapR-DB OJAI Connector for Apache Spark provides an API to save an Apache Spark RDD to a MapR-DB JSON table. Starting in the MEP 4.0 release, the connector introduces support for saving Apache Spark DataFrames and DStreams to MapR-DB JSON tables.

In the context of the MapR-DB OJAI Connector for Apache Spark, serialization refers to the methods that read and write objects into bytes. This section describes how to configure your application to use a more efficient serializer.

MapR provides JDBC and ODBC drivers so you can write SQL queries that access the Apache Spark data processing engine. This section provides instructions on how to download the drivers, and install and configure them.