Before you start developing applications on MapR’s Converged Data Platform, consider how you will get the data onto the
platform, the format it will be stored in, the type of processing or modeling that is required, and how the data will
be accessed.

A MapR Ecosystem Pack (MEP) provides a set of ecosystem components that work together on one or more MapR cluster versions. Only one version of
each ecosystem component is available in each MEP. For example, only one version of Hive and one version of Spark is supported in a MEP.

After you have a basic understanding of Apache Spark and have it installed and running on your MapR cluster, you can use it
to load datasets, apply schemas, and query data from the Spark interactive shell.

Using the MapR Database OJAI connector for Spark enables you build real-time and batch pipelines between your data and MapR Database JSON. Before getting started, it is important that you understand Spark terminology and workflow, system requirements
and support, and OJAI connector and API features.

MapR provides JDBC and ODBC drivers so you can write SQL queries that access the Apache Spark data processing engine.
This section provides instructions on how to download the drivers, and install and configure them.

Starting with MapR Ecosystem Pack (MEP) 6.0.0, MapR Object Store with S3-Compatible API (MapR Object Store) is included in MEP repositories. To fully benefit from the MapR Object Store, it is important to understand what the MapR Object Store
is and how it works, how to authenticate it and perform bucket operations.

MapR supports public APIs for MapR Filesystem, MapR Database, and MapR Event Store For Apache Kafka. These APIs are available for application development purposes.

SparkSQL and DataFrames

The MapR Database Binary Connector for Apache Spark leverages DataSource API (SPARK-3247) introduced in Spark-1.2.0. The connector bridges the gap between simple
HBase KV store and complex relational SQL queries and enables users to perform complex data
analytical work on top of MapR Database binary tables using Spark. HBase Dataframe is a standard
Spark Dataframe, and is able to interact with any other data sources, such as Hive, Orc,
Parquet, JSON, and others. The MapR Database Binary Connector for Apache Spark applies critical
techniques such as partition pruning, column pruning, predicate pushdown and data
locality.

To use the MapR Database Binary Connector for Apache Spark, you need to define the Catalog for the
schema mapping between MapR Database binary tables and Spark tables, prepare the data and populate
the MapR Database binary table, then load the HBase DataFrame. After that, users can do integrated
query and access records in a MapR Database binary table with SQL query. The following examples
illustrate the basic procedure.

Define Catalog Example

The catalog defines a mapping between MapR Database binary tables and Spark
tables. There are two critical parts of this catalog. One is the rowkey definition. The
other is the mapping between the table column in Spark and the column family and column
qualifier in MapR Database binary table. The following example defines a schema for a MapR Database
binary table with name as my_table, row key as key and a
number of columns (col1 - col8). Note that the rowkey also has to be
defined in details as a column (col0), which has a specific cf
(rowkey).

Save the DataFrame Example

Data prepared by the user is a local Scala collection that has 256 HBaseRecord objects. The
sc.parallelize(data) function distributes data to form an RDD.
toDF returns a DataFrame. writefunction returns a
DataFrameWriter used to write the DataFrame to external storage systems (e.g. MapR Database here).
Given a DataFrame with a specified schema catalog, the save function
creates a MapR Database binary table with five (5) regions and saves the DataFrame inside.

Load the DataFrame Example

In the withCatalog function, sqlContext is a variable of
SQLContext, which is the entry point for working with structured data
(rows and columns) in Spark. read returns a DataFrameReader that can be
used to read data in a DataFrame. The option function adds input options
for the underlying data source to the DataFrameReader. The format function
specifies the input data source format for the DataFrameReader. The load()
function loads input as a DataFrame. The data frame df returned by the
withCatalog function can be used to access the MapR Database binary table, as
shown in the Language Integrated Query and SQL Query examples.

Language Integrated Query Example

DataFrame can do various operations, such as join, sort,
select, filter, orderBy, and so on. In
the following example, df.filter filters rows using the given SQL
expression. select selects a set of columns: col0,
col1 and
col4.

SQL Query Example

registerTempTable registers df DataFrame as a temporary
table using the table name table1. The lifetime of this temporary table is
tied to the SQLContext that was used to create df.
sqlContext.sqlfunction allows the user to execute SQL
queries.

Query with Different Timestamps

In HBaseSparkConf, you can set four parameters related to timestamp:

TIMESTAMP

MIN_TIMESTAMP

MAX_TIMESTAMP

MAX_VERSIONS

With MIN_TIMESTAMP and MAX_TIMESTAMP, you can query
records with different timestamps or time ranges. In the meantime, use a concrete value
instead of tsSpecified and oldMs in the following
examples. The first example shows how to load df DataFrame with different
timestamps. tsSpecified is specified by the user.
HBaseTableCatalog defines the HBase and Relation relation schema.
writeCatalog defines the catalog for the schema
mapping.