The MapR Data Science Refinery includes a preconfigured Apache Zeppelin notebook, packaged as a Docker container. Apache Zeppelin is an open source web-based data science notebook. You can use it with MapR components to conduct data discovery, ETL, machine learning, and data visualization.

Out-of-box, the interpreters in Apache Zeppelin on MapR are preconfigured to run against different backend engines. You may need to perform manual steps to configure the Livy, Spark, and JDBC interpreters. No additional steps are needed to configure and run the Pig and Shell interpreters.

Before you start developing applications on MapR’s Converged Data Platform, consider how you will get the data onto the platform, the format it will be stored in, the type of processing or modeling that is required, and how the data will be accessed.

The MapR Data Science Refinery includes a preconfigured Apache Zeppelin notebook, packaged as a Docker container. Apache Zeppelin is an open source web-based data science notebook. You can use it with MapR components to conduct data discovery, ETL, machine learning, and data visualization.

To run the Apache Zeppelin container, you must access the Zeppelin Docker image from MapR’s public repository, run the Docker image, and access the deployed container from your web browser. From your browser, you can create Zeppelin notebooks.

Out-of-box, the interpreters in Apache Zeppelin on MapR are preconfigured to run against different backend engines. You may need to perform manual steps to configure the Livy, Spark, and JDBC interpreters. No additional steps are needed to configure and run the Pig and Shell interpreters.

The Livy interpreter provides support for Spark Python, SparkR, Basic Spark, and Spark SQL jobs. To use the Livy interpreter for these variations of Spark, you must take certain actions, including configuring Zeppelin and installing software on your MapR cluster.

The Spark interpreter is available starting in the 1.1 release of the MapR Data Science Refinery. It provides support for Spark Python, SparkR, Basic Spark, and Spark SQL jobs. To use the Spark interpreter for these variations of Spark, you must take certain actions, including configuring Zeppelin and installing software on your MapR cluster.

You can install custom Python packages either by manually installing packages on each node in your MapR cluster or by using Conda. Using Conda allows you to perform the install from your Zeppelin host node without having to directly access your MapR cluster. The topics in this section describe the instructions for each method as well as instructions for Python 2 vs Python 3.

Apache Zeppelin includes the Helium framework. Helium allows you to register visualization packages with Zeppelin. Using visualization packages, you can view your data through area charts, bar charts, scatter charts, and other displays. To use a visualization package, you must download it, register it with Zeppelin, and enable it through Helium. Like Zeppelin interpreters, Helium is automatically installed in your Zeppelin container.

This section contains examples of how to use Apache Zeppelin interpreters to access the different backend engines. This includes running Apache Pig scripts, Apache Drill queries, Apache Hive queries, and Apache Spark jobs, as well as accessing MapR-DB and MapR-ES.

MapR supports public APIs for MapR-FS, MapR-DB, and MapR-ES. These APIs are available for application development purposes.

Configuring the JDBC Interpreter for Apache Drill and Apache Hive

Apache Zeppelin on MapR includes custom JDBC interpreters for Apache Drill and Apache
Hive. Fields in each interpreter are prepopulated, but you need to customize them for your
environment.

In particular, you must modify the JDBC URL, as described in the following sections.

Drill JDBC

You must specify
the Apache Drill JDBC URL in the default.url property:

The following is an example of a URL when MapR-SASL is enabled:

jdbc:drill:drillbit=drillbitnode:31010;auth=maprsasl

If
MapR-SASL is not enabled, the URL is the following:

jdbc:drill:drillbit=node1:31010

For non-secure clusters,
default.user is prepopulated with the user running the container
(MAPR_CONTAINER_USER). You can modify this property and
default.password, as needed. Zeppelin submits your Drill queries using this user
name and password (if specified).

For secure clusters, Zeppelin always submits Drill
queries using the user name and password from your MapR ticket. You do not need to modify
the default.user and default.password
properties.

Hive JDBC

You must specify the Apache Hive
JDBC URL in the default.url property: