Apache Zeppelin on HDP 2.4

In November, 2015 we introduced Apache Zeppelin as a technical preview on HDP 2.3. Since then, we have made significant progress on integrating Zeppelin into HDP while working in the Apache community to add new features to Zeppelin.

These features are now available in this Apache Zeppelin technical preview – the second Zeppelin technical preview. This technical preview works with HDP 2.4 and comes with the following major features:

In addition, this tech preview includes improvements made in the community such as auto-save, the ability to quickly add new paragraphs, and stability related fixes.

Overview

This tech preview of Apache Zeppelin provides:

Instructions for setting up Zeppelin on HDP 2.4 with Spark 1.6

Ambari-managed Install

Manual Install of Zeppelin

Configuration for running Zeppelin against Spark on YARN and Hive

Configuration for Zeppelin to authenticate users against LDAP

Sample Notebooks to explore

Note: While both Ambari-managed and manual installation instructions are provided, you only need to follow one of the two sets of instructions to set up Zeppelin in your cluster.

Prerequisites

This technical preview requires the following software:

HDP 2.4

Spark 1.6 or 1.5

HDP Cluster Requirement

This technical preview can be installed on any HDP 2.4 cluster, whether it is a multi-node cluster or a single-node HDP Sandbox. The following instructions assume that Spark (version 1.6) is already installed on the HDP cluster.

Note the following cluster requirements:

The Zeppelin server should be installed on a cluster node that has the Spark client installed on it.

Ensure the node running Ambari server has the git package installed.

Ensure that Zeppelin server has the wget package installed

Installing Zeppelin on an Ambari-Managed Cluster

To install Zeppelin using Ambari, complete the following steps.

Download the Zeppelin Ambari Stack Definition. On the node running Ambari server, run the following:

After Ambari restarts and service indicators turn green, add the Zeppelin Service:
At the bottom left of the Ambari dashboard, choose Actions -> Add Service:
On the Add Service screen, select the Zeppelin service.
Step through the rest of the installation process, accepting all default values.
On the Review screen, make a note of the node selected to run Zeppelin service; call this ZEPPELIN_HOST.
Click Deploy to complete the installation process.

Zeppelin includes a few sample notebooks, including a Zeppelin tutorial. There are also quite a few notebooks available at the Hortonworks Zeppelin Gallery, including sentiment analysis, geospatial mapping, and IoT demos.

(Optional) Installing Zeppelin Manually

The Zeppelin Technical Preview is available as an HDP package compiled against Spark 1.6.

To install the Zeppelin Technical Preview manually (instead of using Ambari), complete the following steps as user root.

In the zeppelin-env.sh file, export the following three values.Note: you will use PORT to access the Zeppelin Web UI. <HDP-version> corresponds to the version of HDP where you are installing Zeppelin; for example, 2.4.0.0-169.

Remove “s” from the values of hive.metastore.client.connect.retry.delay and hive.metastore.client.socket.timeout, in the hive-site.xml file in zeppelin/conf dir. (This will avoid a number format exception.)

Create a root user in HDFS:

su hdfs
hdfs dfs -mkdir /user/root
hdfs dfs -chown root /user/root

To launch Zeppelin, run the following commands:

cd /usr/hdp/current/zeppelin-server/lib
bin/zeppelin-daemon.sh start

The Zeppelin server will start, and it will launch the Notebook Web UI.

To access the Zeppelin UI, enter the following address into your browser, where ZEPPELIN_HOST is the node where Zeppelin is installed:http://ZEPPELIN_HOST:9995

Note: If you specified a port other than 9995 in zeppelin-env.sh, use the port that you specified.

Configuring Zeppelin Spark and Hive Interpreters

Before you run a notebook to access Spark and Hive, you need to create and configure interpreters for the two components.

To create the Spark interpreter, go to the Zeppelin Web UI. Switch to the “Interpreter” tab and create a new interpreter:

Click on the +Create button to the right.

Name the interpreter spark-yarn-client.

Select spark as the interpreter type.

The next section of this page contains a form-based list of spark interpreter settings for editing. The remainder of the page contains lists of properties for all supported interpreters.

In the first list of properties, specify the following values (if they are not already set). To add a property, enter the name and value into the form at the end of the list, and click +.

When finished, click Save.Note: Make sure that you save all property settings. Without spark.driver.extraJavaOptions and spark.yarn.am.extraJavaOptions, the Spark job will fail with a message related to bad substitution.

To configure the Hive interpreter:

From the “Interpreter” tab, find the hive interpreter.

Check that the following property references your Hive server node. If not, edit the property value.

hive.hiveserver2.url jdbc:hive2://<hive_server_host>:10000

Note: the default interpreter setting uses the default Hive Server port of 10000. If you use a different Hive Server port, change this to match the setting in your environment.

If you changed the property setting, click Save to save the new setting and restart the interpreter.

Creating a Notebook

To create a notebook:

Under the “Notebook” tab, choose +Create new note.

You will see the following window. Type a name for the new note (or accept the default):

You will see the note that you just created, with one blank cell in the note. Click on the settings icon at the upper right. (Hovering over the icon will display the words “interpreter-binding.”)

Drag the spark-yarn-client interpreter to the top of the list, and save it:

Type sc.version into a paragraph in the note, and click the “Play” button (blue triangle):
SparkContext, SQLContext, ZeppelinContext will be created automatically. They will be exposed as variable names ‘sc’, ‘sqlContext’ and ‘z’, respectively, in scala and python environments.Note: The first run will take some time, because it is launching a new Spark job to run against YARN. Subsequent paragraphs will run much faster.

When finished, the status indicator on the right will say “FINISHED”. The output should list the version of Spark in your cluster:

Importing External Libraries

As you explore Zeppelin you will probably want to use one or more external libraries. For example, to run Magellan you need to import its dependencies; you will need to include the Magellan library in your environment.

There are three ways to include an external dependency in a Zeppelin notebook:

Using the %dep Interpreter

(Note: this will only work for libraries that are published to Maven.)

Adding and Referencing SPARK_SUBMIT_OPTIONS

When you have a jar on the node where Zeppelin is running, this approach can also be useful:

Add SPARK_SUBMIT_OPTIONS env variable to the ZEPPELIN_HOME/conf/zeppelin-env.sh file; for example:

export SPARK_SUBMIT_OPTIONS="--packages group:artifact:version"

Stopping the Zeppelin Server

To stop the Zeppelin server, issue the following commands:

cd /usr/hdp/current/zeppelin-server/lib
bin/zeppelin-daemon.sh stop

LDAP Authentication Configuration

This version of the TP, allows users to authenticate users and provide separation of notebooks.

Note By default Zeppelin is enabled to receive requests over HTTP & not HTTPS. When you enable LDAP Authentication for Zeppelin, it will send username/password over HTTP. For better security, you should enable Zeppelin to listen in HTTPS by enabling SSL. You can use SSL properties specified in this doc. Also note at this time Zeppelin does not send the user identity downstream and we are working to address this before Zeppelin goes GA.

To enable authentication, in /usr/hdp/current/zeppelin-server/conf/shiro.ini file edit the section and enable authentication [urls]

For more information on Shiro please refer http://shiro.apache.org/authentication-features.html

Sample Notebooks

Zeppelin includes a few sample notebooks, including a Zeppelin tutorial. There are also quite a few notebooks available at the Hortonworks Zeppelin Gallery, including sentiment analysis, geospatial mapping, and IoT demos.

Known Issues

If you need help or have any feedback or questions with the tech preview, please first check out Hortonworks Community Connection (HCC) for existing questions and answers. Please use the tag tech-preview and zeppelin.