How-to: Install Apache Zeppelin on CDH

Our thanks to Karthik Vadla and Abhi Basu, Big Data Solutions engineers at Intel, for permission to re-publish the following (which was originally available here).

Data science is not a new discipline. However, with the growth of big data and adoption of big data technologies, the request for better quality data has grown exponentially. Today data science is applied to every facet of life—product validation through fault prediction, genome sequence analysis, personalized medicine through population studies and Patient 360 view, credit card fraud-detection, improvement in customer experience through sentiment analysis and purchase patterns, weather forecast, detecting cyber or terrorist attacks, aircraft maintenance utilizing predictive analytics to repair critical parts before they fail, and many more. Every day, data scientists are detecting patterns in data and providing actionable insights to influence organizational changes.

The data scientist’s work broadly involves acquisition, cleanup, and analysis of data. Being a cross-functional discipline, this work involves communication, collaboration, and interaction with other individuals, internal and possibly external to your organization. This is one reason why the “notebook” features in data analysis tools are gaining popularity. They ease organizing, sharing, and interactively working with long workflows. IPython Notebook is a great example but is limited to usage of Python language. Apache Zeppelin (incubating at the time of this writing) is a new web-based notebook that enables data-driven, interactive data analytics, and visualization with the added bonus of supporting multiple languages, including Python, Scala, Spark SQL, Hive, Shell, and Markdown. Zeppelin also provides Apache Spark integration by default, making use of Spark’s fast in-memory, distributed, data processing engine to accomplish data science at lightning speed.

This post demonstrates how easy it is to install Apache Zeppelin notebook on CDH (for dev/test only, not supported). We assume familiarity with Linux (especially CentOS) commands, installation, and configuration.

System Setup and Configuration

Components

Listed below are the specs of our test Hadoop cluster.

Installed hardware

Installed software

These installation commands are specific to CentOS. If you do not login as ‘root’, you must use sudo for all the commands.

Update CentOS packages (yum update).

Install latest version of Java, preferably version 1.7 or later (yum install java-1.8.0-openjdk-devel).

Install Git (yum install git).

Install Node.js and npm (yum install nodejs npm).

Bower (is installed by npm).

Install Apache Maven – refer to these steps for installation.

Important Note: When you are working in a corporate environment, you need to set the proxies for Git, Nnpm, and Bower individually along with Maven.

Ppyspark: Installs all configurations required to run pyspark interpreter in Zeppelin Phadoop-2.6: Installs Hadoop version support for Zeppelin

Once the build is successful, continue with the configuration.

General Configuration of Zeppelin

To access the Hive metastore, copy the hive-site.xml from HIVE_HOME/conf into ZEPPELIN_HOME/conf folder (where HIVE_HOME and ZEPPELIN_HOME refers to the install locations of this software).

In ZEPPELIN_HOME/conf folder duplicate zeppelin-env.sh.template and rename it to zeppelin-env.sh.

In ZEPPELIN_HOME/conf folder duplicate zeppelin-site.xml.template and rename it to zeppelin-site.xml.

YARN Configuration of Zeppelin

If you have built binaries for yarn, set the master property for the Spark interpreter, i.e., master=yarn-client via Zeppelin UI (Interpreter tab)

In the Zeppelin /conf directory go to the zeppelin-env.sh file, uncomment the export HADOOP_CONF_DIR and specify the configuration directory location of the yarn-site.xml file (e.g., exportHADOOP_CONF_DIR =/etc/hadoop/conf).

Start Zeppelin: ./bin/zeppelin-daemon.sh start (Note: Sometimes you may not be able to run the above command. In that case, make all scripts in /bin folder executable with the following command:

chmod –R 777.)

After this, try the previous command again to start Zeppelin.

And now you can access your notebook at http://localhost:8080 or http://host.ip.address:8080.

Stop Zeppelin: ./bin/zeppelin-daemon.sh stop

Testing

Start the Zeppelin application: ./bin/zeppelin-daemon.sh start and access http://localhost:8080 (or IP address of the node it is installed on).

If you already have data on the Apache Hive metastore, which is accessible via hive commands locally, let’s test Zeppelin commands. Use the %hiveinterpreter to access the Hive metastore and list all available databases. In this example we already have some public genome databases available in our Hive metastore. If you do not have any data in your Hive metastore, you may want to load some data before starting this test or skip to Step 4.Now, type these commands in notebook:

1

2

3

%hive

show databases

The code snippet is echoed back and the code execution output is displayed:

To display tables in a specific database, such as “wellderly”, type these commands in the notebook:

1

2

3

%hive

show tables inwellderly

Again, the code snippet is echoed back and the code execution output is displayed:

Download the test dataset (education.csv) and place it in your HDFS location. Using the Scala interpreter register a table using the .csv file in HDFS. Use the code snippet to register the table. Note: Scala interpreter is the default, so nothing needs to be specified in Zeppelin (like %hive) when using Hive.

You now have installed and configured Zeppelin correctly and you have been able to test the installation successfully. Documentation for Zeppelin is available here.

Sharing a Notebook

If you want to share these notebook results with another user, you can simply send the URL of your notebook to that user. (That user must have access to the server node and cluster on which you created your notebook). That user not only can view all your queries, but also run all your queries to view your queries’ results.

If you want to share only the results without any queries (report-mode), please follow these steps:

Go to right corner of the Zeppelin window, where you see a dropdown list after the settings icon.

Change it from default to report. In this mode only results can be viewed without queries.

Copy the URL and share with others (who have access to the server node and cluster).

As the above image shows, three modes are available to share your notebooks:

Default – In this mode, the notebook can be edited by anyone who has access to the notebook (edit queries and re-run to display different results).

Simple – This mode is similar to default, the only difference is that all the available options are invisible. Options are visible only when you hover your mouse over the cell. This mode gives a cleaner view of the results when shared.

Report – When this mode is enabled, only the final results are visible (read only). The notebook cannot be edited

Conclusion

Clearly, Apache Zeppelin is in the incubator stage, but it does show promise as a cross-platform notebook not tied to a particular platform, tool, or programming language. Our intent here was to demonstrate how you can install Apache Zeppelin on your own system and start experimenting with its many capabilities. In the future, we want to use Zeppelin for exploratory data analysis and also write more interpreters for it to improve the visualization capability, i.e., incorporate Google Charts and similar tools.

I was having problem while using below mvn option , it was not picking up cloudera repo and hbase version was not compatible with CDH 5.4.1
mvn clean package -Pspark-1.3 -Ppyspark -Dhadoop.version=2.6.0-cdh5.4.1 -Phadoop-2.6 –DskipTests

Everything started work after I did 3 changes
1) Updated hbase version into hbase/pom.xml
1.0.0-cdh5.4.1
2.6.0-cdh5.4.1
2) –DskipTests to -Dmaven.test.skip=true as suggested by Alex Ott
3) To use cloudera repo included vendor-repo profile in my mvn call
mvn clean package -Pvendor-repo -Pspark-1.3 -Ppyspark -Dhadoop.version=2.6.0-cdh5.4.1 -Phadoop-2.6 -Dmaven.test.skip=true