Content

Bitnami Hadoop for Microsoft Azure

Description

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

First steps with the Bitnami Hadoop Stack

Welcome to your new Bitnami application running on Microsoft Azure! Here are a few questions (and answers!) you might need when first starting with your application.

What credentials do I need?

You need two sets of credentials:

The application credentials, consisting of a username and password. These credentials allow you to log in to your new Bitnami application.

The server credentials, consisting of an SSH username and key/password. These credentials allow you to log in to your Microsoft Azure server using an SSH client and execute commands on the server using the command line.

What is the administrator username set for me to log in to the application for the first time?

Hadoop ports

Each daemon in Hadoop listens to a different port. The most relevant ones are:

ResourceManager:

Scheduler: 8030.

Resource Tracker: 8031.

Service: 8032.

Web UI: 8088.

NodeManager:

Localizer: 8040.

Web UI: 8042.

Timeline Server:

Service: 10200.

Web UI: 8188.

History Server:

Service: 10020

Admin: 10033.

Web UI: 19888.

NameNode:

Service: 8020.

Web UI: 9870.

Secondary NameNode:

Web UI: 9868.

DataNode:

Data Transfer: 9866.

Service: 9867.

Web UI: 9864.

Hive:

Derby DB: 1527.

HCat/Metastore: 9083.

Hiveserver2 Thrift: 10000.

Hiveserver2 Web UI: 10002.

WebHCat: 50111.

All ports are closed by default. In order to access any of them, you have two options:

(Recommended) Create an SSH tunnel for accessing the port (refer to the FAQ for more information about SSH tunnels).

Open the port for remote access (refer to the FAQ for more information about opening ports).

How to start or stop the services?

Each Bitnami stack includes a control script that lets you easily stop, start and restart services. The script is located at /opt/bitnami/ctlscript.sh. Call it without any service name arguments to start all services:

$ sudo /opt/bitnami/ctlscript.sh start

Or use it to restart a single service, such as Apache only, by passing the service name as argument:

$ sudo /opt/bitnami/ctlscript.sh restart apache

Use this script to stop all services:

$ sudo /opt/bitnami/ctlscript.sh stop

Restart the services by running the script without any arguments:

$ sudo /opt/bitnami/ctlscript.sh restart

Obtain a list of available services and operations by running the script without any arguments:

$ sudo /opt/bitnami/ctlscript.sh

How to access the administration panel?

Access the administration panel by browsing to http://SERVER-IP/cluster/.

How to create a full backup of Hadoop?

Backup

The Bitnami Hadoop Stack is self-contained and the simplest option for performing a backup is to copy or compress the Bitnami stack installation directory. To do so in a safe manner, you will need to stop all servers, so this method may not be appropriate if you have people accessing the application continuously.

Follow these steps:

Change to the directory in which you wish to save your backup:

$ cd /your/directory

Stop all servers:

$ sudo /opt/bitnami/ctlscript.sh stop

Create a compressed file with the stack contents:

$ sudo tar -pczvf application-backup.tar.gz /opt/bitnami

Restart all servers:

$ sudo /opt/bitnami/ctlscript.sh start

You should now download or transfer the application-backup.tar.gz file to a safe location.

Restore

Follow these steps:

Change to the directory containing your backup:

$ cd /your/directory

Stop all servers:

$ sudo /opt/bitnami/ctlscript.sh stop

Move the current stack to a different location:

$ sudo mv /opt/bitnami /tmp/bitnami-backup

Uncompress the backup file to the original directoryv

$ sudo tar -pxzvf application-backup.tar.gz -C /

Start all servers:

$ sudo /opt/bitnami/ctlscript.sh start

If you want to create only a database backup, refer to these instructions for MySQL and PostgreSQL.

How to enable HTTPS support with SSL certificates?

NOTE: The steps below assume that you are using a custom domain name and that you have already configured the custom domain name to point to your cloud server.

Bitnami images come with SSL support already pre-configured and with a dummy certificate in place. Although this dummy certificate is fine for testing and development purposes, you will usually want to use a valid SSL certificate for production use. You can either generate this on your own (explained here) or you can purchase one from a commercial certificate authority.

Once you obtain the certificate and certificate key files, you will need to update your server to use them. Follow these steps to activate SSL support:

Use the table below to identify the correct locations for your certificate and configuration files.

NOTE: If you use different names for your certificate and key files, you should reconfigure the SSLCertificateFile and SSLCertificateKeyFile directives in the corresponding Apache configuration file to reflect the correct file names.

If your certificate authority has also provided you with a PEM-encoded Certificate Authority (CA) bundle, you must copy it to the correct location in the previous table. Then, modify the Apache configuration file to include the following line below the SSLCertificateKeyFile directive. Choose the correct directive based on your scenario and Apache version:

Variable

Value

Apache configuration file

/opt/bitnami/apache2/conf/bitnami/bitnami.conf

Directive to include (Apache v2.4.8+)

SSLCACertificateFile "/opt/bitnami/apache2/conf/server-ca.crt"

Directive to include (Apache < v2.4.8)

SSLCertificateChainFile "/opt/bitnami/apache2/conf/server-ca.crt"

NOTE: If you use a different name for your CA certificate bundle, you should reconfigure the SSLCertificateChainFile or SSLCACertificateFile directives in the corresponding Apache configuration file to reflect the correct file name.

Once you have copied all the server certificate files, you may make them readable by the root user only with the following commands:

How to create an SSL certificate?

OpenSSL is required to create an SSL certificate. A certificate request can then be sent to a certificate authority (CA) to get it signed into a certificate, or if you have your own certificate authority, you may sign it yourself, or you can use a self-signed certificate (because you just want a test certificate or because you are setting up your own CA).

Note that if you use this encrypted key in the Apache configuration file, it will be necessary to enter the password manually every time Apache starts. Regenerate the key without password protection from this file as follows:

How to force HTTPS redirection?

After modifying the Apache configuration files, restart Apache to apply the changes.

How to debug Apache errors?

Once Apache starts, it will create two log files at /opt/bitnami/apache2/logs/access_log and /opt/bitnami/apache2/logs/error_log respectively.

The access_log file is used to track client requests. When a client requests a document from the server, Apache records several parameters associated with the request in this file, such as: the IP address of the client, the document requested, the HTTP status code, and the current time.

The error_log file is used to record important events. This file includes error messages, startup messages, and any other significant events in the life cycle of the server. This is the first place to look when you run into a problem when using Apache.

If no error is found, you will see a message similar to:

Syntax OK

How to create a Virtual Network peering?

To connect two instances internally you can enable a Virtual Network (VNet) peering from the Azure Portal. Depending if the instances were launched in the same or in different resource groups, there are two methods for performing a internal connection: sharing a virtual network or enabling a virtual network peering.

How to upload files to the server with SFTP?

Although you can use any SFTP/SCP client to transfer files to your server, the link below explains how to configure FileZilla (Windows, Linux and Mac OS X), WinSCP (Windows) and Cyberduck (Mac OS X). It is required to use your server's private SSH key to configure the SFTP client properly. Choose your preferred application and follow the steps in the link below to connect to the server through SFTP.

How to connect to Hadoop from a different machine?

For security reasons, ports used by Hadoop cannot be accessed over a public IP address. To connect to Hadoop from a different machine, you must the open port of the service you want to access remotely. Refer to the FAQ for more information on this.

Check the Hadoop ports section to see the complete list of the most relevant ports in Hadoop.

IMPORTANT: Making this application's network ports public is a significant security risk. You are strongly advised to only allow access to those ports from trusted networks. If, for development purposes, you need to access from outside of a trusted network, please do not allow access to those ports via a public IP address. Instead, use a secure channel such as a VPN or an SSH tunnel. Follow these instructions to remotely connect safely and reliably.

How to create a Hadoop cluster with several servers?

It is possible to create a Hadoop cluster with several instances of Bitnami Hadoop stack, as long as Hadoop daemons are properly configured.

Typical Hadoop clusters are divided into the following node roles:

Master nodes: NameNodes and ResourceManager servers, usually running one of these services per node.

Worker nodes: Acting as both DataNode and NodeManager on a same node.

Service nodes: Services such as Application Timeline server, Web App Proxy server and MapReduce Job History server running on a same node.

Client: Where Hadoop jobs will be submitted from, which will have Hadoop Hive installed.

Once you have decided an architecture for your cluster, the Hadoop services running on each node must be able to communicate with each other. Each service operates on different ports. Therefore, when creating the cluster, ensure that you open the service ports on each node.

IMPORTANT: Hadoop will require you to use hostnames/IP addresses that are configured via network configuration to your server. This typically means that you won't be able to use a public IP address, but a private IP address instead.

Stop all the services in the nodes by running the following command in each node:

$ sudo /opt/bitnami/ctlscript.sh stop

NameNode: Save the IP address of the node that will act as the NameNode. In this example, we will suppose that the IP address of the chosen NameNode is 192.168.1.2.

Change the fs.defaultFS property in the /opt/bitnami/hadoop/etc/hadoop/core-site.xml file, and set its value to the full HDFS URI to the node which will act as the NameNode:

Secondary NameNode: Change the dfs.namenode.secondary.http-address property in the /opt/bitnami/hadoop/etc/hadoop/hdfs-site.xml file. Set the value to the appropriate IP address for the Secondary NameNode. For example, if the IP address of the chosen Secondary NameNode were the same as the one for the NameNode, the configuration file will contain the following:

ResourceManager: Add the property yarn.resourcemanager.hostname to the /opt/bitnami/hadoop/etc/hadoop/yarn-site.xml file. Set the value to the IP address of the node which will act as the ResourceManager. For example, if the IP address of the chosen ResourceManager is 192.168.1.3, the configuration file will contain the following:

DataNode: Edit the dfs.datanode.address, dfs.datanode.http.address and dfs.datanode.ipc.address properties in the /opt/bitnami/hadoop/hdfs-site.xml file. Set the values to the IP address and port of the node which will act as the DataNode. For example, if the IP address of the chosen DataNode server is 192.168.1.4 and it listens to the default port, the configuration file will contain the following:

NodeManager: Add the property yarn.nodemanager.hostname to the /opt/bitnami/hadoop/etc/hadoop/yarn-site.xml file. Set the value to the IP address of the node which will act as the NodeManager. For example, if the IP address of the chosen NodeManager is 192.168.1.2, the same as the DataNode, the configuration file will contain the following:

Timeline server: Add the property yarn.timeline-service.hostname to the /opt/bitnami/hadoop/etc/hadoop/yarn-site.xml file. Set the value to the IP address of the node which will act as the Timeline server. For example, if the IP address of the chosen Timeline server is 192.168.1.5 and it listens to the default port, the configuration file will contain the following:

JobHistory server: Edit the mapreduce.jobhistory.address, mapreduce.jobhistory.admin.address and mapreduce.jobhistory.webapp.address properties in the /opt/bitnami/hadoop/mapred-site.xml file. Set the values to the IP address of the node which will act as the JobHistory server, and the corresponding ports for each service. For example, if the IP address of the chosen JobHistory server is 192.168.1.5 and the services listen to the default ports, the configuration file will contain the following:

Copy these configuration files to every node in the cluster. The configuration must be the same in all of them.

Configure start/stop scripts: Once you have decided the architecture and applied configuration files, you must disable unnecessary services in each of the nodes. To do so:

Navigate to /opt/bitnami/hadoop/scripts on each server and determine if any of the startup scripts are not needed. If that is the case, rename them to something different. For instance, in order to disable all Yarn services, run the following command:

$ sudo mv ctl-yarn.sh ctl-yarn.sh.disabled

If you want to selectively disable some of daemons for a specific service, you must edit the appropriate start/stop script and look for the HADOOP_SERVICE_DAEMONS line and remove the ones you want in the list. For instance, if you wanted to selectively disable Yarn's NodeManager, you would change it from this:

How to connect to Hive?

The Bitnami Hadoop Stack includes Hive, Pig and Spark, and starts HiveServer2, Metastore and WebHCat by default.

How to connect to HiveServer2?

HiveServer2 is a server interface that enables remote clients to execute queries against Hive and retrieve the results. It listens to port 10000 by default.

In order to connect to HiveServer2, you have two options:

(Recommended): Connect to the HiveServer2 Thrift server (running on port 10000) through an SSH tunnel (refer to the FAQ for more information about SSH tunnels).

Open the HiveServer2 Thrift server's port 10000 for remote access (refer to the FAQ for more information about opening ports).

Once you have connected to the server through an SSH tunnel or you opened the port to allow the remote access, you can use the Beeline command-line utility. To connect to HiveServer2 using Beeline, run the following:

Connecting to HiveServer2 through an SSH tunnel:

$ beeline -u jdbc:hive2://localhost:10000 -n hadoop

Connecting to HiveServer2 by opening the port 10000. (SERVER-IP is a placeholder, please replace it with the right value).

$ beeline -u jdbc:hive2://SERVER-IP:10000

After some seconds, you will be able to access the prompt:

0: jdbc:hive2://localhost:10000>

How to access the HiveServer2 Web UI?

HiveServer2 has a Web UI which provides different features, such as logging, metrics and configuration information. It listens on port 10002. In order to access it, you have two options:

(Recommended) Access the HiveServer2 Web UI (running on port 10002) through an SSH tunnel (refer to the FAQ for more information about SSH tunnels).

Open the HiveServer2 port 10002 for remote access (refer to the FAQ for more information about opening ports).

How to access WebHCat?

HCatalog is a table and storage management layer for Hadoop. HCatalog is built on top of Metastore, another component of Hadoop. WebHCat is the REST API for HCatalog, and listens to port 50111 by default.

In order to access HCatalog, you have two options:

(Recommended): Access the WebHCat server (running on port 50111) through an SSH tunnel (refer to the FAQ for more information about SSH tunnels).

Open the WebHCat port 50111 for remote access (refer to the FAQ for more information about opening ports).

How to connect to Pig?

The Bitnami Hadoop Stack includes Pig, a platform for analyzing large data sets that consist of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

To use Pig, simply run:

$ pig

After a few moments, you will see the grunt prompt:

grunt>

How to run Pig tutorial scripts?

In order to run the Pig tutorial scripts, you will first need to upload a file to HDFS: