Using Impala through a Proxy for High Availability

For most clusters that have multiple users and production availability requirements, you might set up a proxy server to relay requests to and from Impala.

Currently, the Impala statestore mechanism does not include such proxying and load-balancing features. Set up a software package of your choice to perform these functions.

Note:

Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and catalogd daemons do not have special requirements for high availability, because problems with those daemons do not
result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the Impala service, delete the Impala StateStore and
Impala Catalog Server roles, add the roles on a different host, and restart the Impala service.

Overview of Proxy Usage and Load Balancing for Impala

Using a load-balancing proxy server for Impala has the following advantages:

Applications connect to a single well-known host and port, rather than keeping track of the hosts where the impalad daemon is running.

If any host running the impalad daemon becomes unavailable, application connection requests still succeed because you always connect to the proxy
server rather than a specific host running the impalad daemon.

The coordinator node for each Impala query potentially requires more memory and CPU cycles than the other nodes that process the query. The proxy server can issue queries using
round-robin scheduling, so that each connection uses a different coordinator node. This load-balancing technique lets the Impala nodes share this additional work, rather than concentrating it on a
single machine.

The following setup steps are a general outline that apply to any load-balancing proxy software.

Download the load-balancing proxy software. It should only need to be installed and configured on a single host. Pick a host other than the DataNodes where impalad is running, because the intention is to protect against the possibility of these DataNodes becoming unavailable.

Configure the software (typically by editing a configuration file). In particular:

Set up a port that the load balancer will listen on to relay Impala requests back and forth.

Consider enabling "sticky sessions". Cloudera recommends enabling this setting so that stateless client applications such as impalad and Hue are not disconnected from long-running queries. Evaluate whether this setting is appropriate for your combination of workload and client applications.

Specify the host and port settings for each Impala node. These are the hosts that the load balancer will choose from when relaying each Impala query. See Ports Used by Impala for when to use port 21000, 21050, or another value depending on what type of connections you are load balancing.
Note:

In particular, if you are using Hue or JDBC-based applications, you typically set up load balancing for both ports 21000 and 21050, because these client applications
connect through port 21050 while the impala-shell command connects through port 21000.

Run the load-balancing proxy server, pointing it at the configuration file that you set up.

For any scripts, jobs, or configuration settings for applications that formerly connected to a specific datanode to run Impala SQL statements, change the connection information (such
as the -i option in impala-shell) to point to the load balancer instead.

Note: The following sections use the HAProxy software as a representative example of a load balancer that you can use with Impala. For information
specifically about using Impala with the F5 BIG-IP load balancer, see Impala HA with F5 BIG-IP.

Special Proxy Considerations for Clusters Using Kerberos

In a cluster using Kerberos, applications check host credentials to verify that the host they are connecting to is the same one that is actually processing the request, to prevent
man-in-the-middle attacks. To clarify that the load-balancing proxy server is legitimate, perform these extra Kerberos setup steps:

Choose the host you will use for the proxy server. Based on the Kerberos setup procedure, it should already have an entry impala/proxy_host@realm in its keytab. If not, go back over the initial Kerberos configuration steps for the keytab on each host running the
impalad daemon.

Copy the keytab file from the proxy host to all other hosts in the cluster that run the impalad daemon. (For optimal performance, impalad should be running on all DataNodes in the cluster.) Put the keytab file in a secure location on each of these other hosts.

On systems not managed by Cloudera Manager, add an entry impala/actual_hostname@realm to the keytab on each host running the impalad daemon.

For each impalad node, merge the existing keytab with the proxy’s keytab using ktutil, producing a new keytab file. For example:

Restart Impala to make the changes take effect. Follow the appropriate steps depending on whether you use Cloudera Manager or not:

On a cluster managed by Cloudera Manager, restart the Impala service.

On a cluster not managed by Cloudera Manager, restart the impalad daemons on all hosts in the cluster, as well as the statestored and catalogd daemons.

Example of Configuring HAProxy Load Balancer for Impala

If you are not already using a load-balancing proxy, you can experiment with HAProxy a free, open source load balancer.
This example shows how you might install and configure that load balancer on a Red Hat Enterprise Linux system.

Install the load balancer: yum install haproxy

Set up the configuration file: /etc/haproxy/haproxy.cfg. See the following section for a sample configuration file.

Run the load balancer (on a single host, preferably one not running impalad):

/usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg

In impala-shell, JDBC applications, or ODBC applications, connect to the listener port of the proxy host, rather than port 21000 or 21050 on a host
actually running impalad. The sample configuration file sets haproxy to listen on port 25003, therefore you would send all requests to haproxy_host:25003.

Note: If your JDBC or ODBC application connects to Impala through a load balancer such as haproxy, be cautious about
reusing the connections. If the load balancer has set up connection timeout values, either check the connection frequently so that it never sits idle longer than the load balancer timeout value, or
check the connection validity before using it and create a new one if the connection has been closed.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.