More options

Rating

Abstract

Big data analytics is network intensive because it runs on a cluster of nodes. Due to the high volumes of data exchanged between nodes in the IBM® BigInsights® cluster, network isolation is vital for the following reasons:

To prevent sniffing attacks

To reduce the network congestion so that the corporate network is not affected by the big data cluster traffic

This IBM® Redbooks® Analytics Support web doc serves as a guide for system implementers who are creating a secure zone for an IBM BigInsights cluster by providing an example of current industry practices. This document applies to IBM BigInsights Version 4.2 and later.

Contents

Big data analytics is network intensive because it runs on a cluster of nodes. Due to the high volumes of data exchanged between nodes in the IBM® BigInsights® cluster, network isolation is vital for the following reasons:

To prevent sniffing attacks

To reduce the network congestion so that the corporate network is not affected by the big data cluster traffic

This IBM® Redbooks® Analytics Support web doc serves as a guide for system implementers who are creating a secure zone for an IBM BigInsights cluster by providing an example of current industry practices. This document applies to IBM BigInsights Version 4.2 and later.

IBM BigInsights cluster network deployment architecture

Figure 1 highlights one of the industry practices for network topology for IBM® BigInsights® clusters.

Figure 1. High-level IBM BigInsights network architecture

Users can access management nodes from the corporate network only after they authenticate with the corporate LDAP. Inbound traffic is encrypted and controlled by a firewall. Only ports that are related to cluster administration (Ambari), reverse proxy (Knox), JDBC ports (BigSQL and Hive), and SSH are open for users. Outbound traffic is not restricted. After users log in to the management node, they can connect via Secure Shell (SSH) to data nodes that are connected to the private network. All inbound traffic from the corporate network to data nodes is blocked. Data nodes are connected only to the cluster data network or to the private network.

Management nodes in the cluster have two network interfaces:

One of the interfaces is connected to the corporate network (also called a public network).

The other interface is connected to the private network (sometimes referred to as a data network).

All the data traffic is exchanged between management node and data nodes through the private network only. Thus, there is dedicated bandwidth and higher performance with the added benefit of security.

Setting up the data nodes to access external data sources with Apache Sqoop and similar tools

One of the use cases for big data analytics is to offload the organization's relational database management system (RDBMS) data to Apache Hadoop for archival or running analytics at large scale. Tools such as Apache Sqoop and IBM Fluid Query are used to import data from external sources. These tools launch MapReduce jobs, which read the data from external sources in parallel.

In this scenario, port forwarding in the firewall must be enabled, so that the data nodes can read external sources by forwarding the traffic to or from management nodes.

Run the following commands as root on the management nodes to enable port forwarding between data nodes and management nodes. Data nodes can then initiate communication to servers outside of the private network and receive data, but external servers cannot access the internal network.

These commands assume eth0 is the public interface. The eth1 traffic is routed to outside of the network as though it were coming from the management node. Adjust the interface names (eth0 and eth1) to match your environment. After the data is completely imported, port forwarding can be turned off.

Special Notices

This material has not been submitted to any formal IBM test and is published AS IS. It has not been the subject of rigorous review. IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a client responsibility and depends upon the client's ability to evaluate and integrate them into the client's operational environment.