In this article

Extend Azure HDInsight using an Azure Virtual Network

In this article

Learn how to use HDInsight with an Azure Virtual Network. Using an Azure Virtual Network enables the following scenarios:

Connecting to HDInsight directly from an on-premises network.

Connecting HDInsight to data stores in an Azure Virtual network.

Directly accessing Hadoop services that are not available publicly over the internet. For example, Kafka APIs or the HBase Java API.

Warning

The information in this document requires an understanding of TCP/IP networking. If you are not familiar with TCP/IP networking, you should partner with someone who is before making modifications to production networks.

Do you want to connect the virtual network containing HDInsight to another virtual network or your on-premises network?

To easily work with resources across networks, you may need to create a custom DNS and configure DNS forwarding. For more information, see the connecting multiple networks section.

Do you want to restrict/redirect inbound or outbound traffic to HDInsight?

HDInsight must have unrestricted communication with specific IP addresses in the Azure data center. There are also several ports that must be allowed through firewalls for client communication. For more information, see the controlling network traffic section.

Add HDInsight to an existing virtual network

Use the steps in this section to discover how to add a new HDInsight to an existing Azure Virtual Network.

Note

You cannot add an existing HDInsight cluster into a virtual network.

Are you using a classic or Resource Manager deployment model for the virtual network?

Once joined, HDInsight installed in the Resource Manager network can interact with resources in the classic network.

Do you use forced tunneling? Forced tunneling is a subnet setting that forces outbound Internet traffic to a device for inspection and logging. HDInsight does not support forced tunneling. Either remove forced tunneling before installing HDInsight into a subnet, or create a new subnet for HDInsight.

Do you use network security groups, user-defined routes, or Virtual Network Appliances to restrict traffic into or out of the virtual network?

As a managed service, HDInsight requires unrestricted access to several IP addresses in the Azure data center. To allow communication with these IP addresses, update any existing network security groups or user-defined routes.

HDInsight hosts multiple services, which use a variety of ports. Do not block traffic to these ports. For a list of ports to allow through virtual appliance firewalls, see the Security section.

To find your existing security configuration, use the following Azure PowerShell or Azure CLI commands:

Network security groups

$resourceGroupName = Read-Input -Prompt "Enter the resource group that contains the virtual network used with HDInsight"
get-azurermnetworksecuritygroup -resourcegroupname $resourceGroupName

read -p "Enter the name of the resource group that contains the virtual network: " RESOURCEGROUP
az network nsg list --resource-group $RESOURCEGROUP

Network security group rules are applied in order based on rule priority. The first rule that matches the traffic pattern is applied, and no others are applied for that traffic. Order rules from most permissive to least permissive. For more information, see the Filter network traffic with network security groups document.

User-defined routes

$resourceGroupName = Read-Input -Prompt "Enter the resource group that contains the virtual network used with HDInsight"
get-azurermroutetable -resourcegroupname $resourceGroupName

read -p "Enter the name of the resource group that contains the virtual network: " RESOURCEGROUP
az network route-table list --resource-group $RESOURCEGROUP

Adding HDInsight to a virtual network is an optional configuration step. Be sure to select the virtual network when configuring the cluster.

Connecting multiple networks

The biggest challenge with a multi-network configuration is name resolution between the networks.

Azure provides name resolution for Azure services that are installed in a virtual network. This built-in name resolution allows HDInsight to connect to the following resources by using a fully qualified domain name (FQDN):

Any resource that is available on the internet. For example, microsoft.com, google.com.

Any resource that is in the same Azure Virtual Network, by using the internal DNS name of the resource. For example, when using the default name resolution, the following are example internal DNS names assigned to HDInsight worker nodes:

wn0-hdinsi.0owcbllr5hze3hxdja3mqlrhhe.ex.internal.cloudapp.net

wn2-hdinsi.0owcbllr5hze3hxdja3mqlrhhe.ex.internal.cloudapp.net

Both these nodes can communicate directly with each other, and other nodes in HDInsight, by using internal DNS names.

The default name resolution does not allow HDInsight to resolve the names of resources in networks that are joined to the virtual network. For example, it is common to join your on-premises network to the virtual network. With only the default name resolution, HDInsight cannot access resources in the on-premises network by name. The opposite is also true, resources in your on-premises network cannot access resources in the virtual network by name.

Warning

You must create the custom DNS server and configure the virtual network to use it before creating the HDInsight cluster.

To enable name resolution between the virtual network and resources in joined networks, you must perform the following actions:

Create a custom DNS server in the Azure Virtual Network where you plan to install HDInsight.

Configure the virtual network to use the custom DNS server.

Find the Azure assigned DNS suffix for your virtual network. This value is similar to 0owcbllr5hze3hxdja3mqlrhhe.ex.internal.cloudapp.net. For information on finding the DNS suffix, see the Example: Custom DNS section.

Configure forwarding between the DNS servers. The configuration depends on the type of remote network.

If the remote network is an on-premises network, configure DNS as follows:

Custom DNS (in the virtual network):

Forward requests for the DNS suffix of the virtual network to the Azure recursive resolver (168.63.129.16). Azure handles requests for resources in the virtual network

Forward all other requests to the on-premises DNS server. The on-premises DNS handles all other name resolution requests, even requests for internet resources such as Microsoft.com.

On-premises DNS: Forward requests for the virtual network DNS suffix to the custom DNS server. The custom DNS server then forwards to the Azure recursive resolver.

This configuration routes requests for fully qualified domain names that contain the DNS suffix of the virtual network to the custom DNS server. All other requests (even for public internet addresses) are handled by the on-premises DNS server.

If the remote network is another Azure Virtual Network, configure DNS as follows:

Custom DNS (in each virtual network):

Requests for the DNS suffix of the virtual networks are forwarded to the custom DNS servers. The DNS in each virtual network is responsible for resolving resources within its network.

Forward all other requests to the Azure recursive resolver. The recursive resolver is responsible for resolving local and internet resources.

The DNS server for each network forwards requests to the other, based on DNS suffix. Other requests are resolved using the Azure recursive resolver.

Directly connect to Hadoop services

Most documentation on HDInsight assumes that you have access to the cluster over the internet. For example, that you can connect to the cluster at https://CLUSTERNAME.azurehdinsight.net. This address uses the public gateway, which is not available if you have used NSGs or UDRs to restrict access from the internet.

To connect to Ambari and other web pages through the virtual network, use the following steps:

To discover the internal fully qualified domain names (FQDN) of the HDInsight cluster nodes, use one of the following methods:

Network virtual appliances replicate the functionality of devices such as firewalls and routers. For more information, see the Network Appliances document.

As a managed service, HDInsight requires unrestricted access to Azure health and management services in the Azure cloud. When using NSGs and UDRs, you must ensure that HDInsight these services can still communicate with HDInsight.

HDInsight exposes services on several ports. When using a virtual appliance firewall, you must allow traffic on the ports used for these services. For more information, see the [Required ports] section.

HDInsight with network security groups and user-defined routes

If you plan on using network security groups or user-defined routes to control network traffic, perform the following actions before installing HDInsight:

Forced tunneling

Forced tunneling is a user-defined routing configuration where all traffic from a subnet is forced to a specific network or location, such as your on-premises network. HDInsight does not support forced tunneling.

Required IP addresses

Important

The Azure health and management services must be able to communicate with HDInsight. If you use network security groups or user-defined routes, allow traffic from the IP addresses for these services to reach HDInsight.

If you do not use network security groups or user-defined routes to control traffic, you can ignore this section.

If you use network security groups or user-defined routes, you must allow traffic from the Azure health and management services to reach HDInsight. Use the following steps to find the IP addresses that must be allowed:

You must always allow traffic from the following IP addresses:

IP address

Allowed port

Direction

168.61.49.99

443

Inbound

23.99.5.239

443

Inbound

168.61.48.131

443

Inbound

138.91.141.162

443

Inbound

If your HDInsight cluster is in one of the following regions, then you must allow traffic from the IP addresses listed for the region:

Important

If the Azure region you are using is not listed, then only use the four IP addresses from step 1.

If you use a custom DNS server with your virtual network, you must also allow access from 168.63.129.16. This address is Azure's recursive resolver. For more information, see the Name resolution for VMs and Role instances document.

Example: network security groups with HDInsight

The examples in this section demonstrate how to create network security group rules that allow HDInsight to communicate with the Azure management services. Before using the examples, adjust the IP addresses to match the ones for the Azure region you are using. You can find this information in the HDInsight with network security groups and user-defined routes section.

Azure Resource Management template

The following Resource Management template creates a virtual network that restricts inbound traffic, but allows traffic from the IP addresses required by HDInsight. This template also creates an HDInsight cluster in the virtual network.

Azure CLI

Use the following steps to create a virtual network that restricts inbound traffic, but allows traffic from the IP addresses required by HDInsight.

Use the following command to create a new network security group named hdisecure. Replace RESOURCEGROUPNAME with the resource group that contains the Azure Virtual Network. Replace LOCATION with the location (region) that the group was created in.

az network nsg create -g RESOURCEGROUPNAME -n hdisecure -l LOCATION

Once the group has been created, you receive information on the new group.

Use the following to add rules to the new network security group that allow inbound communication on port 443 from the Azure HDInsight health and management service. Replace RESOURCEGROUPNAME with the name of the resource group that contains the Azure Virtual Network.

Use double-quotes around id in the command if you don't get the expected results.

Use the following command to apply the network security group to a subnet. Replace the GUID and RESOURCEGROUPNAME values with the ones returned from the previous step. Replace VNETNAME and SUBNETNAME with the virtual network name and subnet name that you want to create.

Once this command completes, you can install HDInsight into the Virtual Network.

Important

These steps only open access to the HDInsight health and management service on the Azure cloud. Any other access to the HDInsight cluster from outside the Virtual Network is blocked. To enable access from outside the virtual network, you must add additional Network Security Group rules.

The following example demonstrates how to enable SSH access from the Internet:

Replace the 0owcbllr5hze3hxdja3mqlrhhe.ex.internal.cloudapp.net value with the DNS suffix of the other virtual network. This entry routes requests for the DNS suffix of the remote network to the custom DNS in that network.

On the custom DNS servers in both virtual networks, use the following text as the contents of the /etc/bind/named.conf.options file: