This week on a project I came up against the need to set up some AlwaysOn Availability Groups (AG's) for SQL Server 2014 in an Azure IaaS environment, using v2 (or Resource Manager/ARM) based VMs as opposed to the v1 (or Service Manager/ASM) based ones. I came across varying bits of documentation to do with this, but it either didn't have the ARM PowerShell cmdlets or it was a mix and match of scripts and UI actions which didn't help me scripting the deployment, so after working through this, there are a few things that I learned.

The challenges

For the most part, the process of actually setting up the availability groups within the VMs was more or less the same as doing it on-premises so I wont go in to that. The issue I came across was that after I have provisioned my AGs and created listeners for them I found that I couldn't open SQL connections to the servers through the listener, but I could if I opened connections directly to each server. Name resolution was working fine and I could see that the DNS was in place to resolve to the listeners IP address, but the host was unreachable. After a bit of digging it turns out that to use AGs in Azure IaaS you need to put an internal load balancer in front of them, which will create an IP address on the virtual network which will allow for the Azure virtual network to know which VM to route traffic to correctly (more or less, I'm a little sketchy on the inner workings of the networking in Azure - but this is documented as the solution to AAGs in Azure).

Creating the internal load balancer

For the purposes of keeping this post brief and focussed on the ILB setup, lets assume that you already have your two (or more) SQL servers provisioned and you have configured the availability groups the same as you would on-prem. Below is the script that I used to create the load balancer.

Create a front end IP address for the load balancer. This should be the same IP address that you use for your listener (this makes the assumption that you are working on a single subnet, I'll get to multi-subnet in a bit)

Creates a "pool" to put the servers in that the load balancer will refer to. This should be every server in the availability group that shares the same subnet as the above IP address

Creates a probe to check what VMs are currently responding, so that the network can direct traffic appropriately

Defines a load balancer rule which specifies the ports to handle traffic on

Takes all of the above objects and then creates the load balancer (nothing actually gets created in your tenant up to this point)

For each of the specified VMs, it gets their first network adaptor and puts it in the backend pool that was created earlier

Restart each VM so that the new config can apply

There is one thing that I want to highlight from the above script - the "EnableFloatingIP" switch on the New-AzureRmLoadBalancerRuleConfig cmdlet. Without this turned on for your rule you'll find that all traffic hitting your availability group will just time out. By enabling floating IPs and letting Azure move the IP address around based on it detecting that servers are down you'll see that traffic begins to actually flow correctly.

Next we need to make a couple of changes to the Windows Failover Cluster so that it plays nice with the load balancer that we just created. Here is the script that you should run on the active node in your cluster:

Just swap in the name of the IP address resource in your cluster here and you're good to go (look it up using Get-ClusterResource with no parameters to see a full list of the resources in your cluster). The thing that threw me a little here was the subnet mask - the subnet for my lab in this case was 255.255.255.224, so I originally tested with that in there, but I wasn't able to fail over as all the traffic was still being routed to the old primary. This is because the ILB created earlier was not able to poll correctly. Once this subnet mask was switched over I was able to see successful failover take place with the ILB routing traffic to the new primary within 15 seconds of me triggering the manual failover in SQL. So once you've got this sorted out you are ready to go. Failing over the AG works just the same as it always did, and your traffic should flow to the new primary from there.

Multi-subnet considerations

Now everything I have done so far is all well and good for running your AG on the one subnet, but what happens when there are nodes in your cluster that are in a separate subnet? The whole plan goes a little bit south because the internal load balancers in Azure are basically tied to a single subnet by the front end IP and in a failover cluster for SQL that uses multiple subnets your cluster will usually have at least two IP address resources, one for each subnet, and the appropriate address resource gets brought online as the AG fail over the cluster to a node on the other subnet. This means that for this scenario, you need to create an extra internal load balancer for each subnet (in the lab I had set up I was using 4 SQL servers, 2 in subnet A and 2 in subnet B). Each ILB I created had a front end address that was for the subnet I created it in, and referred to only the VMs in that subnet. With the additional ILBs in place, I was able to start failing SQL over to the second subnet and see things come together the way I needed.

So there you have it - hopefully that will help you get going with AGs in Azure with Resource Manager cmdlets!