Introducing high-availability and multi-subnet scenarios with SQL Server on Linux

On my first blog about SQL Server on Linux, I introduced the new high availability feature which concerns only SQL Server failover cluster instances so far. During this discovery time, I had the support of Mihaela Blendea (@MihaelaBlendea) at Microsoft to clarify some architecture aspects about this new kind ofarchitecture. Firstly, I would like to thank her. It’s always a big pleasure to get the availability of the Microsoft team in this case. But after achieving the installation of my SQL Server FCI environment on Linux, I was interested in performing the same in a more complex scenario like multi-subnets failover clusters as I may notice at some customer shops. The installation process will surely change over the time and it is not intended as an official documentation of course. This is only an exercise which is part of my Linux immersion experience.

So I decided to evolve my current architecture (two clusters nodes with PaceMaker on the same subnet) by introducing a third one on a different subnet. Here a picture of the architecture I wanted to install.

So basically, referring to my previous architecture, the task to perform was as follows:

Make the initial heartbeat configuration redundant. Even if nowadays having redundant network paths is mostly handled by modern infrastructures and virtualization layers as well, I still believe it is always a best practice to make the heartbeat redundant at the cluster level in order to avoid unexpected behaviors like split brains (for instance with two nodes in this case). I will have the opportunity to talk about quorum stuff in a next post.

Introduce a third node on a different subnet to the existing architecture and then adding it to the cluster. You may follow the Microsoft documentation to perform this task. The main challenge here was to add the third node in the context of multi-subnet scenario and to ensure the communication path is working well between cluster nodes for both networks (public and private).

Find a way to make the existing SQL Server FCI resource multi-subnet compliant. I mean to get the same kind of behavior we may have with WSFC on Windows when the resource fails over nodes on different subnets. In this case, we have to configure an OR based resource dependency which includes second virtual address IP.

Check if applications are able to connect in the context of multi-subnet failover event.

You may notice that I didn’t introduce redundancy at the storage layer. Indeed, the NFS server becomes the SPOF but I didn’t want to make my architecture more complex at all for the moment. In a more realistic scenario at customer shops, this aspect would be probably covered by other storage vendor solutions.

So let’s begin by the heartbeat configuration. According to my existing infrastructure, only one ring was configured and ran on the top of my eth0 interfaces on both nodes ((respectively 192.168.5.17 for the linux01 node and 192.168.5.18 for the linux02 node).

At this point I binded each new IP address with a corresponding hostname. We may either store the new configuration in the /etc/hosts file or in the DNS server(s).

Then I updated the Corosync.conf on both nodes by adding the new ring configuration as follows. The point here is that configuration changes are not synchronized automatically across nodes like Windows Failover clusters. To allow redundant ring protocol, I added the rrp_mode parameter to be active on both network interfaces (eth0 and eth1) and a new ring section for each node (ring1_addr).

Obviously, communication paths were done successfully after configuring correctly the routes between nodes on different subnets. Corresponding default gateways are already configured for eth0 interfaces but we have to add static routes for eth1 interfaces as shown below:

Well, after checking everything is ok from the cluster side, the next challenge was to find a way to configure the SQL Server FCI resource to be multi-subnet compliant. As stated by Microsoft, the SQL Server FCI is not as coupled with Pacemaker add-on as the Windows Failover Cluster. Based on my Windows Failover experience, I wondered if I had to go to the same way with the pacemaker cluster on Linux and I tried to find out a way to add a second VIP and then to include it as part of the OR dependency but I found nothing on this field. But Pacemaker offers concepts which include location / collocation and scores in order to behave on the resources during failover events. My intention is not to go into details trough the pacemaker documentation but by playing with the 3 concepts I was able to address our need. Again please feel free to comments if you have a better method to meet my requirement.

Let’s first add a second virtual IP address for the 192.168.50.0 subnet (virtualipdr) and then let’s add a new dependency / colocation between for SQL Server resource (sqllinuxfci)

Now to avoid starting virtualip or virtualipdr resources on the wrong subnet, let’s configure an “opt-out” scenario which includes symmetric cluster to allow resources to run everywhere and location constraints to avoid running a resource on a specified location / node.

In turn, the virtualipdr brought online and virtualip brought offline as well because we are now located on the 192.168.50.0 subnet. Here we go!

Ok at this point our SQL Server Failover Cluster Instance seems to behave as expected but how to deal with client connections in this case? If I refer to previous Windows Failover Cluster experiences, I may think about two scenarios by using DNS servers.

We are able to use SqlClient / JAVA / ODBC support for HA with MultiSubnetFailover parameter on the connection string. In this case good news, we may simply put the both different addresses for the corresponding DNS record and the magic will operate by itself (similar to RegisterAllProvidersIP property with availability groups). The client will reach out automatically the first available address and everything should be fine.

We cannot modify or use the MultiSubnetFailover and in this case we may setup the TTL value manually for the corresponding DNS record (similar to the HostRecordTTL parameters with availability groups). We will experience timeout issues for the first connection attempt but the second one should work.

Other scenarios?? Please feel free to comment

In my lab environnement using SqlClient based connections seem to work well in the aforementioned cases. I will perform further tests in a near feature and update this blog accordingly with the results.

I’m looking forward to see other improvements / features with the next SQL Server CTPs