The amazing adventures of Doug Hughes

High Availability (HA) should be the goal for any web application environment, which is business-mission critical. The big question though is what constitutes acceptable HA? In this article we discuss the various levels in a total environment and what can and should be done in each level to ensure adequate redundancy at an affordable cost, in relation to the overall costs and business-mission critical nature of the web applications. This article assumes the use of a hosting company and that the equipment is co-located. Co-location typically means you own the servers-software and rent rack-space from a hosting company who typically supply firewall-routers-switches etc.

Let’s consider a typical dedicated or co-located web application infrastructure, here is a diagram showing an entry level infrastructure and we will expand on this as we go through this article.

1/ Basic Infrastructure (Little or no Redundancy)

I have encountered many infrastructures which look like this. This is no doubt a start point if moving from a shared hosting environment. The weaknesses are fairly obvious, with only one of each key component, if one fails we have no service. There are some things that can be done to introduce some coverage-redundancy in this set-up.

Firstly, find out from the hosting provider what their Service Level Agreement (SLA) is for repairing-replacing their equipment should it fail (firewalls, switches, servers note you will own the servers if this is a co-located arrangement). Make sure within the SLA there are provisions for maintaining back-ups of all data, configuration files etc. Once you know what the SLA says you will know how long you could be without service in the case of a failure.

Secondly, consider what components in each server could be covered in case of failure. Multiple Hard-drives, for instance, can be configured so that if a single drive fails there is no loss of service. To do this we create RAID arrays. RAID is an acronym for Redundant Array of Inexpensive Disks. There are two main types of RAID arrays, one where two hard drives are in the array and are mirrored (RAID levels 1 and 10) and another where there are 3 or more disks in an array (RAID levels 5 and 6) this last type of array is typically used where there is a lot of data to be stored. Also multiple network cards are a good idea as that is a critical component which is easy to add to a server. Power is also critical so redundant power supplies per server are also good. The items that are difficult to be made redundant are things such as RAM, Motherboard etc.

2/ Basic Redundancy

Adding a second web-CF server is an important step; in fact adding a second anything takes us into a different realm, adding a level of true redundancy. The important point to bear in mind here is to make sure there is replication between the two servers so that web site content is kept synchronized allowing site visitors to be handled by ether server, seamlessly. We have also added a clustering device to handle all incoming traffic and to control which web-CF server site visitors go to. We shall be covering clustering-load balancing in greater detail in a future article.

The comments regarding SLAs, RAID arrays, back-ups etc still apply.

3/ Ideal Basic Redundancy

In this configuration we have full redundancy at all levels of the infrastructure (please note, we are not explicitly showing network switches here but they would be redundant also). In this configuration the second device(s) Firewall, Router, Clustering Device, Database Server are in a hot-standby mode, denoted by a red-outline. This means they are in a fully operational state, fully replicated etc and ready to take over from the primary device should it fail. Two other states could be cold-standby meaning some configuration is still necessary before the device can take over and fully operational meaning they are taking traffic in a parallel state with their redundant partner.

The comments regarding SLAs, RAID arrays, back-ups etc still apply.

4/ Enterprise Level Redundancy

In this case not only is each piece of equipment redundant but the whole grouping of all pieces of equipment also. So not only are we covered in the event of a single piece of equipment failing but also a catastrophic failure of some kind.

If we located these installations in different geographical locations we have coverage in the event of major natural disasters. Critical items in this last kind of installation are keeping data-settings synchronized and also handling domain name resolution if the two installations are in different geographical locations. There are specialist companies who can help with those sorts of issues.

The comments regarding SLAs, RAID arrays, back-ups etc still apply.

In this article we have attempted to lay out some scenarios for ensuring redundancy-HA. There are other ways we could have recommended but we based the article on real-world situations we have worked with on behalf of clients. In future articles we will cover parts of this in greater depth focusing on clustering, database issues etc.

Related

Comments on: "Implementing High Availability" (12)

One additional point to consider is that many hosting providers (and Enterprises) use some form of Network Attached Storage in conjunction with (or in lieu of) server based RAID arrays. This adds an additional level of complexity to the equation and can serve as an additional point of failure if a NAS unit goes down – which is why you should ask your provider if they are using NAS, is it configured for HA.

Very good point Rob I have encountered that a few times and another follow on question if they do use a NAS would be how are they connecting to the NAS? 100MB Ethernet (not optimal) – Gigabit Ethernet (better) – Fiber Channel (best) etc?

If you are doing anything e-commerce related, you need to evaluate the PCI DSS for how security will impact your network. General gist: add another couple of firewalls and segmented LANs plus monitoring boxes for IDS/IPS, centralized logging and storage. Lots of fun!

It seems to me the biggest challenges in HA are the seamless replication of the data (especially between geographically disparate locations where bandwidth is an issue and the connection is not guaranteed to be up all the time) and handling DNS fail over for multiple geographic locations. Any chance of some postings looking at how you address those in some more detail?!

@Peter – in the case of DNS, I’d recommend outsourcing it to someone else for maximum distribution. I currently run a private nameserver inside our network and have configured everydns.net (free) and domainmonger.com (free if you register domains through them) to secondary my DNS server. Any changes I make are automatically sync’d to their networks. I have a total of 6 DNS servers from two providers in 5 locations serving DNS queries for my domains.