The Long-Distance LAN

Linking data centers for high availability is tricky. We have the plan you need.

Application failure can be pricey, particularly when it's a business-critical system. One uptime strategy is to create a data center interconnect (DCI) link, so that if a failure occurs in one data center, the application can continue to run in the second.

There are two approaches to making an application highly available via a DCI. First, you can set the application to be active in one data center and on standby in the second. In case of a problem at the first site, the application can switch over to the second data center and remain active. Hypervisor technologies like VMware's vMotion, which lets virtual machines move from one physical server to another, can assist in this process.

The second option is to synchronize the application so that it runs simultaneously in both data centers. Technologies such as clustering, sharing, and storage replication can help you synchronize. However, many clustering and replication technologies are dependent on sharing a single Ethernet network, and expect to unicast, multicast, or broadcast Ethernet data to all elements--servers, databases, and storage--in the cluster. The problem here is that while Ethernet works well for a few hundred meters over copper in data centers, or even a few kilometers over fiber, after that you run into technical hurdles, including latency and bandwidth challenges, that make the process of building a DCI difficult. Carriers have introduced services, such as virtual private LAN service, that are supposed to help IT solve some of these problems, but most have serious implementation limits and are often ill-suited to supporting highly available applications. Still, there are ways around these challenges and some innovative alternatives for building a DCI. Your best options--which, as often is the case, are also the most expensive--are techniques such as multichassis link aggregation using dark fiber and dense wavelength division multiplexing (DWDM) services.

Latency Problems

Latency is a significant problem with few good solutions. There are three primary causes of latency, but the most significant and intractable is distance. The farther a signal must travel, the longer it takes to propagate through the provider's network. The most common baseline for acceptable latency between data centers is based on VM migration specs, such as those for vMotion for VMware vSphere servers. VMware states that there must be less than 5 milliseconds of latency between source and target servers. The practical upshot is that data centers cannot be more than 75 kilometers apart if you expect reliable operation of VM migration; 50 km is even better.

Latency also affects storage replication, especially synchronous replication, where the data-block write must be duplicated between sites within 5 to 10 milliseconds, depending on your recovery point and recovery time objectives for the application in question.

Another, less obvious, cause of latency is the fact that carrier networks use tunneling protocols, such as MPLS, ATM, and even Sonet. A particular problem with MPLS networks is that the carrier can't guarantee--and may not even know--the path any given packet will take between two points in its network. Carrier networks may hop through several nodes within a city, adding milliseconds of processing latency while the Ethernet frame is forwarded.

Respondents are on a roll: 53% brought their private clouds from concept to production in less than one year, and 60% ­extend their clouds across multiple datacenters. But expertise is scarce, with 51% saying acquiring skilled employees is a roadblock.