TRENDING

Keep IT up

By Drew Robb

Oct 31, 2007

What would it take for your data center to go off-line? For onegovernment agency, all it took was some routine cleaning.

In 2000, the Cincinnati State Technical and Community Collegewas installing new servers and storage as part of an ActiveDirectory migration. Just before the equipment went live, however,the janitor plugged in a vacuum cleaner, which was enough to blowout the transformer serving the entire wing.

It turned out that the science department had also installedsome equipment on the wing, and, unbeknownst to the systemsadministrators, the building was maxing out its electricitycapacity ' and putting its information technology operationsat risk.

Although few data centers are built to withstand a directnuclear attack, services should continue despite floods, fires,hurricanes and blackouts. But outages still hit, and surveys showthat most of the time they are because of acts of incompetencerather than acts of God.

And individual mistakes are not the only cause.

'The problems in the data center are mostly caused not bytechnical issues but by institutional and financial issues,'said Jonathan Koomey, a scientist at Lawrence Berkeley NationalLaboratory and consulting professor at Stanford University whoworks on data center power issues. 'Most budgets for ITequipment are separate from the budget for the facilities, theinfrastructure and the utility bill.'

The key to keeping the data center running is redundancy, bothin terms of equipment and strategy. Here are a few options forboosting uptime.

Keep current

The first step is to have a good idea of what power and coolingsystems are in place. Overall energy availability is a problem' the information technology analyst firm Gartner said halfof all data centers will start running short of power in the nextyear.

This is not just a problem with the overall power consumption.Most outages hit particular pieces of equipment in the data center,not the entire center. Just as one tracks changes in the servers,so do power and cooling resources need to be audited regularly.

'A data center is a very evolving and fluid environment,so there is a lot of change management that data center managersand [chief information officers] need to be aware of,' saidElaine Wilde, senior vice president of the public-sector unit atLee Technologies, which assesses, designs, builds, maintains,monitors and relocates data centers for dozens of federalcustomers.

'It is very important that you do assessments of yourphysical infrastructure, much like you do assessments of yourstorage and processing capacity or applications that aremission-critical to running the business of the agency,' shesaid. Wilde recommends doing an annual assessment.'It's like getting a health checkup.'

Even if the data center was originally designed with plenty ofpower and cooling five years ago, it probably wasn't designedto handle racks of today's servers, which have moreprocessors and are squeezed into smaller form factors. As you addnew racks of blades, you may have to recalculate powerrequirements.

Managing expectations

Any data center manager wants 100 percent reliability. Buildinga top-notch facility costs money, however. The Uptime Institute, anindustry group that offers data center best practices, estimatesthat the most reliable facility, what it calls a Tier IV facility,requires $22,000 of power and cooling infrastructure for everykilowatt that gets used for processing. And those power needs keepincreasing.

Rather than build absolute power redundancy into a single datacenter and achieve 99.999 percent electrical reliability atconsiderable cost, it might be better to have a Tier II or Tier IIIprimary facility with a backup data center it can fail over to. Youcan also target critical parts of the electrical infrastructurethat are cost-effective to address and recognize that you mayexperience some downtime.

This is the approach used by the Defense Department'sAeronautical Systems Center's Major Shared Resource Center(ASC MSRC) at Wright-Patterson Air Force Base, Ohio, which housesseveral supercomputers. The center has an uninterruptible powersupply battery (UPS) system that gives the center 20 to 30 minutesto shut down the servers or ride out a short power outage.Technical Director Jeff Graham said some of the other MSRCs havelarger battery and diesel generator systems that keeptheir systems running.

'It is becoming very expensive to do that becausethese systems are going almost exponentialin terms of the cooling and power they require,'Graham said. 'We have taken this otherinnovative approach to try to bring in the samekind of availability and reliability without allthe diesel generator activity.'

In March, the ASC MSRC will be installing adual-speed, high-speed switch so that if one ofthe large substations goes out, it will switch tothe other substation. That won't solve the problemif both substations go down, but it doesn'thave the high costs associated with buying andmaintaining a generator for those rare occasionswhen it is needed.

Keeping watch

Power and cooling have traditionally been theprovince of facility managers, but data centermanagers are gaining a greater ability to monitorand manage these areas. The use of Powerover Ethernet and Intelligent Building Systemsis the catalyst for that trend. These replace discrete,proprietary monitoring and control systemswith ones that communicate using thestandard network and protocols.

ASC MSRC, for example, uses the opensourceNagios network-monitoring programfor its air handlers, chillers and power distributionunits (PDUs), in addition to environmentalsensors placed throughout the floor.

There are also commercial products designedfor data center infrastructure. ApertureTechnologies' Vista gives visualization andreal-time monitoring of the data center physicalinfrastructure. American Power Conversion'sChange Manager and Capacity Managermonitor items such as UPSes and PDUs andstores the data in a centralized database.

In some cases, it works best to outsource themonitoring. According to Forrester Research,there is growing interest in sending remote infrastructuremonitoring and management servicesto India, but it can also be done locally.

'What it breaks down to is ensuring the appropriateservice level agreements that youneed for all your services from your vendors arein place and they are correctly and appropriatelymaintained,' said Jerry Alexandratos, the EducationDepartment's acting director of ITservices.

No matter how good the redundancy or monitoringtools, uptime still comes down to people.When Gartner did a survey of causes ofdowntime a few years ago, only 20 percentwere because of natural disasters or equipmentfailure. The rest were caused by people.

Procedures matter

'The operating procedures and practices youuse to run your environment have a much largereffect than technology on overall availability,'said John Curran, senior vice president ofServerVault, which provides hosted services forgovernment agencies and businesses.

ServerVault runs a Tier III data center withcomplete redundancy of power, cooling andstandby equipment. The facility has had 100percent network and facilities uptime since itsinception in 2001. He said that although facilitiesand network issues can cause major outages,they are not the typical cause. It is morelikely to be things such as configuration.'Improving availability is more than justadding more power or another switch to thenetwork,' he said. 'For the majority of customers,it means getting a grip on what is intheir environment and what has changed.'