Troubleshoot your data center from the easy chair

Remote management is the key to maintaining your sanity when the bits hit the fan

InfoWorld|Nov 21, 2011

We've all been there: A relaxing evening suddenly turns into chaos with a single phone call, email, or text. There's something amiss at a remote site, and nobody's there to deal with it. Rather than indulging in a martini and a movie, you're hustling to your car to drive an hour to restart a server or perform a simple troubleshooting step that brings the site back online.

If only there was a way to reduce the likelihood of such an event. Oh, wait! There is.

I can't count the number of times I've seen servers, routers, switches, UPSes, and whatnot with remote monitoring and control facilities that either aren't configured or even plugged in. This means more than just the network interface -- the serial interface should be remotely accessible as well through a console server.

When shopping for IT gear, it's easy to drop those line items from the order because, technically, they're not critical to the device's purpose. In many cases they add to the cost of the device, but those expenses pale in comparison to problems that can crop up if they're absent. Routers and switches have built-in serial consoles, so the only associated cost in allowing those devices to be managed out of band is the aforementioned console server. For other hardware -- such as servers, UPSes, AC units, and PDUs -- remote management interfaces are generally an add-on and provide their own CLI or Web-based interface.

Take a basic rack-mount UPS. You can buy one without the management card, but most vendors either include the management interface or you can add that card. For a few hundred dollars, you add a full monitoring package for the UPS with SNMP traps, email alerting, environmental monitoring, and so forth. Some models even have outlet groups that can be remotely switched on and off, providing the ability to remotely power-cycle gear that might be hung up.

At a site that's manned 24/7, this isn't as critical because an admin will always be there to provide hands-on troubleshooting. But at a remote site, those features can mean that you simply pull up a Web interface and power-cycle a hung router rather than drive for an hour to unplug a cable. One event like that can pay for the difference and then some.

Active monitoring has preventive value as well. You can use either built-in monitoring and notification tools or centralized monitoring packages -- such as Nagios, Zenoss, or Cacti -- to constantly monitor any device for potential problems before they become problems. This is especially valuable in connection with environmental sensors, predictive disk failures, and network infrastructure gear.

If there's a cooling problem, it will generally grow over a period of time, rather than instantly shoot up to dangerous levels. Constantly monitoring temperature and humidity at several places in your data center can sound an early warning that means the difference between life and death for your servers when a chiller goes out. If your core switching is logging to a syslog server that can parse the logs, it can raise the alarm when a switching module starts to go bad, versus finding out from the fact that a bunch of servers or critical telecom gear suddenly disappeared from the network.

You can never have too many eyes and ears in a data center. If you think that some form of monitoring is overkill, it's probably just enough. Motion-detecting cameras, environmental sensors, fluid presence sensors, and even remotely accessible microphones and speakers can be invaluable tools when the chips are down.

It boils down to one simple rule: If it goes into the room, you should be able to control it from outside the room. That applies to everything from a cable modem to a core switch. Even devices that have no console or management facilities can at least be plugged into switching PDUs so that they can be power-cycled remotely. For example, primary or backup cable and DSL circuits generally come with cheaper routers that occasionally need to be kicked over.

Chances are you have remote access tools in your gear already, but they're not configured or plugged in. Allow me to politely suggest you take a few minutes to fix that right now.