Monitoring Temperature in the Server Room

An introduction to monitoring the temperature in your server room including some of the issues you will face and how you can solve them.

Requirements drive the solution

The density and location of temperature sensors in your server room will be dictated by precisely what you wish to achieve.

At one extreme you may be concerned that a recent upgrade to your server room has caused one or more hotspots. A hotspot is a location where the exhaust from one of more servers is being fed into the air inlet of another server. Plainly, the outlet air from a server can be higher than the server manufacturer's recommended operating temperature.

At the other extreme is monitoring the temperature in a single location in order to detect air conditioner failure. Detecting air conditioner failure is a very common requirement given how important air cooling is within server room operations.

Precisely where your requirements fall between those two options really depends upon your goals. A full environment mapping of your server room is helpful to identify hotspots and other design issues. The cost of a full environment mapping exercise is likely to be very expensive for a reasonable sized data centre, it is likely to be an exercise that is done infrequently, usually only after a major server room re-organisation.

If you want to detect air conditioner failure, monitor the air conditioner...

One of the main drivers for monitoring the temperature in the server room is as a proxy for monitoring the health of your air conditioners. Air conditioners are crucial to the operation of any but the smallest server room. The consequences of air conditioner failure, or simply forgetting to switch them back on after server room maintenance, can be very serious and costly. Any servers not immediately affected by the heat emergency are likely to suffer a reduced lifetime with increased component failure.

Fortunately, you don't need to use rising server room temperatures as a proxy for your air conditioner system failing. There will always be a delay (unfortunately, not a very long delay) between an air conditioner failure and the heat rising in the server room. Most air conditioning systems installed in server room environments have dry contact outputs indicating the operational state of the system.

A dry contact is a simple circuit through which a current is able to pass. The dry contact is able to indicate one of two states, either a normal state or an alarm state. If a device needs to indicate a number of different states it will supply a number of dry contact outputs. Your air conditioner documentation should tell you precisely which dry contact point indicates the air conditioner failure state.

Plainly, if you have more than one air conditioner, you need to monitor all of the them. Many data centre or server room installations have a primary, backup and tertiary air conditioning units. Monitoring each air conditioning unit means that you can gradually escalate the alarms. Failure of the primary air conditioner is worrying but if the backup is working ok then it isn't a major disaster. If the back up air conditioner fails then plainly things are much more worrying.

What if I can't monitor the air conditioner directly

The best proxy for an air conditioner failing is the cold air inlet from the air conditioner into the server room cold aisle. Any failure by the air conditioner will be most noticeable here because the difference in temperature between the hot isle and the cold isle is greatest at the cold air inlet. Consequently, you can use a lower temperature threshold for your alarm than you might otherwise be able to elsewhere. Whilst you won't buy much time because server room environments heat up very quickly after a total air con failure, at least you will buy the maximum time possible.

Sensor density

The main problem with monitoring temperature in a server room is knowing precisely where to place the sensors. Even the smallest server room without racks and cold/hot aisles won't have a single ambient temperature. The temperature where the air conditioner inlet is located is likely to be much colder than the air exhausted from the servers.

In larger server rooms, with a hot/cold aisle layout, the concept of a single location representing the temperature of the server room is nonsense. Consequently you will need to monitor the temperature in a number of locations. But precisely where should you be monitoring the temperature?

One way to inexpensively increase your sensor density is to use the temperature sensors built into many servers. Not all sensors are easily readable outside the servers but many servers do provide the environmental sensor readings via SNMP. You may well be able to monitor your server temperatures via your existing network monitor. Using the server's own internal sensors would be the perfect way to detect hotspots in your server room.

Thermal shutdown

The best backup to a thermal emergency in a server room is thermal shutdown by the servers. Many modern servers have the ability to monitor the temperature of their motherboard(s) and to shut themselves off when the temperature reaches a level where server efficiency is in danger.

Entering a data centre and finding that your servers have switched themselves off would be disconcerting, it is a lot better than finding that lasting damage has been caused to the servers by a thermal emergency. Repairing the air conditioning system and switching the servers back on is far simpler than having to replace a lot of servers and hoping your back up system is completely up to date.

Recommendations

Collect the requirements at the start and define exactly what you expect your monitoring system to achieve.

If you have servers that do not support thermal shutdown add it using a USB thermometer with software capable of thermal shutdown.

If your servers have thermal sensors use them. Read the information from your network monitoring application and log the readings and keep for trending purposes. Can be very helpful after a server room reconfiguration to see if you've created any hotspots.

If you find a group of one or more servers shutting themselves down, you may have found a hotspot. Such a location may benefit from logging the temperature for a while to see if the shutdowns occur at the same time as a heat spike.