One Cool Machine

Design News held a special Tech Talk session with the SGI engineering team. If you missed it, no problem. You can still read the questions your colleagues asked the SGI team on computer design and thermal management and see SGI's answers. Click here to read.

What do you get when you package up to 256 Intel® Itanium®2 processors into a single computer node? The answer is the architecture for one of the fastest supercomputers in the world—and one heck of a thermal-management challenge.

That's what engineers at Silicon Graphics (SGI) discovered as they put together the design for the new SGI® Altix®3700 Bx2 supercomputer, released to the market at the end of 2004. Their solutions for beating the heating: planarized design for unobstructed airflow, and new designs in fans and heat sinks. Along the way to finding those solutions, engineers also developed technology for helping customers dissipate heat from their computer rooms.

SGI's benchmarks indicate that the new Altix 3700 Bx2 supercomputer is 200 times faster than the competition.

SGI's goal in the design of the Bx2 was to radically upgrade its existing 3700, which had been on the market for a year. It was fast, but its power density was not up to the competition's. Also, the ability to support only 32 processors per rack was a limiter that put the machine at a disadvantage. The new Bx2 would accommodate 64 processors per rack in a shared-memory design and incorporate SGI's new NUMAlink 4 Router ASIC, the switchboard that directs the flow of information between various processors. The ASIC would effectively double the computer's performance. The shared memory design would allow access to all the data in the systems' memory directly. SGI says that clusters—the other common supercomputer design—produce I/O or networking bottlenecks that can slow down or lose data.

"Our aim was to enable engineers to consider simulating problems of a size and complexity they never thought possible," says engineering team leader Steve Dean. Initial trials proved they had the right concept. In tests with the new design, engineers at Boeing found they could simulate an entire airframe at one time instead of just a wing. NASA, using software from ANSYS, solved a 117-million-degree-of-freedom simulation problem. Those examples were just the beginning, engineers believed. SGI's own benchmarks indicated the Bx2 was as much as 200 times faster than the competition.

Watts Happening

But, like an airplane flying at mach speed, the Bx2 generated a tremendous amount of heat. Up to 1,000 watts of power had to be dissipated inside each 28-inch-by-17.5-inch-by-7-inch "brick," the SGI term for a modular physical packaging element with all electronics in it. There are eight computer bricks per rack in the system. To understand the heat from 1,000W of power, imagine that each brick is producing the heat from 10 100W light bulbs. Without cooling, component temperatures would reach hundreds of degrees Celsius in a matter of minutes, turning the machine into an expensive baking oven.

There are many ways to get rid of heat from electronics assemblies. The most common are liquid cooling, air cooling, and radiation. Radiation is reserved mostly for space applications. Liquid cooling, while efficient, is a plumbing nightmare and very expensive. So, SGI engineers opted for air cooling, the least expensive yet highly effective cooling solution for hardware at these power levels.

A computer "brick" in the Altix 3700 Bx2 is a potted, modular physical packaging element with all electronics. The typical 28 x 17.5 x 7 inch brick dissipates 1,000W of power. That's equal to the heat from 10 100W light bulbs in the same space. Without cooling, the heat could jump to hundreds of degrees Celsius in a few minutes. Altix systems can have up to eight bricks.

Keep the Air Moving

Their first step was to design a system that would provide the least possible interruption to the airflow that would cool the machine. Restrictions would cause heat to build up and back up. The solution: a planarized packaging concept. All components on a brick, including processors, memory, and routers, would be placed edge-on. With the hot air seeing only the edge of the components, the air would have relatively unrestricted flow. SGI had produced planarized designs before, so this step was easy.

With virtually no obstacles to block airflow, the next step was to move the air within the bricks. SGI works with several fan vendors, and chose the German fan manufacturer ebm Papst for the Altix 3700 Bx2 project. Using SGI's Pro/ENGINEER models of the supercomputer design, ebm provided three 127-mm high-performance tube-axial fans for each brick. With these fans, air is drawn over the fan blades and the air discharge is parallel with the motor shaft. The fans perform at a higher airflow and lower acoustic level than competing fans of the same size.

With the fan design complete, the next step was heat sink design. That was the job of SGI engineer Rick Salmonson. He built computational fluid dynamics (CFD) simulation models in Flotherm based on Pro/ENGINEER CAD models of his heatsink design concept, then ran many iterations to get the optimized height and pitch of the fins.

It wasn't as easy as it sounds. One basic formula for design of heat sinks—those finned surfaces attached to processors that transfer heat away from the devices—is Q=hA(Tb-Ta)n, where Q is the heat transfer rate, h is the heat transfer coefficient, A is the surface area of the heat sink, Tb is the base temperature of the heat sink, Ta is the surrounding fluid temperature, and n is heat sink efficiency. The formula helps engineers predict heat sink performance. During his Flotherm iterations, Salmonson had to make a basic decision: one heat sink design or two. The Intel processor dissipates 130W of power and the ASIC dissipates 30W. SGI could have used the same heat sink design for both and saved money, which, of course, is always an attractive option. But it would have been an inefficient solution. The ASIC heat sink wouldn't be adequate for the processor, and the processor heat sink would be overkill for the ASIC. Finding a happy medium might have taken too long and been risky. Instead, Salmonson designed different heat sinks for each. For the processor, he specified solid copper for the base and fins. The base for those heat sinks is 91 mm × 71 mm × 6 mm. The fins are 49 mm tall and there are 23 0.4-mm-thick fins on a 2.8-mm pitch. The ASIC heat sink is solid aluminum, 73 mm × 58 mm × 6.5 mm thick. There are 20 1.0 mm-thick aluminum fins for the ASIC heat sink, each 41 mm tall on a 2.9-mm pitch.

The heat travels to the rear of the electronics rack. The temperature drop through the processor heat sink is 40C. For the ASIC, the temperature drop is 25C.

Engineers used the Flotherm computational fluid dynamics (CFD) simulation program to simulate airflow over the computer brick. This view shows how airflow was coming out of the fans. Engineers wanted to see if airflow was evenly divided over the top and bottom of the PCB, if there were any spots starved of airflow, or if there were any areas where airflow may be recirculating. Their goal was unrestricted flow.

In this Flotherm view, engineers studied the surface temperatures of components like the processor chips, the ASIC chipcs, the router chips, the memory DIMMS, and the power conversion components. The heat sinks are hidden from view since they were not a target of study. The view shows a cutting plane roughly through the center of the brick. On the cutting plane, notice the vectors of airflow: That's what engineers expected to see in a real brick. They could move the cutting plane to any location. By moving it around on the computer, they could verify that they had adequate and uniform air flow across the brick and proper cooling. The blocks in blue at the center are air blockers, present to ensure that air passes through the heat sinks and not around them.

A Cool Room

Planarization techniques, fan selection, and heat sink design solved the problem of keeping the computer components cool. But SGI engineers knew that once they had moved the heat from the computer, its customers would have to find a way to get it out of computer rooms. "Instinctively, we felt that customers with large configurations would have a particularly serious problem," says engineering team leader Dean. Their instincts proved right. When NASA ordered several Bx2 systems, SGI's team did calculations that showed the systems would exceed the cooling capabilities of the NASA facility.

Enter SGI engineers Tim McCann and Dave Collins. They designed a water-chilled door for the computer cabinets that would take the heat out of the equipment and put it into the building's water-cooling system. In effect, the heat would travel to the rear of the electronics rack and through the water coil in the rack rear door.

Their solution hinged on design of a heat-exchanger coil much like coils used in the HVAC industry for large-capacity air conditioners. Working with ThermoDyne, Inc., a coil-design consultant, and coil manufacturer Outokumpu Heatcraft USA, McCann and Collins generated several coil designs over a six-week period. Their goal: to absorb 90 percent of the rejected heat from the computer system.

But having the right coil design wasn't the end of the story. "Blowing hot air over the cold coil creates the potential for condensation," Collins says. That condensation could upset the humidity balance required in the room to prevent static electricity discharge problems. Soultion? They sized the coil capacity to allow for a cooling water temperature up to 60F, which combined with a reduced environmental operating window for humidity of 40-55 percent RH. That would prevent condensation from occurring. Then, just to be sure, they incorporated a drain feature to handle the condensate discharge in case the environmental conditions drift out of spec.

SGI engineers used liquid cooling on the door to transfer heat from the supercomputer.

The water-cooled door is helping NASA save money on its electrical bill. But when NASA facility engineers wanted to also use the system to augment the air conditioner for the building, McCann and Collins said no. That would have required lowering the cooling water temperature, increasing the coil's capacity to reject heat but also increasing the risk of condensation.

While engineers who have access to the new Bx2 for their simulations will benefit from the ability to solve larger problems than ever in a shorter time, SGI engineers themselves have benefited from the project experience. Says engineering team leader Dean, "We realized on our own that we have to pay attention to the total solution, not just our part of the design." Dean says it would have been easy but unacceptable to dump the room-cooling problem on NASA or other customers, even though that's really a facility issue. "Many customers don't understand that higher-density systems are beyond the room-cooling capabilities of their buildings. We decided to tackle that problem, too, so the end system would be reliable."

Industrial workplaces are governed by OSHA rules, but this isn’t to say that rules are always followed. While injuries happen on production floors for a variety of reasons, of the top 10 OSHA rules that are most often ignored in industrial settings, two directly involve machine design: lockout/tagout procedures (LO/TO) and machine guarding.

Focus on Fundamentals consists of 45-minute on-line classes that cover a host of technologies. You learn without leaving the comfort of your desk. All classes are taught by subject-matter experts and all are archived. So if you can't attend live, attend at your convenience.