Clusters Fill the Compute Gap

By Peter Varhol

In the not-too-distant past, engineers who could not run analyses or simulations of their designs on desktop computers had little choice but to use an engineering mainframe or supercomputer, queuing up the job and hoping to get the results back within a few days. Further, many smaller engineering groups couldn’t make the leap in investment necessary to acquire such “big iron,” and either rented time on larger systems or simply did without a lot of analysis.

Whether in a rack like this one or built with desktop workstations tied together via high-speed interconnects, compute clusters are an attractive option.

Today, there is an intermediate level of computation that is both more affordable and more flexible than the mainframe and supercomputer alternatives. It’s the compute cluster — a group of connected computers, working together closely, which in many respects create a single computer for certain kinds of computations.

These connected computers are good for only certain kinds of problems — those that can be broken apart into independent execution paths, so that code can execute on multiple processors and cores. Independent execution is typically the realm of analysis and simulation problems, where the same code uses different data to come up with intermediate results independently.

According to Silvina Grad-Frelich, manager at The MathWorks, one of the most common applications of cluster computation today is Monte Carlo simulations. “Within a given amount of time,” she notes, “engineers can get a much finer level of detail over the possible strengths of their design.” Multiple Monte Carlo runs can be done on different processors and cores simultaneously, making it possible to get better simulations faster than with individual workstations. But the speed of clusters also enables engineers to follow through with entire alternative designs, and to test those designs quickly and accurately.

The Practical Benefits of Clusters Any 64-bit system today can be configured with enough memory and storage to eventually solve virtually any problem. But there are two practical reasons for using a cluster. First, it enables engineers to do an analysis within a reasonable time constraint. Even the fastest individual workstation lacks the horsepower to do some very complex problems in a set period of time. You can improve cost and time to market by performing these activities on a faster set of computers.

MATLAB support different types of parallel operations with clusters, including NVIDIA GPU porting.

Second, even if time to market remains the same, the risk is reduced because engineers have more data upon which to make decisions. Because you can run more simulations and more detailed simulations, do a greater number of sensitivity analyses, or run more detailed dynamic analyses, you end up with a better design. Engineers say that the ability to consider more design alternatives usually results in a better design, as they can do more than simply meet specs within the same amount of time.

For high-end designs and detailed simulations, server clusters can be employed to speed up computations, or do more computations. The ability to break simulations and many types of analyses up into multiple independent parts makes it possible to solve some problems in a fraction of the time that they required on the desktop, and with higher fidelity.

Clusters don’t represent new technology. Most of the fastest computers in the TOP500 Supercomputing (top500.org) sites are Intel-based clusters. What has changed is their acceptance as alternatives to supercomputers, along with the ability of software vendors to break up computations into independent parts that can fully leverage dozens of processor cores found in a cluster.

Because many high-end servers employed in clusters use Intel Xeon processors, the chip maker has an interest in ensuring that vendor hardware scales up well, and works with both components and software. Developed in conjunction with hardware and software vendors, Intel Cluster Ready lets engineering groups match engineering design applications to the fastest computers and components. This includes servers from Appro, Dell, SGI, and Super Micro, among others.

System and component providers and system integrators use the Intel Cluster Ready architecture and specification to develop interoperable clusters. Software vendors test their applications running representative workloads on Intel Cluster Ready systems that have already received certification. By buying and using Intel Cluster Ready systems and components, engineering groups know that they have been tested together and can more easily be used in making a “do-it-yourself” cluster. It may be more work to design and set up your own cluster, but doing so gives engineers the ability to customize a cluster for specific types of applications and design tasks.

The Interconnect Is the Key Most clusters are using Infiniband or similar high-speed interconnects, where the performance begins to approximate that of the proprietary busses on the motherboard. Interconnects make a big difference in the feasibility of a cluster. The speed of data transfer between computers is already slowed relative to the speed of memory and especially processors, so average interconnects simply slow down computations still further.

The result is that clusters are able to have inter-system communication at a rate fast enough to deliver the needed performance between memory and system processors on different systems. Typical enterprise 100 Mbit Ethernet simply doesn’t provide the needed bandwidth and performance for cluster applications. Either gigabit Ethernet or fiber is essential in delivering on the promise of cluster computing. Gigabit Ethernet is often adequate for low-end clusters, and 10Gb Ethernet is becoming mainstream for more powerful configurations.

Further, it’s also important that clusters be on their own network segment. While there’s no reason that segment can’t be connected to the rest of the organization, the extra traffic on a busy network is likely to slow cluster performance substantially. The extra traffic increases the likelihood of network packet collisions, which require that packets get resent.

GPU Clusters Set to Take Off There are other alternatives for building clusters. An increasing number of engineering analysis and simulation applications offer the option to execute on graphics processing units (GPUs), delivering performance that is often substantially increased over industry-standard CPUs produced by Intel and AMD. GPUs from NVIDIA and AMD offer improved computational performance, thanks to a processor design optimized for the kinds of floating point computations that graphics require.

GPU clusters offer a combination of high performance for generally lower costs than CPU-only systems. Individual GPU systems of more than 900 cores can be purchased for $10,000, and clusters of such systems are formidable computational engines for certain kinds of applications.

Of course, engineering applications have to be ported to run on GPUs, which have a different instruction set from industry-standard processors. The porting process for commercial applications seems to be gathering steam, with NVIDIA reporting that GPU-based applications for structural dynamics are fully in the mainstream, while those for fluid dynamics are well on their way to that point. It’s important to note that the greatest benefit comes when a problem is highly numeric, and when the problem can be split up into multiple independent streams.

Are Workstation Clusters Feasible? For groups using workstations for both engineering design and normal office work, there is often interest in collectively applying spare computational resources on these systems to larger engineering problems. In fact, most workstations have multiple processors and cores, and most application ignore anything more than a single core on the first processor. For most common uses, a good bit of workstation computational capability remains unused.

The bad news is that it’s not a straightforward process to use extra computing horsepower for other problems. Perhaps the best-known technology in this area, SETI@HOME, doesn’t work for engineering applications. You can’t cycle-steal from in-use processors and cores without having it affect the performance of active work.

However, there is another way. It’s possible to set up a virtual machine that uses Intel’s Directed I/O technology along with unused processors and cores to create a high-performance shell on individual workstations. Parallels Workstation Extreme uses Intel’s hardware virtualization technology, along with a separate dual-hosted network interface.

A fast interconnect is a critical part of the solution. Without the ability to get data and code to the workstations in the cluster, some workstation vendors, such as HP, provide gigabit Ethernet interfaces that can be employed in the virtual machine.

Granted, a workstation cluster won’t perform as well as a traditional server cluster. The processing power is less, in part because it’s not using all available processors and cores, and gigabit Ethernet can’t take the place of fiber as a fast interconnect. Because it’s running in a virtual machine, that can also slow down execution. But if you’re looking for a way to run less-demanding jobs inexpensively, a workstation cluster may get the job done.

Software and Clusters It goes without saying that the software you use has to be able to take advantage of the parallel processing capabilities of your cluster. In most cases, your computational work has to be able to be split into multiple independent pathways, and then has to be re-coded to run that way.

The good news is that many commercial software vendors have already done this re-coding, enabling engineering users to get the most out of their computer clusters. MathWorks’ MATLAB, used by many engineers for custom analysis and simulation, has supported different types of parallel operations with clusters for quite a while. Today, its goal is to provide the capability to utilize clusters at a high level of abstraction that requires as few code changes as possible.

MATLAB also supports easy NVIDIA GPU porting, either with high-level instructions directing specific code segments to run on GPUs, or with Accelereyes Jacket, which enables users to tag MATLAB code to send to an installed GPU.

Analysis applications such as computational fluid dynamics and many types of simulations have already been optimized for clusters, but if a cluster is a part of your future computing strategy, it’s important to check with your software provider to make sure that the applications you use have been fully parallelized.

Thanks to the relatively low cost of lower-end clusters, along with the significant computational benefits, many engineers have come to count on them to deliver performance where it makes a difference to the design process. The cluster options have expanded to include both workstations and GPUs. Now that software is rapidly catching up, any engineering group can take advantage of clusters to create better designs faster.

About Peter Varhol

Contributing Editor Peter Varhol covers the HPC and IT beat for Desktop Engineering. His expertise is software development, math systems, and systems management. You can reach him at DE-Editors@deskeng.com.