The ability to easily change, update, or add to the cluster contributes to the overall utilization rate. This requirement is where home-brew systems can fail. Very often, the highly customized nature of home-brew systems does not tolerate change, and can cause significant downtime while updates are made by hand. A successful cluster must be able to tolerate change.

When the cluster is running, it is important to be able to monitor and maintain the system. Since clusters are built from disparate components, the management interface must handle multiple technologies from multiple vendors. Oftentimes this responsibility falls on the system administrators who must create custom (and sometimes complicated) scripts that glue together information streams coming from various points in the cluster. A successful cluster should provide tools that simplify the administrator’s workload, rather than make it more complex.

Users will request new software tools or applications. These often have library dependency chains. New compute and storage hardware will also be added over time. Administrative practices that can facilitate change without huge disruptions are essential. Home brew systems often operate on a “critical path” of software packages where changes often cause issues across a spectrum of applications. A successful cluster should accommodate user’s needs without undue downtime.

Finally, a successful cluster also minimizes the administrative costs required to deliver these success factors. The true cost of operating a successful HPC cluster extends beyond the initial hardware purchase or power budget. A truly well run and efficient cluster also minimizes the amount of time, resources, and level of expertise administrators need to detect and mitigate issues within the cluster.

Resource Links:

Latest Video

Industry Perspectives

"Exascale computers are going to deliver only one or two per cent of their theoretical peak performance when they run real applications; and both the people paying for, and the people using, such machines need to have realistic expectations about just how low a percentage of the peak performance they will obtain." [Read More...]