Part 1: Day in the life of a HPC manager: shutting down a Top 4 server cluster safely

Successfully shutting down a supercomputer can be trickier than you think

In 2013, we made big changes to our server cluster, The Cosmology Machine [COSMA], swelling its infrastructure to boast 9856 CPU cores and 4096 GPU cores. Designed and integrated by the team at OCF, COSMA is now made-up of two machines: COSMA5, which uses IBM and DDN technology infrastructure and our existing cluster, COSMA4 (originally installed in January 2011).

As Senior Computer Manager in the Department of Physics with responsibility for managing COSMA, one of my biggest challenges is safely shutting down and powering-off COSMA. You might laugh, but this is not as unusual as it sounds. When we have certain electrical works to be done, the power has to be removed for either safety reasons or simply to avoid unplanned power disruption during the works.

Managing the shutdown
Both COSMA4 and COSMA5 are managed through IBM’s Platform HPC, which is configured in high availability. We have a setup of two Platform HPC servers, which back each other up and a Network File System [NFS] server that holds the common storage data for the management system. For easy management, the NFS server is independent.

Now, the important question, how does one shutdown the server system? Should one shutdown the ‘candidate’ first and then the main server? Well, here’s my view of how it should be done properly:

1. Stop the automatic failover mechanism and switch to manual, then shutdown the candidate and then the main server. That’s the correct order. On startup, reverse the order, making sure that the Network File System [NFS] server is fully booted before starting the cluster servers

2. Make sure that the network is fully operational before even attempting to bring the service back. The perils of getting this wrong was brought home to us after a recent downtime

3. Make sure that all your switches are in a consistent state and always run the configuration that you saved, rather then waiting to being surprised by what happens when the server power is restored. Sadly, I say this through experience: the configuration of one of the switches on our server cluster had been changed in an attempt to fix a communication issue between two top-level switches. This change had been overlooked, it was not active prior to the shutdown, but came back into action after the power had been restored. The configuration created the illusion of a loop and the HP switch that connects us to the outside world shut down the communication with the top switch of COSMA4!

In part 2 of this post, I will be specifically looking at how we shutdown our storage systems without jeopardising nearly 2PB of live research data. If you have any questions so far, please leave me a comment below.