HPCC Happenings

The new Intel18 cluster is now available for production use in the new HPCC environment. It is less than two weeks until the October 15th upgrade to the new image for the Intel14 and Intel16 clusters. We highly recommend that users begin to move their work over. Intel 18 will have slower access time to /mnt/scratch until we move the rest of the cluster over to the new Data Center.

There will be an outage on the week of the 22nd for the Intel14 cluster and the week of the 29th for the Intel16 cluster when those clusters are moved to the new Data Center. Please see our Announcements blog for more details.

As you are probably aware, we are in the middle of a transition for MSU HPCC. We have taken delivery and are presently testing a new 2018 cluster and a new file system at the new MSU data center. Bringing up a new cluster while still maintaining the old gives us an opportunity to implement some other changes as well. The new cluster is running a new version of the operating system, CentOS7. CentOS6, the current version, stopped receiving full updates in 05/17. Furthermore, we are transitioning to a new scheduler, Slurm, with support from SchedMD.

The present equipment in the Engineering building will also be moved to the new Data Center. In anticipation of that move we have already started moving some of the current 2016/2014 nodes to the CentOS7/Slurm setup, and those nodes are available for testing. More nodes will be moved to this new configuration as they are rebooted or otherwise go offline and are brought back up.

However, on October 15th, 2018, our contract with Adaptive, the providers of Moab/Torque, will end. On that date, even before we move the old nodes, we will have to convert *all* of the extant nodes to the CentOS7/Slurm setup.

The staff has been working very hard to make this transition as seamless as possible. To this end they did extensive testing as well as providing documentation on how to work with through this transition.

We hope you will bear with us as these changes go forward. We are changing many things and there are bound to be some issues. However, the HPCC staff is awesome and they will help you work through them. We urge you to do some testing *now* so that when the full conversion occurs on October 15th you will be ready to work in the new system.

Overall we think this will be a positive change in many ways and a great improvement to the overall usefulness of HPCC for the research here at MSU.

The RFP for the next cluster has been released and is being considered by vendors. This is the first step in figuring out what we might buy for the 2018 cluster. We are three months or so away from any decisions, but once we get closer, we will be able to provide accurate pricing for the new cluster and buy-ins. Stay tuned!

The new MSU Data Center is up and running. A number of servers from ITS are now housed and running there. In fact, as a result of flooding near the Hannah Administration Building and Computer Center, more servers than were expected have been moved. The new 2018 HPC cluster (and eventually the older clusters) will also be housed in this center.

Obtaining commercial GPUs such as the 1080Ti remain an issue. Bitcoin mining and other cryptocurrency have gobbled up most of the supply. This is a concern to NVIDIA, as their main revenue source is gamers. With a dwindling supply, gamers are looking at alternative GPU sources, such as AMD, and this concerns NVIDIA greatly. It is not clear whether the supply problems will clear up anytime soon, but it is a concern for some of our users. Of course, the more expensive V100 chips are available now. We will continue to monitor the problem and keep you apprised.

Though we are a little behind, we expect the RFP for the new cluster to go out in the next week or so. We clearly cannot talk about any prices for new buy-in computers until we get information back from the RFP, but we are looking at a few categories of compute nodes:

Basic compute nodes: two socket, 128GB or so memory. Memory is quite a bit more expensive now than in the past. Manufacturing is not keeping up with present heavy demand (phones, ssd's, graphics cards, etc.).

High performance graphics nodes, potentially with the new v100 cards.

Less expensive graphics nodes. This is more of a challenge given the present state of consumer graphics cards. Such cards are in short supply because of cryptocurrency mining. Nvidia is actively discouraging resellers from providing solutions in volume with consumer cards. We will still explore this approach but it will be challenging.

We still expect to get a new cluster in the data center by late spring, early summer. Our RFP will include purchasing a new disk and scratch system, so the new cluster can stand on its own once brought up. When the new cluster is tested and in production, we will bring down the old cluster and move its elements to the data center. In this way we hope to avoid any downtime.