Oklahoma: Supercomputing in the Sooner State

Whether they’re smashing together two-ton particles or forecasting the weather in real time, the research scientists at the University of Oklahoma (OU) require seemingly infinite computing power. With jobs producing up to a million gigabytes of data at a time, these academics demand an enormous amount of readily accessible resources and skillful IT professionals who can prioritize and monitor processing jobs.

To address this need, the high-performance computing administrators at the OU Supercomputing Center for Education & Research always are looking for new ways to maximize their cluster’s productivity and keep up with the constant changes in the supercomputing field.

Recently, this pursuit has led the center to adopt a new load sharing facility (LSF) that will allow it to share resources, prioritize jobs and increase productivity across the board. With its previous open-source software, the center struggled to efficiently manage the resources within its cluster — even at its busiest times, administrators would have to reserve some of the system’s processing power for high-priority jobs that could come in at any time. In this system, the cluster was backed up during peak processing hours and then sat idle during off-periods.

Through OU’s new LSF, however, the center has been able to use all its resources when scheduling jobs. Henry Neeman, director of the center, said the new LSF, which was developed by Toronto-based Platform Computing, has improved the center’s productivity.

“To paint a picture, one of the things we found ourselves wanting to do was to have the ability to have jobs running on various parts of the cluster, and then if other jobs that had higher priority came along, we wanted to be able to shut down the ones that were currently running, move those resources over to the most important job, and when those important jobs finished, then the other jobs could pick up from where they had left off,” Neeman said. “And with LSF, it’s not only possible to do that, it turns out it’s essentially trivial.”

This new technique is possible because the Platform LSF enabled all the center’s computing power to work together as part of one pool, with jobs fluidly sharing resources based on their importance. This allows the center to easily manage its workload, and it increases OU’s supercomputing efficiency.

Neeman said the new system also has fewer innate technological problems than any others the center has used.

The new LSF allows independent organizations to share computing power across institutional lines. This capability lets two autonomous clusters access each other’s excess resources, taking advantage of downtime and increasing both organizations’ productivity. OU hopes to make use of this feature in the coming months as part of a partnership with the University of Arkansas

“The idea is that a user at OU can submit a job, and if all of OU’s resources are in use by other jobs, but there are some empty resources at the University of Arkansas, then their job can migrate to Arkansas and run there instead of running at OU,” Neeman said.

This new system will require the center to make some adjustments, but Neeman said his department is up to the task. He also said the principal challenges he and his colleagues face are setting up the system and educating OU’s users to take advantage of it.

OU’s users generally are researchers who come to Neeman with a high level of knowledge about the problems they need to solve but almost no experience with supercomputing. Using a series of workshops called “Supercomputing in Plain English,” he teaches them the basis of this process and helps them understand how high-performance computing can enhance their research.

“The principle is to teach the concepts of supercomputing without going into a lot of detail about the technology of supercomputing,” he said. “And, of course, in seven one-hour workshops, I can’t teach you enough to do anything useful. But in seven hours, you can learn enough to learn enough. After you’ve completed those workshops, then we get together regularly, typically once or twice a week for an hour or two, and we work together on the computing part of the problem that you’re trying to solve.”

While Neeman is busy teaching users the fundamentals of supercomputing, the other three IT professionals in his department continually are educating themselves about the latest and greatest developments in the field. Because of the highly specialized nature of supercomputing — high-performance computing is only about 4 percent of the total computing marketplace — Neeman said there is little formalized training available on the subject. He and his colleagues must be diligent about keeping their skill sets current without the luxury of study guides or seminars. “It’s very, very difficult to find organizations that do training and certifications in cluster administration and high-performance computing administration for two reasons,” he said. “One, it’s a small percentage of the total marketplace, and two, it changes so quickly that it’s really, really difficult to develop a certification program. The biggest certifications that we’re actually able to get are the ones that are for the operating systems. So, the Red Hat Certified Engineer is a very good thing for someone in this business to have, simply because so much of this business is driven by the relationship between the operating system and everything else.”

Because certification plays such a small part in educating these specialists, they must turn to other information sources to stay on top of the game. Neeman said professional networking plays a big part in helping them learn everything they need to know from how to find the best combinations of software and firmware to the best way to restructure their system.

“It is a very small town, and when people are looking for information, they’re not afraid to pick up the phone and call their counterpart at another institution and say, ‘Here’s an issue that we’re dealing with. How are you guys handling this?’” Neeman said.

These phone calls happen frequently, he said, because supercomputing technology changes a lot faster than that in other IT fields. This means as soon as one set of resources is working well, it’s probably time to switch to something new.

“When you’re at the high end, there are really two choices in technology,” Neeman said. “You can have established technology, which means obsolete, or you can have new technology, which means broken. On the high-end computing side, we have to have the new technology, so we know we’re in the broken-technology business. That means that a lot of what we do is look for ways to get these new technologies that are poorly understood to work well for our users.”

The transient nature of supercomputing skills and technologies can make this field an acquired taste, Neeman said. Finding people who are independent learners and problem solvers is more important than finding someone with any particular certification or expertise, he said. “New technologies come online very quickly, and you have to be able to adjust to them, both software technologies and hardware technologies, so we need people who are very flexible and learn very quickly,” he said. “Honestly, the major way of dealing with (the lack of formal training) is finding the right people who have a desire to be constantly updating their skill sets. And I don’t mean they update them every year — they update them every day.” Despite all the challenges, Neeman said the supercomputing industry is thriving with the talent it has. Although it would be nice to have more certifications that could help newbies get started or guides to help high-end IT pros navigate “broken” technologies, the total immersion approach has worked for them thus far. Plus, the challenge keeps the job exciting, he said.