Seven AI habits IT can learn from HPC

James Reinders likes fast computers and the software tools to make them speedy. With over 30 years in high-performance computing (HPC) and parallel computing, including 27 Years at Intel Corporation (retired June 2016), he is also the author of nine books in the HPC field, numerous papers and blogs.

An effective IT organization can lead the way for organizations seeking broader use of AI, by adopting seven lessons from the HPC focus on system-level thinking

August 16, 2019

Every enterprise will be using Artificial Intelligence (AI) in the future – or should be. The potential positive impact on the bottom line and opportunities for competitive advantage that AI can yield are simply too much to ignore.

Today, high-performance computing (HPC) centers are experts in supporting large scale high-performance applications, including large scale AI. Whether you are already implementing AI, or in early phases of exploration/contemplation, read on to learn a few lessons from effective HPC organizations without having to become HPC houses.

The 2019 Digital Trends survey found that there has been a 50 percent increase since last year in the proportion of larger organizations stating they are already using AI, up from 24 percent in 2018 to 36 percent in 2019. Only 26 percent of organizations reported having no plans to invest in AI (down from 35 percent the year prior).

Perhaps this is not surprising, given the current management-level thinking about AI. According to a PwC report, 72 percent of business executives believe that AI will be the business advantage of the future. If customers matter to your business, you might take heed in that the 2019 Digital Trends survey reported that leaders in Customer Experience (CX) are nearly twice as likely as other companies to be using AI within their organizations.

When we find ourselves encouraged (pushed) into evaluating and/or deploying AI projects, we need to help steer away from “shiny object syndrome” to a mode of system-level thinking.

– Thinkstock

HPC - borrowing effective habits without falling in

High-performance computing (HPC) is the world of supercomputers — computing with very high levels of aggregating computing power in a way that delivers enormous performance to single applications, much more than the most powerful desktop computer or workstation, in order to solve large problems in science, engineering, or business.

We can learn a few things from the experiences of the HPC community to make all our systems better. Sure, HPC has a certain mystique about it, and a definite curmudgeon culture. However, every business can benefit from having an educated approach to navigating AI, machine learning, and HPC computing needs and opportunities.

What differentiates HPC from a huge data center is the concept of “scaling of an interrelated computation.” If Real Estate is all about location, location, location… then HPC is about scaling, scaling, scaling.

A common concern for parallel programming, especially in HPC, is measuring scaling efficiency (commonly referred to as scalability) of an application. Such a measurement indicates how efficient an application is when using increasing numbers of parallel processing units (processors, GPUs, ASICs, FPGAs, etc.).

Similarly, what really matters for the best AI deployments is scaling, scaling, and scaling. The answer is not an exact copy of HPC systems, but something that rhymes.

An effective IT organization can lead the way for organizations seeking broader use of AI

The most important common ground for consulting with HPC experts is this: system-level thinking matters. Since this is a bit broad and vague - we can dig deeper and refine it into seven key lessons from effective HPC organizations.

Invest substantially in procurement activities

I’m honestly amazed on how many systems are bought because they looked interesting, but I’m also amazed at how often a good opportunity goes unused because there is no time to evaluate it seriously. We can gain much insights from studying HPC centers that have been navigating the opportunities, complexity, and risks when investing in a new supercomputer. A few years ago, I wrote a piece titled “How the Best HPC Managers Make the Best Procurement Decisions” which focused on efforts aimed at “de-risking procurement.” By the way, this implicitly includes the need to still look smart a few years after a procurement is made!

Why would anyone would hire a consultant to help in procurement? Andrew Jones, VP of HPC Business at the Numerical Algorithms Group (NAG), explains why it makes sense to augment internal capabilities: “Many organizations do have the ability to do this in house. We help with capacity and experience. We augment their team, acting as a temporary enhancement of their capacity and experience. Most customers are only buying a new machine every couple of years, whereas we are involved in HPC planning and procurement projects on a continuous basis. They are gaining years of diverse experience, not just the few days or weeks we spend with the customer.”

The value of investing serious time in procurement discussions with internal stakeholders, vendors, and detailed technical investigations should not be overlooked. Even if you don’t invest in hiring outside help, how much you invest above and beyond your normal job? What’s the downfall if you cannot?

My conversations with industry experts repeatedly came back to insisting on the need for a deep and honest competitive assessment (an organization’s own capabilities and shortcomings), requirements/benchmarks, total cost of ownership, and timing. Let’s discuss the importance of benchmarking and timing as the next two lessons from HPC.

Invest in developing and using impartial requirements/benchmarks

It is of paramount importance that investment decisions are solidly linked to an organization’s needs and goals. One critical way of accomplishing this is with “benchmarks.” I’m not talking about industry standard benchmarks, I’m talking about benchmarks that are representative of the actual workloads you expect to run on the procured machine. I shouldn’t care how fast my machine will run applications that vendors like to show off — I care about applications that are important in my organization!

Engaging vendors to test proposed systems with your own benchmarks is a shared effort. The larger your potential purchase, the more effort one can require of a vendor. Providing machine access, and assistance, is a common request that potential customers make of vendors - do not be shy about asking. However, deciding what the benchmarks should be, and how to interpret them is on you - a non-trivial amount of work.

It is important to remember that benchmarks are only ever an approximation for real workloads. But, when properly used, they can give valuable data on likely performance for the workloads that are important to you, and on how much difficulty is involved in getting that performance.

The 2019 Digital Trends survey highlighted that 55 percent of the uses of AI in organizations today is primarily focused on the analysis of data. When benchmarking, we need to match the benchmark weights to what we actually do. That is harder than it sounds! People experienced in HPC procurement can share their thoughts on how to navigate this.

Andrew Jones shared that “we avoid labeling procurement options as binary good or bad. As important as the performance figures themselves, is qualifying the effort required to get that performance, and an understanding of the architectural reasons behind the performance. In particular, we work to find information that connects a buying decision to actual needs and risks of meeting those needs.”

This is profoundly important in my experience – just because code could run that fast doesn’t mean it will. An honest assessment in what will be run on a machine is more important than dreaming about what could be run on the machine. I’ll revisit this with a different twist, in a bit when I mention more about “modernization efforts.”

Since no organization runs only a single code — a system evaluation needs to consider the trade-offs of performance possibilities as well as potential performance losses when consider one option versus another. The best choice for an organization is often a system that does “good enough” on the majority of applications, excels on few workloads, but may be slow on a few applications of low importance. For this effort we need our best critical thinkers, and we need to put them to work.

Excessive attention to how a “shiny object” may boost one benchmark should not prevent us from seeing the bigger picture, especially if the boost comes at an additional cost for acquisition, deployment, and support. Could the additional cost be used to boost performance more broadly?

Think carefully about timing, have an informed plan

The timing of availability for various technologies can affect capabilities and competition. Being too early, or too late, can substantially affect competitiveness. Phasing of delivery might be a powerful option to upgrade a system to use new technologies as they become available. Stock brokers can tell us about cost and value averaging, and the same applies to computing - there can be power in making constant incremental investments allowing for learning along the way to guide future steps. Knowledge of the longer-term roadmaps for vendors can be important to manage risks.

Waiting can be important. Nicole Hemsoth, a well known HPC reporter, wrote that the National Oceanic and Atmospheric Administration (NOAA) is keenly aware that AI can help, but requires careful thought. She also noted that “this process of evaluation is not different from large companies that see AI benefit but need to think carefully about how and where it fits - and whether it is hardened and stable enough to qualify on critical systems.”

Tractica forecasts software implementations falling under the AI umbrella will generate $105.8 billion in annual worldwide revenue by 2025 (versus only $8.1 billion in 2018). They forecast that telecoms, consumer, advertising, business services, healthcare, and retail industries will make up the six largest adopters. This tells us a little about how much pressure there is to “jump in” now. Developing a multi-year plan for evolving could be an advantage.

Support your applications, learn from your users

I’m not saying IT departments do not support their users. But I will say, that many IT organization lack the funding or charter to support emerging usages such as AI. That creates a gap that is much less common inside the HPC world.

If AI is important to your organization, then the first step should be to partner with users and vendors to find a way to support your needs on the systems you have. You will probably be surprised how well using systems you already have can work A huge bonus is the ability to learn from that, and grow from there. Surprisingly, this is often overlooked as a resource and a proving ground. Even when learning is going on, it is often done with a disconnect between IT and the users. Aggressive IT organizations, like most HPC organizations, are intimately involved in supporting and learning from the heaviest workloads on their systems. If Python or Tensorflow are important to your users, do you understand how to get the version most optimized for your platforms deployed?

Orchestrate real plans to modernize code

Whenever techniques and machines evolve quickly, code needs to evolve too. Code modernization is a way of writing scalable code that uses multi-level parallelism to take full advantage of the modern hardware performance capabilities. I have been delighted to see how much code modernization continues to be talked about, and promoted, within the HPC community - and the positive effect that has!

HPC practitioners invest heavily in open source code, and many companies contribute to improving open source code for new systems. Several years ago, I worked at Intel helping with their Intel Parallel Computing Centers (Intel PCCs). The Intel PCCs were funded to work on updating open source projects for multicore processors, I also helped edit two “Pearls” books that walked through work by world renowned teams to modify open source code to modernize it.

On this journey, I have learned that code modernization is much more important than it might first appear - and this is an important lesson we can carry into IT organizations, whether the actual work for modernization is internally done, hoped for in open source, paid for externally, insisted on with a vendor, or most likely a blend of all the above.

With these insights, we know code modernization will be important for AI as well. Not in a decade, but already today. HPC experiences suggest that failure to invest in code (especially when technologies are changing rapidly) often reinforces vendor lock-in. Money is probably better spent on improving your own code than paying for vendor lock-in!

Consider cloud versus no cloud as balancing act, not a choice

The concept of “HPC in the cloud,” despite much touting by some vendors, has not stopped investments in HPC hardware. Intersect360 Research reported that in 2018, most HPC budgets either increased (46 percent) or remained the same (38 percent) from the prior year - with commercial sites reporting the strongest growth. This reinforces the reality that having expertise for computing infrastructure is a must have.

Cloud-based services, including AWS, Google, Azure, and others, offer a variety of platforms to experiment upon and do early deployments. This can delay the need to have infrastructure expertise, and give such expertise a chance to grow within your organization. While cloud-based AI is certainly an important home for incubation, organizations find themselves needing to build and maintain infrastructure as AI initiatives expand. I will dare to say that this is unsurprising to HPC experts.

When cost, performance, and lots of data matter - nothing beats having your own expertise in computing infrastructure. Ignore this need for expertise at your own peril, not having such expertise will not give you the expertise to achieve a balance.

Total Cost of Ownership (TCO) – hardly a lesson only from HPC

I’ve touched on total cost of ownership a little already when I mentioned paying attention to the cost of getting performance (when evaluating benchmarks), timing (what benefits will you get now vs. waiting), and investing in procurement and modernization for a truly balanced approach. I will also acknowledge (without offering any advice) that part of a total system concern needs to be security, and that’s not an HPC specific issue either (although all HPC centers think about it a lot).

I conclude with TCO as the seventh lesson, and even though TCO is certainly not unique to HPC it is indeed very important to HPC. Nothing says “system approach” more than thinking about the whole picture - the hardware, software, applications, security, and people. The value of a system is the net benefit of what we get out of it above the commitment of capital and expenses (TCO) to make it happen.

Seven Lessons, centered on a Systems Approach

Experienced HPC centers have long experience in realizing a net win from procurement and operation of large high-performance systems. Time has shown clearly that an effective systems approach is key to their success. These become key tips for anyone venturing into supporting AI at a large scale.

A systems approach, is realized when we take to heart all seven lessons: Investing in procurement activities, developing and using impartial benchmarks, carefully considering timing, investing heavily in support for applications and user community, modernizing code with a plan, balancing with the cloud, and managing total-cost-of-ownership.

These lessons from HPC friends could help. But, no need to become an HPC zealot… just learn a bit, maybe share some stories over a beer, and then get back to work. Before the robots take over.