What would you do with 100,000 cores? - Big compute at global scale

So, if you had 100,000 cores at your disposal (or, let’s say, 25,000 computers with four cores each), what would you do? How about calculating the cost of providing life insurance coverage to every person in the world?

Well, by running a specialized insurance model on Azure, that’s exactly what Willis Towers Watson did in collaboration with the HPC and Big Compute team at Microsoft. The whole effort took less than 12 hours, from provisioning 100,000 cores in 14 different regions worldwide, to the final downloading of results.

About Willis Towers Watson

Willis Towers Watson is among the top global providers of consulting and technology solutions for insurance companies; especially in the areas of people, risk and financial management. Over the decades, Willis Towers Watson has brought a range of innovations to the industry, including new ways to analyze risk, price auto insurance and evaluate solvency requirements.

Willis Towers Watson has been working closely with Microsoft for many years. Most recently on RiskAgility FM, a hybrid solution to model different types of financial risk (learn more in this customer story).

“This exercise not only demonstrates how far we have come but what could be possible in the future as we move towards a world where technologies can fundamentally shift many of the previously held paradigms that have restrained the way insurers do business. When used together, innovative technology solutions like vGrid, can greatly enhance speed, reliability, control and accuracy in the risk modeling process by leveraging the power of the cloud.”

-- Stephen Hollands, Director, Software as a Service at Willis Towers Watson

Why run a model on that many cores?

To answer that [probably obvious] question, it’s important to mention first that RiskAgility FM has been designed to run extremely complex financial models and hyperscale-sized jobs on Azure by leveraging Azure Batch. Now, going back to the original question: Why would you run a model of that size and complexity? Well, because both Willis Towers Watson and Microsoft wanted to know how far RiskAgility FM could scale and how well Azure (and the Batch service) could handle that scale.

To determine this, we needed a very large question for RiskAgility FM to answer. What question can be bigger to an insurance company than figuring out the cost to insure the entire world's population?

Exactly how big is this question? Well, the model that can answer it would take approximately 19 years (roughly 166,000 hours, or 10,000,000 minutes) to complete on a computer with a single core (and if the insurer *only* wants to run 500,000 iterations to gain enough confidence in the result). However, if the insurer can have access to an enormous number of cores and a modeling solution able to handle that number of resources, it would presumably only take a fraction of the time. That’s exactly what Willis Towers Watson and Microsoft tested, and a couple of weeks ago, proved.

How?

First, Willis Towers Watson prepared a customized model used to calculate the total cost of providing life insurance to the whole population of the world (from an insurer’s perspective). Due to the scale intended for this project, this model had to be reconfigured to read and write to storage more efficiently. It was kept as close as possible to its original out-of-the-box version because we wanted this exercise to reflect a real world scenario.

The insurance model itself performed a stochastic analysis of the insurance cost of providing the 7.3 billion people on Earth with a $100,000 whole-of-life insurance policy. The model confirmed the cost would be approximately 2.5 times the global gross domestic product (GDP), with a standard deviation of roughly 15 percent of global GDP.

Figure 1. RiskAgility FM portal showing the running status.

When the model was ready (and after a few, small-scale tests) the HPC and Big Compute team helped to allocate 100,000 cores in Azure by deploying VMs with the Batch service, across 14 different Azure regions. The largest set of VMs (called a “pool” in Batch) was comprised of 60,000 cores, while the smallest one of 1,000 cores.

In this day and age, perhaps 100,000 cores might not sound like much to some, but what is notable here is it took less than 12 hours for Batch to secure all resources, bring up the VMs, prepare them for job execution, run the job, and obtain the results; all using services and functionality available to anyone on Azure, today.

Just as impressively, after the job had finished, Batch immediately released the resources back to each of the Azure regions, so they were available to other customers (and if it was an external account, no more costs would be accrued). Such is the elasticity that Azure Batch provides!

Of course, of the 12 hours it took to complete this project, only a portion of that time was used for the actual calculation. With the cores in place, all 500,000 iterations of the mode finished in less than two hours (approximately 100 minutes), resulting in an almost linear speed-up from the single-core equivalent. A rather impressive result considering the scale of this calculation!

What does this demonstrate?

First of all, this proved RiskAgility FM can successfully run the same job on tens of thousands of cores available on multiple sets of VMs geographically distributed across the planet. In other words, it demonstrates true hyperscale is possible on Azure. And this is not only based on the number of cores that were utilized, but also on the number of regions where those cores were provisioned. This gives an enormous amount of flexibility to Towers Watson customers and opens up an interesting new dimension of possibilities for modelers and developers.

Secondly, while RiskAgility FM jobs typically require a few thousand cores, being able to scale a job to 100,000 cores without changing the programming model and resource management code is quite an accomplishment. This is possible because the Batch service takes away the complexity of deploying, managing and running jobs on tens, hundreds, or thousands of cores simultaneously. This confirms developers can use a single programming interface to run jobs on Azure, regardless of the number of cores, VMs, or regions where those jobs run with the Batch API.

Lastly, although this was an experiment, it was one that ran with services available to every Azure customer today. Based on that, it’s not too farfetched to say it proves any Azure customer can easily achieve a level of scalability only possible before through complex coding and management efforts. This is true HPC, on the cloud, for everyone.