How to Greatly Reduce HPC Costs for Engineering Simulation on Cloud

Most simulation engineers with a hunger for high performance computing (HPC) have looked longingly to the cloud. Cloud computing has the potential to provide virtually unlimited access to HPC, enabling larger simulations and more design variations to be done in less time, since many machines working in parallel can solve even very large problems quickly. While the cloud offers much more than unlimited computing power, it’s those HPC resources that provide the strongest pull to the cloud. The question we seek to answer here is, “is it possible to get cloud-based HPC at very low cost?”

Before we dig into the cost equation, it’s worth noting that to take full advantage of cloud HPC, you have to consider more than just the compute-intensive solution phase. In his blog describing cloud computing best practices for engineering simulation, the first best practice highlighted by Wim Slagter is “don’t move the data more than you have to.” The “burst to the cloud” HPC model sounds appealing until you consider transferring huge results files back on premises for post-processing. That’s why the ANSYS Enterprise Cloud (AEC) solution is engineered to enable end-to-end simulation to be performed entirely in the cloud. While AEC provides a complete virtual simulation data center, it’s still true that HPC is at the heart of the system. ANSYS has partnered with Cycle Computing, a company that specializes in enabling HPC workloads on public cloud infrastructure, and it is their CycleCloud software that powers the auto-scaling clusters that allow AEC to deliver HPC on demand. The entire solution runs on Amazon Web Services (AWS) Elastic Compute Cloud (EC2).

You are probably already aware that AWS has data centers available worldwide that deliver computing infrastructure at scale on a pay-as-you-go basis. The business flexibility provided by this on-demand pricing has significant advantages, but often organizations looking to the cloud are hoping not just for more computing power, but also cost savings when compared with provisioning on-premises infrastructure. Leading companies have shown that cloud migrations can lead to significant cost savings when the total cost of on-premise ownership is considered, but customers in the early stages of cloud adoption sometimes find the cloud-to-on-premise comparison challenging.

The topics of cloud migration and cloud cost optimization merit a more detailed discussion than is possible in a blog post, so let’s get to the punchline. For intermittent or highly variable use cases like HPC simulation jobs, AWS offers two different pricing models which are most suitable:

On-demand. This is the most flexible model where you provision the instance you need when you need it (hence “on-demand”) and pay a fixed hourly rate while you use it, with no long-term commitment. If you stop the instance, you are no longer paying for it. While this pricing model is flexible, it is also AWS’ most expensive pricing model.

Spot. This is a very interesting model for HPC. Spot instances allow you to bid on spare EC2 computing capacity. This is a market-driven pricing model that varies depending on the amount of spare capacity in the cloud data center and the demand for that capacity. You establish the price you’re willing to pay for a given instance (your bid price) and if the market price is below your bid price, you pay the market price. Again, it’s a market price, so it varies, but for the machines we use for HPC in ANSYS Enterprise Cloud, we’ve observed that the spot market price is typically about one quarter of the on-demand price. The catch is that while you’re using your instance, if the market price goes up and exceeds your bid price, you lose your machine.

So let’s get practical. How can you use AWS EC2 Spot instances in ANSYS Enterprise Cloud?

The first thing to note is that, when running HPC jobs in ANSYS Enterprise Cloud, you explicitly choose whether to use on-demand instances or spot instances. You do this by choosing which queue you submit the job to.

If you check the Cluster Monitor on the Jobs page of the ANSYS Cloud Gateway (an example is shown below), you should observe several queues. We use different queues for different solvers, but for a given solver we configure two queues; one which uses on-demand instances, and another that uses spot instances. For example, CFD jobs by default use a queue labeled “normal,” which uses high-performance compute-optimized instances at on-demand pricing. There is also a queue labeled “spot” which uses the same instances, but at spot pricing.

When running a batch job or submitting to the cluster from an interactive session, you'll need to explicitly choose one of those queues when you submit your job.

However, if the market is variable, how will you know if you’ll actually GET your compute instance if you submit to a spot queue? Here it helps to know how the spot price history market has been behaving for the instance you're interested in. It's best to check this BEFORE you submit your job. To do that, go to the root of the Shared Data view on the Data page. There you'll see a link called "Spot Price Monitor."

Click that and (after a short delay while we grab price history data from AWS) you'll see the recent price history, as shown in the example below:

You can use the queue drop down menu to select the queue of interest. In the example above (which shows real data from May 2016), you can see that the spot price has been consistently below our bid price for several days (here we've chosen to set the bid price equal to the on-demand price, but this is configurable). As noted in the on-screen message, the risk of spot termination is therefore likely quite low, you could safely submit jobs to this queue without fear that your instance would be lost due to market volatility, and you'd be saving money on your job.

So, with the knowledge that the sport market has been stable, we can proceed to submit our job to the HPC queue of interest. The queue selection is done at the time the HPC job is launched, either using a batch job template in the web user interface (an example of the Fluent batch job template is shown below), or when submitting the solution to RSM from interactive sessions.

Once the jobs are submitted, spot instances behave the same as on-demand instances; new virtual machines are provisioned to service the job (this automated machine provisioning is the part of the AEC solution that Cycle Computing provides) and the job runs when the machines are available, without the long queue wait times that are typical of on premise HPC clusters.

It is always a good idea to follow simulation best practices and configure solver checkpointing so you write out intermediate results that can be used as a point to restart in the event of a job failure, as would happen if the spot market price spiked and the compute node was terminated.Intermediate data are written to the AWS Elastic Block Store (EBS) that is NOT attached to the compute nodes, so there need not be any data loss even when a compute node is lost. You would simply restart your solution from the last solver checkpoint (using on-demand instances, if necessary).

Remember that spot pricing is not something that you MUST use. If you are running a time-critical job and can’t run the risk that the job might terminate due to a spike in the spot market price, you can avoid the risk by choosing to submit to the on-demand queue.

So while using spot instances carry some risk (that can be mitigated by following some best practices), the potential reward is high. You could be running your HPC jobs at approximately ¼ the price, or performing 4 times as much simulation within your budget.