Intro to High Performance Computing in the AWS Cloud

High Performance Computing (HPC) allows scientists and engineers to solve complex science, engineering, and business problems using applications that require high bandwidth, low latency networking, and very high compute capabilities. AWS allows you to increase the speed of research by running high performance computing in the cloud and to reduce costs by providing Cluster Compute or Cluster GPU servers on-demand without large capital investments. You have access to a full-bisection, high bandwidth network for tightly-coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications.

Credit to: David Pellerin, Dougal Ballantyne, Angel Pizarro, and Ryan Shuttleworth for a lot of the content for these slides. Amy Sun for many of the AWS graphics. Ian Meyers for the financial grid computing architecture.

Increased collaboration – access to clusters and data can be from anywhere with an internet connection

Faster time to results –focus on your business/science, increase efficiency of your people by not being burdent by IT. While you can benchmark the cluster on the performance of a running a job, it is more important and comprehensive to benchmark the total time it takes to provision and use the cluster end-to-end

Concurrent Clusters on demand – no more waiting in a queue, run multiple jobs simultaneously with an API call

Other Use Cases: Science-as-a-Service Large-scale HTC (100,000+ core clusters) Small to medium-scale MPI clusters (hundreds of nodes) Many small MPI clusters working in parallel to explore parameter space GPGPU workloads Dev/test of MPI workloads prior to submitting to supercomputing centers Collaborative research environments On-demand academic training/lab environments Ephemeral clusters, custom tailored to the task at hand, created for various stages of a pipeline

Need design graphics here Slot in customers are here Customer logos

*BEN: check with David Pellerin, Jamie and Dougal on the best verticals with the best customer reference for each vertical

About Harvard Medical School The Laboratory for Personalized Medicine (LPM), of the Center for Biomedical Informatics at Harvard Medical School, run by Dr. Peter Tonellato, took the power of high throughput sequencing and biomedical data collection technologies and the flexibility of Amazon Web Services (AWS) to develop innovative whole genome analysis testing models in record time. “The combination of our approach to biomedical computing and AWS allowed us to focus our time and energy on simulation development, rather than technology, to get results quickly,“ said Tonellato. “Without the benefits of AWS, we certainly would not be as far along as we are.”

The Challenge Tonellato’s lab focuses on personalized medicine—preventive healthcare for individuals based on their genetic characteristics—by creating models and simulations to assess the clinical value of new genetic tests.

Other projects include simulating large patient populations to aid in clinical trial simulations and predictions. To overcome the difficulty of finding enough real patient data for modeling, LPM creates patient avatars—literally “virtual” patients. The lab can create different sets of avatars for different genetic tests and then replicate huge numbers of them based on the characteristics of hospital populations.

Tonellato needed to find an efficient way to manipulate many avatars, sometimes as many as 100 million at a time. “In addition to being able to handle enormous amounts of data,” he said, “I wanted to devise system where postdoctoral researchers can scope a genetic risk situation, determine the appropriate simulation and analysis to create the avatars, and then quickly build web applications to run the simulations, rather than spend their time troubleshooting computing technology.”

Why Amazon Web Services In 2006, Tonellato turned to cloud computing to address the complex and highly variable computational need. “I evaluated several alternatives but found nothing as flexible and robust as Amazon Web Services,” he said. Having built datacenters previously, Tonellato could not afford the time he knew would be required to set up servers and then write code. Instead, he decided to conduct a test to see how fast his team could put together a series of custom Amazon Machine Images (AMIs) that would reflect the optimal development environment for researchers’ web applications.

Now, Tonellato’s lab has extended their efforts to integrate Spot Instances into their workflows so that they could stretch their grant money even further. According to Tonellato, “We leverage Spot Instances when running Amazon Elastic Cloud Compute (Amazon EC2) clusters to analyze entire genomes. We have the potential to run even more worker nodes at less cost when using Spot Instances, so it is a huge saving in both time and cost for us. To take advantage of these savings, it just took us a day of engineering, and saw roughly 50% savings in cost.” Tonellato’s lab leverages MIT’s StarCluster tools, which has built-in capabilities to manage an Oracle Grid Engine Cluster on Spot Instances. Erik Gafni, a programmer in Tonellato’s lab, performed the integration of StarCluster into our workflow. According to Gafni, “Using StarCluster, it was incredibly easy to configure, launch, and start using a running Spot Cluster in less than 10 minutes.”

In addition the LPM recognized the need for published resources about how to effectively use cloud computing in an academic environment and published an educational primer in PLoS Computational Biology to address this need. “We believe this article clearly shows how an academic lab can effectively use AWS to manage their computing needs. It also demonstrates how to think about computational problems in relation to AWS costs and computing resources,” says Vincent Fusaro, lead author and senior research fellow in the LPM.

Tonellato runs his simulations on Amazon EC2, which provides customers with scalable compute capacity in the cloud. Designed to make web-scale computing easier for developers, Amazon EC2 makes it possible to create and provision compute capacity in the cloud within minutes.

Tonellato’s lab is thrilled with their AWS solution. “The number of genetic tests available to doctors and hospitals is constantly increasing,” Tonellato explained, “and they can be very expensive. We’re interested in determining which tests will result in better patient care and better results.” He added, “We believe our models may dramatically reduce the time it usually takes to identify the tests, protocols, and trials that are worth pursuing aggressively for both FDA approval and clinical use.”

Next Step To learn more about how AWS can help your big data needs, visit our Big Data details page: http://aws.amazon.com/big-data/.

Feature Details Flexible Run windows or Linux distributions Scalable Wide range of instance types from micro to cluster compute Machine Images Configurations can be saved as machine images (AMIs) from which new instances can be created Full control Full root or administrator rights Secure Full firewall control via Security Groups Monitoring Publishes metrics to Cloud Watch Inexpensive On-demand, Reserved and Spot instance types VM Import/Export Import and export VM images to transfer configurations in and out of EC2

Auto scaling is the tool that allows just-in time provisioning of compute resources based on the policies and events that you specify

This family includes the C1, CC2, and C3 instance types, and is optimized for applications that benefit from high compute power. Compute-optimized instances have a higher ratio of vCPUs to memory than other families, and the lowest cost per vCPU among all Amazon EC2 instance types. We recommend compute-optimized instances for running compute-bound scale out applications. Examples of such applications include high performance front-end fleets and web-servers, on-demand batch processing, distributed analytics, and high performance science and engineering applications. C3 instances are the latest generation of compute-optimized instances, providing customers with the highest performing processors and the lowest price/compute performance available in EC2 currently.

Each virtual CPU (vCPU) on C3 instances is a hardware hyper-thread from a 2.8 GHz Intel Xeon E5-2680v2 (Ivy Bridge) processor allowing you to benefit from the latest generation of Intel processors.

C3 instances support Enhanced Networking that delivers improved inter-instance latencies, lower network jitter and significantly higher packet per second (PPS) performance. This feature provides hundreds of teraflops of floating point performance by creating high performance clusters of C3 instances.

Compared to C1 instances, C3 instances provide faster processors, approximately double the memory per vCPU, support for clustering and SSD-based instance storage.

About National Taiwan University Fast Crypto Lab is a research group within National Taiwan University, in Taiwan. The group’s research activities focus on the design and analysis of efficient algorithms to solve important mathematical problems, as well as the development and implementation of these algorithms on massively parallel computers. Why Amazon Web Services Prior to signing on with Amazon Web Services (AWS), the group used a private cloud and ran Hadoop on their own machines. Prof. Chen-Mou Cheng, the Principal Investigator of Fast Crypto Lab, explains why the research group made the switch to AWS: “It is quite easy to get started with AWS with its clear and flexible interface. Amazon Elastic Compute Cloud (Amazon EC2) provides a common measure of cost across problems of a different nature. For problems that are the same or similar, Amazon EC2 can also be used as a metric for comparing alternative or competing algorithms and their implementations.” Chen-Mou adds, “When using Amazon EC2 as a metric, the parallelizability of the algorithm or the parallelization of the implementation is explicitly taken into account, as opposed to being assumed or unspecified. The Amazon EC2 metric is thus practical and easy to use.” The group now uses Hadoop Streaming in their architecture, and runs their programs with Amazon Elastic MapReduce (Amazon EMR) and Cluster GPU Instances for Amazon EC2. “Our purpose is to break the record of solving the shortest vector problem (SVP) in Euclidean lattices," Chen-Mou says. "The problem plays an important role in the field of information science. We estimated that we would need 1,000 cg1.4xlarge instance-hours. We ended up using 50 cg1.4xlarge instances for about 10 hours to solve our problem. Now, the vectors we found are considered the hardest SVP anyone has solved so far. We only spent $2,300 for using the 100 Tesla M2050 for 10 hours, which is quite a good deal.” The Benefits Since switching to AWS, the group indicates that the machine maintenance costs have been reduced, and they have experienced more stable and scalable computational power. The group’s favorite component of AWS is Amazon CloudWatch, which it uses to watch computer utilities while also improving their program. Looking into the future, Chen-Mou says, “We want to increase our GPU Cluster quote and solve a higher dimension SVP. We are also considering renting an AWS machine for setting up an SVN server.”

Customer reference here?

Speaker Notes

SEC’s Market Information Data Analytics System (MIDAS) provided by AWS partner Tradeworx Powerful AWS-based system for big data market analytics 2M transaction messages/per sec; 20B records, 1TB data/per day From RFP to contract award to production in ~12 months See http://sec.gov/marketstructure “For the growing team of quant types now employed at the SEC, MIDAS is becoming the world’s greatest data sandbox. And the staff is planning to use it to make the SEC a leader in its use of market data” – Elisse B. Walter, Chairman of the SEC

“This basically propels the SEC from zero to 60 in one fell swoop, going from being way behind even the most basic market participant to being on par if not ahead of the vast majority of market participants, in terms of their system and analytical capabilities‘” – Gregg E. Berman, Associate Director, Office of Analytics and Research, SEC

To upload large data sets into AWS, it is critical to make the most of the available bandwidth. You can do so by uploading data into Amazon Simple Storage Service (S3) in parallel from multiple clients, each using multithreading to enable concurrent uploads or multipart uploads for further parallelization. TCP settings like window scaling and selective acknowledgement can be adjusted to further enhance throughput. With the proper optimizations, uploads of several terabytes a day are possible. Another alternative for huge data sets might be Amazon Import/Export, which supports sending storage devices to AWS and inserting their contents directly into Amazon S3 or Amazon EBS volumes.

Parallel processing of large-scale jobs is critical, and existing parallel applications can typically be run on multiple Amazon Elastic Compute Cloud (EC2) instances. A parallel application may sometimes assume large scratch areas that all nodes can efficiently read and write from. S3 can be used as such a scratch area, either directly using HTTP or using a FUSE layer (for example, s3fs or SubCloud) if the application expects a POSIX-style file system.

Once the job has completed and the result data is stored in Amazon S3, Amazon EC2 instances can be shut down, and the result data set can be downloaded The output data can be shared with others, either by granting read permissions to select users or to everyone or by using time limited URLs.

Instead of using Amazon S3, you can use Amazon EBS to stage the input set, act as a temporary storage area, and/or capture the output set. During the upload, the concepts of parallel upload streams and TCP tweaking also apply. In addition, uploads that use UDP may increase speed further. The result data set can be written into EBS volumes, at which time snapshots of the volumes can be taken for sharing.

Date sources for market, trade, and counterparties are installed on startup from on premise data sources, or from Amazon Simple Storage Service (Amazon S3).

AWS DirectConnect can be used to establish a low latency and reliable connection between the corporate data center site and AWS, in 1 to 10Gbit increments. For situations with lower bandwidth requirements, a VPN connection to the VPC Gateway can be established.

Private subnetworks are specifically created for customer source data, compute grid clients, and the grid controller and engines.

Application and corporate data can be securely stored in the cloud using the Amazon Relational Database Service (Amazon RDS).

Grid controllers and grid engines are running Amazon Elastic Compute Cloud (Amazon EC2) instances started on demand from Amazon Machine Images (AMIs) that contain the operating system and grid software.

Static data such as holiday calendars and QA libraries and additional gridlib bootstrapping data can be downloaded on startup by grid engines from Amazon S3.

Results in Amazon DynamoDB are aggregated using a map/reduce job in Amazon Elastic MapReduce (Amazon EMR) and final output is stored in Amazon S3.

The compute grid client collects aggregate results from Amazon S3.

Aggregate results can be archived using Amazon Glacier, a low-cost, secure, and durable storage service

http://aws.amazon.com/solutions/case-studies/bankinter/

About Bankinter Bankinter was founded in June, 1965 as a Spanish industrial bank through a joint venture by Banco de Santander and Bank of America. It is currently listed among the top ten banks in Spain. Bankinter has provided online banking services since 1996, when they pioneered the offering of real-time stock market operations. More than 60% of Bankinter transactions are performed through remote channels; 46% of those transactions are through the Internet. Today Bankinter.com and Bankinter brokerage services continue to lead the European banking industry in online financial operations.

The Challenge Bankinter uses Amazon Web Services (AWS) as an integral part of their credit-risk simulation application, developing complex algorithms to simulate diverse scenarios in order to evaluate the financial health of Bankinter clients. "This requires high computational power,” says Bankinter Director of Technological Innovation, Javier Roldán. “We need to perform at least 5,000,000 simulations to get realistic results.” Why Amazon Web Services Bankinter uses the flexibility and power of Amazon Elastic Compute Cloud (Amazon EC2) to perform these simulations, subdividing processes through a grid of Amazon EC2 instances and implementing simulations in parallel on several Amazon EC2 instances to obtain the result in a very effective time period. Bankinter used Java to develop their application and the Amazon Software Development Kit (SDK) to automate the provisioning process of AWS elements. Through the use of AWS, Bankinter decreased the average time-to-solution from 23 hours to 20 minutes and dramatically reduced processing, with the ability to reduce even further when required. Amazon EC2 also allowed Bankinter to adapt from a big batch process to a parallel paradigm, which was not previously possible. Costs were also dramatically reduced with this cloud-based approach. The Benefits Bankinter plans to expand their use of AWS for future applications and business units. “The AWS platform, with its unlimited and flexible computational power, is a good fit for our risk-simulation process requirements,” says Roldán. “With AWS, we now have the power to decide how fast we want to obtain simulation results. More important, we have the ability to run simulations that were not possible before due to the large amount of infrastructure required.”

Users interact with the Job Manager application which is deployed on an Amazon Elastic Computer Cloud (EC2) instance. This component controls the process of accepting, scheduling, starting, managing, and completing batch jobs. It also provides access to the final results, job and worker statistics, and job progress information. Raw job data is uploaded to Amazon Simple Storage Service (S3), a highly-available and persistent data store.

Individual job tasks are inserted by the Job Manager in an Amazon Simple Queue Service (SQS) input queue on the user’s behalf.

Worker nodes are Amazon EC2 instances deployed on an Auto Scaling group. This group is a container that ensures health and scalability of worker nodes. Worker nodes pick up job parts from the input queue automatically and perform single tasks that are part of the list of batch processing steps.

Interim results from worker nodes are stored in Amazon S3.

Progress information and statistics are stored on the analytics store. This component can be either an Amazon DynamoDB table or a relational database such as an Amazon Relational Database Service (RDS) instance. Optionaly, completed tasks can be inserted in an Amazon SQS queue for chaining to a second processing stage.

Engage Sales and Solutions Architects

You have long queues to use your cluster Your jobs are varied enough that you spend more time optimizing for the cluster you have You are benchmarking for you cluster Find out how your HPC workloads can run AWS We can find the right partner to help manage your HPC workload for you

A centralized repository of public datasets Seamless integration with cloud based applications No charge to the community Some of the datasets available today: 1000 Genomes Project Ensembl GenBank Illumina – Jay Flateley Human Genome Dataset YRI Trio Dataset The Cannabis Sativa Genome UniGene Influenza Virrus PubChem

Time accurate fluid dynamics SBIR-funded project for the US Air Force Research Laboratory (AFRL) SAS 70 Type II certification and VPN-level access required Additional security measures: Uploaded and downloaded data was encrypted Dedicated EC2 cluster instances were provisioned Data was purged upon completion of the run

“The results of this case were impressive. Using Amazon EC2 the large-scale, time accurate simulation was turned around in just 72 hours with computing infrastructure costs well below $1,000.”http://aws.amazon.com/solutions/case-studies/aerodynamic-solutions/

About Pfizer Pfizer, Inc. applies science and global resources to improve health and well-being at every stage of life. The company strives to set the standard for quality, safety, and value in the discovery, development, and manufacturing of medicines for people and animals. The Challenge Pfizer’s high performance computing (HPC) software and systems for worldwide research and development (WRD) support large-scale data analysis, research projects, clinical analytics, and modeling. Pfizer’s HPC services are used across the spectrum of WRD efforts, from the deep biological understanding of disease, to the design of safe, efficacious therapeutic agents. Why Amazon Web Services Dr. Michael Miller, Head of HPC for R&D at Pfizer explains why Pfizer initially considered using Amazon Web Services (AWS) to handle its peak computing needs: “The Amazon Virtual Private Cloud (Amazon VPC) was a unique option that offered an additional level of security and an ability to integrate with other aspects of our infrastructure.” Pfizer has now set up an instance of the Amazon VPC to provide a secure environment with which to carry out computations for WRD. They say, “We accomplished this by customizing the ‘job scheduler’ in our HPC environment to recognize VPC workload, and start and stop instances as needed to address the workflow. Research can be unpredictable, especially as the on-going science raises new questions.” The VPC has enabled Pfizer to respond to these challenges by providing the means to compute beyond the capacity of the dedicated HPC systems, which provides answers in a timely manner. Pfizer’s solution was written in C and is based on the Amazon Elastic Compute Cloud (Amazon EC2) command line tools. They say, “We are currently migrating this solution over to a commercial API that will enable additional provisioning and usage tracking capabilities.” The Benefits The primary cost savings has been in cost avoidance, “Pfizer did not have to invest in additional hardware and software, which is only used during peak loads; that savings allowed for investments in other WRD activities.” For Pfizer, AWS is a fit-for-purpose solution. The Dr. Miller explains, “It is not a replacement for, but rather an addition to our capabilities for HPC WRD activities, providing a unique augmentation to our computing capabilities. Overall, “AWS enables Pfizer’s WRD to explore specific difficult or deep scientific questions in a timely, scalable manner and helps Pfizer make better decisions more quickly.” Looking ahead, Pfizer is interested in exploring Amazon Simple Storage Service (Amazon S3) for storing reference data to expand the type of computational problems they can address.

18.
Unilever: augmenting existing HPC capacity
The key advantage that AWS
has over running this workflow
on Unilever’s existing cluster is
the ability to scale up to a
much larger number of parallel
compute nodes on demand.
Pete Keeley
Unilever Researchs eScience
IT Lead for Cloud Solutions
”
“ • Unilever’s digital data program now processes
genetic sequences twenty times faster

23.
Schrodinger & CycleComputing: computational chemistry
Simulation by Mark Thompson of
the University of Southern
California to see which of
205,000 organic compounds
could be used for photovoltaic
cells for solar panel material.
Estimated computation time 264
years completed in 18 hours.
• 156,314 core cluster across 8 regions
• 1.21 petaFLOPS (Rpeak)
• $33,000 or 16¢ per molecule

24.
Cost Benefits of HPC in the Cloud
Pay As You Go Model
Use only what you need
Multiple pricing models
On-Premises
Capital Expense Model
High upfront capital cost
High cost of ongoing support

25.
Reserved
Make a low, one-time
payment and receive
a significant discount
on the hourly charge
For committed
utilization
Free Tier
Get Started on AWS
with free usage &
no commitment
For POCs and
getting started
On-Demand
Pay for compute
capacity by the hour
with no long-term
commitments
For spiky workloads,
or to define needs
Spot
Bid for unused
capacity, charged at
a Spot Price which
fluctuates based on
supply and demand
For time-insensitive
or transient
workloads
Dedicated
Launch instances
within Amazon VPC
that run on hardware
dedicated to a single
customer
For highly sensitive or
compliance related
workloads
Many pricing models to support different workloads

27.
Harvard Medical School: simulation development
The combination of our approach to
biomedical computing and AWS
allowed us to focus our time and
energy on simulation development,
rather than technology, to get results
quickly. Without the benefits of AWS,
we certainly would not be as far along
as we are.
Dr. Peter Tonellato,
LPM, Center for Biomedical
Informatics, Harvard Medical School
”
“ • Leveraged EC2 spot instances in workflows
• 1 day worth of effort resulted in 50% in cost savings

39.
National Taiwan University: shortest vector problem
Our purpose is to break the record of
solving the shortest vector problem
(SVP) in Euclidean lattices…the
vectors we found are considered the
hardest SVP anyone has solved so far.
Prof. Chen-Mou Cheng
Principle Investigator of Fast Crypto Lab
”
“ • $2,300 for using 100x Tesla M2050 for ten hours

44.
TRADERWORX: Market Information Data
Analytics System
For the growing team of quant types
now employed at the SEC, MIDAS is
becoming the world’s greatest data
sandbox. And the staff is planning to
use it to make the SEC a leader in its
use of market data
Elisse B. Walter,
Chairman of the SEC
Tradeworx
”
“ • Powerful AWS-based system for market analytics
• 2M transaction messages/sec; 20B records and
1TB/day

48.
NYU School of Medicine: Transferring large data sets
Transferring data is a large
bottleneck; our datasets are
extremely large, and it often takes
more time to move the data than to
generate it. Since our collaborators
are all over the world, if we can’t
move it they can’t use it.
”
“ • Uses Globus Online
• Data transfer speeds of up to 50MB/s
Dr. Stratos Efstathiadis
Technical Director of the
HPC facility, NYU

50.
Bankinter: credit-risk simulation
With AWS, we now have the power to
decide how fast we want to obtain
simulation results. More important, we
have the ability to run simulations that
were not possible before due to the
large amount of infrastructure
required.
Javier Roldán
Director of Technological
Innovation, Bankinter
”
“ • Reduced processing time of 5,000,000 simulations
from 23 hours to 20 minutes

60.
Accelerating materials science
2.3M core-hours for $33K
So What Does Scale Mean on AWS?

61.
Cyclopic energy: computational fluid dynamics
AWS makes it possible for us to
deliver state-of-the-art technologies
to clients within timeframes that
allow us to be dynamic, without
having to make large investments in
physical hardware.
Rick Morgans
Technical Director (CTO),
cyclopic energy
”
“ • Two months worth of simulations finished in two days

63.
AeroDynamic Solutions: turbine engine simulation
We’re delighted to be working closely
with the U.S. Air Force and AWS to
make time accurate simulation a
reality for designers large and small.
George Fan
CEO, AeroDynamic
Solutions
”
“ • Time accurate simulation was turned around in 72
hours with infrastructure costs well below $1,000

65.
Pfizer: large-scale data analytics and modeling
AWS enables Pfizer’s Worldwide
Research and Development to
explore specific difficult or deep
scientific questions in a timely,
scalable manner and helps Pfizer
make better decisions more quickly.
Dr. Michael Miller
Head of HPC for R&D,
Pfizer
”
“ • Pfizer avoiding having to procure new HPC hardware
by being able to use AWS for peak work loads.