Category: Amazon EC2

We launched Amazon EC2 Container Service last fall at AWS re:Invent and made it available in production form this past April. We’ve made a lot of enhancements since then and I thought it would be a good time to recap them for you. AWS customers are already making good use of EC2 Container Service and I’d like to share some of their stories with you as well. They love the fact that they can get high performance and flexible resource scheduling without having to build, run, scale, or maintain their own clusters.

Latest Features We have received a remarkable amount of feedback and feature requests for this product. You’ve let us know what you need by way of tweets, emails, posts in the Amazon ECS forum, blog posts, private meetings, and at the AWS Summits. We love all of this feedback and do our best to understand it and to make sure that our roadmap reflects what you tell us. Here’s what we’ve done in the first half of 2015 as we work our way along our roadmap:

July – We added support for the use of the UDP protocol in container port task definitions.

From Our Customers Our customers are making great use of EC2 Container Service, often running large clusters that host production applications. Here’s a sample of what they have shared with us:

Coursera runs large-scale batch jobs on the service. They had their prototype up and running in under two months, and can now deploy software changes in minutes instead of hours, with the ability to scale up to handle dynamic loads. To learn more, read the Coursera Case Study.

Hailo hosts their smartphone app for taxi hailing on AWS using EC2 Container Service as their cluster manager for their microservice-based architecture. They use a custom scheduler driven by a combination of service priority and runtime metrics to drive high resource utilization within their resource pool. To learn more, page through Microservices and Elastic Resource Pools with Amazon EC2 Container Service.

In the Community Finally, I would like to share a few other community-oriented tidbits with you.

Open Source – We recently announced that we will participate in the Open Container Project, with the goal of creating a set of common specifications for software containers. The ECS Container Agent is available on GitHub and we accept pull requests from potential contributors.

Learn More If you would like to learn more about container computing, Amazon ECS, and Docker, here are some resources to get you started:

As I was thinking about this post, I thought it would be fun to deconstruct Auto Scaling to ensure that I (and you) have a full understanding of how it works and how it makes use of other parts of AWS (in practice most of our customers use Auto Scaling to launch and terminate instances on their behalf; this is more of a look behind the scenes and an illustration of how different parts of AWS depend upon and build on each other). Here are the moving parts:

Resource Creation – In order to be able to implement Auto Scaling, we need to have the ability to launch and terminate EC2 instances as needed. Of course, AWS is API-driven and these operations are handled by the RunInstances and TerminateInstances actions, assisted by DescribeInstances:

Resource Monitoring – We need to measure and track how busy (in terms of CPU utilization, network traffic, or other metrics) our instances are (both individually and collectively) in order to be able to make informed scaling decisions. This is handled by Amazon CloudWatch:

Alarms – Now that we are tracking resource utilization, we need to know when the operating conditions dictate a scale-out or scale-in operation. This is also handled by CloudWatch:

Scaling Actions – The final step is to actually take action when an alarm is raised. This is handled by Auto Scaling, as directed by a CloudWatch Alarm:

The actions are defined within a particular Auto Scaling Group, and can add or remove a specific number of instances. They can also adjust the instance count by a percentage (add 20% more instances) or set it to an absolute value.

New Scaling Policies With Steps Today we are making Auto Scaling even more flexible with the addition of new scaling policies with steps.

Our goal is to allow you to create systems that can do an even better job of responding to rapid and dramatic changes in load. You can now define a scaling policy that will respond to the magnitude of the alarm breach in a proportionate and appropriate way. For example, if you try to keep your average CPU utilization below 50% you can have a standard response for a modest breach (50% to 60%), two more for somewhat bigger breaches (60% to 70% and 70% to 80%), and a super-aggressive one for utilization that exceeds 80%.

Here’s how I set this up for my Auto Scaling group:

In this example I added a fixed number (1, 2, 4, or 8) of instances to the group. I could have chosen to define the policies on a percentage basis, increasing the instance count by (say) 50%, 100%, 150%, and 200% at the respective steps. The empty upper bound in the final step is effectively positive infinity. You can also define a similar set of increasingly aggressive policies for scaling down.

As you can see from the example above, you can also tell Auto Scaling how long it should take for an instance to warm up and be ready to start sharing the load. While this waiting period is in effect, Auto Scaling will include the newly launched instances when it computes the current size of the group. However, during this scaling time, the instances are not factored in to the CloudWatch metrics for the group. This avoids unnecessary scaling while the new instances prepare themselves to take on their share of the load.

Step policies continuously evaluate the alarms during a scaling activity and while unhealthy instances are being replaced with new ones. This allows for faster response to changes in demand. Let’s say the CPU load increases and the first step in the policy is activated. During the specified warm up period (300 seconds in this example), the load might continue to increase and a more aggressive response might be appropriate. Fortunately, Auto Scaling is in violent agreement with this sentiment and will switch in to high gear (and use one of the higher steps) automatically. If you create multiple step scaling policies for the same resource (perhaps based on CPU utilization and inbound network traffic) and both of them fire at approximately the same time, Auto Scaling will look at both policies and choose the one that results in the change of the highest magnitude.

EC2’s R3 instances are designed to provide you with the best price per GiB of RAM, along with high memory performance. I am happy to be able to announce that they are now available in the South America (São Paulo) region, in two sizes.

Here are the specs:

Instance Name

vCPU Count

RAM

SSD Storage

Hourly On-Demand (Linux)

RI UpFront (Linux, 3 Year)

RI Price / Hour (Linux, 3 Year)

r3.4xlarge

16

122 GiB

1 x 320

$2.946

$17,345

$0.660

r3.8xlarge

32

244 GiB

2 x 320

$5.892

$34,690

$1.320

Here are some of the other notable features and characteristics of these instances:

Intel Xeon (Ivy Bridge) processors.

Support for Enhanced Networking for lower latency, low jitter, and high packet per second performance.

Sustained memory bandwidth of up to 63 GBps.

Fast I/O performance – up to 150,000 4 KB random reads per second.

You can use these instances for in-memory analytics (SAP HANA springs to mind), high performance relational and NoSQL databases, data warehouses, and memory-resident caches.

I often point to EC2 Spot Instances as a feature that can only be implemented at world-scale with any degree of utility.

Unless you have a massive amount of compute power and a multitude of customers spread across every time zone in the world, with a wide variety of workloads, you simply won’t have the ever-changing shifts in supply and demand (and the resulting price changes) that are needed to create a genuine market. As a quick reminder, Spot Instances allow you to save up to 90% (when compared to On-Demand pricing) by placing bids for EC2 capacity. Instances will run whenever your bid exceeds the current Spot Price and can be terminated (with a two minute warning) in the presence of higher bids for the same (as determined by region, availability zone, and instance type) capacity.

Because Spot Instances come and go, you need to pay attention to your bidding strategy and to your persistence model in order to maximize the value that you derive from them. Looked at another way, by structuring your application in the right way you can be in a position to save up to 90% (or, if you have a flat budget, you can get 10x as much computing done). This is a really interesting spot for you, as the cloud architect for your organization. You can exercise your technical skills to drive the cost of compute power toward zero, while making applications that are price aware and more fault-tolerant. Master the ins and outs of Spot Instances and you (and your organization) will win!

The Trend is Clear As I look back at the history of EC2 — from launching individual instances on demand, then on to Spot Instances, Containers, and Spot Fleets — the trend is pretty clear. Where you once had to pay attention to individual, long-running instances and to list prices, you can now think about collections of instances with an indeterminate lifetime, running at the best possible price, as determined by supply and demand within individual capacity pools (groups of instances that share the same attributes). This new way of thinking can liberate you from some older thought patterns and can open the door to some new and intriguing ways to obtain massive amounts of compute capacity quickly and cheaply, so you can build really cool applications at a price you can afford.

I should point out that there’s a win-win situation when it comes to Spot. You (and your customers) win by getting compute power at the most economical price possible at a given point in time. Amazon wins because our fleet of servers (see the AWS Global Infrastructure page for a list of locations) is kept busy doing productive work. High utilization improves our cost structure, and also has an environmental benefit.

Spot Best Practices Over the next few months, with a lot of help from the EC2 Spot Team, I am planning to share some best practices for the use of Spot Instances. Many of these practices will be backed up with real-world examples that our customers have shared with us; these are not theoretical or academic exercises. Today I would like to kick off the series by briefly outlining some best practices.

Let’s define the concept of a capacity pool in a bit more detail. As I alluded to above, a capacity pool is a set of available EC2 instances that share the same region, availability zone, operating system (Linux/Unix or Windows), and instance type. Each EC2 capacity pool has its own availability (the number of instances that can be launched at any particular moment in time) and its own price, as determined by supply and demand. As you will see, applications that can run across more than one capacity pool are in the best position to consistently access the most economical compute power. Note that capacity in a pool is shared between On-Demand and Spot instances, so Spot prices can rise from either more demand for Spot instances or an increase in requests for On-Demand instances.

Here are some best practices to get you started.

Build Price-Aware Applications – I’ve said it before: cloud computing is a combination of a business model and a technology. You can write code (and design systems) that are price-aware, and that have the potential to make your organization’s cloud budget go a lot further. This is a new area for a lot of technologists; my advice to you is to stretch your job description (and your internal model of who you are and what your job entails) to include designing for cost savings.

You can start by spending some time investigating (or by building some tools using the EC2 API or the AWS Command Line Interface (CLI)) the full range of capacity pools that are available to you within the region(s) that you use to run your app. High prices and a high degree of price variance over time indicate that many of your competitors are bidding for capacity in the same pool. Seek out pools that have lower prices and more stable prices (both current and historic) to find bargains and lower interruption rates.

Check the Price History – You can access historical prices on a per-pool basis going back 90 days (3 months). Instances that are currently very popular with our customers (the R3‘s as I write this) tend to have Spot prices that are somewhat more volatile. Older generations (including c1.8xlarge, m1.small, cr1.8xlarge, and cc2.8xlarge) tend to be much more stable. In general, picking older generations of instances will result in lower net prices and fewer interruptions.

Use Multiple Capacity Pools – Many types of applications can run (or can be easily adapted to run) across multiple capacity pools. By having the ability to run across multiple pools, you reduce your application’s sensitivity to price spikes that affect a pool or two (in general, there is very little correlation between prices in different capacity pools). For example, if you run in five different pools your price swings and interruptions can be cut by 80%.

A high-quality approach to this best practice can result in multiple dimensions of flexibility, and access to many capacity pools. You can run across multiple availability zones (fairly easy in conjunction with Auto Scaling and the Spot Fleet API) or you can run across different sizes of instances within the same family (Amazon EMR takes this approach). For example, your app might figure out how many vCPUs it is running on, and then launch enough worker threads to keep all of them occupied.

Adherence to this best practice also implies that you should strive to use roughly equal amounts of capacity in each pool; this will tend to minimize the impact of changes to Spot capacity and Spot prices.

Stay Tuned As I mentioned, this is an introductory post and we have a lot more ideas and code in store for you! If you have feedback, or if you would like to contribute your own Spot tips to this series, please send me (awseditor@amazon.com) a note.

We launched the T2 instances last summer (see my post, New Low Cost EC2 Instances with Burstable Performance for more information). These instances give you a generous amount of baseline capacity and the ability to automatically and transparently scale up to full-core processing power on an as-needed basis. The bursting model is based on “CPU Credits” that accumulate during quiet periods for spending when things get busy.

Today we are adding the t2.large instance based on customer feedback and on our own usage data. Our customers told us that the burst-based model gave them plenty of CPU power to run applications that consumed large amounts of memory. The new size provides double the amount of memory, along with a higher baseline level of CPU power.

Many AWS customers are running development environments, small databases, application servers, and web servers on their T2 instances. These applications generally don’t need the full CPU very often, but they do need to burst to higher CPU performance from time to time.

Here are the specs for all of the sizes of T2 instances:

Name

vCPUs

Baseline Performance

Platform

RAM (GiB)

CPU Credits / Hour

Price / Hour (Linux)

Price / Month (Linux)

t2.micro

1

10%

32-bit or 64-bit

1

6

$0.013

$9.50

t2.small

1

20%

32-bit or 64-bit

2

12

$0.026

$19.00

t2.medium

2

40%

32-bit or 64-bit

4

24

$0.052

$38.00

t2.large

2

60%

64-bit

8

36

$0.104

$76.00

AWS customer GoSquared (“People Analytics”) has been making good use of the T2 instances. Here’s what they have to say:

“The best part about T2 instances is that, so long as you don’t spend all your CPU credits, you enjoy the performance and all the power of a much larger instance, but at a fraction of the cost. As far as the services on your instance are concerned, they’re running on a fixed performance instance.”

These instances are available in the US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), EU (Frankfurt), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Sydney), South America (São Paulo), and AWS GovCloud (US) regions in On-Demand and Reserved form.

This edition of SQL Server is also more scalable. In contrast to the Standard Edition of SQL Server which maxes out at 16 cores and 128 GiB of memory, the Enterprise Edition is able to take advantage of the 32 cores and 244 GiB of memory provided by the r3.8xlarge instance.

You can run this new AMI on r3.x2large, r3.4xlarge, and r3.8xlarge instances in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) regions (visit the AWS Marketplace for more information):

We launched Amazon Elastic Compute Cloud (EC2) with a single instance type (m1.small) way back in 2006! Since then, we have added many new types in response to customer demand, enabled by improvements in memory and processor technology (see my recent post, EC2 Instance History, for a look back in time).

Today we are adding new M4 instances in five sizes. These are General Purpose instances, with a balance of compute, memory, and network resources.

Let’s take a closer look!

New M4 Instances The new M4 instances feature a custom Intel Xeon E5-2676 v3 Haswell processor optimized specifically for EC2. They run at a base clock rate of 2.4 GHz and can go as high as 3.0 GHz with Intel Turbo Boost. Here are the specs:

Instance Name

vCPU Count

RAM

Instance Storage

Network Performance

EBS-Optimized

m4.large

2

8 GiB

EBS Only

Moderate

450 Mbps

m4.xlarge

4

16 GiB

EBS Only

High

750 Mbps

m4.2xlarge

8

32 GiB

EBS Only

High

1,000 Mbps

m4.4xlarge

16

64 GiB

EBS Only

High

2,000 Mbps

m4.10xlarge

40

160 GiB

EBS Only

10 Gbps

4,000 Mbps

If you are running Linux on an m4.10xlarge instance, you can also control the C states and the P states (see my post on the New C4 Instances to learn more about this). The supersized core count on this instance will be great for applications that use multiple processes to achieve a high degree of concurrency.

These instances also offer Enhanced Networking which delivers up to 4 times the packet rate of instances without Enhanced Networking, while ensuring consistent latency, even when under high network I/O. Within placement groups, Enhanced Networking also reduces average latencies between instances by 50% or more. The M4 instances are EBS-Optimized by default, with additional, dedicated network capacity for I/O operations. The instances support 64-bit HVM AMIs launched within a VPC.

The M4 instances are available today in the US East (Northern Virginia), US West (Northern California), US West (Oregon), EU (Ireland), EU (Frankfurt), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo) regions. You can launch them in On-Demand or Spot form, and you can also purchase Reserved Instances.

Price Reductions on M3 and C4 Instances As part of today’s launch we are lowering the On-Demand and One Year Reserved Instances prices for the M3 and C4 instances by 5% in the US East (Northern Virginia), US West (Oregon), EU (Ireland), EU (Frankfurt), Asia Pacific (Tokyo), and Asia Pacific (Sydney) regions.

On-Demand Instance price reductions are effective June 1, 2015. Reserved Instance price reductions will apply to purchases after June 11, 2015. For more information, see the EC2 Pricing page.

New Spot Fleet APIToday we are making EC2’s Spot Instance model even more useful with the addition of a new API that allows you to launch and manage an entire fleet of Spot Instances with one request (a fleet is a collection of Spot Instances that are all working together as part of a distributed application. A fleet could be a batch processing job, a Hadoop workflow, or an HPC grid computing job). Many AWS customers launch fleets of Spot Instances (in sizes ranging from one instance up to thousands), using custom-written code that is responsible for discovering capacity, monitoring market prices across instance types and availability zones, and managing bids, all with the goal of running their workloads (ranging from large scale molecular dynamics simulations to continuous integration environments) at the lowest possible cost.

With today’s launch, this custom code is no longer necessary! Instead, a single API function (RequestSpotFleet) does all of the work on your behalf. You simply specify the fleet’s target capacity, a bid price per hour, and tell Spot what instance types you would like to launch. Spot will find the lowest priced spare EC2 capacity available, and work to achieve and maintain the fleet’s target capacity. One call does it all, as they say…

Making the Request You can have up to 1,000 active Spot fleets per region, with a per-fleet and a per-region limit of 3,000 instances (the usual EC2 per-account and per-region limits are still in effect and will govern the number of instances that you can launch, the number of Amazon Elastic Block Store (EBS) volumes that you can create, and so forth).

Each request (via the API or the CLI) must include the following values:

Target Capacity – The number of EC2 instances that you want in your fleet.

Maximum Bid Price – The maximum bid price that you are willing to pay.

Launch Specifications – The quantities and types of instances that you would like to launch, and how you want them to be configured (AMI Id, VPC, subnets or availability zones, security groups, block device mappings, user data, and so forth). In general, launch specifications that do not target a particular subnet or availability zone are more economical.

IAM Fleet Role – The name of an IAM role. It must allow EC2 to terminate instances on your behalf.

Each request can also include any or all of the following optional values:

Client Token – A unique, case-sensitive identifier for the request. You can use this to ensure idempotency for your Spot fleet requests.

Valid From -The start date and time of the request.

Valid Until – The end date and time of the request.

Terminate on Expiration – If set to TRUE, all Spot instances in the fleet will be terminated when the Valid Until time is reached. If set to FALSE (the default), running Spot instances will be left as-is, but no new ones will be launched.

The RequestSpotFleet function will return a Spot Fleet Request Id if all goes well, or an error if the request is malformed. You will also receive an error if you ask for instance types that are not available in Spot form. You can use the Id to call other Spot fleet functions including DescribeSpotFleetRequests, DescribeSpotFleetInstances, DescribeSpotFleetRequestHistory, and CancelSpotFleetRequests (there are also command-line equivalents to each of these functions).

Behind the Scenes Once your request has been accepted and the start date and time has been reached, EC2 will attempt to reach and then maintain the desired target capacity, even as Spot prices change. It will start by looking up the current Spot price for each launch specification in your request. Then it will launch Spot Instances using the launch specification(s) that result in the lowest price, until capacity, Spot limits, or bid price limits are reached. As instances in the fleet are terminated due to rising prices, replacement instances will be launched using whatever specification(s) result in the lowest price at that point in time.

The request remains active until it expires or you cancel it. The Spot Instances in the fleet will remain running unless you indicated that you wanted them to be terminated. As I mentioned earlier, you need to include an IAM role so that EC2 can terminate instances that are running on your behalf.

Things to Know As is often the case with new AWS features, this is an initial release and we have a healthy backlog of features in the queue. For example, we plan to add a weighting system. It will allow you to express the relative power of each of your launch specifications in numeric form. The target capacity will also be expressed in these units; this will allow you to indicate that you need a certain amount of “horsepower” in a fleet.

Each fleet is run within a particular AWS region. In the future we would like to support fleets that span two or more regions.

Available Now You can launch Spot fleets today in all public AWS regions where Spot is available. There is no charge for the Spot fleet; you pay Spot prices for the EC2 instances that you launch and any other resources that they consume.

My colleague Mingxue Zhao sent me a guest post designed to make sure that you are aware of an important time / clock issue.

Note: This post was first published on May 18, 2015. We made some important additions and corrections on May 25, 2015.

— Jeff;

The International Earth Rotation and Reference Systems (IERS) recently announced that an extra second will be injected into civil time at the end of June 30th, 2015. This means that the last minute of June 30th, 2015 will have 61 seconds. If a clock is synchronized to the standard civil time, it should show an extra second 23:59:60 on that day between 23:59:59 and 00:00:00. This extra second is called a leap second. There have been 25 such leap seconds since 1972. The last one took place on June 30th, 2012.

Clocks in IT systems do not always follow the standard above and can behave in many different ways. For example:

Some organizations, including Amazon Web Services, plan to spread the extra second over many hours surrounding the leap second by making every second slightly longer.

If a clock doesn’t connect to a time synchronization system, it drifts on its own and will not implement any leap second or an adjustment for it.

If you want to know whether your applications and systems can properly handle the leap second, contact your providers. If you run time-sensitive workloads and need to know how AWS clocks will behave, read this document carefully. In general, there are three affected parts:

The AWS Management Console and backend systems

Amazon EC2 instances

Other AWS managed resources

For more information about comparing AWS clocks to UTC, see the AWS Adjusted Time section of this post.

AWS Management Console and Backend Systems The AWS Management Console and backend systems will NOT implement the leap second. Instead, we will spread the one extra second over a 24-hour period surrounding the leap second by making each second slightly longer. During these 24 hours, AWS clocks may be up to 0.5 second behind or ahead of the standard civil time (see the AWS Adjusted Time section for more information).

You can see adjusted times in consoles (including resource creation timestamps), metering records, billing records, Amazon CloudFront logs, and AWS CloudTrail logs. You will not see a “:60” second in these places and your usage will be billed according to the adjusted time.

Amazon EC2 Instances Each EC2 instance has its own clock and is fully under your control; AWS does not manage instance clocks. An instance clock can have any of the behaviors listed at the beginning of this post. Contact your OS provider to understand the expected behavior of your operating system.

If you use the Amazon Linux AMI, your instance will implement the one-second backwards jump and will see “23:59:59” twice. You may find the following information useful:

Other AWS Managed Resources Other AWS resources may also have their own clocks. Unlike EC2 instances, these resources are fully or partially managed by AWS.

The following resources will implement the one-second backwards jump and will see :23:59:59″ twice:

Amazon CloudSearch clusters

Amazon EC2 Container Service instances

Amazon EMR Clusters

Amazon RDS instances

Amazon Redshift instances

To enable time synchronization on EMR clusters, your VPC has to allow access to NTP. Make sure that your EMR clusters have access to the Internet, and that your security groups and network ACLs allow outbound UDP traffic on port 123.

AWS Adjusted Time This section provides specific details on how clocks will behave in the AWS Management Console and backend systems.

Starting at 12:00:00 PM on June 30th, 2015, we will slow down AWS clocks by 1/86400. Every second on AWS clocks will take 1+1/86400 seconds of “real” time, until 12:00:00 PM on July 1st, 2015, when AWS clocks will be behind by a full second. Meanwhile, the standard civil time (UTC) will implement the leap second at the end of June 30th, 2015 and fall behind by a full second, too. Therefore, at 12:00:00 PM July 1st, 2015, AWS clocks will be synchronized to UTC again. The table below illustrates these changes.

UTC

AWS Adjusted Clock

AWS vs. UTC

Notes

11:59:59 AM June 30th, 2015

11:59:59 AM June 30th, 2015

+0

AWS clocks are synchronized to UTC.

12:00:00 PM

12:00:00 PM

+0

12:00:01

Each second is 1/86400 longer and AWS clocks fall behind UTC. The gap gradually increases to up to 1/2 second.