AWS re:Invent 2017 and the Application Performance Imperative

Another year, another huge gathering of cloud enthusiasts in Las Vegas, all keen to see and hear what new features Amazon has added to the AWS platform.

As expected, this year's event saw a raft of announcements related to the core platform, machine learning, security and IOT. In this piece, I review the most eye-catching announcements with one eye firmly on the application performance imperative ('thou shalt have lightning-quick applications that delight your users').

Elastic Kubernetes Service

It's now clear that Kubernetes has won the container orchestration war. Thankfully, AWS is happy to admit this and has made Kubernetes a first-class citizen for container management.

Given the popularity of Kubernetes, this opens up an interesting opportunity for optimisation: cluster optimisation.

Any container-based application will benefit from optimisation of what's running inside the container - be that Java, Ruby, Node or something else. Beyond this, EKS is interesting because it gives API-based access to cluster make-up - what instances, how many, where they are - opening up another avenue for optimisation.

Fargate

Fargate is an extension to EKS (and to base ECS) that removes the need to manually manage the cluster infrastructure.

As a container consumer this is simply fantastic. Why bother with instance management if all you want to do is run some containers?

Fargate shifts the configuration burden from cluster configuration to container configuration - this is something SKIPJAQ can help with.

Each container that runs on Fargate needs to know how much memory, CPU, I/O and storage it needs. All of these parameters are subject to optimisation and all of them vary in tandem with parameters that exist inside the container.

SageMaker

As expected, Amazon announced a lot of new machine learning products, but SageMaker is the one I want to focus on here.

In essence, SageMaker is a managed ML environment suitable for model development, training and hosting.

SageMaker provides a hosted Jupyter environment with instances that are correctly configured to take advantage of the GPU for computation.

SageMaker provides a managed service for scheduling training jobs on-demand, again with instances that are optimised for the GPU. The job system will certainly be seen as an attractive option by some AWS customers, but whether it will be as cost-effective as using something like spot or Fargate in the long run is something they’ll have to consider carefully.

S3 Select

This is a small, but incredibly useful new feature: it's now possible to select subsets of data from objects stored in S3. To me this looks like a lightweight version of Athena that's available on-demand for any S3 object.

If you store a lot of data in CSV in S3, being able to select from that data without having to load the full object is certainly of interest.

PrivateLink for Third Parties

For some time now, VPC has supported private endpoints - PrivateLink - for accessing other AWS services. These private endpoints ensure that all traffic between the VPC and the service in question stays on AWS internal fibre, never hitting the public Internet.

Now it's possible for anybody running a service on AWS to publish this service on the marketplace as a VPC-ready private endpoint. Which is nice.

Serverless Application Repository

Lambda functions are easy to secure and review, and there's no cost when the function sits idle.

The new Serverless Application Repository provides a much needed means for customers to consume functions without having to dig into the details of how Lambda works.

Instance Family Evolution

In addition to the recently released C5 instance family, we now have the new M5 and H1 instances. One of the disadvantages of this development is that selecting the right instance type is now harder than ever. C5 instances run on an entirely different hypervisor and expose even more detail on the inner workings of system performance.

Much of the benefit of newer instance types will be derived from correct/optimal configuration. Configurations do not port across instance families, and getting the best performance per dollar involves adding instance type into the optimisation mix.

Bare Metal Instances

At the top end of the newly-available compute infrastructure is the new i3.metal bare metal instance. With a whopping 512GB RAM and 15TB of NVMe local storage, this is a machine for the most serious workloads.

With the introduction of bare metal instances, we now have instance families stretched across three different backplanes. The bare metal instances have no virtualisation at all, relying on traditional out-of-band control planes to provide an EC2 API on top of real metal.

C5 instances use the new Nitro hypervisor which, when configured correctly, reduces virtualisation overhead to around 3%.

All other instance families are stretched across various generations of the original Xen-based hypervisor. With correct configuration, some of these instance families can unlock faster networking and I/O.

All of this is to say: the matrix of options for instance types just got a whole lot larger - and we are here to help you make the right choices in this area!

Bursting Improvements

Burstable instances have long been an interesting way to save money for applications subject to spiky traffic. These same instances have also been tricky to optimise against because of their inability to sustain high performance under load.

With the introduction of unlimited bursting, some burstable instances will no longer step down their performance when burst credits are exhausted. Instead, the cost of these instances varies over time depending on how much bursting is actually used.

This opens up an interesting opportunity: we can now optimise bursting for the best price/response time trade-off. To do this we optimise against traffic that exhibits the same peaks and troughs as seen in production. I can imagine a world where burstable instance configurations are selected to always maximise return on investment, optimising the trade-off between improved latency and increased cost.

Spot Improvements

Spot instances can now be launched synchronously using the standard EC2 APIs, making it much easier for AWS users to control the spin up of temporary compute power.

Spot instances on certain families (C4 in particular) can now be hibernated when their reservation expires. For ML jobs, this is a great way of saving money while avoiding the need to restart state-heavy jobs.

Do you want to know more about how the news from AWS re:Invent 2017 is likely to affect the performance of your company’s applications? Write to me at rob@skipjaq.com with your performance-related questions and I’ll do my best to get back to you all with the answers you need to boost your business.