Friday, March 18, 2011

Understanding and using Amazon EBS - Elastic Block Store

The benchmarking Netflix did when we started on AWS highlighted some inconsistent behavior in EBS. The conclusion we reached is a rule of thumb for EBS - If you sustain less than 100 iops (input+output per second) long term average it works fine. Short term bursts can be 1000 iops. By short term I mean less than a minute, long term more than 10 minutes. YMMV.

If you are doing benchmarks like this, collect response time and throughput and plot your data over time. You need to run long enough that the performance shows steady state behavior. The problem with EBS is that it doesn't have a particularly steady state. To explain why we need to look at the underlying architecture. I don't know the details of how EBS is implemented, but there is enough information available to explain how it behaves.

EC2

The AWS EC2 architecture is built out of commodity low cost servers, they have a single 1 Gbit network, a few CPUs, a few disks and a few GBytes of RAM. Over time the models have changed, and EC2 does have a 10Gbit network option now, but for the purposes of this discussion, we will concentrate on the 1Gbit network models. Individual servers are virtualized into the familiar EC2 models by slicing up the RAM, CPUs and disk space, and sharing the network bandwidth and disk iops. When EC2 instances break or are de-configured any data on the internal disks is lost.

The AWS EBS service provides a reliable place to store data that doesn't go away when EC2 instances are dropped, but it provides the same mounted filesystem capability as the internal disks. If you need more disk space or iops you can mount more EBS volumes on a single EC2 instance and spread out the load. The EBS volume is connected to the EC2 instance over the same 1Gbit network as everything else. In a datacenter this would normally be built using commercially available high end storage from NetApp, EMC or whoever, it would be quite expensive (cost much more than the EC2 instance itself) and be fast and reliable up to the limits of the network. To build a low cost cloud, the alternative is to use RAIN (Redundant Array of Inexpensive Nodes) which could be based on standard EC2 instances, or variants that have more disks per CPU. Software is then used to coordinate the RAIN systems and provide an EBS service that will be slower than high end storage, but still be very reliable and be limited by the 1Gbit network.

S3 and Availability Zones

AWS also has an S3 storage service that behaves like a key/value store accessed via http requests and a REST API rather than a directly mounted filesystem. It is possible to rapidly snapshot an EBS volume to and from S3, including incremental backups and restores that fill as they go so you don't have to wait before using them. This implies to me that they share a common back-end infrastructure to some extent. The primary additional difference is that EBS volumes only exist in a single AWS Availability Zone, and S3 data is replicated across two or three Availability Zones. It takes longer to replicate the data for S3, so it is slower, but it is very robust and it is almost impossible to lose data. You can think of an Availability Zone as a complete datacenter. All the zones in a region are separate datacenters that are close enough together to support a high bandwidth and low latency network between them, but they have separate power sources and connections to the Internet.

Multi-Tenancy

The most efficient chunk of compute and storage resource to buy and deploy when building a cloud is either too big or too small for the actual use cases of real applications. Virtualization is used to sub-divide the chunks, but then each individual machine is supporting several independent tenants. For local disks, the space is divided between the tenants, and for network, everyone is sharing the same 1Gbit interface. This works well on average, because most use cases aren't network or disk bound, but you cannot control who you are sharing with and some of the time you will be impacted by the other tenants, increasing variance within each EC2 instance. You can minimize the variance by running on the biggest instance type, e.g. m1.xlarge, or m2.4xlarge. In this case there isn't room for another big tenant, so you get as much as possible of the disk space and network bandwidth to yourself. The virtualization layer reserves some of the capacity. It's possible to tell that another tenant is keeping the CPU busy by looking at the "stolen time", but there are no metrics for stolen iops or network bandwidth.

The EBS service is also multi-tenant. Many clients mount disk space from a common backend pool of EBS disks. You don't get to see how the disk space is allocated, or how data is replicated over more than one disk or instance for durability, but it is limited to that availability zone. A busy client can slow down other clients that share the same EBS service resources. EBS volumes are between 1GB and 1TB in size. If you allocate a 1TB volume, you reduce the amount of multi-tenant sharing that is going on for the resources you use, and you get more consistent performance. Netflix uses this technique, our high traffic EBS volumes are mostly 1TB, although we don't need that much space.

This is actually no different in principle to large shared storage area network (SAN) backends (from companies like EMC or NetApp) that are in common datacenter use. Those also have unpredictable performance when pushed hard, and they mask this issue with lots of battery backed memory. The difference is cost. EBS is 10c per Gbyte per month. If you build a competing public cloud service using high end storage, you could get better performance but your cost base would be far higher.

Visualizing Multi-Tenant Disk Access

I have come up with some diagrams to help show what happens. I'm basing them on a simplified view of AWS where the only instance type family is m1 and everything they have is made out of one underlying building block. This consists of a fairly old specification system, 8 cores, 16GB RAM, four 500GB disks and a single 1Gbit network. In reality, AWS is much more complex than this, but the principles are the same.

Starting with internal disks, this is what an m1.xlarge looks like, it takes up the whole system apart from a small amount of memory, disk space and network traffic for the VM and AWS configuration/management information. You can expect to have minimal multi-tenant contention for network or disk access.

The m1.large instance type halves the system, each instance has two disks rather than four, so it shares the network and some of the disk controller bandwidth, but it should have minimal iops contention with the other tenant.

The low cost m1.small instance type has 160GB of disk per instance, so we can fit three per disk for a total of 12 instances per machine. (Note that the memory for a real m1.small is 1.7GB, so only 9 would fit in 16GB RAM, however the c1.medium instance has 1.7GB, 350GB disk, and more CPU, so six m1.small and three c1.medium fits). You can see the multi-tenancy problem here, any of the instances could generate enough traffic to fill the network and make one of the disks busy, and that is going to affect other instances in an unpredictable and random manner.

Here's an analogy, you can rent a whole house, rent a room in a house, or rent a couch to sleep on, you get what you pay for.

If you ever see public benchmarks of AWS that only use m1.small, they are useless, it shows that the people running the benchmark either didn't know what they were doing or are deliberately trying to make some other system look better. You cannot expect to get consistent measurements of a system that has a very high probability of multi-tenant interference.

EBS Multi-Tenancy

The next few diagrams show the flow of traffic from an instance to the EBS service, which makes two copies of the data on disks connected to separate instances. I don't know if this is how EBS works, but if we wanted to build an EBS-like system using the same building block it could look like this. In practice it would make sense to have specialized back-end building blocks with much more disk space.

The first diagram shows how Netflix runs EBS, we start with an instance that has the maximum network bandwidth with no other tenants, we allocate maximum size 1TB volumes (we stripe many of them together) and the service has to use most of the disk space in the back-end to support us, so we have less chance of another tenant making the EBS disks busy. The performance of EBS in this simplified case would be higher latency than local disk, but otherwise similar. I suspect that in reality the EBS volume is spread over more disks in the backend which gives higher throughput but with higher variance.

If we drop down to a more typical m1.large configuration with 100GB of EBS each, two instances are sharing network bandwidth, the EBS service is servicing two sets of requests, and the EBS back end has many more tenants per disk, so we would expect better peak performance than the two internal disks in the m1.large but more variance.

For the case where we have many m1.small instances each accessing a 10GB EBS volume, it is clear that the peak performance is going to be far better than a share of a local disk, but the contention for network, EBS service and backend disks will be extremely variable, so performance will be very inconsistent.

How To Measure Disk and Network Performance

Someone should write a book on that (I already did, but for Solaris), however there is a useful AWS forum post that explains how to interpret Linux iostat. This blog post is too long already, so Linux iostat will have to wait for another time.

Best Practices for Cloud Storage with Cassandra

There are two basic patterns for Cassandra, one is a persistent memory cache, where we size the data to fit in memory so that all reads are fast, and writes go to disk. The m2.4xl instance type with 68GB RAM and two 850GB disks is best. The second pattern is where there is a much larger data set than memory, and m1.xlarge with 16GB RAM and four 420GB disks will have the best iops for reads, and a much lower overall cost per GB for storage. In both cases, we get all the network bandwidth for servicing clients and the inter-node replication traffic, and minimal multi-tenant variance.

12 comments:

One thing that complicates the EBS picture is the presence of delta (COW) volumes based on a snapshot. I've always assumed that to avoid the latency of taking a snapshot, EBS simply marks the existing volume as RO and writes changed blocks into a new delta. This would explain the limitation on the number of outstanding snapshots: we don't want to search an arbitrary delta list (actually a tree) to find a block, so by limiting the number of snapshots EBS can lazily collapse multiple deltas.

You describe the EC2 "The AWS EC2 architecture is built out of commodity low cost servers, they have a single 1 Gbit network, a few CPUs, a few disks and a few GBytes of RAM. Over time the models have changed, and EC2 does have a 10Gbit network option now, but for the purposes of this discussion, we will concentrate on the 1Gbit network models."

So will a 10G Ethernet help? Most newer storage devices require 10g Ethernet to work properly.

You hint "..."but the contention for network, EBS service and backend disks will be extremely variable, so performance will be very inconsistent."

You also state: "You cannot expect to get consistent measurements of a system that has a very high probability of multi-tenant interference."

It appears though that 10g Ethernet may help. I wonder whether AWS tried this (according to reddit blog, AWS worked on EBS failures for over a a year).

This info is useful, to design the new datacenters for IaaS providers, which, free from putting up with EBS can use some proven elastic storage solutions

- With reddit's issue of writes in the PostgreSQL master not making it to disk but the slaves committing the writes, would you recommend mirroring two EBS volumes in Linux?- Do you think the OS could have detected the failure of the writes?- Would you consider using Solaris or OpenSolaris on EC2 to get ZFS? Would ZFS even help with the problem that reddit has with PostgreSQL with its consistency checking?

No internal knowledge of how EBS handles snapshots, but I have dealt with Linux LVM snapshots, which may be similar. Searching for the current block in a snapshot'd volume in that case is always a constant-time operation, no tree to search. This is achieved by writing an inverse delta of changed blocks to the snapshot volume whenever the original volume is changed. The snapshot volume itself can be much smaller than the original (10% is pretty common) due to storing only the changes since the snapshot was made. The limiting factor is that when the original volume receives many writes and the snapshot's inverse block log fills up.

If, as I suspect, EBS uses Linux (hunch based on use of commodity hardware, likely x86), this would also explain why replicating to another zone is too slow; given the Linux LVM scheme for storing changes, one can easily see how thr computational expense for retrieving the snapshot increases over time.

@Geoff, there is definitely some clever integration between EBS and S3 to provide snapshot capability. That makes it even more complex and will add to latency in some cases. However I was mostly trying to explain the multi-tenancy issues...

@my-inner-voice most applications are CPU and Memory constrained, and don't push the network that hard, so 10Gbit networks aren't needed, they would just make it more expensive. I am more interested in getting some instances with solid state disk in them for high performance database work.

The AWS 10Gbit systems are very nice for HPC workloads, but I think they are currently overkill for most others. While it is possible that some EBS based workloads could be limited by the network on a single large instance (with tens of EBS volumes), I think the multi-tenancy variance is a bigger problem, and that doesn't go away with more network bandwidth.

@_phred EBS runs within one zone, and S3 is replicated across all the zones, so snapshot from EBS to S3 must put all the data across all zones, but I think there are only three copies of the data in four zones. So restore of an EBS volume from S3 in any zone may involve some cross zone traffic, which adds a few milliseconds and more opportunities for contention in the network. Have you seen a big difference when restoring from S3 to EBS in different zones?