AWS IO are expensive and inconsistent

EBS SSD volumes: IOPS, and P-IOPS

We are forced to pay for Provisioned-IOPS whenever we need dependable IO.

The P-IOPS are NOT really faster. They are slightly faster but most importantly they have a lower variance (i.e. 90%-99.9% latency). This is critical for some workload (e.g. databases) because normal IOPS are too inconsistent.

Overall, P-IOPS can get very expensive and they are pathetic compared to what any drive can do nowadays (720$/month for 10k P-IOPS, in addition to $0.14 per GB).

Local SSD storage

Local SSD storage is only available via the i2 instances family which are the most expensive instances on AWS (and over all clouds).

There is no granularity possible. CPU, memory and SSD storage amount all DOUBLE between the few i2.xxx instance types available. They grow in powers of 4CPU + 30GB memory + 800 GB SSD and the multiplier is $765/month.

These limitations make local SSD storage expensive to use and special to manage.

AWS Premium Support is mandatory

The premium support is +10% on top of the total AWS bill (i.e. EC2 instances + EBS volumes + S3 storage + traffic fees + everything).

Handling spikes in traffic

ELB cannot handle sudden spikes in traffic. They need to be scaled manually by support beforehand.

An unplanned event is a guaranteed 5 minutes of unreachable site with 503 errors.

Handling limits

All resources are artificially limited by a hardcoded quota, which is very low by default. Limits can only be increased manually, one by one, by sending a ticket to the support.

I cannot fully express the frustration when trying to spawn two c4.large instances (we already got 15) only to fail because “limit exhaustion: 15 c4.large in eu-central region“. Message support and wait for one day of back and forth email. Then try again and fail again because “limit exhaustion: 5TB of EBS GP2 in eu-central region“.

This circus goes on every few weeks, sometimes hitting 3 limits in a row. There are limits for all resources, by region, by availability zone, by resource types and by resource specifics criteria.

Paying guarantees a 24h SLA to get a reply to a limit ticket. The free tiers might have to wait for a week (maybe more), being unable to work in the meantime. It is an absurd yet very real reason pushing for premium support.

Handling failures on the AWS side

There is NO log and NO indication of what’s going on in the infrastructure. The support is required whenever something wrong happens.

For example. An ELB started dropping requests erratically. After contacting support, they acknowledged to have no idea what’s going on and took action “Thank you for your request. One of the ELB was acting weird, we stopped it and replaced it with a new one“.

The issue was fixed. Sadly, they don’t provide any insight or meaningful information. This is a strong pain point for debugging and planning future failures.

Note: We are barraging further managed service from being introduced in our stack. At first they were tried because they were easy to setup (read: limited human time and a bit of curiosity). They soon proved to be causing periodic issues while being impossible to debug and troubleshoot.

ELB are unsuitable to many workloads

[updated paragraph after comments on HN]

ELB are only accessible with a hostname. The underlying IPs have a TTL of 60s and can change at any minute.

This makes ELB unsuitable for all services requiring a fixed IP and all services resolving the IP only once at startup.

ELB are impossible to debug when they fail (they do fail), they can’t handle sudden spike and the CloudWatch graphs are terrible. (Truth be told. We are paying Datadog $18/month per node to entirely replace CloudWatch).

Load balancing is a core aspect of high-availability and scalable design. Redundant load balancing is the next one. ELB are not up to the task.

The alternative to ELB is to deploy our own HAProxy in pairs with VRRP/keepalived. It takes multiple weeks to setup properly and deploy in production.

By comparison, we can achieve that with google load balancers in a few hours. A Google load balancer can have a single fixed IP. That IP can go from 1k/s to 10k/s requests instantly without loosing traffic. It just works.

Note: Today, we’ve seen one service in production go from 500 requests/s to 15000 requests/s in less than 3 seconds. We don’t trust an ELB to be in the middle of that

Dedicated Instances

“Dedicated instances are Amazon EC2 instances that run in a virtual private cloud (VPC) on hardware that’s dedicated to a single customer. Your Dedicated instances are physically isolated at the host hardware level from your instances that aren’t Dedicated instances and from instances that belong to other AWS accounts.”

Dedicated instances/hosts may be mandatory for some services because of legal compliance, regulatory requirements and not-having-neighbours.

We have to comply to a few regulations so we have a few dedicated options here and there. It’s 10% on top of the instance price (plus a $1500 fixed monthly fee per region).

Note: Amazon doesn’t explain in great details what dedicated entails and doesn’t commit to anything clear. Strangely, no regulators pointed that out so far.

Answer to HN comments: Google doesn’t provide “GCE dedicated instances”. There is no need for it. The trick is that regulators and engineers don’t complain about not having something which is non-existent, they just live without it and our operations get simpler.

Reserved Instances are bullshit

A reservation is attached to a specific region, an availability zone, an instance type, a tenancy, and more. In theory the reservation can be edited, in practice that depends on what to change. Some combinations of parameters are editable, most are not.

Plan carefully and get it right on the first try, there is no room for errors. Every hour of a reservation will be paid along the year, no matter whether the instance is running or not.

For the most common instance types, it takes 8-10 months to break even on a yearly reservation. Think of it as gambling game in a casino. A right reservation is -20% and a wrong reservation is +80% on the bill. You have to be right MORE than 4/5 times to save any money.

Keep in mind that the reserved instances will NOT benefit from the regular price drop happening every 6-12 months. If there is a price drop early on, you’re automatically loosing money.

Critical Safety Notice: 3 years reservation is the most dramatic way to loose money on AWS. We’re talking potential 5 digits loss here, per click. Do not go this route. Do not let your co-workers go this route without a warning.

What GCE does by comparison is a PURELY AWESOME MONTHLY AUTOMATIC DISCOUNT. Instances hours are counted at the end of every month and discount is applied automatically (e.g. 30% for instances running 24/7). The algorithm also accounts for multiple started/stopped/renewed instances, in a way that is STRONGLY in your favour.

Reserving capacity does not belong to the age of Cloud, it belongs to the age of data centers.

AWS Networking is sub-par

Network bandwidth allowance is correlated with the instance size.

The 1-2 cores instances peak around 100-200 Mbps. This is very little in a world more and more connected where so many things rely on the network.

Typical things experiencing slow down because of the rate limited networking:

Instance provisioning, OS install and upgrade

Docker/Vagrant image deployment

sync/sftp/ftp file copying

Backups and snapshots

Load balancers and gateways

General disk read/writes (EBS is network storage)

Our most important backup takes 97 seconds to be copied from the production host to another site location. Half time is saturating the network bandwidth (130 Mbps bandwidth cap), half time is saturating the EBS volume on the receiving host (file is buffered in memory during initial transfer then 100% iowait, EBS bandwidth cap).

The same backup operation would only take 10-20 seconds on GCE with the same hardware.

Cost Comparison

This post wouldn’t be complete without an instance to instance price comparison.

Capacity planning and day to day operations

Every time we have to add an instance. We have to read the instances page, pricing page, EBS page again. There are way too many choices, some of which being hard to change latter. That could be printed on papers and cover a4x7 feet table. By comparison it takes only 1 page both-sided to pick an appropriate instance from Google.

Optimizing usage is doomed to fail

The time taken to optimizing reserved instance is a similar cost to the savings done.

Between CPU count, memory size, EBS volume size, IOPS, P-IOPS. Everything is over-provisioned on AWS. Partly because there are too many things to follow and optimize for a human being, partly as workaround against the inconsistent capabilities, partly because it is hard to fix later for some instances live in production.

All these issues are directly related to the underlying AWS platform itself, being not neat and unable to scale horizontal cleanly, neither in hardware options, nor in hardware capabilities nor money-wise.

Every time we think about changing something to reduce costs, it is usually more expensive than NOT doing anything (when accounting for engineering time).

Conclusion

AWS has a lot of hidden costs and limitations. System capabilities are unsatisfying and cannot scale consistently. Choosing AWS was a mistake. GCE is always a better choice.

GCE is systematically 20% to 50% cheaper for the equivalent infrastructure, without having to do any thinking or optimization. Last but not least it is also faster, more reliable and easier to use day-to-day.

The future of our company

Unfortunately, our infrastructure on AWS is working and migrating is a serious undertaking.

I learned recently that we are a profitable company, more so than I thought. Looking at the top 10 companies by revenue per employee, we’d be in the top 10. We are stuck with AWS in the near future and the issues will have to be worked around with lots of money. The company is able to cover the expenses and cost optimisation ain’t a top priority at the moment.

There’s a saying “throwing money at a problem“. We shall say “throwing houses at the problem” from now on as it better represents the status quo.

If we get to keep growing at the current pace, we’ll have to scale vertically, and by that we mean “throwing buildings at Amazon” 😀

This article is amateurish, really inaccurate and a bit immature.
AWS has it’s cons, but also have a lot of advantages over other cloud platforms. the article does not mention that and many other important facts. Very very biased.

If you are only saving 20% on reservations you are doing something very very wrong. (like doing a no upfront, 1 year term). We save 62+% on our reservations. We have standardized on instance classes (C4 and M4 for 95% of our workload) and make extensive use of spot instances where we can (which saves us upwards of 90% in many cases). We use orchestration solutions so we don’t need to manage individual hosts and simply deploy a fleet of m4.4xl instances. Your argument that a 3 year reservation is a terrible idea is incredibly myopic. Annual price reductions tend to be around 5% which means you are sacrificing 62% every year and hoping to make up for it with a 5% reduction … does that make any sense?

In addition- AWS now offers convertible and regional reservations if you are worried about switching instance types or don’t want to manage reservations on a per AZ basis.

Furhermore- your argument that the lack of dedicated hardware with GCE means no one cares would be laughed out of the room at the companies I work for. “They don’t offer it” means they don’t get used- not “we don’t have to worry about it”.

There are plenty of other problems with GCE- e.g. the Google VPN BGP solution is kludgy and GCE doesn’t offer nearly as many services as AWS does.

Now having said all that- they both have their pros and cons and companies need to go into the selection process with their eyes open. You need to evaluate your workload against each cloud’s offerings and choose the one that makes the most sense for you.

Bullshit! There’s no reservation to save 62% on any C4 or M4, even if you reserve 3 years all upfront.

If you have a perfect orchestration solution and you can move things around however you like. Good for you. That doesn’t change the fact that reserving is a huge risk for the majority people out there, especially 3 years ahead.

I won’t comment on convertible. Didn’t see it yet. It’s just one more obstacle to understand reservations and a higher bar to entry. AWS should lower the complexity of their offering, not increase it.

ALL c4 instances get up to MORE than 60% discount when reserved. You talk about “loosing” money (it’s “losing”, by the way, only one “o”), but come on, why are you so mad at them, actually?

Also, I did hit some limits too, and I am a “free tier” client — as in, I don’t pay for support: I always got all my tickets answered in a matter of HOURS at most. Not a week or more, as you said. But then again, you clearly don’t know, since you have never been a non-premium user. Why do you need to spew your venom at them, without knowing? What horrible thing did they do?

I’d rather you let us know right away that they actually mass-murder people for their “virgin blood”-fueled data centres, at least you wouldn’t sound mad, albeit crazy wouldn’t be out of the question.

Now then, maybe there are cheaper options out there, I’m not opposed to that. I actually want to know about that, but you’re just not credible.
Also, would you mind mentioning things other than the straight-up pricing of the instances? The free SSL certificates and stuff?

And ELBs seem to fail for you (a quick Google search lead me to believe it only happens to you), but… is it any different with any other provider? Oh that’s right, you don’t know, ’cause you’re still using AWS and haven’t used anything else.

Too bad we won’t get any educated comments on this website apparently.

I can’t reply to comment 133 because of a CSS rendering issue (it blocks the reply button for comments that are more than 2 deep!), so I’m replying here.

As a heads up, fully-upfront standard 3 year reservations do actually give you ~60% discount off of the on-demand price. Specifically, 61% for C4 and 63% for M4 instances. When you get to the point where you both have a predictable base load AND have a few hundred to few thousand instances that make up your base load, reservations make a TON of sense.

I agree that the GCP pricing model is less confusing for people. As a product, GCP has gotten considerably better and more competitive over the past few years (BigQuery is the shit) and I would use GCP for anything new. A few years back, that was DEFINITELY not the case.

As far as the future goes, I really think that serverless (a la GCF/Lambda) is going to be huge.

Hi, there are plenty more instance types on AWS offering local SSD storage. We run a bunch of workloads on them too. This is a good site for checking those out: http://www.ec2instances.info/ . There’s the M3, C3, C1, … families which offer local SSD storage. So the granularity is not that bad, although you make some good points about GCE vs AWS.

It’s rather uncommon to reach the S3 limits and not that well known.
From what we remember, the way to hit the hardcore limit is to make a fresh new bucket and start pushing data into it, at a fast and sustained pace. The bucket storage is meant to scale slowly over time and it cannot do it in these circumstances, thus becoming unavailable.

There was a discussion on HN where some experienced AWS users explained a few edge cases to hit S3 limits. Can’t seem to find it back.
It’s somewhat particular to trigger though, most users should be fine with S3 (not so much with ELB).

S3 under the hood, is just really efficient sharding, and it works by using the beginning of the filename or ‘key’ of the object to decide which shard to place the object in.
For example, if you have 2 files in S3 that are ‘test/some/file.png’ and ‘test/my/file.png’, these two files will hit the same shard, so you’ll be throttled at 300rps ~ (roughly I can’t remember exactly). If you were to call them by their md5 hashes instead of using s3 as a folder structure, they’d hopefully both have different starting hashes, so would be on different shards, making the rps x2.

Hey I’m not an AWS fanboy or anything but we do use it heavily. You are either misinformed or purposefully writing a hit piece. This article is 25% totally wrong, 50% somewhat true but exaggerated, and 25% true in my opinion.

And your comment is 100% useless Antonio if you have no real details to offer. This article is full of facts that show exactly how the pricing and performance differ and can be confirmed by several other reports.

It’s 80% because of simple maths, 20% because of hardware limitations and AWS complexity.

You need time to understand the documentation + time to analyse what to optimize (e.g. reservations) + potential downtime (most things can’t be changed on the fly) + extra time for specifics (e.g. I’ve seen volume snapshots taking 10 hours).

It’s basic maths. The usual optimization may take some days to perform but it’s only gonna save $100… that’s not worth it. Not doing it.

Bonus: We’re handling trading systems. Generally speaking, for any company that’s profitable and doing well, it’s risky to change anything in production that is currently working well. If you factor-in the risks, cost savings is [almost] never a good justification for changes.

Thanks, completely agreed with your article. Same story here, AWS IOPS, network, logging, ELB and instance consistency is suck. We heavily invested in AWS three years ago and evaluated GCE too late to convince our management to migrate. I wouldn’t complain about AWS support too much, it works for us in most cases, and it is probably just because GCE support feels immature when we tried it.

thanks for the response, personally been using Azure for nearly a year and they are constantly making updates all the time which is great, plugging holes and creating some more :-). You are right on the portal point it can be quite difficult to work out how things are done and what the relevant options actually do and their impact. Does take a while to find the right route/option if you are starting out.

What would be nice to see is an actual comparison between the 3 main players in terms of cost and performance etc, we mainly use it for Web Apps at the moment but looking at other parts as well. Main barrier we have at work fully uptaking it is that cost is always mentioned as cloud is very expensive compared to in-house and then security concerns, would be nice to see which ones come out on top

Agree. Azure (and Google) are releasing and improving very quickly. It’s challenging to come up with comparisons that won’t be outdated in a couple of months (let alone years).

Between us, I’ve had more costs and performances benchmarks coming… but I just got poached by a major financial institution for Christmas. It’s gonna delay my plan till I can get $1M budget approved for cloud testing.

What annoys me the most are the hidden costs of Amazon, GCP is more transparent in my opinion. But AWS has more services and this can be a strong point.
Everything depends on the use case of the application you are gonna develop, here there are more details https://goo.gl/nfmknl

EC2 Low Utilization Amazon EC2 Instances
Checks the Amazon Elastic Compute Cloud (Amazon EC2) instances that were running at any time during the last 14 days and alerts you if the daily CPU utilization was 10% or less and network I/O was 5 MB or less on 4 or more days. Running instances generate hourly usage charges. Although some scenarios can result in low utilization by design, you can often lower your costs by managing the number and size of your instances

There are many more cost optimizations it will look at – not just the above.

“Note: Amazon doesn’t explain in great details what dedicated entails and doesn’t commit to anything clear.”

A dedicated instance as implemented means that you have no neighbors on that hardware. Of course, you can still be your own neighbor. This is usually fine, if you asked for 4 m4.mediums, you’re getting 4 m4.mediums, just as with shared instances.

Shared instances are distributed over many different physical machines. EC2’s placement algorithm actually tries pretty hard to spread you out over different machines or even racks.

It won’t do that with dedicated instances. If you thought you were doing a great job making your application fault tolerant by splitting it over 32 dedicated t2.micros, they’re all on the same hardware, the box goes down and they all die with it.

Your instances are still on separate physical hardware if they’re in separate AZs, though. And many instance types can’t share hardware.

My company runs mainly batch oriented, compute intensive numerical computation using compute clusters (“HPC”). The main workload is bursty I/O intensive reads of input, heavy number crunching followed by writes of output and massive amounts of debug data. We also have a slew of “ordinary” web services (request-response pattern), but most of them are not mission critical.

We’ve been running AWS and GCE for a few years, and GCE fits us much better, because.. (ranked in order of importance to us):

1. Able to specify capacity arbitrarily.
No need to be forced into playing an endless game of bin-packing using permutations of shapes (instance sizes) of which not a single one fits our respective workloads. At GCE we can, as should be the norm, just specify the *compute resources* we want for the VMs. For example; the other week we discovered one of our workloads was memory constrained. A click of a button revised the memory spec for the machines from 32 to 44 GiB, which was exactly the fix our workload needed. Same goes for # VCPUs. At AWS we always have to consider trade-offs such as wasting money on CPUs when all we want is more RAM or vice versa. Not so at GCE. Also, at AWS we need to constantly consider IOPS. On GCE we hardly have to think about it. A MASSIVE simplification for the entire organization.

Any cloud vendor which year 2018 doesn’t allow customers to specify the RAM, CPU and Disk capacity arbitrarily for their nodes should be ashamed is our opinion.

2. Pricing model is *much* easier at GCE.
We *just use resources*, and discounts are *automatically* applied. The only pricing related factor we actively need to consider is whether or not to use pre-emptive instances (GCE’s variant of AWS “spot instances”). Exporting the on-going billing for all our environments to BQ means we have an up-to-date view of exactly what is costing us money and allows us to take *immediate action* if we see some costs spiking (e.g. some twat forgot to spin down a test cluster). All automated after clicking *one button*.

3. Better “spot instance” availability per zone.
As discussed in the article, we’ve so far _never_ run into a situation at GCE where a zone didn’t have the number of spot (“pre-emptive”) instances we needed, at the time we provisioned. Granted, we don’t use that many machines; a hundred or so (a few thousand cores) per zone. It seems to us that Google over-provisions their centers to a larger extent (has more slack / margin added to their capacity planning) than AWS. This leads us to rely more and more on spot instances, since on GCE the discount is massive; around 80% ! and availability seems abundant. This in turn has made running on GCE even cheaper than running our own DCs, and with the peak scaling limit inherent when running your own DCs removed to boot, due to “cloud”. Our AWS infra is way overpriced compared to the capacity and throughput we get when we make our monthly throughput / cost summaries.

4. I/O (networking and disks).
Just blows AWS out of the water. We’re not very I/O sensitive at large, as we don’t have critical OLTP workloads. But being able to dump N GiB on disks, in a few seconds instead of minutes, sure is a nice improvement compared to AWS. If I was at a more traditional “web” company, then this attribute alone would likely be a top priority. As for networking, we have alerts going off every day about some service not being able to reach some other service due to “networking” (DNS ..) at AWS. We’ve had zero network related alerts at GCE. Data transfer is also way snappier from / to GCE instances and between them, compared to what we got at AWS. So this category is a massive win for GCE in our own experience, and for our way of using compute resources.

5. Use cases vs technical building blocks.
Google seems to invest more in solving classes of problems from the perspective of “What is the customer trying to achieve”, and “What are the typical problems around running IT”. AWS to us always feels like a disparate set of technical building blocks, without an over-arching vision to guide those blocks, and investment to tie them all together into a *simplified* (convenient) management experience.

To avoid sounding like a fanboi, here are some GCP criticisms (rants).
One part of GCP that sucks is the logging service. The vendor bought Stackdriver and that service is just horrible. We’ve had endless problems with it, from massive data loss regarding log ingestion, to the most serious one which G doesn’t seem to want to fix; God awful usability. Takes forever to search through logs. Zero configurability regarding what from the log records to present via the service’s main UI (web). No way to link records to get a contextual view for trouble shooting etc. etc. Building our own log viewer couldn’t fix the slow search, but some of the other issues. If they just acquire Loggly tomorrow and start performing better indexing, then they’d make this blemish of a service usable. {/end rant #1}

IP-tables-like network segmentation. Here we think AWS wins (or won?) hands down. Security groups (“SE-linux labels for distributed computing”) is something we missed at GCE. But with label matching being core to google’s way of thinking, we see rapid progress in the right direction. Still, we missed the security group concept when we migrated.

Support at GCP has gone down the drain. When we started using GCE, support was in mt. view. That meant support reps were at least somewhat skilled in the products they supported. And not uncommonly you could get a product team member involved, so ticket resolution then became “one description, one fix” (one iteration) since you communicated with a person at google with the requisite knowledge to *understand* your question, and skills to remedy the issue. Since last year Google has started either outsourcing or offshoring GCP support, which means you now have to wade through an army of clueless newbies, tasked it seems with preventing you from solving production problems effectively. You can no longer address issues with a vendor person of at least the same, or preferably higher skills in Google-tech than what you have yourself. This means that time to resolution for us has exploded regarding GCP issues the last year. In Google’s “business defense”, you get the same incompetence from most large vendors (IBM, Oracle …), which sucks the life out of you, and makes you start hating the vendor, since every…single…ticket is a war of attrition.

I understand the business reason for having cheap and shoddy 1st lines; most customers ask stupid and / or lazy questions, of the kind that RTFM should have addressed. However, google priding itself on being a “data-driven company”, should fairly quickly be able to learn which customer does its homework before contacting support, and which is sloppy (deserving of sh*tty 1st line). They should be able to make such distinctions even within customer orgs as well, since they have full ticket history to analyze (all the required data available for prioritizing access to competent support). {/end rant #2}

As for AWS, I personally have not dealt with their support, so can’t compare it to google’s. At present we’re only having old stable bits and bobs there, things on life support until we have time to migrate that last bit off of it to GCE.

All in all, GCE has been a big win for us. Our workload runs faster, management is easier, we’ve had fewer operational issues, and our finance folk love the massively reduced spending.