Is There a Landmine Hidden in Amazon's Glacier?

Share

Is There a Landmine Hidden in Amazon's Glacier?

Lapis Ruber

/Flickr

On Tuesday, Amazon unveiled a new online storage service known as Glacier. It's called Glacier because it deals in "cold storage" – i.e., the long-term storage of things like medical records or financial documents that you may need to archive for regulatory services.

This storage is "cold" because you don't access it very often – or very quickly. It's the stuff you might normally put on tape in a vault somewhere.

Amazon has long offered other storage services as part of an ever-growing "cloud" empire – the company's S3 storage service was actually the first of the Amazon Web Services – but Glacier aims to tackle a very different problem. It's much cheaper than other storage services, but it's also much slower.

How much cheaper? That's a very good question.

Storage costs 1 cent per gigabyte, which works out to $10.24 per terabyte per month, and uploading data is free. "You can store any amount of data with high durability at a cost that will allow you to get rid of your tape libraries and robots and all the operational complexity and overhead that have been part and parcel of data archiving for decades," Amazon says in a blog post announcing the new service.

Amazon was not immediately available for comment, but in all likelihood, Glacier is based on older hardware that the company wants to wring some extra dollars from before retiring.

That said, Glacier's pricing model has some people worrying. The cost of retrieving data is quite different from the cost of storing it.

Because the service is designed for long-term archival needs, not active use, it's understandable that the fees for retrieval will be high in comparison to the fees for storage to discourage the use of Glacier for general purpose storage. It will also take three to five hours to prepare an archive for downloading, which will also deter misuse of the service. Presumably, Amazon powers off the hardware until it's needed.

But the retrieval fees are confusing. According to Amazon's pricing chart, you can request up to 5 percent of the data stored in Glacier for free each month, but it's prorated by the day. The FAQ explains: "If on a given day you have 12 terabytes of data stored in Glacier, you can retrieve up to 20.5 gigabytes of data for free that day (12 terabytes x 5% / 30 days = 20.5 gigabytes, assuming it is a 30 day month)." Elsewhere in the FAQ it explains that this is about 0.17 percent a day ("5% / 30 days = 0.17% per day").

It gets more convoluted if you go over that limit. "You are charged a retrieval fee when your retrievals exceed your daily allowance," says Amazon's FAQ. "If, during a given month, you do exceed your daily allowance, we calculate your fee based upon the peak hourly usage from the days in which you exceeded your allowance." And it gets worse from there.

Take a deep breath:

As we saw above, if you store 12 terabytes of data in Amazon Glacier, you can retrieve up to 20.5 gigabytes for free each day. If you exceed 20.5 gigabytes during a given day (or days) over the course of the month, we determine the hour during those days in which you retrieved the most amount of data for the month. In this example, let’s say your peak hourly retrieval rate is 1 gigabyte per hour, and the amount you retrieved that day is 24 gigabytes.

Peak hourly retrieval for the month = 1 gigabyte per hour

Next we subtract your free allowance from the peak hourly retrieval for the month. To determine the amount of data you get for free, we look at the amount of data retrieved during your peak day and calculate the percentage of data that was retrieved during your peak hour. We then multiply that percentage by your free daily allowance. In this example, you retrieved 24 gigabytes during the day and 1 gigabyte at the peak hour, which is 1/24 or ~4% of your data during your peak hour. We multiply 4% by your daily free allowance, which is 20.5 gigabytes each day. This equals 0.82 gigabytes. We then subtract your free allowance from your peak usage to determine your billable peak.

The amount you pay is your billable peak, multiplied by the number of hours in the month, multiplied by the retrieval fee. If we assume the data is stored in US East Region and that this is a 30 day month your retrieval fee for the month is:

Retrieval fee = 0.18 x 720 x $0.01 = $1.30

That still sounds pretty cheap, but as a commenter on Hacker News points out, the way the peak hourly retrieval is calculated is a mystery. If the price is based on how long it takes you to download the archives, then the cost is limited by download speeds. But if the cost is based on how much you request in an hour and you request a large file that can't be broken into chunks, the costs could skyrocket.

For example, a 3 terabyte archive that can't be split into smaller chunks could lead to a retrieval fee as high as $22,082 if the peak usage is determined to be 3 terabytes per hour. The cost of requests is separate from the cost of bandwidth to download the data, which has its own separate pricing table.

Update: An Amazon spokesperson says "For a single request the billable peak rate is the size of the archive, divided by four hours, minus the pro-rated 5% free tier."

It doesn't seem likely that this is Amazon's intent for the pricing, but it has made some developers nervous. "If you wrote an automated script to safely pull a full archive, a simple coding mistake, pulling all data at once, would lead you to be charged up to 720 times what you should be charged," one Hacker News commenter wrote.