Sunday, December 30, 2012

Amazon S3 and Glacier: A Cheap Solution for Long Term Storage Needs

In the last few years, lots of cloud-based storage services began providing relatively cheap solutions to many classes of storage needs. Many of them, especially consumer-oriented ones such as DropBox, Google Drive and Microsoft SkyDrive, try to appeal their users with free tiers and collaborative and social features. Google Drive is a clear case of this trend, having "absorbed" many of the features of the well-known Google Docs applications, seamlessly integrating them into easy to use applications for many platforms, both mobile and desktop-oriented.

I've been using these services for a long time now, and despite being really happy with them, I've been looking for alternative solutions for other kinds of storage needs. As an amateur photographer, for example, I generate a lot of files on a monthly basis, and my long-term storage need for backup is currently in the tens of gigabytes per month. If I used Google Drive to satisfy those needs, supposing I'm already in the terabyte range, I'd pay almost $50 per month! Competitors don't offer seriously cheaper solutions either. At that price, one could argue that a decent home-based storage solution could be a better solution to his problems.

The Backup Problem

The problem is that many consumer cloud storage services are not really meant for backup, and you're paying for a service which keeps your files always online. On the other hand, typical backup strategies involve storing files in mediums which are kept offline, typically reducing the total cost of the solution. At home, you could store your files in DVDs, and keep hard disk space available for other tasks. Instead of DVDs, you could use hard drives as well. We're not considering management issues here (DVDs and hard drives can fail over time, even if kept off and properly stored) but the important thing to grasp here is that different storage needs can be satisfied by different kind of storage classes, to minimize the long-term storage costs of assets whose size is most probably only going to grow over time.

This kind of issues has been addressed by Amazon, which recently rolled out a new service for low-cost long-term storage needs: Amazon Glacier.

What Glacier Is Not

As soon as Glacier was announced, there has been a lot of talking about it. At a cost of $0.01 per gigabyte per month, it clearly seemed an affordable solution for this kind of problems. The cost of one terabyte would be $10 per month, 5 times cheaper than Google Drive, 10 times cheaper than DropBox (at the time of writing).

But Glacier is a different kind of beast. For starters, Glacier requires you to keep track of a Glacier-generated document identifier every time you upload a new file. Basically, it acts like a gigantic database where you store your files and retrieve them by key. No fancy user interface, no typical file system hierarchies such as folders to organize your content.

Glacier's design philosophy is great for system integrators and enterprise applications using the Glacier API to meet their storage needs, but it certainly keeps the average user away from it.

Glacier Can Be Used as a New Storage Class in S3

Even if Glacier was meant and rolled out with enterprise users in mind, at the time of release the Glacier documentation already stated that Glacier would be seamlessly integrated with S3 in the near future.

S3 is a cloud storage web service which pioneered the cloud storage offerings, and it's as easy to use as any other consumer-oriented cloud storage service. In fact, if you're not willing to use the good S3 web interface, lots of S3 clients for almost every platform exist. Many of them even let you mount an S3 bucket as if it were an hard disk.

In the past, the downside of S3 for backup scenarios has always been its price, which was much higher than that of its competitors: 1 terabyte costs approximately $95 per month (for standard redundancy storage).

The great news is that now that Glacier has been integrated with S3, you can have the best of both worlds:

You can use S3 as your primary user interface to manage your storage. This means that you can keep on using your favourite S3 clients to manage the service.

You can configure S3 to transparently move content to Glacier using lifecycle policies.

You will pay Glacier's fees for content that's been moved to Glacier.

The integration is completely transparent and seamless: you won't need to perform any other kind of operation, your content will be transitioned to Glacier according to your rules and it will always be visible into your S3 bucket.

The only important thing to keep in mind is that files hosted on Glacier are kept offline and can be downloaded only if you request a "restore" job. A restore job can take up to 5 hours to be executed, but that's certainly acceptable in a non-critical backup/restore scenario.

How To Configure S3 and Use the Glacier Storage Class

The Glacier storage class cannot be used directly when uploading files to S3. Instead, transitions to Glacier are managed by a bucket's lifecycle rules. If you select one of your S3 buckets, you can use the Lifecycle properties to configure seamless file transitions to Glacier:

S3 Bucket Lifecycle Properties

In the previous image you can see a lifecycle rule of a bucket of mine, which move content to Glacier according to the rules I defined. You can create as many rules as you need and rules can contain both transitions and expirations. In this use case, we're interested in transitions:

S3 Lifecycle Rule - Transition to Glacier

As you can see in the previous image, the afore-mentioned S3 lifecycle rule instructs S3 to migrate all content from the images/ folder to Glacier after just 1 day (the minimum amount of time you can select). All files uploaded into the images directory will automatically be transitioned to glacier by S3.

As previously stated, the integration is transparent and you'll keep on seeing your content into your S3 bucket even after it's been transitioned to Glacier:

S3 Bucket Showing Glacier Content

Requesting a Restore Job

The seamless integration between the two services don't finish here. Glacier files are kept offline and if you try to download them you'll get an error instructing you to initiate a restore job.

You can initiate a restore job from within the S3 user interface using a new Action menu item:

S3 Actions Menu - Initiate Restore

When you initiate a restore job for part of your content (of course you can select only the files you need), you can specify the amount of time the content will be kept online, before being automatically migrated to Glacier again:

S3 Initiation a Restore Job on Glacier Content

This is great since you won't need to remember to transition content to Glacier again: you simply ask S3 to bring your content online for the specified amount of time.

Conclusions

This post quickly outlines the benefit of storing a backup copy of your important content on Amazon Glacier, taking advantage of the ease of use and the affordable price of this service. Glacier integration in S3 enables any kind of users to take advantage of it without even changing your existing S3 workflow. And if you're new to S3, it's just as easy to use as any other cloud storage service out there. Maybe their applications are not as fancy as Google's, but their offer is unmatched today, and there are lots of easy to use S3 clients, either free or commercial (such as Cyberduck and Transmit if you're a Mac user), or even browser based S3 clients such as plugins for Firefox and Google Chrome.

Everybody has got files to backup, and many people is unfortunately unaware of the intrinsic fragility of typical home-based backup strategies, let alone users that never perform any kind of backups. Hard disks fail, that's just a fact, you just don't know when it's going to happen. And besides hard disk failures, other problems may appear over time such as undetected data corruption, which can only be addressed using dedicated storage technologies (such as the ZFS file system), all of which are usually out of range of many user, either for their cost or for their skill requirements for setup and management.

In the last 6 years, I've been running a dedicated Solaris server for my storage needs, and I bought at least 10 hard drives. When I projected the total cost of ownership of this solution I realised how Glacier would allow me to spare a big amount of money. And it did.

Of course I'm still keeping a local copy of everything because I sometimes require quick access to it, but I reduced the redundancy of my disk pools to the bare minimum, and still have a good night sleep because I know that whatever happens my data is still safe at Amazon premises. If a disk breaks (it happened a few days ago), I'm not worried about array reconstruction, because it's not an issue any longer, and I just use two-way mirrors instead of more costly solutions. I could even give up using a mirror altogether, but I'm not willing to reconstruct the content from Glacier every time a disk fails (and it's going to happen at least once every 2/3 years, according to my personal statistics).

So far, I never needed to restore anything from Glacier, but I'm sure that day will eventually come. And I want to be prepared. And you should want to as well.

P.S.: Ted Forbes has cited this blog post in Episode 118 (Photo Storage with Amazon Glacier and S3) of The Art of Photography, his excellent podcast about photography. If you still don't know it, you should check it out. Ted is an amazing guy and his podcast is awesome, with content that ranges from tips, techniques and interesting digressions on the art of photography. I've learnt a lot from him and I bet you will, too.

Well, in fact neither S3 nor Glacier have the concept of folders hierarchy. The folder hierarchy you usually see in S3 is an "illusion" rendered using object names (keys) which mimic paths on a file system. To make the story short, there's no way to do that using a single API call or the S3 web interface.

I'll consider the suggestion because it's pretty simple to use the S3 GET Bucket API call to list bucket contents and initiate the required requests navigating through the hierarchy according to your needs. I cannot promise any kind of deliverable, but I'll try finding some time writing a command line client for that.

I've come across your blogpost by the podcast of Ted Forbes (the art of photography). Having coped with the backup dillemma myself, I wondered if you ever came across crashplan as an alternative to S3/Glacier?

They offer unlimited storage with a fixed fee of $8/month for 10 computers (windows, mac os x, linux, solaris are supported). They way it works is that you have a crashplan java application running on your machine, and it syncs your data to their storage cloud.

Previously I've been using S3 with duplicity on my home built NAS machine. However, it required a lot of maintenance to keep it running, and it wasn't as reliable as I would have expected. The data is stored in S3, but in chunks (large files) with indexes (also large files) which makes the backup and verification process slow.

Now I've switched to crashplan+ family unlimited and I keep backups of not only my NAS (which runs Linux), but also my laptops. With crashplan you can also backup to friends, so relatives can make a backup to my NAS. The backup is instant, and keeps versions of files.

Regarding Glacier, I think the AWS are cool! S3 + Glacier is cool because they reduce the price of cold storage. However for the $10/month for 1TB compared to $8/month for crashplan unlimited, I think crashplan has an even better offering.

Some of the disadvantages of Glacier don't happen in Crashplan: - Folder structures (on a per computer basis)- Keep history of (deleted) files (this can be done on S3 with object versions enabled, but is this also possible with Glacier?)- Penalties for retrieval within x days- Wait for files to be retrievable, all data in crashplan is accessible right away (also via web interface and mobile app).

Please take a look at crashplan and let me know what you think. If you consider installing it on a headless (solaris) machine, take a look at their guide here: http://support.crashplan.com/doku.php/how_to/configure_a_headless_client

Thanks for the comment and sorry for the delay: I'm still on holiday and haven't got much time nor internet connections available.

I've written a followup to this post with some thoughts that may answer your question. To make a long story short, I know CrashPlan, and I think it's a great service. However, I also think it just doesn't fit my needs. I don't need such a backup solution, since I just don't keep that much data on my workstations: just the data I'm working on. For this reason, I don't even want any process synchronising my data. My workflow is simple: get some data, load it locally (and back it up on premise in case I need it in the near future.) Once I've finished working on it, I just offload it to Glacier and free the workstations. Ted Forbes has suggested a simple use case of his own: the video footage of his podcast. As in the case of the photos I'm done working on, I just don't want it sitting on my local disks. Hence, there's no point in having a client backing it up because I already do it as soon as I get it.

No, I haven't tried Zoolz, but it looks like a cool service if it fits your needs. In fact, I think it's great it features distinct tiers for online and cold storage.

Anyway, I've written a short followup to this post (thegreyblog.blogspot.com/2013/01/backups-using-amazon-s3-and-glacier.html) with some thoughts on this kind of services. Hopefully, it will answer your question more fully and it will explain why I'm not willing to rely on this kind of service, at least for the time being. On the other hand, I'm sure these services are just great in more general backup/restore use cases. According to my experience and needs, however, it's much more critical to reduce at a minimum the time required to bring a workstation to a "working state" and just restore the data you need from a reliable backup location. That's why I tailored my workflow not to rely on a "restore from backup" to be up and running again.

You could argue that both Zoolz and CrashPlan let you finely tune what's backed up from your disks (either local or not), and I certainly agree. But the point then would be: why would I pay for their services without taking full advantage of their features? At least for my use cases, I believe S3 and Glacier are simpler and cheaper (but that's just my opinion).

Thank you very much for the article Enrico. I'm currently in search for good all-around onsite & offsite backup solution and your article just gave me an idea. Maybe others will be interested too. My OS is Windows7.