Five Things You Probably Didn’t Know About Amazon S3

Here is a list of top 5 things that I could think of, which are not very well known among the AWS developer community inspite of S3 being one of the most widely used AWS service:

Data Consistency

S3 provides read-after-write consistencyfor PUTS of new Objects. To understand what that means, it is important to understand that S3 achieves High Availability (HA) by replicating the data across multiple servers that could even span multiple data centers. So until you get back a 200 OK response to the PUT call, you cannot be sure that the new Object was created successfully and any immediate GET or HEAD call ( like listing the keys within the bucket ) for the same object might result in not showing the object. On the other hand, once the previous call has returned with a 200 OK any subsequent GET calls for the new Object is guaranteed to return the object as 200 OK signifies that the data is stored safely in S3.

S3 provides eventual consistency for overwrite PUTS and DELETES for existing objects. It is easy to follow this from the above established premise – until the change ( PUT or DELETE ) has been propagated to all copies of data in S3, anyone else requesting the same object can get the previous data or deleted object.

With S3 Select, Amazon provides you with the capability to do Query in Place on the humongous data that you might have stored in S3, without having to download, decompress, process the entire dataset and then filter out the data that you need for further analysis. With S3 Select, you could just retrieve the data that you are interested in, which may results in large cost reduction as well in some cases. There are some limitations though; the data in S3 must be either in CSV or JSON format, and only a subset of SQL queries are supported. For a more involved data set, one could always use Amazon Athena but for a lot of cases, S3 Select could be used directly and can help in substantial cost reduction.

Transfer Acceleration

Since S3 buckets have a universal namespace, so its possible that your users might end up uploading tons of data to a bucket located in Sydney from different parts of the world. Some might get a good upload speed depending on the distance, while others may not. To rectify this, you could enable Transfer Acceleration on your S3 bucket. What that implies is now the end user can upload instead to a Cloudfront’s Edge location and that data will be copied over to the original S3 bucket on a network optimized path, completely transparent to the end user. The end user just needs to interact with a common URI ( bucketname.s3-accelerate.amazonaws.com ) Just make sure that the bucket name is DNS compliant and must not contain periods (.)

Cross Region Replication

Cross-region replication is a bucket-level feature that enables automatic, asynchronous copying of objects across buckets in different AWS Regions. Both the source and destination buckets must enable versioning before being able to use CRR. You can either replicate all the objects from source to destination or can specify the key name prefix so as to replicate only those objects which have that prefix ( folder level replication ). You can also change the storage tier of the destination bucket if you are doing replication for creating backup of the data and that backup data is not going to be accessed frequently, its beneficial to use S3-IA storage tier instead of the default one. The source and destination buckets can also be present in different AWS accounts altogether. If you replicate a bucket with existing data/files, then those are not copied or replicated to destination bucket. Only new objects are replicated. To help customers more proactively monitor the replication status of their Amazon S3 objects, AWS offers the Cross-Region Replication Monitor (CRR Monitor) solution.

Lifecycle Rules

S3 provides different tiers of storage for storing data. The default one provides 4 9’s of Availability and 11 9’s of Durability and can sustain loss of 2 datacenter facilities concurrently making it highly durable. But if you want to store infrequently accessed data that when needed should be readily available, then better use S3-IA (Infrequently Accessed) storage as it has lower storage cost but higher retrieving cost than the default one. There is also another one – Reduced Redundancy Storage which provides 4 9’s of Availability and 4 9’s of Durability making it less durable than Default, also Cheaper than default storage. This is typically used for storing data that can be generated again easily. Another option is to use Glacier – typically used for Data Archival as it needs 3-5 hours to restore data from Glacier

The lifecycle rules lets you manage the lifecycle of an object in a particular storage tier and lets you define the transition and expiration actions. For example, you might choose to transition objects to the S3-IA storage class 30 days after you created them, or archive objects to the Glacier storage class one year after creating them. For more details on how to define these rules, refer to AWS documentation

IMO, these are some of the lesser known features of S3. What do you think? How many did you already knew?