It is time to expand the AWS footprint once again, with a new Region in Sydney, Australia. AWS customers in Australia can now enjoy fast, low-latency access to the suite of AWS infrastructure services.

CustomersOver 10,000 organizations in Australia and New Zealand are already making use of AWS. Here's a very small sample:

The Commonwealth Bank of Australia runs customer-facing web applications on AWS as part of a cloud strategy that has been underway for the past five years. The seamless scaling enabled by AWS has allowed their IT department to focus on innovation.

Brandscreen, a fast-growing Australian start-up,has developed a real-time advertising trading platform for the media industry.They use Elastic MapReduce to process vast amounts of data to test out machine learning algorithms. They store well over 1 PB of data in Amazon S3 and add another 10 TB every day.

MYOB uses AWS to host the MYOB Atlas, a simple website builder that enables businesses to be online within 15 minutes. They currently have more than 40,000 small and medium-sized businesses using Atlas on the AWS cloud.

Halfbrick Studios hosts the highly acclaimed Fruit Ninja game on AWS. They use DynamoDB and multiple Availability Zones to host tens of millions of regular players.

AWS Partner NetworkA number members of the AWS Partner Network have been preparing for the launch of the new Region. Here's a sampling (send me email with launch day updates):

Canonical is working to bring the official Ubuntu AMIs to our new Region. The latest supported images for Ubuntu Server 10.04 LTS,
11.10, 12.04 LTS and 12.10 have been migrated over. Daily images have been
turned on for the new region. The Amazon Quickstart list is also populated with
the proper image ID's.

Acquia provides hosted Drupal (again, see my interview with Acquia's Tom Erickson to learn more) to over 2,400 customers. They are working to ensure that their service will be available to customers in the new Region.

ESRI is the leading provider of Geographic Information Systems, with over one million users in more than 350,000 organizations. They are making their ArcGIS platform available in the new Region.

MetaCDN provides global cloud-based content delivery, video encoding and streaming services. They are working to ensure that their video encoding, persistent storage and delivery services will be available to customers in the new Region.

On the GroundIn order to serve enterprises, government agencies, academic institutions, small-to-mid size companies, startups, and developers, we now have offices in Sydney, Melbourne, and Perth. We will be adding a local technical support operation in 2013 as part of our global network of support centers, all accessible through AWS Support.

HBase is formally part of the Apache Hadoop project, and runs within Amazon Elastic MapReduce. You can launch HBase jobs (version 0.92.0) from the command line or the AWS Management Console.

HBase in ActionHBase has been optimized for low-latency lookups and range scans, with efficient updates and deletions of individual records. Here are some of the things that you can do with it:

Reference Data for Hadoop Analytics - Because HBase is integrated into Hadoop and Hive and provides rapid access to stored data, it is a great way to store reference data that will be used by one or more Hadoop jobs on a single cluster or across multiple Hadoop clusters.

Log Ingestion and Batch Analytics - HBase can handle real-time ingestion of log data with ease, thanks to its high write throughput and efficient storage of sparse data. Combining this with Hadoop's ability to handle sequential reads and scans in a highly optimized fashion, and you have a powerful tool for log analysis.

Storage for High Frequency Counters and Summary Data - HBase supports high update rates (the classic read-modify-write) along with strictly consistent reads and writes. These features make it ideal for storing counters and summary data. Complex aggregations such as max-min, sum, average, and group-by can be run as Hadoop jobs and the results can be piped back into an HBase table.

I should point out that HBase on EMR runs in a single Availability Zone and does not guarantee data durability; data stored in an HBase cluster can be lost if the master node in the cluster fails. Hence, HBase should be used for summarization or secondary data or you should make use of the backup feature described below.

You can do all of this (and a lot more) by running HBase on AWS. You'll get all sorts of benefits when you do so:

Freedom from Drudgery - You can focus on your business and on your customers. You don't have to set up, manage, or tune your HBase clusters. Elastic MapReduce will handle provisioning of EC2 instances, security settings, HBase configuration, log collection, health monitoring, and replacement of faulty instances. You can even expand the size of your HBase cluster with a single API call.

Backup and Recovery - You can schedule full and incremental backups of your HBase data to Amazon S3. You can rollback to an old backup on an existing cluster or you can restore a backup to a newly launched cluster.

Seamless AWS Integration - HBase on Elastic MapReduce was designed to work smoothly and seamlessly with other AWS services such as S3, DynamoDB, EC2, and CloudWatch.

Getting StartedYou can start HBase from the command line by launching your Elastic MapReduce cluster with the --hbase flag :

When you create your HBase Job Flow from the console you can restore from an existing backup, and you can also schedule future backups:

Beyond the BasicsHere are a couple of advanced features and options that might be of interest to you:

You can modify your HBase configuration at launch time by using an EMR bootstrap action. For example, you can alter the maximum file size (hbase.hregion.max.filesize) or the maximum size of the memstore (hbase.regionserver.global.memstore.upperLimit).

You can monitor your cluster with the standard CloudWatch metrics that are generated for all Elastic MapReduce job flows. You can also install Ganglia at startup time by invoking a pair of predefined bootstrap actions (install-ganglia and configure-hbase-for-ganglia). We plan to add additional metrics, specific to HBase, over time.

You can run Apache Hive on the same cluster, or you can install it on a separate cluster. Hive will run queries transparently against HBase and Hive tables. We do advise you to proceed with care when running both on the same cluster; HBase is CPU and memory intensive, while most other MapReduce jobs are I/O bound, with fixed memory requirements and sporadic CPU usage.

HBase job flows are always launched with EC2 Termination Protection enabled. You will need to confirm your intent to terminate the job flow.

I hope you enjoy this powerful new feature!

-- Jeff;

PS - There is no extra charge to run HBase. You pay the usual rates for Elastic MapReduce and EC2.

AWS works hard to lower our costs so that we can pass those savings back to our customers. We look to reduce hardware costs, improve operational efficiencies, lower power consumption and innovate in many other areas of our business so we can be more efficient. The history of AWS bears this out -- in the past six years, we’ve lowered pricing 18 times, and today we’re doing it again. We’re lowering pricing for the 19th time with a significant price decrease for Amazon EC2, Amazon RDS, Amazon ElastiCache and Amazon Elastic Map Reduce.

Amazon EC2 Price DropFirst, a quick refresher. You can buy EC2 instances by the hour. You have no commitment beyond an hour and can come or go as you please. That is our “On-Demand” model.

If you have predictable, steady-state workloads, you can save a significant amount of money by buying EC2 instances for a term (one year or three year). In this model, you purchase your instance for a set period of time and get a lower price. These are called “Reserved Instances,” and this model is the equivalent to buying or leasing servers, like folks have done for years, except EC2 passes its benefit of substantial scale to its customers in the form of low prices. When people try to compare EC2 costs to doing it themselves, the apples to apples comparison is to Reserved Instances (although with EC2, you don't have to staff all the people to build / grow / manage the Infrastructure, and instead, get to focus your scarce resources on what really differentiates your business or mission).

Today’s Amazon EC2 price reduction varies by instance type and by Region, with Reserved Instance prices dropping by as much as 37%, and On-Demand instance prices dropping up to 10%. In 2006, the cost of running a small website with Amazon EC2 on an m1.small instance was $876 per year. Today with a High Utilization Reserved Instance, you can run that same website for less than 1/3 of the cost at just $250 per year - an effective price of less than 3 cents per hour. As you can see below, we are lowering both On-Demand and Reserved Instances prices for our Standard, High-Memory and High-CPU instance families. The chart below highlights the price decreases for Linux instances in our US-EAST Region, but we are lowering prices in nearly every Region for both Linux and Windows instances.

We have a few flavors of Reserved Instances that allow you to optimize your cost for the usage profile of your application. If you run your instances steady state, Heavy Utilization Reserved Instances are the least expensive on a per hour basis. Other variants cost a little more per hour in exchange for the flexibility of being able to turn them off and save on the usage costs when you are not using them. This can save you money if you don’t need to run your instance all of the time. For more details on which type of Reserved Instances are best for you, see the EC2 Reserved Instances page.

Save Even More on EC2 as You Get BiggerOne misperception we sometimes hear is that while EC2 is a phenomenal deal for smaller businesses, the cost benefit may diminish for large customers who achieve scale. We have lots of customers of all sizes, and those who take the time to rigorously run the numbers see significant cost advantages in using EC2 regardless of the size of their operations.

Today, we’re enabling customers to save even more as they scale -- by introducing Reserved Instance volume tiers. In order to determine what tier you qualify for, you add up all of the upfront Reserved Instance payments for any Reserved Instances that you own. If you own more than $250,000 of Reserved Instances, you qualify for a 10% discount on any additional Reserved Instances you buy (that discount applies to both the upfront and the usage prices). If you own more than $2 Million of Reserved Instances, you qualify for a 20% discount on any new Reserved Instances you buy. Once you cross $5 Million in Reserved Instance purchases, give us a call and we will see what we can do to reduce prices for you even further – we look forward to speaking with you!

Price Reductions for Amazon RDS, Amazon Elastic MapReduce and Amazon ElastiCacheThese price reductions don’t just apply to EC2 though, as Amazon Elastic MapReduce customers will also benefit from lower prices on the EC2 instances they use. In addition, we are also lowering prices for Amazon Relational Database Service (Amazon RDS). Prices for new RDS Reserved Instances will decrease by up to 42%, with On-Demand Instances for RDS and ElastiCache decreasing by up to 10%.

Here’s a quick example of how these price reductions will help customers save money. If you are a game developer using a Quadruple Extra Large RDS MySQL 1-year Heavy Utilization Reserved Instance to power a new game, the new pricing will save you over $550 per month (or 39%) for each new database instance you run. If you run an e-commerce application on AWS using an Extra Large multi-AZ RDS MySQL instance for your always-on database you will save more than $445 per month (or 37%) by using a 3-year Heavy Utilization Reserved Database Instance. If you added a two node Extra Large ElastiCache cluster for better performance, you will save an additional $80 per month (or 10%). For a full list of the new prices, go to the Amazon RDS pricing page, Amazon ElastiCache pricing page, and the Amazon EMR pricing page.

Real Customer SavingsLet’s put these cost savings into context. One of our fast growing customers was primarily running Amazon EC2 On-Demand instances, running 360,000 hours last month using a mix of M1.XL, M1.large, M2.2XL and M2.4XL instances. Without this customer changing a thing, with our new EC2 pricing, their bill will drop by over $25,000 next month, or $300,000 per year – an 8.6% savings in their On-Demand spend. This customer was in the process of switching to 3-year Heavy Utilization Reserved Instances (seeing most of their instances are running steady state) for a whopping savings of 55%. Now, with the new EC2 price drop we're announcing today, this customer will save another 37% on these Reserved Instances. Additionally, with the introduction of our new volume tiers, this customer will add another 10% discount on top of all that. In all, this price reduction, the new volume discount tiers, and the move to Reserved Instances will save the customer over $215,000 per month, or $2.6 million per year over what they are paying today, reducing their bill by 76%!

Many of our customers were already saving significant amounts of money before this price drop, simply by running on AWS. Samsung uses AWS to power its “smart hub” application – which powers the apps you can use through their TVs – and they recently shared with us that by using AWS they are saving $34 million in capital expenses over 2 years and reducing their operating expenses by 85%. According to their team, with AWS, they met reliability and performance objectives at a fraction of the cost they would have otherwise incurred.

Another customer example is foursquare Labs, Inc., They use AWS to perform analytics across more than 5 million daily check-ins. foursquare runs Amazon Elastic MapReduce clusters for their data analytics platform, using a mix of High Memory and High CPU instances. Previously, this EMR analytics cluster was running On-Demand EC2 Instances, but just recently, foursquare decided they would buy over $1 million of 1-year Heavy Utilization Reserved Instances, reducing their costs by 35% while still using some On-Demand instances to provide them with the flexibility to scale up or shed instances as needed. However, the new EC2 price drop lowers their costs even further. This price reduction will help foursquare save another 22%, and their overall EC2 Reserved Instance usage for their EMR cluster qualifies them for the additional 10% volume tier discount on top of that. This price drop combined with the move to Reserved Instances will help foursquare reduce their EC2 instance costs by over 53% from last month without sacrificing any of the scaling provided by EC2 and Elastic MapReduce.

As we continue to find ways to lower our own cost structure, we will continue to pass these savings back to our customers in the form of lower prices. Some companies work hard to lower their costs so they can pocket more margin. That’s a strategy that a lot of the traditional technology companies have employed for years, and it’s a reasonable business model. It’s just not ours. We want customers of all sizes, from start-ups to enterprises to government agencies, to be able to use AWS to lower their technology infrastructure costs and focus their scarce engineering resources on work that actually differentiates their businesses and moves their missions forward. We hope this is another helpful step in that direction.

It’s always exciting to find out that an app that has changed how I consume news and blog content on my mobile devices is using AWS to power some of their most engaging features. Such is the case with Pulse, a visual news reading app for iPhone, iPad and Android. Pulse uses Amazon Elastic MapReduce, our hosted Hadoop product, to analyze data from over 11 million users and to deliver the best news stories from a variety of different content publishers. Born out of a Stanford launchpad class and awarded for its elegant design by Apple at WWDC 2011, the Pulse app blends a strong high-tech backend with great visual appeal to conquer the eyes of mobile news readers everywhere.

Pulse backend team members from left to right: Simon, Lili, Greg, Leonard

The December 2011 update included a new feature called Smart Dock, which uses Hadoop and a tool called mrjob, developed by Yelp, to analyze users’ reading preferences and continuously recommend other articles or sources they might enjoy.

To understand the level of engineering that goes behind such rich customer features, I spoke to Greg Bayer, Backend Engineering Lead at Pulse:

How big is the “big data” that Pulse analyzes every day?

Our application relies on accurately analyzing client event logs (as opposed to web logs) to extract trends and enable other rich features for our users. To give you a sense of the scale at which we run these analyses, we literally go through millions of events per hour, which translates to as many as 250+ Amazon Elastic MapReduce nodes on any given day. Since we are dealing with event logs, generated by our users from the various platforms on which they access our app (Android, iPhone, iPad, etc.), our logs grow in proportion to our user base. For example, the recent influx of new users from Kindle Fire (Android) means we now have a lot more logs coming in from those devices. Also, since the logs are big, we’ve found that it is very efficient to write them to disk as fast as possible - directly from devices to Amazon EC2 (see my tandem article on the logging architecture we use and the graph below, which highlights some of our numbers).

Much of our backend is built on industry standard systems such as Hadoop. The innovation happens in how we leverage these systems to create value. For us, it’s all about how we can make the app more fun to use and provide rich features that our users will love. For techies, you can read about many of these features in the backend section of the Pulse engineering blog and learn about all the details.

The Right Choice for Big Data

I joined the team here pretty early on as the first backend engineer. I came to Pulse after working at Sandia National Labs, where I built and managed an in-house 70-node Hadoop cluster. This was an investment of over $100,000, operational support, and over 6 months time to get it fully fine-tuned. Needless to say, I was fully aware of the cost and resources needed to run something at the scale that Pulse would need to accommodate.

AWS was and still is the only feasible solution for us. I love the flexibility to quickly stand up a cluster of hundreds of nodes and the added flexibility of choosing the pricing scheme that’s needed for a job. If I need a job done faster, I can always spin up a very large cluster and get results in minutes, or take advantage of smaller instances and the spot marketplace for Amazon Elastic MapReduce if I’m looking to complete a job that’s not time-sensitive. Since an Amazon Elastic MapReduce cluster can simply be turned off when we are done, the cost to run big queries is usually quite reasonable. Consider a cluster of 100 m1.large machines: a set of queries that takes 45 minutes to run on this cluster could cost us approximately $11 - $34 (depending on whether we bid on spot instances or use regular on-demand instances).

Lessons Learned (the bold fomatting below is our doing :) )

It is important to consider the trade-offs and choose the right tool for the job. In our experience, AWS provides an exceptional capability to build systems as close to the metal as you like, while still avoiding the burden and inelasticity of owning your own hardware. It also provides some useful abstraction layers and services above the machine level.

By allowing virtual machines (Amazon EC2 instances) to be provisioned quickly and inexpensively, a small engineering team can stay more focused on the development of key product features. Since stopping and starting these instances is painless, it’s easy to quickly adapt to changing engineering or needs — perhaps scaling up to support 10x more users or shutting down a feature after pivoting a business model.

AWS also provides many other useful services that help save engineering time. Many standard systems, such as load balancers or Hadoop clusters, that normally require significant time and specialized knowledge to deploy, can be deployed automatically on Amazon EC2 for almost no setup or maintenance cost.

Simple, but powerful services like Amazon S3 and the newly released Amazon DynamoDB make building complex features on AWS even easier. Because bandwidth is fast and free between all AWS services, plugging together several of these services is a great way to bootstrap a scalable infrastructure.

Today's guest blogger is Adam Gray. Adam is a Product Manager on the Elastic MapReduce Team.

-- Jeff;

We’re always excited when we can bring features to our customers that make it easier for them to derive value from their data—so it’s been a fun month for the EMR team. Here is a sampling of the things we’ve been working on.

Free CloudWatch MetricsStarting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:

Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.

Further, through the CloudWatch Console, API, or SDK you can set alarms to be notified via SNS if any of these metrics go outside of specified thresholds. For example, you can receive an email notification whenever a job flow is idle for more than 30 minutes, HDFS Utilization goes above 80%, or there are five times as many remaining map tasks as there are map slots, indicating that you may want to expand your cluster size.

Please watch this video to see how to set EMR alarms through the CloudWatch Console:

Hadoop 0.20.205, Pig 0.9.1, and AMI VersioningEMR now supports running your job flows using Hadoop 0.20.205 and Pig 0.9.1. To simplify the upgrade process, we have also introduced the concept of AMI versions. You can now provide a specific AMI version to use at job flow launch or specify that you would like to use our “latest” AMI, ensuring that you are always using our most up-to-date features. The following AMI versions are now available:

Please visit the AMI Versioning section of the Elastic MapReduce Developer Guide for more information.

S3DistCp for Efficient Copy between S3 and HDFSWe have also made available S3DistCp, an extension of the open source Apache DistCp tool for distributed data copy, that has been optimized to work with Amazon S3. Using S3DistCp, you can efficiently copy large amounts of data between Amazon S3 and HDFS on your Amazon EMR job flow or copy files between Amazon S3 buckets. During data copy you can also optimize your files for Hadoop processing. This includes modifying compression schemes, concatenating small files, and creating partitions.

For example, you can load Amazon CloudFront logs from S3 into HDFS for processing while simultaneously modifying the compression format from Gzip (the Amazon CloudFront default) to LZO and combining all the logs for a given hour into a single file. As Hadoop jobs are more efficient processing a few, large, LZO-compressed files than processing many, small, Gzip-compressed files, this can improve performance significantly.

cc2.8xlarge SupportAmazon Elastic MapReduce also now supports the new Amazon EC2 Cluster Compute instance, Cluster Compute Eight Extra Large (cc2.8xlarge). Like other Cluster Compute instances, cc2.8xlarge instances are optimized for high performance computing, giving customers very high CPU capabilities and the ability to launch instances within a high bandwidth, low latency, full bisection bandwidth network. cc2.8xlarge instances provide customers with more than 2.5 times the CPU performance of the first Cluster Compute instance (cc1.4xlarge) instance, more memory, and more local storage at a very compelling cost. Please visit the Instance Types section of the Amazon Elastic MapReduce detail page for more details.

In addition, we are pleased to announce an 18% reduction in Amazon Elastic MapReduce pricing for cc1.4xlarge instances, dropping the total per hour cost to $1.57. Please visit the Amazon Elastic MapReduce Pricing Page for more details.

VPC SupportFinally, we are excited to announce support for running job flows in an Amazon Virtual Private Cloud (Amazon VPC), making it easier for customers to:

Process sensitive data - Launching a job flow on Amazon VPC is similar to launching the job flow on a private network and provides additional tools, such as routing tables and Network ACLs, for defining who has access to the network. If you are processing sensitive data in your job flow, you may find these additional access control tools useful.

Access resources on an internal network - If your data is located on a private network, it may be impractical or undesirable to regularly upload that data into AWS for import into Amazon Elastic MapReduce, either because of the volume of data or because of its sensitive nature. Now you can launch your job flow on an Amazon VPC and connect to your data center directly through a VPN connection.

You can launch Amazon Elastic MapReduce job flows into your VPC through the Ruby CLI by using the --subnet argument and specifying the subnet address (note that you will have to download the latest version of the Ruby CLI):

Today's guest blogger is Adam Gray. Adam is a Product Manager on the Elastic MapReduce Team.

-- Jeff;

Apache Hadoop and NoSQL databases are complementary technologies that together provide a powerful toolbox for managing, analyzing, and monetizing Big Data. That’s why we were so excited to provide out-of-the-box Amazon Elastic MapReduce (Amazon EMR) integration with Amazon DynamoDB, providing customers an integrated solution that eliminates the often prohibitive costs of administration, maintenance, and upfront hardware. Customers can now move vast amounts of data into and out of DynamoDB, as well as perform sophisticated analytics on that data, using EMR’s highly parallelized environment to distribute the work across the number of servers of their choice. Further, as EMR uses a SQL-based engine for Hadoop called Hive, you need only know basic SQL while we handle distributed application complexities such as estimating ideal data splits based on hash keys, pushing appropriate filters down to DynamoDB, and distributing tasks across all the instances in your EMR cluster.

In this article, I’ll demonstrate how EMR can be used to efficiently export DynamoDB tables to S3, import S3 data into DynamoDB, and perform sophisticated queries across tables stored in both DynamoDB and other storage services such as S3.

We will also use sample product order data stored in S3 to demonstrate how you can keep current data in DynamoDB while storing older, less frequently accessed data, in S3. By exporting your rarely used data to Amazon S3 you can reduce your storage costs while preserving low latency access required for high velocity data. Further, exported data in S3 is still directly queryable via EMR (and you can even join your exported tables with current DynamoDB tables).

The sample order data uses the schema below. This includes Order ID as its primary key, a Customer ID field, an Order Date stored as the number of seconds since epoch, and Total representing the total amount spent by the customer on that order. The data also has folder-based partitioning by both year and month, and you’ll see why in a bit.

Creating a DynamoDB TableLet’s create a DynamoDB table for the month of January, 2012 named Orders-2012-01. We will specify Order ID as the Primary Key. By using a table for each month, it is much easier to export data and delete tables over time when they no longer require low latency access.

For this sample, a read capacity and a write capacity of 100 units should be more than sufficient. When setting these values you should keep in mind that the larger the EMR cluster the more capacity it will be able to take advantage of. Further, you will be sharing this capacity with any other applications utilizing your DynamoDB table.”

At the hadoop command prompt for the current master node, type hive. You should see a hive prompt: hive>

As no other applications will be using our DynamoDB table, let’s tell EMR to use 100% of the available read throughput (by default it will use 50%). Note that this can adversely affect the performance of other applications simultaneously using your DynamoDB table and should be set cautiously.

SET dynamodb.throughput.read.percent=1.0;

Creating Hive Tables Outside data sources are referenced in your Hive cluster by creating an EXTERNAL TABLE. First let’s create an EXTERNAL TABLE for the exported order data in S3. Note that this simply creates a reference to the data, no data is yet moved.

This is a bit more complex. We need to specify the DynamoDB table name, the DynamoDB storage handler, the ordered fields, and a mapping between the EXTERNAL TABLE fields (which can’t include spaces) and the actual DynamoDB fields.

Now we’re ready to start moving some data!

Importing Data into DynamoDB In order to access the data in our S3 EXTERNAL TABLE, we first need to specify which partitions we want in our working set via the ADD PARTITION command. Let’s start with the data for January 2012.

Now if we query our S3 EXTERNAL TABLE, only this partition will be included in the results. Let’s load all of the January 2012 order data into our external DynamoDB Table. Note that this may take several minutes.

Querying Data in DynamoDB Using SQL Now let’s find the top 5 customers by spend over the first week of January. Note the use of unix-timestamp as order_date is stored as the number of seconds since epoch.

Querying Exported Data in S3 It looks like customer: ‘c-2cC5fF1bB’ was the biggest spender for that week. Now let’s query our historical data in S3 to see what that customer spent in each of the final 6 months of 2011. Though first we will have to include the additional data into our working set. The RECOVER PARTITIONS command makes it easy to

ALTER TABLE orders_s3_export RECOVER PARTITIONS;

We will now query the 2011 exported data for customer ‘c-2cC5fF1bB’ from S3. Note that the partition fields, both month and year, can be used in your Hive query.

Exporting Data to S3 Now let’s export the January 2012 DynamoDB table data to a different S3 bucket owned by you (denoted by YOUR BUCKET in the command). We’ll first need to create an EXTERNAL TABLE for that S3 bucket. Note that we again partition the data by year and month.

Note that if this was the end of a month and you no longer needed low latency access to that table’s data, you could also delete the table in DynamoDB. You may also now want to terminate your job flow from the EMR console to ensure you do not continue being charged.

That’s it for now. Please visit our documentation for more examples, including how to specify the format and compression scheme for your exported files.

Although Summer is starting to ebb into Autumn in the northern hemisphere, it's just getting going south of the equator, so there is still time to profile another start-up in our on-going series of profiles!

Introducing Mendeley

Today I'm very happy to introduce you to Mendeley, a London based startup that harnesses cloud computing to help the academic community manage existing libraries of research, discover new research and collaborate with researchers around the world. They are simultaneously building the world’s largest crowd-sourced database of research covering all disciplines from Arts to Zoology. Mendeley’s software also anonymously aggregates all usage data in the cloud and tracks what articles are being read, by whom, when and how often.

Like a lot of great ideas, the founders of Mendeley set out to solve their own problem, and came up with the concept for Mendeley while studying for higher degrees in business, psychology and machine learning. The team includes many people with backgrounds in software development, academia and publishing.

I spoke to Dan Harvey, a Data Mining Engineer at Mendeley about how they came to use AWS:

"We started out buying our own hardware 3–4 years ago. Initially our main reasons for using AWS were due to being able to scale up far more quickly and cheaply than we could ourselves for document storage. Over time this is still true with regard to cost and scaling, but the elastic properties of EC2 mean we only have to pay for resources when we are using them. More recently we're finding that AWS gives our developers more flexibility to have the resources they need to test out new code and ideas, rather than stepping on one another's toes on shared servers"

Mendeley are using a wide collection of AWS services to power their fast growing business, which now manages over 100 million papers.

"We wanted to produce previews of these documents for use on our article pages on the web. This was done using a combination of Elastic Beanstalk to host a Java app to render PDFs into raw images, S3 to store the data, CloudFront to serve the images to end users, and SQS to glue this all together", said Dan.

Data driven

With such a rich collection of documents and data, Mendeley also provides tailored recommendations to its users, making use of Elastic MapReduce, and Mahout. Dan Harvey continues:

"Our latest use of AWS is with the Apache Mahout project. This is distributed collaborative filtering on top of the Hadoop framework; we use it to provide tailored recommendations for our users. We have our own Hadoop cluster internally but chose EMR for this because Mahout requires a different task granularity to our existing workload; we can optimise Hadoop on EMR for the specific recommendation task. It also allows us have a simple way of calculating the daily cost of recommendations based on the on-demand EC2 instances EMR uses with each run – with a multi-use Hadoop cluster it is very hard to allocate costs between the different tasks that run on the shared infrastructure. Finally, when we're done running recommendations, we can shut the cluster down and it costs us nothing."

Introduction to AWS

Dan will join us to talk about Mendeley's use of AWS in more detail at our upcoming Introduction to AWS event in London, where newcomers to the cloud can learn about how to build scalable, elastic applications on AWS. Attendance is free, but you'll need to register.

More information

Mendeley have their own API, with which developers can build applications... for science! The Mendeley Binary Battle, an API competition judged by Amazon CTO Werner Vogels and others, runs until the end of September.

If you're a start-up running on AWS, don't forget that there is still time to enter this year's AWS Start-up Challenge, a worldwide competition with prizes at all levels including $100,000 in cash and AWS credits for the grand prize winner. Learn more, and enter today.

Over the summer months, we'd like to share a few stories from startups around the world: what are they working on and how they are using the cloud to get things done. Today, we're profiling Classle, from Chennai, India!

I recently read Mark Suster’s blog on Avoiding Monoculture - which is why I’m happy to share with you what I’ve learned about Classle, a startup from India, focused on solving education problems for areas of the world that experience serious resource constraints. Classle has the big goal of changing the world around them by encouraging students and experts to share knowledge and expertise, and using the AWS cloud to facilitate this exchange.

Classle is a Social Learning infrastructure company with specific focus on Education, Learning and Knowledge communities. Using the main Classle product, Cloud Campus platform, Classle creates and manages private and public social learning environments and offers services based on it.

Classle helps rural students access higher education and reach opportunities unavailable before. Our company partners with a wide network of colleges throughout India, which act as internet-connected "learning nodes" that distribute educational materials to students. When the student goes home for the day with their downloaded lectures and other materials from the library, Classle makes use of mobile technology and SMS-based quizzes to keep students engaged and actively learning. The entire system was designed to work with simple, $10 phones, not smartphones, and the students are entirely addicted to these quizzes - they can’t get enough of them.

All these services are provided free of charge to both students and colleges. Classle monetizes by partnering with companies who are looking to hire top talent from among the students, and by selling their cloud-based learning platform for training purposes within companies.

Starting Small and Growing with Business

We are using AWS since our inception in early 2009. Our first steps involved two small Amazon EC2 instances and Amazon EBS to store our database. Over the years, our use has expanded to match our business growth. Our selection criteria covered tactical as well as strategic points. From a tactical perspective, we wanted a quicker time for provisioning, which AWS on-demand instances enabled, and the option to secure our resource needs through Reserved Instances.

At a strategic level, we wanted to provide the best experience for our customers and it was key to build Classle on top of services, products, and infrastructure designed for growth and scale. To date, we have established relationships with over 30 educational organizations and that list is constantly growing. Thanks to AWS, we are effectively competing with some large and strong players in the e-learning space.

Starting a company is always hard, whether you’re from India or anywhere else. However, it’s worth to keep in mind that it’s never been easier to go out there and try things out - with Open Source for robust software and cloud service providers like AWS for infrastructure, you can test your ideas and run a business at very low cost.

Being from in India, where we don’t have a strong start-up mentality like in the U.S., certainly poses some unique challenges. There are many more problems to solve, and it is exciting to try and translate the existing limitations into innovations, solutions and hence opportunities.

If I had to boil down my advice, I would say to my fellow entrepreneurs to: venture with confidence, design for scale, start small & architect for growth.

We've combined two popular Amazon EC2 features — Spot Instances and Elastic MapReduce —to allow you to launch managed Hadoop clusters using unused EC2 capacity. You will be able to run long-running jobs, cost-driven workloads, data-critical workloads, and application testing at a discount that has historically ranged between 50% and 66%.

WhatThe EC2 instances used to run an Elastic MapReduce job flow fall in to one of three categories or instance groups:

Master- The Master instance group contains a single EC2 instance. This instance schedules Hadoop tasks on the Core and Task nodes.

Core - The Core instance group contains one or more EC2 instances. These instances use HDFS to store the data for the job flow. They also run mapper and reducer tasks as specified in the job flow. This group can be expanded in order to accelerate a running job flow.

Task - The Task instance group contains zero or more EC2 instances and runs mapper and reduce tasks. Since they don't store any data, this group can expand or contract during the course of a job flow.

You can choose to use either On-Demand or Spot Instances for each of your job flows. If you run your Master or Core groups on Spot Instances, these instances will be terminated if the market price rises above your bid price, and the entire job flow will fail. If you run your Task group on Spot Instances, the unfinished work running on those instances will be returned to the processing queue.

If you have purchased one or more EC2 Reserved Instances, Elastic MapReduce will also take advantage of them (this is not new but I wanted to make sure that you knew about it).

WhenHere are some guidelines to get you started with Elastic MapReduce on Spot Instances:

Long-running Job Flows and Data Warehouses - If you maintain a long-running Elastic MapReduce cluster with some predictable variations in load, you can handle peak demand at lower cost using Spot Instances. Run the Master and Core instance groups on On-Demand instances and supplement the cluster with Spot Instances in a Task instance group at peak times.

Cost-Driven Workloads - If your jobs are relatively short-lived (generally several hours or less), the time to completion is less important than the cost, and losing partial work is acceptable, run the entire job flow on Spot Instances for the largest potential cost savings.

Data-Critical Workloads - If the overall cost is more important than the time to completion and you don't want to lose any partial work, run the Master and Core instance groups on On-Demand instances, making sure that you run enough Core instance groups to hold all of your data in HDFS. Add Spot Instances as needed to reduce the overall processing speed and the total cost.

Application Testing - If you want to test an entire application before moving it to production, run the entire job (Master and Core instance groups) on Spot Instances.

HowYou can start to use Spot Instances for all or part of a job flow by specifying a bid price for one or more of the flow's instance groups. You can do this from the AWS Management Console, the command line, or the Elastic MapReduce APIs. To determine how that maximum price compares to past Spot Prices, the Spot Price history for the past 90 days is available via the EC2 API and the AWS Management Console. Here's a screen shot of the AWS Management Console. As you can see, all you need to do is to check "Request Spot Instances" and enter a Spot Bid Price to benefit from Spot Instances:

You can also add additional TASK instance groups to a running job flow and you can specify a bid price for the instances as you add each group. You could use this feature to create a layered set of bids if you'd like. As you probably know, each job flow is limited to 20 EC2 instances by default. If you would like to run larger job flows, you need to fill out the instance request form.

WhoWe expect that Elastic MapReduce users with several types of job flows will really enjoy and make good use of Spot Instances. Two areas that come to mind are:

Batch-processing workloads that are not particularly time-sensitive such as image and video processing, data processing for scientific research, financial modeling, and financial analysis.

Data warehouses that have a recurring workload variance at peak times.

Our customers have been using Elastic MapReduce to process large volumes of data quickly and economically. For example:

Fliptop (full case study) helps brands convert email lists into social media profiles. They are able to do this using Spot Instances and have realized a cost savings of over 50%.

Foursquare (full case study) performs analytics across more than 3 million daily check-ins using Elastic MapReduce, Spot Instances, Amazon S3, MongoDB, and Apache Flume. This is what Matthew Rathbone of Foursquare told us:

Elastic MapReduce had already significantly reduced the time, effort, and cost of using Hadoop to generate customer insights. Now, by expanding our clusters with Spot Instances, we have reduced our analytics costs by over 50% while decreasing processing time for urgent data-analysis, all without requiring additional application development or adding risk to our analytics.

WatchWe have put together a new video, using EC2 Spot Instances with EMR, to show you how to run an Elastic MapReduce job using a combination of On-Demand and Spot Instances.

FinallyI am a big fan of our Spot Instances and I am really looking forward to hearing about new and interesting ways that our customers put them to use. You now have the opportunity to fine-tune your business processes to reduce your costs, and you can now make some very explicit tradeoffs between cost, time to completion, and what happens if the market price rises above your bid. If you are an IT professional, you have some shiny new tools that will allow you to reduce costs while getting work done more quickly.

Our customers use AWS in many creative and innovative ways, continuously introducing new use cases and driving us to solve unexpected and complex problems. We are constantly improving our capabilities to make sure that we support a very wide variety of use cases and access patterns.

In particular, we want to make sure that developers at any level of experience and sophistication (from a student in a dorm room to an employee of a multinational corporation) have complete control over access to their AWS resources.

AWS Identity and Access Management (IAM) lets you manage users, groups of users, and access permissions for AWS services and resources. You can also use IAM to centrally manage security credentials such as access keys, passwords, and MFA devices. Effective immediately, IAM is now a Generally Available (GA) service!

Using IAM you can create users (representing a person, an organization, or an application, as desired) within an existing AWS Account. You can also group users to apply the same set of permissions. The groups can represent functional boundaries (development vs. test), organizational boundaries (main office vs. branch office), or job function (manager, tester, developer, or system administrator). Each user can be a member of multiple groups (branch office, manager). For maximum security, newly created users have no permissions. All permission control is accomplished using policy documents containing policy statements which grant or deny access to AWS service actions or resources.

Here are some examples of the IAM command line interface in action. Let's create a user that can create and manage other users and then use this user to create a couple of additional users. Then we'll give one user the ability to access Amazon S3.

The iam-userlistbypath command lists all or some of the users in the account:

C:\> iam-userlistbypathC:\>

There are no default users. Let's create a user "jeff" using the iam-usercreate command ("/family" is a path that further qualifies the names):

The -k argument causes iam-usercreate to create an AWS access key (both the access key id and the secret access key) for each user. These keys are the credentials needed to access data controlled by the account. They can be inserted in to any application or tool that currently accepts an access key id and a secret access key. Note: It is important to capture and save the secret access key at this point; there's no way to retrieve it if you lose it (you can create a new set of credentials if necessary).

We can use iam-userlistbypath to verify that we now have one user:

C:\> iam-userlistbypath arn:aws:iam::889279108296:user/family/jeff

However, user "jeff" has no access because we have not granted him any permissions. The iam-useraddpolicy command is used to add permissions to a user. The iam-groupaddpolicy command can be used to do the same for a group. Let's add a policy that gives me (user "jeff") permission to use the IAM APIs on users under the "/app" path. I might not be the only user in my account that should have this permission so I'll start by creating a group and granting the permissions to the group and then add "jeff" to the group.

I (identifying myself as user "jeff" using the credentials that I just created) can now create and manage users under the "/app" path. Let's create users for two of my applications ("syndic8" and "backup") using "/app" as the path. I can use the same command that I used to create user "jeff":

C:\> iam-usercreate -u backup -p /app -k

AKIAI7LTROW2TTCLIFCH kgRiohPeBGyY6iDx7qzqSzCyrang6YUo67etcGat

C:\> iam-usercreate -u syndic8 -p /app -k

AKIAUIEGOSESA354WS2A iXdFDaA15VUImTo2MrmErSvTloTeK4ERNIESw78R

I can list only the application users I created by providing an argument to iam-userlist:

So, by giving my user ("jeff") the appropriate privileges, I can minimize the use of my AWS Account credentials for access to AWS services.

You can think of the AWS Account as you would think about the Unix root (superuser) account. To get full value from IAM you should start using it when you are the only developer and you only have one application, adding users, groups, and policies as your environment becomes more complex. You can protect the AWS Account using an MFA device, and you should always sign your AWS calls using the access keys from a particular user. Once you have fully adopted IAM there should be no reason to use the AWS Account's credentials to make a call to AWS.

There are a number of other commands (fully documented in the IAM CLI Reference). Like all of the other AWS command-line tools, the IAM tools make use of the IAM APIs, all of which are documented in the IAM API Reference.

The AWS Policy Generator can be used to create policies for use with the IAM command line tools. After the policy is created it must be uploaded -- use iam-useruploadpolicy instead of iam-useraddpolicy:

C:\> iam-useruploadpolicy -u jeff -p ec2 -f \temp\ec2_iam_policy.txt

IAM controls access to each service in an appropriate way. You can control access to the actions (API functions) of any supported service. You can also control access to IAM, SimpleDB, SQS, S3, SNS, and Route 53 resources. The integration is done in a seamless fashion; all of the existing APIs continue to work as expected (subject, of course, to the permissions established by the use of IAM) and there is no need to change any of the application code. You may decide to create a unique set of credentials for each application using IAM. If you do this, you'll need to embed the new credentials in each such application.

The AWS Account retains control of all of the data. Also, all accounting still takes place at the AWS Account level, so all usage within the account will be rolled up in to a single bill.

We have seen a wide variety of third-party tools and toolkits add support for IAM already. For example, the newest version of CloudBerry Explorer already supports IAM. Here's a screen shot of their Policy Editor:

The release of AWS Identity and Access Management alleviates one of the biggest concerns security-conscious folks used to have when they started using AWS with a single key that gave complete access and control over all resources. Now the control is entirely in your hands.

The features that I have described above represent our first steps toward our long-term goals for IAM. However, we have a long (and very scenic) journey ahead of us and we are looking for additional software engineers, data engineers, development managers, technical program managers, and product managers to help us get there. If you are interested in a full-time position on the Seattle-based IAM team, please send your resume to aws-platform-jobs@amazon.com.

I think you'll agree that IAM makes AWS an even better choice for any type of deployment. As always, please feel free to leave me a comment or to send us some email.