Two of the most frequent feature requests for Amazon DynamoDB involve backup/restore and cross-Region data transfer.

Today we are addressing both of these requests with the introduction of a pair of scalable tools (export and import) that you can use to move data between a DynamoDB table and an Amazon S3 bucket. The export and import tools use the AWS Data Pipeline to schedule and supervise the data transfer process. The actual data transfer is run on an Elastic MapReduce cluster that is launched, supervised, and terminated as part of the import or export operation.

In other words, you simply set up the export (either one-shot or every day, at a time that you choose) or import (one-shot) operation, and the combination of AWS Data Pipeline and Elastic MapReduce will take care of the rest. You can even supply an email address that will be used to notify you of the status of each operation.

Because the source bucket (for imports) and the destination bucket (for exports) can be in any AWS Region, you can use this feature for data migration and for disaster recovery.

Export and Import TourLet's take a quick tour of the export and import features, both of which can be accessed from the DynamoDB tab of the AWS Management Console. Start by clicking on the Export/Import button:

At this point you have two options: You can select multiple tables and click Export from DynamoDB, or you can select one table and click Import into DynamoDB.

If you click Export from DynamoDB, you can specify the desired S3 buckets for the data and for the log files.

As you can see, you can decide how much of the table's provisioned throughput to allocate to the export process (10% to 100% in 5% increments). You can run an immediate, one-time export or you can choose to start it every day at the time of your choice. You can also choose the IAM role to be used for the pipeline and for the compute resources that it provisions on your behalf.

I selected one of my tables for immediate export, and watched as the MapReduce job was started up:

The export operation was finished within a few minutes and my data was in S3:

Because the file's key includes the date and the time as well as a unique identifier, exports that are run on a daily basis will accumulate in S3. You can use S3's lifecycle management features to control what happens after that.

I downloaded the file and verified that my DynamoDB records were inside:

Although you can't see them in this screen shot, the attribute names are surrounded by the STX and ETX ASCII characters. Refer to the documentation section titled Verify Data File Export for more information on the file format.

The import process is just as simple. You can create as many one-shot import jobs as you need, one table at a time:

Again, S3 plays an important role here, and you can control how much throughput you'd like to devote to the import process. You will need to point to a specific "folder" for the input data when you set up the import. Although the most common use case for this feature is to import data that was previously exported, you can also export data from an existing relational or NoSQL database, transform it into the structure described here, and import the resulting file into DynamoDB.

AWS Data Pipeline (see my introductory blog post for more information) is a web service that helps you to integrate and process data across compute and storage services at specified intervals. You can transform and process data that is stored in the cloud or on-premises in a highly scalable fashion without having to worry about resource availability, inter-task dependencies, transient failures, or timeouts.

Amazon Redshift (there's a blog post for that one too) is a fast, fully managed, petabyte-scale data warehouse optimized for datasets that range from a few hundred gigabytes to a petabyte or more, and costs less than $1,000 per terabyte per year (about a tenth the cost of most traditional data warehousing solutions). As you can see from this post, we recently expanded the footprint and feature set of Redshift.

Pipeline, Say Hello to RedshiftToday we are connecting this pair of powerful AWS services; Amazon Redshift is now natively supported within the AWS Data Pipeline. This support is implemented using two new activities:

The RedshiftCopyActivity is used to bulk copy data from Amazon DynamoDB or Amazon S3 to a new or existing Redshift table. You can use this new power in a variety of different ways. If you are using Amazon RDS to store relational data or Amazon Elastic MapReduce to do Hadoop-style parallel processing, you can stage data in S3 before loading it into Redshift.

The SqlActivity is used to run SQL queries on data stored in Redshift. You specify the input and output tables, along with the query to be run. You can create a new table for the output, or you can merge the results of the query into an existing table.

Putting it TogetherLet's take a look at a representative use case. Suppose you run an ecommerce website and you push your clickstream logs into Amazon S3 every 15 minutes. Every hour you use Hive to clean the logs and combine them with customer data residing in a SQL database, load the data into Redshift, and perform SQL queries to compute statistics such as sales by region and customer segment on a daily basis. Finally, you store the daily results in Redshift for long-term analysis.

Here is how you would define this processing pipeline using the AWS Management Console:

Here is how you define the activity that copies data from S3 to Redshift in the pipeline shown above:

And here is how you compute the statistics:

Start NowThe AWS Data Pipeline runs in the US East (Northern Virginia) Region. It supports access to Redshift in that region, along with cross-region workflows for Elastic MapReduce and DynamoDB. We plan to add cross-region access to Redshift in the future.

Today we are releasing a new feature that enables periodic copying of the data in a DynamoDB table to another table in the region of your choice. You can use the copy for disaster recovery (DR) in the event that an error in your code damages the original table, or to federate DynamoDB data across regions to support a multi-region application.

You can access this feature by using a new template that you can access through the AWS Data Pipeline:

Select the template and then set up the parameters for the copy:

Enter the table names for the source and destination, along
with their respective regions. The Read and Write Percentage Allocation is the
percentage of the table's total capacity units allocated to this copy. Then, set up the frequency of
this copy, the start time for the first copy, and optionally the end time.

You can use the Data Filtering option to choose between four different copy operations:

Full Table Copy - If you do not set any of the Data Filtering parameters, the entire table -- all items and all of their attributes -- will be copied on every execution of the pipeline.

Incremental Copy - If each of the items in the table has a timestamp attribute, you can specify it in the Filter SQL field along with the timestamp value to select against. Note that this will not delete any items in the destination table. If you need this functionality, you can implement a logical delete (usig an attribute to indicate that an item has been deleted) instead of a physical deletion.

Selected Attribute Copy - If you would like to copy only a subset of the attributes of each item, specify the desired attributes in the Columns field.

Incremental Selected Attribute Copy - You can combine Incremental and Selected Attribute Copy operations and incrementally copy a subset of the attributes of each item.

Once you have filled out the form, you will see a confirmation
window that summarizes how the copy is going to be accomplished. Data Pipeline
uses Amazon Elastic Map Reduce (EMR) to perform a parallel copy of data directly
from one DynamoDB table to the other, with no intermediate staging involved:

The usual AWS data transfer charges apply when you use this facility to copy data between DynamoDB tables.

Note that the AWS Data Pipeline is currently available in the US East (Northern Virginia) Region and you'll have to launch it from there. It can, however, copy data between tables located in any of the public AWS Regions.

As I described in my initial blog post, the AWS Data Pipeline gives you the power to automate the movement and processing of any amount of data using data-driven workflows and built-in dependency checking. You can access it from the command line, the APIs, or the AWS Management Console.

Today l'd like to show you how to use the AWS Management Console to create your own pipeline definition. Start by opening up the console and chosing Data Pipeline from the Services menu:

You'll see the main page of the Data Pipeline console:

Click on Create Pipeline to get started, then fill in the form:

With that out of the way, you now have access to the actual Pipeline Editor:

At this point you have two options. You can build the entire pipeline scratch or you can use one of the pre-defined templates as a starting point:

I'm going to use the first template, Export DynamoDB to S3. The pipeline is shown on the left side of the screen:

Clicking on an item to select it will show its attributes on the right side:

The ExportSchedule (an item of type Schedule) specifies how often the pipeline should be run, and over what time interval. Here I've specified that it should run every 12 hours for the first 6 months of 2013:

The ExportCluster (a Resource) specifies that an Elastic MapReduce cluster will be used to move the data:

Data. Information. Big Data. Business Intelligence. It's all the rage these days. Companies of all sizes are realizing that managing data is a lot more complicated and time consuming than in the past, despite the fact that the cost of the underlying storage continues to decline.

Buried deep within this mountain of data is the "captive intelligence" that you
could be using to expand and improve your business. Your need to move,
sort, filter, reformat, analyze, and report on this data in order to
make use of it. To make matters more challenging, you need to do this
quickly (so that you can respond in real time) and you need to do it
repetitively (you might need fresh reports every hour, day, or week).

Data Issues and ChallengesHere are some of the issues that we are hearing about from our customers when we ask them about their data processing challenges:

Increasing Size - There's simply a lot of raw and processed data floating around these days. There are log files, data collected from sensors, transaction histories, public data sets, and lots more.

Variety of Formats - There are so many ways to store data: CSV files, Apache logs, flat files, rows in a relational database, tuples in a NoSQL database, XML, JSON to name a few.

Distributed, Scalable Processing - There are lots of ways to process the data: On-Demand or Spot EC2 instances, an Elastic MapReduce cluster, or physical hardware. Or some combination of any and all of the above just to make it challenging!

Hello, AWS Data Pipeline
Our new AWS Data Pipeline product will help you to deal with all of these issues in a scalable fashion. You can now automate the movement and processing of any amount of data using data-driven workflows and built-in dependency checking.

Let's start by taking a look at the basic concepts:

A Pipeline is composed of a set of data sources, preconditions, destinations, processing steps, and an operational schedule, all defined in a Pipeline Definition.

The definition specifies where the data comes from, what to do with it, and where to store it. You can create a Pipeline Definition in the AWS Management Console or externally, in text form.

Once you define and activate a pipeline, it will run according to a regular schedule. You could, for example, arrange to copy log files from a cluster of Amazon EC2 instances to an S3 bucket every day, and then launch a massively parallel data analysis job on an Elastic MapReduce cluster once a week. All internal and external data references (e.g. file names and S3 URLs) in the Pipeline Definition can be computed on the fly so you can use convenient naming conventions like raw_log_YYYY_MM_DD.txt for your input, intermediate, and output files.

Your Pipeline Definition can include a precondition. Think of a precondition as an assertion that must hold in order for processing to begin. For example, you could use a precondition to assert that an input file is present.

AWS Data Pipeline will take care of all of the details for you. It will wait until any preconditions are satisfied and will then schedule and manage the tasks per the Pipeline Definition. For example, you can wait until a particular input file is present.

Processing tasks can run on EC2 instances, Elastic MapReduce clusters, or physical hardware. AWS Data Pipeline can launch and manage EC2 instances and EMR clusters as needed. To take advantage of long-running EC2 instances and physical hardware, we also provide an open source tool called the Task Runner. Each running instance of a Task Runner polls the AWS Data Pipeline in pursuit of jobs of a specific type and executes them as they become available.

When a pipeline completes, a message will be sent to the Amazon SNS topic of your choice. You can also arrange to send messages when a processing step fails to complete after a specified number of retries or if it takes longer than a configurable amount of time to complete.

From the ConsoleYou will be able to design, monitor, and manage your pipelines from within the AWS Management Console:

API and Command Line AccessIn addition to the AWS Management Console access, you will also be able to access the AWS Data Pipeline through a set of APIs and from the command line.

You can create a Pipeline Definition in a text file in JSON format; here's a snippet that will copy data from one Amazon S3 location to another: