data science, software, and other nonsense

Copying lots of data between S3 buckets

So you’ve filled an S3 bucket with hundreds of millions of objects and now, for one reason or another, you need to copy that data into another S3 bucket. How do you do that efficiently? If you’ve worked with S3 for awhile, you’ve probably seen how painfully slow it is to list all of the keys in a bucket. So even getting a list of keys from S3 is a pain and difficult to parallelize.

Well the good news is there’s a tool for that. It’s called S3DistCp and it uses Elastic Map Reduce (EMR) to distribute all the network load required to efficiently copy data from one bucket to another.

It takes just a little bit of work to get this tool running. We’ll go through it step by step using the AWS console but of course you can do this from the command line as well using the awscli. We’ll also show you how to do this on the cheap using spot instances!

Step 0: Make sure you are in the same region as your source bucket!

You need to launch S3DistCP on an EMR cluster in the same region as your source bucket. So use the region drop down to select the right region:

Step 2: EMR Configuration

From the AWS home screen, navigate to EMR:

Click on “Create Cluster” to launch and new cluster.

From the new menu, select “Advanced Options”:

From the advanced menu, you’ll walk through 4 steps to create an EMR cluster with the copy step. For the most part, the default configuration will be fine. So in the steps below, I’ll just indicate what you need to change to get your copy job launched.

Step 3: Adding the copy step

Under the “Step 1: Software and Steps menu”, find the “Add steps” menu and select “Custom JAR” from the dropdown:

Press “Configure”.

The key things to do here are enter “command-runner.jar” into the JAR location field and to enter the following syntax into the Arguments menu:

1

s3-dist-cp--src s3://my-source-bucket/ --dest s3://my-dest-bucket/

Important: you need the “/” at the end of your S3 paths! Once done, click the add button and next to move on to hardware configuration.

Also, note you may want to select Auto-terminate if you want to launch your job and forget it so that your resources spin down automatically when all the steps are complete.

Step 4: Configure the hardware

Now we’ll setup the EC2 instances that will be the nodes in our cluster for the copy job.

The key things to configure here are the Instance type for the Core nodes. To make the job go faster, add more nodes and select EC2s types with good network performance. To keep the job cheap, use spot instances but note that there’s always a risk your job could get yanked with spot instances if there’s a spike in EC2 usage.

Step 5: Launch and monitor

Click through the remaining steps to launch your copy job! Monitor the job in the EMR console