Every Amazon S3 library I can lay my hands on (for Python at least), seems to read the entire file to be uploaded into memory before sending it. This might be alright when uploading lots of small files, but I have needed to upload a lot of very large files, and my poor old server would creak under the weight of that kind of memory usage.

I managed to bolt a solution together using urllib2 and poster that has been working reliably for me for the past few months. I'm going to show you:

A little about how S3 works

S3 is essentially a big python dictionary in the cloud, you give it a key and a value(file) to store, and later on you can read it back out again. S3 has a nice HTTP API, so you can read and write to the store using standard HTTP libraries.

The area you put your files into is called a bucket. Bucket names (which have restrictions) are globally unique, that is, if you make a bucket called holiday_photos, then no one else using s3 can have a bucket called holiday_photos, which might sound weird, but it has its advantages, you can now access your files from http://holiday_photos.s3.amazonaws.com/. If you set the permissions up so anyone can read the contents of the bucket, the whole world can see you files via http://holiday_photos.s3.amazonaws.com/.

The flip side of this, is that you can upload your files, lets say "meonthebeach.jpg" by using HTTP PUT, in this case PUT to http://holiday_photos.s3.amazonaws.com/meonthebeach.jpg.

When uploading to S3, we need provide a few HTTP headers along with our file data when we PUT.

Date - The current date and time in a specific format, e.g. Wed, 01 Mar 2006 12:00:00 GMT. I generate it with time.strftime("%a, %d %b %Y %X GMT", time.gmtime())

Content-Type - The mime type of the file being uploaded, e.g. text/html. Python's mimetypes module does a good job of guessing this for any given file based on its extension. mimetypes.guess_type(filename)[0]

Content-Length - the length of the data to be uploaded according to RFC 2616, if you are uploading the file from disk you can get this with the os modules stat function. os.stat(filename).st_size

x-amz-acl - Optional, this tells S3 with default access control policy to use, by default this will be available to the logged in owner of the bucket only, to make it publicly readable set it to public-read

Authorization - This is the tricky one, S3 requires that your PUT request be accompanied by an authorization string in the following format: AWS AWS_ACCESS_KEY_ID:SIGNATURE The AWS_ACCESS_KEY_ID is the one provided to you when you signed up to S3

The signature is a string consisting of several of the headers you are sending, along with the resource you are putting concatenated, and hashed with your AWS Secret access key. Constructing the signature is quite complicated in the general case, so I am going to show a method of generating it for the specific type of upload request we will be making, if you need to send headers that we are not using here, see Amazons Documentation for how to create the Authentication Header.

How to use Poster

Poster is a small library that works with urllib2 to allow streaming uploads. All you need to do is import it and call a single function which registers posters custom url openers with urllib2 and you are good to go.

Thanks for this post! I've got a question: do you know how to make it works with Python 3 ?

With approximately the same code (and poster modified with 2to3 and a few fixes for Python3) I've got this error:"TypeError: 'generator' does not support the buffer interface", on the line "urllib.request.urlopen(request).read()", and I didn't found a solution...