Codelab 2: AWS + S3

Due Date

This codelab is due on Thursday, February 15th at 11:59:59PM. It may look long, but most of it is just setup paired with a casual walkthrough.

Goal

In this codelab, you'll get to play around with S3.

You'll get up and running on AWS with a Free Tier account and AWS Educate.

You'll test out S3 from the AWS CLI

Setting Up

Before starting this codelab, run git pull in the 389Lspring18 directory to update your local copy of the class repository.

Getting Started with AWS

The first step is going to be to get an AWS account. If you already have an Amazon.com account with your @umd.edu email, then you can go ahead and sign in with the same account. Sign in or Sign up here.

Free Tier Accounts

You'll be using a Free Tier account on AWS. This tier is meant to give developers like you access to AWS to test drive some of its features.

You can learn more about the specifics of what is available on this page.

AWS Educate via MLH

Now that you have signed up for an AWS account, register for AWS Educate here. As university students, you can get access through MLH's partnership with AWS Educate.

You'll need your Account Id during the sign-up process, which you can access from the "My Account" page on AWS.

After you complete this process, you will receive a promotional credit via email. You should redeem this in the Billing > Credits section via AWS's GUI.

If you're concerned about how much you're using, there's more information here.

AWS Educate grants students access to $100 in credit to pay for services not covered under the free tier.

AWS GUI

This class will use a combination of the CLI and the GUI (referred to as the AWS Management Console, or just the AWS console). You can do everything you can do in the GUI via the CLI, but the goal is to get you comfortable with each as we'll use both throughout the class.

The two keys components to understand here are how to access different services, and how to change between regions.

You can see the full list of services from the "Services" dropdown in the top-left corner:

To see which region you are currently in, look for the region tab in the top-right corner:

Keep in mind that we will be using us-east-1, which is in North Virginia.

As mentioned above, we will be using S3 in this codelab, so I would recommend that you go ahead and take a peek at S3 via the AWS Console. Come back throughout the codelab and look at how things change as you create buckets, upload files, etc.

AWS CLI

IAM User

To work with the CLI, you'll need to first create an IAM user via the AWS Console. Open the "IAM" service and navigate to "Users" (tab on left) > "Add user".

Pick a name for your user, and check "Programmatic Access". You won't need to sign in to the AWS console with your user.

You need to specify what permissions your new user has. To do this, we will create an IAM group for admin users. Give this group a name and select the "AdministratorAccess" policy.

Click through to the "Complete" page where AWS will provide you with the access key ID and secret access key. Keep the latter safe, perhaps saving them in a text file on your machine, because AWS will not show it to you again. These keys are used to sign programmatic requests that you make to AWS. If lost, you must restart and create a new IAM user.

Testing the CLI

If you haven't already done so, set up your local development environment. Follow the instructions here: Environment Setup. Upon entering the final command, you will be prompted to give the keys you just obtained. Set us-east-1 and json as the default region and output respectively.

Make sure the account ID matches and you're good to go. We'll use the CLI more below.

Root User

To clarify the difference between the user you have been using to log in to the AWS Console and the IAM user you just created:

The credentials you've been using to log in with are the root user credentials. Normally, you avoid using the root account and instead federate access out to IAM users so that you can audit their usage. However, in this class, we will continue to use the root account for simplicity.

Essentially, you have two users for your account: the root user, which you use to sign-in to the AWS console, and the IAM user (which has an Account ID and Secret Access Key), which you use to programmatically access AWS. We can't programmatically access the API with the root user, because it does not have associated access keys. In general, you tend to avoid using the root account on teams since it is hard to audit. Preferably all team members have an IAM user they can access, with a limited set of permissions. However, in this class, we will continue to use the root account for simplicity.

S3 Basics

There are two main ways that you will programmatically interact with AWS. The first is through the CLI and the second is through the boto3 SDK. The CLI is great for one-time changes to your AWS environment. However, if you want to create dynamic scripts, then you will want to use something else -- in this class, that will be Python with Amazon's Python SDK, boto3.

For codelabs, I'll be showing you how to interact with AWS via the CLI, since the CLI is easier to experiment with compared to Python code. I won't go into as much detail with the boto3 library, since I don't believe that explaining both adds much value. The APIs for both are fairly similar, so you should be able to easily pick up the relevant boto3 as long as you understand each codelab.

Feel free to ask on Piazza if you have any trouble or questions about boto3 APIs.

References

Do reference the aws s3api documentation while working through this codelab. You will need it for successful completion.

Buckets

Creating a Bucket

The s3api create-bucket command allows you to create S3 buckets from the CLI. Go ahead and create a bucket to use in this codelab:

$ aws s3api create-bucket --bucket cmsc389l-<your directory id>

Remember that these names must be globally unique. What happens if you try to create a bucket that has already been claimed? Try it against the cmsc389l bucket that I created for class:

$ aws s3api create-bucket --bucket cmsc389l

Audit Trail

Throughout the rest of the codelab, you will be playing around with the bucket you just created. For me to grade this codelab, I ask that you enable CloudTrail auditing on this bucket. CloudTrail is a tool that allows you to perform audit logging on different services, which can be useful for debugging problems or for enabling compliance.

For this codelab, you'll submit the audit trail at the end of the assignment. I'm not going to look at the logs in depth, I'll only be skimming to verify that you took the time to go through this codelab. So feel free to explore beyond the walkthrough below.

We will enable the audit log via the AWS Console. Open the CloudTrail service and go to "Trails" > "Create Trail". Create a new trail on the bucket you just created (cmsc389l-<directory id>) and save it into a newly created bucket.

The new trail will take a few seconds to appear in the CloudTrail list, but once it does, you are good to go with the remainder of the codelab.

Important: You must enable the audit logging before continuing (above).

Listing Buckets

To check which buckets are owned by your account, use s3api list-buckets:

$ aws s3api list-buckets

You should see both buckets — the first bucket you created, and the one containing audit logs.

Writing Data

Now that we have an S3 bucket, we are ready to upload content into it.

Run the list-objects-v2 command again, and you will see all of these files. However, if you wanted to see only the files in one of these directories, you could limit the list operation to those keys with a given prefix:

Wrapping Up

Great, you should now be familiar with the basic operations on S3. You can programmatically create buckets, upload objects, read back data from objects, list keys in a bucket, and delete these objects by key.

Storage Types

In class, we talked about four storage classes: Default, Reduced Redundancy Storage (RRS), Infrequently Accessed (IA), and Glacier. We're not going to cover Glacier here, since it's API is fairly complicated. However, let's experiment with moving objects between the other three storage types.

We can toggle between the first three storage classes of an S3 object using the --storage-class flag for the s3api put-object command. Note that moving to Glacier is a separate process. Why? Objects in the first three classes are all accessible in real-time, while objects in Glacier are not.

Perform the s3api list-objects command again to double-check that it modified the storage class.

The second is that we can use the s3api copy-object command to avoid the extra bandwidth charges that arise from uploading the object all over again. In this case, it will only change the storage class on the object.

Instead, if you were to move these files into Glacier, then you would need to trigger a transfer job which would move files back into an S3 bucket. This process will usually take hours.

S3 CLI

The tool that you've used so far, aws s3api, is one of two AWS CLI tools for interacting with S3. The other, aws s3, is a higher-level wrapper on top of s3api that provides typical directory traversal operations.

I recommend that you play around with this tool a bit more to see how it compares with s3api. As you will see, it is much more limited than s3api, but it is quite useful for when you need to move files between your local computer and an S3 bucket.

Assignment

Your assignment for this codelab is to re-create part of the s3 sync command using boto3 in Python. You'll be able to use this script to quickly copy files and directories into S3.

Note: You must use boto3, not the s3 sync command. You will not get full credit if you just write a wrapper on top of the sync command.

As an example of how this script works, say you had a directory like so:

Additionally, we have provided you with a Pipfile that contains all the packages we believe you will need to
complete the assignment. Pipfiles work just like package.json files in the world of Node and Javascript or Gemfiles
in the world of Ruby and Rails. When you run pipenv install, the Pipfile will be parsed and all the dependencies
inside of it will be downloaded and installed locally for your project. You are even able to set the specific version of a dependency that you are importing. Feel free to add to the Pipfile and packages you feel you need to add to implement your solution.

Host a Static Website

As mentioned in class, you can use S3 to host static content, like static websites. Let's do that for the STICs website!

At this point the assignment, you have a tool that you could use to upload static content for a website, however the tool does not yet set the content type of each file when they are uploaded (they default to a byte stream type). If you tried to open an HTML file in your browser that was hosted on S3 with this content type, then your browser would just download the file instead of serving it. This is because the browser doesn't use the file ending to determine the type of file, as you might expect, it instead uses the content type header in the response.

So, what we want to do is let S3 know what type of content it's getting during upload. You will do this by updating your tool to detect and set the content type header. S3 will then be able to pass that information to a browser, enabling it to display our content as desired. Take a look at the documentation again and note that path, bucket, acl, and s3_dest are not the only parameters defined for the upload function.

How do we know what to pass as arguments? Why, by using a library of course! It's just 2-3 lines with a package like mimetypes to detect and set the content type.

Note: A previous version of this codelab recommended python-magic, however it fails to identify a handful of content types such as CSS. Use an alternative, like the mimetypes package.

You will want to clone the source code for the STICS website from it's GitHub repository here.

$ git clone https://github.com/UMD-CS-STICs/UMD-CS-STICs.github.io

Then, create a new bucket specifically for this site, such as cmsc389l-<directory id>-website. Then, enable website hosting from that bucket:

View the site!

To view the site, browse to the S3 URL of index.html and you will see the full STICs site!

https://s3.amazonaws.com/cmsc389l-colink-website/index.html

Submit a screenshot of the front page of the STICs site hosted on your S3. Make sure to include the URL in the screenshot (don't just submit a screenshot of sticsumd.com)!).

Submission

You will submit a zipped directory containing your script, upload.py, plus a Pipfile since you used magic. You will also include the screenshot of the STICs homepage and your audit log.

Audit logs are stored in as JSON files in a .gz zip archive in the bucket that you created. Within this bucket, these logs are organized in a directory hierarchy by date, region, and a few other factors (see this documentation page). You'll need to download all of your CloudTrail logs from this bucket for the us-east-1 region. You can use the aws s3 command to sync this directory to your local filesystem. You'll need to configure

In your submission, include the logs directory that you just downloaded.

Now that you have these log files locally, you can inspect their contents. To access a log, you'll need to unzip it with gunzip <log file name>.json.gz. Then, you can inspect the audit log with jq (a JSON pretty printer). Run cat <log file name>.json | jq . | less.

Submit this assignment to codelab2 on the submit server. Upload a zipped directory containing the following files: