Table of Contents

Prelude

Disclaimer: This is my first blog post. That’s right, I’m a complete rookie at this. Why now? Well, since starting my faculty position four years ago and working to help build and direct the Center for Host-Microbial Interactions, I increasingly find myself involved in various collaborations or projects where I’m learning something I find really interesting and that I think would be useful to share with either my lab or with the broader scientific community, but which doesn’t easily translate into a traditional publication. Basically, I’m learning some cool stuff, and it’s not always evident in my publications, so here we are.

Case-in-point, I recently attended this Microbiome Analysis in the Cloud, held at the Institute for Genome Sciences at the University of Baltimore. The two-day workshop had a lot of high points, including excellent planning and preparation on the part of the organizers, a highly skilled staff that worked the room to help troubleshoot, and a program that covered a lot of ground. While that latter point was explicitly stated as a goal of the workshop, it means that I left the workshop feeling I wouldn’t be able to work through a full dataset on my own. Now that I’ve had a chance to review the workshop materials, I feel a bit more comfortable and want to put down my workflow in this blog post. Expect updates in the coming months as I marinate on this.

I also want to acknowledge Joshua Orvis, a bioinformatician at IGS and one of the workshop instructors. Without his one-on-one help and his development of Chiron, this tutorial wouldn’t be possible.

Why use the cloud?

I’m not here to sell you on the idea of cloud computing. In fact, maybe you should ignore all the buzz about ‘the cloud’. After all, you accrue charges as your cloud instance runs. As a consequence, failure to shut down a instance could result in some hefty charges if it slips your mind. Some people cite that it’s a nussiance to move all your data to the cloud to begin working, only to then have to pull your results off the cloud before you shut it down. But the same data gymnastics come into play when you use any remote computing resource. Similarly, folks will often cite the problem that all programs and dependencies needed to carry out your work will have to be installed by you before your cloud instance is useful. The arrival of Docker has largely made this a non-issue, in my opinion. Yes, I’m aware that Docker is viewed with trepidation by some in the bioinformatics community (see here, here, and here), but it seems to me that Docker and cloud computing make good bed fellows.

Still reluctant to dive in? Well, an alternative to using cloud computing resources is to simply invest in your own compute cluster or leverage compute resources at your institution. I have to say, both of these alternatives have some non-trivial downsides as well. Running your own in-house compute cluster gives you tons of control, but with great power comes great responsibility, including that you’ll have to maintain it, which requries sysadmin skills. Acquiring your fancy new computer will also take a serious chunk of change (think 10K or more), and the hardware quickly becomes outdated. My university has a pretty awesome compute cluster, but if your favorite program isn’t available, you likely won’t have the access priveldges to install it yourself and it can take some time for the powers-that-be to get it installed. We also experience frequent interruptions and server downtime on our university cluster. Here’s the bottom line: just like any other resource, cloud computing has its pros and cons, and should be thought of not as the only solution to your problems, but rather as one tool in your bioinformatics toolbox. So, let’s get started!

Fire up your cloud computer

The two most popular cloud computing services are Amazon’s Web Services (AWS) and Google’s Cloud Platform. Amazon, although the best known of the two, feels cumbersome to me – the first hour of the workshop and 36 slides were devoted just to getting our AWS instance up and running. I prefer Google. If you’re still undecided, I’d also point out that Google gives you a $300 credit, good for 1 year from the time you activate your cloud account! This is more than enough cash to work your way through this tutorial and still have plenty left for some of your own analyses.

I put together the video tutorial below to walk you through the follow steps:

setting up your Google Cloud compute instance

installing Docker on this instance

installing Chiron for quick and easy access to a bunch of dockerized programs for metagenomics

installing the Google Cloud SDK software on your own computer (not the cloud) so you can easily connect to your new cloud resources

Connecting an FTP client to the instance so you can easily transfer files back and forth.

tearing it all down when you’re done

Below the video you’ll find all the commands to work through these steps on your own.

Install your programs

Once your gcloud instance is running, click on the ‘ssh’ button next to the instance to open a terminal window. This fast and easy way to connect to your cloud instance is one nice feature of the way gcloud is setup. We’ll now use this ssh connection to install Docker and Chiron.

Any docker images could be put on your instance at this point. Take a look here to see if your favorite bioinformatics program has been dockerized

Look around your working directory. In particular, take note of all the cool metagenomics tools that are now available in /Chiron/bin

ls Chiron/bin

Although the ssh terminal available right on your instance is very convenient, it does not establish a connection between our local computer and the cloud instance (which we must do in order to move files back and forth). In order to do that, we’ll want to connect to our instance from the terminal app on our local computer. Go ahead and launch your Terminal app. Before we do anything else, let’s execute a command in the terminal that will allow us to see hidden files in our directory. We need access to a few of these hidden files for the purposes of this tutorial.

defaults write com.apple.finder AppleShowAllFiles true#then restart the finder to see these changes
killall Finder
#after this tutorial you can hide these files again by replacing 'true' with 'false'

Now install the google SDK.

curl https://sdk.cloud.google.com | bash

I my experience, if you encounter any issues with this entire tutorial, it will be with getting the SDK installed and connecting to your instance. For example, you may notice that the installation fails with the following error

This error has to do with the the IPv6 settings on your computer preventing you from being able to connect with a google server to download the SDK command line tools

If you encounter this error, this fix is simple. Begin by temporarily turning off IPv6 support for either Wi-Fi or Ethernet, depending on which one you are using to connect to the internet. If you’re using a Wi-Fi connection, then you would turn-off with:

networksetup -setv6off Wi-Fi #if you're using ethernet, replace 'Wi-Fi' with 'Ethernet' in this line

Now reattempt the installation as you did above

curl https://sdk.cloud.google.com | bash

Once you have Google Cloud SDK installed, be sure to turn the IPv6 back on

networksetup -setv6automatic Wi-Fi

Connect to your instance

Now we’ll connect to the instance from within our Terminal.

gcloud compute ssh instance-1 #if your instance is not called 'instance-1', be sure to modify this line accordingly#Be patient here, as this may take a moment to connect.

If the above command failed with an authentication error, it’s because this is the first time you’ve run SDK and it isn’t sure that you should have access to your google account from the terminal. Take a moment to authenticate

gcloud auth login
#this will open a browser window for you to select and sign-in to your google account#after doing this, return to your terminal window and you should be good to go#if this doesn't work, you may need to go though the gcloud initialization process by executing 'gcloud init'#either way, once you have authenticated your account you will need to reattempt connecting with 'gcloud compute ssh instance-1'

Launch an interactive session with Chiron

Chiron gives you access to QIIME for processing marker gene sequence data, as well as the BioBakery suite of tools from Curtis Huttenhower’s lab for handling shotgun metagenomic sequencing data. One of the first steps in the BioBakery workflow is using MetaPhlan2 to get species and strain level composition information from raw sequence files. This is a logical place for us to start as well.

Launch the MetaPhlan2 interactive

sudo ./Chiron/bin/phlan_interactive -l ~/data
#the -l option tells the interactive to create a new folder in our home directory called 'data', and sets this folder as the default from which data will be read and to which outputs will be saved #This is where we'll put all our raw sequence data for analysis