Looking into a 'remote future' with R

Instead of using the term “computer nerd”, I always preferred to describe myself as a “computer jock” growing up1. After all, if there had been a programming team in high school, I felt like I probably would have made varsity.

But sometimes, athletes are limited by the equipment they have on hand. Cyclists can’t ride the Tour de France with tricycles, and my lab can’t run hundreds of thousands of simulations on our 2014 Macbooks. We’ve been feeling the crunch as of late, which is why we’ve begun exploring the various computing clusters available to us on campus.

This post is meant for two purposes: the first is to document an example of using remote clusters with R, and the second is to serve as instructions/reference for my lab members at Rochester.

Requirements

Here’s what you’ll need, and what I’ll be covering.

a remote computing cluster

Here, I’ll be using the University of Rochester’s CS cluster

a non-remote computer (duh)

Since everybody in our lab uses Macs (yay UNIX!), this walkthrough will focus on the travails of using OS X

R, and the packages future, parallel, and if you don’t have tidyverse already installed, shame on you

Step 1: Get ssh-askpass working on your Mac

Long story short, when you try to connect to a remote computer via SSH, Mac OS X does not prompt for a password outside of Terminal (for security reasons). But you often need a password to log in, so in those cases that you do, you’ll be screwed. This is a way around that.

First, download and install ssh-askpass. If you don’t already have brew installed, you probably should. It’s super useful. After you install them, open up a Terminal and type the following:

sudo brew services start theseal/ssh-askpass/ssh-askpass

After that, restart your computer.

Get some keys

You want to be able to authenticate yourself when you connect to a remote computer, and you need the magic of modern cryptology to do that. Make sure you have a place to store your keys first:

ls ~/.ssh

If there are keys, they are probably listed as id_rsa_<something>. Whatever the case, a key should generally have two files: a file without an extension and another file with the same name but with an extension .pub. If there isn’t a directory, create one with mkdir ~/.ssh and move into it with cd ~/.ssh.

If you don’t have any keys, generate one with:

ssh-keygen -t rsa -b 2048

If this is your first key, don’t give it a name, just leave it as the default (which should be id_rsa). Give it a password to protect it in case someone else gets their mitts on it. If there’s already a file called id_rsa.pub, you’re fine.

This is where we started running into some problems with ssh-askpass on some of our Macs. Run the command below to add the default key to your authentication agent. Add -c as an argument if you want to be prompted for the password to it each time you use it.

ssh-add

If doing this gives you an error about connecting to an agent, first the following and then attempt it again:

If you don’t want to keep getting prompted for your password every time you log in, after completing Step 2 below, follow these instructions.

Step 2: Log in to the remote cluster

For this, you’ll need access to the cluster, i.e., an account that can log in. @Labmates: get your PI to talk with the CS folks to get you an account. After that’s done, go pick up your login password from Jim Roche in the Wegmans building on campus.

After you have this, open up that Terminal again and connect with the following command:

ssh <your_username_for_the_cluster>@<the_clusters_address>

For me, it would be ssh zachburchill@cycle1.cs.rochester.edu. If the username on your computer is the same as the cluster account username, I don’t think you need to add the <username>@ bit.

Type in your password when prompted, and log in. Make sure you say “yes” when it asks if you want to add the computer to the list of known hosts. @Labmates: I believe you have to be connected to the school’s network to log in to the cluster. Either connect via school wifi or use the VPN. Nope, it looks like you don’t!

Note: while you’re here, make sure the future package is installed on whatever remote computer you’re trying to log into.

Step 3: Setting up R and ssh-askpass

Now open up some RStudio or R in the Terminal. If you’re using R in the Terminal, you can probably skip the next subsection. If you’re using the “R Console” version of R (not in an actual terminal), welllllp… for some reason it always crashes on my computer when I try to make a remote connection, so good luck.

Using RStudio

The exact workings of all this quickly went beyond my need to know, so I’ll just tell you what we did to get this to work with RStudio. Do the following chunk, making changes where needed.

library(future)# We don't 100% understand this, but it's probably necessary!# Make sure that the path to rpostback-askpass is included in your path variable:Sys.setenv(PATH=paste0(Sys.getenv('PATH'),':/Applications/RStudio.app/Contents/MacOS/postback/'))# Change this is your username on the CS accounts:user_name<-"zachburchill"server_name<-"cycle1.cs.rochester.edu"# Now let's see how lucky you are. Try this command:plan(remote,workers=paste0(user_name,"@",server_name))

Did that last line prompt you for your password? If so, hooray! It wasn’t that easy for me!

After some research, the version of ssh-askpass we installed seemed like it was supposed to deal with this problem. I’m running pretty old versions of OS X, R, and RStudio, however, so who knows. After much frustration, we found the following solution:

# If that's the case, then make sure the "ask_pass" path is somehow right# Check to see if you have one first. If not, it should be: ""Sys.getenv('SSH_ASKPASS')# Then, add it to R.# Make sure that this path is correct first though. You can try:# `which ssh-askpass`# in the Terminal to see where it's installedSys.setenv(SSH_ASKPASS='/usr/local/bin/ssh-askpass')# Now it should work. Try it yourself:plan(remote,workers=paste0(user_name,"@",server_name))

Hopefully that will get things working! You’ll probably have to set these paths every time you start a new R session with RStudio.

Remote processing with future

When you’re running a lot of jobs, you want to have as many operate in parallel as possible. Also, when you’re actively interacting with data in R, you don’t want to have to wait until the last command is finished running. You can have things complete asynchronously with the future package.

This thing is huge (both in terms of functionality and how much it can help) so read more about it here. In an incredibly small nutshell, instead of using <-, you can make something a future (something that will hold a value in the “future”) by using %<-% instead.

Let’s try it out:

foo<-function(i){# makes R pause for that many secondsSys.sleep(i)# returns the name of the active directorygetwd()}x%<-%foo(10)

Now let’s see what the value of x is:

x

## [1] "/home/bcs-serve/u12/zachburchill"

Doing that, we notice two things:

Asynchronous

First, if we tried to access the value of x right after we assigned it, R made us wait a while. That’s because making x a future means that its value isn’t guaranteed to have been made yet. Our foo function sleeps for 10 seconds before returning a value; so when we try to access the value of x before that time, R makes us wait until it exists. But if we don’t care about accessing it immediately, we can go on and do other things, so long as those things don’t rely on knowing the value of x.

Remote

Second, you’ll notice that the value of x is not your home directory on your local computer. If everything went right, it should be R’s default directory on your remote machine. This is because we did plan(remote, workers = ...) earlier. Specifying plan as “remote” tells R to make these futures be processed remotely on whatever computer we defined.

If you do it this way, you can run computationally intensive models and simulations on a beefier remote server, while you use your own personal computer to orchestrate everything. Wooo!

But you can also harness multiple remote computers to do your dirty work for you. If you’re running a bunch of simulations, you can make each run a certain subset. Remote, parallel computing!

Remote AND parallel!

This is where the parellel package comes in. There’s a lot there, so look over the docs and do a bit of Googling to get a gist of how to use it. Theoretically you could do the following and you’d basically be done with the setup:

Unfortunately, this isn’t going to fly for what my labmates and I have to work with. First, the “cycle” hostnames are basically just gateways to the cluster and shouldn’t be used for intensive jobs, and secondly, we call these commands from our local computers, and–long story short–this sort of remote clustering only really works well when the remote clusters can connect back to you (you have to set up a socket connection)2.

Fortunately, we’re not screwed. @Labmates: there are two options you can do here. You can either ssh into the cycle hosts, boot up R from the Terminal on these machines, and then run the cluster, or you can call: ~~NESTED FUTURES!~~

But that’s kind of a whole ‘nother deal, so check that shizz out in my next blog post about it! (Gotta get those extra page-clicks or whatever.)

Also, special thanks to my advisor Florian Jaeger, who introduced me to the future package, got the ball rolling, and got us connected to the cluster.

Footnotes

My mommy told me that joke was funny when I first came up with it when I was 11, so I’m assuming it’s still good. ↩