Cluster

We have a small Hadoop cluster for this course, based on Cloudera express.

Connecting Remotely

The goal here is to connect to gateway.sfucloud.ca by SSH. Since you can't connect directly from the outside world, it's not completely straightforward.

Option 1: the right way

If you don't already have one, create an SSH key so you can log in without a password. The command will be like:

ssh-keygen -t rsa -b 4096 -N ""

Then copy your public key to the server:

ssh-copy-id <USERID>@gateway.sfucloud.ca

Create or add to the ~/.ssh/config (on your local computer, not the cluster gateway) this configuration that will let you connect to the cluster by SSH. Then you can simply ssh gateway.sfucloud.ca to connect.

Job Logs

If you have set up your SSH config file as in the Cluster instructions, you can see the list of jobs that have run on the cluster at http://localhost:8088/.

Then at the command line, use the application ID from that list to get the logs like this:

yarn logs -applicationId application_1234567890123_0001 | less

Cleaning Up

If you have unnecessary files sitting around (especially large files created as part of an assignment), please clean them up with a command like this:

hdfs dfs -rm -r /user/<USERID>/output*

It is possible that you have jobs running and consuming resources without knowing: maybe you created an infinite loop or otherwise have a job burning memory or CPU. You can list jobs running on the cluster like this: