Bosco Installation

Introduction

BOSCO is a job
submission manager designed to help researchers manage large numbers
(\~1000s) of job submissions to the different resources that they can
access on a campus. This is release 1.2 of BOSCO, if you find any
problems or need help installing or running Bosco, please email bosco-discuss@opensciencegrid.org.

It offers the following capabilities:

Jobs are automatically resubmitted when they fail. The researcher
does not need to babysit their jobs.

Job submissions can be throttled to meet batch scheduler
settings (e.g. only 10 jobs running concurrently). The researcher
does not need to make multiple submissions. BOSCO handles that
for them.

BOSCO is designed to be flexible and allows jobs to be submitted to
multiple clusters, with different job schedulers (e.g. PBS, LSF,
SGE, HTCondor, SLURM*).

The primary advantage for the researcher is that they only need to learn
one job scheduler environment even if the clusters utilize different
native environments.

* SLURM support is via its PBS emulation

This document explains how to install, configure and use BOSCO for a
single user. We recommend to use the Bosco Quick Start guide (less
flexible but easier and guided setup), if you plan to install BOSCO only
for you (single user) and to connect it to only one cluster.
Bosco Quick Start will give you a full installation but to learn how to
connect to multiple resources you have to read this document, it is not
explained in the quick start guide.

Requirements

BOSCO Submit-node
: This is the system that the researcher uses to submit jobs. In
general it can be the user's laptop, workstation, or it can be
another system that the user logs into for submitting jobs to
the cluster. There can not be any Condor collector running on the
submit node, otherwise it will conflict with Bosco.

BOSCO resource (aka Cluster) submit-node
: This is the node that you normally login to on the PBS, SGE, LSF,
SLURM or HTCondor cluster (the BOSCO resource that you'd like
to add).

BOSCO resource (aka Cluster)
: This is the remote cluster that jobs will execute on (the
BOSCO resource). The Cluster submit-node is a node belonging to
this cluster. The nodes where jobs run are referred as worker nodes.
All the cluster needs:

Shared Filesystem
: The Cluster needs a shared home filesystem (if the Cluster has no shared filesystem only Grid universe jobs can be sent to it)

Network Access
: The worker nodes need to have access to the submit host. The worker nodes can be behind a NAT between the worker nodes and the submit host.

Note

The BOSCO resource requirements
just listed (shared file system and outbound network access) are
required only if you submit jobs using the HTCondor vanilla universe.
If you submit jobs using the grid universe, submitting to one
resource at the time, then both requirements can be relaxed. I.e. there
is no need of a shared file system or access to the BOSCO submit host
from the BOSCO resource.

BOSCO can be used as part of a more complex Condor setup (with flocking
or multiple pools). Whatever the setup:

the BOSCO host needs connectivity to the cluster submit nodes of the
BOSCO resources

the worker nodes of the BOSCO resources (running the jobs, e.g. the
nodes in the PBS cluster) must have network connectivity to the jobs
submit node (the BOSCO host or a different Condor schedd flocking
into it)

Networking

BOSCO submit host

The port 11000 is required
ONLY if "vanilla" jobs are submitted. "vanilla" jobs require also
that all worker nodes be able to reach the submit host: this means that
the submit host has to have a public IP address or at least an address
known by all BOSCO resources that are going to be added. If you need to
use a port different from 11000, you can edit the BOSCO configuration.

BOSCO Resources

The cluster login node requires port 22 from the BOSCO submit host. BOSCO will
use the SSH protocol to communicate with the login node.

The second requirement is for the BOSCO worker nodes (wherever the jobs
will run) and it is needed only if "vanilla" jobs are used. The worker nodes must
be able to connect to the BOSCO submit host. Only outgoing from the worker node to the
BOSCO submit host is required.

How to Install

Installation Procedure

Download Bosco from
the bosco website. We recommend the Multi-Platform installer. It
is a bit more complex than the quick start
installation but allows more options.

Setup environment before using

Since BOSCO is not installed in the system path. An environment file must be sourced all the times you use BOSCO (start/stop/job submission or query, anything):

$ source ~/bosco/bosco_setenv

Starting BOSCO

BOSCO has some persistent services that must be running. You'll have to start it at the beginning and probably after each reboot of your host.
You should stop BOSCO before an upgrade and possibly before a shutdown of your host.
If you will not use BOSCO anymore, bosco_uninstall will remove it from your system.

To start BOSCO:

$ bosco_start

Add a cluster to BOSCO

To add a new cluster to the resources you will be using through BOSCO:

For the cluster mycluster with user username and submit host mycluster-submit.mydomain.org (Fully Qualified Domain Name, aka full hostname including the domain name) and Local Resource Management System LRMS (valid values are pbs, condor, sge or lsf). Replace the parts in red:

$ bosco_cluster --add username@mycluster-submit.mydomain.org LRMS

When you add your first cluster, BOSCO will prompt you for a password that will be used to store the SSH keys used by BOSCO to access all your clusters (Enter password for bosco ssh key:). Select a random string. It is preferable if you do not use the password you use to access the cluster or to unlock your SSH keys.

Then, if you don't have a ssh key agent with that cluster enabled, you will be prompted for the password that you use to access the cluster you are adding to BOSCO (Enter password to copy ssh keys to ...). This may be followed by a confirmation of the RSA key fingerprint, if it is your first ssh connection from this host, where you have to answer yes.

Important

You must be able to login to the remote cluster. If password authentication is OK, the script will ask you for your password. If key only login is allowed, then you must load your key in the ssh-agent. Here is an example adding the key and testing the login:

When adding the cluster, if the last message is Done!. Your cluster has been added successfully.

You can see a list of the current clusters in BOSCO by typing:
$ bosco_cluster --list

Submitting a test job

You can send a simple test job to verify that the cluster added is working correctly.

To send a BOSCO test job to the host username@mycluster-submit.mydomain (name as listed in the output of bosco_cluster --list):
1. Setup the environment appropriate for your shell as described in the setup environment section (above).

For the cluster username@mycluster-submit.mydomain (identical to output of bosco_cluster --list). Replace the username and cluster:

$ bosco_cluster --test username@mycluster-submit.mydomain

For example:

$ bosco_cluster -t dweitzel@ff-grid.unl.edu
dweitzel@ff-grid.unl.edu
Testing ssh to dweitzel@ff-grid.unl.edu...Passed!
Testing bosco submission...Passed!
Checking for submission to remote pbs cluster (could take ~30 seconds)...Passed!
Submission files for these jobs are in /home/dweitzel/bosco/local.localhocentos56/bosco-test
Execution on the remote cluster could take a while...Exiting

How to Stop and Remove

To stop BOSCO:
$ bosco_stop

To uninstall BOSCO:
1. If you want to remove remote clusters get the list and remove them one by one:

Uninstalling BOSCO removes the software but leaves the files in your =.bosco= and =.ssh= directories with all the information about the added clusters and the SSH keys. Files installed on the remote clusters are not removed either.

---## How to Update BOSCO
If you want to update BOSCO to a new version you have to:
1. setup BOSCO:

UCL_PROMPT source ~/bosco/bosco_setenv

1. stop BOSCO:

UCL_PROMPT bosco_stop

1. remove the old BOSCO:

UCL_PROMPT bosco_uninstall

1. download and install the new BOSCO (see install section above) and re-add all the clusters in your setup:
1. for each installed cluster (the list is returned by =bosco_cluster --list=):
1. remove the cluster:

This will update the local installation and the software on the remote clusters
%ENDSECTION{"BoscoSetup"}%

%STARTSECTION{"BoscoJob"}%
---# Job submission example

You can submit a regular Condor vanilla job to BOSCO. The Campus Factory, a component within BOSCO, will take care to run it on the different clusters that you added and to transfer the input and output files as needed.
Here is a simple example. The Condor team provides [[http://research.cs.wisc.edu/condor/tutorials/][many great tutorials]] to learn more.

---## Configuring Executable
Your may wrap your job in a script (e.g. using your favorite shell or Python) so that you can invoke more executables and do some tests.

This is a simple bash script, called =myjob.sh=:

!/bin/bash

Prepare for the execution

Run the actual applications

hostname
date
id
whoami

Final steps

---## Example Submission File
With BOSCO you can do direct submission to the cluster, using the grid universe, or use the the glideins so that regular (vanilla) HTCondor jobs can be used.
There is a small difference between the 2 options is in the submit file (see below) and vanilla have some additional [[BoscoInstall#FirewallReq][Firewall and Network requirements]] because they use glideins.
All the other steps, job file creation, job submission and checking the jobs, are the same.

Use one or the other

---### Direct Job submission example

Here is an example submission file for direct submission. Copy it to a file, =example.condor=

NOTE That condor_q will show also the glidein jobs. Auxiliary jobs that BOSCO is using to run your actual job. Like in the example above, job 11 was the one submitted.

The job could be idle if it is currently idle at the remote cluster. When the job is being executed on the remote cluster, the =ST= (State) will change to =R=, and the =RUN_TIME= will grow.

Another method of monitoring a job is to check the job's =log=, a human readable (for the most part) time line of actions relating to the job. The =logfile= was specified in the submission script, for example =logfile= in the example above. You can view the log file by using =cat=:

UCL_PROMPT cat logfile

---## Job output
Once the job completes BOSCO will transfer back standard output, standard error and the output files (if specified in the submit file), e.g. the job above will create stdout and stderr files (unique for each submission) and a file =myout-file-10= in the directory where the =condor_submit= command was executed.
%ENDSECTION{"BoscoJob"}%

In order to use Multi-Cluster Bosco, you must make 1 configuration change. The multi-cluster also requires a public IP address.

---## Changing the Bosco Configuration for Multi-Cluster
BOSCO by default is using the loopback IP address. You must change the configuration to listen on the public interface. You can do this by editing the configuration file =$HOME/bosco/local.bosco/config/condor_config.factory=, adding anywhere the line:

NETWORK_INTERFACE =

By setting this, you are enabling Bosco's smart interface detection which will automatically choose and listen on the public interface.

---## Glidein Job submission example

You can submit a regular HTCondor vanilla job to BOSCO. The Campus Factory, a component within BOSCO, will take care to run it on the different clusters that you added and to transfer the input and output files as needed.
Here follow a simple example. The Condor team provides [[http://research.cs.wisc.edu/condor/tutorials/][many great tutorials]] to learn more.

Here is an example of a vanilla submission file (using glideins). Copy it to a file, =example.condor=

NOTE The BOSCO submit host needs to satisfy these additional [[BoscoInstall#FirewallReq][Firewall and Network requirements]] to be able to submit and run vanilla jobs. Those requirement include being reachable by all BOSCO resources.

%STARTSECTION{"BoscoAdvancedUse"}%
---# Advanced use

%STARTSECTION{"BoscoAdvancedUseInstContent"}%

---## Changing the BOSCO port
BOSCO is using the HTCondor [[http://research.cs.wisc.edu/htcondor/manual/latest/3_7Networking_includes.html#SECTION00472000000000000000][Shared port daemon]]. This means that all the communication are coming to the same port, by default 11000. If that port is taken (already bound), the [[BoscoQuickStart][quick start installer]] will select the first available port. You can check and edit manually the port used by BOSCO in the file =$HOME/bosco/local.bosco/config/condor_config.factory=. You can change the port passed to the shared port daemon (in red):

NOTE You need to restart BOSCO after you change the configuration (=bosco_stop; bosco_start=).

If you are referring to this BOSCO pool (e.g. for flocking) you'll need to use a string like: =your_host.domain:11000?sock=collector= .
Replace host and port with the correct ones.

---## Multi homed hosts
Multi homed hosts are hosts with multiple Network Interfaces (aka dual-homed when they have 2 NICs).
BOSCO configuration is tricky on multi-homed hosts. BOSCO requires the submit host to be able to connect back to the BOSCO host, so it must advertise an interface that is reachable from all the chosen submit hosts. E.g. a host with a NIC on a private network and one with a public IP address must advertise the public address if the submit hosts are outside of the private network.
In order to do that you have to:
* make sure that the name returned by the command =/bin/hostname -f= is the name resolving in the public address (e.g. =host hostname -f= should return the public address). If not you should change it.
* edit =~/bosco/local.$HOST/condor_config.local= (HOST is the short host name) and add a line like =NETWORK_INTERFACE = xxx.xxx.xxx.xxx= , substituting xxx.xxx.xxx.xxx with the public IP address. This will tell BOSCO to use that address.

---## Modifying maximum number of submitted jobs to a resource

Many clusters limit the number of jobs that can be submitted to the scheduler. For PBS, we are able to detect this limit. For SGE and LSF, we are not able to detect this limit. In the cases where we cannot find the limit, we set the maximum number of jobs very conservatively, to a maximum of 10. This includes both the number of idle and running jobs to the cluster.

The limit is specified in the condor config file =~/bosco/local.bosco/condor_config.local=, at the bottom. Edit the value of the configuration variable =GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=

GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 10

---## Custom submit properties
Bosco has the ability to add custom submit properties to every job submitted to a cluster. On the cluster's login node (the BOSCO resource, the host you used at the end of the line when typing the =bosco_cluster --add= command), create the file

IMPORTANT This file is executed and the output is inserted into the submit script. I.e. It is not cat, use echo/cat statements in the script.

Below is an example =pbs_local_submit_attributes.sh= script which will cause every job submitted to this cluster through Bosco to request 1 node with 8 cores:

!/bin/sh

echo "#PBS -l nodes=1:ppn=8"

---### Passing parameters to the custom submit properties.
You may also pass parameters to the custom scripts by adding a special parameter to the Bosco submit script.

For example, in your Bosco submit script, add:

...
+remote_cerequirements = NumJobs == 100
...
queue

After you submit this job to Bosco, it will execute the [[#CustomScriptLocations][custom scripts]] with, in this example, =NumJobs= set in the environment equal to =100=. The custom script can take advantage of these values. For example, a PBS script can use the !NumJobs:

!/bin/sh

echo "#PBS -l select=$NumJobs"

This will set the number of requested cores from PBS to !NumJobs specified in the original Bosco Submit file.

---## Flocking to a BOSCO installation
In some special cases you may desire to flock to your BOSCO installation. If you don't know what I'm talking about, then skip this section.

In order to enable flocking you must use an IP so that all the hosts you are flocking from can communicate with the BOSCO host.
Then you must setup FLOCK_FROM and the security configuration so that the communications are authorized.

BOSCO has strong security settings. Here are two examples:
1 Using GSI authentication (a strong authentication method) you must provide and install X509 certificates, you must change the configuration %TWISTY{TWISTY_OPTS_DETAILED showlink="Click to see the configuration file" }%

#

Networking - If you did not already, remember that you need to set BOSCO not to use the loopback port

ENDTWISTY Remember to enable and configure GSI authentication also on the host you are flocking form.
1 Relaxing BOSCO security setting to allow CLAIMTOBE authentication. This is not very secure. Use it only if you can trust all the machines on the network and remember to enable CLAIMTOBE also on the host you are flocking from %TWISTY{TWISTY_OPTS_DETAILED showlink="Click to see the configuration file" }%

#

Networking - If you did not already, remember that you need to set BOSCO not to use the loopback port

After copying from the examples (click above to expand the example files) or editing your configuration file, save it as =$HOME/bosco/local.bosco/config/zzz_condor_config.flocking=.
Other names are OK as long as its definition override the default ones of BOSCO (check with =condor_config_val -config=).

Please try manually to ssh from the BOSCO host to the cluster submit node. The ability to connect is required in order to install BOSCO.

---## Make sure that BOSCO is running
BOSCO may not survive after you log out, for example if the BOSCO node was restarted while you where logged out.
When you log back in after sourcing the setup as described in the [[#SetupEnvironment][setup environment section]], you should start BOSCO as described in the [[#BoscoStart][BOSCO start section]], specially if the command =condor_q= is failing.

---## Errors due to leftover files
Bosco files on the submit host are in:
* =~/bosco/= - the release directory
* =~/.bosco/= - some service files
* =~/.ssh/= - the ssh key used by BOSCO

If you used =bosco_uninstall= it will remove all BOSCO related files. If you removed BOSCO by hand you must pay attention.
If the service key is still in =.ssh= but the other files are missing, during the execution of BOSCO commands you will get some unclear errors like
"IOError: [Errno 2] No such file or directory: '/home/marco/.bosco/.pass'" , "OSError: [Errno 5] Input/output error" , all followed by:

Password-less ssh to marco@itb2.uchicago.edu did NOT work, even after adding the ssh-keys.
Does the remote resource allow password-less ssh?

If that happens you can remove the service files and the keys using:

rm -r ~/.bosco
rm ~/.ssh/bosco_key.rsa*

and then re-add all the clusters with =bosco_cluster --add=.

---## Unable to download and prepare BOSCO for remote installation.
BOSCO can return this error:
1. Because the BOSCO submit host is unable to download BOSCO for the resource installation, e.g. a firewall is blocking the download or the server is down
2. More commonly because there are problems with the login host of the BOSCO resource, e.g. the disk is full or there are multiple login nodes
You can check 1 byy downloading BOSCO on your BOSCO submit host.
To check 2 you have to login on the BOSCO resource: =df= will tell you you some disks are full, with =hostname -f= you can check if the name is different form the one that you used to login with ssh. If the name differs probably you are using a cluster with multiple login nodes and you must use only one for BOSCO. Se the second "IMPORTANT" note in the [[#AddResourceSection][section to add a cluster to BOSCO]] (above).

If you see errors similar to the one below while executing ==bosco_cluster --add==:

Downloading for USER@RESOURCE
Unpacking.tar: Cannot save working directory
tar: Error is not recoverable: exiting now
ls: /tmp/tmp.qeIJ9139/condor*: No such file or directory
Unable to download and prepare BOSCO for remote installation.

then you are using most likely the generic name of a multi-login cluster and you should use the name of one of the nodes as suggested in the [[#AddResourceSection][note above]].

%ENDSECTION{"BoscoTroubleshootingItems"}%

---# Get Help/Support
To get assistance you can send an email to bosco-discuss@opensciencegrid.org