Why have a login node?

A login node is like a regular node, but runs using a software image that is configured to take care of handling login tasks. Typically, it also has another interface connected to the external network from where the logins come in.

Head nodes hardly get affected by logins alone. So, unless the requirements are special, a login node should not actually be needed in the first place. However, if users want to login and do extra tasks other than simply going ahead to run jobs on the regular nodes, then it can make sense to offload the login to separate login nodes, and to custom-configure these off-head login nodes for handling the extra tasks.

What do we need to consider for login nodes?

The following items can be considered during login node configuration:

workload management (WLM) roles are best removed for administrative convenience so that login nodes are not used as compute nodes. While all the nodes (head, login, and regular nodes) run the WLM components, login nodes are essentially submission nodes, and are not assigned cmdaemon WLM roles. The head node WLM server manages the resources that the regular (compute) node WLM clients consume. A login node submits jobs but the WLM component it uses to do the submission is outside of CMDaemon, and is also not managed by CMDaemon. For example, for submitting jobs to Torque 4 and above, the component that you have to install on a login node is trqauthd.

login nodes typically have an external interface facing the outside world, in addition to the internal one facing the regular nodes. If so, then after configuring the external interface, configuring a firewall for that external interface is sensible.

the logins for users via the external interface can be load balanced across the login nodes, for example using a round-robin DNS. How to deal with login nodes that may be down should be thought about.

restricting ssh access to the head node to just the administrators is normally desirable, since users should now be logging into the login nodes

restricting ssh access to the login nodes is usually best customized to the requirements of the users using the ssh options

if the default image is cloned to provide a basis for login nodes, then extra software should probably be added to the login nodes for the convenience of users. For example, the default image has no X11, GUI web browser, or emacs.

Based in these considerations: 1. we will take the default image and make a login image from it, making some suitable adjustments to the cluster on the way. 2. We will follow up with configuring various WLMs across the cluster, and running a job on these WLMs.

The recipes that follow can be used as a basis for your own requirements.

It is safest to wait for the cloning to be done. It typically can take about 10 minutes, then you see a message like:

Copied: /cm/images/default-image->/cm/images/login-image

followed by:

Initial ramdisk for image login-image is being generated

You must wait for ramdisk generation to be done before a reboot is carried out on the node. When it is done, typically after about 3 minutes, the message is:

Initial ramdisk for image login-image was regenerated successfully.

You can then get on with the next step.

Node category creation

We make a new node category for the image, called login, like this:

cmgui:create a new Node Category (clone from "default"). This makes a clone with the name default1. Rename category default1 to login. Then open the login category, assign new image "login-image" to category by selecting it from menu in the Settings tab. Save. After that, click on the Roles tab and check the box beside Login Role. Save.

Running yum --installroot=/cm/images/login-image list installed will list the installed packages so you can figure out if you want to try removing a package.

However, the original default-image on which login-image is based is quite lean, so the chances are high that packages listed as being in login-image are needed or depended upon by others. So, keep an eye on the package dependencies that yum shows you before you agree to remove a package. This will help you avoid lobotomizing the image and making it unworkable. If you are unfamiliar with this kind of stripdown, don't bother with removing packages.

Now, in order to add emacs, gnuplot, and other choices (eg, eclipse-cdt) to the software image, you can run:

The above would install about 400MB of packages to the login-image. (If installing eclipse-cdt, it may install a lot of core files in the image too as a glitch in yum. In that case, just remove the core files from the image before provisioning a node with the image, or provisioning will take longer than needed, amongst other issues.).

Remove workload manager roles

Remove all workload manager (WLM) roles from the login category, since the head node and regular nodes do the WLM server and client roles:

cmgui:category->role, deselect WLM roles and save.

cmsh:

[bright60->category[login]->roles]% unassign sgeclient; commit

The sgeclient entry here is just an example. You should remove the roles that you have. The list command in the prompt level above would show you any roles assigned to the login node.

Adding login nodes to CMDaemon

If you haven't created the login nodes in CMDaemon yet, you can do it with:

cmgui:

Nodes->Create Nodes button. Start with, say, login001 and end with, say, login008, and set the category to login.

Warning: The Ethernet switch settings were not cloned, and have to be set manually

...

[bright60->device*[login001*]]% commit

So, here we are building 8 login nodes.

You need to define the category as suggested, ie: login, in the preceding, because otherwise the login node defaults to having the same attributes as a default regular node.

Configure extra interface:

Login nodes usually have an extra external interface compared with regular nodes. If the BOOTIF default interface uses eth0, then you can add the eth1 physical interface name to each login node like this:

addinterface can work with ranges of nodes and IP addresses. See the cmsh help text or manual for more on this.

Make sure that the externalnet IP address is given a static value or assigned from the DHCP server on the externalnet (if it is a DHCP-supplied address, then the server supplying it is rarely the head node).

The login nodes should normally be served by DNS outside the cluster (ie not served from the head nodes). Remember to configure that. Load-balancing via DNS may be a good idea.

Set up load balancing across the login nodes

The DNS to the login nodes is normally going to be an external DNS (ie not served by the head node). A simple and convenient option to distribute login sessions across the nodes is for the administrator to modify the DNS zone file, and tell users to use a generic login name that will be distributed across the login hosts. For example, if the login nodes login001 to login008 are reachable via IP addresses 1.2.0.1 to 1.2.0.8, then cluster.zone file may have these name resolution entries:

The first time the IP address is looked up for login.cm.cluster.com, the first IP address is given, the next time the lookup is done, the next IP address is given, and so on. If a login node goes down then the system administrator should adjust the records accordingly. Adjusting these automatically makes sense, for larger numbers of login nodes.

restrict ssh access to login nodes with a firewall on the login nodes

You can restrict ssh access on the login nodes to allow access only from a restricted part of the big bad internet with custom iptable rules, or perhaps something like the iptables-based shorewall (discussed next). Remember, letting users access the net means they can access websites run by evildoers. This means the administrator has to exercise due care, because users generally do not.

To install and configure shorewall:

Shorewall is an iptables rules generator. You configure it to taste, run it, and the rules are implemented in iptables. Install shorewall to the login image with the package manager, eg:

The policy file sets the general shorewall policy across the zones:Accept packets from the firewall, and from the loc zone (internalnet), to everywhere. Drop stuff from the net to any zone. Reject all else.

The rules file is where all the final tweaks can go into a newly created NEW section.

I like my pings to work. I also want some ssh connections from trusted networks to work. I put the least critical ones last because priority matters during a DOS. And I would like cmdaemon to have its connectivity. So I may end up with something like:

Make sure nobody else has their keys on the head node under /root/.ssh/authorized_keys.

Modify ssh access on head to allow only root authentication. Ie, adjust /etc/ssh/sshd_config with

AllowUsers root

and reboot the sshd service:

# service sshd restart

If running LDAP, then modifying /etc/security/access.conf is another way to restrict users.

reboot all the login nodes so that they pick up their images and configurations properly

Just to make sure:

# cmsh -c "device; pexec -n=login001..login008 reboot"

After we have the login nodes up and running, we should make sure they do what they are supposed to, by running a job on them with a workload manager (WLM). What follows are walkthroughs to do that. They assume no WLM is already installed.

2. Set up the workload manager and run a job

If you are already running Slurm, you presumably have the roles and queues assigned the way you want already. If you do not have these already configured the way you want, please set them up the way you like. The Administrator Manual covers this in detail.

WLM Software and environment needed on the login nodes

Don't try to assign a role for the login nodes. Just make sure the munge service is available on the login node images. A

yum --installroot=/cm/images/login-image install munge

was the suggestion earlier. However if the login image was cloned from the default image that is provided by Bright, it is already going to be on the login image. So then there is no need to yum install it like that.

Set up munge as a service on the login node.

Slurm uses munge to authenticate to the rest of the cluster when submitting jobs. So it is certainly important enough to carry out the following useful extra configuration steps for it:

monitor the munge service on the login nodes so that you know if it crashes.

set up the autostart functionality for it, so that it restarts if it crashes on the login nodes

See the section on Managing And Configuring Services in the Admin manual for more background.

munge.key file sync

Make sure the /etc/munge/munge.key is the same across all nodes that use munge. For a Slurm that is already active, you can just copy the key to the new login-image. For example, on the head node, copy it over like this:

Make sure you have an ordinary user account to test this with (See User Management chapter in the Administrator Manual if you need to check how to add a user. Basically add a user via cmgui/cmsh and set a password).

Login as a regular user (freddy here) to a login node, eg login001. Add the Slurm and favorite MPI compiler module environments (see sections on the module environment in the Administrator and User manuals for more background):

[freddy@login001 ~]$ module add openmpi/gcc slurm

Build a sample hello world executable:

[freddy@login001 ~]$ mpicc hello.c -o hello

You can use the hello.c from the Using MPI chapter of the User Manual.

There you are - you built a login node that submitted a job using Slurm.

Torque4 setup and job submission:

We will use the Slurm submission section as the guideline, since it is pretty similar.

Make sure Torque roles and queues are already configured the way you want. If you do not have these already configured the way you want, please set them up the way you like. The Administrator Manual covers this in detail.

Notes

It's essential to add the submission host to /etc/hosts.equiv file to be able to submit jobs. The following error message indicates that the submission host isn't added to the /etc/hosts.equiv file:

Set up trqauthd on the login node as a service. This is just like for munge with Slurm, just copy it to the image. Then make sure Torque knows we have submit hosts, by using qmgr to put these hosts in its database.