Posted: Fri Jul 26, 2013 6:06 pm Post subject: How do I use a cluster nicely?

I have overloaded the math department's cluster and have been asked not to do that anymore. The cluster is running Gentoo Linux and I sign into the system using a character terminal (PuTTy and Cygwin Terminal). Since I don't have a graphical interface (that some other students have), I cannot get to some web page that they use to determine which nodes are being used. The system does not have any cluster queuing software installed (and won't any time soon).

I have two questions, which really both involve "How do I use this cluster nicely?". How do I use the nice command and how can I find out which nodes are not being used?

First, I have been told I can use the nice command to make sure my long-running, processor-hogging jobs give way to other users and don't crash them. However, the man nice (and nice --help) does not tell me if the nice command issued to a bash script will be applied to the commands within it, nor does it indicate what will happen if one of those commands starts an MPI program which runs on several other nodes.

In other words, I actually run my processes using an executable bash script called submitJOB which submits many jobs, sort of like this:

Code:

for ((parm1=0; parm1<=4; parm1++))
do
for ((parm2=0; parm2<=30; parm2=parm2+6))
do
echo mpirun -np <num_processors> ... # This line submits a job to run on many processors
mpirun -np <num_processors> ... # This line submits a job to run on many processors
done
done

I (used to) run my job like this:

Quote:

>submitJOB

Should I now run it as

Quote:

nice -n 15 submitJOB

or do I modify the script file so that the command inside the loop reads

Quote:

nice -n 15 mpirun ...

(or both)?

Second, I need to find out which nodes are being used. I have been told about the top command, but it has an interactive output (and it's man page dosn't indicate a way to redirect that). That means I have to rsh to each node, and run the top command. We do have a bash script called rcom (and rcom-nodes) which will rsh to every node and issue a command. I am seeking a command line command which will tell me who the biggest users, or big processes, on a node are and how much processing and memory they've used up, and give me text output, so that I can use that command as rcom <that_command>, and thus get a quick read on which processors are in heavy use._________________Depserately needs help learning Gentoo Linux in order to use a 32-node cluster for my master's thesis in mathematics.

On a distributed computing cluster, there should be a master node that distributes jobs to nodes in a cluster. But you mention you don't have a computing cluster software system installed, so this means you are scheduling jobs by rsh'ing.

ps(1) will give you most of what you want in terms of memory utilization. You also should refer to free(1).

And nice(1) will reduce the priority to all child processes to the nice command. But if it hops to another machine it will lose that property - but you said you didn't have a queuing system so that means your program will rsh to another machine??? (bad practice IMHO without a centralized queuing system) which means you once again will have to nice it again when it hops to another machine...

If a lot of people are doing stuff like this, investment into a queuing system would be highly suggested..._________________Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSDWhat am I supposed watching?

On a distributed computing cluster, there should be a master node that distributes jobs to nodes in a cluster. But you mention you don't have a computing cluster software system installed, so this means you are scheduling jobs by rsh'ing.

ps(1) will give you most of what you want in terms of memory utilization. You also should refer to free(1).

And nice(1) will reduce the priority to all child processes to the nice command. But if it hops to another machine it will lose that property - but you said you didn't have a queuing system so that means your program will rsh to another machine??? (bad practice IMHO without a centralized queuing system) which means you once again will have to nice it again when it hops to another machine...

If a lot of people are doing stuff like this, investment into a queuing system would be highly suggested...

Are we using the same OS? The command ps(1) returns "-bash: syntax error near unexpected token `1'". I am misunderstanding your notation.

I am not manually scheduling jobs by rsh'ing. MPI may be doing that automatically in the background. Did you mean that I am manually scheduling jobs using rsh? Did I forget to mention that I'ms using MPI? Specifically, I'm using MPICH Version: 1.2.7._________________Depserately needs help learning Gentoo Linux in order to use a 32-node cluster for my master's thesis in mathematics.

cygwin can install an xorg server and give you an xterm into the cluster or let you do "x -query" if the cluster members are providing xdmcp logins. It and a number of packages are not installed by default when you run the cygwin setup.exe though. It pains me to see people still shelling out big bucks at corporate for Exceed when they can just install a cygwin server for free.

Other stupid tricks that people don't know it can do are nfs server, tftp server and QT4 and PyQT4. However I hate the way that RedHat bastardizes the packaging of the latter two. Someone needs to get them to cut the crap and finally drop qt3 in favor of support for qt4 while they are at it so they don't do all of the crappy command renaming that they do.

On a distributed computing cluster, there should be a master node that distributes jobs to nodes in a cluster. But you mention you don't have a computing cluster software system installed, so this means you are scheduling jobs by rsh'ing.

ps(1) will give you most of what you want in terms of memory utilization. You also should refer to free(1).

And nice(1) will reduce the priority to all child processes to the nice command. But if it hops to another machine it will lose that property - but you said you didn't have a queuing system so that means your program will rsh to another machine??? (bad practice IMHO without a centralized queuing system) which means you once again will have to nice it again when it hops to another machine...

If a lot of people are doing stuff like this, investment into a queuing system would be highly suggested...

Are we using the same OS? The command ps(1) returns "-bash: syntax error near unexpected token `1'". I am misunderstanding your notation.

I am not manually scheduling jobs by rsh'ing. MPI may be doing that automatically in the background. Did you mean that I am manually scheduling jobs using rsh? Did I forget to mention that I'ms using MPI? Specifically, I'm using MPICH Version: 1.2.7.

ps(1), free(1) and nice(1) is how they appear in the man pages, the (number) is the section they appear in (general commands in this case). You're supposed to use ps, free and nice.

I think if I can figure out the right ps command (and understand it's output), I will be on track._________________Depserately needs help learning Gentoo Linux in order to use a 32-node cluster for my master's thesis in mathematics.

Last edited by odeSolver on Sat Jul 27, 2013 4:06 pm; edited 1 time in total

cygwin can install an xorg server and give you an xterm into the cluster or let you do "x -query" if the cluster members are providing xdmcp logins. It and a number of packages are not installed by default when you run the cygwin setup.exe though. It pains me to see people still shelling out big bucks at corporate for Exceed when they can just install a cygwin server for free.

Thanks. OK, I installed the xorg server - at least I think - but I don't know where to go from here. What is an xterm, how do you do an x -query, and how does that help me?

toralf wrote:

This

Code:

ssh -Y user@system

should give you a forwarded X11 port. Furtermore with

Code:

-L rport:localhost:lport

you can forward every port to your system.

Thanks, but I'm not sure what can I do with an X11 port? I have installed - I think - Cygwin/X and started the terminal. I tried

Code:

ssh -Y user@system

, but it just signed me into the same system just like before. I also tried

Code:

ssh -X user@system

. How does that help me?

What does forwarding everything to my port mean/do?_________________Depserately needs help learning Gentoo Linux in order to use a 32-node cluster for my master's thesis in mathematics.

ssh -X and ssh -Y set your xorg display back to your local desktop when you log into the cluster. Any application you run on the remote node will have its graphics windows go back to display on your desktop. That should allow you to use the web tools that the others were using to keep track of the cluster.

If you want to have the full desktop login experience of the remote node, that's where xdmcp comes in. However the remote node must have its login manager set up to enable remote xdmcp logins before this can happen. Check with your sysadmins.

Assuming that's enabled. Don't start the local cygwin X server. Instead bring up a standard cygwin terminal shell and do

Code:

X -query system

where system is the remote system's node name or ip address. That will start a cygwinx server but it will use the remote system for session management. You should then see the remote system's login window just as it you had sat down at its local console.

ssh -X and ssh -Y set your xorg display back to your local desktop when you log into the cluster. Any application you run on the remote node will have its graphics windows go back to display on your desktop. That should allow you to use the web tools that the others were using to keep track of the cluster.

If you want to have the full desktop login experience of the remote node, that's where xdmcp comes in. However the remote node must have its login manager set up to enable remote xdmcp logins before this can happen. Check with your sysadmins.

Assuming that's enabled. Don't start the local cygwin X server. Instead bring up a standard cygwin terminal shell and do

Code:

X -query system

where system is the remote system's node name or ip address. That will start a cygwinx server but it will use the remote system for session management. You should then see the remote system's login window just as it you had sat down at its local console.

It seems we're on two different tracks here, and there is more about my setup that I didn't mention. I'm not actually logging directly into the cluster in question - I have to first sign a passthrough linux machine, then from there ssh into the cluster.

I used the X command you gave me, it opened a new window, but I never got the logon prompt in the new window. I suspect the passthrough does not have xdmcp enabled.

I also notice a new Cygwin-X program group in my start menu - but there are no programs in it. I appreciate the help you've given so far. Got any more for me?_________________Depserately needs help learning Gentoo Linux in order to use a 32-node cluster for my master's thesis in mathematics.

If you just got a blank screen from the X -query, then that proxy linux box probably doesn't have xdmcp enabled. So we're back to the local X session and xterm.

When you start your local cygwin x server, you should get an xterm window popping up with a cygwin bash shell. Do an

Code:

ipconfig /all

and note the ip address for your local desktop (eg 192.168.1.15). An insecure but easy way to allow others to open X windows on your local desktop is to disable access control with

Code:

xhost +

Then ssh to get into the proxy system or whever you eventually log in. On that final system you are going to do

Code:

export DISPLAY=192.168.1.15:0

You should now be able to get remote windows to pop up on your desktop from the linux applications. If the ssh -X and ssh -Y stuff weren't allowing windows to come back to your desktop in the first place, it's probably because your sysadmin never enabled X forwarding in the remote linux box's /etc/ssh/sshd_config file.

If you just got a blank screen from the X -query, then that proxy linux box probably doesn't have xdmcp enabled. So we're back to the local X session and xterm.

When you start your local cygwin x server, you should get an xterm window popping up with a cygwin bash shell. Do an

Code:

ipconfig /all

and note the ip address for your local desktop (eg 192.168.1.15). An insecure but easy way to allow others to open X windows on your local desktop is to disable access control with

Code:

xhost +

Then ssh to get into the proxy system or whever you eventually log in. On that final system you are going to do

Code:

export DISPLAY=192.168.1.15:0

You should now be able to get remote windows to pop up on your desktop from the linux applications. If the ssh -X and ssh -Y stuff weren't allowing windows to come back to your desktop in the first place, it's probably because your sysadmin never enabled X forwarding in the remote linux box's /etc/ssh/sshd_config file.

Thanks. I probably won't be able to try this until at least Monday. But I'm pretty sure X forwarding is enabled, because other students are doing this or something similar._________________Depserately needs help learning Gentoo Linux in order to use a 32-node cluster for my master's thesis in mathematics.

Doesn't the department provide any guidelines/documentation on how to use the clusters responsively and how to use them in general?

I don't really see why you "Depserately needs help learning Gentoo Linux in order to use a 32-node cluster for my master's thesis in mathematics."
There shouldn't be any gentoo specific things that you have to learn if you are not a system admin/maintainer. There would rather be linux things in general.

vaxbrat wrote:

An insecure but easy way to allow others to open X windows on your local desktop is to disable access control with

Code:

xhost +

Insecure indeed, basically means anyone who is in your network can record your keystrokes, screen, everything..

Atleast use

Code:

xhost +login.node.hostname

But this wouldn't be very secure either since
1) It won't be encrypted or anything.
2) Any user on the login node could export to your X-server

Doesn't the department provide any guidelines/documentation on how to use the clusters responsively and how to use them in general?

I don't really see why you "Depserately needs help learning Gentoo Linux in order to use a 32-node cluster for my master's thesis in mathematics."
There shouldn't be any gentoo specific things that you have to learn if you are not a system admin/maintainer. There would rather be linux things in general.

vaxbrat wrote:

An insecure but easy way to allow others to open X windows on your local desktop is to disable access control with

Code:

xhost +

Insecure indeed, basically means anyone who is in your network can record your keystrokes, screen, everything..

Atleast use

Code:

xhost +login.node.hostname

But this wouldn't be very secure either since
1) It won't be encrypted or anything.
2) Any user on the login node could export to your X-server

You're right, I don't need anything Gentoo specific. But this is a Gentoo forum, and one of the most helpful forums around, so I ask here (and I didn't think anyone really read the footers). The server is brand new and there are no guidelines and no MPI queue setup. The administrator is my adviser, a math professor, who knows more Linux than I do - but not much more. We are figuring these things out together._________________Depserately needs help learning Gentoo Linux in order to use a 32-node cluster for my master's thesis in mathematics.

I've built Beowulf clusters on Gentoo, and am currently running a multi-core workstation as a mini compute cluster. Having a large server where a number of users run whatever they want from the commandline, as you describe your situation, is begging for problems that no amount of nice'ing and shell script trickery will solve.

Tell your advisor to look into the Torque package (in Portage as sys-cluster/torque). It is a Resource manager based on the old PBS system. It works beautifully under Gentoo, and will allow your system admins to precisely allocate resources where needed, and avoid the kinds of problems you describe.