Archive for the ‘HOWTO Articles’ Category

In the case that you must force a machine back to a previous update of RHEL (e.g. latest X.Y introduces a bug, and you want to revert to X.(Y-1) ), it isn’t easy to drop an update.

In Fedora, (and copied to RHEL, but somewhat useless by default) you can use something to the effect of

yum --releasever=X-1 distro-sync

Unfortunately, this only works for major version shifts.

If you want to do this in RHEL to hit a specific point release, you must play with some repos.

These instructions assume you have a satellite and can create kickstart profiles therein.

Edit a kickstart profile and set it back to the release you want. Also select any extra repositories you may have access to (e.g. supplementary, optional) then look at the kickstart script it generates. Find the URL for install and any repo you’ve added, then create a .repo file in /etc/yum.repos.d/ for those repositories. Name each repo in a globbable format, e.g. rhel-downgrade-X-Y rhel-downgrade-X-Y-optional…

Now the moment of truth.

yum --disablerepo \* --enablerepo rhel-downgrade-X-Y\* distro-sync

You can also specify specific yum groups and individual rpms on the command line to just downgrade a particular subsystem, as long as it doesn’t have firm dependencies into the rest of the OS.

Although our data center requires remote management capability for all servers, occasionally something lands that doesn’t have the ability for whatever reason. Rather than driving down to the data center every time we (and by we, I mean the researcher and I) want to test a new kernel, we decided that we needed a remote management solution.

Fortunately, we happened to have some old Linux Networkx ICE BOX node management devices (power control, temperature monitoring, and serial console over LAN) languishing in storage after the cluster they managed was retired. Unfortunately, Linux Networkx appears to be out of business, there is no Google-able documentation for the ICE BOX on line, and the serial console adapters were all thrown out when the cluster was retired.

However, thanks to an RS-232 break out box picked up from EPO, and a few hours of playing with an RJ-45 to DB9 adapter, I have a working pin-out for an ICEBOX serial terminal, and will record it here for future poor schmucks who find themselves in my position.

I only worked out the TD, RD, and SG lines, if you need flow control (and know how to make the ICEBOX use it!) you’re on your own.

From the RJ-45 end, I used wires 2, 3, and 5 (according to the accounting on, and using the same RJ-45 to DB9 adapter as Dan Gottesman’s website, http://www.ossmann.com/5-in-1.html).

RJ-45

Signal

DB-9

Signal

2 (orange)

RD

3

TD

3 (black)

TD

2

RD

5 (green)

SG

5

SG

Update: A very kind soul in the Los Alamos HPC group has posted scans of ICEBOX documentation: http://institute.lanl.gov/data/linuxnetworxinfo/ Thanks to Christian Ritter for the link.

Advanced Optimization Topics for the Blue Biou Power7 Cluster

Chandler Wilkerson <chwilk@rice.edu>Academic and Research ComputingInformation Technology Rice University

Intended Audience and Abstract

This technical report is intended as a users' guide for the Rice Blue Biou Power7 Cluster. Though most users could benefit from at least some of the content of this report, it is expected that the reader is familiar with moderate to advanced OS and programming topics, including multi-threading, SMP, memory allocation, and debugging.

Because of the combination of what is, to some, an unfamiliar architecture with an infiniband-connected cluster of large SMP nodes, porting applications to Blue Biou can be a challenge. Even if a code compiles and runs, it may not be able to take advantage of the number of processors, threads, and large memory space on the Biou nodes. Unless the code is very well understood, it may be necessary to experiment with different settings to find one that takes advantage of the Power7's multithreading. With small changes to job submission scripts, and in some cases, a simple recompile with new library flags, performance of the average code can be increased significantly.

SMT Intro

The first target for optimization is processor/thread affinity. Incorrect usage of the Power7 multithreading can have the largest detrimental effect on application performance, so the largest gains may be had here.

Each compute node in the Blue Biou cluster is comprised of four Power7 chips. When we refer to sockets, we are talking about an entire Power7 chip. Each socket contains eight cores, and each core contains up to four symmetric multi-threading threads. The threads show up in the operating system as individual processors. Each socket connects to its own region of system RAM, on a node with 256GB, each region is 64GB. Cores on a socket share a layer3 cache and a memory controller that accesses the socket's RAM region. Threads on a core share its execution units, registers, cache, and VSX unit (double precision vector processor)

The SMT feature of the cores can be modified in the case where an application does not gain an advantage from having multiple threads. Highly CPU-bound, and highly memory-bound applications are examples of codes that may not benefit from SMT. It is worth testing within the different modes to determine which best suits your particular application. The mode with all four threads active is SMT=4 or SMT=on. An intermediate mode is available where only two threads are enabled on each core, SMT=2. SMT=off or SMT=1 corresponds to one active thread per core. The per-thread performance benefit of changing SMT modes can be pronounced in an idealized application. Single threads in SMT=4 mode typically perform at 45% the speed of the single thread per core in SMT=1 mode, while single threads in SMT=2 mode can perform at 75% the speed of an SMT=1 thread.

Controlling Processor Affinity on Blue Biou

The take-home point of the above SMT discussion is that it can be quite important which particular threads your application utilizes. The Linux Kernel will, by default, assign tasks to idle "processors" seemingly at random. Since it is NUMA aware, it will try to keep running processes in their same memory region to avoid inter-processor communication where possible, but the task to thread assignment is seemingly random. For example, an openmpi run with np=32 will not seek to place one task per core, but in many cases will utilize multiple threads on some cores, and skip other cores entirely. We recommend one of three possible methods for creating a task to CPU mapping, typically referred to as CPU affinity. For OpenMPI programs, a rank file sets an explicit task to thread affinity that is the most specific of the three methods, and also the most complex to implement. A slightly more general method utilizes task sets, or pools of threads that force the processes of a program to run in a restricted subset of threads, according to a bitmask you supply. For programs that scale well at 32 and 64 threads (or perhaps slightly less), setting the per-node SMT mode is the easiest to implement method of the three.

Setting SMT Modes on Blue Biou

On Blue Biou, we use prologue and epilogue scripts within the Torque scheduler to enable users to control SMT settings in a per-job basis.

Please note that any job that changes the SMT settings should be run with the SINGLEJOB node access policy within the job submission script:

#PBS -W x=NACCESSPOLICY:SINGLEJOB

To set the node to SMT=2 mode, use the following directive in the job submission script:

#PBS -T set_ppc64_smt2

To set the node to SMT=1 mode, use the following directive in the job submission script:

#PBS -T set_ppc64_smt1

Each of these prologue scripts has an associated epilogue script that will return the node's setting to SMT=4 when your job finishes, even in the case where it dies abnormally.

OpenMPI Rank Files

In lieu of SMT settings, there is a method for setting CPU affinity on a MPI task by task basis using a rankfile. Essentially, you run in the default SMT=4 mode and assign consecutive MPI tasks to the first thread on every core (up to 32 cores, counting by four). Functionally speaking, an MPI task of rank i gets assigned to cpu slot i*4. In the case where you want to emulate SMT=2 behavior, you must assign to the first two threads on each core, e.g. 0, 1, 4, 5, … or 4i – 3(i modulus 2). OpenMPI rank files are discussed in the OpenMPI FAQ here: http://www.open-mpi.de/faq/?category=tuning#using-paffinity

Task Sets

The taskset utility is a basic unix utility that allows a user to retrieve or set a process's CPU affinity. It can be used to change affinity for a running process, but it is more useful to us in its default mode where you give it the command line and arguments to run with a certain affinity mask. The affinity mask is a hexadecimal bitmask representing which processors (with processor 0 on the LSB) should be enabled for the process. On biou, each core is conveniently represented by an individual hexadecimal digit, and the entire mask should be 32 digits long. Useful values, corresponding to our SMT modes are 1 (SMT=1), 3 (SMT=2, threads 0 and 1), and f (SMT=4). An example of using a taskset for a multi-threaded program follows: