If an application aborts unexpectedly, it is useful to monitor the execution of the application in more detail, i.e. to check which branches of the code are actually executed, what are the actual values of variables, which part of the memory is used etc.

The simplest way to do this debugging is to use print statements in the code in order to get the desired information. However, this is tedious (each time a print or write statement is added the source needs to be recompiled and rerun). Furthermore, since the code is modified the runtime conditions change and may influence the behavior of the applications. Therefore, this way of debugging is not recommended.

Instead, in the first place the compiler offers the possibility to check for certain errors during the compilation of the code. For this special compiler flags have to be used which will be described in more detail in the next section. It is recommended to go this way first when debugging is necessary, because the usage is quite easy and does not require any additional software.

But not all errors can be detected this way since some occur only at run time. In this case debuggers need to be employed. Debuggers are powerful tools to analyse the executions of applications on the fly, i.e. while they are running. In general, the corresponding applications need to be recompiled once using appropriate compiler flags and are then executed under the control of the debugger.

For an overview, see the slides "Module Setup and Compiler" from the last Supercomputer Usageclass.

Compiler flags

Debugging options of the compilers

In the following useful debugging options for the XL compilers are listed and explained. Simply add them to the compile command you usually use for your application. The information are taken from the man pages of the XL compilers, for further information about compiler flags just type man bgxlf or man bgxlc.

-O0

With this option all optimizations performed by the compiler are switched off. Sometimes errors can occur due to too aggressive compiler optimizations (rounding of floating point numbers, rearrangement of loops and/or operations etc.). If you encounter problems that might be connected to such issues (for example, wrong or inaccurate numeric results) try this option and check whether the problem persists. If not, increase moderately the optimization level.

-qcheck[=<suboptions_list>]

For Fortran this option is identical to the -C option (see list of flags for Fortran codes below). For C/C++ codes this option enables different runtime checks, depending on the suboptions_list (colon-separated list, see below) specified, and raises a runtime exception (SIGTRAP signal) if a violation is encountered.

all

Enables all suboptions.

bounds

Performs runtime checking of addresses when subscripting within an object of known size.

divzero

Performs runtime checking of integer division. A trap will occur if an attempt is made to divide by zero.

nullptr

Performs runtime checking of addresses contained in pointer variables used to reference storage.

Stops the compiler after the first phase if the severity level of errors detected equals or exceeds the specified level <sev>. The severity levels in increasing order of severity are:

i

informational messages

l

language-level messages (Fortran only)

w

warning messages

e

error messages

s

severe error messages

u

unrecoverable error messages (Fortran only)

-qinitauto=[<hex_value>]

Initializes each byte or word of storage for automatic variables to the specified hexadecimal value <hex_value>. This generates extra code and should only be used for error determination. If you specify -qinitauto without a <hex_value>, the compiler initializes the value of each byte of automatic storage to zero.

The following flags can be used only with Fortran codes:

-C

Checks each reference to an array element, array section, or character substring for correctness. This way some array-bound violations can be detected.

-qinit=f90ptr

Makes the initial association status of pointers disassociated instead of undefined. This option applies to Fortran 90 and above. The default association status of pointers is undefined.

-qsigtrap[=<tap_handler>]

Sets up the specified trap handler to catch SIGTRAP exceptions when compiling a file that contains a main program. This option enables you to install a handler for SIGTRAP signals without calling the SIGNAL subprogram in the program.

The following flags apply only to C/C++ codes:

-qformat=[<options_list>]

Warns of possible problems with string input and output format specifications. Functions diagnosed are printf, scanf, strftime, strfmon family functions and functions marked with format attributes.<options_list> is a comma-separated list of one or more of the following suboptions:

Compiler flags for using debuggers

In order to run your code under the control of a debugger, you need to recompile your application including the following compiler flags (XL compilers):

-g -qfullpath

Additionally, the flag

-qkeepparm

may be useful. When specified, it ensures that function parameters are stored on the stack even if the application is optimized. As a result, parameters remain in the expected memory location, providing access to the values of these incoming parameters to debuggers.

Available debuggers

Once you have compiled your application with the correct compiler flags you can run your application under the control of a debugger and monitor the behavior on the fly in detail.

Running DDT on JUGENE

Important: In order to be able to use the graphical user interface please make sure you are logged in with ssh -X.
If you are not directly connected to JUGENE, make sure you are using for all ssh connections the -X option and that your local system (laptop, PC) has a running X server!

In order to debug your program load the UNITE and ddt modules first:module load UNITE ddt
Then start the DDT debugger typingddt

Using Totalview interactively

Important: In order to be able to use the graphical user interface please make sure you are looged in with ssh -X If you are not directly connected to JUGENE, make sure you are using for all ssh connections the -X option and that your local system (laptop, PC) has a running X server!

In order to debug you program with Totalview load the UNITE and Totalview modules first:module load UNITE totalview
The most common way to use Totalview (like any other debugger) is an interactive usage with a graphical user interface. In order to do so start your application (after compilation with the appropriate compiler flags) with llrun using the option -tv.

For example:llrun -np <ntasks> -mode VN -tv [ -env OMP_NUM_THREADS=<nthreads> ] application.x
This will start the program application.x with <ntasks> and <nthreads> per task in VN mode. At most 2048 tasks can be viewed in VN mode. If your application is a pure MPI code, you can omit the -env option. After the corresponding partition is booted Totalview will launch three windows, the root window, the startup-parameter and the process window.

In the startup-parameter window, you have the four tags Debugging Options, Arguments, Standard I/O and Parallel. If you wish to acitvate the memory debugging check the corresponding box in the tag Debugging Options. If you would like to change or add the arguments, which are passed to your application or to mpirun, you can do so under Arguments. Please do not change anything in Parallel. Once you have made all changes needed, click on OK.

Click on GO in the process window of Totalview. Totalview will proceed executing the mpirun command and launch your application. This may take several minutes depending on the size of the partition you have requested (i.e. the number of task you would like to run).
A dialog window appears after clicking on GO.

Click on YES and after a few seconds the source code of the main program of your application appears in the process window and you can start debugging your code.

For a detailed description of the usage of Totalview, please refer to the Totalview Documentation (Rogue Wave Software) for a user's guide and further information about Totalview.

Using Totalview in batch mode

Sometimes using the interactive GUI for debugging is not straightforward, for example in cases where the error occurs after several hours of execution. In this case it would be very cumbersome to wait until the code has reached the corresponding spot.
In such cases Totalview can be executed in batch mode. Prepare a job command file and launch you application with tvscript instead of mpirun .

Here [options] are tvscript options, <filename> is the name of the executable to debug (must be the first of the starter_args ) and -args is followed by the arguments which are usually specified with the same option of the mpirun command. The last command must be mpirun.

The executable to debug is application.x and should run with 4 tasks in VN mode. At the beginning of the function named functionA an action point is created. When tvscript reaches that action point, it logs a backtrace and the method’s arguments.

Running this job script, two log files are created by tvscript:mpirun-<date>_<time>.slogmpirun-<date>_<time>.log

The slog file (Summary Log File) contains a summary which events occured. In the example above, this file contains four lines (one for each task):Actionpoint function hit, performing action display_backtrace with \options -show_argumentsActionpoint function hit, performing action display_backtrace with \options -show_argumentsActionpoint function hit, performing action display_backtrace with \options -show_argumentsActionpoint function hit, performing action display_backtrace with \options -show_arguments
This indicates that all tasks reached the defined action point and performed the corresponding action (show the arguments of the function function).
The log file contains more detailed information. In this case it lists (for each task) the names and values of the arguments of the function functionA.
For further information about tvscript and a complete list of options, please see the Totalview Documentation.

Analyzing core dumps

If an application aborts due to an error the current status of the memory usage of the application can be written to disk (core dump) before the execution stops. Due to the fact that writing core files from thousands of nodes takes (too) much time, the generating of core dumps is suppressed. However, you can enable the generation of core dumps exporting the environment variable BG_COREDUMPDISABLED to 0 in your job command file:mpirun -env BG_COREDUMPDISABLED=0 [ <other mpirun options> ] application.x
where application.x is your application. Please use the -g option when compiling your application in case you would like to analyze core dumps.

Important: Use this option with care, because a core dump for each process is generated, i.e. running with 16000 MPI tasks means that 16000 core dump files are generated! Before using this option try to reproduce the error with the least number of tasks possible!

Core dump analysis using addr2line

Core dumps are plain text files that include traceback information in hexadecimal. To read and convert the hexadecimal addresses the tool addr2line can be used. Assuming your application is calledapplication.x you can convert a hexadecimal address<hexaddr>from the core dump to readable text in the following way:

addr2line -e application.x <hexaddr>

For further information about addr2line please useman addr2lineaddr2line -h

Core dump analysis with DDT

To debug using core files, start DDT. Then click the Open Core Files button on the welcome screen. This opens the Open Core Files window, which allows you to select an executable and a set of core files. Click OK to open the core dump files and start debugging them. While DDT is in this mode, you cannot play, pause or step (because there is no process active). You are, however, able to evaluate expressions and browse the variables and stack frames saved in the core dump files. The End Session menu option will return DDT to its normal mode of operation.

Core dump analysis with Totalview

Start totalview. After the source code of your application appears in the process window, go to the menu File and select New Program. Select Open a core file in the dialog box which appears and choose a core file. The process window displays the core file, with the Stack Trace, Stack Frame, and Source Panes showing the state of the process when it dumped core. The title bar of the process window names the signal that caused the core dump. The right arrow in the line number area of the Source Pane indicates the value of the program counter (PC) when the process encountered the error.