Consulting

Resolved: Reports of Hanging Jobs on Hopper

March 1, 2012
by Katie Antypas
(0 Comments)

Issue:

A number of users have reported intermittent large jobs hanging on Hopper. A job appears to start and then hangs shortly after producing no output. The job stops when the wall clock limit has been reached.

Status:

Cray has identified a few bad nodes in the system. After rebooting these nodes, no new hung jobs have been reported since Mar 12. A new xt-mpich2/5.4.4 has been installed and set to default, with a system wide MPI env set so that a job will be aborted if detected being hung. A kernel patch has been installed on Apr 3 to finally address the issue.