Everyone,
We have released a new official version of TORQUE: 2.3.3. This release was made so soon after 2.3.2
(which was only released two weeks ago) because we discovered a serious bug that could cause pbs_mom
daemons to not properly reconnect to the pbs_server in cases of network disruption. For some users,
this resulted in large numbers of compute nodes in a state of "down" and the only way to recover
them was to manually restart the pbs_mom daemon.
If you are concerned about or experience this issue with TORQUE 2.3.2, but cannot upgrade yet to
2.3.3, restarting the pbs_mom should bring the compute node back to full health.
A list of changes in 2.3.3 follows:
c - crash b - bug fix e - enhancement f - new feature
b - fixed bug where pbs_mom would sometimes not connect properly with pbs_server after network
failures
b - changed so run_pelog opens correct stdout/stderr when join is used
b - corrected pbs_server man page for SIGUSR1 and SIGUSR2
f - added new pbs_track command which may be used to launch an external process and a pbs_mom will
then track the resource usage of that process and attach it to a specified job (experimental)
(special thanks to David Singleton and David Houlder from APAC)
e - added alternate method for sending cluster addresses to mom (compile with -DALT_CLSTR_ADDR)
Thanks again to all who have been helping out with TORQUE development, submitting bugs, answering
questions, and giving feedback about 2.3.2!
Regards,
Josh Butikofer
Cluster Resources, Inc.