We would like to inform you that the user service on the IBM cluster (walton) will be suspended on Thursday 29th September from 9:00 until 13:00.

This downtime has been required following the failure of the GPFS tests carried out during the planned maintenance session of 26th September. IBM engineers will be carrying out a number of re-configurations / tests including:

These changes will improve the overall reliability and performance of the cluster.

Note that you will be unable to log on walton over this period. Any jobs still running on Thursday at 9:00 will have to be killed. Normal service may resume earlier than 13:00.

The shared memory system (hamilton) will *not* be affected by this downtime, so a normal service will be provided on this system. See http://www.ichec.ie/status

Apologies for the inconvenience.

------------------------------------------------------------------------
2 - Update on the scheduling system (walton)
------------------------------------------------------------------------

Users who have submitted jobs since last week-end have noticed that their jobs had failed to start, and various other related problems, such as problems deleting jobs they had themselves submitted to the queueing system. Our system administrators have since identified the source of the problem and are taking steps to address it.

This problem has been traced to an automated update which overwrote the maui user and group, and possibly the installation of a new service pack (SP1) on our scheduler node. The knock on effects was that communication between the torque batch system and the maui scheduler failed resulting in no jobs running and jobs being left in an invalid state. We ended up having to purge the torque queue by deleting these jobs to clear these from the system. When we restarted torque and maui after purging these jobs, newly submitted jobs would not start citing a lack of resources despite the availability of a pool of 900+ free CPUs.

We have since managed to fix these problems, although the scheduler crashed another time during the day. The current status would best be described as "functioning but in need of constant monitoring". We are currently considering re-installing and possibly completely rebuilding torque and maui on the scheduler node to see if we can get rid of this unreliability possibly introduced by the application of SP1.

PIs have recently been contacted and asked to indicate which funding body primarily funded the work described in their application. We would like to thank all those who have already supplied this information, and ask other PIs to contact us with this information as soon as possible.

As a reminder, the original request was as follow:

Dear ICHEC PIs,

We have been recently asked by our funding agency to collate statistics on the primary sources of funding supporting researchers applying for HPC resources at ICHEC.

We would appreciate if you could let us know which of the following funding bodies is primarily funding the work described in your project proposal: