Wednesday, 17 December 2008

* Enabled users pilots for Atlas and Lhcb. Currently Lhcb is running a lot of jobs and although most is production many are from their generic users. Atlas instead seems to have almost disappeared.

* Enabled NGS VO and passed the first tests. Currently in the conformance test week.

* Enabled one shared queue and completely phased out the VO queues. This has required a transition period for some VOs to give the time to clear the jobs from the old queue and or to reconfigure their tools. This has greatly simplified the maintainance.

* Installed a top-level BDII and reconfigured the nodes to query the local top level BDII instead of the RAL one. This was actually quite easy and we should have done it earlier.

* Cleaned up old parts of cfengine that were causing the servers to be overloaded, not serve the nodes correctly and fire off thousands of emails a day. Mostly this was due to an overlap in the way cfexecd was run both as a cron job and as a daemon. However we also increased TimeOut and SplayTime values and introduced explicitely the schedule parameter in cfagent.conf. Since then cfengine hasn't had anymore problems.

* Increased usage of YAIM local/post functions to apply local overrides or minor corrections to yaim default. Compared to inserting the changes in cfengine this method has the benefit of being integrated and predictable. When we run yaim the changes are applied immediately and don't get overridden.

* New storage: our room is full and when the clusters are loaded we hit the power/cooling limit and we risk to draw power from other rooms, due to this problem the CC people don't want us to switch on new equipment without switching off some old one to maitain the balance in the power consumption. So eventually we have bought 96 TB of raw space to get us going. The kit has arrived yesterday and needs to be installed in the rack we have and power measures need to be taken to avoid switching off more than the necessary amount of nodes. Luckily it will not be many anyway (even taking the nominal value on the back of the new machines it would be 8 nodes but with better power measures it could be as many as 4) because the new machines consume much less then the DELL nodes that are now 4 years old. However buying new CPUs/storage cannot be done without switching off a significant fraction of the current CPUs before switching on the new kit and requires working in tight cooperation with CCS people which has now been agreed after a meeting I had last week with them and their management.

Tuesday, 16 December 2008

I've started to phase out VO queues and to create VO shared queues. The plan is eventually to have 4 queues called with a leap of imagination short, medium, long and test with the following characteristics:

test: 3h/4h; all local VOsshort: 6h/12h; ops,lhcbsgm,atlasgmmedium: 12h/24h all VOs and roles but those in short queue and productionlong: 24h/48h all VOs and roles but those that can access the short queue

Installing the queues, adding the groups ACLs and publishing them is not difficult. YAIM (glite-yaim-core-4.0.4-1, glite-yaim-lcg-ce-4.0.4-2 or higher) can do it for you. Otherwise it can be done by hand which is still easy but is more difficult to maintain (the risk to override is always high and files need to be maintained in cfengine or cvs or else).

The problem for me is that this scheme works only if the users select the correct ACLs and a suitable queue with the right length for their jobs in their JDL. If they don't the queue chosen by the WMS is random with high probability of jobs failing because they end up in a queue that is too short or into a queue that doesn't have the right ACLs. So I'm not sure if it's really a good idea even if it is much easier to maintain and allows a bit more sophisticated setups.

Anyway if you do it by YAIM all you have to do is to add the queue to

QUEUES="my-new-queue other-queues"

add the right VO/FQAN to the new queue _GROUP_ENABLE variable (remember to convert . and - into _

Monday, 1 December 2008

Atlas analysis jobs running on a site using DPM use POSIX access through the RFIO interface. ROOT (since v5.16 IIRC) has support for RFIO access and uses the buffered access mode READBUF. This allocates a static buffer for files read via RFIO on the client. By default this buffer is 128kB.

Initial tests with this default buffer size showed a low cpu efficiency and a high rate of bandwidth usage, far more than the size of the files being accessed. The buffer size can be altered by including a file on the client called /etc/shift.conf containing

RFIO IOBUFSIZE XXX

where XXX is the size in bytes. Altering this setting gave the following results

This was on a test data set with file sizes of ~1.5GB and using athena 14.2.10.

Using buffer sizes of 64MB+ gives gains in efficiency and required bandwidth. A 128MB buffer is a significant chunk of a worker node's RAM, but as the files are not being cached in the linux file cache the ram usage is likely similar to accessing the file from local disk, and the gains are large.

For comparison the same test was run from a copy of the files on local disk. This gave a cpu efficiency of ~50% but the event rate was ~8 times slower than when using RFIO.

My conclusions are that RFIO buffering is significantly more efficient than standard linux file caching. The default buffer size is insufficient and increasing by small amounts greatly reduces efficiency. Increasing the buffer to 64-128MB gives big gains without impacting available RAM too much.

My guess about why only a big buffer gives gains may be due to the random access on the file by the analysis job. Reading in a small chunk, eg 1MB, may buffer a whole event but the next event is unlikely to be in that buffered 1MB, so another 1MB has to be read in for the next event. Similarly for 10MB, although this time the amount read in each time is 10x as much but with a less than 10x increase in probability of the event being in the buffer. When the buffer reaches 64MB the probability of an event being in the buffered area is high enough to offset the extra data being read in.

Another possibility is that the buffering only buffers the first xMB of the file, hence a bigger buffer means more of the file is in RAM and there's a higher probability of the event being in the buffer. Neither of these hypotheses have been investigated further yet.

Large block reads are also more efficient when reading in the data than lots of small random reads. The efficiency effectively becomes 100% if the buffer size is >= the dataset file size; the first reads pull in all of the file and all reads from then are from local RAM.

This makes no difference to the impact on the head node for eg SURL/TURL requests, only the efficiency of the analysis job accessing the data from the pool nodes and the required bandwidth (our local tests simply used the rfio:///dpm/... path directly). If there are enough jobs there will still be bottle necks on the network, either at switch or pool node. We have given all our pool nodes at least 3Gb/s connectivity to the LAN backbone.

The buffer size setting will give different efficiency gains for different file sizes (ie the smaller filesize, the better the efficiency), eg the first atlas analysis test had smaller file sizes than our tests and showed much higher efficiencies. The impact of the BUFSIZE setting on other VO's analysis jobs that use RFIO hasn't been tested.