Thursday, 15 October 2009

I installed the squid cache for atlas on a SL5 32bit machine. There are no rpms from the project in 32bit. There is a default OS squid rpm but it is apparently bugged and the request is to install a 2.7-STABLE7 version. So I got the source rpm from here

Wednesday, 19 August 2009

I fixed the site bdii problem i.e. the site static information 'disappeared'. It didn't actually disappear it was just declared under mds-vo-name=resource instead of mds-vo-name=UKI-NORTHGRID-MAN-HEP AND THEREFORE GSTAT COULDN'T FIND IT. This was due to rgma and site bdii conflict. The rgma bdii (that didn't exist in very old versions) needs to be declared in the BDII_REGIONS in YAIM. I knew it but forgot completely I already fixed it when I reinstalled the machine few months ago so I spent a delightful afternoon parsing ldif files and ldap output, hacked the ldif, sort of fixed it and then asked for a proper solution. So... here we go I'm writing it down this time so I can google for myself. On the positive side I upgraded now to the latest version both site and top bdiii and the resource bdii on the CEs. So we now have shiny new attributes like Spec2006 &Co.

I also upgraded the CEs trying to fix our random instability problem which afflicts us. However I upgraded online without reinstalling everything and it makes me a bit nervous thinking that some files that needed change might have not been edited because they already exist. So I will completely reinstall the CEs starting with ce01 today.

Wednesday, 29 April 2009

As mentioned previously ( http://northgrid-tech.blogspot.com/2009/03/replaced-nfs-servers.html ) we have recently upgraded our NFS servers and they now run on SL5. Shortly after going into production all LHCb jobs stalled at Manchester and we were blacklisted by the VO.

We were advised that it may be a lockd error, and asked to use the following python code to diagnose this:

The code did not give any errors and we therefore discounted this as the problem. Wind the clock on a fortnight (including a week's holiday over Easter) and we still have not found the problem so I tried the above code again, and bingo lockd was the problem. A quick search of the SL mailing list pointed me to this kernel bughttps://bugzilla.redhat.com/show_bug.cgi?id=459083

Friday, 3 April 2009

We never really tested it though until now. We have found few problems with YAIM:

YAIM creates an mpirun script that assumes ./ is in the path so the job was landing on WN but mpirun couldn't find the user script/executable. I corrected it prepending `pwd`/ in front of the script arguments at the end of the sript so it runs `pwd`/$@ instead of $@. I added this using yaim post functionality.

The if else statement that if used to build MPIEXEC_PATH is written in a contorted way and needs to be corrected. For example:

1) MPI_MPIEXEC_PATH is used in the if but YAIM doesn't write it in any system file that sets the env variable like grid-env.sh where the other MPI_* variable are set.

2) In the else statement there is an hardcoded path which atcually is chosen splitting the mpiexec executable MPI_MPICH_MPIEXEC points to from its directory.

3) YAIM doesn't rewrite mpirun once it's written so the hardcoded path can't be changed reconfiguring the node without manually deleting mpirun before. This make difficult to update or correct mistakes.

4) The existence of MPIEXEC_PATH is not checked and it should.

Anyway eventually we managed to run mpi jobs and we reported to the new TMB MPI working group what we have done because another site was experiencing the same problems. Hopefully they will correct these problems. Special thanks go to Chris Glasman who hunted down the inital problem with the path and patiently tested the changes we applied.

Wednesday, 25 March 2009

We have finally installed all the units. They are ~84TB of usable space. 42TB are dedicated to atlas space tokens, the other 42TB are shared for now but will be moved into atlas space tokens when we see more usage.

We also have finally enabled all the space tokens requested by atlas. They are waiting to be inserted in Tier Of Atlas but below I report what we publish in the BDII.

Tuesday, 24 March 2009

The NFS servers have been replaced in Manchester with two more powerful machines and two 1TB raided SATA disks. This should hopefully put a stop to the space problems we have suffered in the past few months both with atlas and lhcb and should also allow us to keep a bit more releases than before.

We also have a nice nagios graphs to monitor the space now as well as cfengine alerts.

Thursday, 5 March 2009

After some sweet-talking we managed to get two extra air-con units installed in our old machine room. This room houses our 2005 cluster and our more recent CPU and storage purchased last year. The extra cooling was noticeable and allowed us to switch on a couple of racks which were otherwise offline.

In other news, the new data centre is coming along nicely and will be ready for handover in 3/4 months from now. If you're ever racing past Lancaster on the M6 you'll get a good view of the Borg mothership on the hill, the sleak black cladding is going up now...

Friday, 13 February 2009

We've had some interesting time this week in Lancaster, a tale of Gremlins, Greedy Daemons and Magical Faeries who come in the night and fix your DPM problems.

On Tuesday evening, when we've all gone home for the night, the DPM srmv1 daemon (and to a lesser extent the srmv2.2 and dpm daemons) started gobbling up system resources, sending our headnode into a swapping frenzy. There are known memory leak problems in the DPM code, and we've been victim of them before but in those instances we've always been saved by a swift restart of the affected services and the worse that happened was a sluggish DPM. This time the DPM servies completely froze up, and around 7 pm we started failing tests.

So coming into this disaster on Wednesday morning we leaped into action. Restarting the services fixed the load on the headnode, but the DPM still wouldn't work. Checking the logs showed that all requests were being queued, apparently forever. The trail led to some error messages in the mysqld.log;

The oracle Google pointed that these kind of errors were indicative of a mysql server in a bad state after suddenly loosing connection to a client but not accounting for this. Various restarts, reboots and threats were used, but nothing would get the dpm working and we had to go into downtime.

Rather then dive blindly into the bowels of the DPM backend mysql we got in contact with the DPM developers on the DPM support list. They were really quick to respond, and after recieving 40MB of (zipped!) log files from us set to work developing a strategy to fix us. It appears that our mysql had grown much larger then it should have, "bloating" with historical data, which contributed to it getting into a bad state and made the task of repairing the database harder- partly as we simply couldn't restore from backups as these too would be "bloated".

After a while of bashing our heads, scouring logs and waiting for news from the DPM chaps we decided to make use of the downtime and upgrade the RAM on our headnode to 4 GB (from 2), a task we had been saving for the scheduled downtime when we finally upgrade to the Holy Grail that is DPM 1.7.X. So we slapped in the RAM, brung the machine up clearly, and left it.

A bit over an hour after it came up after the upgrade the headnode started working again. As if by magic. Nothing notable in the logs, it just started working again. The theory is that the added RAM allowed the mysql to chug through a backlog of requests and start working again. But that's just speculation. The dpm chaps are still puzzling over what happened, and our databases are still bloated, but the crisis (for now).

So there are 2 morals to this tale;1) I wouldn't advise running a busy DPM headnode with less then 4GB of RAM, it leads to unpredictable behaviour.2) If you get stuck in an Unscheduled Downtime you might as well make use of it to do any work, you never know when something magical might happen!

Monday, 9 February 2009

Spent last week tracking down a problem where jobs were finishing in the batch system but the jobmanager wasn't recognizing this. This meant that jobs never 'completed' which had two major impacts, 1. Steve's test jobs all failed through timeouts and 2. Atlas production stopped because it looked like the pilots never completed and no further pilots were sent.

Some serious detective work was undertaken my Maarten and Andrey and it turned out the pbscache wasn't being updated due to a stale lock file in the ~/.lcgjm/ directory. The lock files can be found with this on the CE:

We had 6 users affected (alas, our important ones!), all with lock files dated Dec 22. Apparently the lcgpbs script Helper.pm would produce these whenever hostname returned 'localhost'. Yes, on December 22 we had maintenance work with DHCP unavailable, and for some brief period the CE hostname was 'localhost'. Note this is lcg-CE under glite-3.1. Happy days are here again!