Tuesday, 31 August 2010

August has seen a really a notable increase of Atlas user pilot jobs. Over 34000 jobs of which more than 12000 just in the last 4 days. Plotting the number of jobs since the beginning of the year there has been an inversion between production and users pilots.

The trend in August was probably helped by moving all the space to the DATADISK space token and attracting more interesting data. LOCALGROUP is also heavily used in Manchester.

In the past 4 days we also have applied the XFS file system tuning suggested by John that solves the load on the data servers experienced since upgrading to SL5. The tweak has increased notably the data throughput and reduced the load on the data servers practically to zero allowing us to increase the number of concurrent jobs. This has allowed a bigger job throughput and has had a clear improvement on the job efficiency isolating as most inefficient the very short ones (<10 mins CPU time) and even then the improvement is also notable as it is possible to see from the plots below.

Before applying the tweak

After applying the tweak

This also means we can keep on using XFS for the data servers which has currently more flexibility as far as partition sizes are concerned.

Sites (including Liverpool) running DPM on pool nodes running SL5 with XFS file systems have been experiencing very high (up to multiple 100s Load Average and close to 100% CPU IO WAIT) load when a number of analysis jobs were accessing data simultaneously with rfcp.The exact same hardware and file systems under SL4 had shown no excessive load, and the SL5 systems had shown no problems under system stress testing/burn-in. Also, the problem was occurring from a relatively small number of parallel transfers (about 5 or more on Liverpool's systems were enough to show an increased load compared to SL4).Some admins have found that using ext4 at least alleviates the problem although apparently it still occurs under enough load. Migrating production servers with TBs of live data from one FS to another isn't hard but would be a drawn out process for many sites.The fundamental problem for either FS appears to be IOPS overload on the arrays rather than sheer throughput, although why this is occurring so much under SL5 and not under SL4 is still a bit of a mystery. There may be changes in controller drivers, XFS, kernel block access, DPM access patterns or default parameters.When faced with an IOPS overload (that's resulting well below the theoretical throughput of the array) one solution is to make each IO operation access more bits from the storage device so that you need to make fewer but larger read requests.This leads to the actual fix (we have been doing this by default on our 3ware systems but we just assumed the Areca defaults were already optimal).

blockdev --setra 16384 /dev/$RAIDDEVICEThis sets the block device read ahead to (16384/2)kB (8MB). We have previously (on 3ware controllers) had to do this to get the full throughput from the controller. The default on our Areca 1280MLs is 128 (64kB read ahead). So when lots of parallel transfers are occurring our arrays have been thrashing spindles pulling off small 64kB chunks from each different file. These files are usually many hundreds or thousands of MB where reading MBs at a time would be much more efficient.The mystery for us is more why the SL4 systems *don't* overload rather than why SL5 does, as the SL4 systems use the exact same default values.Here is a ganglia plot of our pool nodes under about as much load as we can put on them at the moment. Note that previously our SL5 nodes would have LAs in the 10s or 100s under this load or less.http://hep.ph.liv.ac.uk/~jbland/xfs-fix.htmlAny time the systems go above 1LA now is when they're also having data written at a high rate. On that note we also hadn't configured our Arecas to have their block max sector size aligned with the RAID chunk size withecho "64" > /sys/block/$RAIDDEVICE/queue/max_sectors_kbalthough we don't think this had any bearing on the overloading and might not be necessary.

We expect the tweak to also work for systems running ext4 as the underlying hardware access would still be a bottle neck, just at a different level of access.Note that this 'fix' doesn't fix the even more fundamental problem as pointed out by others that DPM doesn't rate limit connections to pool nodes. All this fix does is (hopefully) push the current limit where overload occurs above the point that our WNs can pull data.There is also a concern that using a big read ahead may affect small random (RFIO) access although the sites can tune this parameter very quickly to get optimum access. 8MB is slightly arbitrary but 64kB is certainly too small for any sensible access I can envisage to LHC data. Most access is via full file copy (rfcp) reads at the moment.

Wednesday, 18 August 2010

I started yesterday to look at the ext4 vs ext3 performance with iozone. I installed two old dell WNs with the same file system layout, same raid level 0, but one with ext3 and one with ext4 on / and on /scratch the directories used by the jobs. Both machines have the default mount values and the kernel is 2.6.18-194.8.1.el5.

I performed the tests on /scratch partition writing the log in /. I did it twice one mounting and unmounting the fs at each test so to delete any trace of information from the buffer cache and one leaving the fs mounted between tests. Tests were automatically repeated for sizes from 64kB to 4GB and record length between 4kB - 16384kB. Iozone automatically doubles the previous sizes at each test (4GB is the smallest multiples smaller than the 5GB file size limit I set).

From the numbers ext4 performs much better in writing while reading is basically the same if not slightly worst for smaller files. There is a big drop in performance for both file systems for the 4GB size.

What however I find confusing is that I did the tests again setting the max size of the file at 100M and doing only write tests and ext3 takes less time despite (22 secs vs 44s in this case) despite the numbers saying that writing is almost 40% faster there is something that slows the tests down (deleting?). Speed of tests become similar for sizes >500MB they both decrease steadily until they finally drop at 4GB for any record length in both file systems.

Below some results mostly with the buffer cache because not having it affects mostly ext3 for small sizes of file and rec length as shown in the first graph.

Wednesday, 4 August 2010

Biomed is opening GGUS tickets for non working sites. Apparently they are getting organised and they have someone to do some sort of support now.

They opened one for Manchester too - actually we were slightly flooded with tickets we must have some decent data on the storage.

The problem turned out to be that on the 18/6/2010 the biomed VOMS server CA DN has changed. If you find these messages (if you google for them you get only source code entries) on your DPM srmv2.2 log files: