SSD in RAID failures

SSD’s – seem like an answer to IO issues in terms of bottlenecks – and rightly so.

As the initial concerns over limited operations of the NAND mechanisms that sit behind the storage are pushed aside in the clamor for raw speed – they find themselves in the production environment.

SO – you are looking at the pricing and wondering why there is a difference in pricing between the enterprise grade hardware and the domestic grade…. something that is likely to rear its head some time later it would appear.

Some tests I did with Constellation NL-SAS drives with a PERC H310 versus the same controller with SSD’s was showing a noticeable increase in speed…. seeing 600MB/sec cache dependent and dropping off against 1100MB/sec sustained…. and with the latency… figures that sit either end of barely measurable to 5 to 10 m/seconds.

Sure – things like the PCIe Enterprise Class SSD’s that you will get with things like the Dell Fluid Cache configurations – sure they will do 40MB/sec for 5 years… but the lower end stuff?

On the whole we use the Crucial (Micron) MX100’s in 512MB. These seem more substantial physically and certainly faster than the Lite-On 512 that come Dell Branded with servers such as the PE220 and above. We are yet to see failures of these.

HOWEVER – and this is a big however – I have heard reports from other parties within the group that cheaper options have been failing. Let us not forget that SSD’s just tend to stop – dead – if you are lucky you get a light – otherwise it is just dead – nothing – silence. Unlike mechanical disks that will start to generate SMART errors before file-system errors or the distinctive sounds of a drive in pain… SSD’s tend to not work. From personal experience this can either be while using them, or on restart.

So lets assume that we have two givens – that assuming no manufacturing error there is a finite number of operations on an SSD. Lets also assume that we are going to mitigate the threat of ‘mechanical’ failure through a RAID1 Mirror. Great.

Sure. Right up to the point you realise that the two devices are now taking for the same ride, and burning I/O operations at exactly the same rate.

So – are likely to fail at the same rate in my book. No?

Step up plans afoot to introduce a policy to cycle slave drive in RAID1 arrays on a periodic nature.

Better solutions in an ideal world:

– Software level support for using SSD’s as caching for higher capacity drives;

– Better still introduction into the kernel the ability to mix drive types in a form of hierarchy more akin to the likes of EMC solutions. NL-SAS long term, SAS for live data, and SSD’s for hot cached items – with a live migration and consistency between them.

– RAID controllers, or SSD SMART support that will manage this kind of abrupt failure.