Really good article Dave and something we've spoken about in depth a lot over the last few years.

Something worthwhile pointing out on the cost front, PCIe storage is the cheaper option compared to a SAN if doing a new build (where you have to factor in the SAN cost). I've just recently looked at the costs for a new SAN setup with 1.2TB as the basic requirement of storage capacity. On the SAN side I've gone for 4x600GB disks (RAID 10) and a dozen servers. I've looked at dedicated spindles per DB server as always (I would have preferred 300GB disks but it's the better balance for the example I was working on). The SAN is a Clarion VNX5.

For PCIe I went with the FusionIO ioDrive2 Mono MLC card as an OEM bit of kit shipped from Dell. The rest of the server spec is identical to the above. It's more than fast enough but there are much faster.

The cost of PCIe is well under half the SAN based costs and will deliver a lot more performance (I upped the RAM spec on both sides as I found the requirements as you mentioned from FusionIO and it's not that more expensive to allow at this stage).

On the lifespan front you can use the program / erase cycles to calculate the theoretical lifespan.

The lowest number you normally see quoted is 10,000 p/e cycles. Using that value we can calculate (simplified theoretical version) :

500GB written per day = 536,870,912,000 bytes (for me this is pretty close as TempDB takes a hammering in our estate)

1,318,554,959,872 bytes / 536,870,912,000 bytes = 24,560 days of writing at 500GB per day24,560 = 67 years or 589,440 hours (admittedly lower than half of SATA or SAS, but when you up the capacity to 2.4TB with the same write rate it almost matches the usual MTBF rates on mechanical storage [my preferred way of describing SAN storage without being offensive]).

It's a bit of an unfair calculation if I'm honest as we are comparing the amount of times we can theoretically write to something vs a potential hardware failure rate with mechanical parts. However, since the end result is something being kaput it's probably not too wide of the mark. Adding more component parts increases the probability of failure so that is something else to consider with mechanical storage. If we add spindles for speed we increase the likelihood of something breaking.

Oh and once you see a 1TB database restore go from 4 hours to 5 minutes simply with none-mechanical storage it's very hard to get it out of your head.

Missing, not for the first time in such essays, is discussion of normal forms, particularly for the operational data. If one moves to SSD, response time factor changes significantly, even compared to short-stroking. But doing so with the typical flat-file datastore is cost prohibitive. In order to get maximum user data back and forth with available IOPS, one needs a high NF datastore, which also happens to have the minimum footprint on storage.

Coders just love to refactor code, but they (all too often in control of database schemas) refuse to refactor data. Since their schemas start life as byte dumps manipulated by their wonderous code (just like their granddaddies' COBOL/VSAM apps), refactoring data means re-writing code; well, mostly discarding lots of code. The lifetime employment assurance disappears.

IOW, the problem isn't technical, but spiritual. Much the same thing happened when the 360 appeared with DASD. Rather than code to Direct Access, coders continued to do what was comfortable, code to Sequential Batch. Who said there's something new under the sun?

In addition there are more sectors in the outside tracks than there are in the innter [sic] tracks.

My understanding of disk sectors has always been that the number of sectors per track is constant for a given disk, and that each sector stores the same amount of data as any other sector.

Because the sectors on the outer tracks cover a bigger surface area on the physical platter, the storage density for those outer tracks is correspondingly lower. The included angle subtended by any sector is the same, which allows the head to read the same amount of data per partial rotation, no matter where on the disk its reading from.

Correct me if I'm wrong, but wouldn't 40 disks in a RAID 1 give you 1/2 the capacity you stated. It's a mirror, so your array size is still only 6 TB, not 12. This works out to about 5.5 TB of useable space. More to the point, aren't we really talking about RAID 0+1 here?