The 512/520 Issue

by Administrator on April 20, 2006

I know that this post will get me flames from the usual suspects, but I have to make it anyway. For the last several weeks my friend Mike at Zerowait has been asking questions in his blog that merit some attention here.

It has long been a question of mine why you couldn’t buy the exact drive used by the frame seller, but directly from the drive manufacturer — usually at a substantially reduced cost over what the OEMs mark sell them for a 300% mark-up is common).

This has not just been an issue with NetApp. IBM, EMC and a rash of other vendors do the same thing. They mark up the drives in their cabinets, but prohibit their customers from simply buying replacement drives from Seagate, Hitachi, Fujitsu, Western Digital or whichever vendor manufactures them.

The response I get when I ask this question of an OEM is always the same (and I paraphrase): we format the drive in our own way, using a certain number of sectors, to ensure drive resiliency or to ensure that they will work with some voodoo that we do with our storage in microcode.

Mike, who watches NetApp like a hawk, has seen another trend, that he emailed me about today. NetApp’s recent flip-flop from 520 Sector formatting (which they once told me was required for resiliency and compatibility with WAFL and RAID 4) back to 512 Sector formatting is viewed by Mike as a sham from a technical perspective. Here is a quote from Mike’s email.

You know this 520 to 512 is a big thing.

1) NetApp tells everyone that 520 Sector (BCS) is better

2) NetApp forces every one of their customers to buy new drives that support BCS

3) NetApp switches back to 512 sector ( ZCS)

4) NetApp forces new ZCS customers to upgrade to dual parity – wasting disk space because they are using less reliable technology.

I guess they will force all of their customers to switch to BCS drives again soon… increase sales again

Storage is now sold as fashion!

When is the proprietary block format no longer done for technical reasons, but instead used as a method to cajole users into upgrading their platforms? I have to agree with Mike and with the fellow from Hitachi who he quotes in his blog that this smacks of marketecture, not architecture.

I invite NetApp to explain the move in a response here, which I will promote to a full column topic. I need some clarification from Sunnyvale about why, exactly and technically, this is happening.

I always believe that this forum represents a good education medium. I think the whole Netapp BCS/ZCS criticism unfair. While I do not represent Netapps in any shape or form but I am familiar with their systems and have done consulting work for Netapps customers before.

As the HDS engineer eluded SCSI and FC drives can be formatted using 520bps. He made no reference to PATA/SATA. PATA/SATA drives have a fixed format at 512bps.

A few years back Netapp was using ZCS, however, because of the performance overhead Netapp adbanoded ZCS and started Using BCS. Netapp made no secret of that and in fact if ones looks at the Technical reports for the V series, there’s a section that discusses BCS vs ZCS and what’s recommended (BCS).

But how is possible to use BCS when the format size is fixed at 512bps? A WAFL block size is 4096bytes. Netapp uses 512bytes for the checksum and stores that with the metadata. When Raid Scrub kicks in and 4k blocks are read a checksum is calculated and compared against the original value. If they match, eveything’s fine. If they don’t, data’s reconstructed from Parity and a new checksum’s calculated.

However, Netapps has gone further than that and protects as well against “silent write failures” where a Bad or a misbehaving drive will signal that a write has completed when in fact it failed, where on reads ONTAP will verify the checksum. HDS uses something similar to protect their customers against “silent write failures”. They call it read-after-write and they market it on the foils for their intermix option.

But Netapps didn’t stop there and continued to add features. Features like Rapid Raid Recovery. With this when a drive hits a predefined threshold of media errors either from the same location or a different location, a prediction is made that the drive’s about to fail. ONTAP puts the drive in pre-fail mode, brings in a spare, copies the data in the background to the new drive and then marks the original drive as bad. NO recunstruction occurs.

But they didn’t stop there either and with 7.0 or 7.1 (not sure which version) they introduced Maintenance Center. If you ask the drive mfgs most drives are marked bad because of transient errors. So what Netapps did they put into ONTAP all the low level drive diags their Drive Mfgs are using. So when a drive gets marked as bad, the option is available to run low level drive diags. If the Diags passes then the drive can be put back into the spare pool. If not it can get replaced.

So how many other subsystem vendors in addition double failure protection have provided similar capabilities in an attempt to protect their customers data?

“It has long been a question of mine why you couldn’t buy the exact drive used by the frame seller, but directly from the drive manufacturer — usually at a substantially reduced cost over what the OEMs mark sell them for a 300% mark-up is common”

There are valid reasons as to why vendors don’t want to allow this to occur.

Storage vendors typically source drives from different drive mfgs in order to be able to provide continuity of supply as as well as have alternatives should quality issues occur with a drive type. or with a drive mfg.

Vendors do invest resources and time into qualifying these drives as well as closely monitoring production and quality at the mfg facilities and their disk enclosure integrators. If drive quality issues occur, vendors are the first to know and can take subsequent action. Customers do not and can not until it’s too late.

Allowing a customer to source the exact drive type directly from an mfg, does not quarantee quality, and there are no assurances of drive supply when needed. While the up-front cost maybe less in following such practice, the Total cost maybe end up being more…

This would be no different than building a DR site. You hope you never need it but if you do, it’s there. So is buying drives from the storage vendors. You can always get a drive when you need one, but you hope you never need one.

I find PQ65’s comments very interesting, but I don’t understand the reasoning behind his comments. If BCS (520) is better then ZCS (512) and NetApp uses ZCS on their Nearstore products, doesn’t this mean that customers D/R and backup drives are more vulnerable to corruption? It would seem so, because NetApp recommends Dual Parity on ZCS drives. This seems to leave customers relying on a less resilient technology for their backups. How much less reliable are ZCS systems than BCS systems, and is it worth the risk? That is what my customers and I am trying to find out.

Can NetApp provide reliable, repeatable and verifiable data to show their consumers that the Nearstore products that use ZCS drives are as reliable as NetApp’s products that use BCS technology? Does NetApp keep its financial data on ZCS drives or BCS drives? Why not allow consumers to judge their cost to risk ratio by disclosing test results that can be duplicated and verified?

Clearly there are performance and cost advantages to each technology and drive type. NetApp could easily disclose accurate and repeatable test results, consumers could then make informed and economical decisions on where to store their D/R and back up data. And everyone would be a winner.

The reasoning behind my comment is your post and a very specific comment that reads as follows:

“A few years NetApp swtiched to 520 sector drives because they were more resilient . But on their Back up and Archiving units they currently use 512 Sector drives. Are the Maxtor ATA drives less error prone then the Seagate FC drives?”

To anyone familiar, the above comment shows ignorance for someone that sells Netapp gear and also claims that he can “enhance” a filer’s performance his Blogs. You can’t enhance what you don’t understand…

Secondly, Most, if not all, Netapp systems have a choice in formatting…BCS or ZCS. The OS level plays a role as well. If you run at very old levels then I could see how one can get stuck with no option. There’s no documentation that I’ve seen that explicitly states that ZCS is better than BCS. But there is documentation that states the pros and cons and makes a recommendation which is BCS. If you read the V Series TR you’ll see that that stuff is available.

BTW…Netapp redommends RAID-DP on every system with SATA drives. I said recommends, not enforce…

Sorry for the reaction but I believe the a lot of your comments are B(C)S…

I’m getting a bit lost here. I think that, on the one hand, Pq65 is making some valid points.

1. The NetApp guys have gone to a SATA drive rather than a FC drive for some of its gear. What is unclear is why. If this is a way to reduce the costs of the overall gear and an acknowledgement of the vastly improved resiliency of the SATA drive, then I’m all for it.

2. Doubling up or otherwise increasing the number of SATA disks over the former lesser number of FC drives, and adding a new RAID scheme to provide the same “dual ported performance and resiliency” story of the former FC drives, is another architectural decision by the manufacturer.

3. For large lumbering SATA drives, you need a better RAIDing scheme, since rebuilding 250+ GB drives following a RAID 5 (or 4 in the case of older NetApp gear) is abyssmally slow.

From a consumer standpoint, increasing the number of drives increases power costs of the product. It also means, by virtue of Annual Failure Rates, that you will be replacing drives more frequently. (There are more drives to fail, simply.)

Now, am I understanding correctly, that this comes down to the purposing of the array — that the SATA units are aimed at a disk-based backup/snapshot role and the FC units are aimed at the primary or capture-storage role? Is this the distinction between ZCS and BCS in NetApp parlance? Clearly, the load on the backup system would be less than the load on the production system…until, that is, you need to switch to it for disaster recovery purposes.

Pq65 writes that you hope you your never need to activate your DR capability. True enough. But the point of having the backup capability is to be able to depend on it when and if you need to shift to your backup system.

So, since I am not steeped in NetApp nomenclature or the nuances of its most recent gear, and since they have not briefed me on price/performance/intent of their various models, it would really be helpful for someone from NetApp (and I know those guys read this blog) to set the record straight so that we are not speculating.

No John. There’s no distiction. All Netapp gear that I’ve seen can be formatted for BCS or ZCS. The typical case is is BCS. Below is the output of an R200 spare drive with older PATA drives and ONTAP 7.1…Check out the 1st line of the output:

Let me take a shot at this. I asked one of our engineers to take a look at this thread as well, so if I mess up the details, hopefully he can set me right. (Hi Steve.)

Reformatting the disk drives from 512 bytes blocks to 520 byte blocks and putting the checksum right in each individual block is the best solution, because it doesn’t take any extra seeks or reads to get the chunksum data you need. This is called BCS or Block Checksum. (Most high-end storage vendors have something similar. EMC and Hitachi certainly do.)

Unfortunately, we aren’t able to format ATA drives with 520 byte blocks. Maybe someday, but not yet. So with ATA we use a different technology called Zoned Checksum (or ZCS) where we steal every Nth block on the disk and use it for the checksums. (I think N is 64, but can’t remember for sure.) This is less efficient because you have to read extra data, but it allows you to get the reliability benefits of checksums even with ATA drives, which is important because ATA drives are less reliable.

And what about the RAID-DP (DP = “double parity”)? I think that RAID-DP is a wise choice for all drives, Fibre Channel or ATA, but given that ATA drives are less reliable we make RAID-DP the default there. I’m wondering if it’s time to make it the default for Fibre Channel drives as well, but as far as I know, we haven’t done that yet.

Why sell less reliable drives? ATA drives are cheaper! If you’ve got the money, then by all means keep buying Fibre Channel drives and keep using block checksums.

On the other hand, if you want to save money, and your application can get by with a bit less performance, then the combination of RAID-DP and Zoned Checksums can make ATA drives very safe. We used to recommend ATA only for disk-based backup or for archival storage, but now that we have RAID-DP and ZCS, we see lots of customers using it for primary storage, which is why we are starting to support ATA through the entire product line, and not just in the R-Series.

First, let me try to clear up the confusion about BCS vs. ZCS, and provide a little history. As Dave says, ZCS works by taking every 64th 4K block in the filesystem and using it to store a checksum on the preceding 63 4K blocks. We originally did it this way so we could do on-the-fly upgrades of WAFL volumes (from not-checksum-protected to checksum-protected). Clearly, reformatting each drive from 512 sectors to 520 would not make for an easy, on-line upgrade.

As Dave says above, the primary drawback to ZCS is performance, particularly on reads. Since the data does not always live adjacent to its checksum, a 4K read from WAFL often turns into two I/O requests to the disk. Thus was born the NetApp 520-byte-formatted drive and Block Checksums (BCS). For newly-created volumes, this is the preferred checksum method. Note that a volume cannot use a combination of both methods — a volume is either ZCS or BCS.

Pq65 provides some spare-disk output from a filer running ONTAP 7.x showing spares that could be used in either a BCS or a ZCS. The FC drive shown here is formatted with 520-byte sectors. If it is used in a ZCS volume, ONTAP will simply not use those extra 8 bytes in each sector.

When ATA drives came along, we were stuck with 512-byte sectors. But we wanted to use BCS for performance reasons. So rather than going back to using ZCS, we use what we call and “8/9ths” scheme down in the storage layer of the software stack (underneath RAID). Every 9th 512-byte sector is deemed a checksum sector that contains checksums for each of the previous 8 512-byte sectors (which is a single 4K WAFL block). This scheme allows RAID to treat the disk as if it were formatted with 520-byte sectors, and therefore they are considered BCS drives. And because the checksum data lives adjacent to the data it protects, a single disk I/O can read both the data and checksum, so it really does perform similarly to a 520-byte sector FC drive (modulo the fact that ATA drives have slower seek times and data transfer/rotational speeds).

Starting in ONTAP 7.0, the default RAID type for aggregates is RAID-DP, regardless of disk type. For traditional volumes, the default is still RAID-4 for FC drives, but RAID-DP for ATA drives. You cannot mix FC drives and ATA drives in the same traditional volume or aggregate.

The default RAID group size for RAID-DP is typically double the number of disks as for RAID-4, so if you are deploying large aggregates, the cost of parity is quite similar for either RAID type. But the ability to protect you from a single media error during a reconstruct is of course far superior with RAID-DP (the topic of one of Dave’s recent blogs on the NetApp website).

You can easily upgrade a RAID-4 aggregate to RAID-DP, or downgrade a RAID-DP aggregate to RAID-4. But you cannot shrink a RAID group, so you do want to be careful about how you configure your RAID groups before you populate them with data (assuming you don’t like the defaults).

There was an implication earlier in this blog that we used to use RAID 4, but on newer systems we use RAID 5. That’s not the case — we do not use RAID 5 on any of our systems (though an HDS system sitting behind a V-series gateway might use it internally). This is a whole topic in itself, but the reason, stated briefly, is that RAID-4 is more flexible when it comes to adding drives to a RAID group, and because of WAFL, RAID-4 does not present a performance penalty for us, as it does for most other storage vendors. RAID-DP looks much like RAID-4, but with a second parity drive.

Our “lost-writes” protection capability was also mentioned. Though it is rare, disk drives occasionally indicate that they have written a block (or series of blocks) of data, when in fact they have not. Or, they have written it in the wrong place! Because we control both the filesystem and RAID, we have a unique ability to catch these errors when the blocks are subsequently read. In addition to the checksum of the data, we also store some WAFL metadata in each checksum block, which can help us determine if the block we are reading is valid. For example, we might store the inode number of the file containing the block, along with the offset of that block in the file, in the checksum block. If it doesn’t match what WAFL was expecting, RAID can reconstruct the data from the other drives and see if that result is what is expected. With RAID-DP, this can be done even if a disk is currently missing!

We’re constantly looking for opportunities for adding features to ONTAP RAID and WAFL that can hide some of the deficiencies and quirks of disk drives from clients. I think NetApp is in a unique position to be able to do this sort of thing. It’s great to see that you guys are noticing!