Considerations for very large (~100TB) NAS

My organization is considering a NAS to store a mixture of data, ranging from short-term working backups to large, long-term archival data for preservation. While I'm reasonably aware of trends in personal NAS's, the size of the array being considered - 100 TB - is dramatically different from what I had worked out for myself.

As it is, the working idea is a very large off-the-shelf NAS (e.g. a Synology DS3617xs) filled with 8 TB NAS disks. The budget is accordingly in the ~$6k range. I'd like to see if I'm aware of the main considerations here, and if this overall makes a reasonable amount of sense. My biggest worries are with respect to bit rot / data integrity and the disk/pool arrangement.

Usable space: The estimated actual working space needed is ~50TB, assuming a factor of 2 for parity. I think this is overly pessimistic if we do use a RAIDZ2-like configuration.

Disk/Pool arrangement: To be determined, but potentially 2 x RAIDZ2 of 6 x 8TB, for 64GB usable. 8 TB NAS disks seem sensible, and all three major vendors (WD, HGST, and Seagate) seem to be in the ~$270/drive range. I prefer HGST based off their track record, both in personal use and from Backblaze's data.

DIY vs. Off-the-shelf: No one, including myself, has previous experience building NAS's. Moreover, the expected service life will be long, and the system needs to be reasonably maintainable in the long term. Because of that, I'm inclined to go with an off-the-shelf.

Bit rot: I am modestly concerned about whether this will be an issue at this scale. Ideally I'd use ZFS, but that seems in conflict with going off-the-shelf. I don't exactly trust btrfs, even with Synology's official support. QNAP seems to have a ZFS enterprise product, but it's a rackmount, and we don't have racks.

Off-site backups: No real plan has taken shape yet. I understand that off-the-shelf NASs can sync to services like Backblaze, but the potential volume concerns me here. We have not yet decided if everything needs to be off-site.

If possible, references, or even just suggestions on what sorts of directions I should research, would be very helpful to justify my thinking to decision-makers.

DIY vs. Off-the-shelf: No one, including myself, has previous experience building NAS's. Moreover, the expected service life will be long, and the system needs to be reasonably maintainable in the long term. Because of that, I'm inclined to go with an off-the-shelf.

I'd say you're thinking logically here. DIY is pretty effective, but it sounds like maintaining this thing will be your headache, so you want something that doesn't need constant tweaking.

100TB raw or 100TB usable? The latter is more expensive, even with 12TB disks, although you could do 96TB raw with just 8x 12TB disks so you might be able to get away with a smaller enclosure/cheaper NAS. I haven’t looked up the price differences lately between off the shelf 8 bay vs 12 bay NAS units so I really can’t say...

The best you could do is about US$0.35 /GB for capacity. Assuming that 100TB is usable capacity, you'll need a minimum of 125 TB (raw) to setup a RAID 5. That may not be the best choice either. You may need to Mirror or use RAID 6. It would also be prudent to include a Hot Spare(s). That's going to be about US$4500 just for storage.

DIY vs. Off-the-shelf: No one, including myself, has previous experience building NAS's. Moreover, the expected service life will be long, and the system needs to be reasonably maintainable in the long term. Because of that, I'm inclined to go with an off-the-shelf.

I'd say you're thinking logically here. DIY is pretty effective, but it sounds like maintaining this thing will be your headache, so you want something that doesn't need constant tweaking.

Exactly. While we don't have infinite financial resources, we also don't have infinite people. Money is somewhat easier to get than people, so I'm inclined to support paying the premium, so long as it's a reasonable one.

100TB raw or 100TB usable? The latter is more expensive, even with 12TB disks, although you could do 96TB raw with just 8x 12TB disks so you might be able to get away with a smaller enclosure/cheaper NAS. I haven’t looked up the price differences lately between off the shelf 8 bay vs 12 bay NAS units so I really can’t say...

I should clarify - 100 TB raw. Any thoughts on getting expansion units for a NAS if we end up with not enough bays down the line?

The best you could do is about US$0.35 /GB for capacity. Assuming that 100TB is usable capacity, you'll need a minimum of 125 TB (raw) to setup a RAID 5. That may not be the best choice either. You may need to Mirror or use RAID 6. It would also be prudent to include a Hot Spare(s). That's going to be about US$4500 just for storage.

Black Jacque, I think the proposed budget and your math are concordant if it's actually 100 TB raw / ~50-60 TB usable, correct? That said, the Oracle pricing I see is $0.003/GB-mth for Archive Cloud Classic, which is a full factor of 3 lower than what you said. Is this correct? I do like the free 10TB of retrieval vs. Glacier - does it have the same latency? A likely problem is that we may reference the data often enough to make the latency painful.

ranging from short-term working backups to large, long-term archival data for preservation.

These should be separate systems.

Indeed, any "long-term archive" that doesn't have an offsite component is no such thing.

I'll answer these together. Ideally, yes, we would have separate systems for short-term backups and long-term archives, but I don't think we have the budget, and more critically the people, to manage separate systems. What specific compromises do you foresee from trying to do both? I'm aware that the most cost-effective archival solutions (e.g. tapes, Amazon Glacier, etc.) perform badly for frequent writes. One other thing I should point out is that the short-term backups will likely be much smaller in size than the actual archival storage (e.g. TBs for archives; 100s of GBs for workstation backups).

We are also looking at offsite components, but we're treating it more as a nice-to-have than a core requirement. In the event that our physical site is devastated, we may have bigger issues to worry about than our data. (I personally disagree, but it's hard to justify spending $$$ for paranoia...)

One thing that I've run into back in the past when making large (or infinite) storage systems is to accommodate the cost of new, larger storage availability in time.

100TB sounds really large right now, but when drives are pushing 12 and 14TB right now, I think it's almost safe to say that in 5 years, there will be 100TB of storage available on a single device. At that point, a $500 drive (just a wild guess based on the current 'large' drives selling price) could replace an entire $6k investment inside of 5 years. So that means you're losing $1k/yr on this investment. Could other solutions be found for $1k/yr? Maybe.

I'll answer these together. Ideally, yes, we would have separate systems for short-term backups and long-term archives, but I don't think we have the budget, and more critically the people, to manage separate systems. What specific compromises do you foresee from trying to do both?

It will do both jobs poorly. Sure, you can do OLTP and OLAP on with the same database instance, but it will do both jobs poorly.

How much administrative work do you think these systems will take? Installing 2 systems at the same time has a much smaller marginal cost (people effort) than 2x the cost of 1 system.

We are also looking at offsite components, but we're treating it more as a nice-to-have than a core requirement. In the event that our physical site is devastated, we may have bigger issues to worry about than our data. (I personally disagree, but it's hard to justify spending $$$ for paranoia...)

If your org can survive loss of all its data, then why is it even a "nice-to-have"?

Joking aside, a massive site disaster isn't the only (or most likely) threat. Someone might come in and steal whatever hardware you're using as your backup solution.

I should clarify - 100 TB raw. Any thoughts on getting expansion units for a NAS if we end up with not enough bays down the line?

Maybe, but that sort of thing usually is not cheap. Depending on your needs and timeframe, especially if it's more than 2 or 3 years down the line, it may be cheaper to buy new.

We just walked a customer through this, the cost difference between upgrading/expanding the existing storage subsystem vs. replacing it entirely was less than 20%... so it's hard to be handwavy/non-specific (I apologize for not being able to be more specific on my part), but thinking further out into your timeframe about budgets/storage needs is something that's probably worth doing, even if it's only back of the envelope calculations.

At the rate harddisks are getting cheaper the math, even casually, is probably worth doing.

Supermicro makes some decent server chassis that take a large number of drives for an acceptable price. I would grab something along those lines, and budget in replacing the HBA's with some LSI 9211-8i cards. I recommend those specifically because they have excellent driver support from both the open source and commercial community and are in general very popular, so any issue you'd encounter should be well documented.

The server itself should have ECC memory and dual power supplies. Crashes are unacceptable at this scale, so take all of the high availability features you can get.

For the operating system, I really am liking Storage Spaces on windows server. In server 2016 it has more capability than ever. Right now I'm particularly digging the directed I/O feature, I can initiate a file copy at my workstation from my NAS to another workstation or server and the transfer happens directly between the two points without being relayed through my system.

Storage spaces has a minor learning curve, but once you dig in it's intuitive enough. Deduplication on windows server is really good, I've yet to be able to break it bad enough I couldn't fix it despite my best efforts.

I'd recommend against using a copy on write file system at this scale if you don't have any experience with one. This includes ZFS and ReFS. Both have caveats that I'd suggest you explore at a smaller scale first.

My biggest recommendation for your first foray into this is to limit your volume size to something you can copy somewhere else reasonably. Otherwise you will build something, want to change the configuration, and have no possible migration route without building another 100TB system. Nevermind a loss of integrity that requires you to image a full volume. 16TB-24TB seems reasonable as a recommendation, you could overnight $400-600 in hardware and have the tools you need for a recovery in short order.

I'll answer these together. Ideally, yes, we would have separate systems for short-term backups and long-term archives, but I don't think we have the budget, and more critically the people, to manage separate systems. What specific compromises do you foresee from trying to do both?

It will do both jobs poorly. Sure, you can do OLTP and OLAP on with the same database instance, but it will do both jobs poorly.

How much administrative work do you think these systems will take? Installing 2 systems at the same time has a much smaller marginal cost (people effort) than 2x the cost of 1 system.

Further clarification - we are an academic biological research group. A quick google of your acronyms suggests a couple things: 1) I think the workload is really more along the side of pure "OLAP", as data for archiving will happen infrequently and in a big batch (i.e. a batch of samples are processed at once). We won't have anything remotely resembling an OLTP-like workload with frequent latency-sensitive read/writes.

Moreover, I think we're going off into the weeds about putting short-term backups on the archival system - the thinking here was more or less an opportunistic "hey - if we have $BIGNUMBER terabytes of storage, why not just put workstation backups on it too!" If this is actually a phenomenally bad idea, we can simply look at Backblaze / Crashplan / etc. options, but it's not a priori obvious to me that it really is that bad.

We are also looking at offsite components, but we're treating it more as a nice-to-have than a core requirement. In the event that our physical site is devastated, we may have bigger issues to worry about than our data. (I personally disagree, but it's hard to justify spending $$$ for paranoia...)

If your org can survive loss of all its data, then why is it even a "nice-to-have"?

Joking aside, a massive site disaster isn't the only (or most likely) threat. Someone might come in and steal whatever hardware you're using as your backup solution.

Not so much that it can survive loss of all data, but rather that a disaster KOing the campus 1) may have destroyed the group as well; and 2) is probably a 24/7 CNN-level disaster (we're smack in a major metropolitan area). I still disagree for the record. Theft is a valid concern, but we have campus security officers, and the NAS isn't the most expensive thing we have anyways...

Is this correct? I do like the free 10TB of retrieval vs. Glacier - does it have the same latency? A likely problem is that we may reference the data often enough to make the latency painful.

Correct. I misquoted.

Latency is 90% your problem. Although the SLA for the Archive Cloud is 3 hours. Oracle has another, more expensive product with a 1 ms. SLA.

I ran throughput tests with the Test Account you can get to 'kick the tires'. I recommend you do this too. It may be that you need to get a 'fatter pipe' to meet your SLA.

My download times were completely dependent on my connection to the Internet. That is, during business hours it crawled as business activity hogged the connection. At 4:00 AM I had the full 1 Gb pipe.

In addition, I had no measureable delay during testing in starting a download stream. YMMV.

HTH

Thanks again - I think the 3 hour SLA might actually be good for an off-site option, as that would be fine for disaster recovery. I think relying solely on the cloud product at 1ms SLA might be too expensive for us though. I will check it out in more detail though.

continuum and gusgizmo (and others) - this discussion has more generally made me wonder about requirements. It sounds to me like we may want to back off and think more carefully about our requirements first and whether we really need 100TB now vs. building something for ~25 TB initially, and then looking into a larger scale-up as we're more experienced and drives get bigger...

Theft is a valid concern, but we have campus security officers, and the NAS isn't the most expensive thing we have anyways...

Having worked at a .edu, your expectations of your campus security officers is vastly over rated. Try an experiment: Go into your office building on a Sunday (assuming that you don't normally do that and the staff on-site are people you don't usually see) put a bunch of computers on a cart, wheel them out the front door and put them in your car. Will anyone even bat an eye at this?

Quote:

Further clarification - we are an academic biological research group. A quick google of your acronyms suggests a couple things: 1) I think the workload is really more along the side of pure "OLAP", as data for archiving will happen infrequently and in a big batch (i.e. a batch of samples are processed at once). We won't have anything remotely resembling an OLTP-like workload with frequent latency-sensitive read/writes.

The example was trying to have a single appliance do two different kind of workloads, that shouldn't be done together. Long term archives should go on large, slow disk. "short term" backups should land on faster disk so (a) the backup can be completed faster (this is mitigated somewhat by the fact that the backup should be a large-ish sequential I/O and that should be easy for a large slow drive to do (making many assumptions about the backup software being used and how it operates)) and (b) restores are mostly done from last night's backup and file level restores are (in general) not large large sequential reads and will totally suck on big, slow drives.

Yar, sounds like it's time to think more about your group's needs. In an academic setting I can definitely understand that funding may be available today but not tomorrow (or 3 years down the road), but that still leads to some... potentially sub-optimal decisions in the long term, given the rate that technology is advancing.

Also does your host institution have any local cloud storage available? That may also be worth looking at.

Depending on where the funding for that "local cloud storage" they may not be able to use it. The research grants I had the "pleasure" of working with typically stated the money could be used to support this specific grant. So money could not be allocated to the IT department to beef up existing infrastructure to the research group to share, they had to buy a whole new stack. Once the grant expired the equipment could be used for anything, But, in general, the specific "schools/departments" are pretty territorial about their stuff. That is why a large .edu N+ IT departments, where N is the number of "schools/departments".

OK, well, I guess my next suggestion is: so long as the backups/archive are managed by some sysadmin (you?) and the users don't have any direct access to it, so they can't put any actual workloads on it, then your Synology should be fine. Make a 12-disk RAID-6 with your 8TB drives, make one volume and just use that.

You mentioned that your main concerned are bitrot and data integrity. As far as I know, only ZFS has end-to-end checksumming to combat bitrot (when used with ECC RAM).

With hardware RAID arrays, the danger is that if the RAID controller barfs a few years down the road, the array is generally unrecoverable unless you can find the exact same controller with a similar firmware vintage to replace it. This can be almost impossible, especially if your RAID controller is an integrated appliance-like device such as a Synology. Your appliance is your single point of failure.

Another thing to consider is RAID rebuild times. If a drive dies, it can take sometimes multiple days to rebuild the array. While that's happening, your remaining drives are under severe stress, which can be enough to push another drive over the edge, leading to total loss of redundancy or even data. Generally, smaller and faster drives (and more of them) are better than fewer bigger and slower drives since any one drive failure means less data to rebuild in an array. I'd stay away from the 6+ TB "helium" drives, especially since they use SMR (I think all 8TB+ drives do) - these have extremely slow random read times, and RAID rebuilds on these can be miserably long.

When building your system, its important to consider not just how fast and easy it is to use, but how fast and easy it is to replace failed drives (and even correctly identify which exact drive has failed) and how resilient the system is to drive failure. I'm not sure filling a Synology with 8TB drives will be your best bet.

Regardless of what system you use, you absolutely must take into account total system failures. It could be a power surge frying the power supply and circuit boards of multiple drives at once (I've seen this happen), it could be a computer that's caught fire (happened to me, personally, twice!), it could be firmware bugs borking data on drives (this has happened to SSDs, can't say that will never happen to other drives), etc. You could have a bad piece of software corrupt or delete your data, or a malicious employee, or even s stupid typo on the command line. You could be doing everything perfectly and lose data. If you don't back up the data on that NAS, that means that you don't really value that data.

Regarding data storage and backups - you could set up a ZFS system (Linux + FUSE or FreeNAS), and have multiple VDEVS with different configurations set up for different purposes, using different drive hardware. That way, your CPU/RAM/motherboard/Case cost (which can be >&1k right there) will be used for both usage/storage scenarios. With ZFS, you also don't need to pay for licenses unless you pay for their professional support, and the fact that it's a a different OS (FreeBSD) from your Linux/Windows stuff gives it some immunity towards viruses, and a less tempting server to run "other stuff" on that can compromise it's main mission (storing files).

My experience is supporting about 20 or so locations with Synology and Buffalo NAS devices (anywhere from 1TB to about 30TB or so). I've had 3 NASes fail with multiple drive failures - one was when idiot users ignored the flashing lights and email alerts until it was too late, the other two failed when a second drive failed during rebuilds. One had Seagates, the other WDs (Reds, I think). I sent one of the RAID arrays in for professional recovery (the one where the array had mostly rebuilt when another drive blew up), and recovery was NOT successful. About 50-60 of the files were recovered, but the 40% or so that were missing were actually the most critical (HR info). The site refused to pay for backups or a backup strategy, so that was a VERY expensive lesson for them when they had to settle a court case a few years later where that data was critical.

At home, I have a 18TB raw/12TB usable (soon to be 36TB raw/24TB usable) FreeNAS box that has been running for about 4 years pretty much continuously, and has survived 3 drives failures/replacements so far with absolutely no data loss. I'm also backing the system to cloud storage (Crashplan). Their cost is $10/mo, unlimited data, so it's cheap when you have larger backups.

Setting up a 25TB system now (and spending time beating it up and testing it before using it with real data) sounds like the way to go. I would buy the hardware myself and test/use it with multiple OSs (Windows Storage, FreeNAs, etc) and play with them for a bit, but that's assuming you have the time and knowledge to do that thoroughly. The advantage, though, is that you'll be better prepared to deal with issues when they (inevitably) do come up.

I'd also budget for a test/lab system (a cheap desktop with some cheap drives) to keep around to practice or plan things like data migrations and drive recoveries, so that you won;t have to do your first drive replacement on your "real" system. In fact, just get the lab machine now to play with the different NAS/storage software out there.

y. I'd stay away from the 6+ TB "helium" drives, especially since they use SMR (I think all 8TB+ drives do) - these have extremely slow random read times, and RAID rebuilds on these can be miserably long.

Not currently true, thankfully. Only a specific few use SMR-- HGST Ultrastar Hs14, Seagate Archive series drives, and I think not too many more.

Most other drives are conventional PMR up to 8TB or so, at 10TB and whatnot you see modern helium-filled drives, and I think the only 14TB drive currently available is the rare combination of helium and host-managed SMR.

Note that if you are buying external drives, especially from Seagate, at 8TB and up, you do have to watch out that they're not a Seagate Archive SMR drive inside, but that's not too relevant to the OP.

You mentioned that your main concerned are bitrot and data integrity. As far as I know, only ZFS has end-to-end checksumming to combat bitrot (when used with ECC RAM).

With hardware RAID arrays, the danger is that if the RAID controller barfs a few years down the road, the array is generally unrecoverable unless you can find the exact same controller with a similar firmware vintage to replace it. This can be almost impossible, especially if your RAID controller is an integrated appliance-like device such as a Synology. Your appliance is your single point of failure.

Another thing to consider is RAID rebuild times. If a drive dies, it can take sometimes multiple days to rebuild the array. While that's happening, your remaining drives are under severe stress, which can be enough to push another drive over the edge, leading to total loss of redundancy or even data. Generally, smaller and faster drives (and more of them) are better than fewer bigger and slower drives since any one drive failure means less data to rebuild in an array. I'd stay away from the 6+ TB "helium" drives, especially since they use SMR (I think all 8TB+ drives do) - these have extremely slow random read times, and RAID rebuilds on these can be miserably long.

When building your system, its important to consider not just how fast and easy it is to use, but how fast and easy it is to replace failed drives (and even correctly identify which exact drive has failed) and how resilient the system is to drive failure. I'm not sure filling a Synology with 8TB drives will be your best bet.

Regardless of what system you use, you absolutely must take into account total system failures. It could be a power surge frying the power supply and circuit boards of multiple drives at once (I've seen this happen), it could be a computer that's caught fire (happened to me, personally, twice!), it could be firmware bugs borking data on drives (this has happened to SSDs, can't say that will never happen to other drives), etc. You could have a bad piece of software corrupt or delete your data, or a malicious employee, or even s stupid typo on the command line. You could be doing everything perfectly and lose data. If you don't back up the data on that NAS, that means that you don't really value that data.

Regarding data storage and backups - you could set up a ZFS system (Linux + FUSE or FreeNAS), and have multiple VDEVS with different configurations set up for different purposes, using different drive hardware. That way, your CPU/RAM/motherboard/Case cost (which can be >&1k right there) will be used for both usage/storage scenarios. With ZFS, you also don't need to pay for licenses unless you pay for their professional support, and the fact that it's a a different OS (FreeBSD) from your Linux/Windows stuff gives it some immunity towards viruses, and a less tempting server to run "other stuff" on that can compromise it's main mission (storing files).

My experience is supporting about 20 or so locations with Synology and Buffalo NAS devices (anywhere from 1TB to about 30TB or so). I've had 3 NASes fail with multiple drive failures - one was when idiot users ignored the flashing lights and email alerts until it was too late, the other two failed when a second drive failed during rebuilds. One had Seagates, the other WDs (Reds, I think). I sent one of the RAID arrays in for professional recovery (the one where the array had mostly rebuilt when another drive blew up), and recovery was NOT successful. About 50-60 of the files were recovered, but the 40% or so that were missing were actually the most critical (HR info). The site refused to pay for backups or a backup strategy, so that was a VERY expensive lesson for them when they had to settle a court case a few years later where that data was critical.

At home, I have a 18TB raw/12TB usable (soon to be 36TB raw/24TB usable) FreeNAS box that has been running for about 4 years pretty much continuously, and has survived 3 drives failures/replacements so far with absolutely no data loss. I'm also backing the system to cloud storage (Crashplan). Their cost is $10/mo, unlimited data, so it's cheap when you have larger backups.

Setting up a 25TB system now (and spending time beating it up and testing it before using it with real data) sounds like the way to go. I would buy the hardware myself and test/use it with multiple OSs (Windows Storage, FreeNAs, etc) and play with them for a bit, but that's assuming you have the time and knowledge to do that thoroughly. The advantage, though, is that you'll be better prepared to deal with issues when they (inevitably) do come up.

I'd also budget for a test/lab system (a cheap desktop with some cheap drives) to keep around to practice or plan things like data migrations and drive recoveries, so that you won;t have to do your first drive replacement on your "real" system. In fact, just get the lab machine now to play with the different NAS/storage software out there.

Really great post with lots of great info on data loss. So many good points. Securing data from a loss is kinda up there with preventing a meltdown at a nuclear plant--lots of little things to make sure because it can do downhill really fast if something slips.

One thing I'd like to add in regards to backups--test your restore! So many times people back up but then have no idea how slow, tedious, or hard it is to actually get their data back where it needs to be.

@kperrier - good point on whether they'd bat an eye, and thanks for the details on partitioning workflows. I think ultimately splitting the two workloads will have to come behind the off-site backup in terms of priorities, but I think we can live with the compromises.

Yar, sounds like it's time to think more about your group's needs. In an academic setting I can definitely understand that funding may be available today but not tomorrow (or 3 years down the road), but that still leads to some... potentially sub-optimal decisions in the long term, given the rate that technology is advancing.

Also does your host institution have any local cloud storage available? That may also be worth looking at.

I returned your PM - thanks!We do have some flavors of cloud storage available, but kperrier is close with respect to funding concerns - we'd have to pay for the storage ourselves from our own funding, and the rate is $3-400/TB-yr, depending on exact implementation. This...seems a bit pricey, especially since this comes with features we don't necessarily need, and that kind of continuing expense is substantial.

You mentioned that your main concerned are bitrot and data integrity. As far as I know, only ZFS has end-to-end checksumming to combat bitrot (when used with ECC RAM).

With hardware RAID arrays, the danger is that if the RAID controller barfs a few years down the road, the array is generally unrecoverable unless you can find the exact same controller with a similar firmware vintage to replace it. This can be almost impossible, especially if your RAID controller is an integrated appliance-like device such as a Synology. Your appliance is your single point of failure.

Another thing to consider is RAID rebuild times. If a drive dies, it can take sometimes multiple days to rebuild the array. While that's happening, your remaining drives are under severe stress, which can be enough to push another drive over the edge, leading to total loss of redundancy or even data. Generally, smaller and faster drives (and more of them) are better than fewer bigger and slower drives since any one drive failure means less data to rebuild in an array. I'd stay away from the 6+ TB "helium" drives, especially since they use SMR (I think all 8TB+ drives do) - these have extremely slow random read times, and RAID rebuilds on these can be miserably long.

When building your system, its important to consider not just how fast and easy it is to use, but how fast and easy it is to replace failed drives (and even correctly identify which exact drive has failed) and how resilient the system is to drive failure. I'm not sure filling a Synology with 8TB drives will be your best bet.

Regardless of what system you use, you absolutely must take into account total system failures. It could be a power surge frying the power supply and circuit boards of multiple drives at once (I've seen this happen), it could be a computer that's caught fire (happened to me, personally, twice!), it could be firmware bugs borking data on drives (this has happened to SSDs, can't say that will never happen to other drives), etc. You could have a bad piece of software corrupt or delete your data, or a malicious employee, or even s stupid typo on the command line. You could be doing everything perfectly and lose data. If you don't back up the data on that NAS, that means that you don't really value that data.

Regarding data storage and backups - you could set up a ZFS system (Linux + FUSE or FreeNAS), and have multiple VDEVS with different configurations set up for different purposes, using different drive hardware. That way, your CPU/RAM/motherboard/Case cost (which can be >&1k right there) will be used for both usage/storage scenarios. With ZFS, you also don't need to pay for licenses unless you pay for their professional support, and the fact that it's a a different OS (FreeBSD) from your Linux/Windows stuff gives it some immunity towards viruses, and a less tempting server to run "other stuff" on that can compromise it's main mission (storing files).

My experience is supporting about 20 or so locations with Synology and Buffalo NAS devices (anywhere from 1TB to about 30TB or so). I've had 3 NASes fail with multiple drive failures - one was when idiot users ignored the flashing lights and email alerts until it was too late, the other two failed when a second drive failed during rebuilds. One had Seagates, the other WDs (Reds, I think). I sent one of the RAID arrays in for professional recovery (the one where the array had mostly rebuilt when another drive blew up), and recovery was NOT successful. About 50-60 of the files were recovered, but the 40% or so that were missing were actually the most critical (HR info). The site refused to pay for backups or a backup strategy, so that was a VERY expensive lesson for them when they had to settle a court case a few years later where that data was critical.

At home, I have a 18TB raw/12TB usable (soon to be 36TB raw/24TB usable) FreeNAS box that has been running for about 4 years pretty much continuously, and has survived 3 drives failures/replacements so far with absolutely no data loss. I'm also backing the system to cloud storage (Crashplan). Their cost is $10/mo, unlimited data, so it's cheap when you have larger backups.

Setting up a 25TB system now (and spending time beating it up and testing it before using it with real data) sounds like the way to go. I would buy the hardware myself and test/use it with multiple OSs (Windows Storage, FreeNAs, etc) and play with them for a bit, but that's assuming you have the time and knowledge to do that thoroughly. The advantage, though, is that you'll be better prepared to deal with issues when they (inevitably) do come up.

I'd also budget for a test/lab system (a cheap desktop with some cheap drives) to keep around to practice or plan things like data migrations and drive recoveries, so that you won;t have to do your first drive replacement on your "real" system. In fact, just get the lab machine now to play with the different NAS/storage software out there.

A super helpful post - thanks! I was personally going to go the ZFS+ECC route a while back for personal files (you can find this if you go back a couple years in my post history here). My concern is that I don't have personal experience with this, and this is much more significant than personal backups. If I (or the OS, etc.) screw up and need to do a restore from backup, my personal files can wait until it's convenient, whereas this would be a pants-on-fire moment. In your experience, do you think amateur sysadmin + ZFS+ECC is a lesser risk than that of RAID failure on a Synology, etc?

Most of that is sysadmin salary, so to do an apples-to-apples comparison, include the percentage of your salary that would go towards this project. But you are right that it may have more features than you need, e.g. off-hours coverage.

@kperrier - good point on whether they'd bat an eye, and thanks for the details on partitioning workflows. I think ultimately splitting the two workloads will have to come behind the off-site backup in terms of priorities, but I think we can live with the compromises.

Yar, sounds like it's time to think more about your group's needs. In an academic setting I can definitely understand that funding may be available today but not tomorrow (or 3 years down the road), but that still leads to some... potentially sub-optimal decisions in the long term, given the rate that technology is advancing.

Also does your host institution have any local cloud storage available? That may also be worth looking at.

I returned your PM - thanks!We do have some flavors of cloud storage available, but kperrier is close with respect to funding concerns - we'd have to pay for the storage ourselves from our own funding, and the rate is $3-400/TB-yr, depending on exact implementation. This...seems a bit pricey, especially since this comes with features we don't necessarily need, and that kind of continuing expense is substantial.

You mentioned that your main concerned are bitrot and data integrity. As far as I know, only ZFS has end-to-end checksumming to combat bitrot (when used with ECC RAM).

With hardware RAID arrays, the danger is that if the RAID controller barfs a few years down the road, the array is generally unrecoverable unless you can find the exact same controller with a similar firmware vintage to replace it. This can be almost impossible, especially if your RAID controller is an integrated appliance-like device such as a Synology. Your appliance is your single point of failure.

Another thing to consider is RAID rebuild times. If a drive dies, it can take sometimes multiple days to rebuild the array. While that's happening, your remaining drives are under severe stress, which can be enough to push another drive over the edge, leading to total loss of redundancy or even data. Generally, smaller and faster drives (and more of them) are better than fewer bigger and slower drives since any one drive failure means less data to rebuild in an array. I'd stay away from the 6+ TB "helium" drives, especially since they use SMR (I think all 8TB+ drives do) - these have extremely slow random read times, and RAID rebuilds on these can be miserably long.

When building your system, its important to consider not just how fast and easy it is to use, but how fast and easy it is to replace failed drives (and even correctly identify which exact drive has failed) and how resilient the system is to drive failure. I'm not sure filling a Synology with 8TB drives will be your best bet.

Regardless of what system you use, you absolutely must take into account total system failures. It could be a power surge frying the power supply and circuit boards of multiple drives at once (I've seen this happen), it could be a computer that's caught fire (happened to me, personally, twice!), it could be firmware bugs borking data on drives (this has happened to SSDs, can't say that will never happen to other drives), etc. You could have a bad piece of software corrupt or delete your data, or a malicious employee, or even s stupid typo on the command line. You could be doing everything perfectly and lose data. If you don't back up the data on that NAS, that means that you don't really value that data.

Regarding data storage and backups - you could set up a ZFS system (Linux + FUSE or FreeNAS), and have multiple VDEVS with different configurations set up for different purposes, using different drive hardware. That way, your CPU/RAM/motherboard/Case cost (which can be >&1k right there) will be used for both usage/storage scenarios. With ZFS, you also don't need to pay for licenses unless you pay for their professional support, and the fact that it's a a different OS (FreeBSD) from your Linux/Windows stuff gives it some immunity towards viruses, and a less tempting server to run "other stuff" on that can compromise it's main mission (storing files).

My experience is supporting about 20 or so locations with Synology and Buffalo NAS devices (anywhere from 1TB to about 30TB or so). I've had 3 NASes fail with multiple drive failures - one was when idiot users ignored the flashing lights and email alerts until it was too late, the other two failed when a second drive failed during rebuilds. One had Seagates, the other WDs (Reds, I think). I sent one of the RAID arrays in for professional recovery (the one where the array had mostly rebuilt when another drive blew up), and recovery was NOT successful. About 50-60 of the files were recovered, but the 40% or so that were missing were actually the most critical (HR info). The site refused to pay for backups or a backup strategy, so that was a VERY expensive lesson for them when they had to settle a court case a few years later where that data was critical.

At home, I have a 18TB raw/12TB usable (soon to be 36TB raw/24TB usable) FreeNAS box that has been running for about 4 years pretty much continuously, and has survived 3 drives failures/replacements so far with absolutely no data loss. I'm also backing the system to cloud storage (Crashplan). Their cost is $10/mo, unlimited data, so it's cheap when you have larger backups.

Setting up a 25TB system now (and spending time beating it up and testing it before using it with real data) sounds like the way to go. I would buy the hardware myself and test/use it with multiple OSs (Windows Storage, FreeNAs, etc) and play with them for a bit, but that's assuming you have the time and knowledge to do that thoroughly. The advantage, though, is that you'll be better prepared to deal with issues when they (inevitably) do come up.

I'd also budget for a test/lab system (a cheap desktop with some cheap drives) to keep around to practice or plan things like data migrations and drive recoveries, so that you won;t have to do your first drive replacement on your "real" system. In fact, just get the lab machine now to play with the different NAS/storage software out there.

A super helpful post - thanks! I was personally going to go the ZFS+ECC route a while back for personal files (you can find this if you go back a couple years in my post history here). My concern is that I don't have personal experience with this, and this is much more significant than personal backups. If I (or the OS, etc.) screw up and need to do a restore from backup, my personal files can wait until it's convenient, whereas this would be a pants-on-fire moment. In your experience, do you think amateur sysadmin + ZFS+ECC is a lesser risk than that of RAID failure on a Synology, etc?

This is a tough call.I think if you use a well-designed Freenas setup and you don't muck with it (beyond carefully updating it), how is that different from a Synology appliance? After all, a Synology is just a mini server with a NAS OS, just like FreeNas can be, except that it's locked into a proprietary case/enclosure. You arguably have a better chance of recovering data in an oh-shit situation with FreeNas because you can leverage the forums (cyberjock notwithstanding - guys seems to know a lot but he's a raging asshole), the forums here on Ars, and FreeNas is just really ZFS on BSD with a pretty skin, so almost any other current ZFS implementation can read your array. ZFS can be very simple if you stick to the basics. You also get the flexibility of setting up multiple VDEVs with it for more flexibility than what you can get with a Synology, and you can expand the system by adding VDEVs as needed without needing to replace the whole appliance if you run out of drive slots, like what happens with a Synology. If you stay away from ZIL drives and attempts to make it do more than just serve files, it's just another NAS appliance.

y. I'd stay away from the 6+ TB "helium" drives, especially since they use SMR (I think all 8TB+ drives do) - these have extremely slow random read times, and RAID rebuilds on these can be miserably long.

Not currently true, thankfully. Only a specific few use SMR-- HGST Ultrastar Hs14, Seagate Archive series drives, and I think not too many more.

Most other drives are conventional PMR up to 8TB or so, at 10TB and whatnot you see modern helium-filled drives, and I think the only 14TB drive currently available is the rare combination of helium and host-managed SMR.

Note that if you are buying external drives, especially from Seagate, at 8TB and up, you do have to watch out that they're not a Seagate Archive SMR drive inside, but that's not too relevant to the OP.

Awesome - appreciate you stepping in to set the record straight. Thanks! I've fallen behind on my drive tech. Haven't had time to keep up with reviews.

I work as a storage admin in higher-ed. At petabyte scale my acquisition costs for the sort of thing you are looking for would be ~$150-175/TB on a five year model. That's storage hardware and maintenance only.

I would be very concerned that your current budget would only afford you a 'solution' that would periodically eat your data, suck your time, and perform inadequately...

The closest things to your budget that i would look at would be something like what I see in the Dell return/refurb channel. You could get something that wasn't brand new, but came with support, for maybe $8k-10k with three years of support.

But it sounds like part of your workload is cloud-suitable. I'd suggest looking at a smaller local storage unit, plus cloud services to back it up and serve as archive storage. That gets you out of bandwidth charges for your more active data, and affords you some sort of backup for that data.

I work as a storage admin in higher-ed. At petabyte scale my acquisition costs for the sort of thing you are looking for would be ~$150-175/TB on a five year model. That's storage hardware and maintenance only.

I would be very concerned that your current budget would only afford you a 'solution' that would periodically eat your data, suck your time, and perform inadequately...

The closest things to your budget that i would look at would be something like what I see in the Dell return/refurb channel. You could get something that wasn't brand new, but came with support, for maybe $8k-10k with three years of support.

But it sounds like part of your workload is cloud-suitable. I'd suggest looking at a smaller local storage unit, plus cloud services to back it up and serve as archive storage. That gets you out of bandwidth charges for your more active data, and affords you some sort of backup for that data.

Totally different end of the scale. My 18TB (raw)/ 12TB (usable) NAS/Server was about $1000k for the case/RAM/CPU/Motherboard/misc and another $700 in storage (6 x 3TB WD Red drives @ $110 ea). This was back in 2014 or 2015, when 3TB was the sweet spot. My server pulls double-duty as my ESXi hosting 4 VMs, so the CPU, mobo and RAM costs are probably a bit higher than if this had been a pure storage appliance, but I think the difference would have been maybe $200 at most. I've since replaced 3 of those drives (not a fan of WD Red drives now, BTW). Since the replacements were refurbs, I've relegated them to secondary purposes and bought new replacements at full price (don't trust refurb drives for my critical data).

So - initial cost was about $1800, or about $150/TB of usable space over 3-4 years, and hopefully with no more costs for another 1-2 years. If I factor buying the replacements at full price, then it's closer to $2100, so closer to $175/TB usable over about 4-5 years. And this is with a very nice server mobo (dual NIC, IPMI, on-board LSI controller etc), ECC RAM, etc but not redundant power, and a fairly inexpensive case. My maintenance costs were the replacement drives, and assorted shipping costs.

It's interesting how closely the costs track at both ends of the scale - $150 - $175 / TB usable.

I work as a storage admin in higher-ed. At petabyte scale my acquisition costs for the sort of thing you are looking for would be ~$150-175/TB on a five year model. That's storage hardware and maintenance only.

I would be very concerned that your current budget would only afford you a 'solution' that would periodically eat your data, suck your time, and perform inadequately...

The closest things to your budget that i would look at would be something like what I see in the Dell return/refurb channel. You could get something that wasn't brand new, but came with support, for maybe $8k-10k with three years of support.

But it sounds like part of your workload is cloud-suitable. I'd suggest looking at a smaller local storage unit, plus cloud services to back it up and serve as archive storage. That gets you out of bandwidth charges for your more active data, and affords you some sort of backup for that data.

Totally different end of the scale. My 18TB (raw)/ 12TB (usable) NAS/Server was about $1000k for the case/RAM/CPU/Motherboard/misc and another $700 in storage (6 x 3TB WD Red drives @ $110 ea). This was back in 2014 or 2015, when 3TB was the sweet spot. My server pulls double-duty as my ESXi hosting 4 VMs, so the CPU, mobo and RAM costs are probably a bit higher than if this had been a pure storage appliance, but I think the difference would have been maybe $200 at most. I've since replaced 3 of those drives (not a fan of WD Red drives now, BTW). Since the replacements were refurbs, I've relegated them to secondary purposes and bought new replacements at full price (don't trust refurb drives for my critical data).

So - initial cost was about $1800, or about $150/TB of usable space over 3-4 years, and hopefully with no more costs for another 1-2 years. If I factor buying the replacements at full price, then it's closer to $2100, so closer to $175/TB usable over about 4-5 years. And this is with a very nice server mobo (dual NIC, IPMI, on-board LSI controller etc), ECC RAM, etc but not redundant power, and a fairly inexpensive case. My maintenance costs were the replacement drives, and assorted shipping costs.

It's interesting how closely the costs track at both ends of the scale - $150 - $175 / TB usable.

And what's really interesting is that I just ran the numbers for our 2TB mirrored 3x setup x2, and it's just basically a bunch of 2TB drives in external enclosures connected to a computer running a robocopy script. $300 for each set of 3 drives and enclosures (enterprise class HGST or WDC RE drives with 5yr warranty) x 2 =$600/4TB = $150/TB.

So it seems like if you're not near $150/TB you're either cutting corners or super way overkill.

Sorry for the late response. I've been away for a few years, and just dropped by to see what was happening.

I recently delivered a high redundancy 500TB storage server (single head-node but redundant data paths everywhere, including SAS disks) to a customer. This post is based on that experience, plus over 25 years of supporting "Big Data" services.

First, you should consider backing up your "long-term archive" in the Cloud. Be sure to encrypt the data first, and DON'T LOSE THE KEY.

As I read this, you're looking to minimize cost while maximizing capacity. In general, using a commercial product will be contrary to that goal. If cost is a big factor, you should consider a DIY product, probably based on a Supermicro or Intel head node, with Supermicro disk trays. Since your requirement is only 100TB, it would be possible to put it all into a single chassis, but the 5-year costs, which should include a head-node replacement, are likely to be higher. Also, long-term storage growth should be considered, which means that disk trays make more sense.

The only file-system I would suggest for this is ZFS. Run it with either Linux or FreeBSD, whichever your shop has more experience with. If there's no experience with either, a commercial product might be a better choice.

If you're concerned about the manpower to manage the systems, then you're probably doing something wrong. Get a CM tool like Puppet or Chef and use it. You'll be surprised at how much it simplifies your life.

The head node should have a pair of small (cheap) SSDs for the OS, and at least two high-write SSDs for the ZIL/LOG and for caching (L2ARC). It could also be used to hold Hot-Spare drives.

The disk trays can hold the main storage arrays. The 6-drive RAID-Z2 choice is a good one, but you might have some performance issues with only two arrays. Be sure to leave space to add arrays. Also, please remember that there's a performance hit for a while after an array is added, due to the way ZFS allocates new writes.

Theft is a valid concern, but we have campus security officers, and the NAS isn't the most expensive thing we have anyways...

Having worked at a .edu, your expectations of your campus security officers is vastly over rated. Try an experiment: Go into your office building on a Sunday (assuming that you don't normally do that and the staff on-site are people you don't usually see) put a bunch of computers on a cart, wheel them out the front door and put them in your car. Will anyone even bat an eye at this?

We're a big enough .edu that we have campus security and an actual police department as well as local PD in two cities (we're kinda big) and I rolled out a 7' tall server rack. The security guy held the door for me. KP's right about things walking out. It happens.

Theft is a valid concern, but we have campus security officers, and the NAS isn't the most expensive thing we have anyways...

Having worked at a .edu, your expectations of your campus security officers is vastly over rated. Try an experiment: Go into your office building on a Sunday (assuming that you don't normally do that and the staff on-site are people you don't usually see) put a bunch of computers on a cart, wheel them out the front door and put them in your car. Will anyone even bat an eye at this?

We're a big enough .edu that we have campus security and an actual police department as well as local PD in two cities (we're kinda big) and I rolled out a 7' tall server rack. The security guy held the door for me. KP's right about things walking out. It happens.

After some discussions both in-house and also with the ever-useful continuum, I think trying to actively protect against bitrot may not be within our means, and it's not clear that it would be an issue in practice.

We'll be going commercial, with a rackmount Synology model. DIY is still an attractive option, especially with the advice given here, but fundamentally it's easier to spend somewhat more money vs. going out of our area of expertise to build something critical ourselves. We need to be wise with our spending, but we're not that cash-strapped that it's worth potentially diverting someone for a sustained period of time to save a couple thousand dollars.

Cloud storage looks pretty great for off-site backups, but latency is too much of an issue to actually use that as primary storage.

We'll be going commercial, with a rackmount Synology model. DIY is still an attractive option, especially with the advice given here, but fundamentally it's easier to spend somewhat more money vs. going out of our area of expertise to build something critical ourselves.

There's a third option that you may not have investigated. If you know of a reputable System Builder in your area, they could build and maintain the system for you, probably including the OS, and still cost less that the Synology. If you're in the San Diego area, I'd be willing to work with you on this (I don't support remote customers - it doesn't work well enough). After all, it's ONLY 100TB

We'll be going commercial, with a rackmount Synology model. DIY is still an attractive option, especially with the advice given here, but fundamentally it's easier to spend somewhat more money vs. going out of our area of expertise to build something critical ourselves.

There's a third option that you may not have investigated. If you know of a reputable System Builder in your area, they could build and maintain the system for you, probably including the OS, and still cost less that the Synology. If you're in the San Diego area, I'd be willing to work with you on this (I don't support remote customers - it doesn't work well enough). After all, it's ONLY 100TB