We are facing a challenge of storing a lot of data in a create once read many (99.9% read) environment, inital reports points at 200 to 2000TB created during a three year frame.Assuming quit a lot we will most likely need to store around 0.5PB, by the time I know exactly the basic design or tender needs to be mostly done. So I thought I´d ask around a bit and see what and how others do.

Data will be static and mainly images. Format is unknown and might be jpeg, tiff, raw or something completly different image format. There will also be other forms of data, everything will be adressed through some sort of database.

We have been thinking in the lines of checksums, n+2 raid, 25TB filesystems and 100TB nodes.Backup is done on filesystem level at creation, backup is the done on a full once every year on filesystem level. SLA are decent.

So since funding might be scarce and it´s a do or die situation, we might end up building this off the shelf ourselves (the proper disclaimers are being written).

Any suggestions on how to do this,

A) In a best practice way, enterprise style.B) In a resonably cheap way but still with vendor support.C) DIY a.k.a freenas and similar.

Actualy, any help at all is nice =)

We have resonable experience in DIY and good unix and storage know how but we live in tens of TB and are actualy a little scared about going into hundreds without vendor backing We are a former Sun/Linux current Netapp/Linux shop in regards to storage.

What sort of budget, both initial and sustaining, will go towards this effort?

Just how hard will you be reading the data?

Isilon certainly fits the use case...with a couple of caveats. I haven't seen EMC price it aggressively enough to beat out more traditional alternatives, and if they did I suppose I'd wonder if that pricing would hold through your upgrades. Because the best way to make use of Isilon is to buy just what you need, just when you need it rather than acquiring lots of resources up front.

If the I/O to the units wasn't going to be too intensive I can think of a few solutions that would price out below $1k/TB for usable space, (some well below there) and still get you a comfortable level of support (EMC, IBM, Dell, ...)

I would be reluctant to DIY it as that would involve taking on the responsibilities for engineering and maintenance, and put you in a position where you may only become aware of issues when you actually experience them, rather than when some other poor SOB does, (and your vendor supplies you a fix for something that hasn't even happened to you). So if data integrity and up-time have any value the DIY option has some significant downside to it.

Enterprise, this is EMC, NetApp, IBM and others with proven trackrecords and solid data integrity.Cheap, is fujitsu/dell/whatever known brand of raid or jbod boxes, preferably under some sort of software umbrella. servers are standard known brand name stuff.DIY, Supermicro or similar with off the shelf nearline drives or cheap jbod´s. Preferably linux or solaris but *BSD will do.

Reads from a sub set will be low to moderate, this we can easily break off and put on enterprise level stuff.Reads on bulk amount of data will be low, sequential and feed over connections.

Along with the use, you have to ask yourself something more along the line of your business: what do you exist for?

If you are primarily a group that munches data, and that's where you make your value from, then having reliable access to that data is worth your while.

If you are a group that needs a lot of data but any piece of that data is of limited value, then going open source with bailing wire and duc[k|t] taps may be the preferred method of operation.

If you are trying to be a new cloud provider and cheap, large storage is key to your business success, investing internally into true home-grown solutions may be a better option, even if it is slightly more riskier (there are many papers on large scale distributed file systems, implementing one of them isn't impossible).

A couple of things we've acquired recently that aren't too far out for you:

VNX 5300 with 120 x 2TB disks; they cost a little over $100k each, with five years of maintenance, and they net ~175TBs of usable space (RAID6, reasonable hot spares, ignore vault drives). We got these to use as part of the production storage farm for a VMWare cluster with the idea that we would only load them until the IOPS/latencies suggested that they were 'full' rather than until they were out of space. We are relying (at least somewhat) on VMWare storage DRS to keep watch on them and keep them balances Compared to other alternatives we only need to get 30-40% space utilization out of them to 'break-even' on that concept. We could have gotten them with 3TB disks instead for around 30% more, but we felt we would never use the extra space due to IOPS/latency constraint; your workload may be different.

Dell MD3660f with 60 x 3TB disks; they cost about $45k each, again with five years of maintenance, and they will net ~130TBs each (RAID6, reasonable hot spares). These are to be used as backup to disk targets in a D2D2T design. They don't need to multi-task too well, as most of their I/O will streaming in and out.

For reference on the pricing, we are a .edu, so we do get a decent discount, and for both of those items we were getting a quantity of several. But I believe any customer can get to a similar price range with just a little vendor management. Key to that is flexibility; EMC pitched Isilon, and IBM, NetApp, Dell, Nimble, and some others offered products in those same size/performance neighborhood but most of them were dropped from consideration before we got them to a serious price discussion. No one will (IME) just lead with their rock-bottom pricing; I prefer to narrow a field to several reasonable options based on technical merits then beat them against each other until an acceptable price appears. It can take some time and patience.

That's a fair question. Even with the relatively dense 60 disks in 4U MD3660 you are looking at three full size racks with power and cooling. More traditional densities would put you closer to 8-10 racks for 2000 x 3.5" spindles.

But some vendors are starting to offer 4TB units today, and of course you probably don't want (for many reasons) to just go out today and acquire the full maximum space required, so you don't need all that up front, and you may never need 8-10 racks, but if it would be completely impossible to get there then you should keep that in mind now so that you don't get down the road and discover it later.

Needs to be online. HSM is unlikely but might be doable.The access time would probably be at the very limit of how far it can be stretched.

With HSM it is "online" just the older stuff is on tape. There is no user intervention in pulling the file off of tape for access. Depending on the size of the file, and your configuration of the tape layer and the number of requests being serviced, you will be able to get to the files on tape within a couple of minutes.

I am just throwing it out there as an option. IMO, unless you have an access requirement that precludes an HSM solution, it should be looked at.

I've mentioned this before in some of these "large storage" thread. Perhaps stepping outside the normal box for a second could be of use. What about the cluster FS systems? or even HDFS (you can mount the HDFS space via fuse).

What are you write/read loads like?Who is actually going to be getting to the data and how are they going to access it?

My recommendation: Isilon cluster with NL-series nodes. You can probably get down to about $1k/TB and you can buy in increments.

My former employer went this route. When they added a node for more capacity (another NL node I believe) the whole cluster ground to a halt while it rebalanced the data. This happened after I left so I don't know how they fixed that.

My recommendation: Isilon cluster with NL-series nodes. You can probably get down to about $1k/TB and you can buy in increments.

My former employer went this route. When they added a node for more capacity (another NL node I believe) the whole cluster ground to a halt while it rebalanced the data. This happened after I left so I don't know how they fixed that.

You don't really have to rebalance if you know you don't need to. But that is an issue with any system that allows expansion. Either you rebalance to spread out the data or you don't and then get different read and write performance depending on which data you're reading. Since the OP has 99% reads, I don't see a need to rebalance.

An .edu here too. We went with 2 Isilon clusters using NL nodes. We're in the process of migrating off of our old Sun Thumpers and our DFS environment.

We looked at a gambit of options out there and in the end it made more sense to go with a fully supported product with plenty of real world usage and something that any administrator can pick up and understand.

Another question -- do you need a large amount of computation capacity to go with this data you are storing? In that case, getting 1000 nodes to each have 2 drives and running a Hadoop cluster isn't unreasonable.

I'd favor the "roll your own" if money's tight, you have someone competent enough to roll it and you really need it. The problems with that approach in general should be considered and are a different beast, though.

So, could you maybe add a little detail?Can the huge mass of data be broken into smaller chunks semantically, like a couple of sets below 500TB each? How many files do you expect, mostly big ones above 10MB/100MB/1GB, or also a buttload of small ones?You mention database in some part, could you elaborate?

99,9% read is good to know, but not enough to know. How random will read access be? 99% for the last couple TBs written? How big is an average read, how much do you have per time, what is an acceptable latency?

What software is accessing it and on what level? Is it your own software you could modify? How does it access metadata?

There's a multitude of ways to handle this and I would choose to do whatever is easiest. That absolutely depends on what you actually do with those data and how it is accessed.

Another question -- do you need a large amount of computation capacity to go with this data you are storing? In that case, getting 1000 nodes to each have 2 drives and running a Hadoop cluster isn't unreasonable.

That was more or less my suggestion earlier (running a job tracker/task trackers is optional obviously). The ratio of io to compute power can be adjusted via more or less drives per node.

If you want to cut out the middle man and use your own hardware vendor, Dell OEM's caringo CAstor product. Which was written by the guys who wrote the EMC centerra. That being said, we don't do any object based storage, but looked at Caringo to support our enterprise vault setup.

The only thing that bothers me about Object based storage, none of them explain how you can backup the whole environment to tape. It's always replication to another instance.

At two different employers I have had horrible experience with Centerra storage. I would not touch that, or anyone developed by anyone associated with the platform, with a 10 foot pole. You would be better off with white-box servers and a Gluster cluster.

If you want to cut out the middle man and use your own hardware vendor, Dell OEM's caringo CAstor product. Which was written by the guys who wrote the EMC centerra. That being said, we don't do any object based storage, but looked at Caringo to support our enterprise vault setup.

As far as software goes you are right but the hardware w/ support is Dell which I prefer over any small vendor and instead of dealing with two vendors and possible fingerpointing I think it's smarter to go with one and get full support in one place, in this case from Dell.

Quote:

The only thing that bothers me about Object based storage, none of them explain how you can backup the whole environment to tape. It's always replication to another instance.

It's easy, you just access it via CIFS or NFS, just like you do with any other file shares + both Dell's backup offering support object-based storage directly as far as I know (I *strongly* suggest Commvault over that bugfest BackupExec though.)

As an aside from what others have mentioned, a big advantage for Isilon is buy the capacity you need now, and add single nodes (of same type) as your dataset grows without possibly being limited to the model or design with some traditional storage plays. My second point and the one I really like from a lifecycle standpoint is once you're on the platform retiring nodes as they age and replacing them with newer, and probably cheaper/denser nodes down the road will not require a migration project or change of namespace. Moving 2PB of data can be a challenge, and going with a build your own may be a risky venture if the business depends on this data. At least with the auto-balance issue mentioned here, you have a path for resolution, with freenas maybe no so much. Incidently I've just rebalanced nodes in production and did not grind to a halt, so perhaps there was another issue as thats not normally and issue and can be scheduled as others have mentioned.