February 22, 2011

I wouldn't have picked up on this story if Storagezilla hadn't retweeted it and I'm not going to blame him but sometimes journalists and marketeers drive me mad; perhaps I'm a little naive to expect some kind of accuracy and sense but I still do!

That sounded a really interesting little story and as it was tweeted by 'Zilla, I knew it was going to be about Isilon. And it was about space! How cool!

Then I read the story and really it's making something out of not a lot. Two Isilon clusters with 11 nodes and 700 Terabytes of disk each; okay, that's a reasonable size but it's not petabytes; its 1.4 petabyte, over a petabyte indeed but petabytes? I expect to see at least two petabytes. And actually, they are a mirror pair, so less than a petabyte of unique data(and there's also no mention if that is usable as opposed to raw).

Of course then you pick out the detail; 100 Terabytes of data to start with, growing at approximately 170 Terabytes a year. So it could end up being petabytes eventually...maybe!

However there is another even less postive spin to put on this, Isilon have managed to sell about 3-4 years capacity which will sit there spinning and depreciating?

Great job by the sales-man but! Isilon have great technology which means selling all this capacity up front is pretty unnecessary and gets the customer to pay up front for capacity that they don't need and capacity which can be added non-disruptively and smoothly as and when required.

That's the sort of behaviour that as an end-user drives me nuts! I understand why the vendor does so but don't we keep talking about partnership and don't vendors keep talking about efficiency?

Of course, I could just be channelling my friend Ian but actually I think's just my own grumpiness this time!

As we seek to constrain and control the explosion in data growth is deletion of data and reclamation of storage an economically viable methodology?

I’ve seen a few articles over the past 18 months who calculate that this does not really make sense; if the cost of work required to reclaim that storage, then it does not make sense to do so. I remember looking at the cost of SanScreen when it was an independent company, their big sell was that it paid for itself in identifying orphaned storage and reclaiming that; unfortunately, it didn’t.

But does that mean that carrying out this sort of exercise is not worth doing? My answer to that is No! The benefits to good data management stretch beyond the economic benefits of reclaiming storage and more effective use of your storage estate.

If you never carry out this sort of exercise, you have resigned yourself to uncontrolled data growth; you have given up. Giving up is never a good idea even in the face of what feels like an unstemmable tide; you do not need to sit like Cnut and try to stop the tide coming in but you can slow it and take a greater degree of control.

This sort of exercise can be important in understanding the data that you are storing and understanding its value. And interestingly enough, you might actually want to delete valuable data for a whole variety of reasons.

You need to understand the legal status and value of the data; email in a legal discovery situation is the classic answer, if you have the data, you can be asked to produce it. This can be extremely costly and can be even more costly if you discover that you can produce data at a later date when you have said you can’t.

Those orphaned luns in your SAN, do you know whether or not, they contain legally sensitive data? Those home-directories of ex-employees, is there sensitive data stored there? The unmounted file-system on a server which has never been destroyed?

It is also important to understand the impact of the entire estate of keeping everything for ever; what is the impact on your back-up/recovery strategies? What is the impact on the system refresh and data migration in five years time? Do you only carry out this exercise when you are refreshing? If so, you are probably going to put back your migration strategy back a number of months and you could end up paying additional maintenance for longer.

There are many other consequences to a laissez-faire approach to data management; don’t just accept that data grows forever without bounds. Don’t listen to storage vendors who claim that it is cheaper to simply grow the estate but understand it is more than a short-term cost issue.

No, good data management including storage reclamation needs to become part of the day-to-day workload of the Data Management team.

February 20, 2011

Big Data like Cloud Computing is going to be one of those phrases which like a bar of soap is hard to grasp and get hold of; the firmer the hold you feel you have on the definition, the more likely you are to be wrong. The thing about Big Data is that it does not have to be big; the storage vendors want you to think that it is all about size but it isn’t necessarily.

Many people reading this blog might think that I am extremely comfortable with Big Data; hey, it’s part of my day-to-day work isn’t it? Dealing with large HD files, that’s Big Data isn’t it? Well, they are certainly large but are they Big Data. The answer could be yes but the answer generally today is no. As I say Big Data is not about size or general bigness.

But if it’s not big, what is Big Data? Okay in my mind, Big Data is about data-points and analysing these data-points to produce some kind of meaningful information; in my mind, I have a little mantra which I repeat to myself when thinking about Big Data; ‘Big Data becomes Little Information’.

The number of data-points that we now collect about an interaction of some sort is huge; we are massively increasing the resolution of data collection for pretty much interaction we make. Retail web-sites can analyse your whole path through a web-site; not just the clicks you make but the time you hover over a particular option, this results in hundreds of data-points per visit and these data-points are individually quite small and actually collectively may result in a relatively small data-set.

Take a social media web-site like Twitter for example; a tweet being a 140 characters, so even if we allow a 50% overhead for other information about the tweet, it could be stored in 210 bytes and I suspect possibly even less; a billion tweets (an American billion) would take up about 200 Gigabytes by my calculations. But to process these data-points into useful information will need something considerably more powerful than a 200 Gigabyte disk.

A soccer match for instance could be analysed in a lot more detail than it is at the moment and could generate Big Data; so those HD files that we are storing could be used to produce Big Data to then produce Little Information. The Big Data will probably be much smaller than the original Data-set and the resulting information will almost certainly be much smaller.

And then of course there is everyone’s favourite Big Data; the Large Hadron Collider, now that certainly does produce Big Big Data but let’s be honest, there aren’t that many Large Hadron Colliders out there. Actually a couple of years ago; I attended a talk by one of the scientists involved with the LHC and CERN all about their data storage strategies and some of things they do. Let me tell you, they do some insane things including writing bespoke tape-drivers to use the redundant tracks on some tape formats and he also admitted that they could probably get away with loosing nearly of their data and still derive useful results.

That latter point may actually be true of much of the Big Data out there and that is going to be something interesting to deal with; your Big Data is important but not that important.

So I think the biggest point to consider about Big Data is that it doesn’t have to be large but it’s probably different to anything you’ve dealt with so far.

February 16, 2011

You will note that I have turned blog comments off on this blog (or at least tried to); I'm not trying to curtail discussion but now that www.storagebod.com is live and appears to be working okay, I'd like the debate to happen there!

So if you can update your bookmarks and head-on over there; I'd really appreciate it!!

2011 appears to the year where everyone bunches up as they try to climb the mountain of storage efficiency and effectiveness. Premium features which were defining unique selling points will become common place and this will lead to some desperate measures to define uniqueness and market superiority.

I'd like to take a smaller and relatively unknown player in the storage market as example of how features which even last year were company defining could become rapidly common place; certainly if you are outside the world of media and working in a more traditional enterprise, I would be surprised if you had come across a company called Infortrend.

Infortrend make low to mid-range storage arrays which seem to turn up fairly often in media; often packaged as part of a vertical solutions, it is not a company you would really expect to tick all of the boxes with regards to the latest features. Yet if you look at their latest press release, you will find that they offer or are planning to offer over the next year

Thin provisioning

Deduplication

Automated Storage Tiering

Replication

Snapshots

etc, etc...

So if even the smaller vendors are offering these features; what are the big boys going to have to do to try to differentiate their offerings? Vertical integration and partnerships with the other enterprise vendors such as VMware and Cisco is going to be one area where they can differentiate, their size makes the levels of investment required in these partnerships a lot easier. However sometimes, these smaller vendors such as Infortrend plough an interesting furrow by partnering with smaller niche application vendors who do not have the clout to get time with the bigger vendors. And before we count this out as a strategy; Isilon managed to grow at first as a niche company.

Management tools and automation are one place which needs continuing innovation and investment but interestingly enough, often the smaller vendors excel in ease-of-use out of necessity. Smaller sales-forces, smaller technical support teams and a channel-focused approach to market means that their systems must be easy to use, although they do often fall down on the scalable management and automation front.

Yet at the end of the day, it could well come down to a marketing war and marketing budgets.

So EMC have had to bow to the inevitable and join the Storage Performance Council; as Chuck mentions in his reply to Chris' article, there are public agencies now mandating SPC membership for RFP submission; I am also aware that there are some large other storage users who are starting to do similar. But will we see benchmark wars and will it be a phoney war?

EMC have been selectively cherry-picking the SpecFS benchmarks for some time; they are inordinately proud with regards to their benchmarks around SMB performance and I suspect such cherry-picking will continue. Certainly, many of the other vendor's cherry-pick; for example, IBM won't submit XIV (unless something has changed) because it will crash and burn.

The configurations benchmarked are very often completely out of kilter with any real world configuration but at least any records EMC break in this area will have more relevance than someone jumping over a number of arrays on a motor-bike.

And I wonder if EMC's volte-face will fit very nicely into their current 'Breaking Records' campaign; what better way to announce your 'embracing' of a benchmark than smashing it out of sight for the time being? And if they fail to do so, I do however have it on good authority that EMC are going to submit the number of people you can fit in a Mini and array jumping as a standard to SPC.

Is it a case that we can finally beat them, so we'll join them? As NetApp continue to slowly transmogrify into EMC, I wonder if EMC are going to meet them halfway.

February 14, 2011

As we continue to create more and more data; it is somehow ironic and fitting, that the technology that we use to store that data is becoming less and less robust. It does seem to be the way that as civilisation progresses the more that we have to say, the less chance that in a millennia's time that it will still be around to be enjoyed and discovered.

The oldest European cave paintings date to 32,0000 years ago with the more well known and sophisticated paintings from Lascaux being estimated to being 17,300 years old; there are various schools of thought as to what they mean but we can still enjoy them as artwork and get some kind of message from them. Yes, many have deteriorated and many could continue to deteriorate unless access is controlled to them but they still exist.

The first writing emerges some 5000+ years in the form of cuneiform; we know this because we have discovered clay and stone tablets; hieroglyphs arrived possibly a little later than this with papyrus appearing around the same time followed by parchment. Both papyrus and parchment are much more fragile than stone and clay; yet we have examples going back into the millennia B.C.E.

Then came along paper; first made from pulped rags and then from pulped wood; mass produced in paper mills, this and printing allowed the first mass explosion in information storage and dissemination but yet paper is generally a lot less stable than both parchment, papyrus and certainly stone and clay tablets.

Still paper is incredibly versatile and indeed was the storage medium for the earliest computers in the form of punch cards and paper-tape. And it is at this point that life becomes interesting; the representation of information on the storage medium is no longer human readable and needs a machine to decode it.

So we have moved to an information storage medium which is both less permanent than it's predecessors, needs a tool to read it and decode it.

And still progress continues, to magnetic media and optical media. Who can forget the earliest demonstrations of CDs on programmes such as Tomorrow's World in the UK which implied that these were somehow indestructible and everlasting? And the subsequent disclosures that they are neither.

Will any of the media developed today have anything like the longevity of the mediums from our history? And will any of them be understandable and usable in a millennia's time? It seems that the half-life of media both as a useful and usable is ever decreasing. So perhaps the industry needs to think about more than the sheer amount of data that we can store and more about how we preserve the records of the future.

February 10, 2011

I have been intending to do this for some time but have only just got round to starting this; this blog will now be hosted at www.storagebod.com. However this is early days for the move and for the time being I will be updating both sites but in the long-term, I expect that www.storagebod.com will become the main site and indeed it will fairly rapidly have some additional content.

But as I say, until I am completely happy with the stability and functionality of the new site; the blog here will continue to be updated.

February 09, 2011

Inspired by Preston De Guise's blog entry on the perils of deduplication; I began hypothesising if there is a constant for the maximum physical utilisation of the capacity in a storage array that can be safely utilised; I have decided to call this figure 'Storagebod's Precipice. If you haven't read Preston's blog entry; can I humbly suggest that you go read it and then come back.

The decoupling of logical storage utilisation from that of the physical utilisation which allows a logical capacity/utilisation which is far in excess of the physical capacity is one that is both awfully attractive but also terribly dangerous.

It is tempting to sit upon one's laurel's and exclaim 'What a clever boy am I!' and in one's exuberance forget that one still has to manage physical capacity. The removal of the 1:1 mapping between physical capacity and logical capacity needs careful management and arguably reduces that the maximum physical capacity that one can allocate.

Much of the storage management best practises are no more than rules of thumb and should be treated with extreme caution; these rules may no longer apply in the future.

1) It is assumed that on average data has a known volatility; this impacts any calculation around the amount of space that needs to be reserved for snap-shots. If the data is more volatile than one expects, snapshot capacity can be utilised a lot faster than expected. In fact, one can imagine an upgrade scenario which changes almost every block of data and completely blows the snapshot capacity and destroying your ability to quickly and easy return to a known state, let alone one's ability to maintain the number of snapshots agreed in the business SLA.

2) Deduplication ratios when dealing with virtual machines can be huge. As Preston points out; reclaiming space may not be immediate or indeed be simple. For example; often the reaction to capacity issues is to move a server from one array to another, something which VMware makes relatively simple but this might not buy you anything. Moving hundreds of machines might not even be very effective. Understand your data and understand that data which can be moved with maximum impact on capacity. Deduplicated data is not always your friend!

3) Automated tiering, active archives etc; all potentially allow a small amount of fast storage medium to act as a much larger logical space but certain behaviours could cause this to be depleted very quickly and lead to an array thrashing as it tries to manage the space and moving data about.

4) Thin provisioning and over-commitment ratios; this works on the assumption that users ask for more storage than they really need and that average file-system utilisations are much lower than provisioned. Be prepared to experience that this assumption makes an 'ass out of u & me'.

All of these technologies mean that one has to be vigilant and rely greatly on good storage management tools; they also rely on processes that are agile enough to cope with an environment that could ebb & flow. To be honest, I suspect that the maximum safe physical utilisation of capacity is at most 80% and these technologies may actually reduce this figure. It is ironic that logical efficiencies may well impact the physical efficiency that we have so long strived for!