Henry Newman's Storage Blog Archives for December 2011

I got some interesting feedback from a few people on my article on teaching performance analysis in school. Those who responded agreed the topic was not well covered in the schools with which they were familiar. This got me thinking as to what besides performance analysis should be taught and why.

Today, I am sure there are many class on GPU programming, JAVA and, to a lesser degree, C. The operations side is also needed to ensure a well-rounded education. The key thing I see is problem-solving ability, and problem solving on real-world hardware and software. In the large data centers that I work with, it would be great if people coming out of school had familiarity with Infiniband, 10 Gb Ethernet, Fibre Channel, SAS and storage in general. Additionally, from what I can see, there is very little understanding of I/O from the application I/O request (C library or system call) to the disk. Add in the issues with SSD, both good and bad, and PCIe SSDs, and there is a lack of understanding in the critical areas that impact the design and performance of a system.

So who is at fault? Here in the United States, there always has to be fault. In my opinion, there must be a better partnership between the vendor community and colleges and universities. Vendors must offer their time and limited amounts of hardware to cover the critical issues that can impact the understanding of storage and I/O issues so we can graduate that better understand the problems. A small number of universities in the United States have curriculum in these areas, but broader understanding is needed.

I have attended every Supercomputing conference since 1992 in Minneapolis, and boy things have changed. First of all, the conference has gotten so big that for many years now, it would not have even fit in the Minneapolis Convention Center. It has been about a month since the show, and I have some thoughts about what I saw and heard.

I think much of how I feel can be summed up in a single phase: Nothing disruptive. Yes, vendors showed new technologies from storage appliances, interconnects, faster CPUs, new SSDs and so on. I do not think any of these technologies qualify as something that will change the face of high performance computing. Infiniband was one such technology when it came out. It allowed the industry to go from large SMP machines to commodity clusters, and it was truly innovative in my opinion. I am not sure if the industry has done a whole lot to change the face of HPC since then.

This is not to say there have not been technological progress almost every area (e.g., CPUs, memory bandwidth, file systems and SSDs), but progress and massive changes are two different things. Major changes do not happen overnight, of course, nor do they happen every year, and they almost never happen when there has been limited investment in technologies because of a recession. Additionally, there might be a cycle to innovation in this field that happens every 10 to 15 years, looking back from the 1960s to today. But that is a different story for another day. My hope is that some time in the next two to three years something comes out that changes things again in a very positive way. I am not sure what that will be, as there are many possibilities, but I hope it is soon.

This is nothing against Avere specifically--just another example of the benchmark arms race that is out of control and really does not provide much value to buyers. Just so everyone knows, the company I run does not resell any hardware or software and we will not. We are vendor-neutral. Remember that when you are reading articles from other writers, but back to the facts.

The benchmark arms race will continue, and I believe it will continue to provide buyers no actual useful information. Somehow the buyer community must change the discussion. The problem is that the vendors control the benchmarks and the definition of benchmarks. Being the cynical conspiracy theorist I am, I believe that just like with the POSIX I/O standards that vendors control and do not want to change, vendors that control the I/O benchmarks do not want to change. Reasons that come to mind include:

That vendors know how to run the benchmarks and do not want to change anything given the costs. Oftentimes, for these complex benchmarks there is large learning curve that is needed to run the benchmarks and get the needed results on the vendor hardware.

Vendors now do not want to show scalability, as I believe it is in our interest but not their interest. When buying a storage appliance and looking at performance data, what really matters is that the product scales to meet the performance and density requirements that you have for the long run. Given the testing we have done, I know this is true for some products; remember the vendors control the benchmarks

We need a buyers manifesto that says we care about scalability, not absolute performance. No one is going to buy the exact configuration used by any vendor running SPECSFS, so why should we care about results?

A number of my customers continue to ask questions about data reliability for archives. We all know that archives cannot be 100 percent reliable forever, but how reliable can they be? The answer is no one knows the answer. How can anyone figure out, given all the hardware and software involved in the archive what the reliability is. At least the people I talk with do not even try, as there is no basis of how to discuss the topic.

I think reliability must be discussed in terms of count of 9s. Is your data have 99.99999999 percent, also called 10 9s of reliability, or does it have 15 9s? No one really can tell, as there is no common way to discuss the problem. The 9 count could be calculated from the media reliability, but even that information is not used to have a common discussion. What I am thinking is there must be some standard way to discuss the reliability so the community and others have thoughtful discussions, and vendors can sell well-defined products. Vendors are not required to discuss data integrity, even at the media level for an archive, as the user community does not require it. It is time for a change in the thought process for those of use responsible for large archives of data.

I know it is possible to calculate the level of media data integrity, as my team has done it. I think that if we ask for this we will begin a bigger discussion on the integrity of data across the data path including all the checksums that have not be updated in 20+ years and that are no longer robust enough for the amount of data and the speed of the channel. It is time for a new discussion.