Henry Newman's Storage Blog Archives for November 2011

The annual Supercomputing show earlier this month highlighted high-performance computing technology. This field has always been very competitive and has never made vendors significant long-term profit for many decades. Companies have come and gone, and the number of companies that have entered and exited this market is astounding. Lots of companies tried to enter with revolutionary idea and soon find the market either rejected the ideas or found something more interesting than before. This happened to so many companies, and of course there were the consolidations that had big companies buying smaller companies. Long-term profit has eluded most companies solely focused on HPC, but HPC is still important to the future of all of us. Not to be U.S.-centric, but I live here, and I want us to be successful.

HPC technology is the basis for the design and implementation of new drugs, planes, cars and lots of other things we use every day. The world made a big deal of out Steve Jobs dying. He was a great man with amazing ideas that people saw every day, but a man named Seymour Cray died the same day in 1996. His computer systems had more impact on our lives from the 1960s to the 1990s than did Steve Jobs' consumer products. iPhones and iPads are cool, but the 747, 757, 767 DC-9, DC-10, cars, medicine, and everything else that got us to today was designed and engineered on a system that Seymour built for 30 years.

I think we need to put things in perspective; consumer products are great, but they do not drive the nation's industrial technology. Supercomputing, even though it has not been very profitable, is a critical part of the infrastructure of our nation and we should remember that.

For those of you that have not seen it Japan is on top of the supercomputing list for having the fastest machine in the world for running Linpack. Those of us in HPC know that Linpack is not a good measure of what type of science can be run on the machine, given that it does not measure things like interconnect bandwidth and latency and I/O performance.

One thing that seems to be forgotten in the supercomputer race to have the most Linpack FLOPs is I/O performance. Now some reading this might say who cares about supercomputers, but let me remind you that ever drug you take, plane you fly, car you drive was designed on a supercomputer. So supercomputers impact many aspects of our daily lives, and do not forget the financial industry using them for trading.

System balance is not a consideration for the industry, and that is hurting the science that can be done as the arms race for many organization is to see where you fall on the Top500 List. This is not good for the scientists who actually have science to do and need more than just FLOPS. Nodes fail, so it is critical for jobs to checkpoint themselves so they can restart in the event of a node failure. For example the K machine has 22,032 four-socket blade servers with either 32 GB or 64 GB of memory. Let's say a big job runs on ¼ of the machine's 5508 nodes with 32 GB of memory per node or 176256 GB of memory. Let's say you had 100 GB/sec I/O rate. Checkpointing would take 1762 seconds or over 29 minutes. Clearly checkpointing a job every hour or two is not tractable; nor is every three or four. We need to start looking at balance rather than a FLOPS arms race.

I believe that we need to update what we are teaching in college to computer science students. I think we must start teaching students about how hardware and software interact. I think this curriculum should include areas such as hardware memory allocation, page coloring, how data gets moved to/from the PCIe bus and how that bus works. For example, I think we should teach about things like 8/10 encoding. How does a 10 Gb Ethernet NIC work or a SAS adaptor? What is a CRC error on a channel or things like SECDED? If you do not know Google, as there are a number of good explanations on the net. The class I envision would start as a one-semester required class for sophomore CS students. I think it is important that this type of information be taught early on. The next required class would be a senior-level class that would address some of the same issues in more detail and look at areas, such as reliability engineering for silent data corruption and some of the standards bodies and the hardware that they control, for example SAS disk drives, Ethernet and other hardware technologies that have well-defined standards.

I have actually suggested this to a number of computer science professors I know, and I even offered to do an overview lecture on the topic. I was told that no one in the computer science department really cares about these topics. This rejection came from two computer science departments in pretty large state universities (unnamed of course). There must be a way for the industry to partner with universities to help them understand what we need from graduates and get ideas on the curriculum.

Although I am a single voice, I am loud and I rant a lot, but I need some help here folks.

In my last blog post I commented that performance analysis is becoming a lost artand suggested that it will not be long before we go back to the future and begin working on performance analysis and code optimization instead of just buying hardware and more hardware. I do think we are going to have to go back and teach people how to optimize codes for the architecture being used.

The problem is, how do we develop those skills? Back when people were doing lots of code and system optimization, many of us were writing in assembly language. That taught us tough lessons about the hardware and how to efficiently use it. We need to teach people in our industry about hardware and how it works. Long ago, I was lucky enough to have been able to attend a class called Hardware Training for On-Site Analysts when I was working for Cray Research. It was taught by an EE who actually worked on the hardware design of the machine. I liked the class at the time, but over the years I came to realize how much that class meant to me and my career, giving me the tools to understand there is more than just software when doing performance analysis.

So how would we teach today's young people how to understand things like cache lines, memory bandwidth, NUMA and memory placement, SAS CRC errors, T10 DIF/PI and disk error recovery, just to name a few things? Honestly, I do not have a good answer. First of all, young people need to think that these kinds of things are important to learn, and almost all people today believe that you do not need to understand the hardware. I think we need to find a way to make understanding the hardware important again, but that will not be the case if the answer to all performance problems is buy more hardware.

I am beginning to think that the whole concept of performance analysis and performance optimization is becoming a lost art. How many people understand everything from the application to the storage device whether it is a hard drive, tape drive or an SSD. I would think not too many people know the end-to-end information. I learned years ago to approach things from understanding what the application does. What type of I/O calls does it make? What I/O libraries are using it? How does the I/O interact with the operating system. This was before the time of the Unix and the C Library were common. Then we had to understand how the I/O moved to the device.

Devices today are far more complex than the original devices that I worked in in the early 1980s. For example, all of the error recover back then was done in the operating system. Things are different today. You cannot change much to tune I/O in Java except do not user Java for I/O. I think the current state of things where applications cannot really be tuned and the solution to all performance problems is just buy more hardware is going to run out of gas. Almost everything in our industry is cyclical. For many years, tuning applications and the storage for the application used was commonplace by both industry and the staff of most vendors. Today, that is sadly not case. Some of the cause is the recession, and some of the cause is that users have found in at least the short term that buying hardware is cheaper than fixing the applications. I believe that in the long term this will not be the case, but everyone seems to be living for today and not planning for the future.

I have ranted (a regular occurrence) for years over the fact that standard RAID will not last forever. I last seriously ranted over two years ago. The article was widely read and widely commented on, as I got many emails. Two years later and most agree with me, but where are the changes? We now have 3 TB drives with 4 TB drives on the horizon.

The timebomb of RAID and data loss is nearing. I have heard of a number of cases where multiples failures happened on large file systems during rebuild. When 4 TB drives come with the typical 20 percent increase in performance and now 33 percent (3TB to 4 TB drives) increase in density, the problem is going to happen more and more often. Declustered RAID is the solution to the problem, but it is slow in appearing in the market.

Some vendors that have addressed the problem. Since I try not to mention vendor names, I will let you ask the questions of the community vendors yourselves. I think market pressure is the only way to get the vendor community to make these kinds of major technology changes. All of the vendors I have talked to over the years understand and agree with my conjecture that things must change. However, change is expensive, and we had a recession in which vendors did not invest in technology for that very good reason. The economic situation is seemingly getting better, and it is time for the user community to demand the vendor community meet our reliability requirements. With 4 TB drives just around the corner, and hard error rates still at the same 1 sector in 10E15 bits for the more than six years, the time for change is now.

The Supercomputing 2011 show is well under way, and I have been thinking about Infiniband. With Mellanox buying Voltaire a while back, we are now down to two vendors in the Infiniband market space--Mellanox and QLogic. Are any other markets expanding with only two suppliers? Of course, you might answer there are only two suppliers of CPUs, but that does not include ARM and other cell phone processors as well as PowerPC. Infiniband is a needed technology for certain applications that require low-latency, high-performance communication. The current 40 Gbit/sec QDR Infiniband has been available for a number of years, and the industry is rumored to be moving to FDR at 56 Gbit/sec but far more efficient encode, which allows the realization of higher performance. Infiniband also supports RDMA (remote direct memory access), which significantly reduces latency for applications and libraries that can take advantage of this feature. The questions I have are:

Is the HPC market including traditional HPC, and is the financial trading community big enough to support the R&D needed for a technology? The HPC market is small but growing, and almost all of the large clusters use Infiniband.

Do two vendors provide enough competition in the market? Not too many years ago there were four vendors.

One other interesting point that I recently realized but had the wrong timing as to when it would happen: 10 Gb Ethernet ports are finally having major price drops. I thought this would happen in 2008 or 2009, but given the economic downturn, the volume just did not materialize.

Finally, we are going to see 10 Gb Ethernet become common on higher end motherboards. As the price drops for this technology, will that cause Infiniband to become the next HiPPI technology? Infiniband has a much bigger market than HiPPI did but there are many parallels. Forgive the bad pun.

As more and more data gets created, I think we are going to see more people looking at archiving data. I am at SC 11, and the supercomputing field has always been at the forefront of archival software. It has driven the archival community. Of course, today many believe that Google, Amazon and even Mozy and Carbonite can support large archives, and to some degree they can, but the issue is getting the data in and out. Today, some of the archive I am aware of are over 20 PB. Think about that take a OC-192 channel at around 10 Gbits a second. For rounding's sake, consider:

20*1024*1024*1024*1024*1024/(10/8*1024*1024*1024)= 16,777,216 or about 194 days. Now first of all, no one gets 100 percent of channel usage, and the values I used are higher than an OC-192 channel, but you get the picture. Today, just about all the archives that I am aware of in high performance computing are local the site or organization. Even with high-speed networking, users moving large files around is not tractable unless there is some way to know apriori when you need a file and a way of scheduling the transfer.

Of course, there are tools that can provide schedule transfers, but users still need to schedule the transfer and therefore need to know they need the data. In some cases this will work; in others it will not. File sizes are growing faster than network performance from what I can see, and that means if you want data quickly from an archive, the archive better not be located over the WAN. Small file access over the WAN will work, due to file sizes for some applications, but of course not everything. I do not see the network performance increasing at the rate of data growth. It never has and likely never will.

I am hearing a lot these days about using cheap technology and transforming it into enterprise reliable storage. I honestly have not done a technology review of any of the multiple vendors and their claims, but I am sure I will at some point soon.

Here are the types of things that I plan on evaluating to better understand the claims.

First, I want to look at the reliability of the underlying disks and connections. Will they be dual-pathed? Will the RAID algorithm rebuild all in a reasonable time? Are the disk AFR and the hard error rates considered? How does the vendor address silent data corruption. Which is more prevalent on lower-end disk storage? How is the whole unit tested compared to enterprise and midrange storage? That would start the discussion.

The second set of questions I would ask would center around failover and failure issues within the controller. Questions such as: How do you access a LUN if part of the controller has failed? How is caching accomplished with and without a failure and most importantly during a failure of something in the hardware path?

The third and last set of questions I will be asking are in the area of performance of the system during normal operations and during various failure scenarios.

As I said, I have not had any exposure to some of the new vendors in this area of yet. There were a number of vendors in the space 5 and 10 years ago that tried doing the same thing, and over time they realized that they had to make the storage controller more robust as the methods and techniques chosen did not work. Of course, everything is different now, but as I have said, lots of things are the same. We shall see.

What does that world look like? What does the interconnect look like, and what does the switch look like? Right now, high performance computing is dominated by programs written to communicate via MPI (Message Passing Interface). Since it took over 10 years to write many of the complex codes changing them over, the next few years is out of the question. So most, if not all, of the scientific and engineering codes will be written with MPI. There are some codes and parts of other codes that must communicate node-to-node, while other codes need to communicate to the whole network of nodes.

This, of course, means that the network design needs to be different for different codes or parts of different codes. This is a costly undertaking. It is far easier to design, and therefore cheaper to communicate to, nearby nodes quickly than it is to communicate to faraway nodes quickly. It is almost like you need two networks, and that is what might happen.

What if one network interface was developed that allowed higher-speed, low-latency communication to nearby nodes and another network interface, and it was connected to a specialized switch for local communication? What if the other interface was designed for more global communication with its own specialized switch? This likely could even be done today with multiple InfiniBand connections to different switches. It would, however, require some modification to how the topology is addressed. I would not be surprise if some time before 2015 this type of technology becomes widely available on standard x86 hardware.

Sooner or later PCIe, is going to run out of gas for moving data, as the performance has not kept up with the requirements in some parts of the market, especially technical computing. We have gone from PCIe 1.1, which was originally developed to support the graphics industry with a common card slot and was always planned to be used by all peripherals. The original performance of 250 MB/sec per lane in 2004 was very fast compared to CPU speeds and memory bandwidth at the time, but 4x performance in 2012 means PCIe performance is not scaling with either CPU or memory bandwidth.

My belief is that at some point in the not-too-distant future, some vendor is going to place InfiniBand chips on the CPU board, bypassing the PCIe bus. These chips would be connected to HT or QPI channel. The other more likely possibility is that some vendor will develop its open proprietary interconnect. You ask why create a proprietary interconnect? I think the answer is clear: PCIe is not meeting the market needs for technical computing, which requires high-speed communication between thousands of nodes.

If we are going to be able to address the problem, it is not going to be with PCIe. IBM recently announced the P775, which has a proprietary interconnect. Is this the first of a whole series of machines from the vendor community? Of course, only time will tell, but doing critical science with PCIe 4.0, which will have a 2x improvement over PCIe 3.0 arriving sometime in 2016 or so is not going to work for the science community. There needs to be a much more significant improvement in communication performance combine with new algorithms that will reduce the amount of communication or there will not be significant advances.

Advertiser Disclosure:
Some of the products that appear on this site are from companies from which QuinStreet receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. QuinStreet does not include all companies or all types of products available in the marketplace.