Future Storage Systems: Part 3a – Node Expansion Overview

by dave on October 9, 2008

In the previous two articles on the Future Storage System (FSS), I took a general look at a basic storage system architecture (Part 1) and then went a bit deeper into some of the more interesting bits of that system from a platform standpoint (Part 2). In this article, I want to dive a bit deeper into how I envision nodes to be building blocks for additional capabilities and processing directives. I will be referencing the image below as part of this article.

Hypertransport Node Expansion (detailed)

We’ll approach this system overview the same way as previously done (starting from the lower left hand corner and working clockwise).

This diagram shows a layout of single node (Node A) and an expansion node (A prime). In this diagram, we’re looking at a computational node add, not an I/O node (that’ll be covered in Part 3b 😉 ). The basic architecture can be broken down into the following:

Base Node with I/O

Interconnect (physical/electrical)

Expansion Node (computational or I/O)

Starting with the Base Node, we can see that the architecture is the exact same as covered in Parts 1 & 2. It’s a simple dual physical processor system with dedicated memory banks per processor, a Southbridge controlling PCIe connectivity and system level I/O, and both coherent and non-coherent Hypertransport links between processors. Although memory specifications are provided as part of the diagram, it’s inconsequential to the overall architecture. The SouthBridge, as previously noted, would be utilized as the PCIe resource manager for system and storage I/O. The coherent (cHT) and non-conherent (ncHT) Hypertransport links would be dedicated to intra-processor communication including facilities for NUMA (non-uniform memory access) and processor data requisitions. This facility has already been implemented in current Bx series AMD Opteron processors as part of the Socket 1207 specification.

The interconnect portion of the diagram is both a mechanical and electrical “device.” Essentially, it “glues” together processors into a larger “grid” doubling I/O bandwidth within the system and allowing for expansion beyond two physical processors (at least in this design). Mechanically, it can be compared to the HTXphysical interface originally specified by AMD. Some HCA designs from Qlogic and others utilized this connector as a method of direct I/O injection into the processing path, lowering latency even further than the comparable PCI-X or PCIe models. (It was only about 1ns different, but, still, every bit helps). Again, while HTX is the archetype for Hypertransport’s physical limitation, a slot based architecture wouldn’t be as effective as a more custom packaged model (thinking specifically about blade server-esque edge connectors). Electrically, the HT 3.x protocol would be signaled to the expansion node as part of a larger mesh fabric to engage additional processing resources (CPUs or, as noted further down, Torrenza-compatible co-processors) or I/O resources.

In an ideal work, you’d be able to simply plug in the expansion node (especially the computation version) into the chassis while the system was running and have the OS dynamically add and allocate the resources to the rest of system.

Now the fun begins….

First, I’d like to touch on the reason for looking at expansion nodes at all. In the storage world, there appear to be two basic approaches to adding processing power to head units: clustering (a la Equallogic, NetApp, XIV, Permabit) or multiple storage processors (a la HDS USP-V, Symmetrix, etc.). The advantage of clustering really trickles down into n-way ownership of data and multi-level hardware failures. The disadvantage is having to maintain some level of heartbeat mechanism (CMI on EMC, for example) between nodes that can split-brain I/O and/or other system level processes. Notice I said CAN not WILL. The advantages of using multiple storage processors (directors) ties into complete hyper or LUN awareness and quick ownership failover in case of hardware meltdowns. Additionally, tying multiple SPs together on a common bus can add to overall system performance (principles of Hypertransport, for example) via aggregated memory bandwidth and I/O. The general disadvantages are that any sort of SP failover could ultimately impact other SPs in the system (thinking specifically of some sort of EMI burst or surge) and double-down hardware failures causing DU/DL (data unavailability or data loss). In my mind, there’s obviously reasons to chose one over the other but, I think that ultimately, you could combine both (another article on that later). Since we’re just evaluating a simple node add, we’re going to look at this as multiple SPs on a common backplane (HT).

Secondly, if you look at the diagram, you’ll see a note about Torrenza. There’s a link elsewhere to the Torrenza wiki article, but to distill it down for consumption, here’s the quick n’ dirty. Torrenza is an AMD initiative to allow dedicated co-processors access to the exact same HT and I/O stream as the CPU. So, you want dedicated processing for x-type of application? Install a co-processor into an available 1207 socket in the system. Systems using Cell processors, for example, have been demonstrated behind doors (not commercially available to the best of my knowledge). The ultimate goal here would be to allow specialized co-processors for applications (RSA disk encryption) that would be offloaded from the general storage I/O processors. The application set is really endless. Want to do data encryption inband or at rest? Install an RSA encryption co-processor. Want to do compression or de-dupe? Install a compliant DSP or co-processor that performs that task. When we look at the operating system for this Future Storage System, you’ll see even more applicability.

Closing Thoughts

Tomorrow we’ll take a look at what an I/O based expansion node would look like and see the implications there.