Switched Networks Vs. Dedicated Links

Direct links run much faster than traditional switched networks. Using software-defined networking with dedicated links can help in the quest for storage bandwidth.

The search for adequate bandwidth has us scrambling for ways to maximize box-to-box intercommunication. Just recently, EMC bought the startup DSSD, whose main claim to fame is a PCIe interface to the host, as a way to maximize speed and minimize latency for an all-flash array.

Line speed has increased from 1 Gigabit Ethernet to 10GbE and then 40GbE in maybe four years, but 40GbE is expensive. Hidden within all this improvement is a major issue: Ethernet is a collision system. It's designed to allow multiple senders to try to reach the same host, and when a collision occurs, the losers have to retry.

Many benchmarks show good, efficient 10GbE operation, but this is an artifact of traffic patterns with essentially synchronized workloads. Real-world operation can set caps for efficient operation at less than 25% of nominal bandwidth.

Ethernet is like a traffic system where, if you arrive at an exit and it is blocked, you have to go around, get in the queue again, and hope it isn't blocked when you get back to the exit. Fundamentally, it isn't efficient.

Can we do better? One solution is to move to "Converged" Ethernet. This approach adds buffers in the switches so that any collisions result in data being temporarily stored until the receiving port clears. Buffer sizes are limited, and there has to be a throttling mechanism to make Converged Ethernet work. This allows transmissions to be paused by the receiving end of the connection for a time, allowing the traffic jam to clear.

Remote Direct Memory Access (RDMA) doesn't much affect line performance, whether Converged or not, since it functions mainly to reduce system CPU overhead in working the links. A variety of streaming protocols have helped a bit, especially those that require no receipt confirmation for blocks.

Blade systems offer an alternative to the collision problem: There are direct connections between the blades, and each node has one or more dedicated links to the other nodes in a cluster. This allows a very simple (UDP) protocol and full duplex operation without any collisions, bringing realized bandwidth close to theoretical limits.

One downside of a blade star fabric system is that, when fully implemented, the number of links is roughly the square of the number of connections. This has generally limited its use to small clusters, such as the 12 blades of a single blade server. Moving outside the box requires some refinement.

EMC's DSSD acquisition addresses the out-of-the box need, though clumsily. Dedicated PCIe links connect the array to the host servers, but PCIe suffers from being an internal protocol, and it's quite possible that older-generation link speeds will be the norm. The interconnects also are dedicated point-to-point links, since PCIe switching is in its infancy. Ethernet appears to be racing ahead of any other link mechanism, with 56GbE g shipping and 100GbE g in the design labs.

I would postulate we have the way to resolve the loss of performance due to collisions already in hand. Standard Ethernet switches are smart enough that we can define two-node VLANs that essentially give us direct connection via a switch. Having the switch allows us a great deal of configuration flexibility, and the fabric can be software-defined.

We need a fast low-overhead protocol to take advantage of the high-quality dedicated connection. RoCE and iWARP are candidates, but RoCE implies a converged environment, while iWARP will run out of the box. There are protocols that don't require RDMA support, including Google's QUIC.

Because this is software-defined networking, we can build and tear down VLANs as needed to cope with load changes. Booting a new in-memory instance can get a lot of resources until it completes and then drop back the number of connections to the level required for normal operation.

One downside of this approach is that the total number of connections increases, but in a real system, allowing the dedicated links to be configured on the fly by software permits enough flexibility to cope. Remember that this system supports ordinary shared Ethernet links, as well, though a protocol shift may be needed.

Using dedicated links means that servers will need more than the two Ethernet links typical of most designs. The tradeoff is that these servers won't need lots of storage interfaces, so the SAS/SATA port count can drop way down. I suspect six to eight 10GbE ports would be a typical buildout. The storage boxes would also need more channels.

Obviating collisions should allow a much faster storage connection, and running it via switches allows SDN flexibility of configuration. How this is structured together needs a use-case analysis, but the impact on in-memory and VDI applications, in terms of efficient booting, could be dramatic.

Jim O'Reilly was Vice President of Engineering at Germane Systems, where he created ruggedized servers and storage for the US submarine fleet. He has also held senior management positions at SGI/Rackable and Verari; was CEO at startups Scalant and CDS; headed operations at PC ... View Full Bio

Interestingly, this is somewhat solved by some of the optical technologies. The ability to use L1 paths across infrastructure helps to get around some of the static nature of bandwidth. And if you combine dynamic pathing with an SDN controller, you can do things like load balancing, traffic engineering, and dynamic pathing to meet application requirements.

I suspect that the real change that is happening (and has been for awhile) is that traffic is far more east-west than north-south at this point. The interconnect is almost more important than the uplinks. We should expect to see this interconnect happen at the rack level before long (big optic pipes between racks of recources). In this world, the traditional architectures get disrupted, along with the ecosystem of suppliers around those.

The change in infrastructure you describe would impact the design of storasge appliances in a big way. They'll either need to be local in the racks, or have many really fast ports to move data to the inter-rack level.

I agree that would be the implication. I suspect we end up in a place with lots of compute and storage in a rack with high-speed interconnects within the rack and then fast pipes between racks. It's certainly what Intel would want.

This may cause us to rethink the rack concept a bit. How about the double-rack (back to back) Verari sold, or tying adjacent racks in a container together as if they are one entity. This would increase local cluster size to make room for netwroked storage.

There are no siewalls on these types of rack, so cabling can go round the front or through the cutouts on the frames. A cluster could be two, four or even more racks in size, with just a single switch hop to connect. Direct server-server links could be added too, but server vendors need to face up to needing more than two Ethernet ports.

So, when talking on a FIOS home phone, is the conversation real time or is it data? Not sure how to phrase this - but can a circuit-switch conversation be captured and stored as data as opposed to recording it? I am think of that software that stores phone calls and then lets you search them for key words, does it act the same for circuit-switched and packet-switched? This answer by fiberstore

I agree with Jim here, Ethernet is not the most efficient system, especially as traffic collisions increase. It seems as though a more scientific approach may be necessary. Such fancy terms as regression analysys normally relegated to statisticians, may be needed in the search for a solution here.

Thats why fiber channel introduced to carry SCSI commands. The lessons are learned from the ethernet.

Actually ethernet has pause frame but once the congestion occur, all the traffic are stopped regardless of their importance.

If you want to make this operation efficient, you can separate traffic into classes and you treat them based on their importance. Thus data center bridging has been invented. PFC is the priority flow control can give us the ability to put traffic into classes and control the flows based on their pirority.

Also efficient quening is possible with ETS Enhanced Transmission selection which can be thought as a sub category of DCB.

All these protocols help the ethernet become lossless, so for the storage traffic you can carry SCSI over FC over Ethernet or IP networks.

Dealing with the configuration can increase configuration complexity and also buffer management might be a concern but these are all our design tools.

I assume you are thinking of speech to text conversion. It's possible to process any spoken words into text automatically, but the quality of translation for general speech from an occasional speaker is still not good.

However, for deaf people, there is a wealth of real-time speech to text software. Just Google "speech to text deaf"