Intel’s Betting the Storage I/O Farm on the CPU

I had the privilege of attending Tech Field Day 4 in San Jose this week as a delegate thanks to Stephen Foskett and Gestalt IT. It was a great event and a lot of information was covered in two days of presentations. I’ll be discussing the products and vendors that sponsored the event over the next few blogs starting with this one on Intel. Check out the official page to view all of the delegates and find links to the recordings etc. http://gestaltit.com/field-day/2010-san-jose/.

Intel presented both their Ethernet NIC and storage I/O strategy as well as a processor update and public road map, this post will focus on the Ethernet and I/O presentation.

Intel began the presentation with an overview of the data center landscape and a description of the move towards converged I/O infrastructure, meaning storage, traditional LAN and potentially High Performance Computing (HPC) on the same switches and cables. Anyone familiar with me or this site knows that I am a fan and supporter of converging the network infrastructure to reduce overall cost and complexity as well as provide more flexibility to data center I/O so I definitely liked this messaging. Next was a discussion of iSCSI and its tradition of being used as a consolidation tool.

iSCSI:

iSCSI has been used for years in order to provide a mechanism for consolidated block storage data without the need for a separate physical network. Most commonly iSCSI has been deployed as a low-cost alternative to Fibre Channel. Its typically been used in the SMB space and for select applications in larger datacenters. iSCSI was previously limited to 1 Gigabit pipes (prior to the 10GE ratification) and it also suffers from higher latency and lower throughput than Fibre Channel. The beauty of iSCSI is the ability to use existing LAN infrastructure and traditional NICs to provide block access to shared disk, the Achilles heal is performance. Because of this cost has always been the primary deciding factor to use iSCSI. For more information on iSCSI see my post on storage protocols: http://www.definethecloud.net/storage-protocols.

In order to increase the performance of iSCSI and decrease the overhead on the system processor(s) the industry developed iSCSI Host Bus Adapters (HBA) which offload the protocol overhead to the I/O card hardware. These were not widely adopted due to the cost of the cards, this means that a great deal of iSCSI implementations rely on a protocol stack in the operating system (OS.)

Intel then drew parallels to doing the same with FCoE via the FCoE software stack available for Windows and included in current Linux kernels. The issue with drawing this parallel is that iSCSI is a mid-market technology that sacrifices some performance and reliability for cost, whereas FCoE is intended to match/increase the performance and reliability of FC while utilizing Ethernet as the transport. This means that when looking at FCoE implementations the additional cost of specialized I/O hardware makes sense to gain the additional performance and reduce the CPU overhead.

Intel also showed some performance testing of FCoE software stack versus hardware offload using a CNA. The IOPS they showed were quite impressive for a software stack, but IOPS isn’t the only issue. The other issue is protocol overhead on the processor.Their testing showed an average of about 6% overhead for the software stack. 6% is low but we were only being shown one set of test criteria for a specific workload. Additionally we were not provided the details of the testing criteria. Other tests I’ve seen of the software stack are about 2 years old and show very comparable CPU utilization for FCoE software stack and Generation I CNAs for 8 KB reads, but a large disparity as the block size increased (CPU overhead became worse and worse for the software stack.) In order to really understand the implications of utilizing a software stack Intel will need to publish test numbers under multiple test conditions:

Sequential and random

Various read and write combinations

Various block sizes

Mixed workloads of FCoE and other Ethernet based traffic

I’ve since located the test Intel referenced from Demartek. It can be obtained here (http://www.demartek.com/Reports_Free/Demartek_Intel_10GbE_FCoE_iSCSI_Adapter_Performance_Evaluation_2010-09.pdf.) Notice that in the forward Demartek states the importance of CPU utilization data and stresses that they don’t cherry pick data then provides CPU utilization data only for the Microsoft Exchange simulation through JetStress, not for the SQLIO simulation at various block sizes. I find that you can learn more from the data not shown in vendor sponsored testing, than the data shown.

Even if we were to make two big assumptions: Software stack IOPS are comparable to CNA hardware, and additional CPU utilization is less than or equal to 6% would you want to add an additional 6% CPU overhead to your virtual hosts? The purpose of virtualization is to come as close as possible to full hardware utilization via placing multiple workloads on a single server. In that scenario adding additional processor overhead seems short sighted.

The technical argument for doing this is two fold:

Saving cost on specialized I/O hardware

Processing capacity evolves faster than I/O offload capacity and speeds mainly due to economies of scale therefore your I/O performance will increase with each processor refresh using a software stack

If you’re looking to save cost and are comfortable with the processor and performance overhead then there is no major issue with using the software stack. That being said if you’re really trying to maximize performance and or virtualization ratios you want to squeeze every drop you can out of the processor for the virtual machines. As far as the second point of processor capacity goes, it most definitely rings true but with each newer faster processor you buy you’re losing that assumed 6% off the top for protocol overhead. That isn’t acceptable to me.

The Other Problem:

FC and FCoE have been designed to carry native SCSI commands and data and treat them as SCSI expects, most importantly frames are not dropped (lossless network.) The flow control mechanism FC uses for this is called buffer-to-buffer credits (B2B.) This is a hop-to-hop mechanism implemented in hardware on HBAs/CNAs and FC switches. In this mechanism when two ports initialize a link they exchange a number of buffer spaces they have dedicated to the device on the other side of the link based on agreed frame size. When any device sends a frame it is responsible for keeping track of the buffer space available on the receiving device based on these credits. When a device receives a frame and has processed it (removing it from the buffer) it returns an R_RDY similar to a TCP ACK which lets the sending device know that a buffer has been freed. For more information on this see the buffer credits section of my previous post: http://www.definethecloud.net/whats-the-deal-with-quantized-congestion-notification-qcn. This mechanism ensures that a device never sends a frame that the receiving device does not have sufficient buffer space for and this is implemented in hardware.

On FCoE networks we’re relying on Ethernet as the transport so B2B credits don’t exist. Instead we utilize Priority Flow Control (PFC) which is a priority based implementation of 802.3x pause. For more information on DCB see my previous post: http://www.definethecloud.net/data-center-bridging-exchange. PFC is handled by DCB capable NICs and will handle sending a pause before the NIC buffers overflow. This provides for a lossless mechanism that can be translated back into B2B credits at the FC edge.

The issue here with the software stack is that while the DCB capable NIC ensures the frame is not dropped on the wire via PFC it has to pass processing across the PCIe bus to the processor and allow the protocol to be handled by the OS kernel. This adds layers in which the data could be lost or corrupted that don’t exist with a traditional HBA or CNA.

Summary:

FCoE software stack is not a sufficient replacement for a CNA. Emulex, Broadcom, Qlogic and Brocade are all offloading protocol to the card to decrease CPU utilization and increase performance. HP has recently announced embedding Emulex OneConnect adapters, which offload iSCSI, TCP and FCoE, on the system board. That’s a lot of backing for protocol offload with only Intel standing on the other side of the fence. My guess is that Intel’s end goal is to sell more processors, and utilizing more cycles for protocol processing makes sense. Additionally Intel doesn’t have a proven FC stack to embed on a card and the R/D costs would be significant, so throwing it in the kernel and selling their standard NIC makes sense to the business. Lastly don’t forget storage vendor qualification, Intel has an uphill battle getting an FCoE software stack on the approved list for the major storage vendors.

Full Discloser:Tech Field Day is organized by the folks at Gestalt IT and paid for by the presenters of the event. My travel, meals and accommodations were paid for by the event but my opinions negative or positive are all mine.

12 Replies to “Intel’s Betting the Storage I/O Farm on the CPU”

What are your thoughts on software stack for RAID for local storage? It’s a similar situation to software stack for I/O: you trade a single-digit slice of CPU utilization to eliminate ‘specialized’ hardware. Intel has done firmware RAID stacks for awhile with SATA, and they’ll extend that to SAS with Sandy Bridge ( http://www.theregister.co.uk/2010/09/15/sas_in_patsburg/ ).

Latency for software raid is a common concern. Hardware raid also provides several enhancements for better reliability (specifically related to faster write times and caching) that work out better in the long run.

Greg read my mind. Disk is currently the slowest link in the chain even when using hardware RAID and SAS/FC disks. We’re always trying to speed up disk access and I wouldn’t be interesting in sacrificing even single digit performance. Additionally as Greg states HW RAID provides better reliability and caching.

It’s worth noting that PFC has some serious limitations relating to Network Size. A credit based flow control mechanism is subject to latency problems where the protocol must be lossless and because the propagation delays must be less than the receiver buffers over time, this limits the maximum diameter of PFC as an effective tool.

At the moment, it looks like FCoE will scale to just a few switches in diameter. That’s a very crystalline design feature that isnt’ going to play well when the imposed cost of FCoE is already very high.

iSCSI solves this problem, plus has the advantage in WAN or DC-DC requirements and works OK with existing software stacks.

Interesting to see Intel point out that SMB will work just fine for iSCSI or FCoE workloads though. Most of them already know this and are using it. Elitist waffle about specific use cases (such SQL) aren’t all that important for the bulk of the market.

I agree with you that credit based flow control poses issues when moving over several hops. That being said FCoE should not be designed any different than FC, meaning 2-3 hops maximum where credit-based flow control is perfect. We have no need to push FCoE storage traffic through every switch on our LAN and WAN, just from the storage to an aggregation layer acting as the SAN core and then to the access layer acting as the SAN edge. This scales appropriately to connect as many servers as needed even in large data centers, especially when you consdier that virtualization is reducing the physical footprint anyway.

iSCSI definitely has it’s uses and place, as does FCoE but both are problematic bandaids for the underlying problem of networking SCSI. The real goal should be stopping our reliance on block data and SCSI itself and moving to native network based implementations that can utilize the full suite of TCP/IP tools. NFS and HTTP based storage are far more attractive options to me.

As far as iSCSI for DC to DC or WAN communications I’m not sure what you’re smoking on that side of the pond but I’d love to hear where that ugly baby is being used in production and how it’s going.

Partially wrong. If you use the design Cisco is pushing, every major Ethernet switch is a full-blown FCoE Forwarder (FCF … having FC domain ID and running FSPF) and every FCoE leg spans a single point-to-point link or has at most one DCB-enabled switch in the middle. PFC diameter thus becomes somewhat irrelevant within the DC and BB_credits work hop-by-hop between Ethernet+FCoE switches.

It is true, though, that due to latency long-distance PFC is mission impossible … which also answers the question whether it makes sense to bridge FCoE between Data Centers 😉 Thanks for that insight!

I do find it “odd” that Intel is swimming upstream with a software solution when the industry has clearly chosen hardware offloading. This is evident by the OEM products that contain the ladder solution.

As you saw at Tech Field Day test data can be tweaked to show favorable data. However, what a vendor can’t do is replicate that favorable data in a “true” test environment with different/random read/write data.

I’m glad you reported that it’s definitely a push for Intel to sell more processors. Because what they will not tell you is that in order to have more virtualization in your infrastructure with that solution you servers are going to need bigger, faster and of course more processors.

In my opinion, this will not lead to vendor lock-in but it will lead to an opportunity for Intel to get a beach head in something (convergence) that they are currently not on the radar with their current HBA cards.