Welcome to the NutanixVerse

NVMe-oF 101

I have been told this FMS presentation from August is 2018 still a useful primer for NVMe-oF so I will leave it here untouched after a few edits.

Many people are confused about what NVMe-oF is exactly. So let's start with the basics.

NVMe is an interface specification, like SATА, which allows fast access for direct-attached solid state devices.

It is a specification optimized for NAND flash and next-generation solid-state storage technologies.

Unlike SATA and SAS, NVMe devices communicate with the system CPU using the high-speed PCIe bus directly.

A separate storage controller (HBA) is not required but note the new SM2264 SSD controller will become a defecto silicon add on for PCIe 4.0 that also works with the CPU.

You will see SM2264 pairings with new Xeon Processors become mainstream in 2019.

NVMe devices come in a variety of form factors – Add-in cards, with interfaces such as U.2 (aka 2.5 inch) and M.2.

NVM Express devices communicate directly with the system CPU.

They can achieve roughly 1 million IOPS, 3 microseconds latency and low CPU usage rates with this method.

This makes NVMe SSDs a lot faster than SATA SSDs.

Low cost NVMe SSD can be about 2X times faster than a similarly priced SATA SSD.

At the high end they deliver approx 10 times more IOPS and 10 times lower latency.

These technical capabilities result in a broad swathe of business benefits.

For example, with NVMe customers receive much faster response times but can the applications themselves deal with these fast response times?

Companies should be able to process a higher volume of business transactions per second when all the glitches and roadblocks in the various areas are sorted out.

And there are various issues to sort out by the way. The Intel SPDK and the Linux AIO stuff is a challenge as it stands right now.

Higher IOPS and lower latency deliver better end user experience and theoretically enhance the business they are being used for.

A recent comparison made by Intel showed that compared to a SATA SSD, NVMe SSD has 4.7X times higher performance than SATA SSD.

Even though NVMe has this performance advantage you will notice that some companies SATA All Flash arrays have the same IOPS performance numbers as NVMe Flash Arrays.

Why is that?

One on one comparisons end at one on one because of the PCIe bus bandwidth as it stands today on PCIe 3.0.

One NVMe SSD vs One SATA SSD has one clear winner but put 24 into one system and its a completely different story.

Unless you need the low latency offered by NVMe, a good SATA AFA is still going to perform as good as an NVMe AFA IOPS wise.

This will change with PCIe 4.0.

PCIe 4.0 is going to be like rocket fuel for NVMe systems...

NVMe is the buzz word term for Non Volatile Memory over PCIe express that the super fast SSD in your state of the art PC or Laptop typically uses for storage through an M2 connector format.

I have a few AMD Ryzen and AMD Threadripper workstations in my home office and each machine has two of these supposedly super fast M2 slots on the motherboard.

However, even though I have two M2 Slots on all these motherboards, only one is capable of delivering High speed NVMe M2 SSD full throttle speed, the other is filled with a SATA SSD with an M2 connector but it cannot also deliver the full NVMe experience of the first slot.

This is because PCIe 3.0 has some bandwidth limitations and you cannot have a high performance PCIe graphics card and two M2 NVMe SSD's plugged into the PCIe bus and expect them all to get the max performance from the PCIe 3.0 bus.

You will find you have to get a slower SATA SSD with the M2 interface to work with the second M2 connector port.

Servers used for All Flash storage arrays don't usually have high end graphics cards by the way so there is more bandwidth for the storage devices using NVMe, but they all still have bandwidth limitations imposed by PCIe 3.0.

Some specifically designed for NVMe storage servers are sharing up to 24 NVMe SSD's on their PCIe bus and you can do the math with the PCIe 3.0 bandwidth to figure out how much any one of them gets at any given time.

This bandwidth by the way is 8 GigaTransfers per second (GT/s), per lane.

There is also encoding going on which changed from the PCIe 1.0 and 2.0 specification which uses 8b/10b encoding the same as SATA did.

This was not very efficient and took a 20% overhead from the available bandwidth.

PCIe 3.0 switched to 128b/130b encoding with an overhead of 1.54%.

This meant that each lane could send a theoretical 985Mb/s but the limit is published at 32GB/s max.

Not only that but there are only so many X16, X8, X4 and so on connectors on each motherboard that SSD would have to share on an intel storage controller which is a server with disk connector interfaces.

PCIe 4.0 is baked and yesterday they started working on PCIe 5.0. These offer huge leaps in bandwidth per lane capability.

Now that we have a better idea about the NVMe interface, let’s take a closer look at NVMe over Fabrics, or NVMe-oF.

NVMe over Fabrics is an extension to NVM Express that goes beyond PCIe, allowing the NVM Express command set to be used over various additional network interfaces.

This brings the benefits of the efficient NVMe storage into even more data centers and enterprises by allowing the same protocol to extend over a wider range of heterogeneous devices and interfaces.

NVMe-oF or NVMe over Fabrics is a network protocol, like iSCSI, used to communicate between a host and a storage system over a network (aka fabric).

It depends on and requires the use of RDMA. NVMe over Fabrics can use any of the RDMA technologies, including InfiniBand, RoCE and iWARP. NetApp have elected to use 32GB FC instead of 100GbE, but it is debatable which is better though they do have FCoE on offer as well.

NVMe-oF, compared to iSCSI has much lower latency, in practice adding just a few microseconds to cross the network.

This makes the difference between local storage and remote storage very small.

As with just about any network protocol, it can be used to access a simple feature-less storage box (JBOF) as well as a feature-rich block storage system (SAN). It can be used to access a storage system built with previous generation (SATA) devices.

However, it is strongly associated with NVMe devices due to the performance benefits of that particular combination offering very low latency.

NVMe over Fabrics (NVMe-oF) is an emerging technology but it is really just a glorified high speed iSCSI setup if you salt it down to it's basics.

It gives data centers unprecedented access to NVMe SSD storage.

To summarize – NVMe-oF enables faster access between hosts and storage systems and this drives new levels of business agility and competitiveness.

While NVMe-oF is gaining popularity, many storage vendors are jumping on the band-wagon to deliver very fast, low latency storage systems.

Often these systems are marketed as NVMe-oF but in fact use a proprietary native protocol for the specific solution.

There’s nothing wrong with this approach, and it might even be superior in many ways, it’s just not true NVMe-oF.

NVMe-oF itself has some drawbacks which might hinder its adoption:

Firstly, NVMe-oF is a standard still in its infancy. Each vendor has its own way of implementing this standard in their solutions. This leads to different implementations of NVMe-oF, which are potentially incompatible.

If you want it to work in your deployment, you’ll have to put extra diligence in designing and implementing an NVMe-oF solution, get a reference architecture and follow it to the letter of the standard.

The IEEE is taking control of NVMe-oF standards soon by the way.

Secondly, as stated, NVMe-oF is in essence a glorified iSCSI protocol , except it is so much faster.

NVMe-oF copies the outdated architecture and concepts of a the traditional SAN model, which was developed some 40 years ago.

It does not play well with current technologies and concepts such as fully automated API control, software defined storage (SDS), hyper-converged infrastructure and distributed storage (DS).

This means NVMe-oF is a point-to-point link between one initiator host and one target.

This architecture conflicts with implementing high availability and scale-out of a generic storage system.

Not that it is impossible, but NVMe-oF with HA and scale-out is inferior in many ways to a solution engineered for these requirements from the ground up.

NVMeoF Host components

Since NVMe-oF assumes connection to a single target, it needs to make extra hops over the network when deployed with a modern software-defined or distributed storage solution.

This increases latency and reduces IOPS capability.

Thirdly, most implementations of NVMe-oF lack support for end-to-end data integrity.

NVMe-oF Target components

End-to-end data integrity checking is critical.

Especially when considering the vast amounts of data which these storage systems are expected to process at blazing fast speeds.

Without data integrity built in data corruption is very likely to happen.

Not that it is impossible to implement in NVMe-oF, but the practice is that it isn’t generally baked yet.

Overall, there’s the tendency to use NVMe-oF to gain performance at the expense of everything else that is expected of a storage system including compromises in system availability, data durability and storage system capabilities and Data Services we have come to expect and love as standard.

NVMe and NVMe-oF are promising an order of magnitude speed improvement to current storage systems and while these are early days for these technologies, there is a strong driver for their early adoption in the Data Center.

Some vendors are already taking the lead on NVMe storage systems, which are running out there in production already.

Still, potential users of these technologies have to look between the lines in order to get these 10x improvements that this technology offers on paper.

There are two storage players out there that are doing a pretty good job of implementing NVMe-oF by means of RDMA over 100GbE Converged Ethernet (RoCE).

They both use virtualization strategies and true virtualization methodologies to overcome the point to point issues NVMe-oF brings to the table.

This means that your available storage space on each of the NVMe SSD's can be placed into a vast pool of resources and by placing the storage controllers into virtual machines with the Flash storage operating system services from a normal storage array we can expect to enjoy the use of all the storage related Data Services they serve up such as Snapshots, Clones, inline deduplication, inline compression, and Encryption et al.

HA becomes easier to implement due to the RDMA over 100GbE transport network (RoCE) eliminating the need for expensive Infiniband links between the Virtual Storage controllers that deliver other HA services such as background Sync functionality between Storage Controllers found in a typical HA schema.

This means you can build virtual storage controllers in HA format to exact and precise specs and add compute power as required without expensive add on storage hardware we are forced to use use for those tasks today.

You can also further customize the Virtual All Flash Array (vAFA) to have high IOPS on smaller faster SSD's in the pool, low IOPS with bigger SSD's or build FAST highly scalable vAFA units for any job you can dream of.

The beauty of this Architecture is you just need JBOF systems plugged into the RoCE Media to scale and add CPU and IOPS power as required.

This cuts costs dramatically at scale.

Not only that but you get a choice of HA variants. Typical HA today uses two controllers and one pool of disks.

For True Active-Active HA, you need a share-nothing architecture that allows each controller to have its own set of SSDs in its assigned pool of storage and they will Sync with the other assigned HA Active controller unit in the background using the RoCE network.

This all works much better when this is all virtualized.

AccelStor Solutions are one such Vendor of this emerging technology nirvana and their award winning FlexiSuite™ software which includes the award winning FlexiRemap™ RAID replacement technology is worth waxing lyrical about for a few moments.

RAID was designed for spinning magnetic disk cylinders in the age of the Mainframe computers.

RAID is actually one of the first implementations of virtualization we ever used at scale in the first commercially available mainframes of the 1960's.

Put 5 disks into a storage pool using RAID 5 Data protection and present the available space left from each disk to a disk LUN or volume and you have 5 disks working as one logical unit that a server can read and write to.

This was our first level of virtualization that was in common use in the 1970's computer era.

Try that same RAID schema on SSD memory in SSD drives and you have some interesting problems to deal with!!

First off, not all SSD out there is the same class of SSD. There are Myriads of different types of SSD using MLC, TLC and 3D Memory technologies and many classes of memory devices with different quality of components in them.

Some are enterprise class and some are inexpensive home use type memory chips with different performance levels and expected life durability.

In a typical RAID 5 cycle, RAID actually writes to the disk media several times. SSD does not like this as it has a finite amount of writes it can cater to in its lifespan.

Also, the Parity segment of each SSD will be subject to many writes and those segments will wear out rapidly so there will be no even wear across the SSD set.

Some more expensive memory can be written to many millions of times more than the cheaper class of device and you have to factor all this into the components durability and life expectancy when assessing all the considerations and multitude of factors into the equation as they unsurprisingly come with hefty cost considerations.

Also, the parity drives in a RAID 5 set, if they are SSD will just have much more data written to them than any of the other SSD in the RAID set.

This means wear will not be even across all the SSD in the RAID set. That is just a fact when using RAID on SSD media.

Something needs to replace RAID that sequences data WRITES to a one write sequential event as well as offer the same or better level of Data Protection.

Pure Storage do this with RAID 3D, but AccelStor's FlexiRemap™ does it with zero performance loss!!

In fact, their SATA SSD arrays can achieve a staggering 732,000 Random WRITE IOPS @ 4K and over 1.1 Million READ IOPS, care of this technology magic they call FlexiRemap™.

It turns out that any storage vendor using RAID on Flash will sacrifice a lot of performance to RAID being well, RAID....

I tested this myself with unbelieving eyes using a Load Dynamix testing rig in 2017. Load Dynamix are now a part of Virtual Instruments by the way and the product is now called Workload Wisdom.

If you are in the business of measuring real Storage performance this is THE tool for the job!!

Anyway, as stated, I verified for myself that the AccelStor P710 platform achieves a sustained 732,000 WRITE IOPS. From SATA SSD!!

A great many traditional storage Guru's out there have pointed out that there is not any real need for this level of performance today, but this is a short sighted view of what can be done as well as ignoring the systems that are changing to adapt to this new performance paradigm right under our very noses, right now.

Look at what Hyperconverged and AI is doing to the Data Center environment with high performance Databases and the nature of complex AI machines that are being built today.

This has placed us at a level of a brand new industrial revolution akin to the one that saw the first industrialized factories and railways that swept mankind into a new revolution some three hundred years ago.

The only difference is this new technology revolution will impact us all and it will change the way we live our lives in a very dramatic way.

I am not here to solve the challenging and complex socio-economic puzzle this will present us with however, I am here for the technology that will bring it to your IT doorstep.

The next step to this new wave revolution is going to be via NVMe-oF with all of the services and performance we require precisely tailored to do the job at hand.

Pooling Virtual Machines with the storage technology software running on the Virtual Controllers will deliver us into the era of a virtual environment with guaranteed service, performance and reliability features that are also perfectly isolated between the organizations using them.

This brings guaranteed security to the table as well.

NVMe-oF is a game changer, the potential and the possibilities at the right price are steeped in feature combinations and performance variants that will deliver a pool of resources than can be carved up by millions of companies all running in the same massive data center at guaranteed performance levels us geeks only dream about at night.

The machine timeline since we first used rocks to crack nuts cannot fit on the page so I did one from 1784 instead....

Welcome to the rise of the machines peeples of the Earth!

chaanbeard.com, IT Tech-Talk Blog focusing on AMD and Nutanix with Cloudy things