iSCSI before and after

When the Sun Storage
7000 series
first launched, iSCSI was provided by a user-land process: iscsitgtd. While it
worked, performance wasn't as good as the kernel-based services such as NFS.
iSCSI performance was to be improved by making it a kernel service as
part of the COMSTAR project. This was delivered in the 7000 series in the
2009
Q3 software release, and iSCSI performance was indeed greatly improved.

Here I'll show iSCSI performance test results which highlight the before and
after effects of both COMSTAR and the 7410
hardware
update. It would of course be better to split the results showing the
COMSTAR and hardware update effects separately - but I only have the time to
blog this right now (and I wouldn't be blogging anything without
Joel's
help to collect results.)
I would guess that the hardware update is responsible for
up to a 2x improvement, as it was for NFS. The rest is COMSTAR.

OLD: Sun Storage 7410 (Barcelona, 2008 Q4 software)

To gather this data I powered up an original 7410 with the original software
release from the product launch.

Testing max throughput

To do this I created 20 luns of 100
Gbytes each, which should entirely cache on the 7410. The point of this test
is not to measure disk performance, but instead the performance of the iSCSI
codepath. To optimize for max throughput, each lun used a 128 Kbyte block size.
Since iSCSI is implemented as a user-land daemon in this release,
performance is expected to suffer due to the extra work to copy-out and copy-in
data to the iscsitgtd process from the kernel.

I stepped up clients, luns, and thread count to find a maximum where adding
more stopped improving throughput. I reached this with only 8 clients,
2 sequential read threads per client (128 Kbyte reads), and 1 lun per
client:

Network throughput was 311 Mbytes/sec. Actual iSCSI payload throughput will
be a little lower due to network and iSCSI headers (jumbo frames were used, so
the headers shouldn't add too much.)

This 311 Mbytes/sec result is the "bad" result - before COMSTAR. However,
is it really that bad? How many people are still on 1 Gbit Ethernet? 311
Mbytes/sec is plenty to saturate a 1 GbE link, which may be all you have
connected.

The CPU utilization for this test was 69%, suggesting that more headroom may
be available. I wasn't able to consume it with my client test farm or
workload.

Testing max IOPS

To test max IOPS, I repeated a similar test but used a 512 byte read size
for the client threads intsead of 128 Kbyte. This time 10 threads per client
were run, on 8 clients with 1 lun per client:

iSCSI IOPS were 37,056. For IOPS I only count served reads (and writes):
the I/O part of IOPS.

NEW: Sun Storage 7410 (Istanbul, 2009 Q3+ software)

This isn't quite the 2009 Q3 release, it's our current development version.
While it may be a little faster or slower than the actual 2009 Q3 release,
it still reflects the magnitude of the improvement that COMSTAR and the
Istanbul CPUs have made.

Testing max throughput:

This shows a 60 second average of 2.75 Gbytes/sec network throughput -
impressive, and close to the 3.06 Gbytes/sec I measured for NFSv3 on the
same software and hardware. For this software
version, Analytics included iSCSI payload bytes, which showed it actually moved
2.70 Gbytes/sec (the extra 0.05 Gbytes/sec was iSCSI and TCP/IP headers.)

That's over 9x faster throughput. I would guess that up to 2x of
this is due to the Istanbul CPUs, which still leaves over 5x due to
COMSTAR.

Since this version of iSCSI could handle much more load, 47 clients were
used with 5 luns and 5 threads per client. Four 10 GbE ports on the 7410 were configured
to serve the data.

Testing max IOPS:

The average here was over 318,000 read IOPS. Over 8x the original
iSCSI performance.

Configuration

Both 7410s were max head node configurations with max DRAM (128 and 256 Gbytes),
and max CPUs (4 sockets of Barcelona, and 4 sockets of Istanbul.) Tests were
performed over 10 GbE ports from two cards. This is the performance from single
head nodes - they were not part of a cluster.

Writes

The above tests showed iSCSI read performance. Writes are processed
differently: they are made synchronous unless the write cache enabled property
is set on the lun (the setting is in Shares->lun->Protocols.) The description for this setting is:

This setting controls whether the LUN caches writes. With this setting
off, all writes are synchronous and if no log device is available, write
performance suffers significantly. Turning this setting on can therefore
dramatically improve write performance, but can also result in data corruption
on unexpected shutdown unless the client application understands the semantics
of a volatile write cache and properly flushes the cache when necessary.
Consult your client application documentation before turning this on.

For this reason, it's recommended to use
Logzillas
with iSCSI to improve write performance instead.

Conclusion

As a kernel service, iSCSI performance is similar to other kernel based protocols on the current 7410. For cached reads it is reaching 2.70 Gbytes/sec throughput, and over 300,000 IOPS.