tag:blogger.com,1999:blog-11113282813121161712018-03-05T15:31:55.402-08:00Ramblings from Richard's RanchRamblings and views from my ranch in Southern California.Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.comBlogger57125tag:blogger.com,1999:blog-1111328281312116171.post-8694066468991961922017-06-18T19:00:00.002-07:002017-06-18T19:00:54.017-07:00Interrupts Visualized using CPU SwimlanesWho is running what, where? This is one of the questions that keeps performance analysts awake at night. Optimizing system performance often means running the right code at right time on the right CPU can make a big difference. But how can we see this information?&nbsp;<a href="http://blog.richardelling.com/2017/04/cpu-swimlanes-are-mashup-of-percentage.html" target="_blank">CPU swimlanes</a> are a useful way to explore and understand deep kernel behaviors such as scheduling and interrupt handling. My <a href="http://blog.richardelling.com/2017/04/cpu-swimlanes-are-mashup-of-percentage.html" target="_blank">previous post on CPU swimlanes</a>, showed a simple example of a MacOS laptop with 4 processors. In this post we'll go into more detail and explore how the linux interrupt balancer, <a href="https://github.com/Irqbalance/irqbalance" target="_blank">irqbalance</a>, works to spread the interrupt load over many processors. This analysis is more intuitive and visually recognizable than hundreds of lines of <a href="https://linux.die.net/man/1/mpstat" target="_blank">mpstat</a> data scrolling across a terminal window.<br /><br /><h2>Background</h2><br />The system under test (SUT) is running a linux kernel and has 32 processors as seen by the OS. The kernel is NUMA-aware, so it also knows that there are two sockets, each with an 8-core Intel cpu. Each core has two hyperthreads. The workload is a heavy server workload with millions of I/Os flowing between disks and network.<br />For this workload and environment we make changes to the dashboard to better represent the server's tasks:<br /><br /><ul><li>Per-CPU metrics for Linux are more comprehensive than MacOS:</li><ul><li><i>user</i> time accounts for user programs running and we expect very little user time on this SUT</li><li><i>idle</i> time occurs when the CPUs did not have code to run</li><li><i>iowait</i>&nbsp;occurs when there is no code to run, but there is an outstanding I/O in progress</li><li><i>irq</i>&nbsp;refers to the time the kernel is handing hardware interrupts</li><li><i>nice</i>&nbsp;refers to the time spent running user programs in nice priority</li><li><i>softirq</i>&nbsp;refers to the time the kernel is handling software interrupts</li><li><i>steal</i> is the time spent in involuntary wait for a virtual CPU while a hypervisor is servicing another virtual CPU</li><li><i>system</i> is time spent running kernel code that is not servicing hardware or software interrupts</li></ul><li>For this SUT, we expect to be running our kernel code and this isn't a bad thing. System time is colored as a different shade of green than user time. This satisfies the tenet that green is good, red is bad.</li><li>Data is collected in <a href="https://www.influxdata.com/" target="_blank">influxdb</a> using <a href="https://www.influxdata.com/" target="_blank">telegraf's</a> CPU agent without modification</li><li>Detailed CPU data is broken out by socket</li><li>The relationship between two hyperthreads that share a core is consistent for this SUT, but not immediately obvious to the casual observer. This relationship changes for different processor families and is not recorded in the telemetry stream, by default. To illustrate the shared core, the graph is annotated.</li></ul><h2>The Experiment</h2>The goal of the experiment is to determine the optimal assignment of interrupts and services to the CPUs and whether irqbalance can handle the task automatically. There are five timestamps where irqbalance is requested to re-balance the system with a settling time of approximately two minutes. Though not displayed here, the total system throughput increases in proportion to the amount of system time used. So a good result is when we have more system time and thus more transactions being processed.<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-7fDo0rvedP8/WUcepELYCiI/AAAAAAAAAZA/aXtb2fqbfus9fBLgYUwXoqM0pOP13FGewCLcBGAs/s1600/CPU-swimlanes-irq.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="814" data-original-width="749" height="640" src="https://1.bp.blogspot.com/-7fDo0rvedP8/WUcepELYCiI/AAAAAAAAAZA/aXtb2fqbfus9fBLgYUwXoqM0pOP13FGewCLcBGAs/s640/CPU-swimlanes-irq.png" width="587" /></a></div><br /><h2>Analysis</h2><div>The results clearly show how observing the overall system CPU usage is inadequate for analyzing use of the kernel's resources.&nbsp;</div><div><ol><li>Workload is started and it is immediately obvious that cpu21 is completely consumed.</li><li>First irq rebalancing shifted some work from socket 0 to socket 1 CPUs, but poor cpu21 is still getting hammered and cpu7 is now also being hammered.</li><li>Second irq rebalancing reduced the load on cpu7 and cpu21with visibly better all-around throughput.</li><li>Third irq rebalancing looks fine: the system time usage is spreading well and no CPU is getting hammered by system time.</li><li>Fourth irq rebalance confirms the SUT has an optimal balance.</li></ol></div><div>Many programs, such as iostat or vmstat, show summary CPU statistics. They try to describe in numbers what is shown in the top-most summary of all CPUs. It is not surprising that something can be consuming 100% of one CPU while many others are idle. For this 32-CPU system, the summary data for this condition would show 4% used by softirq, while all of the softirq usage is confined to cpu11. In the detailed CPU swimlanes, it is immediately obvious that cpu11 is getting hammered. Fear not! We know what this is and what we can do about it. Interestingly, cpu27 shares a core with cpu11 and the irqbalance does not try to schedule more work there. IMHO this is a deficiency in irqbalance, but I can only say that with deep knowledge of what is running on cpu11 vs the rest of the test workload. In any case, there is significant idle time on almost of the other CPUs and subsequent experiments can show if more work gets scheduled to cpu27 as the others become more busy.</div><div><br /></div><div>In this test, the irqbalance did a respectable job adjusting the system to deliver better performance. However, that is not always the case. We do see instances where irqbalance de-tunes a system for a short while and then rebalances nicely. As the system nears saturation, any de-tuning can dramatically affect overall performance. With detailed CPU swimlanes the balance can be quickly understood and correlated to other changes in system behavior.</div><div><br /></div><h2>Conclusion</h2>CPU swimlanes are an excellent approach to understanding how work is spread across multiple CPUs in a system. If you love mpstat and systems with many CPUs, you'll love detailed CPU swimlanes.<br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-41523897929848824342017-04-08T16:21:00.000-07:002017-04-08T16:27:35.395-07:00CPU Swimlanes<div class="separator" style="clear: both; text-align: center;"><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif; margin-left: 1em; margin-right: 1em;"><a href="https://4.bp.blogspot.com/-itHc7P9R07s/WOlo1cQBZNI/AAAAAAAAAW8/7c4cpVVXRkQWay1a5-OTcArkZSBDOZVowCLcB/s1600/cpu-swimlanes-mac.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="297" src="https://4.bp.blogspot.com/-itHc7P9R07s/WOlo1cQBZNI/AAAAAAAAAW8/7c4cpVVXRkQWay1a5-OTcArkZSBDOZVowCLcB/s640/cpu-swimlanes-mac.png" width="640" /></a></span></div><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;">CPU swimlanes are a mashup of a percentage plots and <a href="https://en.wikipedia.org/wiki/Swim_lane">swimlane diagram.</a> The goal is to graphically show the usage of each processor and its temporal relation to usage of other processors. The swimlane diagram is often used in process flow diagrams to show functional or organizational processes and how they operate in parallel, when possible. The result is a visualization of the data available in tools such as the command-line oriented <a href="https://en.wikipedia.org/wiki/Mpstat">mpstat</a>, commonly found in UNIX/Linux/MacOS distros.<br /><br />Performance engineers and capacity planners often use CPU usage for systems analysis. For multiprocessor machines, summary data as often seen on CPU usage dashboard or tools can hide system bottlenecks. For example, a 10-CPU system where one CPU is 100% busy running a single-threaded application will appear to be only 10% busy in aggregate. Tools like mpstat are useful for analyzing per-processor usage, but quickly become unwieldy when there are many processors and are not well suited to show trends over time. Also, when a process is migrated to another CPU, mpstat is not well suited to correlate this movement with other temporal changes to the system. This is the perfect job for a nice dashboard.</span><br /><h2><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;">Dashboard Design</span></h2><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;">The CPU percentage plots are designed to allow the observer to quickly differentiate between "user" usage by applications versus the "system" usage by the kernel. This is accomplished by layering the user, idle, and system usage metrics from bottom to top. This is visually effective because as user usage often causes system usage. As the CPUs become busier, the idle time in the middle gets squeezed and can disappear entirely. The balance of user to system time is readily discernible.</span><br /><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;"><br />For many systems, user usage is good and follows the tenets of good things go up and to the right and good things are green. Similarly, system usage is overhead and grows down from the ceiling, reinforcing the tenets that bad things go down and to the right and worrisome things are amber. Idleness is a wide open blue sky.<br /><br />Another design element to the dashboard is that the per-CPU swimlanes are not encumbered by axes. This allows the dashboard to scale to dozens of CPUs without becoming cluttered by redundant text.</span><br /><div><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;"><br />A test of good dashboard design is if you can glance at the images and instantly conclude whether all is well or action is required. By choosing good color schemes and showing meaningful data consistently, a dashboard can speed systems analysis. CPU swimlanes can instantly show imbalances in CPU use in two dimensions: balance across CPUs and balance of user vs kernel resources.</span><br /><h2><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;">Scaling</span></h2><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;">For large systems, it can be useful to modify this basic dashboard:</span><br /><ul><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;"><li>reorganize the per-CPU enumeration to reflect NUMA associations</li><li>for dozens of CPUs, the row sizes can be reduced to show hundreds of CPUs on a relatively small screen</li></span></ul><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;"></span><br /><h2><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;">Sharing</span></h2><span style="font-family: &quot;arial&quot; , &quot;helvetica&quot; , sans-serif;">The CPU swimlane dashboard screenshot above is developed for MacOS systems using <a href="https://grafana.com/">grafana v4.1</a>, <a href="https://github.com/influxdata/influxdb">influxdb</a>, and the <a href="https://github.com/influxdata/telegraf">telegraf</a> metrics aggregator where the CPU usage details are available for only user, system, and idle usage.<br /><br />You can get a copy of the grafana dashboard for the CPU swimlanes above from my git repo <a href="https://github.com/richardelling/grafana-dashboards">https://github.com/richardelling/grafana-dashboards</a> Share and enjoy</span></div>Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-57536471720122037072016-11-14T10:59:00.002-08:002016-11-14T10:59:59.581-08:00How Would You Use 10 Million IOPS?This week <a href="http://www.prnewswire.com/news-releases/newisys-and-kazan-networks-demo-nvme-over-fabric-flash-array-300361015.html" target="_blank">we proudly announced a 2u storage server that delivers 10 million IOPS @ 4k over the network</a>. If you are visiting Salt Lake City and the <a href="http://sc16.supercomputing.org/" target="_blank">Supercomputing 2016 Conference</a>, drop by the exhibition booth and take a look.<br /><br />What I find more exciting than the IOPS (10 million) or bandwidth (40 GB/sec) is the latency penalty for the network is an impressive 7 µsec over Ethernet. Yes, microseconds! Over Ethernet! Amazing!<br /><br />If you've never heard us discussing microseconds in the context of network storage, it is because disks were so slow we could only grumble on about milliseconds. To give some perspective here, a high-end SAS SSD have response times on the order of 50 - 100 µsec @ 4k. If you run "iostat -x" to see latency on a typical Unix/Linux/OSX distro, you only get 10 µsec resolution, today. This is truly an breakthrough in enabling technology for building large, scalable, and fast computing solutions.<br /><br />So, how will you use 10 million IOPS?<br /><br />Kudos to the <a href="http://www.newisys.com/" target="_blank">Newisys</a>&nbsp;/ <a href="http://www.sanmina.com/" target="_blank">Sanmina</a>, and <a href="http://www.kazan-networks.com/" target="_blank">Kazan Networks</a> team for a job well done! &nbsp;Very impressive!<br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-68631790245950969672016-08-28T15:58:00.000-07:002016-08-28T15:58:18.191-07:00On ZFS Copies<div class="p1"><span class="s1">JRS Systems blogs about <a href="http://open-zfs.org/" target="_blank">ZFS</a> copies at&nbsp;<a href="http://jrs-s.net/2016/05/09/testing-copies-equals-n-resiliency/">http://jrs-s.net/2016/05/09/testing-copies-equals-n-resiliency/</a></span></div><div class="p1"><span class="s1">I tried to reply there, but the website wouldn't accept my reply, complaining about cookies and contact the site administrator. So I'll reply here. Internet Hurrah!</span></div><div class="p1"><span class="s1"><br /></span></div><div class="p1"><span class="s1">For a single device pool, copies=2 places the redundant copies approximately 1/3 (copies=2) and 2/3 (copies=3) into the LBA range of the single device. Assuming devices allocate with some diversity by LBA, this allows recovery from a range of LBA failures. For HDDs, think head-contacts-media type of failures. For a random failure case, you get random failures.</span></div><div class="p2"><span class="s1"></span><br /></div><div class="p1"><span class="s1">By contrast, if the pool has two top-level vdevs, such as a simple 2-drive stripe, then the copies are placed on separate drives, if possible. In this case, copies=2|3 provides protection more similar to mirroring, where the copies are on diverse devices. It is not identical to mirroring, because the pool itself depends on all top-level vdevs functioning. On the other hand, you can have different sized devices, with some data diversely stored.</span></div><div class="p1"><span class="s1"><br /></span></div><div class="p1"><span class="s1"></span></div><div class="p1"><span class="s1">In summary, copies is useful for specifying different redundancy policies for datasets, but it is not a replacement for proper mirroring or raidz. This is why <a href="https://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection">https://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection</a>&nbsp;(apologies, in the acquisition, the new regime blew the image links) and&nbsp;<a href="http://jrs-s.net/2016/05/02/zfs-copies-equals-n/">http://jrs-s.net/2016/05/02/zfs-copies-equals-n/</a></span></div><div><br /></div><div class="p1">For ZFS enthusiasts, you can see where the copies of blocks of your data are allocated using zdb's dataset option to show the data virtual addresses (DVAs) assigned to each copy. Here's how to do it.</div><div class="p1"><br /></div><div class="p1">1. First, create a test dataset with copies=2 and create a file with enough data to be interesting. Since we know the default recordsize is 128k, we'll write 2x128k or two ZFS blocks in size.</div><div class="p1"><br /></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"># <b>zfs create -o copies=2 zwimming/copies-example</b></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"># <b>dd if=/dev/urandom of=/zwimming/copies-example/data bs=128k count=2</b></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">2+0 records in</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">2+0 records out</span></span></div><div class="p1"> </div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">262144 bytes (262 kB, 256 KiB) copied, 0.0243667 s, 10.8 MB/s</span></span></div><div class="p2"><br /><span class="s1"></span></div><div class="p2">2. Locate the object number of the file, cleverly the same as the inode number, 7 in this case.</div><div class="p2"><br /></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"># <b>ls -li /zwimming/copies-example/data</b></span></span></div><div class="p2"> </div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">7 -rw-r--r-- 1 root root 262144 Aug 28 15:20 /zwimming/copies-example/data</span></span></div><div class="p1"><span class="s1"><br /></span></div><div class="p1"><span class="s1">3. Ask zdb to show the dataset information with details about the block allocations for object 7 in dataset zwimming/copies-example</span></div><div class="p1"><span class="s1"><br /></span></div><div class="p1"><span class="s1">&nbsp;</span># <b>zdb -dddddd zwimming/copies-example 7</b></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">Dataset zwimming/copies-example [ZPL], ID 49, cr_txg 287, 537K, 7 objects, rootbp DVA[0]=<0:8000:200> DVA[1]=<0:300c200:200> DVA[2]=<0:600bc00:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique triple size=800L/200P birth=291L/291P fill=7 cksum=dbfb763cf:52582f00d6d:fed1f02d1ce1:21f172427f4d22</0:600bc00:200></0:300c200:200></0:8000:200></span></span></div><div class="p2"><span style="font-family: Courier New, Courier, monospace;"><span class="s1"></span><br /></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">&nbsp; &nbsp; Object&nbsp; lvl &nbsp; iblk &nbsp; dblk&nbsp; dsize&nbsp; lsize &nbsp; %full&nbsp; type</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; 7&nbsp; &nbsp; 2&nbsp; &nbsp; 16K &nbsp; 128K &nbsp; 514K &nbsp; 256K&nbsp; 100.00&nbsp; ZFS plain file (K=inherit) (Z=inherit)</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><br /></span></span></div><div class="p1">Here we verify the logical size (lsize) is 256k and the data block size (dsize) is, nominally, 2x the logical block size. Recall we wrote random, non-compressible data, so no compression tricks here.</div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><br /></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 168 &nbsp; bonus&nbsp; System attributes</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>dnode flags: USED_BYTES USERUSED_ACCOUNTED&nbsp;</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>dnode maxblkid: 1</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>path<span class="Apple-tab-span"> </span>/data</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><br /></span></span></div><div class="p1">Verify the object 7 is our file named "data"</div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><br /></span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>uid &nbsp; &nbsp; 0</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>gid &nbsp; &nbsp; 0</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>atime<span class="Apple-tab-span"> </span>Sun Aug 28 15:20:43 2016</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>mtime<span class="Apple-tab-span"> </span>Sun Aug 28 15:20:43 2016</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>ctime<span class="Apple-tab-span"> </span>Sun Aug 28 15:20:43 2016</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>crtime<span class="Apple-tab-span"> </span>Sun Aug 28 15:20:43 2016</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>gen<span class="Apple-tab-span"> </span>291</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>mode<span class="Apple-tab-span"> </span>100644</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>size<span class="Apple-tab-span"> </span>262144</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>parent<span class="Apple-tab-span"> </span>4</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>links<span class="Apple-tab-span"> </span>1</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"> </span>pflags<span class="Apple-tab-span"> </span>40800000004</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">Indirect blocks:</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 L1&nbsp; 0:be00:200 0:300aa00:200 0:6003a00:200 4000L/200P F=2 B=291/291</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; L0 0:24000:20000 0:3020800:20000 20000L/20000P F=1 B=291/291</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 20000&nbsp; L0 0:44000:20000 0:3040800:20000 20000L/20000P F=1 B=291/291</span></span></div><div class="p1"><span class="s1"><br /></span></div><div class="p1"><span class="s1">Here's the meat of the example. This file has one level-1 (L1) indirect block (metadata), with 3 DVAs. Why 3? Because, by default, the number of copies of the metadata is 1+copies, up to 3. With copies=2, the number of metadata copies=3, hence the three DVAs. These DVAs consume 0x200 physical bytes each, or 1.5k. This explains why the accounting for the dsize above is 514k rather than 512k.</span></div><div class="p1"><span class="s1"><br /></span></div><div class="p1"><span class="s1">Each DVA is a tuple of vdev-index:offset:size. Thus a DVA of 0:be00:200 is 512 bytes allocated to vdev-0 (there is only one vdev in this pool) at offset 0xbe00. You can see that the 3 DVAs are offset further into the vdev at 0x300aa00 and 0x6003a00. If this pool had more than one vdev, and there was enough space on them, then we expect the diversity to be across vdevs.</span></div><div class="p1"><span class="s1"><br /></span></div><div class="p1">Looking at the two level-0 (L0) data blocks, we see our actual data. Each block is 128K (0x20000) and the logical (20000L) size is the same as the physical size (20000P) showing no compression. Again we see all blocks allocated to vdev-0 and the offset for the second copy is&nbsp;0x2FFC800 or&nbsp;50,317,312 sectors away (here, sectors=512 bytes).</div><div class="p1"><br /></div><div class="p1">Referring back to JRS System's test, randomly corrupting data will give predictable results. Simply calculate the probability of corrupting two L0 blocks of a given size for a given LBA range.</div><div class="p1"><br /></div><div class="p1">But storage doesn't tend to fail randomly, failures tend to be spatially clustered. Thus copies is a reasonable use of redundancy techniques even when the device is not redundant. Indeed, copies is routinely used for the precious metadata.</div><div class="p1"><br /></div>Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-51033965124920693862016-07-15T20:02:00.000-07:002016-07-15T20:02:31.435-07:00Happy 10th Birthday Snapshot!On this date, 10 years ago, I made my first ZFS snapshot.<br /><br /><blockquote class="tr_bq"> <div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;"># zfs get creation stuff/home@20060715&nbsp;</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; PROPERTY&nbsp; VALUE&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; SOURCE</span></span></div><div class="p1"><span class="s1"><span style="font-family: Courier New, Courier, monospace;">stuff/home@20060715&nbsp; creation&nbsp; Sat Jul 15 20:20 2006&nbsp; -</span></span></div></blockquote>The stuff pool itself has changed dramatically over the decade. Originally, it was a spare 40G IDE drive I had laying around the hamshack. &nbsp;Today it is a mirror of 4T/6T drives from different vendors, for diversity. Over the years the pool has been upgraded, expanded, and had its drives replaced numerous times. This is a true testament to the long-term planning and management built into <a href="http://open-zfs.org/wiki/Main_Page" target="_blank">OpenZFS</a>.<br /><br />The original size of the stuff/home file system was 9GB. Today, it is 1.6TB, which I'll blame mostly on backups of media files. Ten years ago I had a 1.6M pixel camera, today 16M pixel plus HD video and phones.<br /><br />What was I working on back then? &nbsp;SATA controllers, Sun X4500, <a href="http://www.roars.net/" target="_blank">ROARS</a> Field Day, ...<br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-56995298484342830202016-06-01T12:35:00.000-07:002016-06-01T12:35:29.106-07:00As we're getting ready for summer at the ranch...<div class="separator" style="clear: both; text-align: center;">Lion's tails reach for the sky!</div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-qg1NOqK5OYw/V084rkOyOOI/AAAAAAAAAVQ/OZW9MJTcZa4keoie3QisQou6KBd0NPYCACLcB/s1600/lion-tails.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://1.bp.blogspot.com/-qg1NOqK5OYw/V084rkOyOOI/AAAAAAAAAVQ/OZW9MJTcZa4keoie3QisQou6KBd0NPYCACLcB/s400/lion-tails.jpg" width="300" /></a></div><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-19478768017641993342016-05-03T11:24:00.000-07:002016-05-03T11:24:42.847-07:00Observing Failover of Busy PoolsWhile looking on failover tests under load, we can easily see the system-level effects of failover in a single chart.<br /><br />But first, some background. At <a href="http://www.intermodaldata.com/" target="_blank">InterModal Data</a>, we've built an easy-to-manage system of many nodes that can provide scalable NFS and iSCSI shares in the Petabytes range. This software defined storage system scales nicely with great hardware, such as the <a href="http://www.hpe.com/" target="_blank">HPE</a> systems shown here. An important part of the system is the site-wide analytics where we measure many aspects of performance, usage, and environmental data. This data from both clients and servers is stored in an <a href="http://www.influxdata.com/" target="_blank">influxdb</a> time-series database for analysis.<br /><br />For this test, the NFS clients are throwing a sustained, mixed read/write, mixed size, mixed randomness workload at multiple shares on two pools (p115 and p116) served by two Data Nodes (s115 and s116). At the start of the sample, both pools are served by sut116. At 01:34 pool put115 is migrated (failed over, in cluster terminology) to sut115. The samples are taken every minute, but the actual failover time for pool p115 is 11 seconds under an initial load of 11.5k VOPS (VFS layer operations per second). After the migration, both Data Nodes are serving the workload, thus the per-pool performance increases to 16.5k VOPS. The system changes from serving an aggregate of 23k VOPS to 33k VOPS -- a valid reason for re-balancing the load.<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-vTIkRwbzleY/VyjmElsdO5I/AAAAAAAAAU0/boxH8xybUpQ9ChjGgUaW5fIQj1pemua_wCLcB/s1600/failover-example.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://1.bp.blogspot.com/-vTIkRwbzleY/VyjmElsdO5I/AAAAAAAAAU0/boxH8xybUpQ9ChjGgUaW5fIQj1pemua_wCLcB/s1600/failover-example.png" /></a></div><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com2tag:blogger.com,1999:blog-1111328281312116171.post-83916270963819408992016-01-10T16:27:00.003-08:002016-01-10T16:27:31.123-08:00Observing cache hit/miss ratesAt <a href="http://www.intermodaldata.com/" target="_blank">InterModal Data</a> we build large systems with many components running in highly available configurations 24x7x365. For such systems, understanding how the components are working is very important. Our analytics system measures and records thousands of metrics from all components and makes these measurements readily available for performance analysis, capacity planning, and trouble shooting. Alas, having access to the records of hundreds of thousands of metrics is not enough, we need good, concise methods of showing that in meaningful ways. In this post, we'll look at the cache hit/miss data for a storage system and a few methods of observing the data.<br /><br />In general, caches exist to optimize the cost vs performance of a system. For storage systems in particular, we often see RAM working as cache for drives. Drives are slow, relatively inexpensive ($/bit) and persistently store data even when powered off. By contrast, RAM is fast, relatively expensive, and volatile. Component and systems designers balance the relatively high cost of RAM against the lower cost of drives while managing performance and volatility. For the large systems we design at <a href="http://www.intermodaldata.com/" target="_blank">InterModal Data</a>, the cache designs are very important to overall system scalability and performance.<br /><br />Once we have a cache in the system, we're always interested to know how well it is working. If over-designed, adding expensive caches just raises the system cost, adding little benefit. One metric often used for this analysis is the cache hit/miss ratio. Hits are good, misses are bad. But it is impossible to always have 100% hits when volatile RAM is used. We can easily plot this over time as our workload varies.<br /><br />In the following graphs, the data backing the graph is identical. The workload varies over approximately 30 hours.<br /><br />Traditionally, this is tracked as the hit/miss ratio easily represented as a ratio.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-1m1dD785KzM/VpLoSDstEgI/AAAAAAAAATY/FHDfWUROuUs/s1600/cache-hit-miss-ratio.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-1m1dD785KzM/VpLoSDstEgI/AAAAAAAAATY/FHDfWUROuUs/s1600/cache-hit-miss-ratio.png" /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Here we see lots of hits (green = good) with a few cases where the misses (red = bad) seem to rear their ugly heads. Should we be worried? We can't really tell from this graph because there is only the ratio, no magnitude. Perhaps the system is really idle and a handful of misses are measured. When presented with only hit/miss ratio, it is impractical to make any analysis, the magnitude is also needed. Many analysis systems then show you the magnitudes stacked as below.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-xb9OheH1nTo/VpLpE5KyPWI/AAAAAAAAATo/RzXnguUG6EY/s1600/cache-hit-miss-rates-stacked.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-xb9OheH1nTo/VpLpE5KyPWI/AAAAAAAAATo/RzXnguUG6EY/s1600/cache-hit-miss-rates-stacked.png" /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">In this view, the number of accesses are the top of the stacked lines. Under each access point we see the ratio of hits/misses expressed as magnitude. This is better than the ratio graph. Now we can see that the magnitudes are changing from a few thousand accesses/second to approximately 170,000 accesses/second. We can also see that there were times where we saw misses, but during those times the number of accesses was relatively small. If the ratio graph caused some concern, this graph removes almost all of that concern.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">However, in this graph we also lose the ability to discern the hit/miss ratio because of the stacking. Consider if we had two or more levels of cache and wanted to see the overall cache effectiveness, we could quickly lose the details in the stacking.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Recall that hits are good (green) and misses are bad (red). Also consider that Wall Street has trained us to like graphs that go "up and to the right" (good). We can use this to our advantage and more easily separate the good from the bad.</div><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-F0Lp60UrJ1A/VpLoSMhQ4XI/AAAAAAAAATQ/C8phLLJuvLE/s1600/cache-hit-miss-rates.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-F0Lp60UrJ1A/VpLoSMhQ4XI/AAAAAAAAATQ/C8phLLJuvLE/s1600/cache-hit-miss-rates.png" /></a></div><br />Here we've graphed the misses as negative values. Hits go up to the top and are green (all good things). Misses go down and are red (all bad things). The number of accesses is the spread between the good and the bad, so as the spread increases, more work is being asked of the system. In this case we can still see that the cache misses are a relatively small portion of the overall access and, more importantly, occur early in time. As time progresses the hit ratio and accesses both increase for this workload. This is a much better view of the data.<br /><br />Here is another example of the SSD read cache for this same experiment. First, the hit/miss ratio graph.<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-CsROsV0Lh2Y/VpL0kJiXESI/AAAAAAAAAUA/xCKzQV1yNfs/s1600/ssd-read-cache-hits-misses-ratio.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-CsROsV0Lh2Y/VpL0kJiXESI/AAAAAAAAAUA/xCKzQV1yNfs/s1600/ssd-read-cache-hits-misses-ratio.png" /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">If this was the only view you see, you should be horrified: too much red and red is bad! Don't panic.</div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-g3mJX5O1IXI/VpL0kBZWLuI/AAAAAAAAAUE/LDrzKt5-xCc/s1600/ssd-read-cache-hits-misses.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-g3mJX5O1IXI/VpL0kBZWLuI/AAAAAAAAAUE/LDrzKt5-xCc/s1600/ssd-read-cache-hits-misses.png" /></a></div><br />This graph clearly shows the story in the appropriate context. There are some misses and hits, but the overall magnitude is very low, especially when compared to the RAM cache graph of the same system. No need to panic, the SSD cache is doing its job, though it is not especially busy compared to the RAM cache.<br /><br />This method scales to multiple cache levels and systems -- very useful for the large, scalable systems we design at <a href="http://www.intermodaldata.com/" target="_blank">InterModal Data</a>.<br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-29374099593181012772015-04-15T22:32:00.001-07:002015-04-15T22:32:32.598-07:00Spring at the Ranch<div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-uM1S58tKBm0/VS9GfJGo4FI/AAAAAAAAASM/z9BDPSt4eds/s1600/ranch-flower-2015.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://4.bp.blogspot.com/-uM1S58tKBm0/VS9GfJGo4FI/AAAAAAAAASM/z9BDPSt4eds/s1600/ranch-flower-2015.jpg" height="320" width="314" /></a></div>This spring is bringing new changes to the ranch. This vine surprised us with lavender-colored flowers. Other surprises include an early change in scenery as the grasses and weeds have already reached their summer hue. Long time readers of my blog might recognize the meaning of flowers... its all good.Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-54631820577864813642014-10-04T15:06:00.002-07:002014-10-04T15:07:04.339-07:00On ZFS use at home...The other day someone asked on the #zfs IRC (irc.freenode.net) chat about using ZFS at home. As one of the early adopters, I can say it is a great idea! I've been running ZFS at home since late 2005. The first pool of "stuff" I created has been upgraded, expanded, and had its drives replaced. In 2008 I created the latest version of "stuff" as a simple mirrored pair of HDDs. The prior version of "stuff" was transferred to the 2008 pool which is still in use. Therefore I do not have the actual creation date of the original "stuff," but since I used ZFS send/receive to transfer the datasets, I can definitively say the oldest snapshot was created in July 2006.<br /><div><br /></div><div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"># zfs get creation stuff/home@20060715</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">NAME &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; PROPERTY&nbsp; VALUE&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; SOURCE</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">stuff/home@20060715&nbsp; creation&nbsp; Sat Jul 15 20:20 2006&nbsp; -</span></div><div class="p1"><br /></div><div class="p1">I've made many snapshots since and it seems quite impressive to know that I can roll back in time over 8 years to see how my "stuff" has evolved. Let's hear it for long-lived data!</div><div class="p1"><br /></div></div>Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-11545668148127465512014-08-05T18:13:00.000-07:002014-08-05T18:13:26.423-07:00kstat changes in illumos<span style="font-family: Verdana, sans-serif;">One of the nice changes to the kstat (kernel statistics) command in <a href="http://www.illumos.org/" target="_blank">illumos</a> is its conversion to C from perl. There were several areas in the illumos (nee OpenSolaris) code where perl had been used. But these were too few to maintain critical mass and it is difficult for interpreted runtimes to change at the pace of an OS, so keeping the two in lockstep is simply not worthwhile. So the few places where parts of illumos were written in perl have been replaced by native C implementations.</span><br /><span style="font-family: Verdana, sans-serif;"><br /></span><span style="font-family: Verdana, sans-serif;">The kstat(1m) command rewritten in C was contributed by David Höppner, an active member of the illumos community. It is fast and efficient at filtering and printing kstats. By contrast, the old perl version had to start perl (an interpreter), find and load the kstat-to-perl module, and then filter and print the kstats. Internal to the kernel, kstats are stored as a name-value list (nvlist) containing strongly-typed data. Many of these are 64-bit integers. This poses a problem for the version of perl used (5.12) as the 64-bit support is dependent on the compiled version and illumos can be compiled for both 32 and 64 bit processors. To compensate for this mismatch, the following was added to the man page for kstat(3perl):</span><br /><br /><div class="p1"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">Several of the statistics provided by the kstat facility are stored as 64-bit integer values. Perl 5 does not yet internally support 64-bit integers, so these values are approximated in this module. There are two classes of 64-bit value to be dealt with: 64-bit intervals and times</span></div><div class="p1"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">These are the crtime and snaptime fields of all the statistics hashes, and the wtime, wlentime, wlastupdate, rtime, rlentime and rlastupdate fields of the kstat I/O statistics structures. These are measured by the kstat facility in nanoseconds, meaning that a 32-bit value would represent approximately 4 seconds. The alternative is to store the values as floating-point numbers, which offer approximately 53 bits of precision on present hardware. 64-bit intervals and timers as floating point values expressed in seconds, meaning that time-related kstats are being rounded to approximately microsecond resolution.</span></div><br /><div class="p1"><span style="color: blue; font-family: Arial, Helvetica, sans-serif;">It is not useful to store these values as 32-bit values. As noted above, floating-point values offer 53 bitsof precision. Accordingly, all 64-bit counters are stored as floating-point values.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Verdana, sans-serif;">For consumers of the kstat(1m) command output, this means that kstat I/O statistics are stored as seconds (floating point) instead of nanoseconds. For example, with formatting adjusted for readability:</span></div><div class="p1"><span style="font-family: Verdana, sans-serif;">Perl-based kstat(1m)</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"># kstat -p sd:0:sd0</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:class &nbsp; &nbsp; &nbsp; &nbsp;disk</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:crtime &nbsp; &nbsp; &nbsp; 1855326.99995062</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:nread &nbsp; &nbsp; &nbsp; &nbsp;380919301</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:nwritten &nbsp; &nbsp; 1984175104</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:rcnt &nbsp; &nbsp; &nbsp; &nbsp; 0</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:reads &nbsp; &nbsp; &nbsp; &nbsp;18455</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:rlastupdate &nbsp;2371703.49260763</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:rlentime &nbsp; &nbsp; 147.154123471</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:rtime &nbsp; &nbsp; &nbsp; &nbsp;49.399890683</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:snaptime &nbsp; &nbsp; 2371828.77138052</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:wcnt &nbsp; &nbsp; &nbsp; &nbsp; 0</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:wlastupdate &nbsp;2371703.49174494</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:wlentime &nbsp; &nbsp; 2.425675727</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:writes &nbsp; &nbsp; &nbsp; 103602</span></div><div class="p1"> </div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:wtime &nbsp; &nbsp; &nbsp; &nbsp;1.43643661</span></div><div class="p1"><br /></div><div class="p1"><span style="font-family: Verdana, sans-serif;">C-based kstat(1m)</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"># kstat -p sd:0:sd0</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:class<span class="Apple-tab-span"> </span>&nbsp; &nbsp; &nbsp; &nbsp;disk</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:crtime<span class="Apple-tab-span"> </span>&nbsp; &nbsp; &nbsp;&nbsp;244.271312204</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:nread<span class="Apple-tab-span"> </span>&nbsp; &nbsp; &nbsp; &nbsp;25549493</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:nwritten<span class="Apple-tab-span"> </span>&nbsp; &nbsp;&nbsp;1698218496</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:rcnt<span class="Apple-tab-span"> </span>&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;0</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:reads<span class="Apple-tab-span"> </span>&nbsp; &nbsp; &nbsp; &nbsp;4043</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:rlastupdate<span class="Apple-tab-span"> </span>&nbsp;104543293563241</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:rlentime<span class="Apple-tab-span"> </span>&nbsp; &nbsp;&nbsp;68750036336</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:rtime<span class="Apple-tab-span"> </span>&nbsp; &nbsp; &nbsp; &nbsp;64365048052</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:snaptime<span class="Apple-tab-span"> </span>&nbsp; &nbsp;&nbsp;104509.092582653</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:wcnt<span class="Apple-tab-span"> </span>&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;0</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:wlastupdate<span class="Apple-tab-span"> </span>&nbsp;104543293482995</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:wlentime<span class="Apple-tab-span"> </span>&nbsp; &nbsp;&nbsp;4569934990</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:writes<span class="Apple-tab-span"> </span>&nbsp; &nbsp; &nbsp;&nbsp;289766</span></div><div class="p1"> </div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">sd:0:sd0:wtime<span class="Apple-tab-span"> </span>&nbsp; &nbsp; &nbsp; &nbsp;4551425719</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><br /></span></div><div class="p1"><span style="font-family: Verdana, sans-serif;">I find kstat(1m) output to be very convenient for historical tracking and use it often. If you do too, then be aware of this conversion.</span></div><div class="p1"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Verdana, sans-serif;">One of the best features of the new C-based kstat(1m) is the ability to get kstats as JSON. This is even more useful than the "parseable" output shown previously.</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"># kstat -j sd:0:sd0</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">[{</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span>&nbsp; "module": "sd",</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span>&nbsp; "instance": 0,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span>&nbsp; "name": "sd0",</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span>&nbsp; "class": "disk",</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span>&nbsp; "type": 3,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span>&nbsp; "snaptime": 104547.013504692,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span>&nbsp; "data": {</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "crtime": 244.271312204,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "nread": 25549493,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "nwritten": 1700980224,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "rcnt": 0,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "reads": 4043,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "rlastupdate": 104733296813446,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "rlentime": 68901598866,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "rtime": 64513785819,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "snaptime": 104547.013504692,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "wcnt": 0,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "wlastupdate": 104733296708770,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "wlentime": 4579560895,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "writes": 290404,</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span><span class="Apple-tab-span"></span>&nbsp; &nbsp; "wtime": 4561051625</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span"></span>&nbsp; }</span></div><div class="p1"> </div><div class="p1"><span style="font-family: Courier New, Courier, monospace;">}]</span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><br /></span></div><div class="p1"><span style="font-family: Verdana, sans-serif;">Using JSON has the added advantage of being easy to parse without making assumptions about the data. For example, did you know that some kernel modules use ':' in the kstat module, instance, name, or statistic? This makes using the parseable output a little bit tricky. The JSON output is not affected and is readily and consistently readable or storable in the many tools that support JSON.</span></div><div class="p1"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Verdana, sans-serif;">Now you can see how to take advantage of the kstat(1m) command and how it has evolved under illumos to be more friendly to building tools and taking quick measurements. Go forth and kstat!</span></div><div class="p1"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Courier New, Courier, monospace;"><br /></span></div><div id="src" style="background-color: white;"><div id="man"><div class="rs" style="margin-bottom: 1em; margin-left: 3em;"><div style="margin-top: 0.5em;"><br /></div></div></div></div>Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-54237911018343132142014-08-04T22:37:00.001-07:002014-08-04T22:37:08.012-07:00ZIO Scheduling and Resilver Workloads<div><span style="font-family: Verdana, sans-serif;">Latency and performance problems in storage subsystems can be tricky to understand and tune. If you've ever been stuck in a traffic jam or waited in line to get into a concert, you know that queues can be frustrating to understand and trying on your patience. In modern computing systems, there are many different queues and any time we must share a constrained resource, one or more queues will magically appear in the architecture. Thus in order to understand the impacts of the constrained resource, you need to understand the queues.</span><br /><span style="font-family: Verdana, sans-serif;"><br /></span><span style="font-family: Verdana, sans-serif;"><b>The Good Old ZFS I/O Scheduler</b></span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">Looking at the queues</span><span style="font-family: Verdana, sans-serif;">&nbsp;the <i>old</i>&nbsp;<a href="http://www.open-zfs.org/" target="_blank"><i>ZFS</i></a> I/O (ZIO) scheduler deserves investigation first, because it still represents the vast majority of the installed base, including all Solaris installations. The <a href="http://dtrace.org/blogs/ahl/2014/02/10/the-openzfs-write-throttle/" target="_blank">new OpenZFS write throttle</a> changes this area of the code and deserves its own treatment in a separate post.&nbsp;</span><span style="font-family: Verdana, sans-serif;">Adding a resilvering workload into the mix shows how overall performance is affected by the queues and constrained resources.</span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">The ZIO scheduler sends I/O requests to the virtual devices (vdevs) based upon weighted priority. For <i>old</i> <i>ZFS</i> implementations, the priority table maps an I/O type to a value useful for sorting incoming I/O workload requests. The priority table has grown over time, but the general approach is to favor the "I need it now!" demands from the "process this when you get a chance" workloads. The table usually looked like this, with higher priority entries having lower numbers:</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;"><br /></span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">0 - Priority NOW</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">0 - Sync read</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">0 - Sync write</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">0 - Log write</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">1 - Cache fill</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">2 - Dedup Table Prefetch</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">4 - Free</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">4 - Async write</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">6 - Async read</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">10 - Resilver</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;">20 - Scrub</span></div><div style="text-align: left;"><span style="font-family: Courier New, Courier, monospace;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">Background I/Os such as resilver and scrub are scheduled with a lower priority so they do not interfere with more immediate needs, such as sync reads or writes.&nbsp;</span><span style="font-family: Verdana, sans-serif;">The ZIO scheduler does a good job scheduling high priority I/Os over resilver and scrub I/Os.</span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">But the ZIO scheduler only works for one queue -- and there are many queues in the system -- some of which don't have priority-based schedulers. And this is where I begin to grumble about queues... grumble, grumble, grumble.</span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">Consider the following, simplified model of the ZIO scheduler: a resilver workload, a normal (I need it now!) workload, and a pool built of mirrors.</span></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-xIjgXGSuXSw/U82tgC4eNpI/AAAAAAAAAQg/99MzR0cqKJI/s1600/vdev+queues.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-xIjgXGSuXSw/U82tgC4eNpI/AAAAAAAAAQg/99MzR0cqKJI/s1600/vdev+queues.png" height="281" width="400" /></a></div><div><br /></div><div><span style="font-family: Verdana, sans-serif;">The ZIO scheduler prioritizes work in the ZIO pipeline. But once the I/Os have been passed along to the vdevs or disks, the ZIO scheduler can do nothing. Also, the typical queue wrapped around a constrained physical storage device, such as a HDD, doesn't have an easy-to-use, priority-based scheduler.</span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">The depth of the queues down the stack can be quite large. It is not uncommon to see HBAs capable of queuing thousands of I/Os. Modern enterprise-grade SSDs can also have thousands of concurrent I/Os. Even modest HDDs can handle multiple concurrent I/Os.</span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">To maintain some control over the number of I/Os passed along to the disks, <i>old</i> <i>ZFS</i>&nbsp;has a kernel tunable named&nbsp;</span><span style="font-family: Courier New, Courier, monospace;">zfs_vdev_max_pending</span><span style="font-family: Verdana, sans-serif;">. In <i>very old ZFS,</i>&nbsp;this was set to 35. This turned out to be a bad default choice. In <i>old ZFS,</i> it is 10, a better choice than 35 for many systems. In modern OpenZFS, the ZIO scheduler is replaced with a better algorithm altogether. Concentrating on <i>old ZFS</i>, this means 10 I/Os will be passed along to the disks and the remainder will hang around in a queue and be scheduled by the ZIO scheduler. In the previous queue diagram, this is represented by the sigma (Σ) as the sum of the depth of the queues.</span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">The two-queue model for the disks shown in the diagram also works well for analysis in Solaris-based OSes, where the </span><span style="font-family: Courier New, Courier, monospace;">iostat</span><span style="font-family: Verdana, sans-serif;"> command has a two-queue model. The internal kstats implement a Riemann algorithm with two queues and this can give some insight to the workings of the I/O subsystem. In </span><span style="font-family: Courier New, Courier, monospace;">iostat</span><span style="font-family: Verdana, sans-serif;">, these are known as the wait and active queues, for historical reasons.&nbsp;</span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">An important assumption in this model is the service time for handling the I/Os is constant. If you actually measure this, you will find a very different service time for SSDs vs HDDs. For example, most Flash-based SSDs really like many concurrent I/Os and have write pipelines especially well suited for coalescing many write I/Os. For these devices, deeper queues can result in lower per-I/O response time, a very good thing. By contrast, HDDs perform poorly as the queue depth increases, but in an opposite manner than Flash SSDs. By default, HDDs buffer writes so their service time tends to be more consistent. But HDD reads do require physical movement and media rotation leading to response times less predictable than writes.</span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;"><b>This <a href="http://www.bbc.co.uk/programmes/p00hbfjw" target="_blank">Elevator goes to Eleven</a></b></span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">The biggest issue with queuing concurrent I/Os to HDDs is the <a href="https://www.youtube.com/watch?v=inb1NxdoKNc" target="_blank">elevator algorithms</a> used tend to result in a few I/Os getting starved, while the HDD tries to optimize for the majority. Further, these I/Os are not prioritized, so it is quite possible for a ZFS sync read I/O to be starved while the disk optimizes for the resilver I/Os. This starvation is not under control of the OS or ZFS.&nbsp;</span></div><div><span style="font-family: Verdana, sans-serif;"><br /></span></div><div><span style="font-family: Verdana, sans-serif;">What might this starvation look like? Benchmarks and statistics will clearly show us. We know there is a scheduler in the disk device and we hope it will work for us. To find out, we'll create a full-stroke, totally random,&nbsp;</span><span style="font-family: Verdana, sans-serif;">4KB-sized,</span><span style="font-family: Verdana, sans-serif;">&nbsp;100% read or 100% write workload. Our workload includes:</span></div><div><ul><li><span style="font-family: Verdana, sans-serif;">Full-stroke gives us the average worst case workload, but also one validating the vendor's datasheet for average seek and average rotational delay</span></li><li><span style="font-family: Verdana, sans-serif;">4KB isn't particularly magic, it is just small enough that we won't reach bandwidth saturation for any component in the system: memory, CPU, PCIexpress, HBA, SAS/SATA link, or media bandwidth</span></li><li><span style="font-family: Verdana, sans-serif;">100% read workload because these workloads are difficult to cache, thus providing better insight into where the real performance limits lie</span></li><li><span style="font-family: Verdana, sans-serif;">100% write workload to show how deferred writes can lull you into a false sense of performance, due to caching</span></li></ul><span style="font-family: Verdana, sans-serif;">In this workload experiment, we're measuring latency. Bandwidth isn't saturated and IOPS are meaningless. To prove both, we'll take a look at the workload running with a typical enterprise-grade 7,200 rpm HDD.</span><br /><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-PKuzD2xCkBU/U9CQCslUpNI/AAAAAAAAARA/Jet-DPirsEs/s1600/iops-vs-latency.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-PKuzD2xCkBU/U9CQCslUpNI/AAAAAAAAARA/Jet-DPirsEs/s1600/iops-vs-latency.png" height="276" width="400" /></a></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">For this device, the average rotational delay is 4.167ms (7,200 rpm) and the datasheet says the average read seek time is just over 8ms. We expect the average 4KB full-stroke random read to be about 12.5ms or 80 IOPS. And there it is! For a single thread (single concurrent I/O) 100% read workload, the measured response time is close to 12.5 ms and 80 IOPS. Woohoo!</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">The bandwidth consumed by servicing 80 IOPS at 4KB is a whopping 320 KB/sec or around 2.5 Mbits/sec. This is, at best, about 1/1000th of the bandwidth of the disk interconnect, HBA, media, or memory. Clearly, there is no bandwidth constraint for this experiment.</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">Unfortunately, vendor datasheets are not sufficient for explaining the constraints demonstrated by the experiment. The 100% write workload has both lower latency and higher IOPS than the read for all measured numbers of concurrent operations (threads). The writes are also buffered and only committed to permanent media when we issue a synchronize cache command. In this test, we don't issue synchronize cache commands, so the decision to put on media is left to the HDD. We know there is a scheduler ultimately trying to make the best decision on writes to the media, so the latency is a measure of the efficiency of the HDD scheduler with the constraints placed by a continuous stream of small, random writes. The benefits of the scheduler and deferred commits to permanent media for writes are shown by the lower latency and higher IOPS than the 100% read workload.</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">The clue that there is trouble in paradise is not in the number of measured IOPS. In all cases, the IOPS increases as the concurrency increases -- a good thing. Unfortunately, the latency also increases rather significantly -- a very bad thing.&nbsp;</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><b>All IOPS are Not Created Equal</b></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">The measurements shown in the previous graph are averages and do not show a linear relationship between IOPS and latency. Also, if we were to try a linear regression, we would have a problem fitting the mixed workload cases. This is evident from the data, in which a linear relationship could fit the curve well, but the read and write cases have very different slopes.&nbsp;</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">We need to dig further into the latency and queus to understand what lies beneath. For this, we examine the latency of each and every I/O operation. I've long been a fan of iosnoop and will draw your attention to <a href="http://www.brendangregg.com/blog/2014-07-16/iosnoop-for-linux.html" target="_blank">Brendan's porting of iosnoop to Linux</a>. Now you can run iosnoop on Mac OSX, illumos and Solaris distros, and now Linux. Toss in your favorite stats graphing program and you no longer have any excuse for not understanding the distribution of the latency of your storage systems. Let's take a look at our data distributions by read vs write and one vs ten concurrent I/Os. We'll measure in units of microseconds, which is useful for storage devices.</span></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-zKpEULwS6ak/U9CLQfifgGI/AAAAAAAAAQw/71F6-4KrwuM/s1600/HDD-4K-Random-Latency-Distribution.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-zKpEULwS6ak/U9CLQfifgGI/AAAAAAAAAQw/71F6-4KrwuM/s1600/HDD-4K-Random-Latency-Distribution.png" height="640" width="435" /></a></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">We begin by looking at the Concurrent I/Os=1 cases. For both reads and writes, we see a really nice example of the <a href="http://en.wikipedia.org/wiki/Normal_distribution" target="_blank">normal distribution</a>; this is what your stats teacher was trying to drill into your head while you were daydreaming in class. Rejoice when you see these in real life; in computer systems, they are few and far between. From a practical perspective, this means the median and averages are almost identical and common intuition about response time will apply.&nbsp;</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">The mean of the reads is 11.8 ms, very close to the datasheet prediction of 12.5ms. Writes show a mean of 5.1ms, due primarily to the deferred writes and device scheduler optimizations. The 90% percentile for writes is also much better than the reads, also affirming the benefits of the device's scheduler.</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">By contrast, issuing 10 concurrent I/Os shows a different picture of the latency. The read workload looks staggering similar to an <a href="http://en.wikipedia.org/wiki/Exponential_distribution" target="_blank">exponential distribution</a>. The large difference between mean and median latency is an indictor the device's scheduling algorithms are starving some reads. The 90% percentile has ballooned to 148 ms, and some poor reads took more than 300 ms.&nbsp;</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">The write workload still seems normal, but not quite the bell curve we expect from normal distributions. Writes fares better than the reads, with the worse case write taking slightly more than 100 ms.</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><b>Device Queues and the ZIO Scheduler</b></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">Now that we understand how HDDs react to concurrent workloads, we can revisit the ZIO scheduler. The resilver workload is typically described as one or more reads for each write:</span></div><div class="separator" style="clear: both; text-align: left;"></div><ul><li><span style="font-family: Verdana, sans-serif;">Resilver I/Os are constantly running in the background, injecting their workload into the the ZIO pipeline along with the "I need it now!" workload</span></li><li><span style="font-family: Verdana, sans-serif;">Once issued to the device, the ZIO scheduler cannot adjust the priority</span></li><li><span style="font-family: Verdana, sans-serif;">At the device, more concurrent I/Os leads to longer latency, with extreme latency 10x longer than the single-I/O case</span></li></ul><span style="font-family: Verdana, sans-serif;">To prevent normal workload I/Os from getting stuck behind resilver I/Os after being issued to the device, we need to reduce the depth of the queue at the device by reducing </span><span style="font-family: Courier New, Courier, monospace;">zfs_vdev_max_pending</span><span style="font-family: Verdana, sans-serif;">. The perfect value is based on the expected workload. For latency-sensitive workloads, a smaller number results in more consistency and less variance. For bandwidth-intensive workloads, having a few more I/Os can help.</span><br /><span style="font-family: Verdana, sans-serif;"><br /></span><span style="font-family: Verdana, sans-serif;">To quiet my grumbling, here are my rules-of-thumb for&nbsp;</span><span style="font-family: 'Courier New', Courier, monospace;">zfs_vdev_max_pending</span><span style="font-family: Verdana, sans-serif;">:</span><br /><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">1 = the lower bound, but exposes all of the serialization in the stack</span></div><span style="font-family: Verdana, sans-serif;">2 = a good alternative to 1 for HDDs -- offers the disk an opportunity to optimize, but doesn't overwhelm</span><br /><span style="font-family: Verdana, sans-serif;">3 = not a bad choice</span><br /><span style="font-family: Verdana, sans-serif;">4 = upper bound for HDDs, IMHO</span><br /><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">16 = good for many Flash SSDs</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">30+ = good for RAID arrays</span></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">Herein lies the dilemma: HDDs like fewer concurrent I/Os, but SSDs like more concurrent I/Os, especially for write-intensive workloads. But there is only one tunable for all disks in all pools on the system. Fortunately, you can change this on the fly, without rebooting, and measure the impact on your workload. For Solaris or older illumos-based distributions, this example sets the value to 2:</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Courier New, Courier, monospace;">echo zfs_vdev_max_pending/W0t2 | mdb -kw</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">to return to the modern default:</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Courier New, Courier, monospace;">echo zfs_vdev_max_pending/W0t10 | mdb -kw</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">The effects will take place almost immediately. Monitor the latency changes using </span><span style="font-family: Courier New, Courier, monospace;">iostat -x</span><span style="font-family: Verdana, sans-serif;">&nbsp;or another disk latency measure. For direct measurements of all I/Os, as shown previously, consider using </span><span style="font-family: Courier New, Courier, monospace;">iosnoop</span><span style="font-family: Verdana, sans-serif;"> and a stats package.</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">In some storage appliances based on ZFS, there are options for setting this value. Consult your documentation for further information. In some cases, the term </span><span style="font-family: Courier New, Courier, monospace;">zfs_vdev_max_pending</span><span style="font-family: Verdana, sans-serif;"> might be disguised as another name, so look for a tunable related to the number of concurrent I/Os sent to disks.</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">Understanding queues and their impact on constrained resources can be tricky. Not all of the behaviors follow normal distributions and simple math. Measuring the latency of constrained resources while varying the workload allows you to make adjustments, improving the overall performance of the system. Tuning ZFS device queues is easier when using my tips and proven techniques.</span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;"><br /></span></div><div class="separator" style="clear: both; text-align: left;"><span style="font-family: Verdana, sans-serif;">May your traffic jams be short, may you be in the front of the line at the next concert, and may all of your I/O have low latency variance!</span></div><div class="separator" style="clear: both; text-align: left;"><br /></div>Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-74566869841101159312012-09-24T15:40:00.001-07:002012-09-24T15:40:44.593-07:00Designing ZFS at Datacenter ScaleHere is the abstract for my talk next week at <a href="http://www.zfsday.com/" target="_blank">zfsday</a>!<br /><br /><span style="font-family: Arial, Helvetica, sans-serif;">The ZFS hybrid storage pool model is very flexible and allows many different combinations of storage technology to be used. This presents a dilemma to the systems architect: what is the best way to build and configure a pool to meet business requirements? We'll discuss modeling ZFS systems and hybrid storage pools at a datacenter scale. The models consider space, performance, dependability, and cost of the storage devices and any interconnecting networks (including SANs). We will also discuss methods for measuring the performance of the system.</span><br /><span style="background-color: white; font-family: arial, sans, sans-serif; font-size: 13px; white-space: pre-wrap;"><br /></span>I hope to see your smiling face in the audience, or you can register to see the live, streaming video. <a href="http://www.eventbrite.com/event/4260025852?ref=ebtnebregn" target="_blank">Sign up today!</a><br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-76106787861180196982012-09-21T10:46:00.004-07:002012-09-21T10:46:59.552-07:00OmniTI adds weight behind illumos/ZFS dayTheo and the crew at <a href="http://www.omniti.com/" target="_blank">OmniTI</a> have added their support to the i<a href="http://www.zfsday.com/" target="_blank">llumos/ZFS day event in San Francisco next month</a>. Thanks guys! We look forward to hearing more about the <a href="http://omnios.omniti.com/" target="_blank">OmniOS distribution!</a><br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-65323898246588585472012-09-18T14:08:00.005-07:002012-09-18T14:08:55.838-07:00ZFS day is coming soon!<a href="http://www.dey-sys.com/" target="_blank">We</a> are hosting illumos and ZFS day events in San Francisco October 1 - 3, 2012. Our good friends from DDRdrive, Delphix, Joyent, and Nexenta are also sponsoring the event. I will be talking about how to optimize the design of ZFS-based systems and explain how to get the best bang for your buck. <a href="http://www.dey-sys.com/pioneers.php" target="_blank">Jason and Garrett</a> are also on the speakers list, talking about how illumos has really taken hold as a foundation for building modern businesses.<br /><br />For all of the Solaris fans and alumni, there is also a <a href="http://2ndsolaris.eventbrite.com/" target="_blank">Solaris Family Reunion</a> on Monday evening.<br /><br />There is no cover charge, just register at <a href="http://www.zfsday.com/">www.zfsday.com</a> and join us.<br /><br />We look forward to seeing you there... even if you have to sneak out of a boring Oracle Open World session!<br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-55449334342474955312012-09-12T18:11:00.002-07:002012-09-12T18:11:44.276-07:00cifssvrtop updatedWhen I originally wrote cifssvrtop (top for CIFS servers), all of the systems I tested with had one thing in common: the workstations (clients) had names. Interestingly, I recently found a case where the workstations are not named, so the results were less useful than normal.<br /><br /><br /><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">2012 Sep 11 23:50:48, load: 3.11, read: 0 &nbsp; &nbsp; &nbsp; &nbsp;KB, write: 176448 &nbsp; KB</span><br /><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">Client &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;CIFSOPS &nbsp; Reads &nbsp;Writes &nbsp; Rd_bw &nbsp; Wr_bw &nbsp; &nbsp;Rd_t &nbsp; &nbsp;Wr_t &nbsp;Align%</span><br /><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3391 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp;3033 &nbsp; &nbsp; &nbsp; 0 &nbsp;192408 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp;85 &nbsp; &nbsp; 100</span><br /><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">all &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3391 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp;3033 &nbsp; &nbsp; &nbsp; 0 &nbsp;192408 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp;85 &nbsp; &nbsp; 100</span><br /><div><br /></div><div>In this case, there are supposed to be 5 clients. But none have workstation names, so they all get lumped together under "".&nbsp;</div><div><br /></div><div>The fix is, of course, easy and obvious: make an option to discern clients by IPv4 address instead of workstation name. This is more consistent with nfssvrtop and iscsisvrtop, a good thing. Now the output looks like:</div><div><br /></div><div><div><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">2012 Sep 12 19:52:23, load: 2.50, read: 0 &nbsp; &nbsp; &nbsp; &nbsp;KB, write: 1766632 &nbsp;KB</span></div><div><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">Client &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;CIFSOPS &nbsp; Reads &nbsp;Writes &nbsp; Rd_bw &nbsp; Wr_bw &nbsp; &nbsp;Rd_t &nbsp; &nbsp;Wr_t &nbsp;Align%</span></div><div><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">172.60.0.101 &nbsp; &nbsp; &nbsp; &nbsp;452 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; 441 &nbsp; &nbsp; &nbsp; 0 &nbsp; 27984 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; 108 &nbsp; &nbsp; 100</span></div><div><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">172.60.0.104 &nbsp; &nbsp; &nbsp; &nbsp;488 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; 473 &nbsp; &nbsp; &nbsp; 0 &nbsp; 30072 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; 101 &nbsp; &nbsp; 100</span></div><div><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">172.60.0.103 &nbsp; &nbsp; &nbsp; &nbsp;505 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; 490 &nbsp; &nbsp; &nbsp; 0 &nbsp; 31068 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; 849 &nbsp; &nbsp; &nbsp;99</span></div><div><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">172.60.0.102 &nbsp; &nbsp; &nbsp; &nbsp;625 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; 614 &nbsp; &nbsp; &nbsp; 0 &nbsp; 38979 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp;2710 &nbsp; &nbsp; &nbsp;99</span></div><div><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">172.60.0.105 &nbsp; &nbsp; &nbsp; &nbsp;792 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; 773 &nbsp; &nbsp; &nbsp; 0 &nbsp; 49002 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp;4548 &nbsp; &nbsp; &nbsp;99</span></div><div><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">all &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2864 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp;2792 &nbsp; &nbsp; &nbsp; 0 &nbsp;177106 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp;2030 &nbsp; &nbsp; &nbsp;99</span></div></div><div><br /></div><div>Here we can clearly see the clients separated by IPv4 address. The sorting is by CIFSOPS, which is the easiest way to deal with dtrace aggregations.</div><div><br /></div><div>To implement this change, I added a new "-w" flag that will print the workstation name instead of the IPv4 address. If you prefer the previous defaults, then feel free to <a href="https://github.com/richardelling/tools" target="_blank">fork it on github</a>.</div><div><br /></div><div>I've updated the <a href="https://github.com/richardelling/tools/blob/master/cifssvrtop" target="_blank">cifssvrtop</a> sources in <a href="https://github.com/richardelling/tools" target="_blank">github, check it out</a>. The code has details, the "-h" option shows usage, and there is a <a href="https://github.com/richardelling/tools/blob/master/Toptools.pdf" target="_blank">PDF presentation to accompany the top tools</a> there. Finally, feedback and bug reports are always welcome!</div><div><br /></div><div><br /></div><div><br /></div>Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-13361999281134351712012-08-10T18:46:00.000-07:002012-08-10T18:46:14.402-07:00Hello DEY Storage Systems!Many of my friends have been asking where I've been lately and why they haven't seen me lurking around in the usual haunts. In January of this year, Jason Yoho, Garrett D'Amore, and I started a new company, <a href="http://www.dey-sys.com/" target="_blank">DEY Storage Systems</a>. I'm the <span style="font-family: Verdana, sans-serif;">E</span>. And I'll take full responsibility for the uncleverness of the name, though it does lead to some fun puns (where are you when I need you, Simonton?)<div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-dAnK0CpuGy4/UCW3CMv9-bI/AAAAAAAAAL8/_kH0nv-sCV8/s1600/FB+Cover3.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="118" src="http://2.bp.blogspot.com/-dAnK0CpuGy4/UCW3CMv9-bI/AAAAAAAAAL8/_kH0nv-sCV8/s320/FB+Cover3.jpeg" width="320" /></a></div><div>We're currently heads-down, working hard on building the product. We've got <a href="http://www.dey-sys.com/news.php" target="_blank">terrific backing</a> from some <a href="http://www.dey-sys.com/investors.php" target="_blank">truly exceptional entrepreneurs</a>, a cadre of experienced advisors, a great vision, and innovative ideas. We are creating some really cool stuff, and I'm eagerly awaiting the first product launch.</div><div><br /></div><div>One last thought, <a href="http://www.dey-sys.com/dey-jobs.php" target="_blank">we're hiring too</a>!</div><div><br /></div><div><br /></div>Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com1tag:blogger.com,1999:blog-1111328281312116171.post-24862495692726185502012-04-21T15:03:00.000-07:002012-04-21T15:03:06.594-07:00Performability Analysis<br /><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">Modern systems are continuing to evolve and become more tolerant to failures. For many systems today, a simple performance or availability analysis does not reveal how well a system will operate when in a degraded mode. A performability analysis can help answer these questions for complex systems. In this blog, an <a href="https://blogs.oracle.com/relling/entry/introduction_to_performability_analysis" target="_blank">updated version of an old blog post on performability</a>, I'll show one of the methods we use for performability analysis.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">But before we dive into the analysis method, I need to introduce you to a scary word, <b>DEGRADED</b>. A simple operating definition of <b>degraded</b>&nbsp;is, something isn't working nominally. For large systems with many parts, such as the Internet, there is also a large number of components that are, at any given time, not operating nominally. The beauty of the Internet is that it was <a href="http://www.netvalley.com/archives/mirrors/cerf-how-inet.html" target="_blank">designed to continue to work while being degraded</a>. Degraded is very different than being <b>UP</b> or <b>DOWN</b>. We often describe the <a href="http://www.wikipedia.org/wiki/High_availability" target="_blank">availability of a system</a> as the ratio of time UP versus total time, expressed as a percentage of a time interval, with 99.999% (five-nines) being the gold standard. Availability analysis is totally useless when describing how large systems work. As a systems engineer or architect, operation in the degraded mode is where all of the interesting work occurs.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">With performability analysis, we pay close attention to how the system works when degraded so that we can improve the design and decrease the <b>frequency, duration,</b> or <b>impact</b> of operating in degraded mode. As more people are building or using cloudy systems, you can see how the move from system design focused on outage frequency (MTBF) and duration (MTTR) can be very different in design and economics than a system designed to reduce the impact (performability) of degradation.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">We often begin with a small set of components for test and analysis. Traditional benchmarking or performance characterization is a good starting point. For this example, we will analyze a RAID storage array. We begin with an understanding of the performance characteristics of our desired workload, which can vary widely for storage subsystems. In our case, we will create a performance workload which includes a mix of reads and writes, with a consistent I/O distribution, size, and a desired performance metric expressed in IOPS. Storage arrays tend to have many possible RAID configurations which will have different performance and data protection trade-offs, so we will pick a RAID configuration which we think will best suit our requirements. If it sounds like we're making a lot of choices early, it is because we are. We know that some choices are clearly bad, some are clearly good, and there are a whole bunch of choices in between. If we can't meet our design targets after the performability analysis, then we might have to go back to the beginning and start again - such is the life of a systems engineer.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"> </div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">Once we have a reasonable starting point, we will setup a baseline benchmark to determine the best performance for a fully functional system. We will then use fault injection to measure the system performance characteristics under the various failure modes expected in the system. For most cases, we are concerned with hardware failures. Often the impact on the performance of a system under failure conditions is not constant. There may be a fault diagnosis and isolation phase, a degraded phase, and a repair phase. There may be several different system performance behaviors during these phases. The transient diagram below shows the performance measurements of a RAID array with dual redundant controllers configured in a fully redundant, active/active operating mode. We bring the system to a steady state and then inject a fault into one of the controllers.</span></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-F5ZxAtO7V9M/T5MgFWNcncI/AAAAAAAAAK4/VTWjLNeF0vk/s1600/controller-example.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: Arial, Helvetica, sans-serif;"><img border="0" height="241" src="http://2.bp.blogspot.com/-F5ZxAtO7V9M/T5MgFWNcncI/AAAAAAAAAK4/VTWjLNeF0vk/s400/controller-example.jpg" width="400" /></span></a></div><div class="separator" style="clear: both; text-align: left;"> </div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">This analysis is interesting for several different reasons. We see that when the fault was injected, there was a short period where the array serviced no I/O operations. Once the fault was isolated, then a recovery phase was started during which the array was operating at approximately half of its peak performance. Once recovery was completed, the performance returned to normal, even though the <b>system remains in the degraded state</b>. Next we repaired the fault. After the system reconfigured itself, performance returned to normal and the system is operating nominally.&nbsp;</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">You'll note that during the post-repair reconfiguration the array stopped servicing I/O operations and this outage was longer than the outage in the original fault. Sometimes, a trade-off is made such that the impact of the unscheduled fault is minimized at the expense of the repair activity. This is usually a good trade-off because the repair activity is usually a scheduled event, so we can limit the impact via procedures and planning. If you have ever waited for a file system check (fsck or chkdsk) to finish when booting a system, then you've felt the impact of such decisions and understand why modern file systems have attempted to minimize the performance costs of fsck,&nbsp;<a href="http://www.opensolaris.org/os/community/zfs/"><span class="s1">or eliminated the need for fsck altogether.</span></a></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">Modeling the system in this way means that we will consider both the unscheduled faults as well as the planned repair, though we usually make the simplifying assumption that there will be one repair action for each unscheduled fault. The astute operations expert will notice that this simplifying assumption is not appropriate for the well-managed systems, where even better performability is possible.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">If this sort of characterization sounds tedious, well it is. But it is the best way for us to measure the performance of a subsystem under faulted conditions. Trying to measure the performance of a more complex system with multiple servers, switches, and arrays under a comprehensive set of fault conditions would be untenable. We do gain some reduction of the test matrix because we know that some components (eg most, but not all, power supplies) have no impact on performance when they fail.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">Once we know how the system performs while degraded, we can build a Markov model that can be used to examine trade-offs and design decisions. Solving the performability Markov model provides us with the average staying time per year in each of the states.</span></div><div class="separator" style="clear: both; text-align: left;"> </div><div class="p5"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p5"><span style="font-family: Arial, Helvetica, sans-serif;">So now we have the performance for each state, and the average staying time per year. These are two variables, so lets graph them on an X-Y plot. To make it easier to compare different systems, we sort by the performance (in the Y-axis). We call the resulting graph a&nbsp;<i>performability graph</i>&nbsp;or&nbsp;<i>P-Graph</i>&nbsp;for short. Here is an example of a performability graph showing the results for three different RAID array configurations.</span></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-MoHrgD4oCdE/T5MhL-tnBwI/AAAAAAAAALI/Z8XFPw_1GR8/s1600/performability-graph.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: Arial, Helvetica, sans-serif;"><img border="0" height="201" src="http://1.bp.blogspot.com/-MoHrgD4oCdE/T5MhL-tnBwI/AAAAAAAAALI/Z8XFPw_1GR8/s320/performability-graph.jpg" width="320" /></span></a></div><div class="p5"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><br /><div class="p1"> </div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">We usually label availability targets across the top as an alternate X-axis label because many people are more comfortable with availability targets represented as "nines" than seconds or minutes. In order to show the typically small staying time, we use a log scale on the X-axis. The Y-axis shows the performance metric. I refer to the system's performability curve as a&nbsp;<i>performability envelope&nbsp;</i></span><span style="font-family: Arial, Helvetica, sans-serif;">because it represents the boundaries of performance and availability, where we can expect the actual use to fall below the curve for any interval.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">In the example above, there are 3 products: A, B, and C. Each has a different performance capacity, redundancy, and cost. As much as engineers enjoy optimizing for performance or availability, we cannot dismiss the actual cost of the system. With performability analysis, we can help determine if a lower-cost system that tolerates degradation is better than a higher-cost system that delivers less downtime.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">Suppose you have a requirement for an array that delivers 1,500 IOPS with "four-nines" availability. You can see from the performability graph that Product A and C can deliver 1,500 IOPS, Product C can deliver "four-nines" availability, but only Product A can deliver&nbsp;<b>both</b>&nbsp;1,500 IOPS and "four-nines" availability.&nbsp;</span><span style="font-family: Arial, Helvetica, sans-serif;">To help you understand the composition of the graph, I colored some of the states which have longer staying times.</span></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-3wXtufnENjA/T5MjrlKs0hI/AAAAAAAAALY/YdPxzlDEEQ0/s1600/p-graph-explained.001.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: Arial, Helvetica, sans-serif;"><img border="0" height="211" src="http://1.bp.blogspot.com/-3wXtufnENjA/T5MjrlKs0hI/AAAAAAAAALY/YdPxzlDEEQ0/s320/p-graph-explained.001.jpg" width="320" /></span></a></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><br /><br /><div class="p1"> </div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">You can see that some of the failure states have little impact on performance, whereas others will have a significant impact on performance. You can also clearly see that this system is <i><b>expected</b></i> to operate in a degraded mode for approximately 2 hours per year, on average. While degraded, the performance can be the same as the nominal system. See, degraded isn't really such a bad word, it is just a fact of life, or <a href="http://www.wikipedia.org/wiki/Black_Knight_(Monty_Python)" target="_blank">good comedy as in the case of the Black Knight</a>.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">For this array, when a power supply/battery unit fails, the write cache is placed in write through mode, which has a significant performance impact. Also, when a disk fails and is being reconstructed, the overall performance is impacted. Now we have a clearer picture of what performance we can expect from this array per year.</span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div><div class="p1"><span style="font-family: Arial, Helvetica, sans-serif;">This composition view is particularly useful for product engineers, but is less useful to systems engineers. For complex systems, there are many products, many failure modes, and many more trade-offs to consider. More on that later...</span></div><div class="p1"><br /></div><br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-36337490344249269812012-04-16T00:16:00.000-07:002012-04-16T00:16:01.447-07:00Latency and I/O Size: Cars vs Trains<span style="font-family: Arial, Helvetica, sans-serif;">A legacy view of system performance is that bigger I/O is better than smaller I/O. This has led many to worry about things like "jumbo" frames for Ethernet or setting the maximum I/O size for SANs. Is this worry justified? Let's take a look...<br /><br /><i>This post is the second in a series looking at the use and misuse of IOPS for storage system performance analysis or specification.</i><br /><br />In this experiment, the latency and bandwidth of random NFS writes is examined. Conventional wisdom says, jumbo frames and large I/Os is better than default frame size or small I/Os. If that is the case, then we expect to see a correlation between I/O size and latency. Remember, latency is what we care about for performance, not operations per second (OPS). The test case is a typical VM workload where the client is generating lots of small random write I/Os, as generated by the iozone benchmark. The operations are measured at the NFS server along with their size, internal latency, and bandwidth. The internal latency is the time required for the NFS server to respond to the NFS operation request. The NFS client will see the internal latency plus the transport latency.<br /><br />If the large I/O theory holds, we expect that we will see better performance with larger I/Os. By default, the NFSv3 I/O size for the server and client in this case is 1MB. It can be tuned to something smaller, so for comparison, we also measured when the I/O size was 32KB (the NFSv2 default).<br /><br />Toss the results into <a href="http://www.jmp.com/">JMP</a> and we get this nice chart that shows two consecutive iozone benchmark runs - the first with NFS I/O size limited to 32KB, the second with NFS I/O size the default 1MB:</span><br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both;"><a href="http://2.bp.blogspot.com/-ulgeabTPc1o/T4u93FMElpI/AAAAAAAAAKw/06mS6hD53zA/s1600/large-writes-vs-latency.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="315" src="http://2.bp.blogspot.com/-ulgeabTPc1o/T4u93FMElpI/AAAAAAAAAKw/06mS6hD53zA/s400/large-writes-vs-latency.png" width="400" /></a></div><span style="font-family: Arial, Helvetica, sans-serif;"><br />The results are not as expected. What is expected is that the larger I/Os are more efficient and therefore offer better effective bandwidth while reducing overall latency. What we see is that we get higher bandwidth and significantly lower latency with the smaller I/O size! The small I/O size configuration on the left clearly outperforms the same system using large I/O sizes.<br /><br />The way I like to describe this is using the cars vs trains analogy. Trains are much more efficient at moving people from one place to another. Hundreds or thousands of people can be carried on a train at high speed (except in the US, where high speed trains are unknown, but that is a different topic). By contrast cars can carry only a few people at a time, but can move about without regard to the train schedules and without having to wait as hundreds of people load or unload from the train. On the other hand, if a car and train approach a crossing at the same time, the car must wait for the train to pass. And that can take some time. The same thing happens on a network where small packets must wait until large packets pass through the interface. Hence, there is no correlation between the size of the packets and how quickly they move through the network because when large packets are moving, the small packets can be blocked - cars wait at the crossing for the train to pass.<br /><br />This notion leads to a design choice that is counter to the conventional wisdom. To improve overall performance of the system, smaller I/O sizes can be better. As usual, for performance issues, there are many factors involved in performance constraints, but consider that there can be positive improvement when the I/O sizes are more like cars than trains.</span>Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com3tag:blogger.com,1999:blog-1111328281312116171.post-90672861285300495352012-04-08T17:54:00.003-07:002012-04-08T17:54:59.155-07:00DTrace Conference Aura Chart DemoIn case you missed the <a href="http://wiki.smartos.org/display/DOC/dtrace.conf+Schedule" target="_blank">DTrace conference on April 3, 2012</a>, Dierdre recorded all of the sessions and is publishing the videos. I had a few minutes to discuss the Aura Graph work that was demonstrated in Nexenta's booth at VMworld 2011. The <a href="http://smartos.org/2012/04/08/dtrace-conf-2012-more-visualizations/" target="_blank">short video</a> explains what we were visualizing and why it is useful for operators.<br /><br /><a href="http://smartos.org/2012/04/08/dtrace-conf-2012-more-visualizations/">http://smartos.org/2012/04/08/dtrace-conf-2012-more-visualizations/</a><br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-14856465670620569792012-03-31T18:01:00.000-07:002012-03-31T18:01:46.432-07:00IOPS and latency are not related - HDD performance exploredToday, we routinely hear people carrying on about IOPS-this and IOPS-that. Mostly this seems to come from marketing people: 1.5 million IOPS-this, billion IOPS-that. Right off the bat, a billion IOPS is not hard to do, the metric lends itself rather well to parallelization...<br /><i><br /></i><br /><i>This post is the first in a series looking at the use and misuse of IOPS for storage system performance analysis or specification.</i><br /><br />Let's do some simple math. We all want low latency -- the holy grail of performance. In the bad old days, many computer systems were bandwidth constrained in the I/O data path, so it was very easy to measure the effect of bandwidth constraints on latency. For example, fast/wide parallel SCSI and UltraSCSI was the rage when the dot-com bubble was bubbling, capped out at 20 MB/sec. Suppose we had to move 100 MB of data, then the latency is easily calculated:<br /><div style="text-align: center;"><br /></div><div style="text-align: center;">Bad Old Latency = amount of stuff / bandwidth = 100 MB / 20MB/sec = 5 sec</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Well, there it is, 5 seconds later. &nbsp;For modern computers, 6Gbps SAS or SATA is prevalent. So a modern system's disk channel bandwidth is around 750 MB/sec:</div><div style="text-align: left;"><br /></div><div style="text-align: center;">Good New Latency = 100 MB / 750 MB/sec = 0.133 sec</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Sweet! But that is just channel bandwidth, for HDDs there is another limiting factor, the media bandwidth. For bulk HDDs, you can guess 150 MB/sec for media bandwidth and you'll be in the ballpark. Consult the datasheet or detailed design docs for your drive to see its rating.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">The effect of the elimination of latency in the data path has led to an interesting change in systems thinking. The good news is that bandwidth generally isn't an issue. The bad news is that the marketing folks need something else to spew about. Something that goes<i><b> up and to the right</b></i> when you graph it over time, just like your stock portfolio. In other words, latency doesn't work because you are happier when it is smaller. The solution to this marketing dilemma: talk about IOPS! They go up and to the right, bigger is better, my product has more than yours, and the <a href="http://dilbert.com/strips/comic/1992-04-06/" target="_blank">2-drink minimum</a> ensures a party well into the night.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Let's revisit our scenario. Assume that we have a fixed, 4KB I/O size, which is reasonably common for many workloads, especially those running on Intel x86-based systems today.</div><div style="text-align: left;"><br /></div><div style="text-align: center;">IOPS = amount of stuff moved / latency = (100MB / 4KB) / 5 = 5,000 IOPS</div><div style="text-align: left;">or better yet..</div><div style="text-align: center;">IOPS = (100MB / 4KB) / 0.133 = 187,500 IOPS</div><div style="text-align: left;"><br /></div><div style="text-align: left;">How about them apples! A million IOPS cannot be far away! Billions will follow! Joy and happiness will overtake the legions of struggling performance geeks and all will be good in the universe!</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Hold it right there, fella! Unfortunately is doesn't quite work like that. The physics behind the technology says you can maintain 100MB/sec (or media bandwidth) as long as you don't have to seek. It turns out that most real-world workloads are not of the streaming bandwidth type. Just a little seek to an adjacent track and you blow the whole equation. What's worse, for HDDs it gets blown in unpredictable ways. To deal with unpredictable systems, we resort to our good old friends, measurement and statistics.&nbsp;So let's take a look at how a HDD reacts to a random workload -- where one wants to see high IOPS.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">The workload of choice is a full-stroke 4KB random workload. The victim under test, a typical 7,200 rpm 3.5" HDD (most vendors have similar performance specs). If you look at the datasheet you might see specifications like:</div><div style="text-align: center;">Average seek time: 8.5 ms</div><div style="text-align: center;">Average rotational delay: 4.17 ms</div><div style="text-align: center;"><br /></div><div style="text-align: left;">The average rotational delay is based on the rotation speed, 7,200 rpm. Pay close attention to these specs for HDDs and be aware that many of the new, low power "green" drives have variable rotational delay (read as: random I/O performance will suck even worse). From the specs, we can expect that an average random read will take around 12.5 ms. We test the drive, and sure enough we get 12.5 ms, for a single thread case. This is good, because it means that the datasheets don't lie and I can rely on them for system sizing without ever actually testing the HDD. Now that we know the average latency, it is a simple matter of math to get the IOPS.</div><div style="text-align: left;"><br /></div><div style="text-align: center;">IOPS = 1/avg latency = 1/0.0125 = 80 IOPS</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Well, there it is. If you know nothing at all about a drive, you can guess that it should be able to deliver 80 IOPS, give or take a few. Drives with faster rotational speed, such as 15krpm, reduce the rotational delay and the real fast, enterprise-grade HDDs can also reduce the average seek time to a few ms. Do the math and you'll find them around 80 to 200 IOPS.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">But Richard, this is a long way from a billion IOPS. Yes, it is. But before we go there, we need to take another look at the HDDs under concurrent load. The above case is for a single I/O operation at a time. Remember when I said IOPS can be easily tamed by parallelization, let's give it a try. For the next test, lets increase the number of concurrent I/O operations. Modern HDDs have either <a href="http://en.wikipedia.org/wiki/TCQ" target="_blank">Tagged Command Queuing (TCQ) for SCSI or Native Command Queuing (NCQ) for SATA</a>. The idea is that you can submit multiple I/O operations to a HDD and it will optimize the head movements to give you better performance. We adjust our test to measure 100% writes and 100% reads for 4KB random I/Os with 1, 2, 4, 6, 8, or 10 threads. <a href="http://en.wikipedia.org/wiki/Zfs" target="_blank">ZFS</a> fans will know that 10 is a magic number because it is the default I/O concurrency for disks. Since these tests result in multiple answers, it is best to graph them.</div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-dPCkMozZars/T3ecVdW761I/AAAAAAAAAKA/XdQt30zUH7o/s1600/HDD-IOPS-vs-AvgLatency.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="287" src="http://2.bp.blogspot.com/-dPCkMozZars/T3ecVdW761I/AAAAAAAAAKA/XdQt30zUH7o/s400/HDD-IOPS-vs-AvgLatency.png" width="400" /></a></div><div style="text-align: left;"><br /></div><div style="text-align: left;">There is our 80 IOPS for 100% reads with a single thread, so our tests look legitimate. But wait... the rest of the measurements are slightly unexpected. We can reconcile the better performance for writes due to the write buffer cache in the drive (and the workload does not issue <a href="http://en.wikipedia.org/wiki/SCSI_command" target="_blank">SYNCRHONIZE_CACHE</a> commands). So ok, good, we can get 180 write IOPS to the drive, a bonus over the 80 IOPS for reads.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">But, hold it right there again, fella! Remember when we said that the IOPS is related to latency? This data shows that <b>there is no correlation between IOPS and latency for concurrent workloads on HDDs! </b>Latency is on the X axis and IOPS is on the Y axis, so the data clearly shows that IOPS tends to remain constant around 180 to 190 IOPS for the write case even though the average latency ranges from around 5.5 ms up to more than 55 ms. Reads are even worse, where at around 137 IOPS average latency is more than 70 ms. Going back to our math:</div><div style="text-align: center;">IOPS = 1/avg latency = 1/75 ms = 13.3 IOPS</div><div style="text-align: center;">13.3 IOPS != 137 IOPS</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Clearly, <b>there is no correlation between IOPS and latency for concurrent workloads on HDDs!</b>&nbsp;What the data shows is that some I/Os are efficiently handled by the drive's elevator algorithm, but there are other I/Os that get penalized rather badly. Some of the maximum measurements, not shown in the graphs, were in the 900+ ms range. Perhaps these measurements need to include reporting of the standard deviation in addition to the mean? Pity the poor application that has to wait 900 ms because all of the other I/Os were being serviced out of order.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Back to the marketing department... a billion IOPS should look like:</div><div style="text-align: left;"><br /></div><div style="text-align: center;">Latency = 1/IOPS = 1/1,000,000,000 = 1 nanosecond (ns)</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Can you reasonably expect a 1 ns response time for any I/Os on any modern computer system? Absolutely not! You can't even get 1 ns response time from memory, let alone across the PCIe interconnect, through an HBA, down the wire to the disk and back again. Clearly, the games the marketeers are playing have one or more of the following caveats:</div><div style="text-align: left;"></div><ol><li>Big I/Os are being divided ex post-facto into smaller I/Os, as shown in the UltraSCSI example above.</li><li>I/Os are a strange or useless size. We've seen this recently from FusionIO (who should know better!) saying <a href="http://www.fusionio.com/press-releases/fusion-io-breaks-one-billion-iops-barrier/" target="_blank">1 billion IOPS where each I/O was 64 bytes</a>. A more accurate statement is perhaps that they passed 1 billion PCIe transactions per second, except that they didn't say how big the PCI transfers were, so they could be confusing the public with #1, too.</li><li>Parallel storage systems are prevalent, but very often do not deliver better latency. The analogy I typically use here is: nine women can deliver nine babies in nine months (IOPS), but nine women cannot deliver a baby in one month (latency).</li></ol><div style="text-align: left;">I've painted a bleak picture for HDDs here, and indeed their role in high performance systems is over. SSDs won, game over. There are some very good SSDs that have very consistent, low latency and are amongst my favorite choices for cases where latency is important. Even more are being developed as I type, and it is an interesting time to be in the storage business. If you can, please help squash the crazy marketeers who are spewing dribble by understanding your system and how latency matters.</div><br />As we deliver better tools for observing latency and its effects on your storage workload, we will necessarily have to discourage use of meaningless or confusing measures. Bandwidth is already buried, IOPS will be next. Stay tuned...<br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com7tag:blogger.com,1999:blog-1111328281312116171.post-39899723541888626862012-01-02T10:20:00.000-08:002012-01-02T10:20:29.753-08:00Gotta love the winter<div class="p1">In this time of great changes, I thought I might share a view from the ranch this morning.</div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-rg9xIXkXllc/TwH08tpxzII/AAAAAAAAAJk/VAmQQqKfcjw/s1600/Cherry+blossom.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://4.bp.blogspot.com/-rg9xIXkXllc/TwH08tpxzII/AAAAAAAAAJk/VAmQQqKfcjw/s320/Cherry+blossom.png" width="318" /></a></div><div class="p1"> </div><div class="p1">Our cherry trees begin blooming after the rains start in November or December. They will continue to bloom for another month or two. For all of those people who are huddled against the cold, know that soon the changes will bring spring-like weather and the cherry blossoms will bloom.</div><div class="p1"><br /></div>Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-88557678074729447912011-08-28T17:03:00.000-07:002011-08-28T17:03:20.002-07:00Interesting new things shown at VMworld in Las VegasIf you're traveling to Las Vegas for VMworld 2011, be sure to stop by the Nexenta booth. We've got some interesting new things to show... including something you've never seen before...<br /><br />Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-81073472903623252512011-07-29T18:07:00.000-07:002011-07-29T18:07:31.356-07:00NexentaStor 3.1 ReleasedToday NexentaStor 3.1 was released to the world. <a href="http://www.nexenta.com/corp/free-trial-download">Download a free trial copy today!</a> We've been working hard on this release for some time and it offers significant improvements in stability and performance. Here is a small sample of the changes that I think are cool.<br /><br /><ol><li>ZFS baseline<br />zpool version 28 is the default pool version. This version includes a bunch of features that ZFS geeks know and love, such as readonly imports and rollback. Other Nexenta add-ons are also still there, such as WORM.</li><li>JBOD enumeration support<br />One of the most difficult tasks of a general-purpose, x86-based storage appliance is sorting through the plethora of JBODs on the market and being able to reliably map a disk to the JBOD disk slot. This is the first release of this mapping capability in the appliance. Detailed support for a few JBODs is included (LSI DE1600, Xyratex HB 1234 and HB 2435), with generic support for other JBODs. The pipeline is full of JBODs waiting to have detailed support.</li><li>iSCSI performance improvements<br />Several improvements in iSCSI performance, including zero-copy.</li><li><a href="http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&amp;cmd=displayKC&amp;externalId=1021976">Vmware vStorage APIs for Array Integration</a> support<br />VAAI support has been added to block storage protocols.</li><li>Smarter fault triggers<br />Many of the fault triggers have been improved to include more checks and better condition messages.</li><li>HA-Cluster feature enhancements<br />Many enhancements to HA-Cluster to improve failover time and robustness. Also, you can now have multiple pools (Nexenta volumes) or virtual IP (VIP) addresses per cluster service. My favorite feature is that the requirement for a dedicated heartbeat disk has been removed, we now use shared pool disks for heartbeats, which makes perfect sense.</li><li>Auto-sync replication improvements<br />The auto-sync asynchronous replication service has been redesigned to be more robust and offer better performance. A new one-to-many replication transport has also been added.</li><li>More devices supported<br />Device drivers of note include: LSI 9205 family, Areca 1880. In the pipeline are driver updates from Intel and Emulex.</li><li>Better upgrade management<br />You can now upgrade specific version numbers. In the past you would always be upgraded to the latest version, now you can specify the major version. For example, to upgrade to release 3.1, the NMC command is "setup appliance upgrade -r 3.1"</li><li>SMB improvements<br />The SMB (aka CIFS) share service has improved performance, scalability, and reliability.&nbsp;</li><li>Active Directory improvements<br />AD integration has been significantly improved. AD also handles Domain Controller failover better.&nbsp;</li><li>Virtual Machine Data Center improvements<br />VMDC has better integration with VMware vmotion.</li></ol>I tried to keep the list to the top-10, but that obviously didn't work. I hope you enjoy the release anyway!Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0tag:blogger.com,1999:blog-1111328281312116171.post-30601113238787655322010-11-16T04:21:00.000-08:002010-11-16T04:21:18.897-08:00NexentaStor 3.0.4 releasedThis week we've launched <a href="http://www.nexenta.com/">NexentaStor 3.0.4</a>. In many ways this is a significant milestone, far beyond what may be immediately obvious. For existing <a href="http://www.nexenta.com/">Nexenta</a> customers, the feature list will look largely unchanged -- many of the same, great universal storage features available since 3.0.0 earlier this year. But for one who studies how organizations grow and mature, the release represents the best quality, stability, and maturity ever. We have been working hard to earn your trust for protecting your data. I think you will be pleased with the <a href="http://www.nexenta.com/corp/free-trial-download">result</a>.<div><br /></div><div>From a ZFS perspective, there have been many bug fixes and stability improvements.&nbsp;This release also deprecates the global ZIL disable feature. If you've ever attended my training courses or research sessions on ZFS, you know that I'm not a big fan of disabling the ZIL or the ease with which disabling the ZIL can be accomplished in NexentaStor. But there are a few, special cases where the performance trade-off is justified. In <a href="http://www.nexenta.com/">NexentaStor 3.0.4</a> you can set the ZIL use policy on a per-dataset basis. In ZFS terms, this is called "ZIL synchronicity" and is three parts <i>really good stuff</i> and one part pun. Special thanks to&nbsp;<a href="http://milek.blogspot.com/">Robert Milkowski</a>&nbsp;&nbsp;and&nbsp;the ZFS community for the contribution.</div><div><br /></div><div>In keeping with <a href="http://www.nexenta.com/">Nexenta</a> tradition, a free, temporary licensed edition can be downloaded and installed from the <a href="http://www.nexenta.com/corp/free-trial-download">main Nexenta site</a> or the <a href="http://www.nexenta.com/corp/component/remository/products/NexentaStor-Enterprise-Edition-(MIRROR)/">European mirror</a> (thanks Jacco!) Give it a try and see why we think it is the best NexentaStor release yet. Of course, if you already have NexentaStor installed, the NMC command is "setup appliance upgrade" When you do the upgrade, a checkpoint will be made of your current system, so you can roll back if you are unhappy -- appliances are such wonderful things, when implemented properly.</div><div><br /></div>Richard Ellinghttp://www.blogger.com/profile/15596339461577430423noreply@blogger.com0