Archive

Introduction and Goals

The goal of ioh.sh is to measure both the throughput and latency of the different code layers when using NFS mounts on a ZFS appliance. The ZFS appliance code layers inspected with the script are I/O from the disks, ZFS layer and the NFS layer. For each of these layers the script measures the throughput, latency and average I/O size. Some of the layers are further broken down into other layers. For example NFS writes are broken down into data sync, file sync and non-sync operations and NFS reads are broken down into cached data reads and reads that have to go to disk.

The primary three questions ioh is used to answer are

Is I/O latency from the I/O subsystem to ZFS appliance sufficiently fast?

Is NFS latency from ZFS appliance to the consumer sufficiently fast?

Is ZFS adding unusual latency

One: If the latency from the I/O subsystem is not adequate then look into supplying better performing I/O subsystem for ZFS appliance. For example if the goal is 3ms write times per 1K redo write but the underlying I/O subsystem is taking 6ms, then it will be impossible for ZFS appliance to meet those expectations.

Two: If the latency for NFS response from ZFS appliance is adequate and yet the NFS client reports latencies as much slower (more than 2ms slower) then one should look instead at problems in the NIC, network or NFS client host, see network tracing, example http://datavirtualizer.com/tcp-trace-analysis-for-nfs/

Three: If the I/O latency is sufficiently fast but ZFS latency is slow, then this could indicate a problem in the ZFS layer.

The answer to the question “what is adequate I/O latency” depends. In general a single random 8 Kb block read on Oracle is expected to take 3-12 ms on average, thus the typical latency is around 7.5 ms. NOTE: when measuring I/O latency on the source system it’s important to use a tool like “iostat” that will show the actually I/Os to the subsystem. The I/O measured by the Oracle database will include both I/Os satisfied from the host file system cache as well as the I/O subsystem unless the database is running with direct I/O

The ioh tool can also give insight into other useful information such as

Are IOPs getting near the supported IOPs of the underlying I/O subsystem

is NFS throughput getting near the maximum bandwidth of the NIC?”

For example if the NIC is 1GbE then the maximum bandwidth is about 115MB/s, and generally 100MB/s is a good rule of thumb for the max. If throughput is consistently near the NIC maximum, then demand is probably going above maximum and thus increasing latency

For writes, the non stable storage writes are separated from the writes to stable storage which are marked as “sync” writes. For NFS the sync writes are further broken down into “data” and “file” sync writes.

examples:

The following will refresh the display every 10 seconds and display an extra four sections of data

Sections

First line is the date Second line is TCP MB per second in,out and retransmitted. The last value is “ip discards”

The three parts are all related and are a drill down starting with course grain data at the top to finer grain data at the bottom.

averages – default

histograms [hist]

histograms by size reads [histsz] writes [histszw] for NFS

The first section is a quick overview.

The second section breaks out the latency into a histogram so one can get an indication of amount of I/O from memory (ie those in microsecond ranges) as well as how far out the outliers are. (are the outliers on the VDBs matching up to the outliers seen on ZFS appliance?)

The third section differentiates between latency of single block random (typically the 8K size) and latency of multi-block sequential reads (32K and higher). The differentiation is important when comparing to Oracle stats which are grouped by single block random reads (called “db file sequential read” ) and sequential multi-block read (called “db file scattered read”).

The final section

top files read and write [topfile]

is sort of a sanity check as there are periods where there is suppose to be little to no NFS I/O and yet there is, so the top file sections tells which file and which host the NFS I/O is coming from.

The last line after all the sections is total IOPs for reads plus writes. (Note these IOPs could get converted to higher values at the storage layer if using RAID5 which will cause each write to be two reads plus two writes.)

The first section, shows up by default. The other sections require command line arguments.

To see just the first section, which is the default, run ioh.sh without any arguments:

sudo ./ioh.sh

To show non-default sections, add them to the command line

sudo ./ioh.sh -d "hist histsz histszw topfile"

A shortcut for all sections is “all”

sudo ./ioh.sh -d all

Collecting in the background

nohup sudo ./ioh.sh -c 60 -t 86400 &

Runs the collection for 1 day (86400 seconds) collecting every 60 seconds and put raw output into default file name “ioh.out”. The default output file name can be changed with “-o filename” option.

1. Averages:

The displays I/O, ZFS and NFS data for both reads and writes. The data is grouped to try and help easily correlate these different layers First line is date in epoch format

zfssync – these are synchronous writes. THese should mainly be Oracle redo writes.

nfs – unstable storage writes

nfssyncD – data sync writes

nfssyncF – file sync writes

DTrace probes used

io:::start/done check for read or write

nfs:::op-read-start/op-read-done , nfs:::op-write-start/op-write-done

zfs_read:entry/return, zfs_write:entry/return

2. Histograms

latency distribution for i/o, zfs, nfs for reads and writes. These distributions are not normalized by time, ie if ioh.d is outputs once a second then these counts will be equal to the counts in the first section. If ioh.d outputs every 10 seconds, then these values will be 10x higher

Histograms by size for reads and writes

The first column is the average latency for the size of I/O for this line. The second column is the size. The size includes this size and every size lower up till the previous bucket. The goal here is to show the sizes of I/Os and the different latency for different sizes. For an Oralce database with 8k block size, 8k reads will tend to be random where as higher read sizes say will be multiblock requests and represent sequential reads. It’s common to see the 8K reads running slower than the larger reads.

4. Top files

shows the top 5 files for reads and writes. First column is MB/s, then R or W, then IP, then port then filename

Examples and Usage

Idle system

First thing to look at is the MB in and out which answers

“how busy is the system?”

“is NFS throughput approaching the limits of the NIC?”

In the following example, there is only less than 50KB/s total NFS throughput ( in plus out) thus the system isn’t doing much, and there must be no database activity other than the regular maintenance processes which are always running on a database. To confirm this, one can look at the top files at the bottom and see that the only activity is on the control files which are read and written to as part of database system maintenance. Otherwise there is no activity to speak of, so no reason look at I/O latency in this case. Additionally, all majority of what little I/O is in 16K sizes which is typical of control file activity, where as the default database data block activity is in 8K sizes. Most read I/O is coming from ZFS appliance cache as its 64 micro seconds.

Active System Below is an example of an active system. Looking at TCP bytes in and out, there is a fair bit 3MB/s out and 2MB/s in. These rates are a long way from saturating 1GbE but there is activity going on.

READs all reads are coming out of the cache. How do we know? For one the average ms latency is 0.07, or 70 micro seconds. Does thhis 70us include slower reads that might be off disk? Looking at the histogram, one can see that the slowest zfs I/O is under 100us and looking just above at the I/O histogram there are no I/Os thus all the I/O is coming from cache.

Writes Writes are pretty slow. Oracle Redo writes on good systems are typically 3ms or liess for small redo. Unfortunately most of the I/O is coming from datafile writes so it’s difficult to tell what the redo write times are. (maybe worth enhancing ioh.d to show average latency by file) Typically the redo does “nfssyncD” writes and datafile writes are simply unstable storage writes “nfs” writes that get sync at a later date. This particular database is using the Oracle parameter “filesystemio_options=setall” which implements direct I/O. Direct I/O can work without sync writes but the implementation depends on the OS. This O/S implementation, OpenSolaris, causes all Direct I/O writes to by sync writes.

In this case the ZFS I/O 19MB/s is higher than NFS at 16MB/s. Now that could because some thing is accessing the file system locally on ZFS appliance or that ZFS is doing read ahead, so there are possible explanations, but it’s interesting. Second subsystem I/O at 88MB/s is much greater than ZFS I/O at 19MB/s. Again that is notable. Could because there is a scrub going on. (to check for a scrub, run “spool status”, to turn off scrub run “zpool scrub -s domain0″ though the scrub has to be run at some point). Both interesting observations.

Now the more interesting parts. The NFS response time 22ms is almost 3x the average ZFS response time 8ms. On the other hand the average size of NFS I/O is 2.5x the average ZFS I/O size so that might be understandable. The hard part to understand is that the ZFS latency 8ms is twice the latency of subsystem I/O at 4ms yet the average size of the I/O sub-system reads is bigger than the average ZFS read. This doesn’t make any sense.

In this case to hone in the data a bit, it would be worth turning off a scrub if it was running and see what the stats are to eliminate a factor that could be muddying the waters.

But in this case, even without a scrub going, the ZFS latency was 2-3x slower than the I/O subsystem latency.

It turns out ZFS wasn’t caching and spending a lot of time trying to keep the ARC clean.

NFS is 0 MB/s because this was from http traffic. The current version of ioh would show the TCP MB/s. This version also mixed up zfs sync and non-sync writes into one bucket, but much of the ZFS writes have to be non-sync because the write rate is 127MB/s where as the I/O subsystem writes are only 10MB/s thus at least 117MB/s is not sync and if they are not sync they are just memory writes so should be blindingly fast, but they aren’t. The average latency for the ZFS writes is 37ms. All the more shockingly the average size is only 0.199K where as the I/O subsystem writes 32K in 23ms. The case here was that because of disk errors, the ZFS layer was self throttling way to much. This was a bug

How do you determine if the Oracle SQL optimizer has created a sub-optimal execution plan? re-run statistics and see what happens? wait for Oracle to find other execution plans? What if neither method is helping? Do you read the execution plan? What do you look at? Differences in actual vs estimated? How successful is that? Look for full table scans? Do you look at the 10053 trace? How much time and effort does that take? What do you look at in the 10053 trace. Do you have a systematic methodology that works in almost all cases?

The method is tedious as it requires a lot of manual work to draw join trees, identify constraints and relationships, manual decomposition and execution of every 2 table join in the statement. It can add up to a lot of work but it is systematic and dependable.

If you cost your company $50/hour then in 8 hours of saved work it’s paid for its self. In my experience as a DBA I have serveral SQL a year that take me over a day to optimize manually but that I can get done in a few minutes with DB Optimizer. Thus with just one hard SQL the tool has paid for itself. The DB Optimizer analysis might run for a couple hours, but afterwords with the data it collects and presents, I can find better tuning path in minutes if it exists.

The Oak Table members will be discussing their latest technical obsessions and research on Monday and Tuesday, (Oct. 2nd and 3rd, 2017). The truth is, folks- The Oak Table experts are an AWESOME group, (if I don’t say so myself! :)) as we could have easily done another day of incredible sessions, but alas, two days is all we have available for this year’s event.

While at Delphix, we did a lot of storage benchmarking. The I/O response times of Delphix depends, as one would logically imagine, heavily on the underlying disks. Sure Delphix can cache a lot ( with 1 TB of ram and 3x compression that’s 3TB and that 3TB can be shared by 10 or a 100 copies being the equivalent to 30TB or 300TB of databases) but really there will always be important I/O coming from the storage subsystem.

Now Delphix mainly runs databases loads, so the best test for storage that is hooked up to Delphix is to benchmark the storage I/O for a database workload. Two questions arise

What tool can benchmark the I/O?

What is a typical database I/O workload?

For the tool, the clear answer seems to be fio which not only is quite flexible but has an active community and updates and was started by the renown Linux I/O master Jens Axboe and still is actively maintained by him.

Now for database workloads there are 4 types of I/O

Random single block reads (typically lookups by index)

Large multi block reads (typically full table scans)

Small sequential I/Os to transaction log files

Large sequential I/Os totransaction log files (generally when commits are infrequent and change rate is high)

Now in a database, all of these can be done concurrently by multiple processes concurrently and thus we need to benchmark for concurrency as well.

Thus the matrix is the 4 types of I/Os to be tested by different number of concurrent users.

Now fio doesn’t come with such prepackaged I/O benchmarks, so I created a script fio.sh to run configurable database I/O benchmarks.

This summer I bought a used Mac laptop that had something called Hybrid SSD. I had been using a Mac with preinstalled SSD disks and thought the Hybrid would be similar response wise, but once I started using it, there was something cleary wrong, but before sending it back I wanted some empirical proof, so I ran fio.sh.

On my Linux desktop you can see how the MB/sec throughput maxes out about 26 MB/sec and after that latency just goes down proportionally as we add more concurrency.

The github repository all has r scripts to visualize the data (see the readme in github for details on how to generate the graphics)

Here is an explanation of the graphics.

There are a number of factors that are important when benchmarking I/O such as whether using Direct I/O or not, what the size of caching is on the host running fio, what the back end storage cache size is, what the size is of the file used to test I/O, how that file is initialized other with 0’s, or patterned data, or random data, whether the file system compresses or not, etc. Check out this blog post for some anomalies and surprises: http://datavirtualizer.com/lies-damned-lies-and-io-statistics/

When there are databases for which I’m responsible, I like to have powerful performance analysis database tools. In the Oracle world I use tools such as Lab128, DB Optimizer and OEM. There a number of other tools that look good from companies such Quest, Solarwinds and Orachrome. Often though when I’m dealing with a new situation, a new database or a database that’s temporary, then I just want a quick and easy way to see basic stats easily and quickly. For Oracle I have a number of easy monitoring scripts.

For monitoring databases other than Oracle, I have less experience. I did do work on SQL Server, Sybase and DB2 when working on the development of DB Optimizer but monitoring popular open source databases such as MySQL and Postgres is relatively new to me.

Right now I’m looking into Postgres.

For Postgres I wanted some quick and easy way to monitor in realtime various stats to start iterating and prototyping the type of data I wanted to look at, so I put together a shell script. The advantage of a shell script is it should work most anywhere and the only requirement is psql which should be readily available in a PosgreSQL environment.

I took a script from my Oracle experiences, oramon.sh, and retrofitted it for Postgres. It is a script that sets up a named pipe then has psql (or for Oracle , SQL*Plus) read from that named pipe. The script procedurally feeds SQL statements into the named pipe. The output from psql goes to files and the script manages the output to the screen. The script is a little less optimal for PostgreSQL than Oracle as I had to use a temporary table to store variable values. Using psql has variables but they look to be static can not modifiable (seem to act like defines).

The script pulls out some basic stats and outputs them every 5 seconds on the command line, sort of like vmstat. The first stat is AAS or average active sessions which is the count of sessions actively running queries, followed by a few basic database stats.

I’ve only used it a little bit but it does seem a bit touchy about somethings (like spaces in input fields) and the errors are often less than obvious and I’m not finding that much out there on google for the errors.

Today I ran into the error

Variable Name must not be null in JDBC Request

and Googling it didn’t turn up anything.

I’m pretty sure I ran into this same error a few weeks ago when I was first starting with Jmeter, so blogging here to document it.

I was trying something new – running a procedure – instead of a regular sql statement and I think that threw me off.

The error sounded to me like I needed to define an input or output variable.

I tried both of those, until finally I saw its not the input or output variable but the name of the JDBC connection pool that was missing

Turns out I had set “Parameter values” and “Parameter types”. When I took them out it worked. What confuses me, and what I’ll have to look back into, is the whole reason I added the Parameters was because the nextval wasn’t work. Forgot what that original error was.

This blog post is just a start at documenting some of my experiences with jmeter. As far as load testing tools go, jmeter looks the most promising to me. It has an active community, supports many different databases and looks quite flexible as far as architecting different work loads goes.

The flexibility of jmeter also makes it hard to use. One can use jmeter for many other things besides databases so the initial set up is a bit oblique and there look to be many paths to similar results. As such, my understand and method for doing things will probably change considerably as I start to use jmeter more and more.

3. Sampler of type JDBC Request

Make sure and include the name of the thread pool created above. In my case it’s called “orapool”

add a SQL statement to run

4. Listener of type “View Results Tree”

create a widget to see the output

Final setup looks like

run your load and look at the output

Now you hit the run button, the green triangle.

Then click on “View Results Tree” to see the output.

I clicked on “View Results Tree” and then clicked on “JDBC Request” in red.

Then I’ll see some output. I choose “Response data” because it’s a bit more succinct and see the error. In this case there is an extra space ” ” at the end of “oracle.jdbc.OracleDriver “. Jmeter is sensitive to spaces. I’ve gotten a lot of errors because of spaces in fields such as variable names and such.

Correcting that it runs

All the setup might sound like a bit of a pain but once it’s set up, it’s easy to click through and make modifications.

All the setup is available in a text .jmx file and if you are brave you can edit directly there.

The above example is more or less pointless – sort of a “Hello World”.

From here though you can increase the number of threads, increase the number of loops, add more SQL statements.

Jmeter allows a lot of customization so you can add .cvs files for input values, capture output values into variables and use them in input values, have different types of loops with different users running concurrently etc.

More to come.

Christian Antognini gave a presentation at Oaktable World SF in Sept 2016. He was gracious enough to send along his functionally rich .jmx file and I’ll blog on that soon.