Monitoring Tempdb I/O Performance

Troubleshooting SQL Server implies a reactive
activity; an issue has occurred that now needs to be fixed. That may be
true but one of the differences that separates an average SQL Server
professional from a good one is knowing about a problem before it has an impact on a live system.

You should be aware by now of tempdb’s importance
to the overall health of an entire instance, so it shouldn’t be a hard
sell to realize the benefits of being proactive and monitoring tempdb
to get early warning of potential problems before they affect a
production system. This section covers the specifics of monitoring
tempdb I/O: What you should be looking at and what thresholds should
prompt you to do something.

The speed at which requests to store and retrieve
data are processed against tempdb is important to the overall
performance of any SQL Server instance and can even be critical where
tempdb is either heavily used or part of an important business process.

Whether you have tempdb on local storage or a SAN
(storage area network), on a RAID10 volume or RAID1, the simplest way
to check I/O system performance is to look at the latency of I/O
requests.

There are two methods for measuring disk latency:
using Performance Monitor and using SQL Server DMVs.
Which one you should choose depends on how you want to monitor
performance and how accurate you need it to be.

Performance Monitor

The PerfMon counters that you should be interested in are as follows:

Avg. Disk sec/Transfer

Avg. Disk sec/Read

Avg. Disk sec/Write

You’ll find these grouped under Logical
Disk, which shows the logical drives and drive letters presented in
Windows as you would see them in Explorer; and Physical Disk, which
shows the drives as Windows sees them internally. Which group you get
the counters from won’t matter in most cases — I tend to use the
Logical Disk counters because it’s easier to work with drive letters.

The counters themselves all provide the average
latency in milliseconds for I/O requests. “Avg. Disk sec/Transfer” is
the combined average for both reads and writes to a drive. This counter
provides the simplest measurement for regular long-term monitoring.

“Avg. Disk sec/Read” and “Avg. Disk sec/Write”
separate the requests into read and write measurements, respectively,
that can be useful for determining how to configure disk controller
cache . For example, if you’re seeing poor read
performance and excellent write performance, you might want to optimize
the cache for reads.

SQL Server DMVs

Monitoring the performance of a disk
volume using Performance Monitor is a useful indicator of a potential
storage performance issue, but you can get a further level of
granularity from SQL Server itself. The following script (code file: Ch8_5ReadAndWriteLatency.sql) uses the sys.dm_io_virtual_file_stats
DMV to calculate the read and write latency for all database files that
have been used since the SQL Server service was last started.

You can see part of the output from running the script on a busy production SQL Server in Figure 4. Tempdb has four data files with file_id’s 1, 3, 4, and 5, and a transaction log with file_id
2. All the data files have the same read and write latency, which is a
positive indicator that the I/O is balanced across all the files, and
all the results indicate good performance from tempdb.

You should use these thresholds for
guidance only because some systems will never be able to achieve the
target latency. If you don’t have any performance issues with your
application and you’re seeing latency of 20ms, then it’s not so
important.

Using 20ms is a good rule of thumb target on most systems unless SQL Server is spending a lot of time waiting for I/O requests.

Troubleshooting Space Issues

All the code in this section uses the Ch8_6TempDBSpaceIssues.sql code file.

sys.dm_db_file_space_usage

This DMV provides a view of the number
and types of pages that are allocated in tempdb by file, allowing you
to see the distribution of pages across your data files.

You can also use this DMV to total the values
across all the files to get a single view of the breakdown of tempdb
usage, which can help you narrow down the scope of the problem in the
event of unexpected usage. Here is an example script for this:

Example results from the preceding script are shown in Figure 5. Note that user, internal and version pages are from uniform extents only.

FIGURE 5

sys.dm_db_task_space_usage

This DMV provides details of tempdb usage for currently running tasks. The values are set to 0
at the start of the task and deleted when the task completes, so it’s
useful for troubleshooting live issues with currently executing tasks.
For example, the following script will give you the top five sessions
currently using space in tempdb, ordered by the total amount of space
in use:

The output won’t include any currently executing tasks, so it’s not very useful for a live issue; but you can look up the session_id in sys.dm_exec_requests to gather information about who’s using that session_id, such as their login details; the server they’re connecting from; and the application they are using.