Leveraging DB2 Snapshot Administrative Views

Native DB2 LUW database snapshot utilities identify performance characteristics of the database at the moment they are run, but are not easily used to identify performance trends over days, weeks and months. There are obvious advantages to being able to compare performance trends over time; furthermore many snapshot data elements are not meaningful unless seen changing over time. With the advent in DB2 version 9 of System Administrative Snapshot Views, automating historical performance capture has become easy for Unix database administrators.

This article will show the administrator how to use Unix shell scripts to build a home-grown database performance history system without resort to third-party tools. Shell scripts are used to automate performance capture and provide 24×7 monitoring support. The author also provides 40 views representing calculated common metrics of the resulting performance history data.

Our target audience for this article are Unix DB2 Database Administrators and system support staff tasked with DB2 database monitoring in Unix environments. Solaris, AIX and Linux DB2 DBA’s running DB2 version 9 and later will benefit.

Many levels of granularity are possible. Look at performance characteristics over an hour, day, week or month.

Quick access to hundreds of snapshot data elements.

Understand overnight/weekend performance problems.

Retrospective debugging of database problems. How many times have you found out about a problem hours after the problem actually manifested, too late to run snapshots and left to speculate about the source of the problem? DB2 issues often manifest in the middle of the night or on weekends. By the time you’re on the scene, the source of the crash is long gone, leaving little besides db2diag.log for backtracking. Having historical snapshot performance data on hand will allow you to more easily track down the gremlin.

Visualization potential using graphing software.

This longitudinal capability was what made DB2 monitoring tools such as Database Guys1 so powerful. You don’t need to purchase a third party tool to gain this capability. You just need to build your own capture system. It is not difficult.

Prior to DB2 version 8, the only way to obtain snapshot data was to generate a text file from GET SNAPSHOT output, and use shell scripts, perl or other parsing tools to aggregate the data. IBM introduced snapshot table functions in version 8.1 . These represented a huge leap forward by giving administrator access to snapshot data via SQL. For DB2 version 9 IBM introduced snapshot Administrative Views, which are now the preferred means of accessing snapshot data, and are easier to write for than table functions. Snapshot administrative views present snapshot data similarly to system catalog views.

Both snapshot Table Functions and Administrative Views are very useful, providing snapshot data about current database performance mediated by powerful SQL reporting language. The behavior of the database at this moment in time might be all you need to know. But the ugliest problems seem to happen outside of the 9-5 window, and single time slice snapshots don’t permit us to perceive patterns such as trends, recurring performance spikes, or slow degradation in database performance over the span of weeks. Snapshot data becomes vastly more useful when stored over time in performance monitoring tables.

Here’s how.

Assumptions

Version: DB2 LUW Version 9.

Unix OS is used for scripting examples, but the system could be adapted for Windows.

Either SYSADM authority or a combination of SYSMON and other authority [for table creation].

My performance view DDL and related queries were written for the condition of a single database per instance, because that was true in the shop where I developed this version of the capture system. If you have multiple databases per instance, modify my examples for primary key, metric view SELECT columns, predicates and GROUP BY clauses of report views to include DB_NAME. If you are a DPF shop, you will also need to make a similar change for column DBPARTITIONNUM.

Create snapshot historical tables

This article provides CREATE DDL for only 5 snapshot capture tables. These are the only tables you will need for the specific performance reports provided in this series (actually you won’t need SNAP_TBS_CFG_T either, but I had it so decided to include it).

If you want to store the full subset of columns from an administrative view, the quickest way to create your historical system administrative table is:

If you do not want to store the full set of columns from the SYSIBMADM view, you can create custom DDL for the administrative view. Unfortunately db2look will not generate DDL directly from an administrative view, because SYSIBMADM objects are not true user views. But what you can do to make quick work of editing your own DDL is to follow Step #1 above (CREATE TABLE. . . LIKE . . . ) and then run db2look against the resulting object. Drop the table you created after you have produced the DDL from db2look. Massage the DDL to include only those columns you wish kept for history. Create the table from your edited DDL, then create your primary key.3

If you choose to capture a smaller subset of administrative view columns, change your populate table INSERT subselect to use explicitly named set of columns (instead of SELECT *).

The snapshot administrative views I use in the remainder of my examples are:

Populate snapshot historical tables

These script examples use cron and Korn shell, but you can substitute your own scheduler and scripting tool. For the performance granularity I sought, I had the populate script run every five minutes. Your executing user id must have at least SYSMON authority and INSERT authority on the target DBA schema tables.

Performance Metric Views

Included in this article are about forty views of database performance metrics. The 2-column views all feature an initial column SNAPSHOT TIMESTAMP with the second being the metric. These views are ideally suited for time graph 2-axis charting and visualization.

We then present a few views (3 or more columns) with more complex presentation of performance metrics, not suitable for an X-Y graph but useful for the diagnosis of specific performance issues.

Please see my References for sources and credits. All SQL in this article is original and is protected under a Creative Commons 3.0 license.4

I wrote these views for an environment in which there was only a single database per DB2 instance, so users with multiple databases per instance will have to make a small modification of some views. With multiple databases per instance, any views which interpret table DBA.SNAPDB, DBA.SNAPAPPL, or DBA.SNAPSTMT may need to be modified to add DB_NAME to select clauses, predicates and GROUP BY clauses ( If you are a DPF shop, you will also need to make a similar change for column DBPARTITIONNUM.).

To download ddl by individual view, go to the individual view entry below and click on the “[DOWNLOAD VIEW DDL]” link next to the view link. To download all DB2 view DDL at once as a ZIP archive, right-click on this link.

Peformance Metrics Two-Column Views

Application and Connection Metrics

Applications Connected

Description: Number of connections (logical application sessions, Application ID’s) to the database by snapshot timestamp.Usage example:

select APPLS_CUR_CONS from dba.v_appls_cur_cons
order by snapshot_timestamp desc fetch first 1 row only

Applications Executing

[DOWNLOAD VIEW DDL]Description: Applications Executing Currently (within 5 minutes). Number of connected AND EXECUTING logical session threads, or Applications, at time of snapshot. Most connections will not be executing at any snapshot point in time.Usage example:

select APPLS_IN_DB2 from dba.v_appls_in_db2
order by snapshot_timestamp desc fetch first 1 row only

New Connects Per Minute

select DHR_PCT from DBA.V_DATA_HIT_RATIO order by snapshot_timestamp desc fetch first 1 row only

Physical (tablespace I/O) Metrics

Asynchronous Write Percentage (AWP)[DOWNLOAD VIEW DDL]Description: Asynchronous data and index writes to tablespace (“physical”) as a percentage of total index and data physical writes. It is ALWAYS desirable to do a high percentage of Asynchronous Write I/O. Not only do Asynchronous Write I/Os avoid delaying the application connection, but they also tend to be 3 to 10 times faster depending on your disks and storage system.

It is desirable to have AWP greater than 90%. However, if your database has very little Insert, Update, or Delete activity, then achieving a high AWP will be difficult, if not impossible, to do. On the other hand, if your database has DMLTX (update/insert/delete per transaction) greater than or equal to 1, then achieving a high AWP value is very important. To achieve a high AWP, you need to have a sufficient number of I/O Cleaners (DB CFG NUM_IOCLEANERS) and not have the Changed Pages Threshold set too high (DB CFG CHNGPGS_THRESH).5

select ASYNC_WRITE_PCT from DBA.V_ASYNC_WRITE_RATIO
order by snapshot_timestamp desc fetch first 1 row only

Physical Synchronous Read Percentage (SRP)

[DOWNLOAD VIEW DDL]Description: Indicates the number of data pages read in from the table space containers (physical) through synchronous processes, by database manager agents, as a percentage of total physical reads. Synchronous reads are in contrast to asynchronous reads by engine dispatchable units (EDUs). Synchronous physical reads are always preferable to asynchronous reads.

Dirty Page Threshold Cleaners Ratio Percentage (TCR)

How often a bufferpool page cleaner was invoked because a buffer pool had reached the dirty page threshold criterion for the database (dirty pages contain data that has been changed in the bufferpool, but not yet written to disk). The threshold is set by the chngpgs_thresh configuration parameter. It is a percentage applied to the buffer pool size. When the number of dirty pages in the pool exceeds this value, the cleaners are triggered.

If chngpgs_thresh is set too low, pages might be written out too early, requiring them to be read back in. If set too high, then too many pages may accumulate, requiring users to write out pages synchronously in a long burst, tying up other processing (with poor resulting performance).

If this ratio is low, it may indicate that you have defined too many page cleaners. Or CHNGPGS_THRESH database configuration parameter is too high. If this ratio is high, it may indicate that you have too few page cleaners defined. Either way overall performance is impacted. Bufferpool size is also involved.

select THRESHOLD_CLEANS_RATIO from DBA.V_THRESHOLD_CLEANS_RATIO
order by snapshot_timestamp desc fetch first 1 row only

Victim Cleans Ratio Percentage (VCR)

[DOWNLOAD VIEW DDL]Description: Expressed as percentage of total page cleans, the number of times a page cleaner was invoked because a synchronous write was needed during the victim buffer replacement for the database. From a Quest document: “A victim page is a clean or dirty page in the buffer pool that is removed simply because DB2 needs to make room for incoming pages. If a victim page is a dirty page then the information must be written out to disk.”

You don’t want this over 40%, but also not too low. If this ratio is low, it may indicate that you have defined too many page cleaners. If your chngpgs_thresh is set too low, you may be writing out pages that you will dirty later. Aggressive cleaning defeats one purpose of the buffer pool, that is to defer writing to the last possible moment. If this ratio is high, it may indicate that you have too few page cleaners defined, or your bufferpool is too small. Too few page cleaners will increase recovery time after failures.

Usage example:

select VICTIM_CLEANS_RATIO from DBA.V_VICTIM_CLEANS_RATIO order by snapshot_timestamp desc fetch first 1 row only

Select Statements per Transaction (SELECT_STMTS_TX)

select SELECT_STMTS_TX from DBA.V_SELECT_STMTS_TX order by snapshot_timestamp desc fetch first 1 row only

Transactions Per Minute (TPM)

[DOWNLOAD VIEW DDL]Description: Transactions per minute. Your indication of whether or not this is a high transaction volume database.

Usage example:

select UOW_PER_MINUTE from DBA.V_UOW_PERMINUTE order by snapshot_timestamp desc fetch first 1 row only

Locking Metrics

Lock Escalations since last snapshot

[DOWNLOAD VIEW DDL]Description: Number of lock escalations since last snapshot (last 5 minutes). I was dealing with a database with very rare lock escalation so decided not to do a per minute calculation but just show the number of lock escalations last snapshot.

Usage example:

select LOCKESCALS from DBA.V_LOCK_ESCAL order by snapshot_timestamp desc fetch first 1 row only

Exclusive (X) Lock Escalations Since Last Snapshot

[DOWNLOAD VIEW DDL]Description: An exclusive (X) lock escalation can cause locking contention, so seeing this happen on a frequent basis helps us to tune both the application and table locking parameters.

Usage example:

select X_LOCKESCALS from DBA.V_XLOCK_ESCAL order by snapshot_timestamp desc fetch first 1 row only

Lock Timeouts since last snapshot

[DOWNLOAD VIEW DDL]Description: Number of lock timeouts (SQLCODE -911 return code 68) since last snapshot (last 5 minutes). As with lock escalations, I was dealing with a database with very rare lock timeouts so decided not to do a per minute calculation but just show any at all that had occurred – any non-zero value was of concern.

Usage example:

select LOCK_TIMEOUTS from DBA.V_LOCK_TIMEOUTS order by snapshot_timestamp desc fetch first 1 row only

Logging Metrics

Log Cleans Ratio Percentage (LCR)

When the amount of log space encompassing the log record which has updated the oldest page in the buffer pool and the current log position exceeds that allowed by the softmax parameter, it is said that the database is in an ‘LSN gap’ situation. Under the default method of page cleaning, when the logger detects than an LSN gap has occurred, it will trigger the page cleaners to write out all the pages which are contributing to the LSN gap situation. That is, it will write out those pages which are older than what is allowed by the softmax parameter. Page cleaners will be idle for some period of time while no LSN gap is occurring. Then, once an LSN gap occurs, the page cleaners are activated to write a large number of pages before going back to sleep. This can result in the saturation of the I/O subsystem, which then affects other agents which are reading or writing pages.

When LSN gap cleaner triggers (LCR approaches 100%), check for performance issues on the database. If you have configured very large logs, softmax may need to be successively lowered until LSN gap cleaner triggers more frequently. If there is strong correlation, it might be necessary to use the new DB2 proactive page cleaning feature DB2_USE_ALTERNATE_PAGE_CLEANING (registry setting).

select LOG_CLEANS_RATIO from DBA.V_LOG_CLEANS_RATIO order by snapshot_timestamp desc fetch first 1 row only

Log Write IO per Minute

[DOWNLOAD VIEW DDL]Description: The number of I/O requests issued by the logger for writing log data to the disk, per minute. This is an indicator of logging activity, and can be used in conjunction with other graphs to determine whether logging is responsible for application performance issues. Can also point toward heavy update periods.

Usage example:

select NUM_LOG_WRITE_IO_PER_MINUTE from DBA.V_LOG_WRITE_IO_PERMINUTE order by snapshot_timestamp desc fetch first 1 row only

Total Log Used (MB)

[DOWNLOAD VIEW DDL]Description: Total log space in megabytes. This can show us problematic or dangerously large units of work, could also be an alert that we are about to exhaust log space. Also allows you to better visualize the growth and contraction of log files over a period of time.

Usage example:

select TOTAL_LOG_USED_MB from DBA.V_TOTAL_LOG_USED_MB order by snapshot_timestamp desc fetch first 1 row only

Miscellaneous 2-Column Metrics

SQL Execution Time Seconds Per Minute

select ELAPSED_EXEC_TIME_S_PER_MINUTE from DBA.V_SQL_EXEC_TIME_PERMINUTE
order by snapshot_timestamp desc fetch first 1 row only

Rows Read Per Minute

[DOWNLOAD VIEW DDL]Description: Rows read from all tables, per minute. Not the same as rows selected. For example a poorly performing SQL might have to read 10,000 rows of a table in order to return only one row in a SELECT statement.

Usage example:

select ROWS_READ_PER_MINUTE from DBA.v_rows_read_perminute order by snapshot_timestamp desc fetch first 1 row only

3+ Column Performance Views

RR/RS Ratio (RRRSR)

[DOWNLOAD VIEW DDL]Description: Rows Read / Rows Selected Ratio (RRRSR) by agents’ ID. RRRSR indicates how many table rows were read, total, as a ratio to number of rows returned in all SELECT statements executed by that agent during its lifetime. This can be an indicator of poor SQL performance.

RR/RS Ratio (RRRSR) Aggregates

[DOWNLOAD VIEW DDL]Description: Maximum and Average value found of Rows Read to Rows Selected Ratio (RRRSR) for all agents sampled at this interval. RRRSR indicates how many table rows were read, total, as a ratio to number of rows returned in all SELECT statements executed by that agent during its lifetime. This metric is an aggregate among all agents alive at that snapshot. The maximum is a good indicator of performance spikes.

Statements with High RR/RS Ratio (RRRSR)

[DOWNLOAD VIEW DDL]Description: All captured statements which match applid/timestamp with at least 80% of max RR_RS ratio for that snapshot interval. The view should show you some of the worst performing SQL captured at your snapshot interval. Because snapshot microseconds might not match between administrative views SNAPSTMT and SNAPAPPL, SQL allows for join between the two capture tables with a 2 second variance, which can be adjusted for your own implementation. The 80% of max value can also be easily modified for your use.

Prune snapshot historical tables

I recommend you create the performance tables in their own DMS tablespace, so that growth in these tables does not endanger space requirements for production data. In any case, you will need to prune data, probably on a weekly basis. The following script can be used as a template for your own prune process. Remember to also REORG / RUNSTATS tables periodically.

example crontab entry, for database dbname, number of days old before purging: 90; run on Monday morning after midnight.

DB2 LUW Performance: Table Read I/O and Overflows
“TBRRTX should be computed for each table by dividing the number of Rows Read by the number of database Transactions . . . TBROVP – The percentage of Overflows over Rows Read = Overflows * 100 / Rows Read (+1 if you want to void division by zero).”

Footnotes

You will notice that my CREATE INDEX DDL omits DB_NAME, because I developed the DDL for a single database instance. You will want to modify DDL for a multiple database instance (if you are a DPF shop, you will also need to make a similar change for column DBPARTITIONNUM). I chose to use table schema ‘DBA’ for the performance tables, you can use whatever name you want, edit the performance views to reflect your changes. [↩]

Thank you to Frederik Engelen for help simplifying the process of creating administrative view history. [↩]

I can also help you to adapt these database monitoring tools to your environment; see my consulting website for contact information. [↩]

Let’s say you have an index on columns A and B. If you have an SQL query that says WHERE B = ‘some value’ but does not include search criteria for column A, then it is highly likely that DB2 will asynchronously prefetch the leaf pages of the index to scan for occurrences of B = ‘some value’. If there are 1000′s of leaf pages in a 3 or 4 level index, this can be a very CPU, and possibly I/O, expensive operation.

To get an idea of your ‘best case goal’ for BPLRTX, we need to look at a few data points:

The average number of SELECTS per Transaction (SELTX)

The average number of Insert, Update, and Deletes per Transaction (DMLTX)

The average number of Statements per Transaction (STMTTX) = SELTX + DMLTX

The average number of Index NLEVELS in the database: Select avg(NLEVELS) from SYSCAT.INDEXES

Add 1 to the average NLEVELS giving AVGNLEVEL

Finally, your ‘best case goal’ for BPLRTX is computed by the formula:

BPLRTX GOAL = STMTTX x (AVGNLEVEL + 1)

This goal provides the minimum possible, best case, value for BPLRTX by multiplying the average number of statements per transaction by the (rounded up) average Index NLEVELS plus 1 to cover a logical read to the data page. By way of example, if an average transaction does 8 SELECT statements and 2 DML statements, and if the average NLEVEL is 3, then best case BPLRTX GOAL = (8+2) x (3+1) = 40 Bufferpool Logical Reads per Transaction.

Now, let’s be honest with each other. The world isn’t perfect, the application’s SQL is probably far from perfect, and the database’s indexes aren’t perfect either. SO, a realistic goal would include adding a 50% slop factor to the ideal goal. This would give us BPLRTX 60 in the previous example.

You did a semi-respectable job with this impressive post. I’m sure you invested at least 40 hours in writing and preparing it, and I appreciate the references, credits, and links. Your efforts to assist the DB2 LUW community are praise worthy, especially for the organizations “without tools budget” in this present economy – certainly some performance insight is better than none.

However, of course, I earnestly have to disagree that tools are not needed. They are. Organizations should focus on their core businesses – not writing in-house tools and scripts that have to be maintained, and maintained and often re-written when a team member leaves an organization. Furthermore, the right tools can provide performance insight that cannot be obtained from scripts, snapshots, and the methods you’ve described – particularly SQL cost aggregation analysis that is described by this blog post: “Identifying Mosquito Swarms” at http://www.dbisoftware.com/blog/db2_performance.php?id=123

The right tools further add value by graphing performance trends over time – for database, partition, bufferpool, tablespace, table, and even individual SQL metrics, and including important change events plotted on the performance graphs. DBI’s Brother-Panther tracks and plots change events on performance graphs so that our customers can SEE the performance consequences of changes. These charts help team members collaborate, and also answer that dreaded question “Performance was great yesterday – what changed?!?!?!?”

Just this week I’ve been working with a transportation company in Florida and we’ve been able to make huge performance improvements in just a few hours with the advanced analytical capabilities of Brother-Panther(R) for DB2 LUW. Cost aggregation across workload timeframes is critical to assuring that you will get optimum performance results for a minimal investment of DBA time. The methods you have described here do not perform statement cost aggregation and, at best, will only help DBAs “hunt elephants”. See http://www.dbisoftware.com/blog/db2_performance.php?id=121
and http://www.dbisoftware.com/blog/db2_performance.php?id=122.

On behalf of DBI, I hope your blog readers will visit DBI’s newly updated web site at http://www.dbisoftware.com and watch “Database Truths in Black and White”. We also just completed work with the San Mateo IBM Innovation Center and demonstrated 24-44% energy savings, 99% cuts in response times, and over 200X user scalability — learn more at http://www.dbisoftware.com/greenperformance.php

[...] a better job by writing my own snapshot captures and saving Snapshot Administrative Views to tables for 24 hour monitoring. The Health features would be nice, if only Health Monitor worked the way it [...]