This is the blog of Robert Catterall, an IBM Db2 for z/OS specialist. The opinions expressed herein are the author's, and should not be construed as reflecting official positions of the IBM Corporation.

Friday, September 26, 2014

DB2 for z/OS: Avoiding zIIP Engine Contention Issues

In an entry posted a few months ago to this blog, I touched on several matters pertaining to mainframe zIIP engines and the use of these processors by DB2 for z/OS. Based on recent experience, I feel that it's important to further highlight a particular zIIP-related issue: zIIP engine contention. Through this post, I want to explain what zIIP contention is, why you want to avoid it, how you can spot it, and what you can do about it.

As, probably, most mainframers know by now, zIIPs (System z Integrated Information Processors) are "specialty engines" aimed at reducing the cost of mainframe computing. They do this in two ways: 1) they cost less than general-purpose engines, and 2) they do not factor into the pricing of mainframe software. Over time, as IBM has made more of the work done on System z servers zIIP-eligible, and as existing zIIP-eligible workloads have grown (particularly those associated with applications that access DB2 for z/OS data through network connections -- commonly referred to as DDF or DRDA workloads), zIIP engine utilization rates have climbed at many sites. That's a good thing, but only up to a certain point. If zIIP engines become too busy, DB2 performance can be negatively impacted. I'll first expand on that point.

What zIIP contention is, and why you want to avoid it

Consider a z/OS LPAR configured with one or more zIIPs. If a zIIP engine is not available when an item of zIIP-eligible work in the system is ready for dispatch (because the zIIPs are busy at that time with other work), that piece of work will be dispatched to a general-purpose engine. There is a slight delay involved in dispatching zIIP-eligible work to a general-purpose engine. That doesn't matter much if only a small amount of zIIP-eligible work ends up being redirected to general-purpose engines. If, however, the degree of what I call "zIIP spill-over" reaches a too-high level, application performance degradation can result.

Indicators of performance degradation caused by zIIP contention

zIIPs have been around for quite a few years now (since the early 2000s, I believe). Why is the issue of zIIP contention only now coming to the fore? Two reasons: 1) it took a while for zIIP utilization at many sites to reach a level at which contention can occur, and 2) the type of work that can utilize zIIP resources recently changed in an important way. This latter point refers to the introduction, with DB2 10 for z/OS, of zIIP eligibility for prefetch read and database write I/Os. Here's why that's important: it can be a pretty good chunk of work (often accounting for the large majority of CPU time charged to the DB2 database services address space, aka DBM1), and it's a type of system task that is very important for the performance of some DB2 applications. If zIIP contention causes delays in the dispatching of prefetch read tasks, run times for prefetch-intensive workloads (such as batch jobs and data warehousing/business intelligence applications) can become elongated. If that were to happen, you might see the effect as a significant increase in a class 3 wait time (so called because the information comes from records generated when DB2 accounting trace class 3 is active) that is labeled as suspension due to "other read I/O" in an IBM OMEGAMON for DB2 accounting long report (the field might be somewhat differently labeled in accounting reports generated by other vendors' DB2 monitor products).

Prefetch slowdown isn't the only thing to look out for. Lately, I've seen data that suggests a link between higher levels of "zIIP spill-over" and higher-than-desired in-DB2 not-accounted-for time for DB2-accessing applications. Not-accounted-for time is a DB2 class 2 time (it's captured in DB2 accounting trace class 2 records) that is probably calculated for you in an accounting long report generated by your DB2 monitor. If your monitor does not calculate this number for you, you can easily figure it yourself: just take average in-DB2 (i.e., class 2) elapsed time for an application, and subtract from that the average in-DB2 CPU time (keeping in mind that this is two numbers: in-DB2 general-purpose CPU time and in-DB2 specialty engine CPU time). Next, subtract out total class 3 suspend time. What's left is in-DB2 not-accounted-for time. Generally speaking, a not-accounted-for time that is less than 10% of total in-DB2 elapsed time for a higher-priority transactional workload (such as many DDF and CICS workloads) does not concern me (a higher level of not-accounted-for time for a batch workload is usually not such a big deal). In times past, a higher-than-desired in-DB2 not-accounted-for time (i.e., more than 10% of in-DB2 elapsed time for a higher-priority transactional workload) was usually an indication that a z/OS LPAR's general-purpose engines were over-committed (as might be the case if they are routinely running at utilization levels of 95% or more). Nowadays, it appears that higher-than-desired in-DB2 not-accounted-for time might also be caused by elevated levels of zIIP contention.

An early-warning indicator

Obviously, you'd like to spot and head off a potential zIIP contention-related performance problem. How could you do that? First and foremost, I'd keep an eye on what I call the zIIP spill-over rate. That's the percentage of zIIP-eligible work that ends up running on general-purpose engines (as mentioned previously, this spill-over occurs when zIIP engines are busy when zIIP-eligible work is ready for dispatch). Depending on the DB2 monitor that you use, this can be easily done. In an IBM OMEGAMON for DB2 accounting long report, you'll see these three class 1 CPU times for a workload (I've condensed them here -- they wouldn't appear one on top of the other in an actual report):

The top-most number is the average general-purpose CPU time for the workload. The bottom number is the average zIIP engine CPU time. The middle number is the average amount of general-purpose CPU time consumed in doing zIIP-eligible work. If you label the second and third numbers A and B as I did above, the zIIP spill-over rate is calculated as follows:

A / (A + B)

The figures above show a zIIP spill-over rate of less than 1%. Such a small non-zero value would not cause me concern. What would cause me some concern is a rate that is over 5%. With that much zIIP-eligible work running on general-purpose engines, you could see the beginnings of a negative impact on application performance.

So, at what level of zIIP engine utilization would you see a potentially performance-impacting zIIP spill-over rate? That depends in large part on the number of zIIP engines configured in a z/OS LPAR. As a very rough rule of thumb, I'd say that for an LPAR with two or three zIIP engines, my preference would be to keep the zIIP engine utilization rate below 60%. You could probably go somewhat higher than that for an LPAR with a larger number of zIIP engines. If an LPAR has only a single zIIP engine, I'd want the utilization rate of the zIIP to be 50% or less.

Now, some of you may have heard that zIIP contention can occur at a zIIP utilization rate of about 30%. What that may mean is that zIIP spill-over has been observed on systems with zIIP utilization rates as low as 30% or so. First, I'm thinking that such observations may be associated with systems that have a single zIIP engine. Second, keep in mind that a small but non-zero zIIP spill-over rate may not materially affect application performance (as stated previously, I do not see a zIIP spill-over rate of less than 5% as being a cause for concern).

Doing something about zIIP contention

If you observe a higher-than-desired level of zIIP spill-over in one of your z/OS LPARs, what can you do about it? One obvious response would be to add zIIP capacity to the system (keep in mind that EC12 and BC12 servers can have up to two zIIPs for every general-purpose engine). If that is not feasible in the near term, see if you can take some steps to reduce zIIP utilization. An example of such a step would be a significant enlargement of one or more DB2 buffer pools (assuming that you have sufficient memory to back larger pools). This could reduce zIIP utilization by reducing synchronous reads associated with a DDF workload, and by reducing prefetch read activity (something that became zIIP-relevant starting with DB2 10). Using high-performance DBATs could reduce the overall CPU cost (and thus the zIIP consumption) of DDF-connected applications.

Summing it up

The moral of this story is pretty simple: do NOT think that you can run zIIPs as "hot" as you can general-purpose mainframe engines without encountering application performance issues. In doing capacity planning, anticipate growth in zIIP-eligible work on your z/OS systems (not only because of overall workload growth, but also because new types of work become zIIP-eligible through new releases -- and sometimes new maintenance levels -- of subsystem and operating system software), and aim to keep zIIP capacity ahead of zIIP demand so as to keep the rate of zIIP spill-over low. Calculate the zIIP spill-over rate for your production DB2 for z/OS systems, and track that number -- make it one of your key metrics for monitoring the performance of mainframe DB2 data servers.

Driving up zIIP utilization can be an effective way of enhancing the price performance of a z/OS system -- just don't take it so far as to be counterproductive.