urn:lsid:ibm.com:blogs:entries-84a1e68f-fa2b-4d83-a240-5f565e7d6248Mainframe Performance Topics with Martin Packer - Tags - transaction I'm a well-known mainframe performance guy, with almost 30 years of experience helping customers manage systems. I also dabble in lots of other technology. I've sought to widen the Performance role, incorporating aspects of infrastructural architecture.03022015-07-10T16:02:39-04:00IBM Connections - Blogsurn:lsid:ibm.com:blogs:entry-326ee4e7-082f-4f62-9c66-fe0023b5429aWLM Response Time Distribution Reporting With RMFMartinPacker11000094DHactivefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-03-20T07:00:05-04:002013-03-20T07:03:08-04:00
<p>If you’re running a workload with WLM Percentile Response Time goals
take a look at the RMF Service Class Period Response Time Goal Attainment instrumentation.
It’s in the Workload Activity Report but this post is about using the raw data to tell the
story better than a single snapshot (or long-term “munging”) can.</p>
<p>(An example of a percentile response time goal is “90% of transactions must end in 0.2 seconds or less”.)</p>
<p>The raw data is in the SMF 72 Subtype 3 Response Time Distribution Data Section.
For each Service Class Period an array of values is given:
Each value represents a count of the number of transactions that ended within the response time constraints
of that bucket.
Here are some examples:</p>
<ul>
<li>Bucket 0 contains those transactions whose response time was less than 50% of the goal.</li>
<li>Bucket 1 contains those that ended in more than 50% but less than 60% of the goal.</li>
<li>The last but one bucket contains those that ended with a response time between 200% of the goal and 400% of it.</li>
<li>the last bucket contains those that had a response time more than 400% of the goal.</li>
</ul>
<p>I’ve omitted the middle buckets for brevity but note there’s one that’s up to 100% of the goal response time - a handy characteristic.</p>
<p>This “response time bucket” data is clearly a lot more use than just knowing the average response time achieved (or even the standard deviation).</p>
<p>My first implementation stacked up the buckets as percentages, and here’s an example:</p>
<img alt="" src="https://dw1.s81c.com/developerworks/mydeveloperworks/blogs/MartinPacker/resource/oldest_dist.png" />
<p>Isn’t it “busy”. <img src="https://www.ibm.com/developerworks/community/blogs/images/smileys/smile.gif" class="smiley" alt=":-)" title=":-)" />
And what was the goal?
And the legend is pretty cruddy, too.
(This is explained by the reporting tool (SLR) insisting on using table column names as series names.)</p>
<p>Because I couldn’t see the wood for the trees I refurbished this graph a couple of years ago:</p>
<ul>
<li>The graph title states the goal.</li>
<li>I only show the “within goal” and “not within goal” percentages.
(Obviously I do this by summing up the appropriate buckets - and that’s where the “100% of goal” bucket boundary is needed.)</p></li>
<li>When the goal is invariant I draw a datum line at the % number in the goal.</li>
<li>I stopped letting SLR drive GDDM to create the graph and used the REXX GDDM interface to draw the graph instead.
This meant I could label the series whatever I wanted, including using spaces.
(This is considerably more fiddly programming - but I use the code on a frequent basis so that’s tolerable.)</li>
</ul>
<p>The result looks like:
<img alt="" src="https://dw1.s81c.com/developerworks/mydeveloperworks/blogs/MartinPacker/resource/current_dist.png" /></p>
<p>(This is actually from a customer <a href="https://www.ibm.com/developerworks/mydeveloperworks/blogs/MartinPacker/entry/testing_is_trying?lang=en">performance test</a> so don’t be put off by the repetitive hour labels on the x axis. One day I’ll get round to tidying up fractional hour labels - when I get sufficiently disgusted.) <img src="https://www.ibm.com/developerworks/community/blogs/images/smileys/smile.gif" class="smiley" alt=":-)" title=":-)" /></p>
<p>This is much cleaner than the old version:</p>
<ul>
<li>For most of the time more than 90% of the transactions ended within the goal (0.5 seconds) - so the goal was met, sometimes comfortably.</li>
<li>There were times when the goal was only just met.</li>
<li>There was a protracted period when fewer than 90% of transactions ended quickly enough.</li>
</ul>
<p>So, this has served me well for a while.</p>
<h3 id="thoughtsforthefuture">Thoughts For The Future</h3>
<p>I think I might’ve gone too far in the direction of simplification with this:
I’d like to add the “just made it” and “almost made it” buckets back in.
(Whether I use shading or different colours for these is still up for debate.)
The buckets I’m tempted to break out are 90% to 100% and 100% to 110%.
The data I see, though, might drive me to use 80% to 100% and 100% to 120%.
We’ll see.</p>
<p>I also can’t see how goal attainment relates to transaction rate:</p>
<ul>
<li>You might expect there to be a positive correlation though you’d hope for a neutral one.</li>
<li>No correlation would mean something external was going on.</li>
<li>Missing the goal for all transaction rates - “unsafe at any speed” <img src="https://www.ibm.com/developerworks/community/blogs/images/smileys/smile.gif" class="smiley" alt=":-)" title=":-)" /> - is also significant:
Either the goal is unrealistic or something that WLM can’t affect dominates response times.</li>
</ul>
<p>So, adding a second y axis and plotting transaction rate against it would tell that part of the story.</p>
<p>I’d like to understand how the percentage of transactions ending in Period 1, Period 2, etc varies:
Just today I had a situation where - over a weekend - the percentage of transactions ending in Period 1 dropped, as transactions got suddenly more CPU-heavy.</p>
<p>At present the code treats each Service Class Period independently, though it does print shift-average transaction rates ending in each period, along with the average CPU (not per transaction but totalled).</p>
<p>One thing I consider a very long stretch would be to make this a 3D chart - with the bucket boundaries considered to be “contour lines”.
That would be very pretty <img src="https://www.ibm.com/developerworks/community/blogs/images/smileys/smile.gif" class="smiley" alt=":-)" title=":-)" /> but hard to draw and even harder to explain:
While I <strong>love</strong> pretty charts I actually want them to tell the story as clearly as possible.</p>
<h3 id="conclusions">Conclusions</h3>
<p>I hope you’ll agree there’s lots you can usefully do with Response Time Distribution statistics.
Most particularly If you have significant workloads with percentile goals - which would be almost 100% true for DDF, and true of quite a few CICS workloads.</p>
<p>I also hope you’ve found the evolution of a chart interesting:
It’s been occasioned by lots of customer interactions over a number of years.
I can’t say either of the two charts I’ve shown actually <strong>caused</strong> evolutions but I think them interesting examples.</p>
<p>We’ll see if I actually get to make the changes I’m contemplating:
My hunch is I will - but I wouldn’t expect me to supply 3D glasses any time soon. <img src="https://www.ibm.com/developerworks/community/blogs/images/smileys/smile.gif" class="smiley" alt=":-)" title=":-)" /></p>
If you’re running a workload with WLM Percentile Response Time goals
take a look at the RMF Service Class Period Response Time Goal Attainment instrumentation.
It’s in the Workload Activity Report but this post is about using the raw data to tell the
story...024068urn:lsid:ibm.com:blogs:entries-84a1e68f-fa2b-4d83-a240-5f565e7d6248Mainframe Performance Topics with Martin Packer2015-07-10T16:02:39-04:00urn:lsid:ibm.com:blogs:entry-18b59a15-b2e4-4526-84a0-26766e71353cWhat's In A Name?MartinPacker11000094DHactivefalseComment Entriesapplication/atom+xml;type=entryLikestrue2011-12-16T11:44:36-05:002011-12-16T11:44:36-05:00
<div>This is the post I was going to write before the discussion that led to <a href="https://www.ibm.com/developerworks/mydeveloperworks/blogs/MartinPacker/entry/cics_vsam_buffering14?lang=en_us">CICS VSAM Buffering</a> arose. It's about getting more insight into how WLM is set up and performing than RMF Workload Activity Report data alone allows.</div><div> </div><div>I recognise some of this can be done with the WLM policy in hand. But this is about an SMF-based approach. (The piece you can't do with SMF is discerning the WLM classification rules.) And the policy can't answer questions about how systems actually behave.<br /></div><div> </div><div>There are two distinct problems I've worked on solving (relatively) recently. I share the outline of my solution to each of these with you here.<br /> <br /><ul><li>In RMF you can't tell how Report Classes and Service Classes relate to each other: In some cases Report Classes break down Service Class data - often to the address space level. In some cases Report Classes coalesce information from multiple Service Classes. But you can't see this linkage in RMF.<br /> <br /></li><li>In RMF you can't necessarily tell what runs in each Service Class. I say &quot;necessarily&quot; because you <b>can</b> tell some things about the nature of the work in a Service Class.</li></ul> </div><div>The &quot;What's In A Name?&quot; in the title refers to the fact a Workload, Service Class or Report Class <b>name</b> is just a string of characters: Rhetorically it might be a &quot;promise&quot; but it's not a mechanistic guarantee. So - to me at least - it's worth knowing rather more.<br /></div><div> </div><div><u><b>Report And Service Class Relationships</b></u><br /> <br />SMF 72 Subtype 3 RMF Workload Activity Report data describes how Service Class Periods and Report Classes perform.<br /> <br />Type 30 Interval records (Subtypes 2 and 3) describe how <b>address spaces</b> perform.(Actually so do Subtypes 4 and 5, which are step-end and job-end records.) These records contain, amongst other things, WLM Workload, Service Class and Report Class names - for the address space. You can therefore use Type 30 to relate Workload and Service Class to Report Class. My code's done this for some time.<br /></div><div> </div><div>Type 30 does not apply to Service Classes that don't own address spaces. Two examples of this are DDF Transaction Service Classes and CICS Transaction Service Classes.<br /></div><div> </div><div>A related topic is which Service Classes are serving other Service Classes. For example CICS Region Services Classes and transaction Service Classes. Now this you <b>can</b> readily discern from SMF 72 alone. (And of course my code does that.)<br /></div><div> </div><div><b><u>What Work In A Service Class Is</u></b><br /></div><div> <br /></div><div>(This piece relates equally to Report Classes.)<br /></div><div> </div><div>As I said, you can't tell much about what a WLM Service Class covers from Type 72. So, as well as the correlation described above, my code uses Type 30 to flesh out what a Service Class is for. The key to this is the Program Name. For example CICS regions have PGM=DFHSIP. So a Service Class with just PGM=DFHSIP address spaces is just a CICS Region Service Class. Simple enough. Some are more complicated than others - perhaps necessitating the 16-character program name field which, for Unix, includes the last portion of the Unix program name.</div><div> </div><div>You can play other games, too: The job name for a DB2 address space can be decoded to glean the subsystem it belongs to. Certain System address spaces have mnemonic Procedure names. And so on. </div><div> </div><div>From SMF 72 you can obtain the number of address spaces for a Service Class - 0 suggesting the Service Class doesn't own any (see above). 1 suggests this class (possibly a Report Class) is there to provide more granularity. You can also get the number of address spaces In and the number Out-And-Ready. This can help you form a picture of e.g. &quot;low use&quot; address spaces in the Service Class.<br /> <br /> </div><div> </div><div>This post is about sharing some of my experience of trying to extend the value that can be got out of SMF - beyond the obvious. Some of this will probably appear in my <a href="https://www.ibm.com/developerworks/mydeveloperworks/blogs/MartinPacker/entry/i_know_what_you_did_last_summer7">I Know What You Did Last Summer</a> presentation - which I'm <b>still</b> hoping to complete soon. This also, by the way, explains why I'm so keen to get Type 30 data from you when you're sending me RMF data. There really is a huge amount of value to be had.<br /></div>
This is the post I was going to write before the discussion that led to CICS VSAM Buffering arose. It's about getting more insight into how WLM is set up and performing than RMF Workload Activity Report data alone allows. I recognise some of this can be done...003365urn:lsid:ibm.com:blogs:entries-84a1e68f-fa2b-4d83-a240-5f565e7d6248Mainframe Performance Topics with Martin Packer2015-07-10T16:02:39-04:00