Thursday, June 30, 2005

Instrumentation

Talking about instrumentation of code and introducing two more blogs I've been reading on a regular basis. One of them describes a really nice utopia today.

Code instrumentation, what is it? To me, it is the fine art of making approximately every other line of your developed code “debug” of some sort. Trace information. With timestamps. And meaningful information.

Code instrumentation, why is it? Because poor performance is not always a “database problem”. It is many times an application issue and when the application is spread over 14 tiers of complexity, tracking down the bottleneck is grievously hard. If you just whip together an application and throw it out there without any thought to monitoring it over time, be prepared to have poor performance and no clue as to why or where.

Commonly requested information – Tom, can you tell me what my average transaction response time is? Answer: nope, no clue, not a single clue. The only thing I can tell you from the database is on average how long individual bits of SQL might have taken. I don’t know what a transaction is to you and besides, I would tell you only about the database component – no network, no application time, just the database.

And you know what, the end users are the ones that care and they need to have all of the time accounted for. You know what the only thing is that can really give you good transaction response times over time? The application. Why? Because it is the thing that knows what a transaction is, what a transaction does.

I’ve met many a developer that refuses to put this into their code. “It’ll make it run slower”, “This is extra stuff I don’t need”, “It takes me longer to write”. I have yet in 18 years to hear a valid reason why instrumentation should not be done. I have only heard extremely compelling reasons why it must be done.

End users want to know, is the system going slower over time, if so, by how much. Management wants to know, what are my transaction response times, how many transactions do we do, when is the busiest time, and so on. People responsible to administering the system need to be able to identify where bottlenecks are, who needs to be brought in to look at something, who is responsible.

Without code instrumentation, you cannot answer any of those questions – not a single one. Not accurately anyway. (Well, maybe you can if you live in utopia!)

To the developers that say “this is extra code that will just make my code run slower” I respond “well fine, we will take away V$ views, there will be no SQL_TRACE, no 10046 level 12 traces, in fact – that entire events subsystem in Oracle, it is gone”. Would Oracle run faster without this stuff? Undoubtedly – not. It would run many times slower, perhaps hundreds of times slower. Why? Because you would have no clue where to look to find performance related issues. You would have nothing to go on. Without this “overhead” (air quotes intentionally used to denote sarcasm there), Oracle would not have a chance of performing as well as it does. Because you would not have a change to make it perform well. Because you would not know where even to begin.

So, a plea to all developers, get on the instrumentation bandwagon. You’ll find your code easier to debug (note how Oracle doesn’t fly a developer to your site to debug the kernel, there is enough instrumentation to do it remotely). You’ll find your code easier to tune. You’ll find your code easier to maintain over time. Also, make this instrumentation part of the production code, don’t leave it out! Why? Because, funny thing about production – you are not allowed to drop in “debug” code at the drop of a hat, but you are allowed to update a row in a configuration table, or in a configuration file! Your trace code, like Oracle’s should always be there, just waiting to be enabled.

42 Comments:

Raj Jamadagni said....

Tom,

I like the instrumentation part, I do it in my code, but I haven't (yet) found an instrumentation technique that would work in an enterprise environment. I give second preference to utl_file based approaches since for every line that you output, you have to go through open-write-close cycle. In high frequency output mode, this could be a problem.

Maybe AQ is an option to capture everything and write it out once on process completion, successful or otherwise.

Are there any other examples or techniques about instrumentation?

I however hacked your debug.f (didn't work cleanly for me in the 10g environment related to call stack), referred to Cary's presentation (from Hotsos 2005) about generating standard trace file and have something in alpha version that works in my environment.

But I'd like to hear from others who have used this or similar approach towards code instrumentation.

correct me if i am wrong, but except for alert.log, don't think oracle does open-write-close for trace files. They do open-write-write-...-close at the end of the session (or until asked to stop). This is a slightly different behavior when utl_file is used.

This open/close overhead is what i am talking about in high write environment.

Thanks Tom, this is *such* an important subject. It's incredibly frustrating to regularly face the situation as a DBA where I have to say 'I can tell you what the database is doing, but ...'. Sometimes you can help people, sometimes not and it's soooo frustrating trying to guess what's going on inside a "black box"

I think another crucial part of the equation is the knowledge that such tracing exists. For example, I've come across systems with network issues or application problems where the information is available, but the people I'm depending on don't know how to use it.

I suppose if *everything* was instrumented it would become standard practice to look at the results.

Raj Jamadagni:The instrumentation isn't running all the time (unless you want it to). You only use it to diagnose a problem. With this in mind I don't think the performance overhead is that much of a issue.

Think of it like level 12 SQL trace, you'd be mad to switch it on all the time, but you wouldn't want to remove the feature in production.

I played around with DBMS_PROFILER and DBMS_TRACE hoping to find a way to replace my instrumentation. Great tools but not a replacement to instrumentation.

Tom:Thanks for the link. Got to think of something impressive to write. The pressure is on.

I always log in tables, using an autonomous transaction to get the message logged even after rollbacks.

Why using files - it's so convenient to use SQL to investigate ... where msg like '%ORA-%', where msg like 'start%my_procedure%', up to group bys, analytics to find elapsed times between start/end of procedures, etc etc

I know, but I know also the rationale behind Tom's adversion to atrans, namely that often inexperienced people abuse of them. "no feature is totally evil", and in the case of logging, I've found no problems using them (perhaps performance, undo & redo, but it is just a matter of measuring whether it is acceptable or not - a function of how much info you log).

Personally, for tracing - I typically prefer files (and will load them if I need to SQL them).

For things like average transaction response times (auditing stuff) -- I'm all about tables (but no need for an autonomous transaction on that, just make a commit be:

insert into ... (the info);commit;

I know this is not asktom, butI think I am confused. Won'tthe commit also committhe parent transaction? Whichis why I thought autonomoustransasctions were used inlogging? I agree with files being better- especially since you can do real time "tail" tosee additional output insteadof repeating a select.

Safe to assume you'll write a whole chapter about this in volume 2 (if not volume 1) of your new book(s)? I started thinking seriously about this issue when I read your sections in Effective Oracle By Design.

Perhaps some additional examples of how to instrument, showing many different methods (app_info, atrans, no atrans, files, some other way, etc).

I think most people instrument their code when they are debugging rather than doind it as part of coding itself. I used to do that myself. Now I am big fan of instrumentation. Write trace message while coding and when something is not working just go thru the trace logs and find out what could have gone wrong. You feel good when you guess something based on the messages, go back to the code and find that your guess was correct :-) And if instrumentation is proper, sometimes you don't have to guess, you know what is wrong.Tom,May be you should make instrumentation part of your standard slides like bind variables and question authority..

and hit enter (there are other parameters as well to turn on sql_trace for each page)

b) database table as a configuration tool. Update a row in a table and all of a sudden trace is on for a module/user/session/whatever. The DEBUG.F package I use lots has this.

c) configuration file, much like a database table but outside the database. I used this with owarepl -- something that was downloaded and used by a lot of people -- and something I had to telnet into a system exactly once to figure out what was going wrong (and that was a compilation issue!)

and remember, some instrumentation might always be "on" (like v$ tables are always "on'). Every page on my site generates a row at least - so I know what you did, and how long it took to do it. These are my transaction metrics.

Kalita said...

May be you should make instrumentation part of your standard slides like bind variables and question authority..

It is part of it :) In the "tools I use" presentation I have I close up with this and go through the various techniques.

In your book "expert" in the latter part you talked about "creating a report that you can show to upper management". Can you direct me or show me how to produce such report. Do you have any report like that?

Management might want to know uptime of system, average transaction response times by business transaction and how that is trending over the last week, month, six months, year. Number of transactions processed, how many failed.

What would be useful to YOUR management. On asktom, I don't care how many physical IO's I did (well, "I" do, but "I the manager" do not). I care how many page views, avg/min/max response times, number of unique visitors, return visitors, month over month comparisions, day over day (by week - how is this monday compared to last monday and the monday before and so on). when do people come most, least.

That is a short list of what is important to *me*. My manager however just cares "how many people come, wow that's alot"

One other comment on instrumentation from a tester. If you properly instrument code it can assist functional testers in researching potential issues and help lead them towards a more thorough understanding of the system. The understanding usually seems to lead to less frustration in communication between testers and developers.

Concretely, I've seen instrumentation lead to less "by design" and "duplicate" bug reports. Once the tester understands the instrumentation and how to read it, they can isolate the real issue and also attach the logs to the bug for all to review.

Hate to be the one to ask the newbie-sounding question, but could you give me a short definition of what instrumentation is? It's easy to get an idea from your post but a definition will be useful (for me, at least).

Thank you very much, Tom. I'd come to this entry via a link on another site (Confessions of an IT Hitman). I've only just noticed that it's really old! I'd assumed it was from this past week or something. Thanks for taking the time to reply tho.

I'll try using instrumentation on any new code I write and I might write it in whenever I'm debugging existing stuff. I guess it comes down to that eternal developer battle between "quick" and "complete": your common sense tells you one thing, but your tired mind and fingers tell you something else!

"With timestamps" This is what i like the most in your posting. I believe timestamping your log is very important. But unfortunately some guys feel that is nasty to have lot of files in there log folder. Its not easy to search for the latest log.

Tom & others have you come across such comments if so can you share your experience or some real time mess up that was caused because of this.

I completely agree with you, I had so much times to debug, tests and justify the execution time of my programs that I created my own instrumentation. I just released it on sourceforge : http://dbuglibraryora.sourceforge.net/I tried to balance information, speed and easy of use.

About Me

The views expressed are my own and not necessarily those of Oracle and its affiliates. The views and opinions expressed by visitors to this blog are theirs and do not necessarily reflect mine.
I've been using Oracle since 1988. I've been working at Oracle since 1993 (version 7.0). I spend way too much time working on asktom.oracle.com...