Saturday, March 21, 2015

Generating full stack traces for exceptions in Python.

My last few blog posts have been about monkey patching. I will be continuing with more posts on that topic, but I am going to take a slight detour away from that topic in this post to talk about a problem related to capturing stack trace information for exceptions in Python applications.

This does still have some relevance to monkey patching, in as much as one of the reasons you may want to monkey patch a Python application is to add in wrappers which will intercept the the details of an exception raised deep with an application. You might want to do this for the case where the exception would otherwise be captured by an application and translated into a different exception, with loss of information about the inner exception, or in the case of a web application result in a generic HTTP 500 web page response with no useful information captured. So monkey patching can be a useful debugging tool where it may not be convenient to modify the original source code of the application.

In this case we have an exception occurring which was never actually caught within the script itself and is propagated all the way up to the top level, causing the script to be terminated and the traceback printed.

When the traceback was printed, it showed all stack frames from the top level all the way down to the point where the exception occurred.

In this script we have a 'try/except' block half way down the sequence of calls. The call to 'function2()' is made within the 'try' block and when the exception is raised, it is handled within the 'except' block. At that point we use the 'traceback.print_exc()' function to output the details of the exception, but then let the script continue on normally to completion.

What you see here though is that we loose information about the outer stack frames for the sequence of calls that led down to the point where the 'try/except' block existed.

When we want to capture details of an exception for logging purposes so as to later debug an issue, this loss of information can make it harder to debug a problem if the function containing the 'try/except' block could be called from multiple places.

How then can we capture the outer stack frames so we have that additional context?

Capturing the current stack

There are a number of ways of obtaining information about the current stack. If we are just wanting to dump out the current stack to a log then we can use 'traceback.print_stack()'.

So we now have the inner stack frames corresponding to the exception traceback, as well as those outer stack frames corresponding to the current stack. From this we can presumably now join these two sets of stack frames together and get a complete stack trace for where the exception occurred.

If you look closely though you may notice something. That is that there is actually an overlap in the stack frames which are shown for each, plus that the function we have called to print the current stack is also shown.

In the case of the overlap the issue is that in the inner stack frames from the traceback, it shows an execution point in 'function3()' of line 11. This corresponds to the point where 'function2()' was called within the 'try' block and in which the exception occurred.

At the same time, the outer stack frames from the current execution stack show line 13 in 'function3()', which is the point within the 'except' block where we called 'traceback.print_stack()' to display the current stack.

So the top most stack frame from the traceback is actually want we want and we need to ignore the bottom most two stack frames from the current stack if we were to join these together.

Now although the output of these two functions can be directed to any file like object and thus an instance of 'StringIO' could be used to capture the output, we would still need to break apart the formatted text output, drop certain parts and rearrange others to get the final desired result.

Dealing with such pre formatted output could therefore be a pain, especially if what we really wanted was the raw information about the filename, line number, function and potentially the code snippet. What other options therefore exist for getting such raw information?

Using the inspect module

When needing to perform introspection or otherwise derive information about Python objects, the module you want to use is the 'inspect' module. For the case of getting information about the current exception and current stack, the two functions you can use are 'inspect.trace()' and 'inspect.stack()'. Using these we can rewrite our Python script as:

So these functions provide us with the raw information rather than pre formatted text, thus making it easier to process. For each stack frame we also get a reference to the frame object itself, but since we didn't care about that we skipped it when displaying each frame.

Because though we might want to generate such a combined stack trace in multiple places we obviously separate this out into a function of its own.

Using the exception traceback

In this case we have relied on functions from the 'inspect' module that rely on being called directly from within the 'except' block.

That is, for generating the outer stack frames for the current stack we always assume that we need to drop two stack frames from the result of calling 'inspect.stack()'.

For the inner stack frames from the exception, the 'inspect.trace()' function relies on there being an exception which is currently being handled.

That we are assuming we should skip two stack frames for the current stack is a little bit fragile. For example, consider the case where we don't actually call 'print_full_stack()' within the 'except' block itself.

As can be seen, we actually end up with an additional stack frame being inserted corresponding to 'function3a()' which we called within the 'except' block and which in turn called 'print_full_stack()'.

To ensure we do the right thing here we need to look at what 'inspect.stack()' and 'inspect.trace()' actually do.

def trace(context=1): """Return a list of records for the stack below the current exception.""" return getinnerframes(sys.exc_info()[2], context)

So the problem we have with the extra stack frame is that 'inspect.stack()' uses 'sys._getframe()' to grab the current stack. This is correct and what it is intended to do, but not really what we want. What we instead want is the outer stack frames corresponding to where the exception was caught.

As it turns out this is available as an attribute on the traceback object for the exception called 'tb_frame'. Learning from how these two functions are implemented, we can therefore change our function to print the full stack.

Using a saved traceback

In making this last modification we actually implemented 'print_full_stack()' to optionally accept an existing traceback. If none was supplied then we would instead use the traceback for the current exception being handled.

It is likely a rare situation where it would be required, but this allows one to pass in a traceback object which had been saved away and retained beyond the life of the 'try/except' block which generated it.

The problem here is that in 'function4()' rather than seeing the line where the call to 'function3()' was made, we see where the call to 'print_full_stack()' is made.

The reason for this is that although the traceback contains a snapshot of information from the current stack at the time of the exception, this only extends back as far as the 'try/except' block.

When we are accessing 'tb.tb_frame' and getting the outer frames, it is still accessing potentially active stack frames for any currently executing code.

So what has happened is that in looking at the stack frame for 'function4()' it is picking up the current execution line number at that point in time, which has shifted from the time when the original exception occurred. That is, control returned back into 'function4()' and execution progressed to the next line and the call 'print_full_stack()'.

Although the ability to print the full stack trace for the exception is useful, it is only reliable if called within the same function that the 'try/except' block existed. If you return the traceback object to an outer function and then try and produce the full stack, line number information in the outer stack frames can be wrong, due to the code execution point within those functions shifting if subsequent code in those functions had been executed since the exception occurred.

If anyone knows a way around this, beyond creating a snapshot of the full stack within the same function as where the 'try/except' occurred, I would be interested to hear how. My current understanding is that there isn't any way and it is just a limitation one has to live with in that case.

7 comments:

I have a library that provides hooks via user-supplied callbacks; and if the callback raises an exception, it gets repackaged in my own exception type with a chain to the original exception. To allow the library to work in both Python 2 and 3 (even with 2to3) I manually construct the chain, such as this in Python 2 syntax:

With chaining of exceptions, because the traceback of initial exceptions would be retained beyond the life of the try/except for those when re-raising a different exception, I would imaging the issue described with line numbers not then being accurate for the outer stack frames would apply.

It seems creating a snapshot of the full stack within the same function as where the 'try/except' occurred is really the only way to provide full context of exception. The question is why Python itself does not do this and attach this snapshot to exception? I guess this might be to avoid cost of taking snapshot. If this is the case then Python should provide hook called every time a catch clause is entered. Related is http://stackoverflow.com/questions/1029318/calling-a-hook-function-every-time-an-exception-is-raised. I encourage you to take this problem to python-ideas to see if there's a chance of improving this aspect of language. Stack trace is crucial information and Python should really provide easy way to get it.

I'm also wondering if (contrary to Alex Martelli's statement in above SO thread) you found a way to monkey patch Python itself, obtaining missing hook I described above as result...

One could possibly argue the case for a sys.setexcept()/sys.getexpect() hooking mechanism in the same vain as those for profile and trace call backs, but would want to see the good use case for it.

A problem noted on SO is how you distinguish an internal implementation detail where the exception is caught and suppressed automatically such as StopIteration, AttributeError on getattr() failure etc etc.

Having a callback could easily lead to making it easy for users to cause problems if used poorly, but then that situation exists with the profile and trace hooks now.