Monday, March 28, 2011

For the last couple of days I've been debugging a fun problem in the Ubuntu tool called Jockey. Jockey is a tool for managing device drivers on Ubuntu. It actually contains both a command-line and a graphical front-end, and a dbus backend service that does all the work (with proper authentication, since it modifies your system). None of that is terribly relevant to the problem, although the dbus bit will come back to haunt us later.

What is important is that Jockey is a Python application, written using many Python modules interfacing to low-level tools such as apt and dbus. The original bug report was mighty confusing. Aside from not being reproducible by myself and others, the actual exception made no fricken sense! Basically, it was code like this that was throwing a TypeError:

_actions = []

# _actions gets appended to at various times and later...

for item in _actions[:]:

# do something

Everyone who reported the problem said the TypeError was getting thrown on the for-statement line. The exception message indicated that Python was getting some object that it was trying to convert to an integer, but was failing. How could you possible get that exception when either making a copy of a list or iterating over that copy? Was the list corrupted? Was it not actually a list but some list-like object that was somehow returning non-integers for its min and max indexes?

To make matters worse, this little code snippet was in Python's standard library, in the subprocess module. A quick search of Python's bug database did reveal some recent threads about changes here, made to ensure that popen objects got properly cleaned up by the garbage collector if they weren't cleaned up explicitly by the program. Note that we're using Python 2.7 here, and after some reading of the tracker issues and the Python subprocess.py code, I just couldn't see any problem here, or at least none that could possibly lead to the error being seen.

When I originally looked at the Launchpad bug report, that was about as far as I got. I couldn't see any possible way this bug could happen, and I couldn't reproduce it, so I set the bug to Incomplete. Unfortunately, it kept hitting Ubuntu Natty beta testers so it wasn't going to go away. Fortunately, Martin Pitt found a test recipe, with which I could reproduce the bug 100% of the time. Yay! At least it probably wasn't going to be a race condition.

How to debug this though? Normally, I'd just attach gdb to the thing and start tracing, but the problem was that when I set the Jockey dbus backend to use the debug version of Python, the error went away (or rather morphed into something that was really unrelated). Here's where I started to sharpen up my favorite yak shaving blade.

The parable about shaving yaks is so entirely appropriate to problems like this that I'll take a quick detour. One of my (and my brother's) favorite shows in the 90's was Ren and Stimpy. That show was groundbreaking, and you can see elements of it in just about every NickToons cartoon on cable today. Some of the episodes are brilliant and others are horrible but as a whole, Ren and Stimpy are an undeniable classic of American animation. In one particular awesome episode, Ren and Stimpy celebrate Kilted Yaks-eve where a Yak comes up from the tub drain and shaves his stubble, leaving behind a present of the highly desired shaving scum for Stimpy in the morning. Based on this episode (most likely!) the programming term "yak shaving" typically means having to take detour after meaningless detour before you can actually solve the problem you're having.

In the context of this bug, the attempt to use the debug build of Python in the Jockey dbus backend was the first yak to shave. Because I wanted to get more information out of the process, I tried to attach to a running backend process, but this proved to be quite difficult. I was debugging this in a 64bit virtual machine, and gdb+debug-python was just not cooperating.

Now the question is: do you debug that problem (or at least get to a place where you can decide to file a bug or not), or do you punt and try a different approach? The former is most definitely shaving the yak; it gets you no closer to fixing the original problem, but certain wastes enough of your time to make you seem productive. Of course, it's inevitable that if you do follow that secondary bug, it'll lead to a third one, and a fourth one, and so on until you're deep in the shaving scum and no closer to a solution to the original problem. You need to be constantly on guard against these types of yak, er, rabbit holes.

I'll spare you the gory details about hitting bugs in my text editor, bugs in building Python on a multiarch system, and other paths that all lead to smooth chinned yaks, but not to happy Pythons. When I punted on trying to attach to the Jockey backend, I actually decided to try to figure out exactly where the TypeError was occurring by using a twist on old-fashioned print statements. By grepping the Python source I found about a dozen occurrences of the error message "an integer is required". Which one was getting tripped? I added a little marker text to every such occurrence and re-built a Python package to aid in debugging.

Here's where I did detour to yak shaving. The Ubuntu source package for Python itself, runs the full test suite on every build. That's great for ensuring a high quality Python package, but it's terrible for turnaround time when experimentally hacking Python. Building the package with DEB_BUILD_OPTIONS="nocheck nobench" is supposed to work, but for some reason did not with my sbuild environment. I guess debugging that would have been like shaving the yak's left check, but instead it was so much easier to shave his right check. So I spent some time with my razor chopping out huge sections of the debian/rules file so that Python 2.7 would build as fast as possible. While a seemingly meaningless task, this did actually help quite a bit since I was able to try out ideas with a much shorter turnaround.

Anyway, with many iterations on the exception marking idea, I finally nailed down the culprit. It was in the Python C API function PyInt_AsLong(), but even here I was unsure which arm of the conditional was getting triggered. Another round of hack-build-scp-reset-test and I found what I suspected: PyInt_AsLong() was getting passed an object that could not be turned into an integer. But what was that object?

So now back to the original problem with gdb. To solve this I downloaded and built a 32bit virtual machine, and was able to reproduce the bug there. Fortunately, in the 32bit environment I was much more successful in attaching to a running backend Jockey dbus process, and even though I didn't have the Python source available (and no, I won't talk about where that particular yak shaving detour led), I could fairly easily print the objects in the debugger at the offending code, where I learned that PyInt_AsLong() was getting called with None as its argument. And yeah, you can't turn None into an integer in Python!

But, how was this function getting called with None? Popping up the call stack led me eventually to PyArg_Parse(), an older C API function that parses a Python tuple into a set of objects based on some format flags. This is used when implementing Python functions in C, to parse an argument list. Popping up the stack again led me into some python-apt code. python-apt is a C++ library that exposes the APT system to Python programs. It's fairly mature and robust, but I wasn't as familiar with all its dark corners as I was with Python.

Now, my first instinct is never that I've found a bug in Python. That's not to say they don't exist, but just that Python has been around so long, is so well tested, and used so extensively that I'm always suspicious in cases like this (i.e. strange, inexplicable bugs that make no sense) of extension modules and third party code. And indeed, my sleuthing has led me to python-apt, exactly the kind of complex Python extension module that can have odd lurking bugs in it. Still, the problem I was now faced with was this: the call stack led me into a code path that had nothing to do with iterating over a list, or copying that list. So, what gives?

Well, it's useful to know how Python exceptions work at the C level. In general terms, when some C code raises an exception, it basically sets some global state, and then returns error codes up the stack until either something catches it and deals with it, or it percolates up to Python's top eval loop. The key thing here is that there are usually two states that exist: the global exception value currently in effect, and an error code that gets returned up the C call stack. Usually this return code is a zero or one, but it can also be NULL or -1 in certain cases. The Python C API documentation is very good at describing these.

So now that I looked closely at the python-apt code, I could see what what going on and it all began to make sense! The python-apt package was pulsing a progress meter, set up as a callback by the client of the python-apt code. Meaning, python-apt couldn't really control what this callback would return. python-apt expected the callback to return either Python's True or False, but it could return anything, including nothing! In Python, True can be coerced into the integer 1 and False into the integer 0, and python-apt wanted an integer, so indeed the call stack lead right to a call to PyArg_Parse() to turn the callback's return object into an integer. What happens if the callback didn't return something that could be turned into an integer, or worse, didn't return anything at all?

In Python, a function always returns something, even if there is no explicit return statement. In those cases, None is implicitly returned. Yes, you see it now. And if not, there was this clue in the python-apt code: "most of the time the user who subclasses the pulse() method forgot to add a return {True,False} so we just assume he wants a True." Translated: Hey guy! You forgot to add a "return True" or "return False" to your pulse() method, and it probably fell of the end, giving us None, which we dutifully passed to PyArg_Parse().

PyArg_Parse() did its thing when getting None, by correctly setting the global exception state to the TypeError, and returning a zero code to indicate an error occurred. But looking at the python-apt code, it recognizes the error code, but forgets that some global exception state was set! Meaning, even though python-apt was ignoring the exception, Python still knew about it. But because control wasn't being passed on up to the Python eval loop, the exception was just lurking there like a angry unshaven yak, waiting to be discovered. And in fact, the next time Python itself checked the exception state was in, yep, the for-loop iterating over the perfectly fine list object. Python hits the for-loop, finds this orphaned TypeError lurking there and raises it, in a place that has literally nothing to do with the original exception.

The fix is a one-liner. In python-apt, where its ignoring any exception returned by PyArg_Parse(), it must both swallow the error code (which it was doing), and clear the global exception state (which it was notdoing). By adding a call to PyErr_Clear(), python-apt was keeping the interpreter's state consistent, and properly ignoring the argument parsing error, thus fixing the bug.

As I mentioned to my colleague Colin Watson, this did turn out to be a fun one to debug, though not as "fun" as one he recently worked on.