A few days ago, a user e-mailed me a bug report for
LibraryHippo. For the previous 55 days, she'd been
receiving the same (erroneous) morning e-mail about what library
materials she had due. However, when she checked her family's account
summary on the web page, it was correct.

(The exact cause of the problem won't be of interest to many people,
but I thought the way it presented, and how I diagnosed it, might
help someone out. Read on!)

I checked the logs, and sure enough, there was an error. Sort of like this:

Whenever LibraryHippo checks a patron's library card, it saves the
results to the datastore to be used to construct the next day's
e-mails, but an inability to save doesn't keep the results from being
displayed on the web page. So that part made sense.

The next step was to figure out what was going wrong with the save.
The logs indicated that pickle was using more stack frames
than were available.

I try to debug it

I added a copy of the offending card to my family's account on the live site. Same problem.
Then I fired up the dev environment at home and did the same thing. Everything worked like a charm.

That was unexpected. So, as a last resort, I started thinking.

Why so deep?

The structure that was being pickled (CardStatus) was quite
flat. Here are the involved classes' definitions. Unless noted
otherwise, everything is a string or datetime:

So, one level for the CardStatus, one for Thing, one for Hold
(or Item), one for a list, and one for the status_notes strings in
the list. That's 5 levels. And probably pickle encodes a dict in a
few of the of the complex types. Let's be generous and say 10 levels,
each of which take maybe 5 nested pickle functions. That's about
50. Plus however deep we are in the stack before the pickling
happens. That shouldn't be more than 100.

How deep is too deep?

Popular wisdom on the web seems to be to increase the recursion limit
when pickle runs into these kinds of problems. I was loath to do this,
as I'd be constantly worrying about what depth to allow and whether
it'd be enough and so on.

While I dithered over that, my wife called to me from the television
room, "Just set the limit higher. That should help your user for now
and it will give you more time to work on the problem."

So I did. I was worried that the App Engine runtime wouldn't let me
change the recursion limit, so I used
sys.getrecursionlimit to log the depth,
sys.setrecursionlimit, and
sys.getrecursionlimit again to verify.

Turns out that:

The default recursion limit on App Engine's environment is 800.

You can change the limit. I went up to 20000.

In the dev environment the limit is 1000.

The higher limit on the production server fixed things. I relaxed a
little, and started thinking.

Maybe the 800/1000 difference accounted for things working at home,
but not in production. I used sys.setrecursionlimit to change the
limit at home, and reproduced the error. Huzzah! Now I could move more
quickly.

Inserting diagnostics into pickle

Back to the question of why pickle was recursing so deeply. I don't
routinely debug Python, relying instead on the power of logging
statements. Thus, I decided to provide a custom pickler that did
normal pickling things, but that also, for every object pickled,
logged the stack depth, the object type, and its representation:

Already we can see that we're quite deep in the stack, but the real
surprise was the BeautifulSoup.NavigableString. LibraryHippo uses
BeautifulSoup to scrape the libraries' web pages. A
NavigableString is basically a Unicode string
plus navigation. To support the navigation, the instance points at
what looks to be a DOM model of the entire parsed HTML page. That
explains the deep deep recursion.

The offending objects were in the pickup field of the Hold
class. A quick fix to ensure we're storing a plain string, and the
problem was resolved.

What a difference

Aside from the obvious effect of correct e-mails, I was curious to see what other differences this change made.

Before the fix, the serialization descended to a stack depth of
892, and produced a 546221-byte blob. After, the maximum
depth is 52, and the blob size is 10779 bytes. So everyone
should benefit a little from lower resource usage and quicker
card-checking. Fortunately, LibraryHippo isn't popular enough to exceed
the free quotas, so I wasn't paying extra for compute or storage. Then
again, if I had been, maybe I would've noticed the problem before a user
did.