‘davisagli’

David Glick – Plone developer

This is the story of a Friday night spent tracking down a particularly pesky bug in my Plone site. In gory detail.

Crap, it's not working...

I had just finished moving one of our larger Plone deployments to a new host. Everything had gone, with one or two minor exceptions, exceedingly smoothly, and I was looking forward to setting aside the laptop for the rest of the evening. (Yes, that does happen occasionally ;) ...But then I got an IM...

[our client]: 22:06:50
i am getting:

Bad gateway
The proxy server received an invalid response from an upstream server

Well, shit. That's weird... Normally I might have expected this 502 error from squid when Zope is restarting, but in this case Zope is directly behind apache, and is running fine. And the error's only occurring on one particular page...what the heck? Also... why is this a problem now when it definitely wasn't happening on the old server? My entire software stack is installed from a buildout with pinned versions, and I copied the apache configuration, so what variables could be left?

I could probably have already known at this point that the evening was shot. But what to do?

Time for some inspection...

Okay, well obviously Apache doesn't like something it's getting from Zope. Maybe a Zope bug of some sort? Let's check the apache error log and see what it tells us:

[FriJan2322:12:052009][warn]proxy:badHTTP/1.1 header returned by /@@sub_trial(POST)

Gee, thanks for all the detail, apache. Well, at least this confirms my suspicions that some header from Zope is wrong. But what could it be? Time to hit Zope directly and see what we can see in the headers. Open up the port in the firewall, enter the Virtual Host Monster URL to the problem page, view the headers in Firebug...what have we got?

Well, the page loads fine when I hit the site directly in Zope, and those headers look pretty innocuous. At this point my only guess was that the gzipping was screwing things up somehow, and thus followed a half-hour-long digression of confirming the gzip settings in CacheFu, confirming the mod_deflate settings in apache, and finally disabling gzipping entirely in Firefox. All of which served only to tell me in the end that the problem had nothing to do with gzipping.

Pull out the big guns: Google

At this point I did what any self-respecting software professional would do: I typed the error message into Google. This is where things really start to get interesting...

A query of zope "bad HTTP/1.1 header" netted me a few irrelevant-looking results. But aha, also an old e-mail thread that describes a similar error, and that has a link to a Plone trac ticket that...well, "Plone Hotfix 20071106 breaks long status messages" doesn't exactly sound relevant, but look, there's a comment that mentions the bug results in a 502 gateway error when running behind apache -- which matches my experience -- and it even says the bug is still present in Plone 3.0.3, the version I'm using on this particular site. And the description of the issue -- a problem with the statusmessage cookie header -- sounds reasonably close to the diagnosis I was guessing at. Hey, maybe we're on to something!

But something is still not right. The comments on the ticket also say that a newer version of the hotfix that introduced the bug, PloneHotfix20071106, was released to fix the issue all the way back in November 2007. And my buildout already includes this updated version -- it's right there as:

(Note the "-2"). Hmm, now that I think about it, maybe I have a vague recollection of encountering this same issue sometime in the distant past, and updating that hotfix in the buildout. But, obviously the fix is not working now. So what's going on?

The epiphany

And then it hit me.

David Glick: 23:14:50 holy crap, I think the real problem here is related to the recent plone.org upgrade
David Glick: 23:14:56 no shit :)

Let me explain. During the recent plone.org upgrade, all the product tarball files stored in PloneSoftwareCenter were migrated from being stored in the ZODB to being stored in a big directory on the filesystem using FileSystemStorage and collective.psc.externalstorage. (This is part of the efforts to make PSC function as a pypi-style repository, a cool ability which I hope Alex Clark and Tarek Ziad?? will publicize before too long.) Now, the thing I noticed that made me think "aha!" is that the filename listed for the hotfix tarball in my buildout has no version suffix. And it didn't take me much longer to check the older release of that hotfix and see that it has the exact same tarball filename. And then, knowing what I did about the plone.org upgrade, it didn't take much imagination to think that the two releases must now be accessing the same file on the filesystem, and the "-2" release was getting the wrong data. And this was quickly confirmed by inspection of the actual Hotfix code that had been fetched into my buildout.

To express my feelings at this point, echoing the immortal words of our Zope Pope:

"WAAAA!"

Case closed? You betcha. I stopped using the broken tarball and checked out the correct version of the hotfix from the svn tag instead. After restarting Zope, no more 502 error when I hit the form. And I followed up soon thereafter by replacing the hotfix tarball with the most recent release so that others shouldn't run into this same issue.

Note, however, that there's no guarantee at this point that other packages aren't experiencing a similar issue, so keep your wits about you when installing old-style product tarballs from plone.org. However, I have filed a report and trust that Alex or Tarek will be able to work on it before too long.

Of course, it goes without saying that I was rather lucky to have the familiarity with the plone.org upgrade needed to correctly diagnose the cause of this issue. But otherwise, just another day in the life of a Plone hacker...

Over the past couple days Steve McMahon has been with us at the ONE/Northwest Seattle office, and we've been working on a number of improvements to PloneFormGen. With the involvement of several other Seattle-based Plone developers who often join us for open source work on Friday afternoons anyway, we had a small army working on PFG.

Faced with a long list of good ideas, we decided to focus on the relatively easy changes. We'll try to get these through beta to a final release fairly quickly so that we can move on to more interesting challenges like a revamped form editing UI, multi-page forms, and no longer storing fields as Archetypes objects.

Sprint accomplishments:

Andrew Burkhalter finished converting the installation over to GenericSetup (work begun by Jesse Snyder at the sprint in DC)

I merged a branch I've been working on which adds support for rendering PFG forms from another template, viewlet, or portlet.

Steve McMahon cleaned out some compatibility code for Plone 2.1 (which will no longer be supported except in the 1.2.x maintenance series)

I created a proof-of-concept of a recaptcha integration which we plan to ship with PFG. This needs more work though.

Steve added an option to the mailer to select which fields will be included in the e-mail, as well as an option to exclude empty fields.

Steve removed the restriction on which fields can be used as the mailer recipient.

Jon Baldivieso added a link spam validator to ensure that a field doesn't contain links.

Steve switched the default view of form fields to the edit view, to save a few clicks when building forms.

Steve added CSS ids to the rich label field.

I fixed up the test infrastructure to work with roadrunner, and added some basic integration tests.

Andrew made progress on his branch which will add a form import/export feature.

Alex Tokar and Fulvio Casali almost finished implementing an option to make a field "server side" only, which is like a hidden field that doesn't actually render for the client (to avoid the chance of hacking hidden fields).

Steve added a BCC field override setting.

Jesse Snyder fixed a bug where copied-and-pasted action adapters were not automatically activated.

Michael Dunlap removed the isUrl validator from the form action override, so that it's possible to use relative URLs.

Elliot Cohen joined the fun even though he's new to Plone, and reviewed PFG documentation.

I added a 'force SSL' option for form folders which will redirect to an https:// version of the form.

Thanks to everyone who participated, including those who responded to my previous request for voting on features!

ONE/Northwest is gearing up for building various "civic engagement features"--things like online fundraising, online advocacy, and event RSVP--on top of Plone and Salesforce.com. (You'll be hearing more about this.) PloneFormGen will play a significant role in implementing several of these features, so we've invited its creator, the illustrious Steve McMahon, to come up to Seattle for a little PloneFormGen mini-sprint at the beginning of December.

As part of planning for this sprint and the future of PloneFormGen development, we'd like to get a sense of which outstanding feature requests are most desired by the community. So if you use PloneFormGen, please take a minute to go to http://ploneformgen.uservoice.com and vote for the improvements that would most ease your life when working with PloneFormGen. (The list is as of this writing heavily influenced by my own sense of what's important -- please feel free to add new items as well.)

Of course, our primary interest during the mini-sprint will be in scratching our own itches, but this may help us choose which of a number of minor itches to scratch if we have time, and I imagine the rating of priorities will come in handy in the future. (This is also a bit of a case study to see how well +/- voting works for prioritizing feature requests, to see if it's something we should also consider doing for Plone itself.)