Main Navigation

As I rebooted this morning as a result of RH bug 4733471 two serious questions popped into my head:

do any of the major core X contributors2 employ a full-time X QA person? As far as I know the answer is ‘no’ but I’d love to be wrong.

would a full-time X QA person funded fractionally by the major X contributors, reporting to the development managers for each of those contributors, but formally employed by freedesktop.org, make even more sense?

My sense is that this kind of position that may be hard for any one contributor to justify but that it is the kind of thing that is probably necessary for a complex piece of software to succeed, so a position with costs shared across the various contributors might make sense.

(This is only partially inspired by Owen’s recent call on behalf of Friends of GNOME and the sysadmin team, but I’ve always thought a full-time GNOME QA manager would make sense- it really is vastly more efficient for everyone involved if much of this sort of stuff is done upstream. And it just struck me today that probably the same is true for X.)

this was today’s first reboot, but recent experience suggests I’ll reboot at least one more time and probably at least twice more today [↩]

Matej is great (I’m totally amused that RH’s desktop QA guy is… an ex-lawyer) but IMHO the problem is big enough to merit a full-time expert.

foo: I think you may be giving too much credit to GNOME’s buildbot, sadly; AFAIK it does not get enough love to make it a critical part of the development process like Moz’s buildbot is. But I may be wrong about that.

[…] (This is only partially inspired by Owen’s recent call on behalf of Friends of GNOME and the sysadmin team, but I’ve always thought a full-time GNOME QA manager would make sense- it really is vastly more efficient for everyone involved if much of this sort of stuff is done upstream. And it just struck me today that probably the same is true for X.) this was today’s first reboot, but recent experience suggests I’ll reboot at least one more time and probably at least twice more todayRH, Intel, Novell, as far as I know?Syndicated 2009-04-22 13:25:26 from Luis Villa’s Blog […]

Shouldn’t you be asking if RH have QA people testing updates before they send them out?

In comparison to my previous rumblings about Ubuntu’s X server errors, this is not the kind of bug that would easily be caught by any sort of manual testing- besides apparently being specific to a particular HW, it appears to be triggered only by interaction with one particular application (ffox) and then only once every several hours of use. So the appropriate question is ‘once discovered, how was it prioritized and dealt with?’

Relatedly, I guess that means that when I say ‘QA guy’ I mean someone who does triage of community-related bugs as well as traditional testing. I really think that helps a lot with the ‘various HW problems’ case- if you can’t reasonably test everything before it goes out the door, you can at least highlight the problems after the fact and fix them quickly/efficiently.

(For what it is worth, as best as I can tell RH has prioritized fixing the new (completely re-written?) driver in F11 over fixing bugs in F10’s old driver that won’t help F11. This makes sense, and is what I would do in their shoes, but is still frustrating as a user.)

Yes, one full time at Intel, plus a few more fractions adding to maybe one or two more. We’re working on fixing that, as a lot of what we need done is QA-bound, and so the burden falls on developers and we spend our time writing, fixing, and running tests instead of writing, fixing, and running the software that is our jobs.

We also have a severe lack of automated tests that are useful to developers. The only relevant test we have for desktop X is cairo’s xlib results. For input, there still is nothing — Peter Hutterer is just building some pieces of input testing now, rather than building the singing/dancing input system we wish he was working on. And in the land of GL, oglconform is closed source and produces results so meaningless you’re better off replicating the code path on your own than trying to interpret oglconform. The open-source GL test suites (glean, the original open-source test suite which is a c++ mess and moderately useful, and piglit, a new beautifully simple and incredibly useful test suite) are written entirely by driver developers trying to get their jobs done, and have very limited coverage — bugs that we’ve investigated before.

If I was to put someone on “do things to make sure that X doesn’t suck”, I’d want someone that I could describe tests to and have them go implement. Today what I needed to write was “Create an FBO for a cube map, draw solid colors to each face in turn, then draw views of each face of that cube map to the window system backbuffer and verify that the pixels drawn are expected. For extra credit, draw some simple pattern on the faces so I’d know if I got orientation wrong” I’ll spend a couple of days writing that test, before I get to fixing our driver for cube mapped FBO rendering (which likely affects many games that people wish worked on our hardware). A couple of weeks ago, it was “Talking directly to the DRM, allocate moderate-sized, tiled buffer objects until you hit swap, then render to/from them a bunch of times, and finally read each one and make sure that it’s still got the expected contents.” I think that was the one that exposed 7 bugs in code I was working on that each would have hurt different aspects of the Linux desktop.

But if I’m spending 2 days a week building testcases, and a bunch of time analyzing bug reports, that honestly leaves maybe a day a week of actual coding to improve things. And that sounds a lot like failure, when you look at the TODO list we’ve got in X.

As far as your particular bug, there are a few important things:
1) file your bug at freedesktop.org so the developers actually see it, since distros are bad at forwarding bugs upstream.
2) since it’s a gpu hang, ignore backtraces and run intel_gpu_dump while reproducing the problem on a current kernel so we can see what the GPU is doing, not what the CPU is doing. Admittedly, we just wrote this tool in the last couple of weeks for debugging this class of failures. We’re also still spending a lot of time polishing the tool so it gets more useful information, so expect some iterations (sorry).
3) Do whatever you can to narrow down what it takes to cause your problem. If my laptops (I’ve only got 6 running within 5 feet of me, and only one I use for my desktop activity on a daily basis) were experiencing the problem, I’d have fixed it, so I need to know how to experience your problem on my laptop

To be clear of how nasty the interactions often are for hangs, I fixed a bug that started out as “things are unstable in our distro.” We complained that that wasn’t enough information. “OK, things are unstable when I run clutter apps”. OK, closer. But I run clutter apps and it’s OK. So I’m in the guy’s cube one day talking about other stuff, and he reproduces the bug trying to show me something else. Whoah. So I have him reboot, and do the same thing we did last time: start a terminal, run the app, click over here a bunch of times, and eventually it hangs again. I go back to my machine, start the same app, click a bunch of times, no hang. I go back — wait a second, I did it in a terminal there because I can’t ssh to his box. Do it from ssh, no reproducing. I start it from xterm instead of gnome-terminal — no reproducing. I go back to my cube, and start a gnome-terminal, run the test, click, and bam — failure. I say aha, text versus 3d failure, classic. So I run ls -lR / in gnome-terminal while running the test so I can reproduce it faster and more reliably — doesn’t work. Only the particular message output from clicking on the 3d app, output to a gnome-terminal, interrupting the 3D app at the right time, would trigger the particular wrap condition that could race with 3D and hang,

That bug started out as “X hangs seemingly at random.” And our bugzillas are full of “seemingly at random,” and they’ve usually been open for months or years. Computers aren’t random, as far as I’ve been able to tell. So instead of telling me that ff hangs, tell me what exactly you were doing — presumably you were visiting websites. Which ones? I want URLs here. What else is running on your desktop? Names of apps, numbers of windows, what window manager, any other details you can come up with that might make you different from me. An ubuntu bug we’re working on right now triggers only when you get the aperture full, but it started out as “compiz hangs after a bit of use.” Well, no — you don’t need to use it for a while, you can open up enough windows and do a few operations you’ll reliably trigger it, apparently. And if you can subtract details and still make it happen, even better — can you do your bug if firefox is the only thing running, without even a desktop environment?

Sorry for the rantiness, but I’m hoping it’s useful for a xorg developer’s perspective. We’re actually working on removing a bunch of options right now to make sure that your laptop looks more like my laptop, so we can all look at the same bugs together. Luckily, F11 looks a lot like my laptop, so RH is already in good shape if they’d just get X bugs that aren’t their fault over to us faster.

Thanks for the detailed response, Eric. It deserves a fuller response on my end which I hope to get to in the next few days. Probably the most important part is that I know you guys are working hard and I know Software Is Hard, so while frustrated I can’t blame you guys directly.