Search form

Eating my own bird food

With the next version of Yocto (1.1, codenamed edison) coming up in a few weeks, and with a new cycle (1.2) beginning immediately following that, I thought it might be a good time to reflect a little bit on the development and release process from my perspective as the maintainer of most of the BSPs in the meta-intel repo, having now been through two complete cycles.

Mostly it's as you'd expect - the Yocto codebase is extremely fast-moving, and just maintaining the current set of BSPs against the changes to the core metadata can at times be a full-time job in itself. During the active development cycle, and sometimes even during stabilization phases of the project, builds break out of the blue, things randomly break at run-time, and most of the time the causes aren't immediately apparent and some amount of the investigation or bisection is necessary to get things back to a stable state. This is completely expected state of affairs in an active and growing project, and in most cases is a healthy sign - for it to be otherwise would actually be somewhat worrying.

Still, it can be frustrating and requires constant vigilance - the longer problems go undetected, the more difficult it is to find the culprit (e.g. at the moment I'm bisecting a problem where the desktop icons disappeared again due to some commit of the past couple days - had I not built and boot-tested this machine yesterday, it would have taken much longer and been more difficult to locate the problem, as it has in the past).

Being vigilant about the six meta-intel BSPs I currently maintain (crownbay, crownbay-noemgd, emenlow, fishriver, sugarbay, and jasperforest) has been manageable so far, but just barely so. In addition to making sure everything still works most of the time, during each release cycle I also need to go through the exercise of upgrading all the BSPs to work with new package versions, and upgrading some of those packages themselves, the main one for BSPs being the kernel. There actually is quite a bit of work involved in moving the BSPs to a new kernel, mostly things like moving out-of-tree drivers (mainly graphics) to the new kernel version, keeping on top of changes to kernel options or additional options that we should be taking advantage of, and in general some amount of movement between what we keep in our kernel branches vs what's gone upstream since the last cycle, etc. Most of the time it goes smoothly, but every cycle there's always one or two BSPs that give me real problems - for 1.1 it was emenlow graphics, for 1.0 crownbay the gigabit ethernet driver.

My guess is that I could add two or three more BSPs to the list (which is actually what I expect to be doing during the next cycle) and still be able to maintain both the BSPs and my sanity, but it's apparent to me that beyond that, things aren't going to scale without some kind of help. Obviously, one form that kind of help could take would be to simply throw more people at the problem, but even if there was a budget for that, which there typically isn't, you'd never do that without first maxing out what you could do by instead working more efficiently. Thankfully, at least half of the problem should be very amenable to being taken off the table by a bit of automation. Obviously, you can't automate the tasks mentioned above, which require digging into problems or looking at options or forward-porting code, but you really should be able to automate most of the rest of it, the rote things that actually do point you to the tasks that need intervention. Doing those things manually can take a non-trivial amount of time and attention and for the most part really shouldn't be done by a human i.e. building and booting images for all of the supported BSPs, at least to the point where you can point a human tester (me, to start with) at the box and say, 'this BSP is ready for testing, switch the monitor over and see if all the icons are still there'.

It just so happens, not coincidentally, that I also have a lot of idle hardware lying around, which pretty much exactly matches the platforms I need to test on. ;-). It also happens that I'm a big proponent of eating my own bird food, killing multiple birds with one stone, and whatever other ornithological metaphors make sense, so what I'm planning to do is take all of these machines, some of which are actually first-class build nodes e.g. Sandy Bridge and Jasper Forest, put them all on a remote power strip and kvm switch and have them coordinate pulling down and building the latest yocto source, iterating through each image and machine type. Once each image is ready, it will automatically get 'installed' on the appropriate test machine (using gpxe network boot support or some such) and booted and at that point a message will be sent to the 'operator' (me, to start with) that image X on machine X is ready for testing. Once I know that, I can then switch over and do as little or as much testing on the machine as necessary (much of the time just looking at the screen and seeing icons would suffice). While it doesn't sound like much, automating all of this would save me the trouble of doing it manually, which is tedious and forces me to context-switch into and generally spend more time than I'd like focusing on non-core tasks that actually do require a human to look at. It would also serve to more consistently detect problems as early as possible, making the potential bisection distance between breakage as small and manageable as possible.

As mentioned, many of these idle machines are heavy-hitters and obviously cut out for the hard-core task of building Yocto images. Many of the other ones, however e.g. the Atom machines aren't quite up to such heavy lifting, but being the taskmaster that I am, in this economy I expect even the babies to work, and am hoping in a later iteration to be able to split up the work of building and testing an image into smaller pieces that can be farmed off to whatever machine is available, each according to its abilities. How to determine those abilities is an interesting question - what I'm actually hoping to do is to tie into performance data that we can make available on each machine using the various performance tools we have available in the Yocto images all of these machines will be running. These performance metrics would be used in some sort of simple 'cloud protocol' to partition and farm out work to the set of available machines in the grid. It might even be useful to anticipate this and start out with a trivially simple protocol that could be used to bootstrap the basic build coordination described above, but that's a stretch goal as I see it at this point - for now it would suffice to simply have the work taken off my hands.

At this point in the development of the Yocto Project, I'm probably the only person faced with these problems related to the scalability of BSP maintenance, but there are also a couple other things that I'll be working on in the near future that I'm expecting will create similar problems for other people and which I'll describe in a future blog post. Without going into too much detail now, it's basically a plan to make creating and maintaining new BSPs very simple and easy to do. If successful, there should soon be other people facing these problems too (which of course are the kind of problems you don't mind having), and hopefully by the time that happens, we'll have something like what I described above in place to make BSP maintenance more scalable to go along with it.