Posted
by
CmdrTacoon Sunday June 05, 2005 @11:45AM
from the just-like-a-real-project dept.

An anonymous reader writes "The Linux Kernel is now getting automatically tested within 15 minutes of a new version being released, across a variety of hardware and the results are being published for all to see. Martin Bligh
announced this yesterday, running on top of IBM's internal test automation system. Maybe this will enable the kernel developers to keep up with the 2.6 kernel's rapid pace of change. Looks like it caught one new problem with last night's build already ..."

code generation is good for repetitive stuff especially if your language doesn't have much in the way of a built in preprocessor

say for example producing similar load on demand wrappers for a load of functions in a dynamic library.

p.s./. seems to be restricting me to one post every 15 mins right now dunno why (the error says Slashdot requires you to wait 2 minutes between each successful posting of a comment to allow everyone a fair chance at posting a comment.

code generation is good for repetitive stuff especially if your language doesn't have much in the way of a built in preprocessor

There's a fair bit of repetitive code in the kernel. I had to do some hacking to make some RS-422 cards we had work properly, and found that a lot of the char drivers especially contain very similar code, and structure. Code generation might help with older drivers that nobody cares about until they break. They tend to rot from the looks of things.

holy crap, remind me to never hire you. if you know exactly what the string is going to look like, why didn't you just write it? if you expect it to change, why did you hard code the length values into the buffers?

I think the kick in teh butt humor found in this is that of the computer auto generating the code, auto compiling it and auto testing it and regenerating code for improvement based on test results,... loop it Johnny Mnemonic... uhhh ertrrr Neo..

Being one fully aware of the possiblities of auto coding or using code generators, both of which exist today in one form or another, just not so completely available wide scope on much of any user/consumer platform..

Please don't play this card all the time. We hear it way too often in the Free Software/Open Source communities, and it's really quite silly.

The grandparent post asked if it would make more sense to do it another way. That's a perfectly valid and logical question. Either he's right, and it does make more sense, or he's wrong (for a variety of reasons), and it's best to keep it the way it is. None of these require one person to do it incorrectly, and another to do it proper

Great idea. You should ask IBM to integrate their test platform into Linus' processes. He might be dubious after BitKeeper (that idiot) about another company helping him, but in this case I think it's a great idea.

There may be (and probably are) other test beds out there, testing releases. It would be better for Linus (and the world) if he could release already-tested code to the world, instead of having the world duplicate all the testing effort, and IBM seems like a perfect solution.

I automatically test every nightly -git snapshot release, so it's fairly well tied in anyway. This also means my heaviest usage of our machines is at night, when most of the (US) developers are asleep.

So it's fairly well tied in already... and the whole -rc cycle should enable us to catch a lot of stuff.

In any case, most people, especially in mission-critical processes, don't compile a new kernel as soon as it's released. Myself, I try kernels after a while, when no major issues are found. Even then, I test them out first in different test machines. So 15 minutes before, 15 minutes after, it's all the same.

"Release" in the open source world has a broader sense than in commercial software. In open source not all "released" versions are meant for general public consumption; they include unstable versions targeted mostly at developers, so that severe isues can be detected and patched quickly.

Taking this into account, I believe this is meant to catch bugs mainly in nightly (unstable) builds and release candidates, not in "final" versions (those should, at least in theory, have no serious bugs left around as th

This is good, and long overdue (I'm surprised it hasn't been around for years), but just how much testing is being done? Compiling? Booting? Or are there actual functional and reliability tests which are being performed?

Compiles, boots, runs dbench, tbench, kernbench, reaim, fsx. If one test fails, it'll highlight itin yellow, rather than green or red. I have a few of those in the internal tests, but not the external set.

This is only the tip of the iceberg as to what can be done. We're already running LTP, etc internally, and several other tests. Some have licensing restrictions on results release (SPEC)... LTP is a pain because some tests always fail, and I have to work out the differential against baseline. Will come later.

...the cross-platform, cross-hardware part? Setting up one machine to build automatically is easy. Setting up a whole bunch of them (and all unique, read administration nightmare) and tie them together to a system, that's quite a bit of work.

Indeed. The automation system I wrote is just a wrapper around an internal harness called ABAT that has a massive amount of work behind it. If systems crash it can detect that, power cycle them, etc.

Going from 90% working to 99.9% working is frigging hard. I had all this working 3-6 months ago, but the results weren't good enough quality to be published. Several people internally put a massive amount of work into improving the quality and stability of the harness.

It's magic [netbsd.org]! A single script and I can build a complete operating system for a big-endian 64bit architecture on a 32bit little-endian architecture, or any of the other 48 supported archs. More than that, I can build a complete NetBSD for any arch on any halfway POSIXish system.

build.sh bootstraps its own contained build utils (compiler, binutils et al) and builds the system with that. You can even build the complete system as non-root and get tarballs that you ca

http://aegis.sf.net/ [sf.net]aegis.sf.netand it can do a lot of other things too, like making sure that each change has an accompagning test and that all tests pass before anybody else is bothered with that change.

The biggest downside for aegis (as I see it) is that it needs to run on a central development server, it is not server based like CVS or the others(it has a cvs-like interface for reading). But OTOH, would it be so hare to have the kernel developers log into a central compile farm where the linux kernel i

The results are all there if anyone wants to play with them. Go to the results matrix, and click on the numerical part of the green box. Pick a test, and drill down to the results directory.

The numbers are there, it's just a question of drawing graphs, etc. I have some for kernbench already, but I'm not finished automating them. If anyone wants to email me code to generate them from the directory structure published there, feel free;-) Preferably python or perl into gnuplot.

But that is not the point of automated testing. As a member of a qa team who is developing automated tests I get comments like that every day

Automated tests are not intended to catch everything or test strange permutations of pre-conditions. There purpose is to provide a mechanism for verifying that a build satisfies the basic requirements of the project.

More exotic configs need to be tested manually as usual but automated tests can provide a "failsafe" just in case a basic part of the build is broken.

Reliable, repeatable testing is a great way to prevent fixes in one area from causing bugs in another. When I fix A, I generally only test A manually. I don't test every other conceivable code path, even though my fix for A might well impact them.

An automated test for B will catch regressions caused by my fix in A, making it harder to backslide. Backsliding is very expensive because bugs are far removed from their cause. If an automated test sees that changes in A caused a regression in B, the cause is immediately obvious.

Automated tests are not intended to catch everything or test strange permutations of pre-conditions. There purpose is to provide a mechanism for verifying that a build satisfies the basic requirements of the project.

I agree with jnelson4765, new buids would be well served to be tested on a great many machines with a wide variety of hardware setups.

Who should map the hardware testing platforms? I don't know, but I do know that if the new kernel builds are tested for a generic group of hardware and released, then other testers report on their tests using hardware X, you would end up with a relatively quick listing of a new build against many variants of hardware. Published correctly, it would allow people to search for

Unfortunately, organizing that kind of odd ball testing would be a management nightmare unless you want to go out and collect all of the hardware. Remember, some people do post patches and whole driver releases without stepping inside of the kernel team's realm.

The only real way to automate something like that would be a dummy load facility. Some software which would emulate the hardware being in place. Something conceptually similar to that effect anyway.

Wouldn't it make more sense to package these tools for someone to install on their collection of oddball equipment, and assist in the debugging/testing?

That's how the
PostgreSQL build farm [pgbuildfarm.org] works.
People with wierd hardware [onlamp.com] apply to be added to the automated test farm. ARM, MIPS, PARISC, Alpha, PowerPC, Sparc, etc. are all represented well in the postgresql automated tests.

Red Hat (and probably Novell/SuSe, since they use over one thousand kernel patches) runs a myriad of tests on each of its own kernel builds nightly - and has been doing so for years. On more than just the 3 architectures covered by this test.

That said, pushing tests upstream is a great idea. Just not revolutionary or anything.

Man, I wish they'd test Fedora kernel releases on their test farm. Of a dozen different machines I've run 2.6 Fedora kernel releases on, I've lost 1394 on one, USB on another, the hardware clock, on a third, parallel port probing on the third, serial ports on a fifth, and the Compaq Smart Array on the sixth.

The other six machines seem OK. But that's a 50% buggered rate from various flavors of 2.6 upgrades, mostly from nightly 'yum update's. These are all IBM, Compaq, HP, and Dell machines, so somebody's

This is a very smart system. The Samba team uses something very similar. The key to finding regressions with this method is to create tests for every piece of functionality, and to integrate it with the rest of the testing suite, so that each function of the kernel will be continuously tested. For new features, it is preferable to create these tests as the features are being coded. For existing millions of lines of code, it is necessary for some brave souls to go in and create these tests.

I hope they are using code from the Linux testing suite. That piece of work has already formed a nice set of tests. Also, I hope that the kernel is automatically built with many different combinations of options. And with time, I hope this will become better. The more tests, with the more hardware configurations, with the more kernel configurations, with the more types of input data (including many imaginative forms of incorrect input data to test that the kernel handles it gracefully and thwarts attacks based on such methods), the better quality we will have in the kernel, and it is likely that Linux will be unmatched in quality, stability, efficiency (well, maybe not efficiency necessarily), and long uptimes.

With an automated test suite, what happens when a class of bug is discovered to be untested-for? Presumably, the suite is modified to detect it. Then, is the resulting new suite itself subjected to an automated test suite? And, then...[divide-by-zero error...]

There is indeed an internal self-test suite on the harness. It's not desperately sophisticated, and I wouldn't dare show it to anyone;-) However, it does catch a lot of stupid bugs. It requires some manual intervention/inspection to work.

Plus, there's a separate development grid where we test new test-harness code before it's put onto theproduction grid.

I got to work on part of this system, which IBM calls Autobench, for my senior project at PSU. The system is a highly configurable framework which can download, compile, and run various benchmarks and profilers (for example while compiling a kernel). Its all centrally administered, so IBM can run a battery of tests on a variety of different machines at once.

I think Martin Bligh said that IBM has been using this for a while now, automatically downloading kernels upon release and testing them. The new thin

I'd expect the community to start advocating unit testing, an agile development practice, at some point to increase the reliabilty of code before it is even merged into the nightly builds.

I realize that this is not the same as testing the entire package on dissimilar hardware like he is doing here; For instance, there are bound to be a few issues when developers of code and its underlying code base both submit updates the same evening. IMHO, it'd especially help new developers if there existed unit tests

Individual distros have been doing this for years. Red Hat is one company that is known for its extensive testing of the kernel (as well as many other OSS projects). Don't use a vanilla kernel if you're running a production environment.Regards,Steve

Firstly, because both RHEL and SLES pull their base from mainline kernel. I'm damned if I'm going to fix bugs 3 times - RHEL, SLES, and back in mainline. Let's fix it once, before it spreads.

Secondly, it's MUCH, MUCH easier to fix a bug the night after it went in, not 3 months later. Everyone has context as to what's goin on fresh in their minds, and the change hasn't been buried under 7 tons of other crap.

One of the main goals appears to be whether the kernel builds or not. I shouldn't have to tell slashdot that build errors are among the most trivial of OS programming errors. They certainly exist, as the chart shows, but whoever is in charge of this project has a long way to go, by adding real tests of functionality. Consider it job security;)

For one, did you actually bother to look at the results at all, and what tests are being run, andpublished?

For another, this is only the tip of the iceberg as to what can be done, but I'm not going to lock whatever I have now in some dingy dungeon until it's "finished". What's there is useful, ableit incomplete. Testing is *never* complete.

The main goal, as you put it, is to improve the quality of the linux kernel. If we can ensure the kernel builds, boots, and runs basic tests... in a fully automated wa

Well, there is testing. It's just done by firms like RedHat and it's not made publicly available. There's also a very public and fairly thourough testing procedure done by Debian, but they don't specifically target the kernel, the way this particular system is.

I think the PostgreSQL buildfarm [pgbuildfarm.org] is one of the coolest ones I've seen. It's distributed across a bunch of volunteer-run machines representing a broader selection of architectures than most any other automated-test projects I'm aware of. A nice article on it can be found
here [onlamp.com]

Any other projects out there with similar transparency in their automated testing?

The problem is that I have extensively and exclusively used Linux 2.6 the whole last year since I migrated from 2.4 because it lacked the features I needed. Back in the good old days it was that Linux had two versions: the development one - uneven numbers - like 2.1 and stable - even numbers - like 2.0. Now the development is in the "stable" branch - 2.6. And this results in big problems. You can get used to them on the desktop, you get mad having to drive to the server room because your 2.6.x server has li

Now really dear sir, how am I to use 2.0, 2.2 or 2.4, since all new features, which I need are only in 2.6? Because everybody believes that 2.6 is a stable release (which in theory is indicated by the even release-subnumber) features got added to 2.6 and nobody bothers with backporting them to 2.4. And this is really a big problem. If there was a development version, then 2.6 would be rock-solid stable, I would get no freezes, panics and other nasty events and the features would go to the stable kernel - li

Windows '95, '98 and ME are descended from DOS and Windows 3.x, and contain significant portions of old 16-bit legacy code. These Windows versions are essentially DOS-based, with 32-bit extensions. Process and resource management, memory protection and security were added as an afterthought and are rudimentary at best. This Windows product line is totally unsuited for applications where