A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.

Monday, 18 October 2010

Other schedulers? Illumos?

I've been asked in past interviews what I've thought about schedulers from other operating system kernels that I may have studied, and if I had any interest in hacking on them. Perhaps it's somewhat of a surprise to learn that until recently, I hadn't really looked at any other operating system's scheduler. Most of what I've thought up to do with scheduling is really pretty simple stuff to manage the most common scheduling policy and I part thought of myself, and part based on accounts of various approaches to scheduling in my travels. So my approach to things is pretty much "well this hack makes sense" based on some pretty basic principles to do with latency, fairness and CPU throughput.

A couple of days ago, I was asked what I thought it would take to port BFS to the opensolaris spinoff thingy Illumos. I don't pretend to even begin to understand the politics behind things so I have zero informed opinion about the whole Sun thingy. I pretty much had no idea how much effort it would take, but knowing BFS is simple code, I figure once the kernel interfaces are understood, it should be doable to any other operating system. Of course it would be much easier if it was done by someone who already knew the kernel they were working with rather than starting from scratch as I would be, but then, few people are likely to truly understand my intent with BFS unless they've studied it, no matter how simple the overall policy in BFS is.

With any existing modern kernel scheduler, there's a very good chance that an awful lot of effort has gone into making the scheduler scalable on SMP, and one of the very first things to do there is to make the scheduler have multiple runqueues, one for each CPU, or possibly one per physical package, shared amongst cores (much less common). Then, as CPU architectures become more complex, in having non-uniform memory (NUMA), sharing physical packages (multicore), having multiple threads per core (SMT) and combinations therein, special cases are made for balancing where tasks go when to make the most of these, but usually the special casing is only about improving throughput.

BFS in concept is two different distinct entities. The first is one of avoiding all heuristics for determining what is interactive and what is not, and using the simplest possible foreground-background virtual deadline to determine who goes next, and using a deterministic round robin system for sharing CPU time between all tasks. This is the overall policy for managing SCHED_NORMAL otherwise also known as SCHED_OTHER in the past, and responsible for the good interactivity, low latency and responsiveness. (I've mentioned in many articles previously why I believe all heuristics are bad in my opinion). The second entity in BFS is to throw out complex separate runqueue designs and special balancing, and have one runqueue to rule them all. This is a step backwards in possible scalability, and depending on the workload, becomes more and more of an issue the more CPUs you add, but is not relevant on any machine any normal human being can afford for a home desktop, even an extremely powerful one (note: whether this is true 10 years from now is yet to be seen, as I still think BFS will be fine up to many threads on multicore).

So to port BFS to another kernel would be porting two different concepts. The first is simply the policy for managing SCHED_OTHER tasks. This is a fairly portable concept and likely can be leveraged into just about any other scheduler without too much effort provided someone knows the kernel they're working with. The second part, however, would involve carving out large chunks of the existing complex system and replacing it with a rather simple design. As it is, if you remove the CGROUPS code as well as CFS that BFS replaces in the mainline linux kernel, you'd up with almost 20,000 lines less code. This is pretty drastic by anyone's standards since it replaces the scheduler en-bloc. The current BFS patch does not actually remove these lines of code, though, as many people requested I make it a configurable patch at build time. So it basically compiles out all that code instead of removing it.

Now one could make BFS plug into a "pluggable scheduler", and some of you may remember I posted patches for just such a feature for the linux kernel about 4 years ago. However, "pluggable" is a relative concept, depending on just how much of the scheduler you replace, and BFS replaces most of it. The current mainline kernel has pluggable policies, but isn't really pluggable on such a low level that BFS can happily live beside it (and the truth is that most of us don't "choose" a policy when we run our apps, they all run SCHED_OTHER unless we explicitly set them otherwise). The same goes for any other kernel, and the Illumos one is a perfect example since it has 4 different schedulers "plugged" into it, but probably not pluggable in the sense of "replaceable".

So anyway, where was this post going? Today I decided I'd have a look through the opensolaris code. Not because I particularly want to port BFS to it, but because someone else said they want to do so, and out of morbid curiosity of my own to see what it was like.

The summary of my impression was that I was... surprised. Now I don't claim to be any kind of expert on code per-se. I most certainly have ideas, but I just hack together my ideas however I can dream up that they work, and I have basically zero traditional teaching, so you should really take whatever I say about someone else's code with a grain of salt. Well, anyway, the code, as I saw it, was neat. Real neat. Extremely neat. In fact, I found it painful to read after a while. It was so neatly laid out that I found myself admiring it. It seems to have been built like an aircraft. It has everything that opens and shuts, has code for just about everything I've ever seen considered on a scheduler, and it's all neatly laid out in clean code and even comments. It also appears to have been coded with an awful lot of effort to ensure it's robust and measurable, with checking and tracing elements at every corner. I started to feel a little embarrassed by what we have as our own kernel. The more I looked at the code, the more it felt like it pretty much did everything the Linux kernel has been trying to do for ages. Not only that, but it's built like an aircraft, whereas ours looks like a garage job with duct tape by comparison.

As an aside, I did google a few terms they used which I hadn't seen before, and I was more than a little disappointed to find patents regarding the names... Sigh.

Now this would be a great time to take my comments out of context without reading on. The problem is that here was a scheduler that did exactly what I hate about what the Linux kernel scheduler is becoming. It's a monstrosity of epic proportions, and as far as an aircraft goes, it's like taking an Airbus A380 on a short joyride if you're running it on a desktop. It looks like a good, nay, great design for a massive airliner. By looking at it alone, I haven't got the foggiest what it might run like on a desktop. Now since I'm full of opinion and rhetoric and don't ever come through with any substance (maybe writing my own scheduler invalidates that?), I'm going to also say I can't even be bothered trying it, for you to confirm your suspicions about me.

So what do I think of it now? It looks like an excellent design for a completely different purpose. It's built like a commercial design for commercial purposes that have very different requirements than what most of us use Linux for, but it does appear to have been done so very well. It looks like a goddamn Star Destroyer, and the Linux kernel (scheduler) suddenly looks like the Millennium Falcon. Real fast, but held together with duct tape, and ready to explode at any minute. Or to put it another way, my wife used to work for a local automotive manufacturer (in Australia), and brought home the latest (American) Cadillac they were playing with at her work about 10 years ago. When I stepped into it, it looked and felt like a lounge room! It didn't feel at all like stepping into a car. It had everything that opens and shuts, and every control you'd imagine when sitting in your lounge room with your dining setting and entertainment system. Then I drove it around the block, and it had the performance, handling and braking of a lounge room too.

11 comments:

The one thing I admire in Solaris is its uniformity. You can actually target it as a platform without much hassle. Its kernel even has a binary driver ABI! That means hardware vendors offer installable drivers that actually work, like for Windows and OS X.

I once tried it on my desktop machine. It doesn't run too bad. So I guess the scheduler is useful on normal machines too.

What was a complete show stopper for desktops though is the lack of support for modern desktop hardware capabilities (meaning sound cards and GPUs).

Then tell me why Solaris is nearly dead? It's living only thanks to zfs. It seems you can't stand the fact your code was wiped out from the kernel and Ingo took your role. Furthermore, you're a bunch of a hippo, because it's your scheduler that seems to be conducted with a duct tape compared to CFS.

To be fair, CFS tries to do much more than Solaris or BFS. And Linux actually succeeds, mostly.

Besides, while Solaris' scheduler might look nice, it's been very well known for its crappiness ('Slowlaris' nickname haven't appeared just by chance). And it also has some pathological behavior when Zones are used.

The nick name "Slowlaris" emerged because the TCP/IP stack of the old Solaris was slow, as explained by the SUN ceo John Schwartz. The "slowlaris" performance problem had nothing to do with the cpu performance nor the scalability of Solaris, only about the network stack. The Solaris 11 stack has been rewritten and it beats Linux in benchmarks - on the same hardware.

Actually the TCP/IP stack was re-written for Solaris 10, released in 2005. The "slowaris" nickname is from the period before that. And, yes, I have the same impression as you do: On similar standard X86 hardware I can extract more server juice from Solaris than I can from say Red Hat 6. YMMW.