A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.

Sunday, 3 October 2010

Of jiffies, gjiffies, miffies, niffies and BFS 350

So we like to think that our software is perfect if there are no bug reports. Truth be told it's been a very long time since anyone's reported anything on BFS, which could be both good and bad. It means either no one is using it, or no one is having any bugs with it, or no one is inclined to report anything. Since I get about 10,000 downloads a month of it, I'm pretty sure that someone is using it, but I don't really know for certain how much of the other two it is. Anyway the point is that I hadn't really hacked much on it for a while now and had only been porting it with each stable kernel release as they came out, until I started working on 330. So for the next version, I planned to try and commit some more changes along the "hope it fixes android" lines and while I was at it, to try and find a few minor optimisations, and audit the code in the process.

The main change in 350 was changing the deadline system to stop using jiffies. I had already done so in 330 by using a local 64 bit jiffy counter I called gjiffies (for global runqueue jiffies), which was just a wrapper on 64 bit builds, but had locking and a u64 variable on 32 bit builds. Currently, BFS uses microsecond accounting for timeslices. The reason for doing so is that we don't have hardware that can do a context switch in less than a microsecond (some take up to 30microseconds) and to avoid overflowing on 32 bits. If I were to use nanoseconds, a 32 bit signed counter would overflow in just 2.1 seconds. So anyway to try and improve the resolution of the deadlines and maintain some deadline difference on lower Hz builds (specifically 100Hz), I changed the local jiffy counter I introduced in 330 to a monotonically incrementing microsecond counter I called miffies. This worked off the dreaded sched_clock interface we have that gets time from the TSC nanosecond timer on each CPU. This timer sounds great in principle, but often returns bogus values, looking like it goes backwards or jumping forwards dramatically, fluctuates with CPU frequency scaling, and the timers are not aligned between CPUs on SMP. Numerous sanity checks exist in the sched_clock_cpu interface to make the number meaningful, but still is only useful as used by one CPU. I used every CPU to update the monotonic counter by just adding their difference to the time whenever they were called, and scaled it to microseconds. Thus I ended up with a relatively monotonic microsecond counter in 64 bits that could be used everywhere. Serge Belyshev quickly pointed out that there wasn't much point it being microsecond since it was 64 bit anyway, and I moved from miffies to niffies, since nanoseconds won't wrap in 64 bits for 500 years. So in 350, all timing of deadlines is now done comparing times to the monotonic 64 bit nanosecond niffies timer.

Anyway, the point of all this, is that deadlines used to be limited by jiffy resolution, which means whatever Hz the build was compiled as. That meant that on 100Hz, it's possible for many tasks to have the same deadline, dramatically decreasing the utility of having the deadlines in the first place. It also means that the deadline differentiation should be valid, even long after the "virtual deadline" has passed (see the discussion on the BFS 330 update).

The next thing I added to improve management of low Hz builds was to test for whether a task was starting in the first half or 2nd half of a tick (jiffy), and then use a "dither" flag to determine whether we should expire that task early once the tick fires. The reason for this is to minimise the extra time a task can get running beyond its allocated timeslice when we're limited by Hz resolution. So if we have say our rr_interval set to 10 milliseconds, and our jiffies are 10 milliseconds long (as is the case at 100Hz), then it's possible, if a task starts just after a tick, that it can run for just under 20 milliseconds. With the dither flag, since it starts in the first half of that tick, when it gets to the next tick, it will determine that it has less than half a tick left, and expire it early. Thus, the longest a task can overrun its quota is only half a tick instead of one tick, meaning 5 ms in this example. This is not an expensive operation, but of course it would be much better if BFS wasn't Hz limited in this way at all by using high precision event timers to expire tasks at the right time. However, there are 2 reasons I haven't done this. First, hpet isn't available on all hardware and does have a slight performance hit. But second, and most important, I've never bothered to figure them out and do it properly. That's why I just suggest people just raise their Hz.

Apart from those major changes, a whole swag of things I had been considering doing I also committed. I use close shifts (right shift by 20 instead of divide by 1000000000 for example) where possible instead of division since it's cheaper. ISO calculations which are done during a scheduler tick now have their own lock to avoid grabbing the grq lock. I found some mistakes in the test for preemption when there are tasks of different scheduling policies, so SCHED_BATCH behaviour wasn't working properly, and sometimes the not-lowest priority CPU was being chosen to test against when there may have been a lower priority task running elsewhere. That's been fixed. I decreased the maximum rr_interval possible to 1000 (1 second) since new changes mean that 5 seconds would overflow, and higher values don't really help anyway. I added a local last_task variable per CPU, which could tell if that task was cache warm still on that CPU, even if it had run elsewhere. I dropped the first_time_slice flag as it was adding overhead and was not helping as it was possible to retrieve timeslice from children long long after they were forked if they ran only very little. I made sure timeslice was split properly between parent and child on fork() and that a reschedule is forced if the relative loss of timeslice meant the parent ran out. I changed reschedule to use 100 microseconds as the cutoff for expiration of timeslices since rescheduling with less time available than this will either greatly overrun its quota or reschedule very quickly. SCHED_BATCH tasks now refill timeslice and reset their deadline every time they're rescheduled which means they're lower weight than previously, but since they've been flagged as latency insensitive, it means they're likely to be fully CPU bound anyway and will only be rescheduled when they run out of timeslice or are preempted. I bypass the test for earliest_deadline_task now in schedule() when there is nothing else running and the previous task will simply run again. Numerous other minor cleanups and fixes were committed.

So, after that long-winded discussion, what's the advantage of BFS 350 over 318 onwards, and what should you see? Hopefully... nothing really obvious. There may be some lower latencies, and possibly very minor improvements in throughput and behaviour when many different scheduling policies are used at once. So far the feedback from people on 2.6.35.X has been that they haven't noticed a difference. Unfortunately, that is not the case for those older kernels! I probably should never have started backporting these changes to the older kernels since I haven't been extensively testing the backports. Ironically, though, it's those stuck on those older kernels that made me start making these changes in the first place! Furthermore, I now have no idea what's going on in android land. So I'm in a bit of a bind now, and I'd love to get feedback from those who are on those older kernels, and comparison with newer kernels with 350 and old. But if you're not adventurous and are on an older kernel and just want stable, stick to BFS 318. It may be that the whole deadline expiry thing is STILL needed, yet I honestly don't know why :(

Oh, I also noticed that despite quite a few people reading this blog, there have been virtually no comments. I know that many many more people read than comment, and that people may not feel qualified to comment, but a simple yes/no, good/bad would be greatly appreciated.

UPDATE: It may well be that the behavioural problems are just 32 bit overflows after all. Texstar of pclinuxos has been testing 32 bit kernels and found them to be problematic. More on this as he helps me sort it out.

8 comments:

I guess this is one of the cases where one is longing for the old days, with the stable vs dev branches. Nowadays Linux is changing too fast and is turning into something that resembles Fedora development :-/

Indeed, it's hard to not see it that way, RealNC. It's hard to even think of the "stable" 4 point release kernels as stable when you look at some of the patches going in. I'm glad I stuck to my policy of only releasing -ck for 3 point release kernels as it would have been a madhouse of updates if I tried to stay in sync with them.

I'm was very happy to see you begin this blog. I will stop in to read what you are doing, regularly.

I occasionally create packages for PCLinuxOS and was glad to see Tex include your scheduler as the default. I am also a musician and hope to get the time to start doing some real music production with the computer some day.

I'm posting because you ask for comments. I don't often compile kernels or do any sophisticated benchmarking, so I can't comment on many of the specifics. But, I am curious if a newer kernel and your scheduler with tweaks can achieve high-quality music production or if real-time patches still are required for the best results?

I may install one bfs-enabled kernel on my N900. It's frustrating when I'm using the GPS, the media player and the phone rings....it's just impossible to get the call.Also, It's interesting how big cell phone vendors stick with mainline and do nothing about interactivity when it's proven that if you do, you get to see the results. It's freaking phone!!!Keep it up mate!

I wasn't aware that PCLinuxOS has a BFS-patched kernel as its default. I just looked that up and confirmed it. That distro has quite a big userbase. Is there actually any feedback from the distro devs?

Yes, I get quite positive feedback from them. On the 357 post you'll see they're responsible for spotting a regression in 350 as well. There are some other distros that user BFS as well, and a lot more that offer a BFS/-ck optional kernel.