It is not clear whether the bug which I posted about here is present any more or not, given that it is extremely rare and hard to reproduce it is impossible to know for sure. A number of minor issues were addressed in the code which hopefully have fixed it (and hardly anyone was affected anyway).

The changes since BFS version 0.416 include a fairly large architectural change just to bring the codebase in sync with 3.3, but none of the changes should be noticeable in any way. One change that may be user-visible is that the high resolution IRQ accounting now appears to be on by default for x86 architectures. There is an issue that system time accounting is wrong without this feature enabled in BFS so this should correct that problem.

Other changes:416-417: A number of ints were changed to bool which though unlikely to have any performance impact, do make the code cleaner and the compiled code does often come out different. rq_running_iso was converted from a function to macro to avoid it being a separate function call when compiled in with the attendant overhead. requeue_task within the scheduler tick was moved to being done under lock which may prevent rare races. test_ret_isorefractory() was optimised. set_rq_task() was not being called on tasks that were being requeued within schedule() which could possibly have led to issues if the task ran out of timeslice during that requeue and should have had its deadline offset. The need_resched() check that occurs at the end of schedule() was changed to unlikely() since it really is that. Moved the scheduler version print function to bfs.c to avoid recompiling the entire kernel if the version number is changed.

417-418: Fixed a problem with the accounting resync for linux 3.3.

418-419: There was a small possibility that an unnecessary resched would occur in try_preempt if a task had changed affinity and called try_preempt with its ->cpu still set to the old cpu it could no longer run on, so try_preempt was reworked slightly. Reintroduced the deadline offset based on CPU cache locality on sticky tasks in a way that was cheaper than we currently offset the deadline.

419-420: Finally rewrote the earliest_deadline_task code. This has long been one of the hottest code paths in the scheduler and small changes here that made it look nice would often slow it down. I spent quite a few hours reworking it to include less GOTOs while disassembling the code to make sure it was actually getting smaller with every change. Then I wrote a scheduler specific version of find_next_bit which could be inlined into this code and avoid another function call in the hot path. The overall behaviour is unchanged from previous BFS versions, but initial benchmarking confirms slight improvements in throughput.

Now I'll leave it open for wider testing confirming it's all good and then I have to think about what I should do with the full -ck patchset. As I've said on numerous posts before, I'm no longer sure about the validity of some of the patches in the set with all the changes to the virtual memory subsystem in the mainline kernel.

No. I only suggest that outliers indicate that the scheduler stumbles once in a blue moon. It is not good. It is important to watch and understand them. You seem to suggest that they have something to do with scalability to a large number of CPUs. So, they may or may not be relevant to PR > 41.

OK... here is the standard make benchmark on 3.3.0/3.3.0+bfs v0.418/3.3.0+bfs v0.420 running it a total of 9 times. There is a statistically significant difference between the two bfs patched kernels compared to the native kernel.

The 'make benchmark' is compiling the linux 3.3.0 via 'make -j4 bzImage' and timing the result, the repeating 9 times. This is done via a bash script. The result is for my X3360 workstation (4 cores no HT). I will repeat on the dual quad tomorrow and post again.

GREAT JOB CK!

P.S. I can now post a link to an image but how can I post the image embedded in my reply? The blogspot server tells me I cannot use an img tag? If anyone knows, please post the html code for me.

@Ralph - It means that the differences in the median values of the 9 runs (461 ms) IS statistically different, bfs v0.420 vs cfs but the differences from v0.418 to v0.420 are not different.

This to me is an artificial endpoint done only to see if the code has retained its efficiency with regard to this endpoint. This data set says nothing about interactivity. The fact that both BFS kernels are no different is good news. The fact that both are measurably better than the CFS is also good news. What it means under real world computing is not the point of this experiment.

There is a statistically significant difference between the two bfs patched kernels compared to the cfs in mainline, but the two bfs kernels do not differentiate themselves from each other with n=9 on this machine.

There is a statistically significant difference between both bfs patched kernels compared to the cfs in mainline, AND for the first time, a difference between the two bfs's with v0.420 differentiating itself from v0.418!

Again, great job, CK. Seems as though the refinements to the code you outlined scale very well! I don't understand why the difference on the quad machine. I only ~doubled the n value as you can see (8 vs 19). Perhaps I'll power the statistics with a higher n value on the quad and see if there is a differentiation between the two bfs kernels...

The 'make benchmark' is compiling the linux 3.3.0 via 'make -jn bzImage' and timing the result. The "n" corresponds to the number of cores on the box. In this case, n=4 for the quad and n=16 for the dual quad. This is done via a bash script.

Excellent, thanks again. The changes will be bigger on more core machines as they are scalability improvements. I suggested that earlier when I said there was a trend but not statistically significant difference in the quad core and I expected it would be greater on the 16x machine.

It was only cosmetic code anyway that is being deleted with that patch, so it will be part of the stable release. Otherwise, BFS is pretty much complete for 3.3. Coincidentally, I'm working on a cut-down -ck at this moment. I'm removing a few of the more invasive, unproven patches.

I do not buy many of Mike Galbraith's arguments. For example, who cares that the top snapshot shows some fairness imbalance at a given moment as long as its fluctuations are not too large or too long.

It is however alarming that the heterogeneous load is problematic for BFS. It does not look like the homogeneous make -j captures that. I mixture of make -j and x264 encoding would be an interesting case. Please comment.