A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, BFS and -ck.

Thursday, 16 August 2012

3.5-ck1, BFS 424 for linux-3.5

Thanks to those who have been providing interim patches porting BFS to linux 3.5 while I've been busy!
Finally I found some downtime from my current coding contract work to port BFS and -ck to linux 3.5, and here is the announce below:

A little question: By diffing old 3.4 bfs-424 with this new one I saw you having done a lot regarding NUMA. This is at virtual level managing multiple cpus?I have a system: x86_64 Intel(R) Core(TM)2 DuoIs following NUMA .config acceptable, does it make sense?

CONFIG_NUMA=y# CONFIG_AMD_NUMA is not setCONFIG_X86_64_ACPI_NUMA=y# CONFIG_NUMA_EMU is not setCONFIG_USE_PERCPU_NUMA_NODE_ID=yCONFIG_ACPI_NUMA=y

linux-3.5.2-bfs-with-NUMA just started, works! Thank you for the suggestion to turn NUMA off. What motivated me at first with NUMA is this help text:---CONFIG_NUMA: Enable NUMA (Non Uniform Memory Access) support....For 64-bit this is recommended if the system is Intel Core i7 (or later), AMD Opteron, or EM64T NUMA---I think my processor has EM64T. Now I will try a compile session without NUMA config!Ralph Ulrich

Thanks CK! As usual, here are the results of my make benchmark on my dual quad machine. CK1 clearly differentiates itself from mainline with n=27 runs doing a `make -j16 bzImage` on v3.5.2.

http://s19.postimage.org/hxdw0papd/big_anova.jpg

Here is a link to my benchmark script:https://github.com/graysky2/bin/blob/master/bench

Details:1) It is a non-latency based measure.2) Compilation benchmark using gcc to “make bzImage” for a preconfigured linux 3.4.4 build.3) Runs benchmarks 28 times totally to get a decent number of observations for a statistical comparison. In all cases, the first run is omitted leaving an n=27.4) Results are how many seconds it took to compile on a dual Intel E5620 (2x hyperhreaded quadcore CPUs on a single board) @ 2.40 GHz.5) Make is run with 16 threads (8 physical cores and 8 HT cores).

Just to correct a mistake I made before at this url:http://ck-hack.blogspot.com/2012/07/bfs-424-test.html?showComment=1341259994610#c1478189585724857286

I benchmarked mainline against mainline through a mistake of not uncommenting the lines in my script that patch the kernel source with your patchset. So, bfs still reigns supreme. Rock on.

@kelvin - n=27 means that I repeated the make benchmark 27 times to get a good number of observations on which to basis the statistics. Without a larger set, we cannot say that kernel A is faster than kernel B for example.

I use a -j16 switch because the machine has 8 physical cores and 8 hyperthreaded cores. It was selected to maximize throughput.

Thanks for your reply. The resolution of the benchmark for your new Ivy Bridge CPU(wow, a great CPU which is quite expensive!) is too low.It seems that the faster the CPU, the greater the difference between the mainline kernel and ck-patched kernel.

So, the above mentionned setup, just with 3.4.9 instead of 3.5.2 is running fine. (But only 3h of uptime so far.)

Thank you, Ralph! At least for the news that there may be coming something hopefully better for radeon. I have no experience with git so far and don't know if I need to learn in. (Me = being very lazy atm) ;-) Don't mind.

BTW. ... I kept/altered some of your recently proposed kernel config settings:CONFIG_RCU_BOOST_PRIO=1 <--- giving a "-2" in topCONFIG_RCU_BOOST_DELAY=440and left CONFIG_HZ_1000=y <---------- following Cons advices for interactivity(Some of your options are not relevant for UP single core systems so: not mentioned, just to be complete.)

Too sad that these parameters including scheduler choice still aren't runtime-configurable.

Seems like with either 3.4.9 or the BFQ-addon patchhttps://groups.google.com/forum/?fromgroups#!topic/bfq-iosched/hPe2jFW55Is[1-25]I do not need to further fiddle with schedtool to fight video/audio playback hickups on my weak old system.

Of course, never "needed" by design, as Con wrote.But it's still a nice and handy tool to adjust priorities of certain processes within the BFS schema when the kernel / application & combination shows glitches.

E.g., I recently used these two on kernel 3.4.7:schedtool -R -p 1 -n -10 `pidofproc pulseaudio`schedtool -I -n -5 `pidofproc Xorg`These haven't solved the issue of sound stalls with pulseaudio while playing video, but eased them at least (In my subjective scale counting -5 to +5: changed from -5 to -1.)(And, please, also read the schedtool manpage for detailed understanding.)

I can confirm too that disabling NUMA is beneficial for desktop systems. Just look at the results of the make benchmark that I wrote about earlier in this blog. Here you see the mainline "3.5.2-1-ARCH" vs. two different BFS patched kernels. One has NUMA enabled per the Arch Linux defaults and the other has it disabled:

http://s19.postimage.org/a8mk5gxgz/3770k.jpg

There is a clear and statistically significant difference in compile times (n=28) with the median gain through disabling NUMA being 344 ms. From my research, unless your hardware has >1 PHYSICAL CPUs -- not cores but physical processors -- it is advantageous to disable NUMA as measured by this non-latency endpoint.

If all major distros do something because their peer group did it, this is a shameful reason to perpetuate it -- particularly considering the data showing that it is a performance regression for those users with only one physical CPU. In other words, if 99.99999 % of distro users have only one physical CPU (home users/laptop users) and thus are impacted detrimentally by this option, why in the world would we enable the option in the kernel package that benefits the 0.00001 % of users that do? Just because our peer groups do it is no justification that it is a data-driven and sound decision.

I made this very point to the Arch Linux kernel devs in a feature request. Let's see if they agree... https://bugs.archlinux.org/task/31187

After I apply bfs for 3.5 last weekend for a machine which I happen to enable some kernel hacking option in kernel config, I got a WARNING and a suspicious RCU usage in dmesg, then I trac back to kernel 3.3 and also tested on other machine, the issue still there, I believe it is a long existed issue. I post the dmesg here and attached one of the kernel config of my machine.

Just a notice about systemd:Although it is mentioned BFScheduler doesnt work with cgroup, this just means you are unable to manipulate the scheduler using cgroups. But

Systemd plays well with .config having:CONFIG_CGROUPS=y# CONFIG_CGROUP_DEBUG is not setCONFIG_CGROUP_FREEZER=yCONFIG_CGROUP_DEVICE=yCONFIG_CGROUP_MEM_RES_CTLR=yCONFIG_CGROUP_MEM_RES_CTLR_SWAP=yCONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED=y# CONFIG_CGROUP_MEM_RES_CTLR_KMEM is not set# CONFIG_CGROUP_PERF is not setCONFIG_BLK_CGROUP=y# CONFIG_DEBUG_BLK_CGROUP is not set# CONFIG_CFQ_GROUP_IOSCHED is not setCONFIG_NETFILTER_XT_MATCH_DEVGROUP=m# CONFIG_NETPRIO_CGROUP is not set

I am just trying systemd-188https://forums.gentoo.org/viewtopic-p-7120752.html#7120752Ralph Ulrich

Using OpenSuse 12.1 with systemd and zen Kernel (with BFS and BFQ). Was much simpler than your way on gentoo ;)

No problem here, my .config differs:# CONFIG_CGROUP_MEM_RES_CTLR is not set# CONFIG_CGROUP_PERF is not setCONFIG_CGROUP_BFQIO=y

Don't wan't to miss systemd, nice to see how simple you can see the boot time graph for the different deamons. Und how quick you can tweak the boot process. Maybe some time BFS don't work together with systemd (as I read, systemd tries to expand to a kind of super user space deamon, big brother for all tasks, with resource management), but at the moment there is no problem so far.

Btw. On OpenSuse my shutdown time was dramatically reduced with systemd too.

@Mike, I am pretty sure- you run systemd-44 with consolekit- BFS enabled linux kernel does not affect systemdMy tries are about newest systemd-189 with advanced features. openSUSE-12.2 will not update to newer systemd ...Ralph Ulrich

@Ralph Ulrich, et al.,some weeks ago you suggested settings for the RCU subsystem. Some of them for non-UP systems (so: not for my machine). Some of them as a workaround for you in those days.That was:CONFIG_RCU_BOOST_PRIO=14 (workaround)CONFIG_RCU_BOOST_DELAY=440

Do you have insights or experience with the last one, to share with us?I've already googled the world but did not find a clue. This setting does have an important influence on I/O vs. interactivity/responsiveness.Currently I'm running 3.5.3 with BFS + BFQ +BFQ-Addon @ 1000HZ & CONFIG_RCU_BOOST_DELAY=333. ATM, I can only say that this favours disk I/O a bit with unchanged responsiveness compared to default(500). Worst cases to test with: Writing files to NTFS partitions via ntfs-3g and playing video in parallel; and watchout for stuttering audio/video.

Please, excuse me for this off-topic posting. But I expect Linux specialists to come back on here that possibly know, on how to help me.

Issue: I've upgraded from openSUSE 12.1 to 12.2 3 days ago. With that change I got from gcc 4.6.2 to 4.7.1. But with that I don't get the same kernel compiled in the same way. It does not call the PM resume sequence any more after suspend to disk & later poweron.

If I use the old 3.5.3-BFS kernel compiled with 12.1/4.6.2 everything works fine.

in case anyone wants to apply BFS 4.24 to 3.6.0-rc5: i made the following patch (without intrinsic knowlegde of the code, so it may not be perfect). It compiles, boots, and has been stable since then, but is otherwise untested. ;)

A queued fix for coming linux-3.5.5 will break bfs-424 in include/linux/sched.h:queue-3.5/sched-fix-race-in-task-group.patch

Because concerned new code is never to be used with BFScheduler it will be an easey - just reorder patch - fix to do. As Martin above pointed out, beside the change with cpuset_update_active_cpus, this is also the only case to change with linux-3.6-rc. So we wait for a new common ground: BFS-425

Dear Con,I'm really sorry. If had known that before, I wouldn't have written this initial posting.I do not depend on your patchset, but originally and generally, wanted to let you know that we appreciate your ongoing work.

You raise-max patch seems to be going in the wrong direction for desktop. I have found 90hz to give optimal jitter in OpenGL. Which pretty much translates to a well-running system. You can try my low-jitter kernel here: http://paradoxuncreated.com/Blog/wordpress/?p=2268

PS: This is not -ck. Cfs gave less jitter i OpenGL. I like cfs granularity also, and have set it to a suitable value.

Excellent post, Ralph Ulrich. Actually you are the first poster in twenty, that actually understand this. Ofcourse on servers many understand this. On desktop, there seems to be few who understand this, and higher hz, and suboptimal configs are much more common there. (for instance 250hz, no preempt, liek a standard ubuntu kernel - mindless.) It seems to be similar people that argue many services/drivers on windows, and tubeamp/vinyl in audio, or similar things.

I played with kernel 3.6 today and I found that the BFs 424 patch for 3.6-rc5 I posted previously still applies to 3.6. I have also made a ck1 patch available here:

http://www.filefactory.com/file/6wddd3pfr2mf/n/patch-3_6-ck1_bz2

The usual big disclaimer: I have simply adapted the existing coding so that it compiles and runs on 3.6 (posting this from a running 3.6-ck1 kernel). I am not in a position to maintain the actual semantics of the coding. That is Con's prerogative.

@MartinThank you for providing interim patches for us in the meantime!I want to bug-report that (at least) the latest BFS-only patch you made is not uniprocessor friendly. Only capable of SMP.What I've found so far is: It's related to the changes you've made in the patch regarding include/linux/shed.h (HUNK 3 @@ -1239,17 +1244,36 @@)ATM I don't really know how to fix this. Any advices?

Thank you in advance,Manuel Krause

Appendix: Compiling error output: CC kernel/sched/bfs.okernel/sched/bfs.c: In function ‘task_running’:kernel/sched/bfs.c:467:10: error: ‘struct task_struct’ has no member named ‘on_cpu’kernel/sched/bfs.c: In function ‘sched_fork’:kernel/sched/bfs.c:1742:3: error: ‘struct task_struct’ has no member named ‘on_cpu’kernel/sched/bfs.c: In function ‘schedule’:kernel/sched/bfs.c:3314:7: error: ‘struct task_struct’ has no member named ‘on_cpu’kernel/sched/bfs.c:3315:7: error: ‘struct task_struct’ has no member named ‘on_cpu’kernel/sched/bfs.c: In function ‘init_idle’:kernel/sched/bfs.c:5010:6: error: ‘struct task_struct’ has no member named ‘on_cpu’kernel/sched/bfs.c: In function ‘task_running’:kernel/sched/bfs.c:468:1: warning: control reaches end of non-void function [-Wreturn-type]make[2]: *** [kernel/sched/bfs.o] Error 1make[1]: *** [kernel/sched] Error 2make: *** [kernel] Error 2

hmm. thanks for the feedback. The on_cpu error is easy enough to fix (and, i have to admit, was unnecessarily introduced by myself). However, when testing UP configurations I hit on another problem in the context of urwlocks which I can't get my head around so easily. I'm afraid that one is left to the man himself...In other words: for SMP configurations my previous patch works fine, but for UP I have no real solution yet.

@Martin - Thanks for working on the unofficial ck1 and bfs patchsets. Filefactory and captha are lame. I would be glad to mirror your patches on http://repo-ck.com which houses the unofficial Arch Linux-CK packages.

Here is a link to my benchmark script:https://github.com/graysky2/bin/blob/master/bench

Details:1) It is a non-latency based measure.2) Compilation benchmark using gcc to “make bzImage” for a preconfigured linux 3.6.1 build.3) Runs benchmarks 28 times totally to get a decent number of observations for a statistical comparison. 4) Results are how many seconds it took to compile on a 3370K @ 4.5 GHz.5) Make is run with 8 threads (4 physical cores and 4 HT cores).

thx for benchmarking the patch. That's a good indication that nothing got horribly broken. Btw, I am still looking for a benchmark measuring the "desktop fluidity" or -- to translate a word invented by German c't magazine -- the "swoopdicity" of a system. At least that's where I feel the benefits of BFS.

Thanks also for pointing out O. Natalenko's site. cool stuff.

And yes, from my side there is no problem hosting the patches elsewhere. I just used the first file hoster Google would find. It's all in the cloud anyway. ;)

@Micron:I use the opensource radeon for my graphics. When going to a new kernel I often have to reboot at least twice.First reboot may lead to a blank/black/striped screen. But keyboard control is active after 40s. So I usually can login blindly and do the reboot (avoiding disk loss compared to RESET-button).

So, I had been running openSUSE kernel 3.6.1 with my old setup (most recent BFS-only from Martin + mm-drop_swap_cache_aggressively.patch + most recent BFQ, UP-system) for over 2 days of uptime without any problems. {Please keep in mind, that the BFS-patch always needs minor adjustments for openSUSE kernel-sources.}

Maybe something like that has hit Micron? (I always watch the patching output before make and compilation output before rebooting^^ !)

Then, I tested the ck1 provided by Martin and it failed compiling after a few minutes. Error output at the end. I then reverted bfs424-grq_urwlocks.patch from Con's broken-out (http://ck.kolivas.org/patches/3.0/3.5/3.5-ck1/patches/) and it compiled fine and is up and running since this afternoon without issues.

I have tried now both BFS and CFS, with the most possible low-jitter tweaks. My impression is that on jitter-sensitive applications like Doom 3, they can perform very similar. On additional compatibility layers like wine, who are even more jitter sensitive, BFS jitter-extremes seem higher. Meaning average jitter is lower, but wine has some 1 second jitters, with BFS. CFS has higher average jitter, but no 1 second jitters.

Both tested with high_res_timers off, 90hz timer, and a fast granularity setting for a psychovisual jitter-profile of natural.