A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.

Monday, 20 February 2017

linux-4.10-ck1, MuQSS version 0.152 for linux-4.10

Announcing a new -ck release, 4.9-ck1 with new version of the Multiple
Queue Skiplist Scheduler, version 0.150. These are patches designed to
improve system responsiveness and interactivity with specific
emphasis on the desktop, but configurable for any workload.

As with the 4.9.0 -ck, I am getting huge spikes in several CPU monitors while the system is actually idle process-wise. `top` sees my CPUs at '100% si' almost constantly, `xosview` displays 100% "SYS" spikes in a second's interval or less, and XFCE4's xfce4-systemload-plugin shows the CPU at 100% constantly. I set CONFIG_HZ=300. Any hints how to get a usable CPU load monitoring again?

It's an accounting error (it's not actually using extra CPU.) Unless you can hack the code and fix it, there's nothing more you can do until I find time to investigate and fix it (which alas won't be any time soon.)

Great work once more on updating MuQSS. Personally I think it's a great scheduler. I've been getting very impressive results from it when combined with the schedutil governor and using yield_type 2, interactive 1 and rr_interval 1.

Not only is the system incredibly responsive but performance seems to be the best as well, like that. Mileage may vary for other people but I could not be happier.

Astonishing. And this doesn't hurt throughput in any way? In my former testings, some years and kernels ago, setting to 1 did not only affect disk i/o negatively, but also gfx and audio being not "in-time" asap.

Actually, I'm running rr_interval 1, interactive 1 and yield_type 2 and whereas one might expect that to hurt throughput, from some testing (both synthetic as well as real world) I've actually found that throughput seems to be BETTER than with, for example, rr_interval 6, interactive 0 and yield_type 1 (or 0).

I suspect this has to do with more and more applications as well as OS subsystems becoming increasingly multithreaded and the overhead of the context switching (yield type 2 and rr_interval 1) being less than the overhead of threads simply waiting for other threads to complete their tasks.

please specify whether (in Your tests) You use performance / ondemand / powersave governor and which of cpufreq or p-state You actually use.As mentioned in different threads here and there, ondemand vs performance itself is a big win, at least on non p-state capable hw, if You get 12% out of performance that's neat and worth a try :)

Schedutil. Been a fan of that one since it was first implemented. Tried it with ondemand as well and even that was a performance degradation. Performance might be on-par with schedutil but I'd hardly wager it being better.

Is it possible if you could re-run the latency tests (Interbench). I am wondering if it's even worth running MuQSS because the throughput is probably better with CFS but I am not sure about the latencies I've been using both schedulers and I can't find any difference latency wise while my workload consists of compiling large projects like LLVM/Chromium while programming, I haven't noticed anything slowing down even with CFS.

Running any WINE application hardlockups my system (either on execution or given a period of time). Before, my workaround was to use SCHED_ISO for pulseaudio, jackdbus, and osu!.exe with SCHED_NORMAL for wine and winserver. However, simply tuning yield_type to 0 fixes this.

Running CS:GO with yield_type 0 or 1 both showed hard stuttering when loading player or bot threads with multicore rendering enabled. Tuning yield_type to 2 removes this stutter and I am assuming it applies to source games. Thank god you made it tunable.

After using the system for a considerable while, or playing osu! enough to reach this error, I still come across NOHZ: local_softirq_pending 202. Usually on CFS, the warning goes away with no apparent problems on the system. On MuQSS when the warning appears, the entire system lags. Specifically, display is not always updated, mouse stutters and does not poll correctly, and keyboard input is delayed and occassionally does not poll correctly. This issue goes away when I am able to set or run any program with policy SCHED_ISO (and keep it running) or restart with nohz=off at runtime.

All of this was tested on ck-ivybridge from the repo-ck repository with an Intel i5-3317U. The minimal workarounds are really stable, with the only thing worrying me is idle power consumption from disabling idle dynticks. Apart from that, the kernel is awesome to work and play with.

And I spoke too soon. Workarounds and tunables above do still significantly delay it from hardlocking the system.

The only usecase I found to guarantee hardlocking on my system is using yield_type=2 and rr_interval=1, running a wine program using GL/EGL/GLES and opening 1-20 terminals at once.

I'm pretty confident it's the realtime scheduling issue mentioned before and that all programs that use or bridge to OpenGL on wine coincidentally demands realtime scheduling. htop shows this, but schedtool says otherwise. I'm beginning to think wine is coded to shit.

In the event that the wine program becomes a zombie, wineserver -k and schedtool -I the parent/child process relevant to the zombie kills the process (??). Using schedtool -R in the same process hardlocks the system OR puts the cpu to an unworkable idle state with softirq warnings.

I've also stopped rtkit-daemon to see if it helps, but to no avail. I really don't want to make a huge list of programs to SCHED_NOT_RR in schedtoold on this incredibly responsive kernel.

Looks like SCHED_BATCH for wine, wineserver, and wine-preloader is the most stable setup for me. Realtime priority programs have no problem running for extended periods of time and wine mostly spawns children at SCHED_NORMAL. Not sure what to make of it.

Apart from my longstanding issues, latency-wise:

- primusrun and nvidia-xrun with intel_cpufreq schedutil makes all Valve games I've played open several seconds faster and leaves me with unbelievably low mouse latency on an Optimus system compared to mainline and Windows.

- I/O detection for my external keyboard and mouse is really fast and never fails to register compared to the few times that happened on mainline.

- Dota 2 on CFS caps at 30 FPS after reaching a specific load from multiple unit selection (even though it can run well above this on Pause). MuQSS does not have this issue.

Looks like SCHED_BATCH for wine, wineserver, and wine-preloader is the most stable setup for me. Realtime priority programs have no problem running for extended periods of time and wine mostly spawns children at SCHED_NORMAL. Not sure what to make of it.

Apart from my longstanding issues, latency-wise:

- primusrun and nvidia-xrun with intel_cpufreq schedutil makes all Valve games I've played open several seconds faster and leaves me with unbelievably low mouse latency on an Optimus system compared to mainline and Windows.

- I/O detection for my external keyboard and mouse is really fast and never fails to register compared to the few times that happened on mainline.

- Dota 2 on CFS caps at 30 FPS after reaching a specific load from multiple unit selection (even though it can run well above this on Pause). MuQSS does not have this issue.

Throughput wise :from the PTS results, there is no clear winner. It depends on the workload.from the spreadsheet, I would say the best MuQSS setting is "interactive=0" and "interactive=1 & yield_type=0 or 1". CK patchset is slower.

Looks like SCHED_BATCH for wine, wineserver, and wine-preloader is the most stable setup for me. Realtime priority programs have no problem running for extended periods of time and wine mostly spawns children at SCHED_NORMAL. Not sure what to make of it.

Apart from my longstanding issues, latency-wise:

- primusrun and nvidia-xrun with intel_cpufreq schedutil makes all Valve games I've played open several seconds faster and leaves me with unbelievably low mouse latency on an Optimus system compared to mainline and Windows.

- I/O detection for my external keyboard and mouse is really fast and never fails to register compared to the few times that happened on mainline.

- Dota 2 on CFS caps at 30 FPS after reaching a specific load from multiple unit selection (even though it can run well above this on Pause). MuQSS does not have this issue.

A suggestion if you're having lockups with -ck is it might be worth building the kernel without threadirqs enabled. It could be a subtle driver priority inversion bug that only shows up with threaded irqs and since they're off by default in mainline they wouldn't be picked up.

Oh, I did not notice this comment, along with an old mailing list concerning wine(server) priority inversion. I did notice less jack2 xruns with this off, but never used it long enough to reach a conclusion. I will test this out.

Without threadirqs, its pretty stable. I haven't seen any xruns reported from jack2 despite leaving Cadence on for a few hours with moderate workloads compared to w/ threadirqs' occasional pops.

SCHED_BATCH nice 19 wineserver is still the most stable policy. It also solves freezing issues I had with Ragnarok Online on wine-staging CSMT that I had with CFS. Only lockup I've reached so far is with hibernate from low battery which is a very rare use-case for me.

It's still a mystery why I haven't found your mailing list on wineserver priority inversion sooner, but at least I reached the same conclusion.

I am playing with ck-1 patch together with Nvidia videocard (Optimus if it matters) under 4.10.10. Unfortunately, it freezes GUI after sometime. All processes continue to work, only display stops refreshing.Is it a known issue?

@kernelOfTruth: I also experience GUI freezes on 4.10.x series, but not on 4.9.x. I have Asus laptop with Optimus (UX303UB) and those freezes are with 375.39, 378.13 and 381.09 drivers (Gentoo Linux here).

I think that MuQSS is incompatible with CONFIG_CPUMASK_OFFSTACK, which is implied by CONFIG_MAXSMP ("Configure maximum number of SMP processors and NUMA Nodes"). sched/core.c get_user_cpu_mask bounds the copy length to cpumask_size() which is a runtime value when CONFIG_CPUMASK_OFFSTAK, but MuQSS's version bounds it to sizeof(cpumask_t) which will, in this case, probably be larger than the actual target buffer. Refer to Linux commit 96f874e26428a (from 2008). I think MuQSS needs to either handle this case, or require !CONFIG_CPUMASK_OFFSTACK.

Have you already tried to compare your same setup with Alfred Chen's VRQ patch applied instead of MuQSS? I don't want to advertise it, but it may be worth a try. For my system Alfred's patch results in much more responsiveness at all without negative effects. No gaming on my machine tested.http://cchalpha.blogspot.de/2017/04/vrq-095-release.html

No, I don't have these kinds of workload on here, having no need for this. Though, kernel compilation, severe swapping and additional I/O are usual on here. BTW, I also use the most recent BFQ I/O scheduler.

I think my issues with wine (which I have narrowed down to mostly wineserver) might be a priority inversion issue. Applications zombify when audio is out of sync and I'm assuming they deadlock when it doesn't render something in time.

osu! with SCHED_BATCH wineserver will lockup the system when running SCHED_IDLEPRIO make -j8, given some time. SCHED_BATCH nice 19 wineserver delays the lockup much longer under the same stress, but will still occasionally zombify it. I have also tried this with Zero Escape The Nonary Games and reached similar results. Apart from this test case, wineserver is relatively stable with these policies under moderate stress.

What I discovered along the way was that when compiling DKMS modules, CFS would sometimes terminate it with SIGPIPE during context switch. This occurs more frequently with linux-zen and linux-rt-bfq when BFQ is enabled. I have not seen this happen once on the ck-patchset and my test kernels with MuQSS in the past 3 months.

I don't have the time, inclination, intestinal fortitude nor psychological disturbance required for attempting something so futile. Linus' position against multiple CPU schedulers in the kernel has been hard line for over a decade. Additionally a patch this size maintained in mainline requires a full time job to respond to issues and maintain. I spend a few days every few months on this patch and it's fun; why would I want to make it torture?

Using mqss I get lags playing cpu intensive winegames like with CSMT (command stream) like Wow.I get lags that are not present with cfs on the stock arch kernel.Using renice helps however.Im using yield type 0.The lags especially occur when the sence changes.