A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, MuQSS, BFS and -ck.

Monday, 12 December 2016

linux-4.9-ck1, MuQSS version 0.150

Announcing a new -ck release, 4.9-ck1 with new version of the Multiple Queue Skiplist Scheduler, version 0.150. These are patches designed to improve system responsiveness and interactivity with specific
emphasis on the desktop, but configurable for any workload.

MuQSS

MuQSS 0.150 updates

Regarding MuQSS, apart from a resync to linux-4.9, which has numerous hotplug and cpufreq changes (again!), I've cleaned up the patch to not include any Hz changes of its own, leaving Hz changes up to users to choose, unless they use the -ck patchset.
Additionally, I've modified sched_yield yet again. Since expected behaviour is different for different (inappropriate) users out there of sched_yield, I've made it tunable in /proc/sys/kernel/yield_type and changed the default to what I believe should happen. From the documentation I added in Documentation/sysctl/kernel.txt:

Previous versions of MuQSS defaulted to type 2 above. If you find behavioural regressions with any of your workloads try switching it back to 2.

4.9-ck1 updates

Apart from resyncing with the latest trees from linux-bfq and wb-buf-throttling
- Added a new kernel configuration option to enable threaded IRQs and set it by default
- Changed Hz to default to the safe 100 value, removing 128 which caused spurious issues and had no real world advantage.
- Fixed a build for muqss disabled (why would you use -ck and do that I don't know)
- Made hrtimers not be used if we know we're in suspend which may have caused suspend failures for drivers that did no use correct freezable vs normal timeouts
- Enabled bfq and set it to default
- Enabled writeback throttling by default

In general, because there's tons of old hardware (like my Eee 701 netbook) that is revitalized by Linux, and especially w/ -ck! :-P

But I presume you're speaking to 'why would one use 32bit on a 64bit-capable CPU?' And that is usually due to the Windows mindset to use 32bit if you have less than 4GB memory, which does not apply to Linux.wiki.archlinux.org/index.php/Frequently_asked_questions#64-bit

And in the ARM world: www.cnx-software.com/2016/03/01/64-bit-arm-aarch64-instructions-boost-performance-by-15-to-30-compared-to-32-bit-arm-aarch32-instructions/

Obviously offtopic and wrong place to ask -- but maybe someone of you knows how to help:What can I do against these warnings, like e.g:WARNING: "phys_base" [sound/drivers/snd-dummy.ko] has no CRC!Many of them occur at compilation time and I don't know if that leads to further problems. Kernel is vanilla 4.9.0 from opensuse src rpm +ck1.

OMG... That last link took me quite some time to read upto the end of that LKML thread. I've taken the third provided patch from there, and got rid of the compile warnings. As it's dated 21th November, I'm somehow disappointed, that it's not taken into vanilla 4.9.0.

Thx ck for the new yield_type configuration. I'm getting very good results when set to 'No yield' in xonotic. Game feels very responsive input is very consistent. I had already set __GL_YIELD="NOTHING" previously but still it's much better if I also set yield_type to 0. Not sure why this is the case.

This means then that other code besides the GPU driver is also using sched_yield. It is arguably the most misused syscall on linux today and should not even exist any more. Setting it to zero basically makes it do nothing, which is why I added it as a feature :)

@ck:I also want to thank you for making yield_type a tunable! After trying to do my humble port of old and unmaintained TuxOnIce to 4.9.0 and failed to resume from disk all times, and after investigating your code changes for 4.9.0 (-ck1), I've coincidentally tried yield_type=2 -- and it works again, for many cycles now.You've added the new default =1 for some rational reason, let it be interactiveness/performance/both, I've then read some of your code comments regarding the yield() -- so how can I debug and change possibly faulty code in order to make it work well with your yield_type=1 ? In the TOI code there is one yield() call, e.g., but there can eventually be more sources of error in other drivers. I don't know what to search for and what to change to what, but I want to do mainly for TOI.If you find some time to explain at least a little bit, I would really appreciate it.

Maybe someone wants to see the non-official TOI code for 4.9.0:http://workupload.com/file/sVqjhDZ* checksumming does not work, don't configure it* with 4.9.0 + MuQSS/ck use "echo 2 > /proc/sys/kernel/yield_type"* possible other bugs I haven't encountered, use at your own risk

I did some benchmarks and it seems that yield=1 has slower overall performance, at least in gaming, on average. Yield 2 and Yield 0 performed about the same:http://openbenchmarking.org/result/1701176-TA-CKYIELD0V56

Sadly, in the test I have above, only 1 test (OpenArena) has per frame analysis. According to that test though, yield 0 had the smallest lag spikes, with a max of 17ms per frame vs. 27ms and 28ms for yield 2 and yield 1 respectively.

Doing the test with some popular multicore CPU benchmarks, it seems that yield=1 is the same or slightly better in most cases. I wonder why games tend to perform better but CPU bound tests don't...http://openbenchmarking.org/result/1701170-TA-CKYIELDMU34

@ck:You've defined the yield_type as runtime configurable. Thank you for the choice!My question: When does it get effective after changing the value? Is there a difference to be expected for all old tasks running, for newly started tasks or other unnamed conditions, and then: when?I'm currently re-testing the yield_type=2 after one day of =0, uptime with ck1: ~8 days.

Thank you Con for your reply!My main reason to ask for this was a weird behaviour of the sound system via headphones that confused me yesterday. After switching the yield_type forth and back several times the stereo sound was suddenly changing from left to right and back to normal without pattern but continuously over time, and I wondered whether the system may get confused by switching yield_type too often and if you could imagine such a case.Atm. I'm considering just a simple cable issue and am sorry for bothering you with this, but your info above is valuable anyways.

@ck:I don't believe in speeches for copper cable healing or such, but the issue went away all of a sudden for two days. And came back (same unchanged cable and system setup).The only way to solve the stereo audio waving around (headphones) was to pin the pulseaudio process to the first of my two cpus via schedtool.Sidenote: I've set the HZ value to 512 atm.

BFQv8r6 for Linux 4.9 is out. After reverting patch 0017 from ck1 and applying the new BFQ manually I noticed wbt.h was deleted when reverting. I think wbt shouldn't be in patch 0017 together with BFQ out wasn't meant to be there in the first place.Merry Christmas and best regards,Peter

@ck In a comment probably above this one, you said that the kernel shouldn't even include sched_yield() anymore because it's mostly not used correctly. This makes me somehow curious as if there would be no sched_yield() in userspace, wouldn't it be quite insufficient (waste of cpu cycles) for an user-space implemented hybrid mutex to do spinning when the lock is only hold for a small amount of time or when the lock is uncontented. After spinning for a constant time while atomically checking for a state change it changes its locking strategy to a futex-based mutex one.

It's not that sched_yield is used incorrectly at all, it's that no such ill-defined function should exist in the first place. It's an ancient concept that was implemented in unix that "I want to yield a little", but never really defined beyond that. The problem with sched_yield is that there is absolutely no definition of the semantics of what it is meant to do. There is no definition of exactly what is to be yielded, to what else, how to yield, for how long to yield, and why. One does not need to invent a userspace equivalent for a function that has absolutely no defined semantics in the first place. Userspace should be sleeping for defined durations, reasons and defined wake conditions. There are plenty of syscalls that do exactly what is asked and expected of them that should be used instead.

My next question really has nothing to do with muqqs but since I am not that familiar with linux's internals myself, I'd like to ask you on your personal opinion on my usecase. As you may have noticed, I am implementing a more efficient mutex/lock for a performance sensitive application and I am running muqss as my main kernel with sched_yield set to 0, to improve some of my applications.

Now, my lock implementation uses a hybrid approach where it firstly uses a spinning lock then a futex-based lock. The thing is that I am having a loop with an iteration count of 30 which does atomic operations on the lock variable which has some bits stored whether is mutex is locked or not. If it's not, then it basically calls sched_yield and then repeats the loop.

So the question is now, do I simply remove the sched_yield since it's basically a non-operation or can you suggest me something else which might be more efficient?

The real question is: what are you waiting on happening during the sched_yield you were calling? If setting yield to zero improved the behaviour then you're not really doing anything at all or waiting on anything at all during that yield call, you're just spinning. If you have a defined wake condition, use a callback from that wake condition instead.

I haven't measured the actual difference in performance yet as I have noticed a possible performance increase in other applications so I decided to leave it set to zero (Some people here I guess have found the same behavior though).

When calling sched_yield I was expecting a context switch to other threads so that eventually the other thread which is holding the mutex (Assuming it's micro-contention, so a spinning lock will be sufficient otherwise it will take the futex-based path) will unlock the mutex so that when it context-switches back to our spinning thread it will eventually lock the mutex.So I am more like waiting for "one context-switch" which is somehow contradicting your question because the time to wait is basically unknown.

Well, that's kinda my problem, sched_yield has no effect so I was trying to ask you what other approach I could try. Also I think I haven't explained the "one context-switch" correctly since one context-switch doens't make much sense but actually this is looped 30 times. So I guess there is a chance that the other thread will eventually continue (Or not, that means me not understanding SMP correctly).

It is very noticeable on old machines.Apart from that I use the same stripped config (disabled "everything", so the kernel will still run but not more) and only use video/network/audio drivers needed for the particular machine.

Hi, I forgot to post the updated benchmarks of MuQSS150 I ran some time ago. They are here as usual :https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing

I've put some colors to make the results more readable (hopefully).The reference kernel is the one on the first column. Following the value of the realtime difference between tested kernel and reference kernel, the colors are :- blue if difference is within 'realtime of reference kernel +/- maximum standard deviation'- green if difference is lower than 'realtime of reference kernel - maximum standard deviation'- red if difference is higher than 'realtime of reference kernel + maximum standard deviation'Overall best and worst are also shown ,if not in between +/- std dev.

I know a standard deviation computed of 3 runs is not very significant, but it's all I've got.

The benchmark seems quite underwhelming as it probably has the same throughput as cfs. Also if I understood the interbench benchmark correctly it doesn't seem like cfs (300Hz) is that bad compared to muqqs (100Hz). It even looks like that CFS has a better max latency compared to MuQSS which makes me wondering, shouldn't MuQSS improve responsiveness (lower latency)?

In interbench results the order of importance is deadlines met followed by desired cpu followed by max latency.In terms of throughput remember that muqss is still primarily designed with responsiveness and interactivity in mind. The difference between bfs and muqss is that muqss will scale to any number of CPUs without breaking down in its throughput performance which is currently not of great importance until 16 or more CPUs. That number is not common at the moment but will become more so since the only direction manufacturers have to scale now is outwards instead of up in speed - that phones may have up to 8 cores now is evidence of that.

As I see it, this is a throughput benchmark only.You can't have high throughput and low latency both at the same time since they are mutually exclusive. In my experience MUQSS is in a different league compared to CFS latency-wise.And then keeping about the same throughput rather speaks for MUQSS I would say. :)

When I play Counter Strike: Go I experience random pauses. I tried yield_type 0 and 2, in addition I am using schedtool -I -e. This kind of a behavior is not reproducible with cfq. On the other hand with yield_type = 0 the game does not suffer from jitters like cfq.

The pauses are almost certainly hitting the threshold in CPU for realtime for isochronous scheduling - it is not designed for fully cpu bound applications like games, but for things like video and audio. Also, you mean cfs, not cfq, but everyone makes the same mistake since the names are so similar (just like bfs and bfq but no problem with muqss now.) In summary, don't run it with schedtool -I and I'm pretty sure your pauses will go away.

Works fine now, but there is a jitter now like in cfs, it's very small and rare, but still. With SCHED_ISO only pauses.

I took it from man of schedtool:

SCHED_ISO was designed to give users a SCHED_RR-similar class. To quote Con Kolivas: "This is a non-expiring scheduler policy designed to guarantee a timeslice within a reasonable latency while preventing starvation. Good for gaming, video at the limits of hardware, video capture etc."

I don't know if the difference in timer interrupt count is of any importance, but I have issues with input latancy in games. The behavior varies, but most of the time input is very responsive after rebooting and gets very laggy after some time.

So after some time the situation with timer interrupt counts changed completely.Now CPU0 has a much lower count compared to CPU1, it was the other way around and the relative difference is about 9%.

Does somebody have more information about this behavior? Am I missing some timer interrupts? Maybe because if regions with disabled interrupts? Do u have simmilar behavior on your pcs? I can't find any information about this.

@duud:Thx for the added info. I was asking as I've had seen increasing input lags with earlier BFQ releases, e.g. even in Firefox, the longer it was up running.Luckily, in my case, this doesn't happen any more/ not noticable since Con's MuQSS+ck rework. But honestly, I don't know whether it to be timer or scheduler or mainline related progress.

Mmmmh... And those two lines effectively help against mouse pointer lags?Unfortunately the first line only affects "usbhid", whereas I'd need something for a PS/2 trackman via adapter on the serial port. I remember me fiddling around some years ago with serial polling in xorg.conf and with setserial. And in that former times it was due to not matured scheduling in BFS, my fiddling was without success -- only improved BFS releases faded that out.What effects does the second line have?

How can nvidia driver be already registered when I do have to use a kernel module? Does anything significantly got changed since linux kernel 4.9 (upgraded from 4.8.17-1-ck) concerning video driver? Never had this before and did not change anything with nvidia kernel module.

I don't have any specific advice regarding swappiness. It is a two edged sword and a very blunt tool (pun) for dealing with hitting swap at the same time. Lower than the default 60 does seem like a good idea, but nothing beats disabling swap entirely...

For my needs to suspend to disk (to swap) and using /dev/shm that is backed by more swap, also meaning me not having enough RAM, I can't turn swap off.

Swap -- or it's interaction -- is a severe bottleneck for linux for many years now. And I was always wishing that someone would take it over to ease the heavy slowdowns, that occur sometimes, when reclaims(?) are needed. Recently, after heavy /dev/shm & swap activity, I then had to wait ~15 minutes before firefox got back to responding to input. A complete reboot and FF reloading all 170 tabs would have been faster.

The default 60 and your 70 make it so the kernel starts swapping when just half of your RAM is in use. It does this by deciding to rather keep disk cache contents in RAM instead of programs.

When you use zero, it will only start to swap if your RAM is fully in use by programs.

The suggestion to use one instead of zero comes from an old report about a program that behaved bad with the zero value but behaved normal with a one.

To use zero (or one) is a suggestion for when you really have enough RAM for everything at all times. In that case it never makes sense to use swap because there's just no way to avoid choppiness on the desktop. It will always happen when you do something with a program that had its data swapped out.

If your programs actually use more memory than you have RAM, there will always be choppiness and perhaps something like 20 might be interesting to experiment with to make the kernel not reduce cache sizes to a minimum.

The default 60 is a value that's for things like web servers. Over there, you don't care for interactivity and you would be fine with just the programs involved in serving the web stuff in RAM while other rarely used programs thrown out into swap to have more RAM for larger disk caches.

@Anonymous:Thank you for your suggestions and explanations. Unfortunately, only setting swappiness may not be sufficient on my system. Side question: Can it be related to shared memory, integrated intel laptop graphics?At least, setting swappiness to 1, 10 or 20 on here led to severe knockouts of my system for many minutes, until it recovered for only short periods of time. Seems like I'd need to readjust the other settings as well, but have no idea where to begin this Odyssee. Atm. I'm going downwards from my former swappiness value, and 50 is the actual known good step that doesn't affect interactiveness (w.i.p.).

Thank you very much! Without your Patchset, and later BFQ Linux always felt broken. I began using them aeons ago on a P3@933Mhz which i bought refurbished. And using them now on an old Thinkpad T60P with CoreDuo T2600 and 3GB Ram. Yes! 32Bits! Why? No Money! Anyways, right now everything runs very smooth at 4.9.4 which Greysky kindly supplies via his repo for Archlinux. That couldn't be said for the whole of 4.8 which forced me to gnarlingly fall back to default upstream, and experimentally using ZEN. Which worked less buggy, but not flawless. But the pain is gone now.

Hi see ur blog for many months and I have to say that u do nice job!!! I have a question.I have many years to do hacks so I have forget some basics.I remember how can I make a phising url.I want to ask where I have to upload a phising url.(with purpose to steal someone's password.)Im not native english speaker.Please answer me..

After downgrading from 4.9.4-ck1 to 4.9.0-ck1 (git) because of latency I downgraded to 4.8-ck (git).Feels much faster than the 4.9... bunch.It seems the kernel gets more and more bloated and slower every release.

My name is Jennifer Lora me and my husband are here to testify about how weuse Lisa ATM CARD to make money and also have our own businesstoday. Go get your blank ATM card today and be among the lucky ones. ThisPROGRAMMED blank ATM card is capable of hacking into any ATMmachine,anywhere in the world.It has really changed our life for good andnow we can say we are rich and we can never be poor again. You can withdrawthe maximum of $ 10,000 daily We can proudly say our business is doing fineand we have up to 20,000 000 (20 millions dollars in our account) Is notillegal,there is no risk of being caught ,because it has been programmed insuch a way that it is not traceable,it also has a technique that makes itimpossible for the CCTV to detect you..For details on how to get yours today, email her on : [ lisaatmcard@gmail.com ]or call her on ( +12678734910 )

Like a previous poster, I also got a process hard freeze when using the silver searcher while compiling the Linux kernel in the background. It was at a point where 'killall -9 ag' would not kill the process

Hi, ck. After some time from the first MuQss release I have tried again your patch but i still have problems.

Wine is not usable since no application can be executed due to the error:"kernel: usercopy: kernel memory overwrite attempt detected"With an Atom Z520 i still have some intermittent boot panic. When boot goes well, then everything runs smooth for many days.

Another one, probably unrelated to the first one. Unlike the other one, which always occurs within five minutes of booting, this one took over 15 hours to occur.I like the irony of the comment in line 4285.

I've looked into the first issue (line 3230) a bit more because it seems more serious.

The comment above the no_iso_tick() function says rq->iso_ticks should be decreased. If I read line 3232 correctly, that means that the 'ticks' argument must be positive. As simple printk debugging shows, ticks is often negative. In fact, it is negative about a third of the time. This causes rq->iso_ticks to grow until an overflow happens.

I logged all ticks < -5 and with a counter for positive, negative and zero ticks. You can find the log here: sendspace.com/file/ccz4c8

I've looked into the first issue (line 3230) a bit more because it seems more serious.

The comment above the no_iso_tick() function says rq->iso_ticks should be decreased. If I read line 3232 correctly, that means that the 'ticks' argument must be positive. As simple printk debugging shows, ticks is often negative. In fact, it is negative about a third of the time. This causes rq->iso_ticks to grow until an overflow happens.

I logged all ticks < -5 and with a counter for positive, negative and zero ticks. You can find the log here: sendspace.com/file/ccz4c8My kernel tick rate is 100 Hz.

Thank you people, I read it like there aren't known issues like in the early gcc5 days.Unfortunately the ugrade process in openSUSEs is a little lengthy until living on the safe side, what's also my fault, keeping an old 13.1 only freshly updated.Means, it'll take some more days for me to be able test your suggestions.

I've been running linux 4.9.7 with muqss for quite some time now without any issue. But today I wanted to try golang and by simply issuing one command, the application segfaulted. Well, I thought this must be a golang error but before I wanted to report this I tried this with the stock archlinux vanilla kernel and it didn't seg faulted which means that somehow its muqss fault. I also tried comparing both kernel configs and they are equivalent with some obvious exceptions like bfq.

Sad story. 4.4.14 vanilla kernel (~180k config) feels more responsive than a custom 4.9.9-ck1 kernel (~70k config).The kernel is getting too bloated.Seems like no one cares about speed/latency/efficiency anymore.Or it is by intent to sell more new cpus.

Oh, yes, that one. Thank you for clarifying. On my side, I'm also not completely done with (mis)configuaration hassles after upgrading through 3 openSUSE major releases. What I hate most with it are such unpredictable automatic installer decisions that still happen (though always choosing the manual adjustments' way). The reason why I had pushed it for so long time.

Quite a seductive proposal. But I'm with openSUSE now for almost 2 decades and always managed the good times and the bad times so far.I'm not _as_ upset with "shitstemd" ;-), and luckily I've seen, that it kept and keeps improving over the years. What really worries me is the co-existence of plasma5 and old kde4 and related Qt libs needed for each of them, severely filling up the partition. Noone wants/accepts to repartition disks for incomplete software reasons, except for Windows users.

I've made some scaling tests with CFS and MuQSS, to see why MuQSS is performing poorly under half load.They are here :https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharingin the '4.9.9 Scaling test' sheet.

I remember that you said that it might be related to load balancing and Intel turbo boost.However, I've found that my motherboard set the CPU to it's max turbo boost frequency when XMP memory profile is enabled (and XMP is always enabled on my computer).So my 4770k CPU always runs at a maximum frequency of 3.9, whether 1, 2, 3 or more cores are loaded. I've checked that with turbostat.So I believe it's not a turbo boost issue.I've also done some tests with XMP disabled and turbo boost working as intended.

The only thing I've found, using 'turbostat make -jN', is that with CFS, load is distributed evenly across physical cores and logical cores, whereas MuQSS puts more load on physical cpu.I don't know if it's intended or if it can cause this performance issue.I just write this to let you know.

Hi, when I use an external usb wifi in 4.9-CK kernel the system hangs/freeze and only happens with that kernel ck, when I use 4.9 vanilla or Zen it doesnt happen? syslog doesnt show anything I had to press the power button to restart again.