A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, BFS and -ck.

Friday, 1 August 2014

SMT/Hyperthreading, nice and scheduling policies

The concept of symmetric multi-threading, which Intel called "Hyperthreading" and introduced into their commodity CPUs first around 2001, is not remotely a new one and goes back a long way before Intel introduced it into the mainstream market. I suspect the introduction of it back then by Intel was them easing the concept of increasing threads and cores for marketing reasons with the imminent walls they'd soon hit with CPU heat and power requirements that would stop the pursuit for higher and higher single CPU frequencies. The idea is that, since a lot of the CPU sits unused even when something is running as fast as it can on part of it, with a bit of extra logic and architecture, you could throw another "virtual core" at some of the unused execution units and behave like 2 (or more) CPUs, putting more of the CPU to good use. These days the vast majority of CPUs sold by Intel have hyperthreading on them, thus doubling the virtual or "logical" cores the CPU has, including even their low power atom offerings.

There have been numerous benchmarks, in-field tests, workloads etc., where people have tried to find whether hyperthreading is better or not. With a bit of knowledge of the workings of hyperthreading, it's pretty easy to know what the answer is, and not surprisingly, it's the frustrating answer of "it depends". And that's the most accurate answer by far, but I'd go further than that and say that if you have any kind of mixed workload, hyperthreading is always going to be better, whereas if you have precisely one workload , then you have to define exactly how it's going to work and whether hyperthreading will be better or not. Which means that in my opinion at least, hyperthreading is advantageous on a desktop, laptop, tablet and even phone since by design they're nothing but mixed workloads. I won't spend much longer on this discussion, but suffice to say that I think about 4 threads (at the moment) is about optimal for most real world desktop(y) workloads.

Imagine for a moment you have a single core CPU which you can run as is, or enable hyperthreading to run as a 2 thread CPU. If you were to run your CPU in single core only mode, then when you run one task at a time it will always use the full power of the CPU, but if you run two tasks, each task runs at 50% the speed and completes in double the time. If you enable hyperthreading, then if you have two mixed workloads that actually use different parts of the CPU, you can actually get effectively (at best) about 140% of the performance of running the CPU in single core mode. This means that instead of the two tasks running at 50% speed when run concurrently, they run at 70% speed. In practice, the actual performance benefit is rarely 40% but it is often on the order of 25%, so each task tends to run about 60% speed instead of 50% speed. Still a nice speedup for "free".

One thing has always troubled me about hyperthreading, though, and that is the way it tends to break priority support in the scheduler. By priority support, I refer to the use of 'nice' and other scheduling policies, such as realtime, sched idleprio etc.

If you have a single core CPU and run a nice 0 task concurrently with a nice +19 task, the nice 0 task will get about 98% of the CPU time and the nice +19 task only about 2%. The scheduler does this by serialising and metering out the time each task gets to spend on the CPU. Now if you enable hyperthreading on that CPU, the scheduler no longer serialises access to the CPU, but gives each of those tasks one logical "core" on the CPU, and you get an overall 25% increase in throughput. However of the total throughput, both the nice 0 and nice +19 task get precisely half. This would be fine if we had two real cores, but they're not, and the performance of both tasks is sacrificed to ~60% to achieve this. Which means that for this contrived but simple example, enabling hyperthreading slows down the overall execution speed of your nice 0 task when you run a nice +19 task much more than on a single core - it runs at 60% speed instead of 98%.

An even more dramatic example is what happens with realtime tasks, which these days most audio backends on linux use (usually through pulseaudio). Running a realtime task concurrently with a SCHED_NORMAL nice 0 task on a single core means the realtime task will get 100% CPU and the nice 0 task will get zero CPU time. Enable hyperthreading and suddenly the realtime task only runs at 60% of its normal speed even with a heavily niced +19 task running in the background.

Enter SMT-nice as I call it. This is not a new idea, and in fact my first iteration of it was for mainline 10(!) years ago. See here: SMT Nice 2.6.4-rc1-mm1

I actually had the patch removed myself from mainline for criticism regarding throughput reasons, though I still argue that worrying about the last percentage points of throughput are not relevant if you break a mechanism as valuable as nice and scheduling policies, but I had lost the energy for defending it which is why I pushed it be removed myself. Note that although throughput overall may be slightly decreased, the throughput of higher priority tasks is not only fairer with respect to low priority tasks, but enhanced because the low priority tasks will have less cache trashing effects.

What this does is it examines all hyperthread "siblings" to see what is running on them, and then decides whether the currently running or next running task should actually have access to the sibling or allow the sibling to go idle completely, allowing a higher priority task to have the actual true core and all its execution units to itself. I'd been meaning to create an equivalent patch for BFS for the longest time but CPUs got faster, cheaper, more cores, I got lazy etc... though I recently found more enthusiasm for hacking.

So here is a reincarnation of the SMT-nice concept for BFS, improved to work across multiple scheduling policies from realtime, iso down to idleprio, and I've made it a compile time option in case people feel they don't wish to sacrifice any throughput:

The TL;DR is: On Intel hyperthreaded CPUs, 'nice', realtime and sched idleprio works better, and background tasks interfere much less with the foreground tasks. Note: This patch does nothing if you don't have a hyperthreaded CPU.

If you wish to do testing to see how this works, try running with and without the patch and running two benchmarks concurrently, one at nice 0 and one at nice +19 (such as 'make -j2' on one kernel and 'nice -19 make -j2' on another kernel on a machine with 2 cores/4 threads) and compare times. Or run some jackd benchmarks of your choice to see what it takes to get xruns etc.

This patch will almost certainly make its way into the next BFS in some form.

EDIT: It seems people have missed the point of this patch. It improves the performance of foreground applications at the expense of background ones. So your desktop/gui/applications will remain fast even if you run folding@home, mprime, seti@home etc., but those background tasks will slow down more. If you don't want it doing that, disable it in your build config.

your kernel(-config) sounds interesting and the wikis really fine.Is there a git tree for the kernel so that I could use it for other distros too or is your kernel project only for Mandriva/Rosa?Are there any plans to integrate the exfat kernel driver?

the Wiki is currently missing an important chapter, that I'll write soon, about the so called ABF Kernel Farm

ABF is our common (Conectiva, OpenMandriva, ROSA, Unity linux, and different spinoff like MDK, MagOS, ecc.) build server that allow us to work remotely, it builds and hosts all our projects sources (srpms) and the resulting binary files (rpms)

We can split our work in many project folders, then rebuild for different Linux platforms, something about is here:http://forum.rosalab.ru/viewtopic.php?f=29&t=1223

ABF is Open, and its contents are open for all people that want to see, download, ecc..., to have a full access needs a subscription

ExFat: I have not been interested in it because I was not asked, I only know that we have in our repository: fuse-exfat & exfat utils

If you want to know more and/or discuss, you can contact me via email (abitrules AT yahoo.it), or into my forum (http://mib.pianetalinux.org/forum), or in the IRC that I often attend (freenode > #openmandriva-cooker)

I am still working on multiple queue locking in 3.15.x. The early stage throughput testing shows there is slight improvement again the baseline code on a 4 cores machine. Hopefully there will be a call-for-test in 3.16.

Btw, sorry, off topic, but does we really need all of the "ifdef CONFIG_SCHED_BFS" and "ifndef CONFIG_SCHED_BFS"? I mean, if i want to use bfs, then i will patch the kernel and if i don´t want to use bfs, i will not patch it. But i don´t know if it is possible to remove them, i am not a kernel hacker.xxx

isn't meant to be used, therefore I created all of the other patches - it's basically the reference implementation of the changes for Con or other devs to easily reproduce what I did & check whether anything was missed

not sure what you dislike about the ifdef's and ifndef's, that's the same way BFS is added to the kernel and other patchsets,

if you don't like the look of it & want to break compatibility with 3.16 and the new locking unification changes of it, feel free to remove it

/scratches head

everything's running fine for me so far - so more people could try it out but have backups available - like everything in the community/opensource: no guarantees

If compiling, causes browser windows to not draw, Chromium will just stay a white page until I stop the compile, sometimes it will draw after 30s or so into the compile, but I guess this is counted as a background operation? Windows instantly draw again if compile stops and also it's just fine without the smtnice option selected at compile-time.

g'day mr. kolivas,I think the patch doesn't behaves well with zfs & spl modules from the daily ubuntu ppa's because I get a lot of kernel hangs and they don't show up when using btrfs as the root filesystem. anyhow, great work.

Not getting anything I can understand in dmesg just a whole lot of numbers. Im using linux-stable.git, on branch origin/linux-3.16.y patched up to ck. Config is here, http://sprunge.us/iXiM?py#n-7dmesg is here,http://sprunge.us/NbeK?py#n-7

Also if someone show me the commad they use to patch their kernels. Im a little confused. From inside the kernel dir I used `patch -p1 < /home/patchdir/3.16-sched-bfs-450.patch` and then again with the next patch and so on.