A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, BFS and -ck.

Monday, 30 May 2011

2.6.39 BFS progress

TL;DR: 2.6.39 BFS fixed maybe?

After walking away from the code for a while, annoyed at the bug I couldn't track down, I had another good look at what might be happening. It appears that while the grq lock is dropped in schedule() to perform the block plug flush, a call to the task via try_to_wake_up may be missed entirely, leaving the task deactivated when it should actually keep running. Anyway, first tests from the people on these blog comments are reassuring.

Here is a cleaned up and slightly modified version of the "test8" patch that has so far been stable and shows to have fixed the problem for a handful of people:

28 comments:

Seems good to me. test8 made it through the night with Deluge running and I didn't have any problems building with the updated patch when booted into test8. Running the updated patch now and I have no problems to report.

Considering how fast the earlier patches crashed and burned for me, I think the probability of this actually being fixed is quite high now.

You've done an excellent job figuring out a problem that couldn't be reproduced by you locally!

Fuck, I just hit some variant of the bug again, resulting in a dpkg deadlock when doing an apt-get upgrade.

What information from /proc/<pid> is the most useful to you? I notice in /proc/<pid>/status it says "State: D (disk sleep)" and the process is unkillable. The timestamp on status also hasn't changed for 15 minutes so I'm guessing that's when it deadlocked.

You're definitely on the right track with your latest attempt at fixing it though. There must be some additional corner/edge case you're not seeing yet.

@terminx: yes once something that should be flushing data to disk is blocked, one by one everything that wants to write to disk will also block. Reading the proc entries you may get further hangs of the tasks trying to read from it.The output of sysrq-p and sysrq-t can be helpful.

Thanks everyone for your comments and testing! There's no doubt that adding this patch makes things a lot better with no regressions, but it's still not as stable as 2.6.38 given the one report of regressions. There is a small possibility that it's yet another problem, but this patch posted on this thread definitely fixes a real problem. I may release a ck2 anyway but not remove the unstable tag.

as anonymous I use BFS with BFQv2. And with your new patch kernel 2.6.39 seems as good as my "gold 2.6.38.7-zen kernel". No drawbacks to see at the moment, even under heavy IO it looks good.Don't get the high load values >8 as before.So thanks again for your bug hunting ;)

If there's an edge case I haven't hit it yet. Chrom{e,ium} hangage reported earlier appears cleared up. I've been up all night hacking away on various things without troubles. Previously the issue showed up within a minute.

test8 and recheck_unplugged should give the same results. I haven't been able to find anything else to blame in that particular part of the code so I'm tempted to release it as ck2 anyway just so that ck1 which is very unstable is not in the wild any more, pending further enlightenment.

Stable after 24H on my coreduo2, with wine and emule always open, 1 mkv movie seen, 2 hours of browsing and audacious music listened, and some I/O traffic in usb pen.If we have to do some specific stress test to our patched systems say to as ck!Thank you for your great work!

The bfs404-recheck_unplugged patch completely fixes the problem with processes locking up for me and has been working well for two days now. My system is very responsive. I was able to run Handbrake to encode some videos in the background and play World of Warcraft. While playing, it was as if there was nothing running in the background. I think this is one of the best releases to date if not the best. Thanks for all the work making Linux more suitable on the desktop. I'm looking forward to ck2.

By the way, if it helps with hunting down any remaining bugs, here is my config file.http://tux9656.no-ip.biz/config-2.6.39-ck1.bz2

So we have a mostly better, but still 2 reports of problems. Hrm. I've tried everything I can think of to reproduce it locally and failed. Well, here's a test9 patch which may well be a regression but it's worth a shot: