Comments

While fixing io_context creation / task exit race condition,
6e736be7f2 "block: make ioc get/put interface more conventional and
fix race on alloction" also prevented an exiting (%PF_EXITING) task
from creating its own io_context. This is incorrect as exit path may
issue IOs, e.g. from exit_files(), and if those IOs are the first ones
issued by the task, io_context needs to be created to process the IOs.
Combined with the existing problem of io_context / io_cq creation
failure having the possibility of stalling IO, this problem results in
deterministic full IO lockup with certain workloads.
Fix it by allowing io_context creation regardless of %PF_EXITING for
%current.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Hugh Dickins <hughd@google.com>
---
Thanks a lot for the hint, Hugh. My testing stuff (fio, dd and some
adhoc rawio testing programs) was issuing IOs before exiting, so I
didn't hit the problem and I suspect the reason why I didn't see the
boot failure Andrew was seeing was because of systemd - boot process
used to be dominated by lots of short-lived programs, many of which
touching/modifying files, and thus it triggered the first IO on exit
paths with Andrew's old userland. With systemd, most of those are
gone, so...
Thanks.
block/blk-ioc.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On 2011-12-25 02:02, Tejun Heo wrote:
> While fixing io_context creation / task exit race condition,> 6e736be7f2 "block: make ioc get/put interface more conventional and> fix race on alloction" also prevented an exiting (%PF_EXITING) task> from creating its own io_context. This is incorrect as exit path may> issue IOs, e.g. from exit_files(), and if those IOs are the first ones> issued by the task, io_context needs to be created to process the IOs.> > Combined with the existing problem of io_context / io_cq creation> failure having the possibility of stalling IO, this problem results in> deterministic full IO lockup with certain workloads.> > Fix it by allowing io_context creation regardless of %PF_EXITING for> %current.
Thanks, applied.

On Sun, 25 Dec 2011 14:29:34 +0100
Jens Axboe <axboe@kernel.dk> wrote:
> On 2011-12-25 02:02, Tejun Heo wrote:> > While fixing io_context creation / task exit race condition,> > 6e736be7f2 "block: make ioc get/put interface more conventional and> > fix race on alloction" also prevented an exiting (%PF_EXITING) task> > from creating its own io_context. This is incorrect as exit path may> > issue IOs, e.g. from exit_files(), and if those IOs are the first ones> > issued by the task, io_context needs to be created to process the IOs.> > > > Combined with the existing problem of io_context / io_cq creation> > failure having the possibility of stalling IO, this problem results in> > deterministic full IO lockup with certain workloads.> > > > Fix it by allowing io_context creation regardless of %PF_EXITING for> > %current.
The patch works for me.
> Thanks, applied.
So we get another great big bisection hole in mainline. I feel
duty bound to rewhine about this :(
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Hello,
On Wed, Dec 28, 2011 at 9:50 AM, Hugh Dickins <hughd@google.com> wrote:
> "It's the tmpfs swapping test that I've been running, with variations,> for years. System booted with mem=700M and 1.5G swap, two repetitious> make -j20 kernel builds (of a 2.6.24 kernel: I stuck with that because> the balance of built to unbuilt source grows smaller with later kernels),> one directly in a tmpfs, the other in a 1k-block ext2 (that I drive with> ext4's CONFIG_EXT4_USE_FOR_EXT23) on /dev/loop0 on a 450MB tmpfs file.">> I doubt much of that (quoted from an older mail to someone else about> one of the many other bugs it's found) is relevant: maybe just plenty> of file I/O and swapping.
Plain -j4 build isn't triggering anything. I'll try to replicate the condition.
Thanks.

On Wed, Dec 28, 2011 at 09:55:12AM -0800, Tejun Heo wrote:
> Hello,> > On Wed, Dec 28, 2011 at 9:50 AM, Hugh Dickins <hughd@google.com> wrote:> > "It's the tmpfs swapping test that I've been running, with variations,> > for years. System booted with mem=700M and 1.5G swap, two repetitious> > make -j20 kernel builds (of a 2.6.24 kernel: I stuck with that because> > the balance of built to unbuilt source grows smaller with later kernels),> > one directly in a tmpfs, the other in a 1k-block ext2 (that I drive with> > ext4's CONFIG_EXT4_USE_FOR_EXT23) on /dev/loop0 on a 450MB tmpfs file."> >> > I doubt much of that (quoted from an older mail to someone else about> > one of the many other bugs it's found) is relevant: maybe just plenty> > of file I/O and swapping.> > Plain -j4 build isn't triggering anything. I'll try to replicate the condition.
It's not too reliable but I can reproduce it with -j 22 allmodconfig
build inside qemu w/ 512M of memory. I'll try to find out what's
going on.
Thanks.

Happy new year, guys.
On Wed, Dec 28, 2011 at 01:19:18PM -0800, Tejun Heo wrote:
> > On Wed, Dec 28, 2011 at 9:50 AM, Hugh Dickins <hughd@google.com> wrote:> > > "It's the tmpfs swapping test that I've been running, with variations,> > > for years. System booted with mem=700M and 1.5G swap, two repetitious> > > make -j20 kernel builds (of a 2.6.24 kernel: I stuck with that because> > > the balance of built to unbuilt source grows smaller with later kernels),> > > one directly in a tmpfs, the other in a 1k-block ext2 (that I drive with> > > ext4's CONFIG_EXT4_USE_FOR_EXT23) on /dev/loop0 on a 450MB tmpfs file."> > >> > > I doubt much of that (quoted from an older mail to someone else about> > > one of the many other bugs it's found) is relevant: maybe just plenty> > > of file I/O and swapping.> > > > Plain -j4 build isn't triggering anything. I'll try to replicate the condition.> > It's not too reliable but I can reproduce it with -j 22 allmodconfig> build inside qemu w/ 512M of memory. I'll try to find out what's> going on.
I misread the code, the problem is empty cfqq on the cfq prio tree. I
don't think this is caused by recent io_context changes. It looks
like somebody is forgetting to remove cfqq from the dispatch prio tree
after emptying a cfqq by removing a request from it. Jens, any ideas?
Thanks.

Hello, again.
Adding Shaohua Li as he fixed a similar issue in 4a0b75c7d0 "block,
cfq: fix empty queue crash caused by request merge". The original
thread can be read from
http://thread.gmane.org/gmane.linux.kernel.next/20064/focus=20159
On Tue, Jan 03, 2012 at 09:35:00AM -0800, Tejun Heo wrote:
> Happy new year, guys.> > On Wed, Dec 28, 2011 at 01:19:18PM -0800, Tejun Heo wrote:> > > On Wed, Dec 28, 2011 at 9:50 AM, Hugh Dickins <hughd@google.com> wrote:> > > > "It's the tmpfs swapping test that I've been running, with variations,> > > > for years. System booted with mem=700M and 1.5G swap, two repetitious> > > > make -j20 kernel builds (of a 2.6.24 kernel: I stuck with that because> > > > the balance of built to unbuilt source grows smaller with later kernels),> > > > one directly in a tmpfs, the other in a 1k-block ext2 (that I drive with> > > > ext4's CONFIG_EXT4_USE_FOR_EXT23) on /dev/loop0 on a 450MB tmpfs file."> > > >> > > > I doubt much of that (quoted from an older mail to someone else about> > > > one of the many other bugs it's found) is relevant: maybe just plenty> > > > of file I/O and swapping.> > > > > > Plain -j4 build isn't triggering anything. I'll try to replicate the condition.> > > > It's not too reliable but I can reproduce it with -j 22 allmodconfig> > build inside qemu w/ 512M of memory. I'll try to find out what's> > going on.> > I misread the code, the problem is empty cfqq on the cfq prio tree. I> don't think this is caused by recent io_context changes. It looks> like somebody is forgetting to remove cfqq from the dispatch prio tree> after emptying a cfqq by removing a request from it. Jens, any ideas?
That should have been service tree. I couldn't find more missing
removals other than the one Shaohua's patch already fixed. Close
cooperator selection in cfq_select_queue() seems suspicious tho. I
can't see what prevents it from returning an empty coopeator cfqq.
I'm trying to verify whether that's the case. Will update when I know
more.
Thanks.

On Tue, Jan 03, 2012 at 02:13:01PM -0800, Tejun Heo wrote:
> > That's pretty odd. Given Hughs report as well, it sure does sound like> > we now have some life time issues with cfqq's.> > Hmmm... I disabled cfqq merge logic (commented out> cfq_close_cooperator() and the following cfq_setup_merge() calls) in> cfq_select_queue() and neither is triggering for quite a while now.> Maybe cfqq refcnt is getting borked over cfqq merging / splitting? It> would also explain the low frequency of the issue too. I'll try to> further isolate it but It would be awesome if someone more familiar> with the logic can go over that part.
Scrap that. It triggered and yeah cfq_get_next_queue() is retrieving
empty cfqq from the service tree.
Thanks.