David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-06T17:14:57Zurn:uuid:36215513-d149-795b-5b0a-cc3f5fdefcb2

[-- Attachment #1: Type: text/plain, Size: 702 bytes --]
On Fri, 2018-07-06 at 10:11 -0700, Paul E. McKenney wrote:
> > The preempt state is alread a bit complicated and shadowed in the
> > preempt_count (on some architectures) adding additional bits to it like
> > this is just asking for trouble.
>
> How about a separate need_resched_rcu() that includes the extra cache
> miss? Or open-coding the rcu_urgent_qs_requested()?
Peter said "touch two cachelines". He didn't say it was a cache miss.
Given that every single cond_resched() call touches the same cache
line, and every single rcu_all_qs() and similar will also touch it,
it's fairly much guaranteed *not* to be a miss...
... which is why I didn't really understand why he cared.
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Fri, Jul 06, 2018 at 06:14:44PM +0100, David Woodhouse wrote:
> On Fri, 2018-07-06 at 10:11 -0700, Paul E. McKenney wrote:
> > > The preempt state is alread a bit complicated and shadowed in the
> > > preempt_count (on some architectures) adding additional bits to it like
> > > this is just asking for trouble.
> >
> > How about a separate need_resched_rcu() that includes the extra cache
> > miss? Or open-coding the rcu_urgent_qs_requested()?
>
> Peter said "touch two cachelines". He didn't say it was a cache miss.
"... that includes the extra cache touch", then.
> Given that every single cond_resched() call touches the same cache
> line, and every single rcu_all_qs() and similar will also touch it,
> it's fairly much guaranteed *not* to be a miss...
>
> ... which is why I didn't really understand why he cared.
Let's see what he says. ;-)
Thanx, Paul

On Fri, Jul 06, 2018 at 06:14:44PM +0100, David Woodhouse wrote:
> On Fri, 2018-07-06 at 10:11 -0700, Paul E. McKenney wrote:
> > > The preempt state is alread a bit complicated and shadowed in the
> > > preempt_count (on some architectures) adding additional bits to it like
> > > this is just asking for trouble.
> >
> > How about a separate need_resched_rcu() that includes the extra cache
> > miss? Or open-coding the rcu_urgent_qs_requested()?
>
> Peter said "touch two cachelines". He didn't say it was a cache miss.
>
> Given that every single cond_resched() call touches the same cache
> line, and every single rcu_all_qs() and similar will also touch it,
> it's fairly much guaranteed *not* to be a miss...
cond_resched() is a no-op for PREEMPT=y, but then you're still
sprinkling that read all across.
> ... which is why I didn't really understand why he cared.
Well it also complicated the whole preemption state, and like I wrote,
TIF_NEED_RESCHED is not the only place we track preemption state in.
Stuff like preempt_enable() will (on x86) only ever look at the
preempt_count value.

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-09T09:19:04Zurn:uuid:8ac448ca-0842-6f55-1b5f-3872b85e893d

[-- Attachment #1: Type: text/plain, Size: 1934 bytes --]
On Mon, 2018-07-09 at 10:53 +0200, Peter Zijlstra wrote:
> On Fri, Jul 06, 2018 at 10:11:50AM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 06, 2018 at 06:29:05PM +0200, Peter Zijlstra wrote:
> > > On Fri, Jul 06, 2018 at 03:53:30PM +0100, David Woodhouse wrote:
> > > >
> > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > index e4d4e60..89f5814 100644
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -1616,7 +1616,8 @@ static inline int spin_needbreak(spinlock_t *lock)
> > > >
> > > > static __always_inline bool need_resched(void)
> > > > {
> > > > - return unlikely(tif_need_resched());
> > > > + return unlikely(tif_need_resched()) ||
> > > > + rcu_urgent_qs_requested();
> > > > }
> > > Instead of making need_resched() touch two cachelines, I think I would
> > > prefer adding resched_cpu() to rcu_request_urgent_qs_task().
>
> > I used to do something like this, but decided that whacking each holdout
> > CPU over the head ten times a second was a bit much.
>
> This is only called from the !list_empty(rcu_tasks_holdout) loop in
> rcu_tasks_kthread afaict, and that has a
> schedule_timeout_interruptible(HZ) in it, which I read as once a second.
>
> Which seems like an entirely reasonable amount of time to kick a task.
> Not scheduling for a second is like an eternity.
If that is our only "fix" for KVM, then wouldn't that mean that things
like expand_fdtable() would be *expected* to take "an eternity" when
another CPU happens to be in the guest? Because vcpu_run() would still
loop until the task gets kicked after a second?
Of course, we can explicitly put a check into the KVM loop, but that
brings me back to my original concern — why is it OK to do it there as
a special case and not for the general case construct of
if (need_resched) { drop_local_locks(); cond_resched(); get_local_locks(); }
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Mon, Jul 09, 2018 at 10:18:55AM +0100, David Woodhouse wrote:
> > Which seems like an entirely reasonable amount of time to kick a task.
> > Not scheduling for a second is like an eternity.
>
> If that is our only "fix" for KVM, then wouldn't that mean that things
> like expand_fdtable() would be *expected* to take "an eternity" when
> another CPU happens to be in the guest? Because vcpu_run() would still
> loop until the task gets kicked after a second?
But either proposal is exactly the same in this respect. The whole
rcu_urgent_qs thing won't be set any earlier either.
> Of course, we can explicitly put a check into the KVM loop, but that
> brings me back to my original concern — why is it OK to do it there as
> a special case and not for the general case construct of
> if (need_resched) { drop_local_locks(); cond_resched(); get_local_locks(); }
I'm not proposing anything that would differentiate between KVM and
anything else.
I just want to keep my preemption state sane-ish, and adding random
conditions to part of it just doesn't look attractive.

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-09T10:56:55Zurn:uuid:3b7faf91-d9f1-dcc4-6116-6ace4a98d0b1

[-- Attachment #1: Type: text/plain, Size: 1000 bytes --]
On Mon, 2018-07-09 at 12:44 +0200, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 10:18:55AM +0100, David Woodhouse wrote:
> >
> > >
> > > Which seems like an entirely reasonable amount of time to kick a task.
> > > Not scheduling for a second is like an eternity.
> >
> > If that is our only "fix" for KVM, then wouldn't that mean that things
> > like expand_fdtable() would be *expected* to take "an eternity" when
> > another CPU happens to be in the guest? Because vcpu_run() would still
> > loop until the task gets kicked after a second?
>
> But either proposal is exactly the same in this respect. The whole
> rcu_urgent_qs thing won't be set any earlier either.
Er.... Marius, our latencies in expand_fdtable() definitely went from
~10s to well below one second when we just added the rcu_all_qs() into
the loop, didn't they? And that does nothing if !rcu_urgent_qs.
Doesn't rcu_implicit_dynticks_qs() set it after jiffies_till_sched_qs
(default HZ/10) has expired?
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Mon, Jul 09, 2018 at 11:56:41AM +0100, David Woodhouse wrote:
> > But either proposal is exactly the same in this respect. The whole
> > rcu_urgent_qs thing won't be set any earlier either.
>
> Er.... Marius, our latencies in expand_fdtable() definitely went from
> ~10s to well below one second when we just added the rcu_all_qs() into
> the loop, didn't they? And that does nothing if !rcu_urgent_qs.
Argh I never found that, because obfuscation:
ruqp = per_cpu_ptr(&rcu_dynticks.rcu_urgent_qs, rdp->cpu);
...
smp_store_release(ruqp, true);
I, using git grep "rcu_urgent_qs.*true" only found
rcu_request_urgent_qs_task() and sync_sched_exp_handler().
But how come KVM even triggers that case; rcu_implicit_dynticks_qs() is
for NOHZ and offline CPUs.

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-09T11:12:25Zurn:uuid:59290570-9083-bfcb-5c4f-c11a2328e8dc

[-- Attachment #1: Type: text/plain, Size: 1291 bytes --]
On Mon, 2018-07-09 at 13:06 +0200, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 11:56:41AM +0100, David Woodhouse wrote:
>
> >
> > >
> > > But either proposal is exactly the same in this respect. The whole
> > > rcu_urgent_qs thing won't be set any earlier either.
> > Er.... Marius, our latencies in expand_fdtable() definitely went from
> > ~10s to well below one second when we just added the rcu_all_qs() into
> > the loop, didn't they? And that does nothing if !rcu_urgent_qs.
> Argh I never found that, because obfuscation:
>
> ruqp = per_cpu_ptr(&rcu_dynticks.rcu_urgent_qs, rdp->cpu);
> ...
> smp_store_release(ruqp, true);
>
> I, using git grep "rcu_urgent_qs.*true" only found
> rcu_request_urgent_qs_task() and sync_sched_exp_handler().
>
> But how come KVM even triggers that case; rcu_implicit_dynticks_qs() is
> for NOHZ and offline CPUs.
I don't know that it is; I'm merely going by the empirical observation
that with a check for rcu_urgent_qs in the vcpu_run() loop, KVM is no
longer screwing over synchronize_sched() for 10 seconds at a time. Or
even 1 second at a time.
I'm all for considering a CPU in guest mode to be quiescent, and not
waiting for it at all. But we don't do that without full NOHZ even for
CPUs in userspace.
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Mon, Jul 09, 2018 at 12:12:15PM +0100, David Woodhouse wrote:
> On Mon, 2018-07-09 at 13:06 +0200, Peter Zijlstra wrote:
> > On Mon, Jul 09, 2018 at 11:56:41AM +0100, David Woodhouse wrote:
> > > > But either proposal is exactly the same in this respect. The whole
> > > > rcu_urgent_qs thing won't be set any earlier either.
> > > Er.... Marius, our latencies in expand_fdtable() definitely went from
> > > ~10s to well below one second when we just added the rcu_all_qs() into
> > > the loop, didn't they? And that does nothing if !rcu_urgent_qs.
> > Argh I never found that, because obfuscation:
> >
> > ruqp = per_cpu_ptr(&rcu_dynticks.rcu_urgent_qs, rdp->cpu);
> > ...
> > smp_store_release(ruqp, true);
> >
> > I, using git grep "rcu_urgent_qs.*true" only found
> > rcu_request_urgent_qs_task() and sync_sched_exp_handler().
> >
> > But how come KVM even triggers that case; rcu_implicit_dynticks_qs() is
> > for NOHZ and offline CPUs.
>
> I don't know that it is; I'm merely going by the empirical observation
> that with a check for rcu_urgent_qs in the vcpu_run() loop, KVM is no
> longer screwing over synchronize_sched() for 10 seconds at a time. Or
> even 1 second at a time.
It would be good to know what exactly sets that variable in your case.
> I'm all for considering a CPU in guest mode to be quiescent, and not
> waiting for it at all. But we don't do that without full NOHZ even for
> CPUs in userspace.
Doing it for guests should be easier than for userspace, since
vmenter/vmexit are (afaik) _much_ more expensive than sysenter/sysexit.

On Mon, Jul 09, 2018 at 01:06:57PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 11:56:41AM +0100, David Woodhouse wrote:
>
> > > But either proposal is exactly the same in this respect. The whole
> > > rcu_urgent_qs thing won't be set any earlier either.
> >
> > Er.... Marius, our latencies in expand_fdtable() definitely went from
> > ~10s to well below one second when we just added the rcu_all_qs() into
> > the loop, didn't they? And that does nothing if !rcu_urgent_qs.
>
> Argh I never found that, because obfuscation:
>
> ruqp = per_cpu_ptr(&rcu_dynticks.rcu_urgent_qs, rdp->cpu);
> ...
> smp_store_release(ruqp, true);
>
> I, using git grep "rcu_urgent_qs.*true" only found
> rcu_request_urgent_qs_task() and sync_sched_exp_handler().
Yeah, got tired of typing that long string too many times, so made a
short-named pointer...
> But how come KVM even triggers that case; rcu_implicit_dynticks_qs() is
> for NOHZ and offline CPUs.
Mostly, yes. But it also takes measures when CPUs take too long to
check in.
The reason that David's latencies went from 100ms to one second is
because I made this code less aggressive about invoking resched_cpu().
The reason I did that was to allow cond_resched_rcu_qs() to be used less
without performance regressions. And just plain cond_resched() on
!PREEMPT is intended to handle the faster checks. But KVM defeats
this by checking need_resched() before invoking cond_resched().
For PREEMPT, either the scheduling-clock interrupt sees that there
is no RCU-read-side critical section or we have either idle or
nohz_full userspace execution.
Of course, if there really is a huge RCU read-side critical section that
really does take 15 seconds to execute, there is of course nothing that
RCU can do about that. But as you say later, even a one-second critical
section is huge and needs to be broken up somehow. Which should
introduce (at the very least) a cond_resched() for !PREEMPT or an
rcu_read_unlock() and thus rcu_read_unlock_special() for PREEMPT.
Thanx, Paul

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-09T12:47:26Zurn:uuid:3a666046-2f1b-efde-75ed-cae6d8ca5cf0

[-- Attachment #1: Type: text/plain, Size: 954 bytes --]
On Mon, 2018-07-09 at 05:34 -0700, Paul E. McKenney wrote:
> The reason that David's latencies went from 100ms to one second is
> because I made this code less aggressive about invoking resched_cpu().
Ten seconds. We saw synchronize_sched() take ten seconds in 4.15. We
wouldn't have been happy with one second, but ten seconds was
considered particularly suboptimal.
> The reason I did that was to allow cond_resched_rcu_qs() to be used less
> without performance regressions. And just plain cond_resched() on
> !PREEMPT is intended to handle the faster checks. But KVM defeats
> this by checking need_resched() before invoking cond_resched().
It isn't just KVM. It's a relatively common construct to use
need_resched(), then drop any local locks around cond_resched().
A bare cond_resched() will call rcu_all_qs() unconditionally, and it is
kind of inconsistent that need_resched() doesn't include the
corresponding condition.
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Mon, Jul 09, 2018 at 02:55:16PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 05:34:57AM -0700, Paul E. McKenney wrote:
> > But KVM defeats this by checking need_resched() before invoking
> > cond_resched().
>
> That's not wrong or even uncommon I think.
In fact, I think we recently put that pattern in crypto code in order to
break up very long kernel_fpu sections.
Note that you also 'broke' cond_resched_lock() as that no longer matches
cond_resched().

On Mon, Jul 09, 2018 at 03:02:27PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 02:55:16PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 09, 2018 at 05:34:57AM -0700, Paul E. McKenney wrote:
> > > But KVM defeats this by checking need_resched() before invoking
> > > cond_resched().
> >
> > That's not wrong or even uncommon I think.
>
> In fact, I think we recently put that pattern in crypto code in order to
> break up very long kernel_fpu sections.
OK, so here are our options:
1. Add the RCU conditional to need_resched(), as David suggests.
Peter has concerns about overhead.
2. Create a new need_resched_rcu_qs() that is to be used when
deciding whether or not to do cond_resched(). This would
exact the overhead only where it is needed, but is one more
thing for people to get wrong.
3. Revert my changes to de-emphasize cond_resched_rcu_qs(),
and go back to sprinkling cond_resched_rcu_qs() throughout
the code. This also is one more thing for people to get wrong,
and might well eventually convert all cond_resched() calls to
cond_resched_rcu_qs(), which sure seems like a failure mode to me.
4. Others?
> Note that you also 'broke' cond_resched_lock() as that no longer matches
> cond_resched().
Given that cond_resched_lock() was there first, I believe that you can
just say "broke" without the quote marks. :-/
Given that this code is releasing and acquiring a lock, I believe that
the patch below should cure this, aside from also needing to check
whether RCU needs a quiescent state. Any other similar gotchas out there?
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 537bced8f4bc..b559b556f464 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5017,6 +5017,7 @@ int __cond_resched_lock(spinlock_t *lock)
preempt_schedule_common();
else
cpu_relax();
+ rcu_all_qs();
ret = 1;
spin_lock(lock);
}

On Mon, Jul 09, 2018 at 01:47:14PM +0100, David Woodhouse wrote:
> On Mon, 2018-07-09 at 05:34 -0700, Paul E. McKenney wrote:
> > The reason that David's latencies went from 100ms to one second is
> > because I made this code less aggressive about invoking resched_cpu().
>
> Ten seconds. We saw synchronize_sched() take ten seconds in 4.15. We
> wouldn't have been happy with one second, but ten seconds was
> considered particularly suboptimal.
Yes, ten seconds. Please accept my apologies for my early morning
confusion.
Thanx, Paul
> > The reason I did that was to allow cond_resched_rcu_qs() to be used less
> > without performance regressions. And just plain cond_resched() on
> > !PREEMPT is intended to handle the faster checks. But KVM defeats
> > this by checking need_resched() before invoking cond_resched().
>
> It isn't just KVM. It's a relatively common construct to use
> need_resched(), then drop any local locks around cond_resched().
>
> A bare cond_resched() will call rcu_all_qs() unconditionally, and it is
> kind of inconsistent that need_resched() doesn't include the
> corresponding condition.

On Mon, Jul 09, 2018 at 07:29:32AM -0700, Paul E. McKenney wrote:
> OK, so here are our options:
>
> 1. Add the RCU conditional to need_resched(), as David suggests.
> Peter has concerns about overhead.
Not only overhead, its plain broken, because:
1) we keep preemption state in other places too
2) we check need_resched() in asm

On Mon, Jul 09, 2018 at 04:43:38PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 07:29:32AM -0700, Paul E. McKenney wrote:
>
> > OK, so here are our options:
> >
> > 1. Add the RCU conditional to need_resched(), as David suggests.
> > Peter has concerns about overhead.
>
> Not only overhead, its plain broken, because:
>
> 1) we keep preemption state in other places too
> 2) we check need_resched() in asm
Fair points. And yes, the various asm checks and transitions have
made their presence known to me in other contexts of late. :-/
So what would you suggest?
Thanx, Paul

On Mon, Jul 09, 2018 at 07:29:32AM -0700, Paul E. McKenney wrote:
> OK, so here are our options:
>
> 1. Add the RCU conditional to need_resched(), as David suggests.
> Peter has concerns about overhead.
>
> 2. Create a new need_resched_rcu_qs() that is to be used when
> deciding whether or not to do cond_resched(). This would
> exact the overhead only where it is needed, but is one more
> thing for people to get wrong.
Also, with the crypto guys checking need_resched() in asm that won't
really work either.
> 3. Revert my changes to de-emphasize cond_resched_rcu_qs(),
> and go back to sprinkling cond_resched_rcu_qs() throughout
> the code. This also is one more thing for people to get wrong,
> and might well eventually convert all cond_resched() calls to
> cond_resched_rcu_qs(), which sure seems like a failure mode to me.
4a. use resched_cpu() more agressive
4b. use the tick to set TIF_NEED_RESCHED when it finds rcu_urgent_qs
(avoids the IPI at the 'cost' of a slight delay in processing)
5. make guest mode a quiescent state (like supposedly already done
for NOHZ_FULL) (but this would not help the crypto guys).
6. ....
ok I ran out of ideas here I think.
So for PREEMPT the tick can check preempt_count() == 0 and if so, know
it _could_ have rescheduled and advance the qs, right? But since we
don't have a preempt count for !PREEMPT_COUNT kernels this doesn't work.
And thus we need to invoke actual scheduling events and then through the
schedule() callback RCU knows things happened.
4b seems like something worth trying for !PREEMPT kernels I suppose

On Mon, Jul 09, 2018 at 05:26:32PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 07:29:32AM -0700, Paul E. McKenney wrote:
> > OK, so here are our options:
> >
> > 1. Add the RCU conditional to need_resched(), as David suggests.
> > Peter has concerns about overhead.
> >
> > 2. Create a new need_resched_rcu_qs() that is to be used when
> > deciding whether or not to do cond_resched(). This would
> > exact the overhead only where it is needed, but is one more
> > thing for people to get wrong.
>
> Also, with the crypto guys checking need_resched() in asm that won't
> really work either.
Fair point! Ease of use is a good thing, even within the Linux kernel.
Or maybe especially within the Linux kernel...
> > 3. Revert my changes to de-emphasize cond_resched_rcu_qs(),
> > and go back to sprinkling cond_resched_rcu_qs() throughout
> > the code. This also is one more thing for people to get wrong,
> > and might well eventually convert all cond_resched() calls to
> > cond_resched_rcu_qs(), which sure seems like a failure mode to me.
>
> 4a. use resched_cpu() more agressive
> 4b. use the tick to set TIF_NEED_RESCHED when it finds rcu_urgent_qs
> (avoids the IPI at the 'cost' of a slight delay in processing)
4b sounds eminently reasonable to me! Something like the (untested,
probably doesn't even build) patch below?
David, any reason why this wouldn't work? Seems to me that this would
make need_resched() respond to RCU's need for quiescent states in a
timely manner without need_resched() having to become heavier weight,
but figured I should ask.
> 5. make guest mode a quiescent state (like supposedly already done
> for NOHZ_FULL) (but this would not help the crypto guys).
>
> 6. ....
>
> ok I ran out of ideas here I think.
>
>
> So for PREEMPT the tick can check preempt_count() == 0 and if so, know
> it _could_ have rescheduled and advance the qs, right? But since we
> don't have a preempt count for !PREEMPT_COUNT kernels this doesn't work.
>
> And thus we need to invoke actual scheduling events and then through the
> schedule() callback RCU knows things happened.
>
> 4b seems like something worth trying for !PREEMPT kernels I suppose
David is running a !PREEMPT kernel.
For PREEMPT kernels, the patch below results in a quiescent state for
the CPU, and the forced schedule queues the task. This queuing enables
later RCU priority boosting (if enabled) once all other CPUs sharing
the same leaf rcu_node structure have passed through quiescent states.
And yes, for PREEMPT kernels the scheduling-clock interrupt handler
already checks for a quiescent state using a combination of
preempt_count() (as you say, but ignoring the hardirq bits because
we are in an interrupt handler) and current->rcu_read_lock_nesting.
So I believe that this will cover it.
Thoughts?
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 51919985f6cf..33b0a1ec0536 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2496,6 +2496,10 @@ void rcu_check_callbacks(int user)
{
trace_rcu_utilization(TPS("Start scheduler-tick"));
raw_cpu_inc(rcu_data.ticks_this_gp);
+ if (smp_load_acquire(this_cpu_ptr(&rcu_dynticks.rcu_urgent_qs)) &&
+ !is_idle_task(current))
+ set_tsk_need_resched(current);
+ __this_cpu_write(rcu_dynticks.rcu_urgent_qs, false);
rcu_flavor_check_callbacks(user);
if (rcu_pending())
invoke_rcu_core();

On Mon, Jul 09, 2018 at 09:34:32AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 09, 2018 at 05:26:32PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 09, 2018 at 07:29:32AM -0700, Paul E. McKenney wrote:
> > > OK, so here are our options:
> > >
> > > 1. Add the RCU conditional to need_resched(), as David suggests.
> > > Peter has concerns about overhead.
> > >
> > > 2. Create a new need_resched_rcu_qs() that is to be used when
> > > deciding whether or not to do cond_resched(). This would
> > > exact the overhead only where it is needed, but is one more
> > > thing for people to get wrong.
> >
> > Also, with the crypto guys checking need_resched() in asm that won't
> > really work either.
>
> Fair point! Ease of use is a good thing, even within the Linux kernel.
> Or maybe especially within the Linux kernel...
>
> > > 3. Revert my changes to de-emphasize cond_resched_rcu_qs(),
> > > and go back to sprinkling cond_resched_rcu_qs() throughout
> > > the code. This also is one more thing for people to get wrong,
> > > and might well eventually convert all cond_resched() calls to
> > > cond_resched_rcu_qs(), which sure seems like a failure mode to me.
> >
> > 4a. use resched_cpu() more agressive
> > 4b. use the tick to set TIF_NEED_RESCHED when it finds rcu_urgent_qs
> > (avoids the IPI at the 'cost' of a slight delay in processing)
>
> 4b sounds eminently reasonable to me! Something like the (untested,
> probably doesn't even build) patch below?
>
> David, any reason why this wouldn't work? Seems to me that this would
> make need_resched() respond to RCU's need for quiescent states in a
> timely manner without need_resched() having to become heavier weight,
> but figured I should ask.
>
> > 5. make guest mode a quiescent state (like supposedly already done
> > for NOHZ_FULL) (but this would not help the crypto guys).
> >
> > 6. ....
> >
> > ok I ran out of ideas here I think.
> >
> >
> > So for PREEMPT the tick can check preempt_count() == 0 and if so, know
> > it _could_ have rescheduled and advance the qs, right? But since we
> > don't have a preempt count for !PREEMPT_COUNT kernels this doesn't work.
> >
> > And thus we need to invoke actual scheduling events and then through the
> > schedule() callback RCU knows things happened.
> >
> > 4b seems like something worth trying for !PREEMPT kernels I suppose
>
> David is running a !PREEMPT kernel.
>
> For PREEMPT kernels, the patch below results in a quiescent state for
> the CPU, and the forced schedule queues the task. This queuing enables
> later RCU priority boosting (if enabled) once all other CPUs sharing
> the same leaf rcu_node structure have passed through quiescent states.
>
> And yes, for PREEMPT kernels the scheduling-clock interrupt handler
> already checks for a quiescent state using a combination of
> preempt_count() (as you say, but ignoring the hardirq bits because
> we are in an interrupt handler) and current->rcu_read_lock_nesting.
>
> So I believe that this will cover it.
>
> Thoughts?
Updated per Peter's feedback on IRC.
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 51919985f6cf..ccde5f8aff61 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2496,6 +2496,12 @@ void rcu_check_callbacks(int user)
{
trace_rcu_utilization(TPS("Start scheduler-tick"));
raw_cpu_inc(rcu_data.ticks_this_gp);
+ if (smp_load_acquire(this_cpu_ptr(&rcu_dynticks.rcu_urgent_qs)) &&
+ !is_idle_task(current)) {
+ set_tsk_need_resched(current);
+ set_preempt_need_resched();
+ }
+ __this_cpu_write(rcu_dynticks.rcu_urgent_qs, false);
rcu_flavor_check_callbacks(user);
if (rcu_pending())
invoke_rcu_core();

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-09T18:51:05Zurn:uuid:3fb413a6-b235-37a2-26e9-3686c8150c49

On Mon, Jul 09, 2018 at 07:50:54PM +0100, David Woodhouse wrote:
>
>
> On Mon, 2018-07-09 at 09:34 -0700, Paul E. McKenney wrote:
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 51919985f6cf..33b0a1ec0536 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2496,6 +2496,10 @@ void rcu_check_callbacks(int user)
> > {
> > trace_rcu_utilization(TPS("Start scheduler-tick"));
> > raw_cpu_inc(rcu_data.ticks_this_gp);
> > + if (smp_load_acquire(this_cpu_ptr(&rcu_dynticks.rcu_urgent_qs)) &&
> > + !is_idle_task(current))
> > + set_tsk_need_resched(current);
>
> OK, so this will make KVM (and various other code) see that
> need_resched() is true, and they'll call cond_resched() or something
> else that might not actually schedule another task, but will at least
> end up in rcu_all_qs()...
>
> > + __this_cpu_write(rcu_dynticks.rcu_urgent_qs, false);
>
> ... which bails out immediately and does nothing, because that's set to
> false?
>
> Am I missing something?
If this is the idle task, RCU will detect that as a quiescent state via
its dyntick-idle mechanism. In which case, there is no point in leaving
.rcu_urgent_qs being true.
If this is not the idle task, the scheduler will invoke
rcu_note_context_switch(), which will in turn invoke rcu_sched_qs(),
rcu_preempt_qs(), or rcu_qs(), depending on kernel version and
configuration. This will happen independently of .rcu_urgent_qs, so it
is OK to set .rcu_urgent_qs to false. And doing so reduces the overhead
of the next cond_resched().
I may end up using rcu_is_cpu_rrupt_from_idle() instead of is_idle_task()
at some point, but the former eases backporting. And the only difference
is if someone has a long loop within an _rcuidle tracepoint used in the
idle loop, and where that loop check need_resched(). Which currently
seems to be the empty set.
And I should treat interruption of a usermode task the same as that of an
idle task. In the PREEMPT case, ->rcu_read_lock_nesting better be zero
(lockdep would have complained), so the quiescent state will be reported.
In the !PREEMPT case, user=1 directly causes reporting of a quiescent
state.
So here are the possible code paths when .rcu_urgent_qs is set to true:
1. A context switch will record the quiescent state and clear
.rcu_urgent_qs. (The failure to do the clearing in current -rcu
for PREEMPT builds is a performance bug that I need to fix.)
2. A cond_resched() will cause rcu_all_qs() to be invoked, which
will record a quiescent state and clear .rcu_urgent_qs.
3. With the patch below, a scheduling-clock interrupt of a
non-idle non-userspace task will force a reschedule, which
will result in #1 above happening.
However, I should avoid setting .rcu_urgent_qs to false when it is
already false, shouldn't I?
So how about the following instead?
I am doing some light testing and will let you know how that goes.
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 51919985f6cf..c3b688c7127a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2496,6 +2496,15 @@ void rcu_check_callbacks(int user)
{
trace_rcu_utilization(TPS("Start scheduler-tick"));
raw_cpu_inc(rcu_data.ticks_this_gp);
+ /* The load-acquire pairs with the store-release setting to true. */
+ if (smp_load_acquire(this_cpu_ptr(&rcu_dynticks.rcu_urgent_qs))) {
+ /* Idle already is a quiescent state. */
+ if (!is_idle_task(current) && !user) {
+ set_tsk_need_resched(current);
+ set_preempt_need_resched();
+ }
+ __this_cpu_write(rcu_dynticks.rcu_urgent_qs, false);
+ }
rcu_flavor_check_callbacks(user);
if (rcu_pending())
invoke_rcu_core();

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-09T20:35:47Zurn:uuid:31ce81ad-1eb0-6376-8fc5-831a19333c83

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]
On Mon, 2018-07-09 at 13:34 -0700, Paul E. McKenney wrote:
>
> So here are the possible code paths when .rcu_urgent_qs is set to true:
>
> 1. A context switch will record the quiescent state and clear
> .rcu_urgent_qs. (The failure to do the clearing in current -rcu
> for PREEMPT builds is a performance bug that I need to fix.)
What if there's nothing else runnable and there is no actual context
switch?
> 2. A cond_resched() will cause rcu_all_qs() to be invoked, which
> will record a quiescent state and clear .rcu_urgent_qs.
>
> 3. With the patch below, a scheduling-clock interrupt of a
> non-idle non-userspace task will force a reschedule, which
> will result in #1 above happening.[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Mon, Jul 09, 2018 at 09:35:38PM +0100, David Woodhouse wrote:
>
>
> On Mon, 2018-07-09 at 13:34 -0700, Paul E. McKenney wrote:
> >
> > So here are the possible code paths when .rcu_urgent_qs is set to true:
> >
> > 1. A context switch will record the quiescent state and clear
> > .rcu_urgent_qs. (The failure to do the clearing in current -rcu
> > for PREEMPT builds is a performance bug that I need to fix.)
>
> What if there's nothing else runnable and there is no actual context
> switch?
The scheduler invokes rcu_note_context_switch() before looking to see
if there really will or won't be a context switch.
I am sure that Peter will correct me if I am confused on this point. ;-)
Thanx, Paul
> > 2. A cond_resched() will cause rcu_all_qs() to be invoked, which
> > will record a quiescent state and clear .rcu_urgent_qs.
> >
> > 3. With the patch below, a scheduling-clock interrupt of a
> > non-idle non-userspace task will force a reschedule, which
> > will result in #1 above happening.

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-09T20:45:54Zurn:uuid:a7c8cfb0-a4aa-e23e-650a-fab18935b92f

On Mon, Jul 09, 2018 at 09:45:45PM +0100, David Woodhouse wrote:
> On Mon, 2018-07-09 at 13:42 -0700, Paul E. McKenney wrote:
> > On Mon, Jul 09, 2018 at 09:35:38PM +0100, David Woodhouse wrote:
> > >
> > >
> > > On Mon, 2018-07-09 at 13:34 -0700, Paul E. McKenney wrote:
> > > >
> > > > So here are the possible code paths when .rcu_urgent_qs is set to true:
> > > >
> > > > 1. A context switch will record the quiescent state and clear
> > > > .rcu_urgent_qs. (The failure to do the clearing in current -rcu
> > > > for PREEMPT builds is a performance bug that I need to fix.)
> > >
> > > What if there's nothing else runnable and there is no actual context
> > > switch?
> >
> > The scheduler invokes rcu_note_context_switch() before looking to see
> > if there really will or won't be a context switch.
> >
> > I am sure that Peter will correct me if I am confused on this point. ;-)
>
> Ah, OK. Yes, that looks correct. Thanks.
Here is hoping!
> I'll give your patch a spin tomorrow, unless Marius beats me to it.
Please see below for the version that I eventually queued. Should Marius
have a Reported-by? If so, please tell me his full name so I can add that.
Thanx, Paul
------------------------------------------------------------------------
commit e1e91fd0796dc544a77ddfaa5afbfc1f5fb42ecd
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Mon Jul 9 13:47:30 2018 -0700
rcu: Make need_resched() respond to urgent RCU-QS needs
The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent
need for an RCU quiescent state from the force-quiescent-state processing
within the grace-period kthread to context switches and to cond_resched().
Unfortunately, such urgent needs are not communicated to need_resched(),
which is sometimes used to decide when to invoke cond_resched(), for
but one example, within the KVM vcpu_run() function. As of v4.15, this
can result in synchronize_sched() being delayed by up to ten seconds,
which can be problematic, to say nothing of annoying.
This commit therefore checks rcu_dynticks.rcu_urgent_qs from within
rcu_check_callbacks(), which is invoked from the scheduling-clock
interrupt handler. If the current task is not an idle task and is
not executing in usermode, a context switch is forced, and either way,
the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current
task is an idle task, then RCU's dyntick-idle code will detect the
quiescent state, so no further action is required. Similarly, if the
task is executing in usermode, other code in rcu_check_callbacks() and
its called functions will report the corresponding quiescent state.
Reported-by: David Woodhouse <dwmw2@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 51919985f6cf..ea756bb64eb3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2496,6 +2496,15 @@ void rcu_check_callbacks(int user)
{
trace_rcu_utilization(TPS("Start scheduler-tick"));
raw_cpu_inc(rcu_data.ticks_this_gp);
+ /* The load-acquire pairs with the store-release setting to true. */
+ if (smp_load_acquire(this_cpu_ptr(&rcu_dynticks.rcu_urgent_qs))) {
+ /* Idle and userspace execution already are quiescent states. */
+ if (!is_idle_task(current) && !user) {
+ set_tsk_need_resched(current);
+ set_preempt_need_resched();
+ }
+ __this_cpu_write(rcu_dynticks.rcu_urgent_qs, false);
+ }
rcu_flavor_check_callbacks(user);
if (rcu_pending())
invoke_rcu_core();

On Mon, Jul 09, 2018 at 02:05:32PM -0700, Paul E. McKenney wrote:
> On Mon, Jul 09, 2018 at 09:45:45PM +0100, David Woodhouse wrote:
> > On Mon, 2018-07-09 at 13:42 -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 09, 2018 at 09:35:38PM +0100, David Woodhouse wrote:
> > > >
> > > >
> > > > On Mon, 2018-07-09 at 13:34 -0700, Paul E. McKenney wrote:
> > > > >
> > > > > So here are the possible code paths when .rcu_urgent_qs is set to true:
> > > > >
> > > > > 1. A context switch will record the quiescent state and clear
> > > > > .rcu_urgent_qs. (The failure to do the clearing in current -rcu
> > > > > for PREEMPT builds is a performance bug that I need to fix.)
> > > >
> > > > What if there's nothing else runnable and there is no actual context
> > > > switch?
> > >
> > > The scheduler invokes rcu_note_context_switch() before looking to see
> > > if there really will or won't be a context switch.
> > >
> > > I am sure that Peter will correct me if I am confused on this point. ;-)
> >
> > Ah, OK. Yes, that looks correct. Thanks.
>
> Here is hoping!
>
> > I'll give your patch a spin tomorrow, unless Marius beats me to it.
>
> Please see below for the version that I eventually queued. Should Marius
> have a Reported-by? If so, please tell me his full name so I can add that.
And the earlier patch was against my -rcu tree, which won't be all that
helpful for v4.15. Please see below for a lightly tested backport to v4.15.
It should apply to all the releases of interest. If other backports
are needed, please remind me of my woodhouse.v4.15.2018.07.09a tag.
Thanx, Paul
------------------------------------------------------------------------
commit 6361b81827a8f93f582124da385258fc04a38a7f
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Mon Jul 9 13:47:30 2018 -0700
rcu: Make need_resched() respond to urgent RCU-QS needs
The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent
need for an RCU quiescent state from the force-quiescent-state processing
within the grace-period kthread to context switches and to cond_resched().
Unfortunately, such urgent needs are not communicated to need_resched(),
which is sometimes used to decide when to invoke cond_resched(), for
but one example, within the KVM vcpu_run() function. As of v4.15, this
can result in synchronize_sched() being delayed by up to ten seconds,
which can be problematic, to say nothing of annoying.
This commit therefore checks rcu_dynticks.rcu_urgent_qs from within
rcu_check_callbacks(), which is invoked from the scheduling-clock
interrupt handler. If the current task is not an idle task and is
not executing in usermode, a context switch is forced, and either way,
the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current
task is an idle task, then RCU's dyntick-idle code will detect the
quiescent state, so no further action is required. Similarly, if the
task is executing in usermode, other code in rcu_check_callbacks() and
its called functions will report the corresponding quiescent state.
Reported-by: David Woodhouse <dwmw2@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Backported to v4.15. Probably applies elsewhere. ]
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f9c0ca2ccf0c..3350ece366ab 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2839,6 +2839,15 @@ void rcu_check_callbacks(int user)
rcu_bh_qs();
}
rcu_preempt_check_callbacks();
+ /* The load-acquire pairs with the store-release setting to true. */
+ if (smp_load_acquire(this_cpu_ptr(&rcu_dynticks.rcu_urgent_qs))) {
+ /* Idle and userspace execution already are quiescent states. */
+ if (rcu_is_cpu_rrupt_from_idle() && !user) {
+ set_tsk_need_resched(current);
+ set_preempt_need_resched();
+ }
+ __this_cpu_write(rcu_dynticks.rcu_urgent_qs, false);
+ }
if (rcu_pending())
invoke_rcu_core();
if (user)

On Mon, Jul 09, 2018 at 01:42:48PM -0700, Paul E. McKenney wrote:
> On Mon, Jul 09, 2018 at 09:35:38PM +0100, David Woodhouse wrote:
> >
> >
> > On Mon, 2018-07-09 at 13:34 -0700, Paul E. McKenney wrote:
> > >
> > > So here are the possible code paths when .rcu_urgent_qs is set to true:
> > >
> > > 1. A context switch will record the quiescent state and clear
> > > .rcu_urgent_qs. (The failure to do the clearing in current -rcu
> > > for PREEMPT builds is a performance bug that I need to fix.)
> >
> > What if there's nothing else runnable and there is no actual context
> > switch?
>
> The scheduler invokes rcu_note_context_switch() before looking to see
> if there really will or won't be a context switch.
Correct. Just getting to __schedule() means we can schedule and thus is
a valid point for RCU to progress. Even if we then end up selecting the
very same task and not switching at all.

On Tue, Jul 10, 2018 at 11:24:26AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 09, 2018 at 01:42:48PM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 09, 2018 at 09:35:38PM +0100, David Woodhouse wrote:
> > >
> > >
> > > On Mon, 2018-07-09 at 13:34 -0700, Paul E. McKenney wrote:
> > > >
> > > > So here are the possible code paths when .rcu_urgent_qs is set to true:
> > > >
> > > > 1. A context switch will record the quiescent state and clear
> > > > .rcu_urgent_qs. (The failure to do the clearing in current -rcu
> > > > for PREEMPT builds is a performance bug that I need to fix.)
> > >
> > > What if there's nothing else runnable and there is no actual context
> > > switch?
> >
> > The scheduler invokes rcu_note_context_switch() before looking to see
> > if there really will or won't be a context switch.
>
> Correct. Just getting to __schedule() means we can schedule and thus is
> a valid point for RCU to progress. Even if we then end up selecting the
> very same task and not switching at all.
Thank you for the confirmation!
Thanx, Paul

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-11T10:57:51Zurn:uuid:2fb18077-9b14-c51a-382d-fdda2955697a

[-- Attachment #1: Type: text/plain, Size: 4059 bytes --]
On Mon, 2018-07-09 at 15:08 -0700, Paul E. McKenney wrote:
>
> And the earlier patch was against my -rcu tree, which won't be all that
> helpful for v4.15. Please see below for a lightly tested backport to v4.15.
>
> It should apply to all the releases of interest. If other backports
> are needed, please remind me of my woodhouse.v4.15.2018.07.09a tag.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 6361b81827a8f93f582124da385258fc04a38a7f
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date: Mon Jul 9 13:47:30 2018 -0700
>
> rcu: Make need_resched() respond to urgent RCU-QS needs
>
> The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent
> need for an RCU quiescent state from the force-quiescent-state processing
> within the grace-period kthread to context switches and to cond_resched().
> Unfortunately, such urgent needs are not communicated to need_resched(),
> which is sometimes used to decide when to invoke cond_resched(), for
> but one example, within the KVM vcpu_run() function. As of v4.15, this
> can result in synchronize_sched() being delayed by up to ten seconds,
> which can be problematic, to say nothing of annoying.
>
> This commit therefore checks rcu_dynticks.rcu_urgent_qs from within
> rcu_check_callbacks(), which is invoked from the scheduling-clock
> interrupt handler. If the current task is not an idle task and is
> not executing in usermode, a context switch is forced, and either way,
> the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current
> task is an idle task, then RCU's dyntick-idle code will detect the
> quiescent state, so no further action is required. Similarly, if the
> task is executing in usermode, other code in rcu_check_callbacks() and
> its called functions will report the corresponding quiescent state.
>
> Reported-by: David Woodhouse <dwmw2@infradead.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> [ paulmck: Backported to v4.15. Probably applies elsewhere. ]
Hm, this doesn't appear to work. I'm still seeing latencies of 4-5
seconds in my testing. In fact, even our old workaround of adding
rcu_all_qs() into vcpu_enter_guest() didn't properly fix it AFAICT.
I'm just creating a VM with lots of CPUs, then attaching new devices to
it to cause the VMM to open more file descriptors, until it hits a
power of two and invokes expand_fdtable().
expand_fdtable (512) sync took 10472394964 cycles (3500000 µs).
expand_fdtable (512) sync took 15298908072 cycles (5100000 µs).
--- a/fs/file.c
+++ b/fs/file.c
@@ -162,8 +162,16 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
/* make sure all __fd_install() have seen resize_in_progress
* or have finished their rcu_read_lock_sched() section.
*/
- if (atomic_read(&files->count) > 1)
+ if (atomic_read(&files->count) > 1) {
+ unsigned long sync_start, sync_end;
+ unsigned long j_start, j_end;
+ j_start = jiffies;
+ sync_start = get_cycles();
synchronize_sched();
+ sync_end = get_cycles();
+ j_end = jiffies;
+ printk("expand_fdtable (%d) sync took %ld cycles (%ld µs).\n", nr, sync_end - sync_start, jiffies_to_usecs(j_end - j_start));
+ }
spin_lock(&files->file_lock);
if (!new_fdt)
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Wed, Jul 11, 2018 at 11:57:43AM +0100, David Woodhouse wrote:
> On Mon, 2018-07-09 at 15:08 -0700, Paul E. McKenney wrote:
>
> >
> > And the earlier patch was against my -rcu tree, which won't be all that
> > helpful for v4.15. Please see below for a lightly tested backport to v4.15.
> >
> > It should apply to all the releases of interest. If other backports
> > are needed, please remind me of my woodhouse.v4.15.2018.07.09a tag.
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit 6361b81827a8f93f582124da385258fc04a38a7f
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date: Mon Jul 9 13:47:30 2018 -0700
> >
> > rcu: Make need_resched() respond to urgent RCU-QS needs
> >
> > The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent
> > need for an RCU quiescent state from the force-quiescent-state processing
> > within the grace-period kthread to context switches and to cond_resched().
> > Unfortunately, such urgent needs are not communicated to need_resched(),
> > which is sometimes used to decide when to invoke cond_resched(), for
> > but one example, within the KVM vcpu_run() function. As of v4.15, this
> > can result in synchronize_sched() being delayed by up to ten seconds,
> > which can be problematic, to say nothing of annoying.
> >
> > This commit therefore checks rcu_dynticks.rcu_urgent_qs from within
> > rcu_check_callbacks(), which is invoked from the scheduling-clock
> > interrupt handler. If the current task is not an idle task and is
> > not executing in usermode, a context switch is forced, and either way,
> > the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current
> > task is an idle task, then RCU's dyntick-idle code will detect the
> > quiescent state, so no further action is required. Similarly, if the
> > task is executing in usermode, other code in rcu_check_callbacks() and
> > its called functions will report the corresponding quiescent state.
> >
> > Reported-by: David Woodhouse <dwmw2@infradead.org>
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > [ paulmck: Backported to v4.15. Probably applies elsewhere. ]
>
> Hm, this doesn't appear to work. I'm still seeing latencies of 4-5
> seconds in my testing. In fact, even our old workaround of adding
> rcu_all_qs() into vcpu_enter_guest() didn't properly fix it AFAICT.
>
> I'm just creating a VM with lots of CPUs, then attaching new devices to
> it to cause the VMM to open more file descriptors, until it hits a
> power of two and invokes expand_fdtable().
>
> expand_fdtable (512) sync took 10472394964 cycles (3500000 µs).
> expand_fdtable (512) sync took 15298908072 cycles (5100000 µs).
Interesting. (I am assuming that the guest is printing these messages,
not the host, but please let me know if my assumption is incorrect.)
Are the CPUs saturated? If so, could you please try booting with
rcutree.kthread_prio=2? If that prevents the messages from happening,
then I need to put some work into guaranteeing forward progress.
Otherwise, I need to figure out why the setting of rcu_urgent_qs is
being ignored.
I will assume the latter for the moment and see if I can spot the
problem.
Thanx, Paul
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -162,8 +162,16 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
> /* make sure all __fd_install() have seen resize_in_progress
> * or have finished their rcu_read_lock_sched() section.
> */
> - if (atomic_read(&files->count) > 1)
> + if (atomic_read(&files->count) > 1) {
> + unsigned long sync_start, sync_end;
> + unsigned long j_start, j_end;
> + j_start = jiffies;
> + sync_start = get_cycles();
> synchronize_sched();
> + sync_end = get_cycles();
> + j_end = jiffies;
> + printk("expand_fdtable (%d) sync took %ld cycles (%ld µs).\n", nr, sync_end - sync_start, jiffies_to_usecs(j_end - j_start));
> + }
>
> spin_lock(&files->file_lock);
> if (!new_fdt)

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-11T12:58:27Zurn:uuid:91d0283f-1b95-59e0-026a-ddf6be30cb80

[-- Attachment #1: Type: text/plain, Size: 1024 bytes --]
On Wed, 2018-07-11 at 05:51 -0700, Paul E. McKenney wrote:
>
> Interesting. (I am assuming that the guest is printing these messages,
> not the host, but please let me know if my assumption is incorrect.)
No, this is all in the host. When the VMM (qemu, etc.) opens more files
and has to expand its fd_table, the threads which are currently in
KVM's vcpu_run() are making synchronize_sched() take multiple seconds.
> Are the CPUs saturated? If so, could you please try booting with
> rcutree.kthread_prio=2? If that prevents the messages from happening,
> then I need to put some work into guaranteeing forward progress.
> Otherwise, I need to figure out why the setting of rcu_urgent_qs is
> being ignored.
The CPUs shouldn't be saturated. The guest is fairly much idle. I can
best reproduce this by starting up the guest and then assigning a new
PCI device. At that point fairly much nothing is happening at all.
> I will assume the latter for the moment and see if I can spot the
> problem.[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-11T14:23:54Zurn:uuid:aa4bb2e4-87c2-3128-be60-128e5773dd17

On Wed, Jul 11, 2018 at 01:58:22PM +0100, David Woodhouse wrote:
> On Wed, 2018-07-11 at 05:51 -0700, Paul E. McKenney wrote:
> >
> > Interesting. (I am assuming that the guest is printing these messages,
> > not the host, but please let me know if my assumption is incorrect.)
>
> No, this is all in the host. When the VMM (qemu, etc.) opens more files
> and has to expand its fd_table, the threads which are currently in
> KVM's vcpu_run() are making synchronize_sched() take multiple seconds.
>
> > Are the CPUs saturated? If so, could you please try booting with
> > rcutree.kthread_prio=2? If that prevents the messages from happening,
> > then I need to put some work into guaranteeing forward progress.
> > Otherwise, I need to figure out why the setting of rcu_urgent_qs is
> > being ignored.
>
> The CPUs shouldn't be saturated. The guest is fairly much idle. I can
> best reproduce this by starting up the guest and then assigning a new
> PCI device. At that point fairly much nothing is happening at all.
OK, thank you for the information and again apologies for the hassle.
I will do what I should have done long ago and make the relevant addition
to rcutorture.
In the meantime, one workaround is to export rcu_momentary_dyntick_idle()
and to invoke it from within your loop, for example, as enabled by the
(untested, probably does not even build) patch below.
This approach is quite a bit heavier weight than the hoped-for eventual
fix, but it should get this out of the way to allow you to find other
problems in your testing. ;-)
Thanx, Paul
> > I will assume the latter for the moment and see if I can spot the
> > problem.
------------------------------------------------------------------------
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index b3dbf9502fd0..bbf23e1318a9 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -115,6 +115,7 @@ static inline bool rcu_irq_enter_disabled(void) { return false; }
static inline void rcu_irq_exit_irqson(void) { }
static inline void rcu_irq_enter_irqson(void) { }
static inline void rcu_irq_exit(void) { }
+static inline void rcu_momentary_dyntick_idle(void) { }
static inline void exit_rcu(void) { }
#ifdef CONFIG_SRCU
void rcu_scheduler_starting(void);
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 37d6fd3b7ff8..1bec142720dd 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -86,6 +86,7 @@ void rcu_irq_exit(void);
void rcu_irq_enter_irqson(void);
void rcu_irq_exit_irqson(void);
bool rcu_irq_enter_disabled(void);
+void rcu_momentary_dyntick_idle(void);
void exit_rcu(void);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f9c0ca2ccf0c..da06a52e5e60 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -439,11 +439,12 @@ bool rcu_eqs_special_set(int cpu)
*
* The caller must have disabled interrupts.
*/
-static void rcu_momentary_dyntick_idle(void)
+void rcu_momentary_dyntick_idle(void)
{
raw_cpu_write(rcu_dynticks.rcu_need_heavy_qs, false);
rcu_dynticks_momentary_idle();
}
+EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
/*
* Note a context switch. This is a quiescent state for RCU-sched,

On Wed, Jul 11, 2018 at 03:23:45PM +0100, David Woodhouse wrote:
>
>
> On Mon, 2018-07-09 at 15:08 -0700, Paul E. McKenney wrote:
> > index f9c0ca2ccf0c..3350ece366ab 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2839,6 +2839,15 @@ void rcu_check_callbacks(int user)
> > rcu_bh_qs();
> > }
> > rcu_preempt_check_callbacks();
> > + /* The load-acquire pairs with the store-release setting to true. */
> > + if (smp_load_acquire(this_cpu_ptr(&rcu_dynticks.rcu_urgent_qs))) {
> > + /* Idle and userspace execution already are quiescent states. */
> > + if (rcu_is_cpu_rrupt_from_idle() && !user) {
>
> if (idle && !user) seems tautological and... illogical.
>
> If I make it 'if (!rcu_is_cpu_rrput_from_idle() && !user)' it seems to
> work better. Ripping out my debugging printks now to check that's still
> true...
Right you are! I will step away for a bit to put a paper bag over
my head...
> (Also, isn't userspace execution only a quiescent state if NO_HZ_FULL?)
Userspace execution is a quiescent state in all cases. However, you
are quite right that NO_HZ_FULL makes a difference, namely, it allows
one CPU to reliably determine whether or not some other CPU is
currently executing either in userspace or in idle.
Without NO_HZ_FULL, CPUs can only detect their own userspace execution.
Which is what is happening here because rcu_check_callbacks() is being
invoked from the scheduling-clock interrupt, which is where the "user"
parameter comes from.
So the above code can reliably detect the usermode-execution quiescent
state because it is always running on the CPU in question.
Thanx, Paul
> > + set_tsk_need_resched(current);
> > + set_preempt_need_resched();
> > + }
> > + __this_cpu_write(rcu_dynticks.rcu_urgent_qs, false);
> > + }

On Wed, Jul 11, 2018 at 07:43:03AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 11, 2018 at 03:23:45PM +0100, David Woodhouse wrote:
> >
> >
> > On Mon, 2018-07-09 at 15:08 -0700, Paul E. McKenney wrote:
> > > index f9c0ca2ccf0c..3350ece366ab 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -2839,6 +2839,15 @@ void rcu_check_callbacks(int user)
> > > rcu_bh_qs();
> > > }
> > > rcu_preempt_check_callbacks();
> > > + /* The load-acquire pairs with the store-release setting to true. */
> > > + if (smp_load_acquire(this_cpu_ptr(&rcu_dynticks.rcu_urgent_qs))) {
> > > + /* Idle and userspace execution already are quiescent states. */
> > > + if (rcu_is_cpu_rrupt_from_idle() && !user) {
> >
> > if (idle && !user) seems tautological and... illogical.
> >
> > If I make it 'if (!rcu_is_cpu_rrput_from_idle() && !user)' it seems to
> > work better. Ripping out my debugging printks now to check that's still
> > true...
>
> Right you are! I will step away for a bit to put a paper bag over
> my head...
>
> > (Also, isn't userspace execution only a quiescent state if NO_HZ_FULL?)
>
> Userspace execution is a quiescent state in all cases. However, you
> are quite right that NO_HZ_FULL makes a difference, namely, it allows
> one CPU to reliably determine whether or not some other CPU is
> currently executing either in userspace or in idle.
>
> Without NO_HZ_FULL, CPUs can only detect their own userspace execution.
> Which is what is happening here because rcu_check_callbacks() is being
> invoked from the scheduling-clock interrupt, which is where the "user"
> parameter comes from.
>
> So the above code can reliably detect the usermode-execution quiescent
> state because it is always running on the CPU in question.
And here is an updated v4.15 patch with Marius's Reported-by and David's
fix to my lost exclamation point.
Thanx, Paul
------------------------------------------------------------------------
commit 83c4beae36f2a2b38c1a0fa80538af7ce2477823
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Mon Jul 9 13:47:30 2018 -0700
rcu: Make need_resched() respond to urgent RCU-QS needs
The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent
need for an RCU quiescent state from the force-quiescent-state processing
within the grace-period kthread to context switches and to cond_resched().
Unfortunately, such urgent needs are not communicated to need_resched(),
which is sometimes used to decide when to invoke cond_resched(), for
but one example, within the KVM vcpu_run() function. As of v4.15, this
can result in synchronize_sched() being delayed by up to ten seconds,
which can be problematic, to say nothing of annoying.
This commit therefore checks rcu_dynticks.rcu_urgent_qs from within
rcu_check_callbacks(), which is invoked from the scheduling-clock
interrupt handler. If the current task is not an idle task and is
not executing in usermode, a context switch is forced, and either way,
the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current
task is an idle task, then RCU's dyntick-idle code will detect the
quiescent state, so no further action is required. Similarly, if the
task is executing in usermode, other code in rcu_check_callbacks() and
its called functions will report the corresponding quiescent state.
Reported-by: Marius Hillenbrand <mhillenb@amazon.de>
Reported-by: David Woodhouse <dwmw2@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Backported to v4.15. Probably applies elsewhere. ]
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f9c0ca2ccf0c..de2f91cb2a0c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2839,6 +2839,15 @@ void rcu_check_callbacks(int user)
rcu_bh_qs();
}
rcu_preempt_check_callbacks();
+ /* The load-acquire pairs with the store-release setting to true. */
+ if (smp_load_acquire(this_cpu_ptr(&rcu_dynticks.rcu_urgent_qs))) {
+ /* Idle and userspace execution already are quiescent states. */
+ if (!rcu_is_cpu_rrupt_from_idle() && !user) {
+ set_tsk_need_resched(current);
+ set_preempt_need_resched();
+ }
+ __this_cpu_write(rcu_dynticks.rcu_urgent_qs, false);
+ }
if (rcu_pending())
invoke_rcu_core();
if (user)

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-11T17:03:51Zurn:uuid:21389ddd-19bf-7e35-29d4-41a6e4b0beba

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-11T20:19:51Zurn:uuid:665ed143-7c73-4e73-f6b7-1bd339ccb44f

[-- Attachment #1: Type: text/plain, Size: 984 bytes --]
On Wed, 2018-07-11 at 13:17 -0700, Paul E. McKenney wrote:
> As I understand it, they would like to have their guest run uninterrupted
> for extended times. Because rcu_virt_note_context_switch() is a
> point-in-time quiescent state, it cannot tell RCU about the extended
> quiescent state.
>
> Should we replace the current calls to rcu_virt_note_context_switch()
> with rcu_kvm_enter() and rcu_kvm_exit()? Would that be better
> than the below architecture-by-architecture approach?
Yes it would. I was already starting to mutter about needing the same
for ARM and POWER. I'll do a v3 (incorporating your fixes too) in the
morning.
Thanks.
Also... why in $DEITY's name was the existing
rcu_virt_note_context_switch() not actually sufficient? If we had that
there, why did we need an additional explicit calls to rcu_all_qs() in
the KVM loop, or the more complex fixes to need_resched() which
ultimately had the same effect, to avoid ten-second latencies?
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Wed, Jul 11, 2018 at 09:19:44PM +0100, David Woodhouse wrote:
> On Wed, 2018-07-11 at 13:17 -0700, Paul E. McKenney wrote:
> > As I understand it, they would like to have their guest run uninterrupted
> > for extended times. Because rcu_virt_note_context_switch() is a
> > point-in-time quiescent state, it cannot tell RCU about the extended
> > quiescent state.
> >
> > Should we replace the current calls to rcu_virt_note_context_switch()
> > with rcu_kvm_enter() and rcu_kvm_exit()? Would that be better
> > than the below architecture-by-architecture approach?
>
> Yes it would. I was already starting to mutter about needing the same
> for ARM and POWER. I'll do a v3 (incorporating your fixes too) in the
> morning.
>
> Thanks.
>
> Also... why in $DEITY's name was the existing
> rcu_virt_note_context_switch() not actually sufficient? If we had that
> there, why did we need an additional explicit calls to rcu_all_qs() in
> the KVM loop, or the more complex fixes to need_resched() which
> ultimately had the same effect, to avoid ten-second latencies?
My guess is that this was because control passed through the
rcu_virt_note_context_switch() only once, and then subsequent
scheduling-clock interrupts bypassed this code. But that is just a guess.
I need to defer to someone who understands the KVM code better than I do.
Thanx, Paul

On Wed, Jul 11, 2018 at 09:54:52PM +0100, David Woodhouse wrote:
> On Wed, 2018-07-11 at 13:27 -0700, Paul E. McKenney wrote:
> >
> > Something like this, on top of the earlier patch? I am not at all
> > confident of this patch because there might be other entry/exit
> > paths I am missing. Plus there might be RCU uses on the arch-
> > specific patch to and from the guest OS.
>
> That was going to be my starting point, yes. With some additional
> manual review and testing to cover the "not at all confident" part. :)
Please feel free to change as needed, up to and including throwing
my patch away and starting over. ;-)
Thanx, Paul

On 07/11/2018 10:27 PM, Paul E. McKenney wrote:
> On Wed, Jul 11, 2018 at 08:39:36PM +0200, Christian Borntraeger wrote:
>>
>>
>> On 07/11/2018 08:36 PM, Paul E. McKenney wrote:
>>> On Wed, Jul 11, 2018 at 11:20:53AM -0700, Paul E. McKenney wrote:
>>>> On Wed, Jul 11, 2018 at 07:01:01PM +0100, David Woodhouse wrote:
>>>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>>>
>>>>> RCU can spend long periods of time waiting for a CPU which is actually in
>>>>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
>>>>> modes, and don't wait for it.
>>>>>
>>>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>>>
>>>> And idiot here forgot about some of the debugging code in RCU's dyntick-idle
>>>> code. I will reply with a fixed patch.
>>>>
>>>> The code below works just fine as long as you don't enable CONFIG_RCU_EQS_DEBUG,
>>>> so should be OK for testing, just not for mainline.
>>>
>>> And here is the updated code that allegedly avoids splatting when run with
>>> CONFIG_RCU_EQS_DEBUG.
>>>
>>> Thoughts?
>>>
>>> Thanx, Paul
>>>
>>> ------------------------------------------------------------------------
>>>
>>> commit 12cd59e49cf734f907f44b696e2c6e4b46a291c3
>>> Author: David Woodhouse <dwmw@amazon.co.uk>
>>> Date: Wed Jul 11 19:01:01 2018 +0100
>>>
>>> kvm/x86: Inform RCU of quiescent state when entering guest mode
>>>
>>> RCU can spend long periods of time waiting for a CPU which is actually in
>>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
>>> modes, and don't wait for it.
>>>
>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>> [ paulmck: Adjust to avoid bad advice I gave to dwmw, avoid WARN_ON()s. ]
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 0046aa70205a..b0c82f70afa7 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>> vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
>>> }
>>>
>>> + rcu_kvm_enter();
>>> kvm_x86_ops->run(vcpu);
>>> + rcu_kvm_exit();
>>
>> As indicated in my other mail. This is supposed to be handled in the guest_enter|exit_ calls around
>> the run function. This would also handle other architectures. So if the guest_enter_irqoff code is
>> not good enough, we should rather fix that instead of adding another rcu hint.
>
> Something like this, on top of the earlier patch? I am not at all
> confident of this patch because there might be other entry/exit
> paths I am missing. Plus there might be RCU uses on the arch-specific
> patch to and from the guest OS.
>
> Thoughts?
>
If you instrment guest_enter/exit, you should cover all cases and all architectures as far
as I can tell. FWIW, we did this rcu_note thing back then actually handling this particular
case of long running guests blocking rcu for many seconds. And I am pretty sure that
this did help back then.

On Wed, Jul 11, 2018 at 11:11:19PM +0200, Christian Borntraeger wrote:
>
>
> On 07/11/2018 10:27 PM, Paul E. McKenney wrote:
> > On Wed, Jul 11, 2018 at 08:39:36PM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 07/11/2018 08:36 PM, Paul E. McKenney wrote:
> >>> On Wed, Jul 11, 2018 at 11:20:53AM -0700, Paul E. McKenney wrote:
> >>>> On Wed, Jul 11, 2018 at 07:01:01PM +0100, David Woodhouse wrote:
> >>>>> From: David Woodhouse <dwmw@amazon.co.uk>
> >>>>>
> >>>>> RCU can spend long periods of time waiting for a CPU which is actually in
> >>>>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
> >>>>> modes, and don't wait for it.
> >>>>>
> >>>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> >>>>
> >>>> And idiot here forgot about some of the debugging code in RCU's dyntick-idle
> >>>> code. I will reply with a fixed patch.
> >>>>
> >>>> The code below works just fine as long as you don't enable CONFIG_RCU_EQS_DEBUG,
> >>>> so should be OK for testing, just not for mainline.
> >>>
> >>> And here is the updated code that allegedly avoids splatting when run with
> >>> CONFIG_RCU_EQS_DEBUG.
> >>>
> >>> Thoughts?
> >>>
> >>> Thanx, Paul
> >>>
> >>> ------------------------------------------------------------------------
> >>>
> >>> commit 12cd59e49cf734f907f44b696e2c6e4b46a291c3
> >>> Author: David Woodhouse <dwmw@amazon.co.uk>
> >>> Date: Wed Jul 11 19:01:01 2018 +0100
> >>>
> >>> kvm/x86: Inform RCU of quiescent state when entering guest mode
> >>>
> >>> RCU can spend long periods of time waiting for a CPU which is actually in
> >>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
> >>> modes, and don't wait for it.
> >>>
> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> >>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >>> [ paulmck: Adjust to avoid bad advice I gave to dwmw, avoid WARN_ON()s. ]
> >>>
> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>> index 0046aa70205a..b0c82f70afa7 100644
> >>> --- a/arch/x86/kvm/x86.c
> >>> +++ b/arch/x86/kvm/x86.c
> >>> @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >>> vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
> >>> }
> >>>
> >>> + rcu_kvm_enter();
> >>> kvm_x86_ops->run(vcpu);
> >>> + rcu_kvm_exit();
> >>
> >> As indicated in my other mail. This is supposed to be handled in the guest_enter|exit_ calls around
> >> the run function. This would also handle other architectures. So if the guest_enter_irqoff code is
> >> not good enough, we should rather fix that instead of adding another rcu hint.
> >
> > Something like this, on top of the earlier patch? I am not at all
> > confident of this patch because there might be other entry/exit
> > paths I am missing. Plus there might be RCU uses on the arch-specific
> > patch to and from the guest OS.
> >
> > Thoughts?
> >
>
> If you instrment guest_enter/exit, you should cover all cases and all architectures as far
> as I can tell. FWIW, we did this rcu_note thing back then actually handling this particular
> case of long running guests blocking rcu for many seconds. And I am pretty sure that
> this did help back then.
And my second patch on the email you replied to replaced the only call
to rcu_virt_note_context_switch(). So maybe it covers what it needs to,
but yes, there might well be things I missed. Let's see what David
comes up with.
What changed was RCU's reactions to longish grace periods. It used to
be very aggressive about forcing the scheduler to do otherwise-unneeded
context switches, which became a problem somewhere between v4.9 and v4.15.
I therefore reduced the number of such context switches, which in turn
caused KVM to tell RCU about quiescent states way too infrequently.
The advantage of the rcu_kvm_enter()/rcu_kvm_exit() approach is that
it tells RCU of an extended duration in the guest, which means that
RCU can ignore the corresponding CPU, which in turn allows the guest
to proceed without any RCU-induced interruptions.
Does that make sense, or am I missing something? I freely admit to
much ignorance of both kvm and s390! ;-)
Thanx, Paul

On 07/11/2018 11:32 PM, Paul E. McKenney wrote:
> On Wed, Jul 11, 2018 at 11:11:19PM +0200, Christian Borntraeger wrote:
>>
>>
>> On 07/11/2018 10:27 PM, Paul E. McKenney wrote:
>>> On Wed, Jul 11, 2018 at 08:39:36PM +0200, Christian Borntraeger wrote:
>>>>
>>>>
>>>> On 07/11/2018 08:36 PM, Paul E. McKenney wrote:
>>>>> On Wed, Jul 11, 2018 at 11:20:53AM -0700, Paul E. McKenney wrote:
>>>>>> On Wed, Jul 11, 2018 at 07:01:01PM +0100, David Woodhouse wrote:
>>>>>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>>>>>
>>>>>>> RCU can spend long periods of time waiting for a CPU which is actually in
>>>>>>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
>>>>>>> modes, and don't wait for it.
>>>>>>>
>>>>>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>>>>>
>>>>>> And idiot here forgot about some of the debugging code in RCU's dyntick-idle
>>>>>> code. I will reply with a fixed patch.
>>>>>>
>>>>>> The code below works just fine as long as you don't enable CONFIG_RCU_EQS_DEBUG,
>>>>>> so should be OK for testing, just not for mainline.
>>>>>
>>>>> And here is the updated code that allegedly avoids splatting when run with
>>>>> CONFIG_RCU_EQS_DEBUG.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> Thanx, Paul
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> commit 12cd59e49cf734f907f44b696e2c6e4b46a291c3
>>>>> Author: David Woodhouse <dwmw@amazon.co.uk>
>>>>> Date: Wed Jul 11 19:01:01 2018 +0100
>>>>>
>>>>> kvm/x86: Inform RCU of quiescent state when entering guest mode
>>>>>
>>>>> RCU can spend long periods of time waiting for a CPU which is actually in
>>>>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
>>>>> modes, and don't wait for it.
>>>>>
>>>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>>>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>>>> [ paulmck: Adjust to avoid bad advice I gave to dwmw, avoid WARN_ON()s. ]
>>>>>
>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>> index 0046aa70205a..b0c82f70afa7 100644
>>>>> --- a/arch/x86/kvm/x86.c
>>>>> +++ b/arch/x86/kvm/x86.c
>>>>> @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>>>> vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
>>>>> }
>>>>>
>>>>> + rcu_kvm_enter();
>>>>> kvm_x86_ops->run(vcpu);
>>>>> + rcu_kvm_exit();
>>>>
>>>> As indicated in my other mail. This is supposed to be handled in the guest_enter|exit_ calls around
>>>> the run function. This would also handle other architectures. So if the guest_enter_irqoff code is
>>>> not good enough, we should rather fix that instead of adding another rcu hint.
>>>
>>> Something like this, on top of the earlier patch? I am not at all
>>> confident of this patch because there might be other entry/exit
>>> paths I am missing. Plus there might be RCU uses on the arch-specific
>>> patch to and from the guest OS.
>>>
>>> Thoughts?
>>>
>>
>> If you instrment guest_enter/exit, you should cover all cases and all architectures as far
>> as I can tell. FWIW, we did this rcu_note thing back then actually handling this particular
>> case of long running guests blocking rcu for many seconds. And I am pretty sure that
>> this did help back then.
>
> And my second patch on the email you replied to replaced the only call
> to rcu_virt_note_context_switch(). So maybe it covers what it needs to,
> but yes, there might well be things I missed. Let's see what David
> comes up with.
>
> What changed was RCU's reactions to longish grace periods. It used to
> be very aggressive about forcing the scheduler to do otherwise-unneeded
> context switches, which became a problem somewhere between v4.9 and v4.15.
> I therefore reduced the number of such context switches, which in turn
> caused KVM to tell RCU about quiescent states way too infrequently.
You talk about
commit bcbfdd01dce5556a952fae84ef16fd0f12525e7b
rcu: Make non-preemptive schedule be Tasks RCU quiescent state
correct? In fact, then whatever (properly sent) patch comes up should contain
a fixes tag.
>
> The advantage of the rcu_kvm_enter()/rcu_kvm_exit() approach is that
> it tells RCU of an extended duration in the guest, which means that
> RCU can ignore the corresponding CPU, which in turn allows the guest
> to proceed without any RCU-induced interruptions.
>
> Does that make sense, or am I missing something? I freely admit to
> much ignorance of both kvm and s390! ;-)
WIth that explanation it makes perfect sense to replace
rcu_virt_note_context_switch with rcu_kvm_enter/exit from an rcu performance
perspective. I assume that rcu_kvm_enter is not much slower than
rcu_virt_note_context_switch? Because we do call it on every guest entry/exit
which we might have plenty for ping pong I/O workload.

On Wed, Jul 11, 2018 at 02:32:59PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 11, 2018 at 11:11:19PM +0200, Christian Borntraeger wrote:
> >
> >
> > On 07/11/2018 10:27 PM, Paul E. McKenney wrote:
> > > On Wed, Jul 11, 2018 at 08:39:36PM +0200, Christian Borntraeger wrote:
> > >>
> > >>
> > >> On 07/11/2018 08:36 PM, Paul E. McKenney wrote:
> > >>> On Wed, Jul 11, 2018 at 11:20:53AM -0700, Paul E. McKenney wrote:
> > >>>> On Wed, Jul 11, 2018 at 07:01:01PM +0100, David Woodhouse wrote:
> > >>>>> From: David Woodhouse <dwmw@amazon.co.uk>
> > >>>>>
> > >>>>> RCU can spend long periods of time waiting for a CPU which is actually in
> > >>>>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
> > >>>>> modes, and don't wait for it.
> > >>>>>
> > >>>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > >>>>
> > >>>> And idiot here forgot about some of the debugging code in RCU's dyntick-idle
> > >>>> code. I will reply with a fixed patch.
> > >>>>
> > >>>> The code below works just fine as long as you don't enable CONFIG_RCU_EQS_DEBUG,
> > >>>> so should be OK for testing, just not for mainline.
> > >>>
> > >>> And here is the updated code that allegedly avoids splatting when run with
> > >>> CONFIG_RCU_EQS_DEBUG.
> > >>>
> > >>> Thoughts?
> > >>>
> > >>> Thanx, Paul
> > >>>
> > >>> ------------------------------------------------------------------------
> > >>>
> > >>> commit 12cd59e49cf734f907f44b696e2c6e4b46a291c3
> > >>> Author: David Woodhouse <dwmw@amazon.co.uk>
> > >>> Date: Wed Jul 11 19:01:01 2018 +0100
> > >>>
> > >>> kvm/x86: Inform RCU of quiescent state when entering guest mode
> > >>>
> > >>> RCU can spend long periods of time waiting for a CPU which is actually in
> > >>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
> > >>> modes, and don't wait for it.
> > >>>
> > >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > >>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > >>> [ paulmck: Adjust to avoid bad advice I gave to dwmw, avoid WARN_ON()s. ]
> > >>>
> > >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > >>> index 0046aa70205a..b0c82f70afa7 100644
> > >>> --- a/arch/x86/kvm/x86.c
> > >>> +++ b/arch/x86/kvm/x86.c
> > >>> @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >>> vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
> > >>> }
> > >>>
> > >>> + rcu_kvm_enter();
> > >>> kvm_x86_ops->run(vcpu);
> > >>> + rcu_kvm_exit();
> > >>
> > >> As indicated in my other mail. This is supposed to be handled in the guest_enter|exit_ calls around
> > >> the run function. This would also handle other architectures. So if the guest_enter_irqoff code is
> > >> not good enough, we should rather fix that instead of adding another rcu hint.
> > >
> > > Something like this, on top of the earlier patch? I am not at all
> > > confident of this patch because there might be other entry/exit
> > > paths I am missing. Plus there might be RCU uses on the arch-specific
> > > patch to and from the guest OS.
> > >
> > > Thoughts?
> > >
> >
> > If you instrment guest_enter/exit, you should cover all cases and all architectures as far
> > as I can tell. FWIW, we did this rcu_note thing back then actually handling this particular
> > case of long running guests blocking rcu for many seconds. And I am pretty sure that
> > this did help back then.
>
> And my second patch on the email you replied to replaced the only call
> to rcu_virt_note_context_switch(). So maybe it covers what it needs to,
> but yes, there might well be things I missed. Let's see what David
> comes up with.
>
> What changed was RCU's reactions to longish grace periods. It used to
> be very aggressive about forcing the scheduler to do otherwise-unneeded
> context switches, which became a problem somewhere between v4.9 and v4.15.
> I therefore reduced the number of such context switches, which in turn
> caused KVM to tell RCU about quiescent states way too infrequently.
>
> The advantage of the rcu_kvm_enter()/rcu_kvm_exit() approach is that
> it tells RCU of an extended duration in the guest, which means that
> RCU can ignore the corresponding CPU, which in turn allows the guest
> to proceed without any RCU-induced interruptions.
>
> Does that make sense, or am I missing something? I freely admit to
> much ignorance of both kvm and s390! ;-)
But I am getting some rcutorture near misses on the commit that
introduces rcu_kvm_enter() and rcu_kvm_exit() to the x86 arch-specific
vcpu_enter_guest() function. These near misses occur when running
rcutorture scenarios TREE01 and TREE03, and in my -rcu tree rather
than the v4.15 version of this patch.
Given that I am making pervasive changes to the way that RCU works,
it might well be that this commit is an innocent bystander. I will
run tests overnight and let you know what comes up.
Thanx, Paul

On Wed, Jul 11, 2018 at 11:39:10PM +0200, Christian Borntraeger wrote:
>
>
> On 07/11/2018 11:32 PM, Paul E. McKenney wrote:
> > On Wed, Jul 11, 2018 at 11:11:19PM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 07/11/2018 10:27 PM, Paul E. McKenney wrote:
> >>> On Wed, Jul 11, 2018 at 08:39:36PM +0200, Christian Borntraeger wrote:
> >>>>
> >>>>
> >>>> On 07/11/2018 08:36 PM, Paul E. McKenney wrote:
> >>>>> On Wed, Jul 11, 2018 at 11:20:53AM -0700, Paul E. McKenney wrote:
> >>>>>> On Wed, Jul 11, 2018 at 07:01:01PM +0100, David Woodhouse wrote:
> >>>>>>> From: David Woodhouse <dwmw@amazon.co.uk>
> >>>>>>>
> >>>>>>> RCU can spend long periods of time waiting for a CPU which is actually in
> >>>>>>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
> >>>>>>> modes, and don't wait for it.
> >>>>>>>
> >>>>>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> >>>>>>
> >>>>>> And idiot here forgot about some of the debugging code in RCU's dyntick-idle
> >>>>>> code. I will reply with a fixed patch.
> >>>>>>
> >>>>>> The code below works just fine as long as you don't enable CONFIG_RCU_EQS_DEBUG,
> >>>>>> so should be OK for testing, just not for mainline.
> >>>>>
> >>>>> And here is the updated code that allegedly avoids splatting when run with
> >>>>> CONFIG_RCU_EQS_DEBUG.
> >>>>>
> >>>>> Thoughts?
> >>>>>
> >>>>> Thanx, Paul
> >>>>>
> >>>>> ------------------------------------------------------------------------
> >>>>>
> >>>>> commit 12cd59e49cf734f907f44b696e2c6e4b46a291c3
> >>>>> Author: David Woodhouse <dwmw@amazon.co.uk>
> >>>>> Date: Wed Jul 11 19:01:01 2018 +0100
> >>>>>
> >>>>> kvm/x86: Inform RCU of quiescent state when entering guest mode
> >>>>>
> >>>>> RCU can spend long periods of time waiting for a CPU which is actually in
> >>>>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
> >>>>> modes, and don't wait for it.
> >>>>>
> >>>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> >>>>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >>>>> [ paulmck: Adjust to avoid bad advice I gave to dwmw, avoid WARN_ON()s. ]
> >>>>>
> >>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>>>> index 0046aa70205a..b0c82f70afa7 100644
> >>>>> --- a/arch/x86/kvm/x86.c
> >>>>> +++ b/arch/x86/kvm/x86.c
> >>>>> @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >>>>> vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
> >>>>> }
> >>>>>
> >>>>> + rcu_kvm_enter();
> >>>>> kvm_x86_ops->run(vcpu);
> >>>>> + rcu_kvm_exit();
> >>>>
> >>>> As indicated in my other mail. This is supposed to be handled in the guest_enter|exit_ calls around
> >>>> the run function. This would also handle other architectures. So if the guest_enter_irqoff code is
> >>>> not good enough, we should rather fix that instead of adding another rcu hint.
> >>>
> >>> Something like this, on top of the earlier patch? I am not at all
> >>> confident of this patch because there might be other entry/exit
> >>> paths I am missing. Plus there might be RCU uses on the arch-specific
> >>> patch to and from the guest OS.
> >>>
> >>> Thoughts?
> >>>
> >>
> >> If you instrment guest_enter/exit, you should cover all cases and all architectures as far
> >> as I can tell. FWIW, we did this rcu_note thing back then actually handling this particular
> >> case of long running guests blocking rcu for many seconds. And I am pretty sure that
> >> this did help back then.
> >
> > And my second patch on the email you replied to replaced the only call
> > to rcu_virt_note_context_switch(). So maybe it covers what it needs to,
> > but yes, there might well be things I missed. Let's see what David
> > comes up with.
> >
> > What changed was RCU's reactions to longish grace periods. It used to
> > be very aggressive about forcing the scheduler to do otherwise-unneeded
> > context switches, which became a problem somewhere between v4.9 and v4.15.
> > I therefore reduced the number of such context switches, which in turn
> > caused KVM to tell RCU about quiescent states way too infrequently.
>
> You talk about
> commit bcbfdd01dce5556a952fae84ef16fd0f12525e7b
> rcu: Make non-preemptive schedule be Tasks RCU quiescent state
>
> correct? In fact, then whatever (properly sent) patch comes up should contain
> a fixes tag.
Not that one, but this one is at least part of the "team":
28053bc72c0e5 ("rcu: Add long-term CPU kicking"). I might need to use
"git bisect" to find the most relevant commit... :-/
> > The advantage of the rcu_kvm_enter()/rcu_kvm_exit() approach is that
> > it tells RCU of an extended duration in the guest, which means that
> > RCU can ignore the corresponding CPU, which in turn allows the guest
> > to proceed without any RCU-induced interruptions.
> >
> > Does that make sense, or am I missing something? I freely admit to
> > much ignorance of both kvm and s390! ;-)
>
> WIth that explanation it makes perfect sense to replace
> rcu_virt_note_context_switch with rcu_kvm_enter/exit from an rcu performance
> perspective. I assume that rcu_kvm_enter is not much slower than
> rcu_virt_note_context_switch? Because we do call it on every guest entry/exit
> which we might have plenty for ping pong I/O workload.
But is there any way for a guest OS to sneak back out to the hypervisor
without executing one of the rcu_kvm_exit() calls? If there is, RCU
is broken.
Thanx, Paul

On 07/12/2018 01:37 AM, Paul E. McKenney wrote:
> On Wed, Jul 11, 2018 at 02:32:59PM -0700, Paul E. McKenney wrote:
>> On Wed, Jul 11, 2018 at 11:11:19PM +0200, Christian Borntraeger wrote:
>>>
>>>
>>> On 07/11/2018 10:27 PM, Paul E. McKenney wrote:
>>>> On Wed, Jul 11, 2018 at 08:39:36PM +0200, Christian Borntraeger wrote:
>>>>>
>>>>>
>>>>> On 07/11/2018 08:36 PM, Paul E. McKenney wrote:
>>>>>> On Wed, Jul 11, 2018 at 11:20:53AM -0700, Paul E. McKenney wrote:
>>>>>>> On Wed, Jul 11, 2018 at 07:01:01PM +0100, David Woodhouse wrote:
>>>>>>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>>>>>>
>>>>>>>> RCU can spend long periods of time waiting for a CPU which is actually in
>>>>>>>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
>>>>>>>> modes, and don't wait for it.
>>>>>>>>
>>>>>>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>>>>>>
>>>>>>> And idiot here forgot about some of the debugging code in RCU's dyntick-idle
>>>>>>> code. I will reply with a fixed patch.
>>>>>>>
>>>>>>> The code below works just fine as long as you don't enable CONFIG_RCU_EQS_DEBUG,
>>>>>>> so should be OK for testing, just not for mainline.
>>>>>>
>>>>>> And here is the updated code that allegedly avoids splatting when run with
>>>>>> CONFIG_RCU_EQS_DEBUG.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> Thanx, Paul
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> commit 12cd59e49cf734f907f44b696e2c6e4b46a291c3
>>>>>> Author: David Woodhouse <dwmw@amazon.co.uk>
>>>>>> Date: Wed Jul 11 19:01:01 2018 +0100
>>>>>>
>>>>>> kvm/x86: Inform RCU of quiescent state when entering guest mode
>>>>>>
>>>>>> RCU can spend long periods of time waiting for a CPU which is actually in
>>>>>> KVM guest mode, entirely pointlessly. Treat it like the idle and userspace
>>>>>> modes, and don't wait for it.
>>>>>>
>>>>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>>>>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>>>>> [ paulmck: Adjust to avoid bad advice I gave to dwmw, avoid WARN_ON()s. ]
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>> index 0046aa70205a..b0c82f70afa7 100644
>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>> @@ -7458,7 +7458,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>>>>> vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
>>>>>> }
>>>>>>
>>>>>> + rcu_kvm_enter();
>>>>>> kvm_x86_ops->run(vcpu);
>>>>>> + rcu_kvm_exit();
>>>>>
>>>>> As indicated in my other mail. This is supposed to be handled in the guest_enter|exit_ calls around
>>>>> the run function. This would also handle other architectures. So if the guest_enter_irqoff code is
>>>>> not good enough, we should rather fix that instead of adding another rcu hint.
>>>>
>>>> Something like this, on top of the earlier patch? I am not at all
>>>> confident of this patch because there might be other entry/exit
>>>> paths I am missing. Plus there might be RCU uses on the arch-specific
>>>> patch to and from the guest OS.
>>>>
>>>> Thoughts?
>>>>
>>>
>>> If you instrment guest_enter/exit, you should cover all cases and all architectures as far
>>> as I can tell. FWIW, we did this rcu_note thing back then actually handling this particular
>>> case of long running guests blocking rcu for many seconds. And I am pretty sure that
>>> this did help back then.
>>
>> And my second patch on the email you replied to replaced the only call
>> to rcu_virt_note_context_switch(). So maybe it covers what it needs to,
>> but yes, there might well be things I missed. Let's see what David
>> comes up with.
>>
>> What changed was RCU's reactions to longish grace periods. It used to
>> be very aggressive about forcing the scheduler to do otherwise-unneeded
>> context switches, which became a problem somewhere between v4.9 and v4.15.
>> I therefore reduced the number of such context switches, which in turn
>> caused KVM to tell RCU about quiescent states way too infrequently.
>>
>> The advantage of the rcu_kvm_enter()/rcu_kvm_exit() approach is that
>> it tells RCU of an extended duration in the guest, which means that
>> RCU can ignore the corresponding CPU, which in turn allows the guest
>> to proceed without any RCU-induced interruptions.
>>
>> Does that make sense, or am I missing something? I freely admit to
>> much ignorance of both kvm and s390! ;-)
>
> But I am getting some rcutorture near misses on the commit that
> introduces rcu_kvm_enter() and rcu_kvm_exit() to the x86 arch-specific
> vcpu_enter_guest() function. These near misses occur when running
> rcutorture scenarios TREE01 and TREE03, and in my -rcu tree rather
> than the v4.15 version of this patch.
>
> Given that I am making pervasive changes to the way that RCU works,
> it might well be that this commit is an innocent bystander. I will
> run tests overnight and let you know what comes up.
Is there a single patch that that I can test or do I have to combine all the pieces
that are sprinkled in this mail thread?

[-- Attachment #1: Type: text/plain, Size: 1326 bytes --]
On Wed, 2018-07-11 at 16:47 -0700, Paul E. McKenney wrote:
>
> > > What changed was RCU's reactions to longish grace periods. It used to
> > > be very aggressive about forcing the scheduler to do otherwise-unneeded
> > > context switches, which became a problem somewhere between v4.9 and v4.15.
> > > I therefore reduced the number of such context switches, which in turn
> > > caused KVM to tell RCU about quiescent states way too infrequently.
> >
> > You talk about
> > commit bcbfdd01dce5556a952fae84ef16fd0f12525e7b
> > rcu: Make non-preemptive schedule be Tasks RCU quiescent state
> >
> > correct? In fact, then whatever (properly sent) patch comes up should contain
> > a fixes tag.
>
> Not that one, but this one is at least part of the "team":
>
> 28053bc72c0e5 ("rcu: Add long-term CPU kicking"). I might need to use
> "git bisect" to find the most relevant commit... :-/
Whichever commit we blame, I suspect the Fixes: tag wants to go on
Paul's earlier 'Make need_resched() respond to urgent RCU-QS needs'
patch.
The rcu_kvm_enter() thing we're talking about now is actually
addressing a problem which has existed for much longer — that with
NO_HZ_FULL, a CPU might end up in guest mode for an indefinite period
of time, with RCU waiting for it to return.
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On 07/12/2018 10:31 AM, David Woodhouse wrote:
>
>
> On Wed, 2018-07-11 at 16:47 -0700, Paul E. McKenney wrote:
>>
>>>> What changed was RCU's reactions to longish grace periods. It used to
>>>> be very aggressive about forcing the scheduler to do otherwise-unneeded
>>>> context switches, which became a problem somewhere between v4.9 and v4.15.
>>>> I therefore reduced the number of such context switches, which in turn
>>>> caused KVM to tell RCU about quiescent states way too infrequently.
>>>
>>> You talk about
>>> commit bcbfdd01dce5556a952fae84ef16fd0f12525e7b
>>> rcu: Make non-preemptive schedule be Tasks RCU quiescent state
>>>
>>> correct? In fact, then whatever (properly sent) patch comes up should contain
>>> a fixes tag.
>>
>> Not that one, but this one is at least part of the "team":
>>
>> 28053bc72c0e5 ("rcu: Add long-term CPU kicking"). I might need to use
>> "git bisect" to find the most relevant commit... :-/
>
> Whichever commit we blame, I suspect the Fixes: tag wants to go on
> Paul's earlier 'Make need_resched() respond to urgent RCU-QS needs'
> patch.
>
> The rcu_kvm_enter() thing we're talking about now is actually
> addressing a problem which has existed for much longer — that with
> NO_HZ_FULL, a CPU might end up in guest mode for an indefinite period
> of time, with RCU waiting for it to return.
On s390 this seems to add about 10ns (~3%) for a guest exit/rentry
microbenchmark mostly due to rcu_eqs_enter and rcu_eqs_exit now being
visible in perf samples. The older interface was cheaper.

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]
On Thu, 2018-07-12 at 13:00 +0200, Christian Borntraeger wrote:
>
> On s390 this seems to add about 10ns (~3%) for a guest exit/rentry
> microbenchmark mostly due to rcu_eqs_enter and rcu_eqs_exit now being
> visible in perf samples. The older interface was cheaper.
Well, the older interface wasn't actually working, which made it
moderately suboptimal :)
But that is fixed by the first patch of the two, so perhaps the second
ought to be conditional on CONFIG_NO_HZ_FULL since it's only really
fixing a real bug there?
Unless we can contrive some way to do the rcu_eqs_enter/exit only on
the paths to/from userspace, and *not* when we loop in the kernel to
handle interrupts (etc.) and immediately re-enter the guest without
returning? That's somewhat non-trivial though...
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On 07/12/2018 01:10 PM, David Woodhouse wrote:
> On Thu, 2018-07-12 at 13:00 +0200, Christian Borntraeger wrote:
>>
>> On s390 this seems to add about 10ns (~3%) for a guest exit/rentry
>> microbenchmark mostly due to rcu_eqs_enter and rcu_eqs_exit now being
>> visible in perf samples. The older interface was cheaper.
>
> Well, the older interface wasn't actually working, which made it
> moderately suboptimal :)
>
> But that is fixed by the first patch of the two, so perhaps the second
> ought to be conditional on CONFIG_NO_HZ_FULL since it's only really
> fixing a real bug there?
>
> Unless we can contrive some way to do the rcu_eqs_enter/exit only on
> the paths to/from userspace, and *not* when we loop in the kernel to
> handle interrupts (etc.) and immediately re-enter the guest without
> returning? That's somewhat non-trivial though...
Assuming that patch1 really fixes the issue (and all KVM hot loops are
supposed to have something like
if (need_resched())
schedule();
this could work out given that patch1 does what it is supposed to do.
So you are thinking of something like
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index d8ee9f949ce3..cac5de8a887c 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -72,8 +72,17 @@ void cond_synchronize_sched(unsigned long oldstate);
void rcu_idle_enter(void);
void rcu_idle_exit(void);
+#ifdef CONFIG_NO_HZ_FULL
void rcu_kvm_enter(void);
void rcu_kvm_exit(void);
+#else
+static inline void rcu_kvm_enter(void)
+{
+}
+static inline void rcu_kvm_exit(void)
+{
+}
+#endif
void rcu_irq_enter(void);
void rcu_irq_exit(void);
void rcu_irq_enter_irqson(void);
?

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-12T12:00:52Zurn:uuid:25855c75-ef74-2324-7c87-7575ddf561c5

[-- Attachment #1: Type: text/plain, Size: 1593 bytes --]
On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote:
>
> > Also... why in $DEITY's name was the existing
> > rcu_virt_note_context_switch() not actually sufficient? If we had that
> > there, why did we need an additional explicit calls to rcu_all_qs() in
> > the KVM loop, or the more complex fixes to need_resched() which
> > ultimately had the same effect, to avoid ten-second latencies?
>
> My guess is that this was because control passed through the
> rcu_virt_note_context_switch() only once, and then subsequent
> scheduling-clock interrupts bypassed this code. But that is just a guess.
> I need to defer to someone who understands the KVM code better than I do.
I think it's more likely that we just never happened at all. It's
conditional. From the latest patch iteration (see it being removed):
@@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void)
* one time slice). Lets treat guest mode as quiescent state, just like
* we do with user-mode execution.
*/
- if (!context_tracking_cpu_is_enabled())
- rcu_virt_note_context_switch(smp_processor_id());
+ rcu_kvm_enter();
}
Given the vmexit overhead, I don't think we can do the currently-
proposed rcu_kvm_enter() thing except for CONFIG_NO_HZ_FULL where it's
really necessary. I'll make that conditional, but probably on the RCU
side.
Without CONFIG_NO_HZ_FULL, rcu_kvm_exit() can do nothing, and
rcu_kvm_enter() can do rcu_virt_note_context_switch().
OK?
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Thu, Jul 12, 2018 at 01:00:42PM +0100, David Woodhouse wrote:
>
>
> On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote:
> >
> > > Also... why in $DEITY's name was the existing
> > > rcu_virt_note_context_switch() not actually sufficient? If we had that
> > > there, why did we need an additional explicit calls to rcu_all_qs() in
> > > the KVM loop, or the more complex fixes to need_resched() which
> > > ultimately had the same effect, to avoid ten-second latencies?
> >
> > My guess is that this was because control passed through the
> > rcu_virt_note_context_switch() only once, and then subsequent
> > scheduling-clock interrupts bypassed this code.
Gah! My guess was instead that the code did a rcu_kvm_enter() going in,
but somehow managed to miss the rcu_kvm_exit() going out. But that makes
absolutely no sense -- had that happened, rcutorture would likely have
screamed bloody murder, loudly and often. No mere near misses!
And besides, thus far, -ENOREPRODUCE. :-/
Which indicates that I have an opportunity to improve rcutorture and
that this patch was with high probability an innocent bystander.
> > But that is just a guess.
> > I need to defer to someone who understands the KVM code better than I do.
>
> I think it's more likely that we just never happened at all. It's
> conditional. From the latest patch iteration (see it being removed):
>
> @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void)
> * one time slice). Lets treat guest mode as quiescent state, just like
> * we do with user-mode execution.
> */
> - if (!context_tracking_cpu_is_enabled())
> - rcu_virt_note_context_switch(smp_processor_id());
> + rcu_kvm_enter();
> }
>
>
> Given the vmexit overhead, I don't think we can do the currently-
> proposed rcu_kvm_enter() thing except for CONFIG_NO_HZ_FULL where it's
> really necessary. I'll make that conditional, but probably on the RCU
> side.
>
> Without CONFIG_NO_HZ_FULL, rcu_kvm_exit() can do nothing, and
> rcu_kvm_enter() can do rcu_virt_note_context_switch().
>
> OK?
Makes sense to me! And a big "thank you!" to Christian for testing
and analyzing this in a timely fashion!!!
Thanx, Paul

On Thu, Jul 12, 2018 at 05:53:51AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 12, 2018 at 01:00:42PM +0100, David Woodhouse wrote:
> >
> >
> > On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote:
> > >
> > > > Also... why in $DEITY's name was the existing
> > > > rcu_virt_note_context_switch() not actually sufficient? If we had that
> > > > there, why did we need an additional explicit calls to rcu_all_qs() in
> > > > the KVM loop, or the more complex fixes to need_resched() which
> > > > ultimately had the same effect, to avoid ten-second latencies?
> > >
> > > My guess is that this was because control passed through the
> > > rcu_virt_note_context_switch() only once, and then subsequent
> > > scheduling-clock interrupts bypassed this code.
>
> Gah! My guess was instead that the code did a rcu_kvm_enter() going in,
> but somehow managed to miss the rcu_kvm_exit() going out. But that makes
> absolutely no sense -- had that happened, rcutorture would likely have
> screamed bloody murder, loudly and often. No mere near misses!
>
> And besides, thus far, -ENOREPRODUCE. :-/
OK, one close call in 63 hours of rcutorture, this one on scenario TREE03
(yesterday hit TREE01 and TREE03). Time for probabilitistic long-runtime
bisection. Plus thought about how to get more information out of the near
misses. Fun! ;-)
Thanx, Paul
> Which indicates that I have an opportunity to improve rcutorture and
> that this patch was with high probability an innocent bystander.
>
> > > But that is just a guess.
> > > I need to defer to someone who understands the KVM code better than I do.
> >
> > I think it's more likely that we just never happened at all. It's
> > conditional. From the latest patch iteration (see it being removed):
> >
> > @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void)
> > * one time slice). Lets treat guest mode as quiescent state, just like
> > * we do with user-mode execution.
> > */
> > - if (!context_tracking_cpu_is_enabled())
> > - rcu_virt_note_context_switch(smp_processor_id());
> > + rcu_kvm_enter();
> > }
> >
> >
> > Given the vmexit overhead, I don't think we can do the currently-
> > proposed rcu_kvm_enter() thing except for CONFIG_NO_HZ_FULL where it's
> > really necessary. I'll make that conditional, but probably on the RCU
> > side.
> >
> > Without CONFIG_NO_HZ_FULL, rcu_kvm_exit() can do nothing, and
> > rcu_kvm_enter() can do rcu_virt_note_context_switch().
> >
> > OK?
>
> Makes sense to me! And a big "thank you!" to Christian for testing
> and analyzing this in a timely fashion!!!
>
> Thanx, Paul

On Thu, Jul 12, 2018 at 09:17:04AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 12, 2018 at 05:53:51AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 12, 2018 at 01:00:42PM +0100, David Woodhouse wrote:
> > >
> > >
> > > On Wed, 2018-07-11 at 14:08 -0700, Paul E. McKenney wrote:
> > > >
> > > > > Also... why in $DEITY's name was the existing
> > > > > rcu_virt_note_context_switch() not actually sufficient? If we had that
> > > > > there, why did we need an additional explicit calls to rcu_all_qs() in
> > > > > the KVM loop, or the more complex fixes to need_resched() which
> > > > > ultimately had the same effect, to avoid ten-second latencies?
> > > >
> > > > My guess is that this was because control passed through the
> > > > rcu_virt_note_context_switch() only once, and then subsequent
> > > > scheduling-clock interrupts bypassed this code.
> >
> > Gah! My guess was instead that the code did a rcu_kvm_enter() going in,
> > but somehow managed to miss the rcu_kvm_exit() going out. But that makes
> > absolutely no sense -- had that happened, rcutorture would likely have
> > screamed bloody murder, loudly and often. No mere near misses!
> >
> > And besides, thus far, -ENOREPRODUCE. :-/
>
> OK, one close call in 63 hours of rcutorture, this one on scenario TREE03
> (yesterday hit TREE01 and TREE03). Time for probabilitistic long-runtime
> bisection. Plus thought about how to get more information out of the near
> misses. Fun! ;-)
Most of the weekend was devoted to testing today's upcoming pull request,
but I did get a bit more testing done on this.
I was able to make this happen more often by tweaking rcutorture a
bit, but I still do not yet have statistically significant results.
Nevertheless, I have thus far only seen failures with David's patch or
with both David's and my patch. And I actually got a full-up rcutorture
failure (a too-short grace period) in addition to the aforementioned
close calls.
Over this coming week I expect to devote significant testing time to
the commit just prior to David's in my stack. If I don't see failures
on that commit, we will need to spent some quality time with the KVM
folks on whether or not kvm_x86_ops->run() and friends have the option of
failing to return, but instead causing control to pop up somewhere else.
Or someone could tell me how I am being blind to some obvious bug in
the two commits that allow RCU to treat KVM guest-OS execution as an
extended quiescent state. ;-)
Thanx, Paul

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-17T08:19:27Zurn:uuid:c0dba689-891c-add5-a307-38b81b55a74e

[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]
On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote:
> Most of the weekend was devoted to testing today's upcoming pull request,
> but I did get a bit more testing done on this.
>
> I was able to make this happen more often by tweaking rcutorture a
> bit, but I still do not yet have statistically significant results.
> Nevertheless, I have thus far only seen failures with David's patch or
> with both David's and my patch. And I actually got a full-up rcutorture
> failure (a too-short grace period) in addition to the aforementioned
> close calls.
>
> Over this coming week I expect to devote significant testing time to
> the commit just prior to David's in my stack. If I don't see failures
> on that commit, we will need to spent some quality time with the KVM
> folks on whether or not kvm_x86_ops->run() and friends have the option of
> failing to return, but instead causing control to pop up somewhere else.
> Or someone could tell me how I am being blind to some obvious bug in
> the two commits that allow RCU to treat KVM guest-OS execution as an
> extended quiescent state. ;-)
One thing we can try, if my patch is implicated, is moving the calls to
rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting
them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run()
for example. If that fixes it, then we know we've missed something else
interesting that's happening in the middle.
Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500
with this patch, so in the next iteration it definitely needs to be
ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there
(AFAICT) and it's too expensive otherwise as Christian pointed out.
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Tue, Jul 17, 2018 at 10:19:08AM +0200, David Woodhouse wrote:
> On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote:
> > Most of the weekend was devoted to testing today's upcoming pull request,
> > but I did get a bit more testing done on this.
> >
> > I was able to make this happen more often by tweaking rcutorture a
> > bit, but I still do not yet have statistically significant results.
> > Nevertheless, I have thus far only seen failures with David's patch or
> > with both David's and my patch. And I actually got a full-up rcutorture
> > failure (a too-short grace period) in addition to the aforementioned
> > close calls.
> >
> > Over this coming week I expect to devote significant testing time to
> > the commit just prior to David's in my stack. If I don't see failures
> > on that commit, we will need to spent some quality time with the KVM
> > folks on whether or not kvm_x86_ops->run() and friends have the option of
> > failing to return, but instead causing control to pop up somewhere else.
> > Or someone could tell me how I am being blind to some obvious bug in
> > the two commits that allow RCU to treat KVM guest-OS execution as an
> > extended quiescent state. ;-)
>
> One thing we can try, if my patch is implicated, is moving the calls to
> rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting
> them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run()
> for example. If that fixes it, then we know we've missed something else
> interesting that's happening in the middle.
I don't have enough data to say anything with too much certainty, but
my patch has rcu_kvm_en{ter,xit}() quite a bit farther apart than yours
does, and I am not seeing massive increases in error rate in my patch
compared to yours. Which again might or might not mean anything.
Plus I haven't proven that your patch isn't an innocent bystander yet.
If it isn't just an innocent bystander, that will take most of this
week do demonstrate given current failure rates.
I am also working on improving rcutorture diagnostics which should help
me work out how to change rcutorture so as to find this more quickly.
> Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500
> with this patch, so in the next iteration it definitely needs to be
> ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there
> (AFAICT) and it's too expensive otherwise as Christian pointed out.
Makes sense!
Thanx, Paul

On Tue, Jul 17, 2018 at 05:56:53AM -0700, Paul E. McKenney wrote:
> On Tue, Jul 17, 2018 at 10:19:08AM +0200, David Woodhouse wrote:
> > On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote:
> > > Most of the weekend was devoted to testing today's upcoming pull request,
> > > but I did get a bit more testing done on this.
> > >
> > > I was able to make this happen more often by tweaking rcutorture a
> > > bit, but I still do not yet have statistically significant results.
> > > Nevertheless, I have thus far only seen failures with David's patch or
> > > with both David's and my patch. And I actually got a full-up rcutorture
> > > failure (a too-short grace period) in addition to the aforementioned
> > > close calls.
> > >
> > > Over this coming week I expect to devote significant testing time to
> > > the commit just prior to David's in my stack. If I don't see failures
> > > on that commit, we will need to spent some quality time with the KVM
> > > folks on whether or not kvm_x86_ops->run() and friends have the option of
> > > failing to return, but instead causing control to pop up somewhere else.
> > > Or someone could tell me how I am being blind to some obvious bug in
> > > the two commits that allow RCU to treat KVM guest-OS execution as an
> > > extended quiescent state. ;-)
> >
> > One thing we can try, if my patch is implicated, is moving the calls to
> > rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting
> > them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run()
> > for example. If that fixes it, then we know we've missed something else
> > interesting that's happening in the middle.
>
> I don't have enough data to say anything with too much certainty, but
> my patch has rcu_kvm_en{ter,xit}() quite a bit farther apart than yours
> does, and I am not seeing massive increases in error rate in my patch
> compared to yours. Which again might or might not mean anything.
>
> Plus I haven't proven that your patch isn't an innocent bystander yet.
> If it isn't just an innocent bystander, that will take most of this
> week do demonstrate given current failure rates.
And I finally did get some near misses from an earlier commit, so we
should consider your patch to be officially off the hook.
Thanx, Paul
> I am also working on improving rcutorture diagnostics which should help
> me work out how to change rcutorture so as to find this more quickly.
>
> > Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500
> > with this patch, so in the next iteration it definitely needs to be
> > ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there
> > (AFAICT) and it's too expensive otherwise as Christian pointed out.
>
> Makes sense!
>
> Thanx, Paul

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-18T16:02:02Zurn:uuid:1a41758f-df48-4c9b-defe-c2ff2cd92d8c

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-18T19:44:11Zurn:uuid:92fc7519-17fc-23a6-40da-8cc772eebcf9

[-- Attachment #1: Type: text/plain, Size: 2177 bytes --]
On Wed, 2018-07-18 at 09:37 -0700, Paul E. McKenney wrote:
> On Wed, Jul 18, 2018 at 06:01:51PM +0200, David Woodhouse wrote:
> >
> > On Wed, 2018-07-18 at 08:36 -0700, Paul E. McKenney wrote:
> > >
> > > And I finally did get some near misses from an earlier commit, so we
> > > should consider your patch to be officially off the hook.
> >
> > Yay, I like it when it's not my fault. I'll redo it with the ifdef
> > CONFIG_NO_HZ_FULL.
>
> Hey, I didn't say it wasn't your fault, only that it -officially- wasn't
> your fault. ;-)
I can live with being innocent until proven guilty.
> >
> > What should it do for the !CONFIG_NO_HZ_FULL case? The existing call in
> > guest_enter_irqoff() clearly wasn't actually doing the right thing
> > anyway, hence the need for the need_resched() patch in $SUBJECT... so
> > should I just leave it doing nothing in guest_enter_irqoff()?
>
> One starting point would be the combination of your patch and my
> patch, with -rcu commit IDs and diff below. But yes, it needs to be
> !CONFIG_NO_HZ_FULL. And no, I am not at all confident that I actually
> found all the places needing change in the core code, so this needs some
> serious review both by the KVM guys and the NO_HZ_FULL guys.
Right, that looks fairly much like the version I'd ended up with. So my
question was...
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void)
> * one time slice). Lets treat guest mode as quiescent state, just like
> * we do with user-mode execution.
> */
...if we change this to something like...
#ifdef CONFIG_NO_HZ_FULL
> + rcu_kvm_enter();
#else
> if (!context_tracking_cpu_is_enabled())
> rcu_virt_note_context_switch(smp_processor_id());
#endif
... do you actually want me to keep the #else case there? It blatantly
wasn't working anyway for us, perhaps because the condition was false?
That's why I started fixing need_resched() in the first place, and that
fix ought to cover whatever this call to rcu_virt_note_context_switch()
was supposed to be doing?
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Wed, Jul 18, 2018 at 09:41:05PM +0200, David Woodhouse wrote:
>
>
> On Wed, 2018-07-18 at 09:37 -0700, Paul E. McKenney wrote:
> > On Wed, Jul 18, 2018 at 06:01:51PM +0200, David Woodhouse wrote:
> > >
> > > On Wed, 2018-07-18 at 08:36 -0700, Paul E. McKenney wrote:
> > > >
> > > > And I finally did get some near misses from an earlier commit, so we
> > > > should consider your patch to be officially off the hook.
> > >
> > > Yay, I like it when it's not my fault. I'll redo it with the ifdef
> > > CONFIG_NO_HZ_FULL.
> >
> > Hey, I didn't say it wasn't your fault, only that it -officially- wasn't
> > your fault. ;-)
>
> I can live with being innocent until proven guilty.
>
> > >
> > > What should it do for the !CONFIG_NO_HZ_FULL case? The existing call in
> > > guest_enter_irqoff() clearly wasn't actually doing the right thing
> > > anyway, hence the need for the need_resched() patch in $SUBJECT... so
> > > should I just leave it doing nothing in guest_enter_irqoff()?
> >
> > One starting point would be the combination of your patch and my
> > patch, with -rcu commit IDs and diff below. But yes, it needs to be
> > !CONFIG_NO_HZ_FULL. And no, I am not at all confident that I actually
> > found all the places needing change in the core code, so this needs some
> > serious review both by the KVM guys and the NO_HZ_FULL guys.
>
> Right, that looks fairly much like the version I'd ended up with. So my
> question was...
>
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void)
> > * one time slice). Lets treat guest mode as quiescent state, just like
> > * we do with user-mode execution.
> > */
>
> ...if we change this to something like...
>
> #ifdef CONFIG_NO_HZ_FULL
> > + rcu_kvm_enter();
> #else
> > if (!context_tracking_cpu_is_enabled())
> > rcu_virt_note_context_switch(smp_processor_id());
> #endif
>
> ... do you actually want me to keep the #else case there? It blatantly
> wasn't working anyway for us, perhaps because the condition was false?
> That's why I started fixing need_resched() in the first place, and that
> fix ought to cover whatever this call to rcu_virt_note_context_switch()
> was supposed to be doing?
My thought would be something like this:
if (context_tracking_cpu_is_enabled())
rcu_kvm_enter();
else
rcu_virt_note_context_switch(smp_processor_id());
The reason I believe that this is the right approach is that even when
you have CONFIG_NO_HZ_FULL=y, some CPUs will still be nohz_full=n CPUs,
so you don't want to take the extra overhead on those CPUs.
But I could easily be confused here, so I am adding Frederic for his
thoughts.
Thanx, Paul

On Wed, Jul 18, 2018 at 01:17:00PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 18, 2018 at 09:41:05PM +0200, David Woodhouse wrote:
> >
> >
> > On Wed, 2018-07-18 at 09:37 -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 18, 2018 at 06:01:51PM +0200, David Woodhouse wrote:
> > > >
> > > > On Wed, 2018-07-18 at 08:36 -0700, Paul E. McKenney wrote:
> > > > >
> > > > > And I finally did get some near misses from an earlier commit, so we
> > > > > should consider your patch to be officially off the hook.
> > > >
> > > > Yay, I like it when it's not my fault. I'll redo it with the ifdef
> > > > CONFIG_NO_HZ_FULL.
> > >
> > > Hey, I didn't say it wasn't your fault, only that it -officially- wasn't
> > > your fault. ;-)
> >
> > I can live with being innocent until proven guilty.
> >
> > > >
> > > > What should it do for the !CONFIG_NO_HZ_FULL case? The existing call in
> > > > guest_enter_irqoff() clearly wasn't actually doing the right thing
> > > > anyway, hence the need for the need_resched() patch in $SUBJECT... so
> > > > should I just leave it doing nothing in guest_enter_irqoff()?
> > >
> > > One starting point would be the combination of your patch and my
> > > patch, with -rcu commit IDs and diff below. But yes, it needs to be
> > > !CONFIG_NO_HZ_FULL. And no, I am not at all confident that I actually
> > > found all the places needing change in the core code, so this needs some
> > > serious review both by the KVM guys and the NO_HZ_FULL guys.
> >
> > Right, that looks fairly much like the version I'd ended up with. So my
> > question was...
> >
> > > --- a/include/linux/context_tracking.h
> > > +++ b/include/linux/context_tracking.h
> > > @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void)
> > > * one time slice). Lets treat guest mode as quiescent state, just like
> > > * we do with user-mode execution.
> > > */
> >
> > ...if we change this to something like...
> >
> > #ifdef CONFIG_NO_HZ_FULL
> > > + rcu_kvm_enter();
> > #else
> > > if (!context_tracking_cpu_is_enabled())
> > > rcu_virt_note_context_switch(smp_processor_id());
> > #endif
> >
> > ... do you actually want me to keep the #else case there? It blatantly
> > wasn't working anyway for us, perhaps because the condition was false?
> > That's why I started fixing need_resched() in the first place, and that
> > fix ought to cover whatever this call to rcu_virt_note_context_switch()
> > was supposed to be doing?
>
> My thought would be something like this:
>
> if (context_tracking_cpu_is_enabled())
> rcu_kvm_enter();
> else
> rcu_virt_note_context_switch(smp_processor_id());
>
> The reason I believe that this is the right approach is that even when
> you have CONFIG_NO_HZ_FULL=y, some CPUs will still be nohz_full=n CPUs,
> so you don't want to take the extra overhead on those CPUs.
>
> But I could easily be confused here, so I am adding Frederic for his
> thoughts.
Hmm, actually rcu_user_enter()/rcu_user_exit() are already called upon guest
entry/exit :-)
Now I must confess the code leading there in __context_tracking_enter/exit is
not very obvious but that part applies to both CONTEXT_USER and CONTEXT_GUEST cases.
I should probably add a few comments to clarify.

On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote:
> On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote:
> > And here is an updated v4.15 patch with Marius's Reported-by and David's
> > fix to my lost exclamation point.
>
> Thanks. Are you sending the original version of that to Linus? It'd be
> useful to have the commit ID so that we can watch for it landing, and
> chase this one up to Greg.
>
> As discussed on IRC, this patch reduces synchronize_sched() latency for
> us from ~4600s to ~160ms, which is nice.
>
> However, it isn't going to be sufficient in the NO_HZ_FULL case. For
> that you want a patch like the one below, which happily reduces the
> latency in our (!NO_HZ_FULL) case still further to ~40ms.
That is interesting. As I replied to Paul, we are already calling
rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why
you're seeing such an optimization by repeating those calls.
Perhaps the rcu_user_* somehow aren't actually called from
__context_tracking_enter()...? Some bug in context tracking?
Otherwise it's a curious side effect.
Thanks.

On Thu, Jul 19, 2018 at 02:32:06AM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote:
> > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote:
> > > And here is an updated v4.15 patch with Marius's Reported-by and David's
> > > fix to my lost exclamation point.
> >
> > Thanks. Are you sending the original version of that to Linus? It'd be
> > useful to have the commit ID so that we can watch for it landing, and
> > chase this one up to Greg.
> >
> > As discussed on IRC, this patch reduces synchronize_sched() latency for
> > us from ~4600s to ~160ms, which is nice.
> >
> > However, it isn't going to be sufficient in the NO_HZ_FULL case. For
> > that you want a patch like the one below, which happily reduces the
> > latency in our (!NO_HZ_FULL) case still further to ~40ms.
>
> That is interesting. As I replied to Paul, we are already calling
> rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why
> you're seeing such an optimization by repeating those calls.
>
> Perhaps the rcu_user_* somehow aren't actually called from
> __context_tracking_enter()...? Some bug in context tracking?
> Otherwise it's a curious side effect.
David is working with v4.15. Is this maybe something that has changed
since then?
Thanx, Paul

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-19T07:36:14Zurn:uuid:c5190a9d-9f0d-0950-cbdc-8e82fc8a80ae

[-- Attachment #1: Type: text/plain, Size: 1138 bytes --]
On Wed, 2018-07-18 at 20:11 -0700, Paul E. McKenney wrote:
>
> > That is interesting. As I replied to Paul, we are already calling
> > rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why
> > you're seeing such an optimization by repeating those calls.
> >
> > Perhaps the rcu_user_* somehow aren't actually called from
> > __context_tracking_enter()...? Some bug in context tracking?
> > Otherwise it's a curious side effect.
>
> David is working with v4.15. Is this maybe something that has changed
> since then?
To clarify: in 4.15 without CONFIG_PREEMPT and without NO_HZ_FULL I was
seeing RCU stalls because a thread in vcpu_run() was *never* seen to go
through a quiescent state. Hence the change to need_resched() in the
first patch in this thread, which fixed the problem at hand and seemed
to address the general case.
It then seemed by *inspection* that the NO_HZ_FULL case was probably
broken, because we'd failed to spot the rcu_user_* calls. But
rcu_user_enter() does nothing in the !NO_HZ_FULL case, so wouldn't have
helped in the testing that we were doing anyway.
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On 07/18/2018 10:17 PM, Paul E. McKenney wrote:
> On Wed, Jul 18, 2018 at 09:41:05PM +0200, David Woodhouse wrote:
>>
>>
>> On Wed, 2018-07-18 at 09:37 -0700, Paul E. McKenney wrote:
>>> On Wed, Jul 18, 2018 at 06:01:51PM +0200, David Woodhouse wrote:
>>>>
>>>> On Wed, 2018-07-18 at 08:36 -0700, Paul E. McKenney wrote:
>>>>>
>>>>> And I finally did get some near misses from an earlier commit, so we
>>>>> should consider your patch to be officially off the hook.
>>>>
>>>> Yay, I like it when it's not my fault. I'll redo it with the ifdef
>>>> CONFIG_NO_HZ_FULL.
>>>
>>> Hey, I didn't say it wasn't your fault, only that it -officially- wasn't
>>> your fault. ;-)
>>
>> I can live with being innocent until proven guilty.
>>
>>>>
>>>> What should it do for the !CONFIG_NO_HZ_FULL case? The existing call in
>>>> guest_enter_irqoff() clearly wasn't actually doing the right thing
>>>> anyway, hence the need for the need_resched() patch in $SUBJECT... so
>>>> should I just leave it doing nothing in guest_enter_irqoff()?
>>>
>>> One starting point would be the combination of your patch and my
>>> patch, with -rcu commit IDs and diff below. But yes, it needs to be
>>> !CONFIG_NO_HZ_FULL. And no, I am not at all confident that I actually
>>> found all the places needing change in the core code, so this needs some
>>> serious review both by the KVM guys and the NO_HZ_FULL guys.
>>
>> Right, that looks fairly much like the version I'd ended up with. So my
>> question was...
>>
>>> --- a/include/linux/context_tracking.h
>>> +++ b/include/linux/context_tracking.h
>>> @@ -118,12 +118,12 @@ static inline void guest_enter_irqoff(void)
>>> * one time slice). Lets treat guest mode as quiescent state, just like
>>> * we do with user-mode execution.
>>> */
>>
>> ...if we change this to something like...
>>
>> #ifdef CONFIG_NO_HZ_FULL
>>> + rcu_kvm_enter();
>> #else
>>> if (!context_tracking_cpu_is_enabled())
>>> rcu_virt_note_context_switch(smp_processor_id());
>> #endif
>>
>> ... do you actually want me to keep the #else case there? It blatantly
>> wasn't working anyway for us, perhaps because the condition was false?
>> That's why I started fixing need_resched() in the first place, and that
>> fix ought to cover whatever this call to rcu_virt_note_context_switch()
>> was supposed to be doing?
>
> My thought would be something like this:
>
> if (context_tracking_cpu_is_enabled())
> rcu_kvm_enter();
> else
> rcu_virt_note_context_switch(smp_processor_id());
In the past we needed that (when we introduced that). At least with every
host interrupt we called this making an rcu event at least every HZ.
Will the changes in need_resched make this part unnecessary?

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-19T07:36:17Zurn:uuid:9b2f7893-cf8e-449f-a83a-6e91d6a97558

[-- Attachment #1: Type: text/plain, Size: 1163 bytes --]
On Thu, 2018-07-19 at 08:45 +0200, Christian Borntraeger wrote:
>
> > My thought would be something like this:
> >
> > if (context_tracking_cpu_is_enabled())
> > rcu_kvm_enter();
> > else
> > rcu_virt_note_context_switch(smp_processor_id());
>
> In the past we needed that (when we introduced that). At least with every
> host interrupt we called this making an rcu event at least every HZ.
> Will the changes in need_resched make this part unnecessary?
Yes, the change in need_resched() should make this part unnecessary.
Unless your architecture's version of the vcpu_run() loop just loops
forever even when TIF_NEED_RESCHED is set? :)
I'm not sure about the context tracking condition in the code snippet
cited above, though. I think that's what caused my problem in the first
place — I have CONTEXT_TRACKING_FORCE && !NO_HZ_FULL. So in 4.15, that
means rcu_user_enter() did nothing and rcu_virt_note_context_switch()
wasn't called. Hence the observed stalls.
Should rcu_user_enter() itself be conditional on CONTEXT_TRACKING not
NO_HZ_FULL?
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On 07/19/2018 09:20 AM, David Woodhouse wrote:
> On Thu, 2018-07-19 at 08:45 +0200, Christian Borntraeger wrote:
>>
>>> My thought would be something like this:
>>>
>>> if (context_tracking_cpu_is_enabled())
>>> rcu_kvm_enter();
>>> else
>>> rcu_virt_note_context_switch(smp_processor_id());
>>
>> In the past we needed that (when we introduced that). At least with every
>> host interrupt we called this making an rcu event at least every HZ.
>> Will the changes in need_resched make this part unnecessary?
>
> Yes, the change in need_resched() should make this part unnecessary.
> Unless your architecture's version of the vcpu_run() loop just loops
> forever even when TIF_NEED_RESCHED is set? :)
Very early versions did that. The SIE instruction is interruptible
so you can continue to run the guest by simply returning from an host
interrupt. All sane versions of KVM on s390 now make sure to make a
short trip into C after a host interrupt. There we check for
need_resched signals and machine checks so we are good.
>
> I'm not sure about the context tracking condition in the code snippet
> cited above, though. I think that's what caused my problem in the first
> place — I have CONTEXT_TRACKING_FORCE && !NO_HZ_FULL. So in 4.15, that
> means rcu_user_enter() did nothing and rcu_virt_note_context_switch()
> wasn't called. Hence the observed stalls.
>
> Should rcu_user_enter() itself be conditional on CONTEXT_TRACKING not
> NO_HZ_FULL?
>

On Thu, Jul 19, 2018 at 12:23:34PM +0200, Christian Borntraeger wrote:
>
>
> On 07/19/2018 09:20 AM, David Woodhouse wrote:
> > On Thu, 2018-07-19 at 08:45 +0200, Christian Borntraeger wrote:
> >>
> >>> My thought would be something like this:
> >>>
> >>> if (context_tracking_cpu_is_enabled())
> >>> rcu_kvm_enter();
> >>> else
> >>> rcu_virt_note_context_switch(smp_processor_id());
> >>
> >> In the past we needed that (when we introduced that). At least with every
> >> host interrupt we called this making an rcu event at least every HZ.
> >> Will the changes in need_resched make this part unnecessary?
> >
> > Yes, the change in need_resched() should make this part unnecessary.
> > Unless your architecture's version of the vcpu_run() loop just loops
> > forever even when TIF_NEED_RESCHED is set? :)
>
> Very early versions did that. The SIE instruction is interruptible
> so you can continue to run the guest by simply returning from an host
> interrupt. All sane versions of KVM on s390 now make sure to make a
> short trip into C after a host interrupt. There we check for
> need_resched signals and machine checks so we are good.
OK, thank you all! I will drop the two patches that add the
rcu_kvm_enter() and rcu_kvm_exit() calls. Two less things to
worry about! ;-)
Thanx, Paul
> > I'm not sure about the context tracking condition in the code snippet
> > cited above, though. I think that's what caused my problem in the first
> > place — I have CONTEXT_TRACKING_FORCE && !NO_HZ_FULL. So in 4.15, that
> > means rcu_user_enter() did nothing and rcu_virt_note_context_switch()
> > wasn't called. Hence the observed stalls.
> >
> > Should rcu_user_enter() itself be conditional on CONTEXT_TRACKING not
> > NO_HZ_FULL?
> >

On Thu, Jul 19, 2018 at 09:20:33AM +0200, David Woodhouse wrote:
> On Thu, 2018-07-19 at 08:45 +0200, Christian Borntraeger wrote:
> >
> > > My thought would be something like this:
> > >
> > > if (context_tracking_cpu_is_enabled())
> > > rcu_kvm_enter();
> > > else
> > > rcu_virt_note_context_switch(smp_processor_id());
> >
> > In the past we needed that (when we introduced that). At least with every
> > host interrupt we called this making an rcu event at least every HZ.
> > Will the changes in need_resched make this part unnecessary?
>
> Yes, the change in need_resched() should make this part unnecessary.
> Unless your architecture's version of the vcpu_run() loop just loops
> forever even when TIF_NEED_RESCHED is set? :)
>
> I'm not sure about the context tracking condition in the code snippet
> cited above, though. I think that's what caused my problem in the first
> place — I have CONTEXT_TRACKING_FORCE && !NO_HZ_FULL. So in 4.15, that
> means rcu_user_enter() did nothing and rcu_virt_note_context_switch()
> wasn't called. Hence the observed stalls.
>
> Should rcu_user_enter() itself be conditional on CONTEXT_TRACKING not
> NO_HZ_FULL?
Ah, CONTEXT_TRACKING_FORCE is only for testing purpose, you should not select
it, it's going to introduce overhead. Actually I should remove that. Although
since we have removed CONFIG_NOHZ_FULL_ALL it's the last way we have to test
NOHZ_FULL from config alone.

On Wed, Jul 18, 2018 at 08:11:52PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 19, 2018 at 02:32:06AM +0200, Frederic Weisbecker wrote:
> > On Wed, Jul 11, 2018 at 06:03:42PM +0100, David Woodhouse wrote:
> > > On Wed, 2018-07-11 at 09:49 -0700, Paul E. McKenney wrote:
> > > > And here is an updated v4.15 patch with Marius's Reported-by and David's
> > > > fix to my lost exclamation point.
> > >
> > > Thanks. Are you sending the original version of that to Linus? It'd be
> > > useful to have the commit ID so that we can watch for it landing, and
> > > chase this one up to Greg.
> > >
> > > As discussed on IRC, this patch reduces synchronize_sched() latency for
> > > us from ~4600s to ~160ms, which is nice.
> > >
> > > However, it isn't going to be sufficient in the NO_HZ_FULL case. For
> > > that you want a patch like the one below, which happily reduces the
> > > latency in our (!NO_HZ_FULL) case still further to ~40ms.
> >
> > That is interesting. As I replied to Paul, we are already calling
> > rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why
> > you're seeing such an optimization by repeating those calls.
> >
> > Perhaps the rcu_user_* somehow aren't actually called from
> > __context_tracking_enter()...? Some bug in context tracking?
> > Otherwise it's a curious side effect.
>
> David is working with v4.15. Is this maybe something that has changed
> since then?
Hmm, nope I think the context tracking code hasn't changed for a while.

On Thu, Jul 19, 2018 at 08:16:47AM +0200, David Woodhouse wrote:
>
>
> On Wed, 2018-07-18 at 20:11 -0700, Paul E. McKenney wrote:
> >
> > > That is interesting. As I replied to Paul, we are already calling
> > > rcu_user_enter/exit() on guest_enter/exit_irqsoff(). So I'm wondering why
> > > you're seeing such an optimization by repeating those calls.
> > >
> > > Perhaps the rcu_user_* somehow aren't actually called from
> > > __context_tracking_enter()...? Some bug in context tracking?
> > > Otherwise it's a curious side effect.
> >
> > David is working with v4.15. Is this maybe something that has changed
> > since then?
>
> To clarify: in 4.15 without CONFIG_PREEMPT and without NO_HZ_FULL I was
> seeing RCU stalls because a thread in vcpu_run() was *never* seen to go
> through a quiescent state. Hence the change to need_resched() in the
> first patch in this thread, which fixed the problem at hand and seemed
> to address the general case.
>
> It then seemed by *inspection* that the NO_HZ_FULL case was probably
> broken, because we'd failed to spot the rcu_user_* calls. But
> rcu_user_enter() does nothing in the !NO_HZ_FULL case, so wouldn't have
> helped in the testing that we were doing anyway.
Oh ok, so the optimization you saw is likely unrelated to the rcu_user* things.

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-19T13:36:54Zurn:uuid:3e463153-129e-d3d4-6087-0aca8ba9fc1a

[-- Attachment #1: Type: text/plain, Size: 883 bytes --]
On Thu, 2018-07-19 at 15:14 +0200, Frederic Weisbecker wrote:
> > I'm not sure about the context tracking condition in the code snippet
> > cited above, though. I think that's what caused my problem in the first
> > place — I have CONTEXT_TRACKING_FORCE && !NO_HZ_FULL. So in 4.15, that
> > means rcu_user_enter() did nothing and rcu_virt_note_context_switch()
> > wasn't called. Hence the observed stalls.
> >
> > Should rcu_user_enter() itself be conditional on CONTEXT_TRACKING not
> > NO_HZ_FULL?
>
> Ah, CONTEXT_TRACKING_FORCE is only for testing purpose, you should not select
> it, it's going to introduce overhead. Actually I should remove that. Although
> since we have removed CONFIG_NOHZ_FULL_ALL it's the last way we have to test
> NOHZ_FULL from config alone.
I didn't select it. It defaults to y if !NO_HZ_FULL.
I'm turning it off now...
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Wed, Jul 18, 2018 at 09:37:12AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 18, 2018 at 06:01:51PM +0200, David Woodhouse wrote:
> > On Wed, 2018-07-18 at 08:36 -0700, Paul E. McKenney wrote:
> > > And I finally did get some near misses from an earlier commit, so we
> > > should consider your patch to be officially off the hook.
> >
> > Yay, I like it when it's not my fault. I'll redo it with the ifdef
> > CONFIG_NO_HZ_FULL.
>
> Hey, I didn't say it wasn't your fault, only that it -officially- wasn't
> your fault. ;-)
And I believe that I found my bug in this commit (lots more testing still
required, but the preponderance of evidence and all that):
2cc0c7f07143 ("rcu: Eliminate ->rcu_qs_ctr from the rcu_dynticks structure")
So it really isn't your fault.
Of course, the real reason for the lack of fault on your part will not
because I believe I found the bug elsewhere, but instead because I will
be dropping your patch (and mine as well) on Frederic's advice. ;-)
Thanx, Paul

David Woodhousedwmw2@infradead.orgRe: [RFC] Make need_resched() return true when rcu_urgent_qs requested2018-07-23T08:09:08Zurn:uuid:4b11ca18-518d-4598-ef82-72621d8ff237

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]
On Thu, 2018-07-19 at 10:09 -0700, Paul E. McKenney wrote:
>
> Of course, the real reason for the lack of fault on your part will not
> because I believe I found the bug elsewhere, but instead because I will
> be dropping your patch (and mine as well) on Frederic's advice. ;-)
You're keeping the need_resched() one though?
And we are still left with the fact that CONTEXT_TRACKING_FORCE is
making the existing code in guest_enter_irqoff() do the wrong thing for
!NO_HZ_FULL. But in fact the rcu_virt_note_context_switch() there is
completely redundant now we fixed need_resched(), so can be dropped,
leaving only the rcu_user_enter/exit calls for the NO_HZ_FULL case?
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

On Mon, Jul 23, 2018 at 10:08:59AM +0200, David Woodhouse wrote:
> On Thu, 2018-07-19 at 10:09 -0700, Paul E. McKenney wrote:
> >
> > Of course, the real reason for the lack of fault on your part will not
> > because I believe I found the bug elsewhere, but instead because I will
> > be dropping your patch (and mine as well) on Frederic's advice. ;-)
>
> You're keeping the need_resched() one though?
Yes. This the current commit in -rcu (which will change when I rebase
onto v4.19-rc1, if not earlier):
fcf0407e6e63 ("rcu: Make need_resched() respond to urgent RCU-QS needs")
> And we are still left with the fact that CONTEXT_TRACKING_FORCE is
> making the existing code in guest_enter_irqoff() do the wrong thing for
> !NO_HZ_FULL. But in fact the rcu_virt_note_context_switch() there is
> completely redundant now we fixed need_resched(), so can be dropped,
> leaving only the rcu_user_enter/exit calls for the NO_HZ_FULL case?
I am not yet convinced that we know exactly the right thing to be
doing for guest OSes for either value of NO_HZ_FULL, much less that
we are actually doing it. ;-)
But what does your testing say?
Thanx, Paul