This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism.

Ticket locks have an inherent problem in a virtualized case, because the vCPUs are scheduled rather than running concurrently (ignoring gang scheduled vCPUs). This can result in catastrophic performance collapses when the vCPU scheduler doesn't schedule the correct "next" vCPU, and ends up scheduling a vCPU which burns its entire timeslice spinning. (Note that this is not the same problem as lock-holder preemption, which this series also addresses; that's also a problem, but not catastrophic).

Currently we deal with this by having PV spinlocks, which adds a layer of indirection in front of all the spinlock functions, and defining a completely new implementation for Xen (and for other pvops users, but there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation (fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD iterations, then call out to the __ticket_lock_spinning() pvop, which allows a backend to block the vCPU rather than spinning. This pvop can set the lock into "slowpath state".

- When releasing a lock, if it is in "slowpath state", the call __ticket_unlock_kick() to kick the next vCPU in line awake. If the lock is no longer in contention, it also clears the slowpath flag.

The "slowpath state" is stored in the LSB of the within the lock tail ticket. This has the effect of reducing the max number of CPUs by half (so, a "small ticket" can deal with 128 CPUs, and "large ticket" 32768).

This series provides a Xen implementation, but it should be straightforward to add a KVM implementation as well.

Overall, it results in a large reduction in code, it makes the native and virtualized cases closer, and it removes a layer of indirection around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in "slowpath" state) is optimal, identical to the non-paravirtualized case.

5: callq *__ticket_lock_spinning jmp 2b ### SLOWPATH END with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where the fastpath case is straight through (taking the lock without contention), and the spin loop is out of line:

pop %rbp retq ### SLOWPATH END The unlock code is complicated by the need to both add to the lock's "head" and fetch the slowpath flag from "tail". This version of the patch uses a locked add to do this, followed by a test to see if the slowflag is set. The lock prefix acts as a full memory barrier, so we can be sure that other CPUs will have seen the unlock before we read the flag (without the barrier the read could be fetched from the store queue before it hits memory, which could result in a deadlock).

This is is all unnecessary complication if you're not using PV ticket locks, it also uses the jump-label machinery to use the standard "add"-based unlock in the non-PV case.

This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism.

Ticket locks have an inherent problem in a virtualized case, because the vCPUs are scheduled rather than running concurrently (ignoring gang scheduled vCPUs). This can result in catastrophic performance collapses when the vCPU scheduler doesn't schedule the correct "next" vCPU, and ends up scheduling a vCPU which burns its entire timeslice spinning. (Note that this is not the same problem as lock-holder preemption, which this series also addresses; that's also a problem, but not catastrophic).

Currently we deal with this by having PV spinlocks, which adds a layer of indirection in front of all the spinlock functions, and defining a completely new implementation for Xen (and for other pvops users, but there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation (fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD iterations, then call out to the __ticket_lock_spinning() pvop, which allows a backend to block the vCPU rather than spinning. This pvop can set the lock into "slowpath state".

- When releasing a lock, if it is in "slowpath state", the call __ticket_unlock_kick() to kick the next vCPU in line awake. If the lock is no longer in contention, it also clears the slowpath flag.

The "slowpath state" is stored in the LSB of the within the lock tail ticket. This has the effect of reducing the max number of CPUs by half (so, a "small ticket" can deal with 128 CPUs, and "large ticket" 32768).

This series provides a Xen implementation, but it should be straightforward to add a KVM implementation as well.

Overall, it results in a large reduction in code, it makes the native and virtualized cases closer, and it removes a layer of indirection around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in "slowpath" state) is optimal, identical to the non-paravirtualized case.

5: callq *__ticket_lock_spinning jmp 2b ### SLOWPATH END with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where the fastpath case is straight through (taking the lock without contention), and the spin loop is out of line:

pop %rbp retq ### SLOWPATH END The unlock code is complicated by the need to both add to the lock's "head" and fetch the slowpath flag from "tail". This version of the patch uses a locked add to do this, followed by a test to see if the slowflag is set. The lock prefix acts as a full memory barrier, so we can be sure that other CPUs will have seen the unlock before we read the flag (without the barrier the read could be fetched from the store queue before it hits memory, which could result in a deadlock).

This is is all unnecessary complication if you're not using PV ticket locks, it also uses the jump-label machinery to use the standard "add"-based unlock in the non-PV case.

On 03/21/2012 12:20 PM, Raghavendra K T wrote: > From: Jeremy Fitzhardinge <jeremy.fitzhardinge [at] citrix> > > Changes since last posting: (Raghavendra K T) > [. > - Rebased to linux-3.3-rc6. > - used function+enum in place of macro (better type checking) > - use cmpxchg while resetting zero status for possible race > [suggested by Dave Hansen for KVM patches ] > ] > > This series replaces the existing paravirtualized spinlock mechanism > with a paravirtualized ticketlock mechanism. > > Ticket locks have an inherent problem in a virtualized case, because > the vCPUs are scheduled rather than running concurrently (ignoring > gang scheduled vCPUs). This can result in catastrophic performance > collapses when the vCPU scheduler doesn't schedule the correct "next" > vCPU, and ends up scheduling a vCPU which burns its entire timeslice > spinning. (Note that this is not the same problem as lock-holder > preemption, which this series also addresses; that's also a problem, > but not catastrophic). > > (See Thomas Friebel's talk "Prevent Guests from Spinning Around" > http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) > > Currently we deal with this by having PV spinlocks, which adds a layer > of indirection in front of all the spinlock functions, and defining a > completely new implementation for Xen (and for other pvops users, but > there are none at present). > > PV ticketlocks keeps the existing ticketlock implemenentation > (fastpath) as-is, but adds a couple of pvops for the slow paths: > > - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD > iterations, then call out to the __ticket_lock_spinning() pvop, > which allows a backend to block the vCPU rather than spinning. This > pvop can set the lock into "slowpath state". > > - When releasing a lock, if it is in "slowpath state", the call > __ticket_unlock_kick() to kick the next vCPU in line awake. If the > lock is no longer in contention, it also clears the slowpath flag. > > The "slowpath state" is stored in the LSB of the within the lock tail > ticket. This has the effect of reducing the max number of CPUs by > half (so, a "small ticket" can deal with 128 CPUs, and "large ticket" > 32768). > > This series provides a Xen implementation, but it should be > straightforward to add a KVM implementation as well. >

Looks like a good baseline on which to build the KVM implementation. We might need some handshake to prevent interference on the host side with the PLE code.

On 03/26/2012 07:55 PM, Avi Kivity wrote: > On 03/21/2012 12:20 PM, Raghavendra K T wrote: >> From: Jeremy Fitzhardinge<jeremy.fitzhardinge [at] citrix> [...] >> >> This series provides a Xen implementation, but it should be >> straightforward to add a KVM implementation as well. >> > > Looks like a good baseline on which to build the KVM implementation. We > might need some handshake to prevent interference on the host side with > the PLE code. >

Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on non PLE machine.

Ingo, Peter, Can you please let us know if this series can be considered for next merge window? OR do you still have some concerns that needs addressing.

I shall rebase patches to 3.3 and resend. (main difference would be UNINLINE_SPIN_UNLOCK and jump label changes to use static_key_true/false() usage instead of static_branch.)

I am happy to see this issue receiving some attention and second the wish to see these patches be considered for further review and inclusion in an upcoming release.

Overcommit is not as common in enterprise and single-tenant virtualized environments as it is in multi-tenant environments, and frankly we have been suffering.

We have been running an early copy of these patches in our lab and in a small production node sample set both on 3.2.0-rc4 and 3.3.0-rc6 for over two weeks now with great success. With the heavy level of vCPU:pCPU overcommit required for our situation, the patches are increasing performance by an _order of magnitude_ on our E5645 and E5620 systems.

> On 03/26/2012 07:55 PM, Avi Kivity wrote: > >> On 03/21/2012 12:20 PM, Raghavendra K T wrote: >> >>> From: Jeremy Fitzhardinge<jeremy.**fitzhardinge [at] citrix<jeremy.fitzhardinge [at] citrix> >>> > >>> >> [...] > > >>> This series provides a Xen implementation, but it should be >>> straightforward to add a KVM implementation as well. >>> >>> >> Looks like a good baseline on which to build the KVM implementation. We >> might need some handshake to prevent interference on the host side with >> the PLE code. >> >> > Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on non > PLE machine. > > Ingo, Peter, > Can you please let us know if this series can be considered for next merge > window? > OR do you still have some concerns that needs addressing. > > I shall rebase patches to 3.3 and resend. (main difference would be > UNINLINE_SPIN_UNLOCK and jump label changes to use static_key_true/false() > usage instead of static_branch.) > > Thanks, > Raghu > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo [at] vger > More majordomo info at http://vger.kernel.org/**majordomo-info.html<http://vger.kernel.org/majordomo-info.html> > Please read the FAQ at http://www.tux.org/lkml/>

On 03/28/2012 09:39 PM, Alan Meadows wrote: > I am happy to see this issue receiving some attention and second the > wish to see these patches be considered for further review and inclusion > in an upcoming release. > > Overcommit is not as common in enterprise and single-tenant virtualized > environments as it is in multi-tenant environments, and frankly we have > been suffering. > > We have been running an early copy of these patches in our lab and in a > small production node sample set both on3.2.0-rc4 and 3.3.0-rc6 for over > two weeks now with great success. With the heavy level of vCPU:pCPU > overcommit required for our situation, the patches are increasing > performance by an _order of magnitude_ on our E5645 and E5620 systems. >

Thanks Alan for the support. I feel timing of this patch was little bad though. (merge window)

> > > Looks like a good baseline on which to build the KVM > implementation. We > might need some handshake to prevent interference on the host > side with > the PLE code. >

I think I still missed some point in Avi's comment. I agree that PLE may be interfering with above patches (resulting in less performance advantages). but we have not seen performance degradation with the patches in earlier benchmarks. [. theoretically since patch has very slight advantage over PLE that atleast it knows who should run next ].

So TODO in my list on this is: 1. More analysis of performance on PLE mc. 2. Seeing how to implement handshake to increase performance (if PLE + patch combination have slight negative effect).

Sorry that, I could not do more analysis on PLE (as promised last time) because of machine availability.

I 'll do some work on this and comeback. But in the meantime, I do not see it as blocking for next merge window.

> > Avi, Thanks for reviewing. True, it is sort of equivalent to PLE on > non PLE machine. > > Ingo, Peter, > Can you please let us know if this series can be considered for next > merge window? > OR do you still have some concerns that needs addressing. > > I shall rebase patches to 3.3 and resend. (main difference would be > UNINLINE_SPIN_UNLOCK and jump label changes to use > static_key_true/false() usage instead of static_branch.)

On 03/28/2012 08:21 PM, Raghavendra K T wrote: > >> >> >> Looks like a good baseline on which to build the KVM >> implementation. We >> might need some handshake to prevent interference on the host >> side with >> the PLE code. >> > > I think I still missed some point in Avi's comment. I agree that PLE > may be interfering with above patches (resulting in less performance > advantages). but we have not seen performance degradation with the > patches in earlier benchmarks. [. theoretically since patch has very > slight advantage over PLE that atleast it knows who should run next ].

The advantage grows with the vcpu counts and overcommit ratio. If you have N vcpus and M:1 overcommit, PLE has to guess from N/M queued vcpus while your patch knows who to wake up.

I can think of two options: - from the PLE handler, don't wake up a vcpu that is sleeping because it is waiting for a kick - look at other sources of pause loops (I guess smp_call_function() is the significant source) and adjust them to use the same mechanism, and ask the host to disable PLE exiting.

This can be done incrementally later.

> > Sorry that, I could not do more analysis on PLE (as promised last time) > because of machine availability. > > I 'll do some work on this and comeback. But in the meantime, I do not > see it as blocking for next merge window.

On 03/29/2012 03:28 PM, Avi Kivity wrote: > On 03/28/2012 08:21 PM, Raghavendra K T wrote: >> >>> >>> >>> Looks like a good baseline on which to build the KVM >>> implementation. We >>> might need some handshake to prevent interference on the host >>> side with >>> the PLE code. >>> >> >> I think I still missed some point in Avi's comment. I agree that PLE >> may be interfering with above patches (resulting in less performance >> advantages). but we have not seen performance degradation with the >> patches in earlier benchmarks. [. theoretically since patch has very >> slight advantage over PLE that atleast it knows who should run next ]. > > The advantage grows with the vcpu counts and overcommit ratio. If you > have N vcpus and M:1 overcommit, PLE has to guess from N/M queued vcpus > while your patch knows who to wake up. >

Yes. I agree.

>> >> So TODO in my list on this is: >> 1. More analysis of performance on PLE mc. >> 2. Seeing how to implement handshake to increase performance (if PLE + >> patch combination have slight negative effect). > > I can think of two options:

I really like below ideas. Thanks for that!.

> - from the PLE handler, don't wake up a vcpu that is sleeping because it > is waiting for a kick

How about, adding another pass in the beginning of kvm_vcpu_on_spin() to check if any vcpu is already kicked. This would almost result in yield_to(kicked_vcpu). IMO this is also worth trying.

will try above ideas soon.

> - look at other sources of pause loops (I guess smp_call_function() is > the significant source) and adjust them to use the same mechanism, and > ask the host to disable PLE exiting. > > This can be done incrementally later. >

On 03/29/2012 11:33 PM, Raghavendra K T wrote: > On 03/29/2012 03:28 PM, Avi Kivity wrote: >> On 03/28/2012 08:21 PM, Raghavendra K T wrote: > I really like below ideas. Thanks for that!. > >> - from the PLE handler, don't wake up a vcpu that is sleeping because it >> is waiting for a kick > > How about, adding another pass in the beginning of kvm_vcpu_on_spin() > to check if any vcpu is already kicked. This would almost result in > yield_to(kicked_vcpu). IMO this is also worth trying. > > will try above ideas soon. >

> What is the current status of this patchset? I haven't looked at it too > closely because I have been focused on 3.4 up until now...

The real question is whether these heuristics are the correct approach or not.

If I look at it from the non virtualized kernel side then this is ass backwards. We know already that we are holding a spinlock which might cause other (v)cpus going into eternal spin. The non virtualized kernel solves this by disabling preemption and therefor getting out of the critical section as fast as possible,

The virtualization problem reminds me a lot of the problem which RT kernels are observing where non raw spinlocks are turned into "sleeping spinlocks" and therefor can cause throughput issues for non RT workloads.

Though the virtualized situation is even worse. Any preempted guest section which holds a spinlock is prone to cause unbound delays.

The paravirt ticketlock solution can only mitigate the problem, but not solve it. With massive overcommit there is always a way to trigger worst case scenarious unless you are educating the scheduler to cope with that.

So if we need to fiddle with the scheduler and frankly that's the only way to get a real gain (the numbers, which are achieved by this patches, are not that impressive) then the question arises whether we should turn the whole thing around.

I know that Peter is going to go berserk on me, but if we are running a paravirt guest then it's simple to provide a mechanism which allows the host (aka hypervisor) to check that in the guest just by looking at some global state.

So if a guest exits due to an external event it's easy to inspect the state of that guest and avoid to schedule away when it was interrupted in a spinlock held section. That guest/host shared state needs to be modified to indicate the guest to invoke an exit when the last nested lock has been released.

Of course this needs to be time bound, so a rogue guest cannot monopolize the cpu forever, but that's the least to worry about problem simply because a guest which does not get out of a spinlocked region within a certain amount of time is borked and elegible to killing anyway.

> So if a guest exits due to an external event it's easy to inspect the > state of that guest and avoid to schedule away when it was interrupted > in a spinlock held section. That guest/host shared state needs to be

On a large system under high contention sleeping can perform surprisingly well. pthread mutexes have a tendency to beat kernel spinlocks there. So avoiding sleeping locks completely (that is what pv locks are essentially) is not necessarily that good.

Your proposal is probably only a good idea on low contention and relatively small systems.

> > So if a guest exits due to an external event it's easy to inspect the > > state of that guest and avoid to schedule away when it was interrupted > > in a spinlock held section. That guest/host shared state needs to be > > On a large system under high contention sleeping can perform surprisingly > well. pthread mutexes have a tendency to beat kernel spinlocks there. > So avoiding sleeping locks completely (that is what pv locks are > essentially) is not necessarily that good.

Care to back that up with numbers and proper trace evidence instead of handwaving?

I've stared at RT traces and throughput problems on _LARGE_ machines long enough to know what I'm talking about and I can provide evidence in a split second.

> Your proposal is probably only a good idea on low contention > and relatively small systems.

Sigh, you have really no fcking clue what you are talking about.

On RT we observed scalabilty problems way before hardware was available to expose them. So what's your point?

On Sat, Mar 31, 2012 at 01:04:41AM +0200, Thomas "Kubys" Gleixner wrote: > On Sat, 31 Mar 2012, Andi Kleen wrote: > > > > So if a guest exits due to an external event it's easy to inspect the > > > state of that guest and avoid to schedule away when it was interrupted > > > in a spinlock held section. That guest/host shared state needs to be > > > > On a large system under high contention sleeping can perform surprisingly > > well. pthread mutexes have a tendency to beat kernel spinlocks there. > > So avoiding sleeping locks completely (that is what pv locks are > > essentially) is not necessarily that good. > > Care to back that up with numbers and proper trace evidence instead of > handwaving?

E.g. my plumbers presentations on lock and mm scalability from last year has some graphs that show this very clearly, plus some additional data on the mutexes. This compares to the glibc futex locks, which perform much better than the kernel mutex locks on larger systems under higher contention

Given your tone I will not supply an URL. I'm sure you can find it if you need it.

On 03/31/2012 01:56 AM, H. Peter Anvin wrote: > What is the current status of this patchset? I haven't looked at it too > closely because I have been focused on 3.4 up until now... >

Thanks Peter,

Currently targeting the patchset for next merge window. IMO These patches are in good shape now. I' ll rebase these patches and send it ASAP. It needs "jumplabel split patch" (from Andrew Jones) as dependency, I would fold that patch (got ok from him for that) also into series, and send them ASAP.

> I know that Peter is going to go berserk on me, but if we are running > a paravirt guest then it's simple to provide a mechanism which allows > the host (aka hypervisor) to check that in the guest just by looking > at some global state. > > So if a guest exits due to an external event it's easy to inspect the > state of that guest and avoid to schedule away when it was interrupted > in a spinlock held section. That guest/host shared state needs to be > modified to indicate the guest to invoke an exit when the last nested > lock has been released.

The issue is with ticketlocks though. VCPUs could go into a spin w/o a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it) and releases the lock. VCPU1 is next eligible to take the lock. If that is not scheduled early enough by host, then remaining vcpus would keep spinning (even though lock is technically not held by anybody) w/o making forward progress.

In that situation, what we really need is for the guest to hint to host scheduler to schedule VCPU1 early (via yield_to or something similar).

The current pv-spinlock patches however does not track which vcpu is spinning at what head of the ticketlock. I suppose we can consider that optimization in future and see how much benefit it provides (over plain yield/sleep the way its done now).

Do you see any issues if we take in what we have today and address the finer-grained optimization as next step?

> The issue is with ticketlocks though. VCPUs could go into a spin w/o > a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in > that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it) > and releases the lock. VCPU1 is next eligible to take the lock. If

Sorry I meant to say "VCPU2 is next eligible ..."

> that is not scheduled early enough by host, then remaining vcpus would keep > spinning (even though lock is technically not held by anybody) w/o making > forward progress. > > In that situation, what we really need is for the guest to hint to host > scheduler to schedule VCPU1 early (via yield_to or something similar).

> > Care to back that up with numbers and proper trace evidence > > instead of handwaving? > > E.g. my plumbers presentations on lock and mm scalability from > last year has some graphs that show this very clearly, plus > some additional data on the mutexes. This compares to the > glibc futex locks, which perform much better than the kernel > mutex locks on larger systems under higher contention

On 03/31/2012 01:07 AM, Thomas Gleixner wrote: > On Fri, 30 Mar 2012, H. Peter Anvin wrote: > > > What is the current status of this patchset? I haven't looked at it too > > closely because I have been focused on 3.4 up until now... > > The real question is whether these heuristics are the correct approach > or not. > > If I look at it from the non virtualized kernel side then this is ass > backwards. We know already that we are holding a spinlock which might > cause other (v)cpus going into eternal spin. The non virtualized > kernel solves this by disabling preemption and therefor getting out of > the critical section as fast as possible, > > The virtualization problem reminds me a lot of the problem which RT > kernels are observing where non raw spinlocks are turned into > "sleeping spinlocks" and therefor can cause throughput issues for non > RT workloads. > > Though the virtualized situation is even worse. Any preempted guest > section which holds a spinlock is prone to cause unbound delays. > > The paravirt ticketlock solution can only mitigate the problem, but > not solve it. With massive overcommit there is always a way to trigger > worst case scenarious unless you are educating the scheduler to cope > with that. > > So if we need to fiddle with the scheduler and frankly that's the only > way to get a real gain (the numbers, which are achieved by this > patches, are not that impressive) then the question arises whether we > should turn the whole thing around. > > I know that Peter is going to go berserk on me, but if we are running > a paravirt guest then it's simple to provide a mechanism which allows > the host (aka hypervisor) to check that in the guest just by looking > at some global state. > > So if a guest exits due to an external event it's easy to inspect the > state of that guest and avoid to schedule away when it was interrupted > in a spinlock held section. That guest/host shared state needs to be > modified to indicate the guest to invoke an exit when the last nested > lock has been released.

Interesting idea (I think it has been raised before btw, don't recall by who).

One thing about it is that it can give many false positives. Consider a fine-grained spinlock that is being accessed by many threads. That is, the lock is taken and released with high frequency, but there is no contention, because each vcpu is accessing a different instance. So the host scheduler will needlessly delay preemption of vcpus that happen to be holding a lock, even though this gains nothing.

A second issue may happen with a lock that is taken and released with high frequency, with a high hold percentage. The host scheduler may always sample the guest in a held state, leading it to conclude that it's exceeding its timeout when in fact the lock is held for a short time only.

> Of course this needs to be time bound, so a rogue guest cannot > monopolize the cpu forever, but that's the least to worry about > problem simply because a guest which does not get out of a spinlocked > region within a certain amount of time is borked and elegible to > killing anyway.

Hopefully not killing! Just because a guest doesn't scale well, or even if it's deadlocked, doesn't mean it should be killed. Just preempt it.

> Thoughts ?

It's certainly interesting. Maybe a combination is worthwhile - prevent lockholder preemption for a short period of time AND put waiters to sleep in case that period is insufficient to release the lock.

On 03/31/2012 12:07 AM, Thomas Gleixner wrote: > On Fri, 30 Mar 2012, H. Peter Anvin wrote: > >> What is the current status of this patchset? I haven't looked at it too >> closely because I have been focused on 3.4 up until now... > The real question is whether these heuristics are the correct approach > or not. > > If I look at it from the non virtualized kernel side then this is ass > backwards. We know already that we are holding a spinlock which might > cause other (v)cpus going into eternal spin. The non virtualized > kernel solves this by disabling preemption and therefor getting out of > the critical section as fast as possible, > > The virtualization problem reminds me a lot of the problem which RT > kernels are observing where non raw spinlocks are turned into > "sleeping spinlocks" and therefor can cause throughput issues for non > RT workloads. > > Though the virtualized situation is even worse. Any preempted guest > section which holds a spinlock is prone to cause unbound delays. > > The paravirt ticketlock solution can only mitigate the problem, but > not solve it. With massive overcommit there is always a way to trigger > worst case scenarious unless you are educating the scheduler to cope > with that. > > So if we need to fiddle with the scheduler and frankly that's the only > way to get a real gain (the numbers, which are achieved by this > patches, are not that impressive) then the question arises whether we > should turn the whole thing around. > > I know that Peter is going to go berserk on me, but if we are running > a paravirt guest then it's simple to provide a mechanism which allows > the host (aka hypervisor) to check that in the guest just by looking > at some global state. > > So if a guest exits due to an external event it's easy to inspect the > state of that guest and avoid to schedule away when it was interrupted > in a spinlock held section. That guest/host shared state needs to be > modified to indicate the guest to invoke an exit when the last nested > lock has been released. > > Of course this needs to be time bound, so a rogue guest cannot > monopolize the cpu forever, but that's the least to worry about > problem simply because a guest which does not get out of a spinlocked > region within a certain amount of time is borked and elegible to > killing anyway. > > Thoughts ?

On Sun, 1 Apr 2012, Avi Kivity wrote: > On 03/31/2012 01:07 AM, Thomas Gleixner wrote: > > On Fri, 30 Mar 2012, H. Peter Anvin wrote: > > > > > What is the current status of this patchset? I haven't looked at it too > > > closely because I have been focused on 3.4 up until now... > > > > The real question is whether these heuristics are the correct approach > > or not. > > > > If I look at it from the non virtualized kernel side then this is ass > > backwards. We know already that we are holding a spinlock which might > > cause other (v)cpus going into eternal spin. The non virtualized > > kernel solves this by disabling preemption and therefor getting out of > > the critical section as fast as possible, > > > > The virtualization problem reminds me a lot of the problem which RT > > kernels are observing where non raw spinlocks are turned into > > "sleeping spinlocks" and therefor can cause throughput issues for non > > RT workloads. > > > > Though the virtualized situation is even worse. Any preempted guest > > section which holds a spinlock is prone to cause unbound delays. > > > > The paravirt ticketlock solution can only mitigate the problem, but > > not solve it. With massive overcommit there is always a way to trigger > > worst case scenarious unless you are educating the scheduler to cope > > with that. > > > > So if we need to fiddle with the scheduler and frankly that's the only > > way to get a real gain (the numbers, which are achieved by this > > patches, are not that impressive) then the question arises whether we > > should turn the whole thing around. > > > > I know that Peter is going to go berserk on me, but if we are running > > a paravirt guest then it's simple to provide a mechanism which allows > > the host (aka hypervisor) to check that in the guest just by looking > > at some global state. > > > > So if a guest exits due to an external event it's easy to inspect the > > state of that guest and avoid to schedule away when it was interrupted > > in a spinlock held section. That guest/host shared state needs to be > > modified to indicate the guest to invoke an exit when the last nested > > lock has been released. > > Interesting idea (I think it has been raised before btw, don't recall by > who).

Someoen posted a pointer to that old thread.

> One thing about it is that it can give many false positives. Consider a > fine-grained spinlock that is being accessed by many threads. That is, > the lock is taken and released with high frequency, but there is no > contention, because each vcpu is accessing a different instance. So the > host scheduler will needlessly delay preemption of vcpus that happen to > be holding a lock, even though this gains nothing.

You're talking about per cpu locks, right? I can see the point there, but per cpu stuff might be worth annotating to avoid this.

> A second issue may happen with a lock that is taken and released with > high frequency, with a high hold percentage. The host scheduler may > always sample the guest in a held state, leading it to conclude that > it's exceeding its timeout when in fact the lock is held for a short > time only.

So when an exit occures in the locked section, then the host can mark the global state to tell the guest to issue a trap on unlock.

> > Of course this needs to be time bound, so a rogue guest cannot > > monopolize the cpu forever, but that's the least to worry about > > problem simply because a guest which does not get out of a spinlocked > > region within a certain amount of time is borked and elegible to > > killing anyway. > > Hopefully not killing! Just because a guest doesn't scale well, or even > if it's deadlocked, doesn't mean it should be killed. Just preempt it.

:)

> > Thoughts ? > > It's certainly interesting. Maybe a combination is worthwhile - prevent > lockholder preemption for a short period of time AND put waiters to > sleep in case that period is insufficient to release the lock.

Right, but as Srivatsa pointed out this still needs the ticket lock ordering support to avoid that guests are completely starved.