* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:> On Sun, Jan 10, 2010 at 11:30:16PM -0500, Mathieu Desnoyers wrote:> > Here is an implementation of a new system call, sys_membarrier(), which> > executes a memory barrier on all threads of the current process.> > > > It aims at greatly simplifying and enhancing the current signal-based> > liburcu userspace RCU synchronize_rcu() implementation.> > (found at http://lttng.org/urcu)> > I didn't expect quite this comprehensive of an implementation from the> outset, but I guess I cannot complain. ;-)> > Overall, good stuff.> > Interestingly enough, what you have implemented is analogous to> synchronize_rcu_expedited() and friends that have recently been added> to the in-kernel RCU API. By this analogy, my earlier semi-suggestion> of synchronize_rcu(0 would be a candidate non-expedited implementation.> Long latency, but extremely low CPU consumption, full batching of> concurrent requests (even unrelated ones), and so on.

Yes, the main different I think is that the sys_membarrierinfrastructure focuses on IPI-ing only the current process runningthreads.

> > A few questions interspersed below.> > > Changelog since v1:> > > > - Only perform the IPI in CONFIG_SMP.> > - Only perform the IPI if the process has more than one thread.> > - Only send IPIs to CPUs involved with threads belonging to our process.> > - Adaptative IPI scheme (single vs many IPI with threshold).> > - Issue smp_mb() at the beginning and end of the system call.> > > > Changelog since v2:> > > > - Iteration on min(num_online_cpus(), nr threads in the process),> > taking runqueue spinlocks, allocating a cpumask, ipi to many to the> > cpumask. Does not allocate the cpumask if only a single IPI is needed.> > > > > > Both the signal-based and the sys_membarrier userspace RCU schemes> > permit us to remove the memory barrier from the userspace RCU> > rcu_read_lock() and rcu_read_unlock() primitives, thus significantly> > accelerating them. These memory barriers are replaced by compiler> > barriers on the read-side, and all matching memory barriers on the > > write-side are turned into an invokation of a memory barrier on all> > active threads in the process. By letting the kernel perform this> > synchronization rather than dumbly sending a signal to every process> > threads (as we currently do), we diminish the number of unnecessary wake> > ups and only issue the memory barriers on active threads. Non-running> > threads do not need to execute such barrier anyway, because these are> > implied by the scheduler context switches.> > > > To explain the benefit of this scheme, let's introduce two example threads:> > > > Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())> > Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())> > > > In a scheme where all smp_mb() in thread A synchronize_rcu() are> > ordering memory accesses with respect to smp_mb() present in > > rcu_read_lock/unlock(), we can change all smp_mb() from> > synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from> > rcu_read_lock/unlock() into compiler barriers "barrier()".> > > > Before the change, we had, for each smp_mb() pairs:> > > > Thread A Thread B> > prev mem accesses prev mem accesses> > smp_mb() smp_mb()> > follow mem accesses follow mem accesses> > > > After the change, these pairs become:> > > > Thread A Thread B> > prev mem accesses prev mem accesses> > sys_membarrier() barrier()> > follow mem accesses follow mem accesses> > > > As we can see, there are two possible scenarios: either Thread B memory> > accesses do not happen concurrently with Thread A accesses (1), or they> > do (2).> > > > 1) Non-concurrent Thread A vs Thread B accesses:> > > > Thread A Thread B> > prev mem accesses> > sys_membarrier()> > follow mem accesses> > prev mem accesses> > barrier()> > follow mem accesses> > > > In this case, thread B accesses will be weakly ordered. This is OK,> > because at that point, thread A is not particularly interested in> > ordering them with respect to its own accesses.> > > > 2) Concurrent Thread A vs Thread B accesses> > > > Thread A Thread B> > prev mem accesses prev mem accesses> > sys_membarrier() barrier()> > follow mem accesses follow mem accesses> > > > In this case, thread B accesses, which are ensured to be in program> > order thanks to the compiler barrier, will be "upgraded" to full> > smp_mb() thanks to the IPIs executing memory barriers on each active> > system threads. Each non-running process threads are intrinsically> > serialized by the scheduler.> > > > Just tried with a cache-hot kernel compilation using 6/8 CPUs.> > > > Normally: real 2m41.852s> > With the sys_membarrier+1 busy-looping thread running: real 5m41.830s> > > > So... 2x slower. That hurts.> > > > So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a> > small allocation overhead and benefit from cpumask broadcast if> > possible so we scale better. But that all depends on how big the> > allocation overhead is.> > > > Impact of allocating a cpumask (time for 10,000,000 sys_membarrier> > calls, one thread is doing the sys_membarrier, the others are busy> > looping)). Given that it costs almost half as much to perform the> > cpumask allocation than to send a single IPI, as we iterate on the CPUs> > until we find more than N match or iterated on all cpus. If we only have> > N match or less, we send single IPIs. If we need more than that, then we> > switch to the cpumask allocation and send a broadcast IPI to the cpumask> > we construct for the matching CPUs. Let's call it the "adaptative IPI> > scheme".> > > > For my Intel Xeon E5405> > > > *This is calibration only, not taking the runqueue locks*> > > > Just doing local mb()+single IPI to T other threads:> > > > T=1: 0m18.801s> > T=2: 0m29.086s> > T=3: 0m46.841s> > T=4: 0m53.758s> > T=5: 1m10.856s> > T=6: 1m21.142s> > T=7: 1m38.362s> > > > Just doing cpumask alloc+IPI-many to T other threads:> > > > T=1: 0m21.778s> > T=2: 0m22.741s> > T=3: 0m22.185s> > T=4: 0m24.660s> > T=5: 0m26.855s> > T=6: 0m30.841s> > T=7: 0m29.551s> > > > So I think the right threshold should be 1 thread (assuming other> > architecture will behave like mine). So starting with 2 threads, we> > allocate the cpumask before sending IPIs.> > > > *end of calibration*> > > > Resulting adaptative scheme, with runqueue locks:> > > > T=1: 0m20.990s> > T=2: 0m22.588s> > T=3: 0m27.028s> > T=4: 0m29.027s> > T=5: 0m32.592s> > T=6: 0m36.556s> > T=7: 0m33.093s> > > > The expected top pattern, when using 1 CPU for a thread doing sys_membarrier()> > in a loop and other threads busy-waiting in user-space on a variable shows that> > the thread doing sys_membarrier is doing mostly system calls, and other threads> > are mostly running in user-space. Side-note, in this test, it's important to> > check that individual threads are not always fully at 100% user-space time (they> > range between ~95% and 100%), because when some thread in the test is always at> > 100% on the same CPU, this means it does not get the IPI at all. (I actually> > found out about a bug in my own code while developing it with this test.)> > The below data is for how many threads in the process?

8 threads: one doing sys_membarrier() in a loop, 7 others waiting on avariable.

> Also, is "top"> accurate given that the IPI handler will have interrupts disabled?

Probably not. AFAIK. "top" does not really consider interrupts into itsaccounting. So, better take this top output with a grain of salt or two.

Absolutely. And it's of no use to add a check within the IPI handler toverify if it was indeed needed, because all we would skip is a simplesmp_mb(), which is relatively minor in terms of overhead compared to theIPI itself.

+ if (unlikely(thread_group_empty(current))) + return 0;in the caller below. The if you present here simply ensures that wedon't do a superfluous function call on the current thread. It'sprobably not really worth it for a slow path though.

> The UP-kernel case is handled by the #ifdef in sys_membarrier(), though> with a bit larger code footprint than the embedded guys would probably> prefer. (Or is the compiler smart enough to omit these function given no> calls to them? If not, recommend putting them under CONFIG_SMP #ifdef.)

Hrm, that's a bit odd. I agree that UP systems could simply return-ENOSYS for sys_membarrier, but then I wonder how userland coulddistinguish between:

- an old kernel not supporting sys_membarrier() -> in this case we need to use the smp_mb() fallback on the read-side and in synchronize_rcu().- a recent kernel supporting sys_membarrier(), CONFIG_SMP -> can use the barrier() on read-side, call sys_membarrier upon update.- a recent kernel supporting sys_membarrier, !CONFIG_SMP -> calls to sys_membarrier() are not required, nor is barrier().

Or maybe we just postpone the userland smp_mb() question to anotherthread. This will eventually need to be addressed anyway. Maybe with avgetmaxcpu() vsyscall.