Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead.Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory whenchanging cpuset's mems") wins a super prize for the largest number ofmemory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memorybarriers inserted into a number of hot paths. This was detected whileinvestigating at large page allocator slowdown introduced some timeafter 2.6.32. The largest portion of this overhead was shown byoprofile to be at an mfence introduced by this commit into the pageallocator hot path.For extra style points, the commit introduced the use of yield() in animplementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and writesides with a sequence counter with just read barriers on the fast pathside. This is much cheaper on some architectures, including x86. Themain bulk of the patch is the retry logic if the nodemask changes in amanner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failureis a risk. If it is, the sequence number gets bumped and parallelallocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. Theactual results were

The overall improvement is small but the System CPU time is muchimproved and roughly in correlation to what oprofile reported (theseperformance figures are without profiling so skew is expected). Theactual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal butthe system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals. Thefirst ran within a cpuset and continually ran a small program thatfaulted 100M of anonymous data. In a second window, the nodemask of thecpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usuallywithin 10 seconds) and obviously with the commit everything worked fine.With this patch applied, it also worked fine so the fix should befunctionally equivalent.

/*- * reading current mems_allowed and mempolicy in the fastpath must protected- * by get_mems_allowed()+ * get_mems_allowed is required when making decisions involving mems_allowed+ * such as during page allocation. mems_allowed can be updated in parallel+ * and depending on the new value an operation can fail potentially causing+ * process failure. A retry loop with get_mems_allowed and put_mems_allowed+ * prevents these artificial failures. */-static inline void get_mems_allowed(void)+static inline unsigned int get_mems_allowed(void) {- current->mems_allowed_change_disable++;+ return read_seqcount_begin(&current->mems_allowed_seq);+}

- /*- * ensure that reading mems_allowed and mempolicy happens after the- * update of ->mems_allowed_change_disable.- *- * the write-side task finds ->mems_allowed_change_disable is not 0,- * and knows the read-side task is reading mems_allowed or mempolicy,- * so it will clear old bits lazily.- */- smp_mb();-}--static inline void put_mems_allowed(void)-{- /*- * ensure that reading mems_allowed and mempolicy before reducing- * mems_allowed_change_disable.- *- * the write-side task will know that the read-side task is still- * reading mems_allowed or mempolicy, don't clears old bits in the- * nodemask.- */- smp_mb();- --ACCESS_ONCE(current->mems_allowed_change_disable);+/*+ * If this returns false, the operation that took place after get_mems_allowed+ * may have failed. It is up to the caller to retry the operation if+ * appropriate.+ */+static inline bool put_mems_allowed(unsigned int seq)+{+ return !read_seqcount_retry(&current->mems_allowed_seq, seq); }

- /*- * ensure checking ->mems_allowed_change_disable after setting all new- * allowed nodes.- *- * the read-side task can see an nodemask with new allowed nodes and- * old allowed nodes. and if it allocates page when cpuset clears newly- * disallowed ones continuous, it can see the new allowed bits.- *- * And if setting all new allowed nodes is after the checking, setting- * all new allowed nodes and clearing newly disallowed ones will be done- * continuous, and the read-side task may find no node to alloc page.- */- smp_mb();-- /*- * Allocation of memory is very fast, we needn't sleep when waiting- * for the read-side.- */- while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {- task_unlock(tsk);- if (!task_curr(tsk))- yield();- goto repeat;- }+ if (need_loop)+ write_seqcount_begin(&tsk->mems_allowed_seq);