> 1)
>
> This patch might solve the remapping
> (remove_migration_ptes()), but does not solve the anon-vma
> locking done in the first, unmapping step of pte-migration -
> which is done via try_to_unmap(): which is a generic VM
> function used by swapout too, so callers do not necessarily
> hold the mmap_sem.
>
> A new TTU flag might solve it although I detest flag-driven
> locking semantics with a passion:
>
> Splitting out unlocked versions of try_to_unmap_anon(),
> try_to_unmap_ksm(), try_to_unmap_file() and constructing an
> unlocked try_to_unmap() out of them, to be used by the
> migration code, would be the cleaner option.

So as a quick concept hack I wrote the patch attached below.
(It's not signed off, see the patch description text for the
reason.)

So this is roughly as good as it can get without hard binding -
and according to my limited testing the numa02 workload is
20-30% faster than the AutoNUMA or balancenuma kernels on the
same hardware/kernel combo. The above numa02 result now also
gets reasonably close to the numa/core +THP numa02 numbers (to
within 10%).

As expected there's a lot of TLB flushing going on, but, and
this was unexpected to me, even maximally pushing the migration
code does not trigger anything pathological on this 4-node
system - so while the TLB optimization will be a welcome
enhancement, it's not a must-have at this stage.

I'll do a cleaner version of this patch and I'll test on a
larger system with a large NUMA factor too to make sure we don't
need the TLB optimization on !THP.

So I think (assuming that I have not overlooked something
critical in these patches!), with these two fixes all the
difficult known regressions in numa/core are fixed.

I'll do more testing with broader workloads and on more systems
to ascertain this.