I think this is papering over the problem. Basically this check worksbecause after page table sharing, the parent and child are pointing to thesame data page even if they are not sharing page tables. As it's duringfork(), we cannot have faulted in parallel so the populated PTE must bedue to page table sharing. If the parent has not faulted the page, thensharing is not attempted and again the problem is avoided. It would be anew instance though of hugetlbfs just happening to work because of itslimitations - in this case, it works because we only share page tablesfor MAP_SHARED.

Fundamentally I think the problem is that we are not correctly detectingthat page table sharing took place during huge_pte_alloc(). This patch islonger and makes an API change but if I'm right, it addresses the underlyingproblem. The first VM_MAYSHARE patch is still necessary but would you mindtesting this on top please?

---8<---mm: hugetlbfs: Correctly detect if page tables have just been shared

Each page mapped in a processes address space must be correctlyaccounted for in _mapcount. Normally the rules for this arestraight-forward but hugetlbfs page table sharing is different.The page table pages at the PMD level are reference counted whilethe mapcount remains the same. If this accounting is wrong, it causesbugs like this one reported by Larry Woodman

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()shared page tables with the check dst_pte == src_pte. The logic is ifthe PMD page is the same, they must be shared. This assumes that thesharing is between the parent and child. However, if the sharing is witha different process entirely then this check fails as in this diagram.

These two processes are not poing to the same data page but are not sharingpage tables because the opportunity was missed. When either process laterforks, the src_pte == dst pte is potentially insufficient. As the checkfalls through, the wrong PTE information is copied in (harmless but wrong)and the mapcount is bumped for a page mapped by a shared page table leadingto the BUG_ON.

+ /* If the pagetable is shared, no other work is necessary */+ if (shared)+ return 0;+ /* * Serialize hugepage allocation and instantiation, so that we don't * get spurious allocation failures if two CPUs race to instantiate