On Mon, Apr 27, 2009 at 5:05 PM, Daisuke Nishimura<d-nishimura@mtf.biglobe.ne.jp> wrote:> On Mon, 27 Apr 2009 15:43:23 +0530> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:>>> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-27 18:12:59]:>>>> > Works very well under my test as following.>> > prepare a program which does malloc, touch pages repeatedly.>> >>> > # echo 2M > /cgroup/A/memory.limit_in_bytes # set limit to 2M.>> > # echo 0 > /cgroup/A/tasks. # add shell to the group.>> >>> > while true; do>> > malloc_and_touch 1M & # run malloc and touch program.>> > malloc_and_touch 1M &>> > malloc_and_touch 1M &>> > sleep 3>> > pkill malloc_and_touch # kill them>> > done>> >>> > Then, you can see memory.memsw.usage_in_bytes increase gradually and exceeds 3M bytes.>> > This means account for swp_entry is not reclaimed at kill -> exit-> zap_pte()>> > because of race with swap-ops and zap_pte() under memcg.>> >>> > ==>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>>> >>> > Because free_swap_and_cache() function is called under spinlocks,>> > it can't sleep and use trylock_page() instead of lock_page().>> > By this, swp_entry which is not used after zap_xx can exists as>> > SwapCache, which will be never used.>> > This kind of SwapCache is reclaimed by global LRU when it's found>> > at LRU rotation. Typical case is following.>> >>>>> The changelog is not clear, this is the typical case for?>>> Okey, let me summarise the problem.>> First of all, what I think is problematic is "!PageCgroupUsed> swap cache without the owner process".> Those swap caches cannot be reclaimed by memcg's reclaim> because they are not on memcg's LRU(!PageCgroupUsed pages are not> linked to memcg's LRU).> Moreover, the owner prcess has already gone, only global LRU scanning> can free those swap caches.>> Those swap caches causes some problems like:> (1) pressure the memsw.usage(only when MEM_RES_CTLR_SWAP).> (2) make struct mem_cgroup unfreeable even after rmdir, because> we call mem_cgroup_get() when a page is swaped out(only when MEM_RES_CTLR_SWAP).> (3) pressure the usage of swap entry.>> Those swap caches can be created in paths like:>> Type-1) race between exit and swap-in path> Assume processA is exiting and pte has swap entry of swaped out page.> And processB is trying to swap in the entry by readahead.> This entry holds memsw.usage and refcnt to struct mem_cgroup.>> Type-1.1)> processA | processB> -------------------------------------+-------------------------------------> (free_swap_and_cache()) | (read_swap_cache_async())> | swap_duplicate()> | __set_page_locked()> | add_to_swap_cache()> swap_entry_free() == 1 |> find_get_page() -> found |> try_lock_page() -> fail & return |> | lru_cache_add_anon()> | doesn't link this page to memcg's> | LRU, because of !PageCgroupUsed.>> Type-1.2)> processA | processB> -------------------------------------+-------------------------------------> (free_swap_and_cache()) | (read_swap_cache_async())> | swap_duplicate()> swap_entry_free() == 1 |> find_get_page() -> not found |> & return | __set_page_locked()> | add_to_swap_cache()> | lru_cache_add_anon()> | doesn't link this page to memcg's> | LRU, because of !PageCgroupUsed.>> Type-2) race between exit and swap-out path> Assume processA is exiting and pte points to a page(!PageSwapCache).> And processB is trying reclaim the page.>> processA | processB> -------------------------------------+-------------------------------------> (page_remove_rmap()) | (shrink_page_list())> mem_cgroup_uncharge_page() |> ->uncharged because it's not |> PageSwapCache yet. |> So, both mem/memsw.usage |> are decremented. |> | add_to_swap() -> added to swap cache.>> If this page goes thorough without being freed for some reason, this page> doesn't goes back to memcg's LRU because of !PageCgroupUsed.

Thanks for the detailed explanation of the possible race conditions. Iam beginning to wonder why we don't have any hooks in add_to_swap.*.for charging a page. If the page is already charged and if it is acontext issue (charging it to the right cgroup) that is alreadyhandled from what I see. Won't that help us solve the !PageCgroupUsedissue?