On Wed 03-04-19 19:00:21, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
>
> Thanks to Dave Hansen's patches, which make PMEM as part of memory as NUMA nodes.
> How to use PMEM along with normal DRAM remains an open problem. There are
> several patchsets posted on the mailing list, proposing to use page migration to
> move pages between PMEM and DRAM using Linux page replacement policy [1,2,3].
> There are some important problems not addressed in these patches:
> 1. The page migration in Linux does not provide high enough throughput for us to
> fully exploit PMEM or other use cases.
> 2. Linux page replacement is running too infrequent to distinguish hot and cold
> pages.
[...]
> 33 files changed, 4261 insertions(+), 162 deletions(-)
For a patch _this_ large you should really start with a real world
usecasing hitting bottlenecks with the current implementation. Should
microbenchmarks can trigger bottlenecks much easier but do real
application do the same? Please give us some numbers.
--
Michal Hocko
SUSE Labs

[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]>> Infrequent page list update problem
>> ====
>>
>> Current page lists are updated by calling shrink_list() when memory pressure
>> comes, which might not be frequent enough to keep track of hot and cold pages.
>> Because all pages are on active lists at the first time shrink_list() is called
>> and the reference bit on the pages might not reflect the up to date access status
>> of these pages. But we also do not want to periodically shrink the global page
>> lists, which adds unnecessary overheads to the whole system. So I propose to
>> actively shrink page lists on the memcg we are interested in.
>>
>> Patch 18 to 25 add a new system call to shrink page lists on given application's
>> memcg and migrate pages between two NUMA nodes. It isolates the impact from the
>> rest of the system. To share DRAM among different applications, Patch 18 and 19
>> add per-node memcg size limit, so you can limit the memory usage for particular
>> NUMA node(s).
>
> This sounds a little bit confusing to me. Is it totally user's decision about when to call the syscall to shrink page lists? But, how would user know when is a good timing? Could you please elaborate the usecase?
Sure. We would set up a daemon that monitors user applications and calls the syscall
to shuffle the page lists for the user applications, although the daemon’s concrete
action plan is still under exploration. It might not be ideal but the page access information
could be refreshed periodically and page migration would happen on the background of
application execution.
On the other hand, if we wait until DRAM is full and use page migration to make room in DRAM
for either page promotion or new page allocation, page migration sits on the critical path
of application execution. Considering the bandwidth and access latency gaps between
DRAM and PMEM are not as large as the gaps between DRAM and SSD, the cost of page migration
(4KB/0.312GB/s = 12us or 2MB/2.387GB/s = 818us)might defeat the benefit of using DRAM over PMEM.
I just wonder which would be better: waiting for 12us or 818us then reading 4KB or 2MB data in DRAM
or directly accessing the data in PMEM without waiting.
Let me know if this makes sense to you.
Thanks.
--
Best Regards,
Yan Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]