*[RFC PATCH 0/4] kvm: Report unused guest pages to host@ 2019-02-04 18:15 Alexander Duyck
2019-02-04 18:15 ` [RFC PATCH 1/4] madvise: Expose ability to set dontneed from kernel Alexander Duyck
` (7 more replies)0 siblings, 8 replies; 55+ messages in thread
From: Alexander Duyck @ 2019-02-04 18:15 UTC (permalink / raw)
To: linux-mm, linux-kernel, kvm
Cc: rkrcmar, alexander.h.duyck, x86, mingo, bp, hpa, pbonzini, tglx, akpm
This patch set provides a mechanism by which guests can notify the host of
pages that are not currently in use. Using this data a KVM host can more
easily balance memory workloads between guests and improve overall system
performance by avoiding unnecessary writing of unused pages to swap.
In order to support this I have added a new hypercall to provided unused
page hints and made use of mechanisms currently used by PowerPC and s390
architectures to provide those hints. To reduce the overhead of this call
I am only using it per huge page instead of of doing a notification per 4K
page. By doing this we can avoid the expense of fragmenting higher order
pages, and reduce overall cost for the hypercall as it will only be
performed once per huge page.
Because we are limiting this to huge pages it was necessary to add a
secondary location where we make the call as the buddy allocator can merge
smaller pages into a higher order huge page.
This approach is not usable in all cases. Specifically, when KVM direct
device assignment is used, the memory for a guest is permanently assigned
to physical pages in order to support DMA from the assigned device. In
this case we cannot give the pages back, so the hypercall is disabled by
the host.
Another situation that can lead to issues is if the page were accessed
immediately after free. For example, if page poisoning is enabled the
guest will populate the page *after* freeing it. In this case it does not
make sense to provide a hint about the page being freed so we do not
perform the hypercalls from the guest if this functionality is enabled.
My testing up till now has consisted of setting up 4 8GB VMs on a system
with 32GB of memory and 4GB of swap. To stress the memory on the system I
would run "memhog 8G" sequentially on each of the guests and observe how
long it took to complete the run. The observed behavior is that on the
systems with these patches applied in both the guest and on the host I was
able to complete the test with a time of 5 to 7 seconds per guest. On a
system without these patches the time ranged from 7 to 49 seconds per
guest. I am assuming the variability is due to time being spent writing
pages out to disk in order to free up space for the guest.
---
Alexander Duyck (4):
madvise: Expose ability to set dontneed from kernel
kvm: Add host side support for free memory hints
kvm: Add guest side support for free memory hints
mm: Add merge page notifier
Documentation/virtual/kvm/cpuid.txt | 4 ++
Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++
arch/x86/include/asm/page.h | 25 +++++++++++++++
arch/x86/include/uapi/asm/kvm_para.h | 3 ++
arch/x86/kernel/kvm.c | 51 ++++++++++++++++++++++++++++++
arch/x86/kvm/cpuid.c | 6 +++-
arch/x86/kvm/x86.c | 35 +++++++++++++++++++++
include/linux/gfp.h | 4 ++
include/linux/mm.h | 2 +
include/uapi/linux/kvm_para.h | 1 +
mm/madvise.c | 13 +++++++-
mm/page_alloc.c | 2 +
12 files changed, 158 insertions(+), 2 deletions(-)
--
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints
2019-02-11 17:41 ` Dave Hansen@ 2019-02-11 17:48 ` Michael S. Tsirkin
2019-02-11 18:30 ` Alexander Duyck0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2019-02-11 17:48 UTC (permalink / raw)
To: Dave Hansen
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar,
alexander.h.duyck, x86, mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, Feb 11, 2019 at 09:41:19AM -0800, Dave Hansen wrote:
> On 2/9/19 4:44 PM, Michael S. Tsirkin wrote:
> > So the policy should not leak into host/guest interface.
> > Instead it is better to just keep the pages pinned and
> > ignore the hint for now.
>
> It does seems a bit silly to have guests forever hinting about freed
> memory when the host never has a hope of doing anything about it.
>
> Is that part fixable?
Yes just not with existing IOMMU APIs.
It's in the paragraph just above that you cut out:
Yes right now assignment is not smart enough but generally
you can protect the unused page in the IOMMU and that's it,
it's safe.
So e.g.
extern int iommu_remap(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t size, int prot);
I can elaborate if you like but generally we would need an API that
allows you to atomically update a mapping for a specific page without
perturbing the mapping for other pages.
--
MST
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints
2019-02-11 17:48 ` Michael S. Tsirkin@ 2019-02-11 18:30 ` Alexander Duyck
2019-02-11 19:24 ` Michael S. Tsirkin0 siblings, 1 reply; 55+ messages in thread
From: Alexander Duyck @ 2019-02-11 18:30 UTC (permalink / raw)
To: Michael S. Tsirkin, Dave Hansen
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar, x86,
mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, 2019-02-11 at 12:48 -0500, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 09:41:19AM -0800, Dave Hansen wrote:
> > On 2/9/19 4:44 PM, Michael S. Tsirkin wrote:
> > > So the policy should not leak into host/guest interface.
> > > Instead it is better to just keep the pages pinned and
> > > ignore the hint for now.
> >
> > It does seems a bit silly to have guests forever hinting about freed
> > memory when the host never has a hope of doing anything about it.
> >
> > Is that part fixable?
>
>
> Yes just not with existing IOMMU APIs.
>
> It's in the paragraph just above that you cut out:
> Yes right now assignment is not smart enough but generally
> you can protect the unused page in the IOMMU and that's it,
> it's safe.
>
> So e.g.
> extern int iommu_remap(struct iommu_domain *domain, unsigned long iova,
> phys_addr_t paddr, size_t size, int prot);
>
>
> I can elaborate if you like but generally we would need an API that
> allows you to atomically update a mapping for a specific page without
> perturbing the mapping for other pages.
>
I still don't see how this would solve anything unless you have the
guest somehow hinting on what pages it is providing to the devices.
You would have to have the host invalidating the pages when the hint is
provided, and have a new hint tied to arch_alloc_page that would
rebuild the IOMMU mapping when a page is allocated.
I'm pretty certain that the added cost of that would make the hinting
pretty pointless as my experience has been that the IOMMU is too much
of a bottleneck to have multiple CPUs trying to create and invalidate
mappings simultaneously.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints
2019-02-11 18:30 ` Alexander Duyck@ 2019-02-11 19:24 ` Michael S. Tsirkin0 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2019-02-11 19:24 UTC (permalink / raw)
To: Alexander Duyck
Cc: Dave Hansen, Alexander Duyck, linux-mm, linux-kernel, kvm,
rkrcmar, x86, mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, Feb 11, 2019 at 10:30:10AM -0800, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 12:48 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 09:41:19AM -0800, Dave Hansen wrote:
> > > On 2/9/19 4:44 PM, Michael S. Tsirkin wrote:
> > > > So the policy should not leak into host/guest interface.
> > > > Instead it is better to just keep the pages pinned and
> > > > ignore the hint for now.
> > >
> > > It does seems a bit silly to have guests forever hinting about freed
> > > memory when the host never has a hope of doing anything about it.
> > >
> > > Is that part fixable?
> >
> >
> > Yes just not with existing IOMMU APIs.
> >
> > It's in the paragraph just above that you cut out:
> > Yes right now assignment is not smart enough but generally
> > you can protect the unused page in the IOMMU and that's it,
> > it's safe.
> >
> > So e.g.
> > extern int iommu_remap(struct iommu_domain *domain, unsigned long iova,
> > phys_addr_t paddr, size_t size, int prot);
> >
> >
> > I can elaborate if you like but generally we would need an API that
> > allows you to atomically update a mapping for a specific page without
> > perturbing the mapping for other pages.
> >
>
> I still don't see how this would solve anything unless you have the
> guest somehow hinting on what pages it is providing to the devices.
>
> You would have to have the host invalidating the pages when the hint is
> provided, and have a new hint tied to arch_alloc_page that would
> rebuild the IOMMU mapping when a page is allocated.
>
> I'm pretty certain that the added cost of that would make the hinting
> pretty pointless as my experience has been that the IOMMU is too much
> of a bottleneck to have multiple CPUs trying to create and invalidate
> mappings simultaneously.
I agree it's a concern.
Another option would involve passing these hints in the DMA API.
How about the option of removing the device by hotplug when
host needs overcommit? That would involve either buffering
on host, or requesting free pages after device is removed
along the lines of existing balloon code. That btw seems to
be an argument for making this hinting part of balloon.
--
MST
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-04 23:00 ` Nadav Amit@ 2019-02-04 23:37 ` Alexander Duyck
2019-02-05 0:03 ` Nadav Amit0 siblings, 1 reply; 55+ messages in thread
From: Alexander Duyck @ 2019-02-04 23:37 UTC (permalink / raw)
To: Nadav Amit, Alexander Duyck
Cc: Linux-MM, LKML, kvm list, Radim Krcmar, X86 ML, Ingo Molnar, bp,
hpa, pbonzini, tglx, akpm
On Mon, 2019-02-04 at 15:00 -0800, Nadav Amit wrote:
> > On Feb 4, 2019, at 10:15 AM, Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >
> > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >
> > Add guest support for providing free memory hints to the KVM hypervisor for
> > freed pages huge TLB size or larger. I am restricting the size to
> > huge TLB order and larger because the hypercalls are too expensive to be
> > performing one per 4K page. Using the huge TLB order became the obvious
> > choice for the order to use as it allows us to avoid fragmentation of higher
> > order memory on the host.
> >
> > I have limited the functionality so that it doesn't work when page
> > poisoning is enabled. I did this because a write to the page after doing an
> > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > cycles to do so.
> >
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > ---
> > arch/x86/include/asm/page.h | 13 +++++++++++++
> > arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> > 2 files changed, 36 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > index 7555b48803a8..4487ad7a3385 100644
> > --- a/arch/x86/include/asm/page.h
> > +++ b/arch/x86/include/asm/page.h
> > @@ -18,6 +18,19 @@
> >
> > struct page;
> >
> > +#ifdef CONFIG_KVM_GUEST
> > +#include <linux/jump_label.h>
> > +extern struct static_key_false pv_free_page_hint_enabled;
> > +
> > +#define HAVE_ARCH_FREE_PAGE
> > +void __arch_free_page(struct page *page, unsigned int order);
> > +static inline void arch_free_page(struct page *page, unsigned int order)
> > +{
> > + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > + __arch_free_page(page, order);
> > +}
> > +#endif
>
> This patch and the following one assume that only KVM should be able to hook
> to these events. I do not think it is appropriate for __arch_free_page() to
> effectively mean “kvm_guest_free_page()”.
>
> Is it possible to use the paravirt infrastructure for this feature,
> similarly to other PV features? It is not the best infrastructure, but at least
> it is hypervisor-neutral.
I could probably tie this into the paravirt infrastructure, but if I
did so I would probably want to pull the checks for the page order out
of the KVM specific bits and make it something we handle in the inline.
Doing that I would probably make it a paravirtual hint that only
operates at the PMD level. That way we wouldn't incur the cost of the
paravirt infrastructure at the per 4K page level.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-04 23:37 ` Alexander Duyck@ 2019-02-05 0:03 ` Nadav Amit
2019-02-05 0:16 ` Alexander Duyck0 siblings, 1 reply; 55+ messages in thread
From: Nadav Amit @ 2019-02-05 0:03 UTC (permalink / raw)
To: Alexander Duyck
Cc: Alexander Duyck, Linux-MM, LKML, kvm list, Radim Krcmar, X86 ML,
Ingo Molnar, bp, hpa, pbonzini, tglx, akpm
> On Feb 4, 2019, at 3:37 PM, Alexander Duyck <alexander.h.duyck@linux.intel.com> wrote:
>
> On Mon, 2019-02-04 at 15:00 -0800, Nadav Amit wrote:
>>> On Feb 4, 2019, at 10:15 AM, Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>>
>>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>
>>> Add guest support for providing free memory hints to the KVM hypervisor for
>>> freed pages huge TLB size or larger. I am restricting the size to
>>> huge TLB order and larger because the hypercalls are too expensive to be
>>> performing one per 4K page. Using the huge TLB order became the obvious
>>> choice for the order to use as it allows us to avoid fragmentation of higher
>>> order memory on the host.
>>>
>>> I have limited the functionality so that it doesn't work when page
>>> poisoning is enabled. I did this because a write to the page after doing an
>>> MADV_DONTNEED would effectively negate the hint, so it would be wasting
>>> cycles to do so.
>>>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>> ---
>>> arch/x86/include/asm/page.h | 13 +++++++++++++
>>> arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
>>> 2 files changed, 36 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>> index 7555b48803a8..4487ad7a3385 100644
>>> --- a/arch/x86/include/asm/page.h
>>> +++ b/arch/x86/include/asm/page.h
>>> @@ -18,6 +18,19 @@
>>>
>>> struct page;
>>>
>>> +#ifdef CONFIG_KVM_GUEST
>>> +#include <linux/jump_label.h>
>>> +extern struct static_key_false pv_free_page_hint_enabled;
>>> +
>>> +#define HAVE_ARCH_FREE_PAGE
>>> +void __arch_free_page(struct page *page, unsigned int order);
>>> +static inline void arch_free_page(struct page *page, unsigned int order)
>>> +{
>>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>> + __arch_free_page(page, order);
>>> +}
>>> +#endif
>>
>> This patch and the following one assume that only KVM should be able to hook
>> to these events. I do not think it is appropriate for __arch_free_page() to
>> effectively mean “kvm_guest_free_page()”.
>>
>> Is it possible to use the paravirt infrastructure for this feature,
>> similarly to other PV features? It is not the best infrastructure, but at least
>> it is hypervisor-neutral.
>
> I could probably tie this into the paravirt infrastructure, but if I
> did so I would probably want to pull the checks for the page order out
> of the KVM specific bits and make it something we handle in the inline.
> Doing that I would probably make it a paravirtual hint that only
> operates at the PMD level. That way we wouldn't incur the cost of the
> paravirt infrastructure at the per 4K page level.
If I understand you correctly, you “complain” that this would affect
performance.
While it might be, you may want to check whether the already available
tools can solve the problem:
1. You can use a combination of static-key and pv-ops - see for example
steal_account_process_time()
2. You can use callee-saved pv-ops.
The latter might anyhow be necessary since, IIUC, you change a very hot
path. So you may want have a look on the assembly code of free_pcp_prepare()
(or at least its code-size) before and after your changes. If they are too
big, a callee-saved function might be necessary.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-05 0:03 ` Nadav Amit@ 2019-02-05 0:16 ` Alexander Duyck
2019-02-05 1:46 ` Nadav Amit0 siblings, 1 reply; 55+ messages in thread
From: Alexander Duyck @ 2019-02-05 0:16 UTC (permalink / raw)
To: Nadav Amit
Cc: Alexander Duyck, Linux-MM, LKML, kvm list, Radim Krcmar, X86 ML,
Ingo Molnar, bp, Peter Anvin, Paolo Bonzini, Thomas Gleixner,
Andrew Morton
On Mon, Feb 4, 2019 at 4:03 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> > On Feb 4, 2019, at 3:37 PM, Alexander Duyck <alexander.h.duyck@linux.intel.com> wrote:
> >
> > On Mon, 2019-02-04 at 15:00 -0800, Nadav Amit wrote:
> >>> On Feb 4, 2019, at 10:15 AM, Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >>>
> >>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >>>
> >>> Add guest support for providing free memory hints to the KVM hypervisor for
> >>> freed pages huge TLB size or larger. I am restricting the size to
> >>> huge TLB order and larger because the hypercalls are too expensive to be
> >>> performing one per 4K page. Using the huge TLB order became the obvious
> >>> choice for the order to use as it allows us to avoid fragmentation of higher
> >>> order memory on the host.
> >>>
> >>> I have limited the functionality so that it doesn't work when page
> >>> poisoning is enabled. I did this because a write to the page after doing an
> >>> MADV_DONTNEED would effectively negate the hint, so it would be wasting
> >>> cycles to do so.
> >>>
> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >>> ---
> >>> arch/x86/include/asm/page.h | 13 +++++++++++++
> >>> arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> >>> 2 files changed, 36 insertions(+)
> >>>
> >>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> >>> index 7555b48803a8..4487ad7a3385 100644
> >>> --- a/arch/x86/include/asm/page.h
> >>> +++ b/arch/x86/include/asm/page.h
> >>> @@ -18,6 +18,19 @@
> >>>
> >>> struct page;
> >>>
> >>> +#ifdef CONFIG_KVM_GUEST
> >>> +#include <linux/jump_label.h>
> >>> +extern struct static_key_false pv_free_page_hint_enabled;
> >>> +
> >>> +#define HAVE_ARCH_FREE_PAGE
> >>> +void __arch_free_page(struct page *page, unsigned int order);
> >>> +static inline void arch_free_page(struct page *page, unsigned int order)
> >>> +{
> >>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >>> + __arch_free_page(page, order);
> >>> +}
> >>> +#endif
> >>
> >> This patch and the following one assume that only KVM should be able to hook
> >> to these events. I do not think it is appropriate for __arch_free_page() to
> >> effectively mean “kvm_guest_free_page()”.
> >>
> >> Is it possible to use the paravirt infrastructure for this feature,
> >> similarly to other PV features? It is not the best infrastructure, but at least
> >> it is hypervisor-neutral.
> >
> > I could probably tie this into the paravirt infrastructure, but if I
> > did so I would probably want to pull the checks for the page order out
> > of the KVM specific bits and make it something we handle in the inline.
> > Doing that I would probably make it a paravirtual hint that only
> > operates at the PMD level. That way we wouldn't incur the cost of the
> > paravirt infrastructure at the per 4K page level.
>
> If I understand you correctly, you “complain” that this would affect
> performance.
It wasn't so much a "complaint" as an "observation". What I was
getting at is that if I am going to make it a PV operation I might set
a hard limit on it so that it will specifically only apply to huge
pages and larger. By doing that I can justify performing the screening
based on page order in the inline path and avoid any PV infrastructure
overhead unless I have to incur it.
> While it might be, you may want to check whether the already available
> tools can solve the problem:
>
> 1. You can use a combination of static-key and pv-ops - see for example
> steal_account_process_time()
Okay, I was kind of already heading in this direction. The static key
I am using now would probably stay put.
> 2. You can use callee-saved pv-ops.
>
> The latter might anyhow be necessary since, IIUC, you change a very hot
> path. So you may want have a look on the assembly code of free_pcp_prepare()
> (or at least its code-size) before and after your changes. If they are too
> big, a callee-saved function might be necessary.
I'll have to take a look. I will spend the next couple days
familiarizing myself with the pv-ops infrastructure.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-05 0:16 ` Alexander Duyck@ 2019-02-05 1:46 ` Nadav Amit
2019-02-05 18:09 ` Alexander Duyck0 siblings, 1 reply; 55+ messages in thread
From: Nadav Amit @ 2019-02-05 1:46 UTC (permalink / raw)
To: Alexander Duyck
Cc: Alexander Duyck, Linux-MM, LKML, kvm list, Radim Krcmar, X86 ML,
Ingo Molnar, Borislav Petkov, Peter Anvin, Paolo Bonzini,
Thomas Gleixner, Andrew Morton
> On Feb 4, 2019, at 4:16 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
> On Mon, Feb 4, 2019 at 4:03 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>> On Feb 4, 2019, at 3:37 PM, Alexander Duyck <alexander.h.duyck@linux.intel.com> wrote:
>>>
>>> On Mon, 2019-02-04 at 15:00 -0800, Nadav Amit wrote:
>>>>> On Feb 4, 2019, at 10:15 AM, Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>>>>
>>>>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>>>
>>>>> Add guest support for providing free memory hints to the KVM hypervisor for
>>>>> freed pages huge TLB size or larger. I am restricting the size to
>>>>> huge TLB order and larger because the hypercalls are too expensive to be
>>>>> performing one per 4K page. Using the huge TLB order became the obvious
>>>>> choice for the order to use as it allows us to avoid fragmentation of higher
>>>>> order memory on the host.
>>>>>
>>>>> I have limited the functionality so that it doesn't work when page
>>>>> poisoning is enabled. I did this because a write to the page after doing an
>>>>> MADV_DONTNEED would effectively negate the hint, so it would be wasting
>>>>> cycles to do so.
>>>>>
>>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>>> ---
>>>>> arch/x86/include/asm/page.h | 13 +++++++++++++
>>>>> arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
>>>>> 2 files changed, 36 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>>>> index 7555b48803a8..4487ad7a3385 100644
>>>>> --- a/arch/x86/include/asm/page.h
>>>>> +++ b/arch/x86/include/asm/page.h
>>>>> @@ -18,6 +18,19 @@
>>>>>
>>>>> struct page;
>>>>>
>>>>> +#ifdef CONFIG_KVM_GUEST
>>>>> +#include <linux/jump_label.h>
>>>>> +extern struct static_key_false pv_free_page_hint_enabled;
>>>>> +
>>>>> +#define HAVE_ARCH_FREE_PAGE
>>>>> +void __arch_free_page(struct page *page, unsigned int order);
>>>>> +static inline void arch_free_page(struct page *page, unsigned int order)
>>>>> +{
>>>>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>>> + __arch_free_page(page, order);
>>>>> +}
>>>>> +#endif
>>>>
>>>> This patch and the following one assume that only KVM should be able to hook
>>>> to these events. I do not think it is appropriate for __arch_free_page() to
>>>> effectively mean “kvm_guest_free_page()”.
>>>>
>>>> Is it possible to use the paravirt infrastructure for this feature,
>>>> similarly to other PV features? It is not the best infrastructure, but at least
>>>> it is hypervisor-neutral.
>>>
>>> I could probably tie this into the paravirt infrastructure, but if I
>>> did so I would probably want to pull the checks for the page order out
>>> of the KVM specific bits and make it something we handle in the inline.
>>> Doing that I would probably make it a paravirtual hint that only
>>> operates at the PMD level. That way we wouldn't incur the cost of the
>>> paravirt infrastructure at the per 4K page level.
>>
>> If I understand you correctly, you “complain” that this would affect
>> performance.
>
> It wasn't so much a "complaint" as an "observation". What I was
> getting at is that if I am going to make it a PV operation I might set
> a hard limit on it so that it will specifically only apply to huge
> pages and larger. By doing that I can justify performing the screening
> based on page order in the inline path and avoid any PV infrastructure
> overhead unless I have to incur it.
I understood. I guess my use of “double quotes” was lost in translation. ;-)
One more point regarding [2/4] - you may want to consider using madvise_free
instead of madvise_dontneed to avoid unnecessary EPT violations.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-10 0:49 ` Michael S. Tsirkin@ 2019-02-11 16:31 ` Alexander Duyck
2019-02-11 17:36 ` Michael S. Tsirkin
2019-02-11 17:48 ` Dave Hansen1 sibling, 1 reply; 55+ messages in thread
From: Alexander Duyck @ 2019-02-11 16:31 UTC (permalink / raw)
To: Michael S. Tsirkin, Alexander Duyck
Cc: linux-mm, linux-kernel, kvm, rkrcmar, x86, mingo, bp, hpa,
pbonzini, tglx, akpm
On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >
> > Add guest support for providing free memory hints to the KVM hypervisor for
> > freed pages huge TLB size or larger. I am restricting the size to
> > huge TLB order and larger because the hypercalls are too expensive to be
> > performing one per 4K page.
>
> Even 2M pages start to get expensive with a TB guest.
Agreed.
> Really it seems we want a virtio ring so we can pass a batch of these.
> E.g. 256 entries, 2M each - that's more like it.
The only issue I see with doing that is that we then have to defer the
freeing. Doing that is going to introduce issues in the guest as we are
going to have pages going unused for some period of time while we wait
for the hint to complete, and we cannot just pull said pages back. I'm
not really a fan of the asynchronous nature of Nitesh's patches for
this reason.
> > Using the huge TLB order became the obvious
> > choice for the order to use as it allows us to avoid fragmentation of higher
> > order memory on the host.
> >
> > I have limited the functionality so that it doesn't work when page
> > poisoning is enabled. I did this because a write to the page after doing an
> > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > cycles to do so.
>
> Again that's leaking host implementation detail into guest interface.
>
> We are giving guest page hints to host that makes sense,
> weird interactions with other features due to host
> implementation details should be handled by host.
I don't view this as a host implementation detail, this is guest
feature making use of all pages for debugging. If we are placing poison
values in the page then I wouldn't consider them an unused page, it is
being actively used to store the poison value. If we can achieve this
and free the page back to the host then even better, but until the
features can coexist we should not use the page hinting while page
poisoning is enabled.
This is one of the reasons why I was opposed to just disabling page
poisoning when this feature was enabled in Nitesh's patches. If the
guest has page poisoning enabled it is doing something with the page.
It shouldn't be prevented from doing that because the host wants to
have the option to free the pages.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-11 16:31 ` Alexander Duyck@ 2019-02-11 17:36 ` Michael S. Tsirkin
2019-02-11 18:10 ` Alexander Duyck0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2019-02-11 17:36 UTC (permalink / raw)
To: Alexander Duyck
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar, x86,
mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > >
> > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > freed pages huge TLB size or larger. I am restricting the size to
> > > huge TLB order and larger because the hypercalls are too expensive to be
> > > performing one per 4K page.
> >
> > Even 2M pages start to get expensive with a TB guest.
>
> Agreed.
>
> > Really it seems we want a virtio ring so we can pass a batch of these.
> > E.g. 256 entries, 2M each - that's more like it.
>
> The only issue I see with doing that is that we then have to defer the
> freeing. Doing that is going to introduce issues in the guest as we are
> going to have pages going unused for some period of time while we wait
> for the hint to complete, and we cannot just pull said pages back. I'm
> not really a fan of the asynchronous nature of Nitesh's patches for
> this reason.
Well nothing prevents us from doing an extra exit to the hypervisor if
we want. The asynchronous nature is there as an optimization
to allow hypervisor to do its thing on a separate CPU.
Why not proceed doing other things meanwhile?
And if the reason is that we are short on memory, then
maybe we should be less aggressive in hinting?
E.g. if we just have 2 pages:
hint page 1
page 1 hint processed?
yes - proceed to page 2
no - wait for interrupt
get interrupt that page 1 hint is processed
hint page 2
If hypervisor happens to be running on same CPU it
can process things synchronously and we never enter
the no branch.
> > > Using the huge TLB order became the obvious
> > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > order memory on the host.
> > >
> > > I have limited the functionality so that it doesn't work when page
> > > poisoning is enabled. I did this because a write to the page after doing an
> > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > cycles to do so.
> >
> > Again that's leaking host implementation detail into guest interface.
> >
> > We are giving guest page hints to host that makes sense,
> > weird interactions with other features due to host
> > implementation details should be handled by host.
>
> I don't view this as a host implementation detail, this is guest
> feature making use of all pages for debugging. If we are placing poison
> values in the page then I wouldn't consider them an unused page, it is
> being actively used to store the poison value.
Well I guess it's a valid point of view for a kernel hacker, but they are
unused from application's point of view.
However poisoning is transparent to users and most distro users
are not aware of it going on. They just know that debug kernels
are slower.
User loading a debug kernel and immediately breaking overcommit
is an unpleasant experience.
> If we can achieve this
> and free the page back to the host then even better, but until the
> features can coexist we should not use the page hinting while page
> poisoning is enabled.
Existing hinting in balloon allows them to coexist so I think we
need to set the bar just as high for any new variant.
> This is one of the reasons why I was opposed to just disabling page
> poisoning when this feature was enabled in Nitesh's patches. If the
> guest has page poisoning enabled it is doing something with the page.
> It shouldn't be prevented from doing that because the host wants to
> have the option to free the pages.
I agree but I think the decision belongs on the host. I.e.
hint the page but tell the host it needs to be careful
about the poison value. It might also mean we
need to make sure poisoning happens after the hinting, not before.
--
MST
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-11 17:36 ` Michael S. Tsirkin@ 2019-02-11 18:10 ` Alexander Duyck
2019-02-11 19:54 ` Michael S. Tsirkin0 siblings, 1 reply; 55+ messages in thread
From: Alexander Duyck @ 2019-02-11 18:10 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar, x86,
mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > >
> > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > performing one per 4K page.
> > >
> > > Even 2M pages start to get expensive with a TB guest.
> >
> > Agreed.
> >
> > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > E.g. 256 entries, 2M each - that's more like it.
> >
> > The only issue I see with doing that is that we then have to defer the
> > freeing. Doing that is going to introduce issues in the guest as we are
> > going to have pages going unused for some period of time while we wait
> > for the hint to complete, and we cannot just pull said pages back. I'm
> > not really a fan of the asynchronous nature of Nitesh's patches for
> > this reason.
>
> Well nothing prevents us from doing an extra exit to the hypervisor if
> we want. The asynchronous nature is there as an optimization
> to allow hypervisor to do its thing on a separate CPU.
> Why not proceed doing other things meanwhile?
> And if the reason is that we are short on memory, then
> maybe we should be less aggressive in hinting?
>
> E.g. if we just have 2 pages:
>
> hint page 1
> page 1 hint processed?
> yes - proceed to page 2
> no - wait for interrupt
>
> get interrupt that page 1 hint is processed
> hint page 2
>
>
> If hypervisor happens to be running on same CPU it
> can process things synchronously and we never enter
> the no branch.
>
Another concern I would have about processing this asynchronously is
that we have the potential for multiple guest CPUs to become
bottlenecked by a single host CPU. I am not sure if that is something
that would be desirable.
> > > > Using the huge TLB order became the obvious
> > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > order memory on the host.
> > > >
> > > > I have limited the functionality so that it doesn't work when page
> > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > cycles to do so.
> > >
> > > Again that's leaking host implementation detail into guest interface.
> > >
> > > We are giving guest page hints to host that makes sense,
> > > weird interactions with other features due to host
> > > implementation details should be handled by host.
> >
> > I don't view this as a host implementation detail, this is guest
> > feature making use of all pages for debugging. If we are placing poison
> > values in the page then I wouldn't consider them an unused page, it is
> > being actively used to store the poison value.
>
> Well I guess it's a valid point of view for a kernel hacker, but they are
> unused from application's point of view.
> However poisoning is transparent to users and most distro users
> are not aware of it going on. They just know that debug kernels
> are slower.
> User loading a debug kernel and immediately breaking overcommit
> is an unpleasant experience.
How would that be any different then a user loading an older kernel
that doesn't have this feature and breaking overcommit as a result?
I still think it would be better if we left the poisoning enabled in
such a case and just displayed a warning message if nothing else that
hinting is disabled because of page poisoning.
One other thought I had on this is that one side effect of page
poisoning is probably that KSM would be able to merge all of the poison
pages together into a single page since they are all set to the same
values. So even with the poisoned pages it would be possible to reduce
total memory overhead.
> > If we can achieve this
> > and free the page back to the host then even better, but until the
> > features can coexist we should not use the page hinting while page
> > poisoning is enabled.
>
> Existing hinting in balloon allows them to coexist so I think we
> need to set the bar just as high for any new variant.
That is what I heard. I will have to look into this.
> > This is one of the reasons why I was opposed to just disabling page
> > poisoning when this feature was enabled in Nitesh's patches. If the
> > guest has page poisoning enabled it is doing something with the page.
> > It shouldn't be prevented from doing that because the host wants to
> > have the option to free the pages.
>
> I agree but I think the decision belongs on the host. I.e.
> hint the page but tell the host it needs to be careful
> about the poison value. It might also mean we
> need to make sure poisoning happens after the hinting, not before.
The only issue with poisoning after instead of before is that the hint
is ignored and we end up triggering a page fault and zero as a result.
It might make more sense to have an architecture specific call that can
be paravirtualized to handle the case of poisoning the page for us if
we have the unused page hint enabled. Otherwise the write to the page
is a given to invalidate the hint.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-11 18:10 ` Alexander Duyck@ 2019-02-11 19:54 ` Michael S. Tsirkin
2019-02-11 21:00 ` Alexander Duyck0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2019-02-11 19:54 UTC (permalink / raw)
To: Alexander Duyck
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar, x86,
mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, Feb 11, 2019 at 10:10:06AM -0800, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > >
> > > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > > performing one per 4K page.
> > > >
> > > > Even 2M pages start to get expensive with a TB guest.
> > >
> > > Agreed.
> > >
> > > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > > E.g. 256 entries, 2M each - that's more like it.
> > >
> > > The only issue I see with doing that is that we then have to defer the
> > > freeing. Doing that is going to introduce issues in the guest as we are
> > > going to have pages going unused for some period of time while we wait
> > > for the hint to complete, and we cannot just pull said pages back. I'm
> > > not really a fan of the asynchronous nature of Nitesh's patches for
> > > this reason.
> >
> > Well nothing prevents us from doing an extra exit to the hypervisor if
> > we want. The asynchronous nature is there as an optimization
> > to allow hypervisor to do its thing on a separate CPU.
> > Why not proceed doing other things meanwhile?
> > And if the reason is that we are short on memory, then
> > maybe we should be less aggressive in hinting?
> >
> > E.g. if we just have 2 pages:
> >
> > hint page 1
> > page 1 hint processed?
> > yes - proceed to page 2
> > no - wait for interrupt
> >
> > get interrupt that page 1 hint is processed
> > hint page 2
> >
> >
> > If hypervisor happens to be running on same CPU it
> > can process things synchronously and we never enter
> > the no branch.
> >
>
> Another concern I would have about processing this asynchronously is
> that we have the potential for multiple guest CPUs to become
> bottlenecked by a single host CPU. I am not sure if that is something
> that would be desirable.
Well with a hypercall per page the fix is to block VCPU
completely which is also not for everyone.
If you can't push a free page hint to host, then
ideally you just won't. That's a nice property of
hinting we have upstream right now.
Host too busy - hinting is just skipped.
> > > > > Using the huge TLB order became the obvious
> > > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > > order memory on the host.
> > > > >
> > > > > I have limited the functionality so that it doesn't work when page
> > > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > > cycles to do so.
> > > >
> > > > Again that's leaking host implementation detail into guest interface.
> > > >
> > > > We are giving guest page hints to host that makes sense,
> > > > weird interactions with other features due to host
> > > > implementation details should be handled by host.
> > >
> > > I don't view this as a host implementation detail, this is guest
> > > feature making use of all pages for debugging. If we are placing poison
> > > values in the page then I wouldn't consider them an unused page, it is
> > > being actively used to store the poison value.
> >
> > Well I guess it's a valid point of view for a kernel hacker, but they are
> > unused from application's point of view.
> > However poisoning is transparent to users and most distro users
> > are not aware of it going on. They just know that debug kernels
> > are slower.
> > User loading a debug kernel and immediately breaking overcommit
> > is an unpleasant experience.
>
> How would that be any different then a user loading an older kernel
> that doesn't have this feature and breaking overcommit as a result?
Well old kernel does not have the feature so nothing to debug.
When we have a new feature that goes away in the debug kernel,
that's a big support problem since this leads to heisenbugs.
> I still think it would be better if we left the poisoning enabled in
> such a case and just displayed a warning message if nothing else that
> hinting is disabled because of page poisoning.
>
> One other thought I had on this is that one side effect of page
> poisoning is probably that KSM would be able to merge all of the poison
> pages together into a single page since they are all set to the same
> values. So even with the poisoned pages it would be possible to reduce
> total memory overhead.
Right. And BTW one thing that host can do is pass
the hinted area to KSM for merging.
That requires an alloc hook to free it though.
Or we could add a per-VMA byte with the poison
value and use that on host to populate pages on fault.
> > > If we can achieve this
> > > and free the page back to the host then even better, but until the
> > > features can coexist we should not use the page hinting while page
> > > poisoning is enabled.
> >
> > Existing hinting in balloon allows them to coexist so I think we
> > need to set the bar just as high for any new variant.
>
> That is what I heard. I will have to look into this.
It's not doing anything smart right now, just checks
that poison == 0 and skips freeing if not.
But it can be enhanced transparently to guests.
> > > This is one of the reasons why I was opposed to just disabling page
> > > poisoning when this feature was enabled in Nitesh's patches. If the
> > > guest has page poisoning enabled it is doing something with the page.
> > > It shouldn't be prevented from doing that because the host wants to
> > > have the option to free the pages.
> >
> > I agree but I think the decision belongs on the host. I.e.
> > hint the page but tell the host it needs to be careful
> > about the poison value. It might also mean we
> > need to make sure poisoning happens after the hinting, not before.
>
> The only issue with poisoning after instead of before is that the hint
> is ignored and we end up triggering a page fault and zero as a result.
> It might make more sense to have an architecture specific call that can
> be paravirtualized to handle the case of poisoning the page for us if
> we have the unused page hint enabled. Otherwise the write to the page
> is a given to invalidate the hint.
Sounds interesting. So the arch hook will first poison and
then pass the page to the host?
Or we can also ask the host to poison for us, problem is this forces
host to either always write into page, or call MADV_DONTNEED,
without it could do MADV_FREE. Maybe that is not a big issue.
--
MST
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-11 19:54 ` Michael S. Tsirkin@ 2019-02-11 21:00 ` Alexander Duyck
2019-02-11 22:52 ` Michael S. Tsirkin0 siblings, 1 reply; 55+ messages in thread
From: Alexander Duyck @ 2019-02-11 21:00 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar, x86,
mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, 2019-02-11 at 14:54 -0500, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 10:10:06AM -0800, Alexander Duyck wrote:
> > On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> > > On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > > > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > > > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > > >
> > > > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > > > performing one per 4K page.
> > > > >
> > > > > Even 2M pages start to get expensive with a TB guest.
> > > >
> > > > Agreed.
> > > >
> > > > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > > > E.g. 256 entries, 2M each - that's more like it.
> > > >
> > > > The only issue I see with doing that is that we then have to defer the
> > > > freeing. Doing that is going to introduce issues in the guest as we are
> > > > going to have pages going unused for some period of time while we wait
> > > > for the hint to complete, and we cannot just pull said pages back. I'm
> > > > not really a fan of the asynchronous nature of Nitesh's patches for
> > > > this reason.
> > >
> > > Well nothing prevents us from doing an extra exit to the hypervisor if
> > > we want. The asynchronous nature is there as an optimization
> > > to allow hypervisor to do its thing on a separate CPU.
> > > Why not proceed doing other things meanwhile?
> > > And if the reason is that we are short on memory, then
> > > maybe we should be less aggressive in hinting?
> > >
> > > E.g. if we just have 2 pages:
> > >
> > > hint page 1
> > > page 1 hint processed?
> > > yes - proceed to page 2
> > > no - wait for interrupt
> > >
> > > get interrupt that page 1 hint is processed
> > > hint page 2
> > >
> > >
> > > If hypervisor happens to be running on same CPU it
> > > can process things synchronously and we never enter
> > > the no branch.
> > >
> >
> > Another concern I would have about processing this asynchronously is
> > that we have the potential for multiple guest CPUs to become
> > bottlenecked by a single host CPU. I am not sure if that is something
> > that would be desirable.
>
> Well with a hypercall per page the fix is to block VCPU
> completely which is also not for everyone.
>
> If you can't push a free page hint to host, then
> ideally you just won't. That's a nice property of
> hinting we have upstream right now.
> Host too busy - hinting is just skipped.
Right, but if you do that then there is a potential to end up missing
hints for a large portion of memory. It seems like you would end up
with even bigger issues since then at that point you have essentially
leaked memory.
I would think you would need a way to resync the host and the guest
after something like that. Otherwise you can have memory that will just
go unused for an extended period if a guest just goes idle.
> > > > > > Using the huge TLB order became the obvious
> > > > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > > > order memory on the host.
> > > > > >
> > > > > > I have limited the functionality so that it doesn't work when page
> > > > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > > > cycles to do so.
> > > > >
> > > > > Again that's leaking host implementation detail into guest interface.
> > > > >
> > > > > We are giving guest page hints to host that makes sense,
> > > > > weird interactions with other features due to host
> > > > > implementation details should be handled by host.
> > > >
> > > > I don't view this as a host implementation detail, this is guest
> > > > feature making use of all pages for debugging. If we are placing poison
> > > > values in the page then I wouldn't consider them an unused page, it is
> > > > being actively used to store the poison value.
> > >
> > > Well I guess it's a valid point of view for a kernel hacker, but they are
> > > unused from application's point of view.
> > > However poisoning is transparent to users and most distro users
> > > are not aware of it going on. They just know that debug kernels
> > > are slower.
> > > User loading a debug kernel and immediately breaking overcommit
> > > is an unpleasant experience.
> >
> > How would that be any different then a user loading an older kernel
> > that doesn't have this feature and breaking overcommit as a result?
>
> Well old kernel does not have the feature so nothing to debug.
> When we have a new feature that goes away in the debug kernel,
> that's a big support problem since this leads to heisenbugs.
Trying to debug host features from the guest would be a pain anyway as
a guest shouldn't even really know what the underlying setup of the
guest is supposed to be.
> > I still think it would be better if we left the poisoning enabled in
> > such a case and just displayed a warning message if nothing else that
> > hinting is disabled because of page poisoning.
> >
> > One other thought I had on this is that one side effect of page
> > poisoning is probably that KSM would be able to merge all of the poison
> > pages together into a single page since they are all set to the same
> > values. So even with the poisoned pages it would be possible to reduce
> > total memory overhead.
>
> Right. And BTW one thing that host can do is pass
> the hinted area to KSM for merging.
> That requires an alloc hook to free it though.
>
> Or we could add a per-VMA byte with the poison
> value and use that on host to populate pages on fault.
>
>
> > > > If we can achieve this
> > > > and free the page back to the host then even better, but until the
> > > > features can coexist we should not use the page hinting while page
> > > > poisoning is enabled.
> > >
> > > Existing hinting in balloon allows them to coexist so I think we
> > > need to set the bar just as high for any new variant.
> >
> > That is what I heard. I will have to look into this.
>
> It's not doing anything smart right now, just checks
> that poison == 0 and skips freeing if not.
> But it can be enhanced transparently to guests.
Okay, so it probably should be extended to add something like poison
page that could replace the zero page for reads to a page that has been
unmapped.
> > > > This is one of the reasons why I was opposed to just disabling page
> > > > poisoning when this feature was enabled in Nitesh's patches. If the
> > > > guest has page poisoning enabled it is doing something with the page.
> > > > It shouldn't be prevented from doing that because the host wants to
> > > > have the option to free the pages.
> > >
> > > I agree but I think the decision belongs on the host. I.e.
> > > hint the page but tell the host it needs to be careful
> > > about the poison value. It might also mean we
> > > need to make sure poisoning happens after the hinting, not before.
> >
> > The only issue with poisoning after instead of before is that the hint
> > is ignored and we end up triggering a page fault and zero as a result.
> > It might make more sense to have an architecture specific call that can
> > be paravirtualized to handle the case of poisoning the page for us if
> > we have the unused page hint enabled. Otherwise the write to the page
> > is a given to invalidate the hint.
>
> Sounds interesting. So the arch hook will first poison and
> then pass the page to the host?
>
> Or we can also ask the host to poison for us, problem is this forces
> host to either always write into page, or call MADV_DONTNEED,
> without it could do MADV_FREE. Maybe that is not a big issue.
I would think we would ask the host to poison for us. If I am not
mistaken both solutions right now are using MADV_DONTNEED. I would tend
to lean that way if we are doing page poisoning since the cost for
zeroing/poisoning the page on the host could be canceled out by
dropping the page poisoning on the guest.
Then again since we are doing higher order pages only, and the
poisoning is supposed to happen before we get into __free_one_page we
would probably have to do both the poisoning, and the poison on fault.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-11 21:00 ` Alexander Duyck@ 2019-02-11 22:52 ` Michael S. Tsirkin
[not found] ` <94462313ccd927d25675f69de459456cf066c1a2.camel@linux.intel.com>
0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2019-02-11 22:52 UTC (permalink / raw)
To: Alexander Duyck
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar, x86,
mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, Feb 11, 2019 at 01:00:53PM -0800, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 14:54 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 10:10:06AM -0800, Alexander Duyck wrote:
> > > On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > > > > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > > > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > > > > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > > > >
> > > > > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > > > > performing one per 4K page.
> > > > > >
> > > > > > Even 2M pages start to get expensive with a TB guest.
> > > > >
> > > > > Agreed.
> > > > >
> > > > > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > > > > E.g. 256 entries, 2M each - that's more like it.
> > > > >
> > > > > The only issue I see with doing that is that we then have to defer the
> > > > > freeing. Doing that is going to introduce issues in the guest as we are
> > > > > going to have pages going unused for some period of time while we wait
> > > > > for the hint to complete, and we cannot just pull said pages back. I'm
> > > > > not really a fan of the asynchronous nature of Nitesh's patches for
> > > > > this reason.
> > > >
> > > > Well nothing prevents us from doing an extra exit to the hypervisor if
> > > > we want. The asynchronous nature is there as an optimization
> > > > to allow hypervisor to do its thing on a separate CPU.
> > > > Why not proceed doing other things meanwhile?
> > > > And if the reason is that we are short on memory, then
> > > > maybe we should be less aggressive in hinting?
> > > >
> > > > E.g. if we just have 2 pages:
> > > >
> > > > hint page 1
> > > > page 1 hint processed?
> > > > yes - proceed to page 2
> > > > no - wait for interrupt
> > > >
> > > > get interrupt that page 1 hint is processed
> > > > hint page 2
> > > >
> > > >
> > > > If hypervisor happens to be running on same CPU it
> > > > can process things synchronously and we never enter
> > > > the no branch.
> > > >
> > >
> > > Another concern I would have about processing this asynchronously is
> > > that we have the potential for multiple guest CPUs to become
> > > bottlenecked by a single host CPU. I am not sure if that is something
> > > that would be desirable.
> >
> > Well with a hypercall per page the fix is to block VCPU
> > completely which is also not for everyone.
> >
> > If you can't push a free page hint to host, then
> > ideally you just won't. That's a nice property of
> > hinting we have upstream right now.
> > Host too busy - hinting is just skipped.
>
> Right, but if you do that then there is a potential to end up missing
> hints for a large portion of memory. It seems like you would end up
> with even bigger issues since then at that point you have essentially
> leaked memory.
> I would think you would need a way to resync the host and the guest
> after something like that. Otherwise you can have memory that will just
> go unused for an extended period if a guest just goes idle.
Yes and that is my point. Existing hints code will just take a page off
the free list in that case so it resyncs using the free list.
Something like this could work then: mark up
hinted pages with a flag (its easy to find unused
flags for free pages) then when you get an interrupt
because outstanding hints have been consumed,
get unflagged/unhinted pages from buddy and pass
them to host.
>
> > > > > > > Using the huge TLB order became the obvious
> > > > > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > > > > order memory on the host.
> > > > > > >
> > > > > > > I have limited the functionality so that it doesn't work when page
> > > > > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > > > > cycles to do so.
> > > > > >
> > > > > > Again that's leaking host implementation detail into guest interface.
> > > > > >
> > > > > > We are giving guest page hints to host that makes sense,
> > > > > > weird interactions with other features due to host
> > > > > > implementation details should be handled by host.
> > > > >
> > > > > I don't view this as a host implementation detail, this is guest
> > > > > feature making use of all pages for debugging. If we are placing poison
> > > > > values in the page then I wouldn't consider them an unused page, it is
> > > > > being actively used to store the poison value.
> > > >
> > > > Well I guess it's a valid point of view for a kernel hacker, but they are
> > > > unused from application's point of view.
> > > > However poisoning is transparent to users and most distro users
> > > > are not aware of it going on. They just know that debug kernels
> > > > are slower.
> > > > User loading a debug kernel and immediately breaking overcommit
> > > > is an unpleasant experience.
> > >
> > > How would that be any different then a user loading an older kernel
> > > that doesn't have this feature and breaking overcommit as a result?
> >
> > Well old kernel does not have the feature so nothing to debug.
> > When we have a new feature that goes away in the debug kernel,
> > that's a big support problem since this leads to heisenbugs.
>
> Trying to debug host features from the guest would be a pain anyway as
> a guest shouldn't even really know what the underlying setup of the
> guest is supposed to be.
I'm talking about debugging the guest though.
> > > I still think it would be better if we left the poisoning enabled in
> > > such a case and just displayed a warning message if nothing else that
> > > hinting is disabled because of page poisoning.
> > >
> > > One other thought I had on this is that one side effect of page
> > > poisoning is probably that KSM would be able to merge all of the poison
> > > pages together into a single page since they are all set to the same
> > > values. So even with the poisoned pages it would be possible to reduce
> > > total memory overhead.
> >
> > Right. And BTW one thing that host can do is pass
> > the hinted area to KSM for merging.
> > That requires an alloc hook to free it though.
> >
> > Or we could add a per-VMA byte with the poison
> > value and use that on host to populate pages on fault.
> >
> >
> > > > > If we can achieve this
> > > > > and free the page back to the host then even better, but until the
> > > > > features can coexist we should not use the page hinting while page
> > > > > poisoning is enabled.
> > > >
> > > > Existing hinting in balloon allows them to coexist so I think we
> > > > need to set the bar just as high for any new variant.
> > >
> > > That is what I heard. I will have to look into this.
> >
> > It's not doing anything smart right now, just checks
> > that poison == 0 and skips freeing if not.
> > But it can be enhanced transparently to guests.
>
> Okay, so it probably should be extended to add something like poison
> page that could replace the zero page for reads to a page that has been
> unmapped.
>
> > > > > This is one of the reasons why I was opposed to just disabling page
> > > > > poisoning when this feature was enabled in Nitesh's patches. If the
> > > > > guest has page poisoning enabled it is doing something with the page.
> > > > > It shouldn't be prevented from doing that because the host wants to
> > > > > have the option to free the pages.
> > > >
> > > > I agree but I think the decision belongs on the host. I.e.
> > > > hint the page but tell the host it needs to be careful
> > > > about the poison value. It might also mean we
> > > > need to make sure poisoning happens after the hinting, not before.
> > >
> > > The only issue with poisoning after instead of before is that the hint
> > > is ignored and we end up triggering a page fault and zero as a result.
> > > It might make more sense to have an architecture specific call that can
> > > be paravirtualized to handle the case of poisoning the page for us if
> > > we have the unused page hint enabled. Otherwise the write to the page
> > > is a given to invalidate the hint.
> >
> > Sounds interesting. So the arch hook will first poison and
> > then pass the page to the host?
> >
> > Or we can also ask the host to poison for us, problem is this forces
> > host to either always write into page, or call MADV_DONTNEED,
> > without it could do MADV_FREE. Maybe that is not a big issue.
>
> I would think we would ask the host to poison for us. If I am not
> mistaken both solutions right now are using MADV_DONTNEED. I would tend
> to lean that way if we are doing page poisoning since the cost for
> zeroing/poisoning the page on the host could be canceled out by
> dropping the page poisoning on the guest.
>
> Then again since we are doing higher order pages only, and the
> poisoning is supposed to happen before we get into __free_one_page we
> would probably have to do both the poisoning, and the poison on fault.
Oh that's a nice trick. So in fact if we just make sure
we never report PAGE_SIZE pages then poisoning will
automatically happen before reporting?
So we just need to teach host to poison on fault.
Sounds cool and we can always optimize further later.
--
MST
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
[not found] ` <94462313ccd927d25675f69de459456cf066c1a2.camel@linux.intel.com>
@ 2019-02-12 0:34 ` Michael S. Tsirkin0 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2019-02-12 0:34 UTC (permalink / raw)
To: Alexander Duyck
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar, x86,
mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, Feb 11, 2019 at 04:09:53PM -0800, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 17:52 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 01:00:53PM -0800, Alexander Duyck wrote:
> > > On Mon, 2019-02-11 at 14:54 -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 11, 2019 at 10:10:06AM -0800, Alexander Duyck wrote:
> > > > > On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> > > > > > On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > > > > > > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > > > > > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > > > > > > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > > > > > >
> > > > > > > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > > > > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > > > > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > > > > > > performing one per 4K page.
> > > > > > > >
> > > > > > > > Even 2M pages start to get expensive with a TB guest.
> > > > > > >
> > > > > > > Agreed.
> > > > > > >
> > > > > > > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > > > > > > E.g. 256 entries, 2M each - that's more like it.
> > > > > > >
> > > > > > > The only issue I see with doing that is that we then have to defer the
> > > > > > > freeing. Doing that is going to introduce issues in the guest as we are
> > > > > > > going to have pages going unused for some period of time while we wait
> > > > > > > for the hint to complete, and we cannot just pull said pages back. I'm
> > > > > > > not really a fan of the asynchronous nature of Nitesh's patches for
> > > > > > > this reason.
> > > > > >
> > > > > > Well nothing prevents us from doing an extra exit to the hypervisor if
> > > > > > we want. The asynchronous nature is there as an optimization
> > > > > > to allow hypervisor to do its thing on a separate CPU.
> > > > > > Why not proceed doing other things meanwhile?
> > > > > > And if the reason is that we are short on memory, then
> > > > > > maybe we should be less aggressive in hinting?
> > > > > >
> > > > > > E.g. if we just have 2 pages:
> > > > > >
> > > > > > hint page 1
> > > > > > page 1 hint processed?
> > > > > > yes - proceed to page 2
> > > > > > no - wait for interrupt
> > > > > >
> > > > > > get interrupt that page 1 hint is processed
> > > > > > hint page 2
> > > > > >
> > > > > >
> > > > > > If hypervisor happens to be running on same CPU it
> > > > > > can process things synchronously and we never enter
> > > > > > the no branch.
> > > > > >
> > > > >
> > > > > Another concern I would have about processing this asynchronously is
> > > > > that we have the potential for multiple guest CPUs to become
> > > > > bottlenecked by a single host CPU. I am not sure if that is something
> > > > > that would be desirable.
> > > >
> > > > Well with a hypercall per page the fix is to block VCPU
> > > > completely which is also not for everyone.
> > > >
> > > > If you can't push a free page hint to host, then
> > > > ideally you just won't. That's a nice property of
> > > > hinting we have upstream right now.
> > > > Host too busy - hinting is just skipped.
> > >
> > > Right, but if you do that then there is a potential to end up missing
> > > hints for a large portion of memory. It seems like you would end up
> > > with even bigger issues since then at that point you have essentially
> > > leaked memory.
> > > I would think you would need a way to resync the host and the guest
> > > after something like that. Otherwise you can have memory that will just
> > > go unused for an extended period if a guest just goes idle.
> >
> > Yes and that is my point. Existing hints code will just take a page off
> > the free list in that case so it resyncs using the free list.
> >
> > Something like this could work then: mark up
> > hinted pages with a flag (its easy to find unused
> > flags for free pages) then when you get an interrupt
> > because outstanding hints have been consumed,
> > get unflagged/unhinted pages from buddy and pass
> > them to host.
>
> Ugh. This is beginning to sound like yet another daemon that will have
> to be running to handle missed sync events.
Why a daemon? Not at all. You get an interrupt, you schedule
a wq immediately or just do it from the interrupt handler.
> I really think that taking an async approach for this will be nothing
> but trouble. You are going to have a difficult time maintaining any
> sort of coherency no the freelist without the daemon having to take the
> zone lock and then notify the host of what is free and what isn't.
We seem to be doing fine without zone lock for now.
Just plain alloc_pages.
> > >
> > > > > > > > > Using the huge TLB order became the obvious
> > > > > > > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > > > > > > order memory on the host.
> > > > > > > > >
> > > > > > > > > I have limited the functionality so that it doesn't work when page
> > > > > > > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > > > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > > > > > > cycles to do so.
> > > > > > > >
> > > > > > > > Again that's leaking host implementation detail into guest interface.
> > > > > > > >
> > > > > > > > We are giving guest page hints to host that makes sense,
> > > > > > > > weird interactions with other features due to host
> > > > > > > > implementation details should be handled by host.
> > > > > > >
> > > > > > > I don't view this as a host implementation detail, this is guest
> > > > > > > feature making use of all pages for debugging. If we are placing poison
> > > > > > > values in the page then I wouldn't consider them an unused page, it is
> > > > > > > being actively used to store the poison value.
> > > > > >
> > > > > > Well I guess it's a valid point of view for a kernel hacker, but they are
> > > > > > unused from application's point of view.
> > > > > > However poisoning is transparent to users and most distro users
> > > > > > are not aware of it going on. They just know that debug kernels
> > > > > > are slower.
> > > > > > User loading a debug kernel and immediately breaking overcommit
> > > > > > is an unpleasant experience.
> > > > >
> > > > > How would that be any different then a user loading an older kernel
> > > > > that doesn't have this feature and breaking overcommit as a result?
> > > >
> > > > Well old kernel does not have the feature so nothing to debug.
> > > > When we have a new feature that goes away in the debug kernel,
> > > > that's a big support problem since this leads to heisenbugs.
> > >
> > > Trying to debug host features from the guest would be a pain anyway as
> > > a guest shouldn't even really know what the underlying setup of the
> > > guest is supposed to be.
> >
> > I'm talking about debugging the guest though.
>
> Right. But my point is if it is a guest feature related to memory that
> you are debugging, then disabling the page hinting would probably be an
> advisable step anyway since it would have the potential for memory
> corruptions itself due to its nature.
Oh absolutely. So that's why I wanted debug kernel to be
as close as possible to non-debug one in that respect.
If one gets a corruption we want it reproducible on debug too.
> > > > > I still think it would be better if we left the poisoning enabled in
> > > > > such a case and just displayed a warning message if nothing else that
> > > > > hinting is disabled because of page poisoning.
> > > > >
> > > > > One other thought I had on this is that one side effect of page
> > > > > poisoning is probably that KSM would be able to merge all of the poison
> > > > > pages together into a single page since they are all set to the same
> > > > > values. So even with the poisoned pages it would be possible to reduce
> > > > > total memory overhead.
> > > >
> > > > Right. And BTW one thing that host can do is pass
> > > > the hinted area to KSM for merging.
> > > > That requires an alloc hook to free it though.
> > > >
> > > > Or we could add a per-VMA byte with the poison
> > > > value and use that on host to populate pages on fault.
> > > >
> > > >
> > > > > > > If we can achieve this
> > > > > > > and free the page back to the host then even better, but until the
> > > > > > > features can coexist we should not use the page hinting while page
> > > > > > > poisoning is enabled.
> > > > > >
> > > > > > Existing hinting in balloon allows them to coexist so I think we
> > > > > > need to set the bar just as high for any new variant.
> > > > >
> > > > > That is what I heard. I will have to look into this.
> > > >
> > > > It's not doing anything smart right now, just checks
> > > > that poison == 0 and skips freeing if not.
> > > > But it can be enhanced transparently to guests.
> > >
> > > Okay, so it probably should be extended to add something like poison
> > > page that could replace the zero page for reads to a page that has been
> > > unmapped.
> > >
> > > > > > > This is one of the reasons why I was opposed to just disabling page
> > > > > > > poisoning when this feature was enabled in Nitesh's patches. If the
> > > > > > > guest has page poisoning enabled it is doing something with the page.
> > > > > > > It shouldn't be prevented from doing that because the host wants to
> > > > > > > have the option to free the pages.
> > > > > >
> > > > > > I agree but I think the decision belongs on the host. I.e.
> > > > > > hint the page but tell the host it needs to be careful
> > > > > > about the poison value. It might also mean we
> > > > > > need to make sure poisoning happens after the hinting, not before.
> > > > >
> > > > > The only issue with poisoning after instead of before is that the hint
> > > > > is ignored and we end up triggering a page fault and zero as a result.
> > > > > It might make more sense to have an architecture specific call that can
> > > > > be paravirtualized to handle the case of poisoning the page for us if
> > > > > we have the unused page hint enabled. Otherwise the write to the page
> > > > > is a given to invalidate the hint.
> > > >
> > > > Sounds interesting. So the arch hook will first poison and
> > > > then pass the page to the host?
> > > >
> > > > Or we can also ask the host to poison for us, problem is this forces
> > > > host to either always write into page, or call MADV_DONTNEED,
> > > > without it could do MADV_FREE. Maybe that is not a big issue.
> > >
> > > I would think we would ask the host to poison for us. If I am not
> > > mistaken both solutions right now are using MADV_DONTNEED. I would tend
> > > to lean that way if we are doing page poisoning since the cost for
> > > zeroing/poisoning the page on the host could be canceled out by
> > > dropping the page poisoning on the guest.
> > >
> > > Then again since we are doing higher order pages only, and the
> > > poisoning is supposed to happen before we get into __free_one_page we
> > > would probably have to do both the poisoning, and the poison on fault.
> >
> >
> > Oh that's a nice trick. So in fact if we just make sure
> > we never report PAGE_SIZE pages then poisoning will
> > automatically happen before reporting?
> > So we just need to teach host to poison on fault.
> > Sounds cool and we can always optimize further later.
>
> That is kind of what I was thinking. In the grand scheme of things I
> figure most of the expense is in the fault and page zeroing bits of the
> code path. I have done a bit of testing today with the patch that just
> drops the messages if a device is assigned, and just the hypercall bits
> are only causing about a 2.5% regression in performance on a will-it-
> scale/page-fault1 test. However if I commit to the full setup with the
> madvise, page fault, and zeroing then I am seeing an 11.5% drop in
> performance.
>
> I think in order to really make this pay off we may need to look into
> avoiding zeroing or poisoning the page in both the host and the guest.
> I will have to look into some things as it looks like there was
> somebody from Intel may have been working on doing some work to address
> that based on the presentation I found at the link below:
>
> https://www.lfasiallc.com/wp-content/uploads/2017/11/Use-Hyper-V-Enlightenments-to-Increase-KVM-VM-Performance_Density_Chao-Peng.pdf
>
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-11 17:48 ` Dave Hansen@ 2019-02-11 17:58 ` Michael S. Tsirkin
2019-02-11 18:19 ` Dave Hansen0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2019-02-11 17:58 UTC (permalink / raw)
To: Dave Hansen
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar,
alexander.h.duyck, x86, mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, Feb 11, 2019 at 09:48:11AM -0800, Dave Hansen wrote:
> On 2/9/19 4:49 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> >> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >>
> >> Add guest support for providing free memory hints to the KVM hypervisor for
> >> freed pages huge TLB size or larger. I am restricting the size to
> >> huge TLB order and larger because the hypercalls are too expensive to be
> >> performing one per 4K page.
> > Even 2M pages start to get expensive with a TB guest.
>
> Yeah, but we don't allocate and free TB's of memory at a high frequency.
>
> > Really it seems we want a virtio ring so we can pass a batch of these.
> > E.g. 256 entries, 2M each - that's more like it.
>
> That only makes sense for a system that's doing high-frequency,
> discontiguous frees of 2M pages. Right now, a 2M free/realloc cycle
> (THP or hugetlb) is *not* super-high frequency just because of the
> latency for zeroing the page.
Heh but with a ton of free memory, and a thread zeroing some of
it out in the background, will this still be the case?
It could be that we'll be able to find clean pages
at all times.
> A virtio ring seems like an overblown solution to a non-existent problem.
It would be nice to see some traces to help us decide one way or the other.
--
MST
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-11 17:58 ` Michael S. Tsirkin@ 2019-02-11 18:19 ` Dave Hansen
2019-02-11 19:56 ` Michael S. Tsirkin0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2019-02-11 18:19 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar,
alexander.h.duyck, x86, mingo, bp, hpa, pbonzini, tglx, akpm
On 2/11/19 9:58 AM, Michael S. Tsirkin wrote:
>>> Really it seems we want a virtio ring so we can pass a batch of these.
>>> E.g. 256 entries, 2M each - that's more like it.
>> That only makes sense for a system that's doing high-frequency,
>> discontiguous frees of 2M pages. Right now, a 2M free/realloc cycle
>> (THP or hugetlb) is *not* super-high frequency just because of the
>> latency for zeroing the page.
> Heh but with a ton of free memory, and a thread zeroing some of
> it out in the background, will this still be the case?
> It could be that we'll be able to find clean pages
> at all times.
In a systems where we have some asynchrounous zeroing of memory where
freed, non-zeroed memory is sequestered out of the allocator, yeah, that
could make sense.
But, that's not what we have today.
>> A virtio ring seems like an overblown solution to a non-existent problem.
> It would be nice to see some traces to help us decide one way or the other.
Yeah, agreed. Sounds like we need some more testing to see if these
approaches hit bottlenecks anywhere.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
2019-02-11 18:19 ` Dave Hansen@ 2019-02-11 19:56 ` Michael S. Tsirkin0 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2019-02-11 19:56 UTC (permalink / raw)
To: Dave Hansen
Cc: Alexander Duyck, linux-mm, linux-kernel, kvm, rkrcmar,
alexander.h.duyck, x86, mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, Feb 11, 2019 at 10:19:17AM -0800, Dave Hansen wrote:
> On 2/11/19 9:58 AM, Michael S. Tsirkin wrote:
> >>> Really it seems we want a virtio ring so we can pass a batch of these.
> >>> E.g. 256 entries, 2M each - that's more like it.
> >> That only makes sense for a system that's doing high-frequency,
> >> discontiguous frees of 2M pages. Right now, a 2M free/realloc cycle
> >> (THP or hugetlb) is *not* super-high frequency just because of the
> >> latency for zeroing the page.
> > Heh but with a ton of free memory, and a thread zeroing some of
> > it out in the background, will this still be the case?
> > It could be that we'll be able to find clean pages
> > at all times.
>
> In a systems where we have some asynchrounous zeroing of memory where
> freed, non-zeroed memory is sequestered out of the allocator, yeah, that
> could make sense.
>
> But, that's not what we have today.
Right. I wonder whether it's smart to build this assumption
into a host/guest interface though.
> >> A virtio ring seems like an overblown solution to a non-existent problem.
> > It would be nice to see some traces to help us decide one way or the other.
>
> Yeah, agreed. Sounds like we need some more testing to see if these
> approaches hit bottlenecks anywhere.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 4/4] mm: Add merge page notifier
2019-02-04 19:40 ` Dave Hansen@ 2019-02-04 19:51 ` Alexander Duyck0 siblings, 0 replies; 55+ messages in thread
From: Alexander Duyck @ 2019-02-04 19:51 UTC (permalink / raw)
To: Dave Hansen, Alexander Duyck, linux-mm, linux-kernel, kvm
Cc: rkrcmar, x86, mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, 2019-02-04 at 11:40 -0800, Dave Hansen wrote:
> > +void __arch_merge_page(struct zone *zone, struct page *page,
> > + unsigned int order)
> > +{
> > + /*
> > + * The merging logic has merged a set of buddies up to the
> > + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> > + * advantage of this moment to notify the hypervisor of the free
> > + * memory.
> > + */
> > + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> > + return;
> > +
> > + /*
> > + * Drop zone lock while processing the hypercall. This
> > + * should be safe as the page has not yet been added
> > + * to the buddy list as of yet and all the pages that
> > + * were merged have had their buddy/guard flags cleared
> > + * and their order reset to 0.
> > + */
> > + spin_unlock(&zone->lock);
> > +
> > + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> > + PAGE_SIZE << order);
> > +
> > + /* reacquire lock and resume freeing memory */
> > + spin_lock(&zone->lock);
> > +}
>
> Why do the lock-dropping on merge but not free? What's the difference?
The lock has not yet been acquired in the free path. The arch_free_page
call is made from free_pages_prepare, whereas the arch_merge_page call
is made from within __free_one_page which has the requirement that the
zone lock be taken before calling the function.
> This makes me really nervous. You at *least* want to document this at
> the arch_merge_page() call-site, and perhaps even the __free_one_page()
> call-sites because they're near where the zone lock is taken.
Okay, that makes sense. I would probably look at adding the
documentation to the arch_merge_page call-site.
> The place you are calling arch_merge_page() looks OK to me, today. But,
> it can't get moved around without careful consideration. That also
> needs to be documented to warn off folks who might move code around.
Agreed.
> The interaction between the free and merge hooks is also really
> implementation-specific. If an architecture is getting order-0
> arch_free_page() notifications, it's probably worth at least documenting
> that they'll *also* get arch_merge_page() notifications.
If an architecture is getting order-0 notifications then the merge
notifications would be pointless since all the pages would be already
hinted.
I can add documentation that explains that in the case where we are
only hinting on non-zero order pages then arch_merge_page should
provide hints for when a page is merged above that threshold.
> The reason x86 doesn't double-hypercall on those is not broached in the
> descriptions. That seems to be problematic.
I will add more documentation to address that.
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 4/4] mm: Add merge page notifier
2019-02-11 15:58 ` Alexander Duyck@ 2019-02-12 2:09 ` Aaron Lu
2019-02-12 17:20 ` Alexander Duyck0 siblings, 1 reply; 55+ messages in thread
From: Aaron Lu @ 2019-02-12 2:09 UTC (permalink / raw)
To: Alexander Duyck, Alexander Duyck, linux-mm, linux-kernel, kvm
Cc: rkrcmar, x86, mingo, bp, hpa, pbonzini, tglx, akpm
On 2019/2/11 23:58, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 14:40 +0800, Aaron Lu wrote:
>> On 2019/2/5 2:15, Alexander Duyck wrote:
>>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>
>>> Because the implementation was limiting itself to only providing hints on
>>> pages huge TLB order sized or larger we introduced the possibility for free
>>> pages to slip past us because they are freed as something less then
>>> huge TLB in size and aggregated with buddies later.
>>>
>>> To address that I am adding a new call arch_merge_page which is called
>>> after __free_one_page has merged a pair of pages to create a higher order
>>> page. By doing this I am able to fill the gap and provide full coverage for
>>> all of the pages huge TLB order or larger.
>>>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>> ---
>>> arch/x86/include/asm/page.h | 12 ++++++++++++
>>> arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
>>> include/linux/gfp.h | 4 ++++
>>> mm/page_alloc.c | 2 ++
>>> 4 files changed, 46 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>> index 4487ad7a3385..9540a97c9997 100644
>>> --- a/arch/x86/include/asm/page.h
>>> +++ b/arch/x86/include/asm/page.h
>>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>>> if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>> __arch_free_page(page, order);
>>> }
>>> +
>>> +struct zone;
>>> +
>>> +#define HAVE_ARCH_MERGE_PAGE
>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>> + unsigned int order);
>>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
>>> + unsigned int order)
>>> +{
>>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>> + __arch_merge_page(zone, page, order);
>>> +}
>>> #endif
>>>
>>> #include <linux/range.h>
>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>> index 09c91641c36c..957bb4f427bb 100644
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>>> PAGE_SIZE << order);
>>> }
>>>
>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>> + unsigned int order)
>>> +{
>>> + /*
>>> + * The merging logic has merged a set of buddies up to the
>>> + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
>>> + * advantage of this moment to notify the hypervisor of the free
>>> + * memory.
>>> + */
>>> + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
>>> + return;
>>> +
>>> + /*
>>> + * Drop zone lock while processing the hypercall. This
>>> + * should be safe as the page has not yet been added
>>> + * to the buddy list as of yet and all the pages that
>>> + * were merged have had their buddy/guard flags cleared
>>> + * and their order reset to 0.
>>> + */
>>> + spin_unlock(&zone->lock);
>>> +
>>> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
>>> + PAGE_SIZE << order);
>>> +
>>> + /* reacquire lock and resume freeing memory */
>>> + spin_lock(&zone->lock);
>>> +}
>>> +
>>> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>
>>> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index fdab7de7490d..4746d5560193 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>> #ifndef HAVE_ARCH_FREE_PAGE
>>> static inline void arch_free_page(struct page *page, int order) { }
>>> #endif
>>> +#ifndef HAVE_ARCH_MERGE_PAGE
>>> +static inline void
>>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
>>> +#endif
>>> #ifndef HAVE_ARCH_ALLOC_PAGE
>>> static inline void arch_alloc_page(struct page *page, int order) { }
>>> #endif
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index c954f8c1fbc4..7a1309b0b7c5 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>>> page = page + (combined_pfn - pfn);
>>> pfn = combined_pfn;
>>> order++;
>>> +
>>> + arch_merge_page(zone, page, order);
>>
>> Not a proper place AFAICS.
>>
>> Assume we have an order-8 page being sent here for merge and its order-8
>> buddy is also free, then order++ became 9 and arch_merge_page() will do
>> the hint to host on this page as an order-9 page, no problem so far.
>> Then the next round, assume the now order-9 page's buddy is also free,
>> order++ will become 10 and arch_merge_page() will again hint to host on
>> this page as an order-10 page. The first hint to host became redundant.
>
> Actually the problem is even worse the other way around. My concern was
> pages being incrementally freed.
>
> With this setup I can catch when we have crossed the threshold from
> order 8 to 9, and specifically for that case provide the hint. This
> allows me to ignore orders above and below 9.
OK, I see, you are now only hinting for pages with order 9, not above.
> If I move the hint to the spot after the merging I have no way of
> telling if I have hinted the page as a lower order or not. As such I
> will hint if it is merged up to orders 9 or greater. So for example if
> it merges up to order 9 and stops there then done_merging will report
> an order 9 page, then if another page is freed and merged with this up
> to order 10 you would be hinting on order 10. By placing the function
> here I can guarantee that no more than 1 hint is provided per 2MB page.
So what's the downside of hinting the page as order-10 after merge
compared to as order-9 before the merge? I can see the same physical
range can be hinted multiple times, but the total hint number is the
same: both are 2 - in your current implementation, we hint twice for
each of the 2 order-9 pages; alternatively, we can provide hint for one
order-9 page and the merged order-10 page. I think the cost of
hypercalls are the same? Is it that we want to ease the host side
madvise(DONTNEED) since we can avoid operating the same range multiple
times?
The reason I asked is, if we can move the arch_merge_page() after
done_merging tag, we can theoretically make fewer function calls on free
path for the guest. Maybe not a big deal, I don't know...
>> I think the proper place is after the done_merging tag.
>>
>> BTW, with arch_merge_page() at the proper place, I don't think patch3/4
>> is necessary - any freed page will go through merge anyway, we won't
>> lose any hint opportunity. Or do I miss anything?
>
> You can refer to my comment above. What I want to avoid is us hinting a
> page multiple times if we aren't using MAX_ORDER - 1 as the limit. What
Yeah that's a good point. But is this going to happen?
> I am avoiding by placing this where I did is us doing a hint on orders
> greater than our target hint order. So with this way I only perform one
> hint per 2MB page, otherwise I would be performing multiple hints per
> 2MB page as every order above that would also trigger hints.
>
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 0/4] kvm: Report unused guest pages to host
2019-02-04 18:15 [RFC PATCH 0/4] kvm: Report unused guest pages to host Alexander Duyck
` (4 preceding siblings ...)
2019-02-04 18:19 ` [RFC PATCH QEMU] i386/kvm: Enable paravirtual unused page hint mechanism Alexander Duyck
@ 2019-02-05 17:25 ` Nitesh Narayan Lal
2019-02-05 18:43 ` Alexander Duyck
2019-02-07 14:48 ` Nitesh Narayan Lal
2019-02-10 0:51 ` Michael S. Tsirkin7 siblings, 1 reply; 55+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-05 17:25 UTC (permalink / raw)
To: Alexander Duyck, linux-mm, linux-kernel, kvm
Cc: rkrcmar, alexander.h.duyck, x86, mingo, bp, hpa, pbonzini, tglx,
akpm, Luiz Capitulino, David Hildenbrand, Pankaj Gupta
[-- Attachment #1.1: Type: text/plain, Size: 3631 bytes --]
On 2/4/19 1:15 PM, Alexander Duyck wrote:
> This patch set provides a mechanism by which guests can notify the host of
> pages that are not currently in use. Using this data a KVM host can more
> easily balance memory workloads between guests and improve overall system
> performance by avoiding unnecessary writing of unused pages to swap.
>
> In order to support this I have added a new hypercall to provided unused
> page hints and made use of mechanisms currently used by PowerPC and s390
> architectures to provide those hints. To reduce the overhead of this call
> I am only using it per huge page instead of of doing a notification per 4K
> page. By doing this we can avoid the expense of fragmenting higher order
> pages, and reduce overall cost for the hypercall as it will only be
> performed once per huge page.
>
> Because we are limiting this to huge pages it was necessary to add a
> secondary location where we make the call as the buddy allocator can merge
> smaller pages into a higher order huge page.
>
> This approach is not usable in all cases. Specifically, when KVM direct
> device assignment is used, the memory for a guest is permanently assigned
> to physical pages in order to support DMA from the assigned device. In
> this case we cannot give the pages back, so the hypercall is disabled by
> the host.
>
> Another situation that can lead to issues is if the page were accessed
> immediately after free. For example, if page poisoning is enabled the
> guest will populate the page *after* freeing it. In this case it does not
> make sense to provide a hint about the page being freed so we do not
> perform the hypercalls from the guest if this functionality is enabled.
>
> My testing up till now has consisted of setting up 4 8GB VMs on a system
> with 32GB of memory and 4GB of swap. To stress the memory on the system I
> would run "memhog 8G" sequentially on each of the guests and observe how
> long it took to complete the run. The observed behavior is that on the
> systems with these patches applied in both the guest and on the host I was
> able to complete the test with a time of 5 to 7 seconds per guest. On a
> system without these patches the time ranged from 7 to 49 seconds per
> guest. I am assuming the variability is due to time being spent writing
> pages out to disk in order to free up space for the guest.
Hi Alexander,
Can you share the host memory usage before and after your run. (In both
the cases with your patch-set and without your patch-set)
>
> ---
>
> Alexander Duyck (4):
> madvise: Expose ability to set dontneed from kernel
> kvm: Add host side support for free memory hints
> kvm: Add guest side support for free memory hints
> mm: Add merge page notifier
>
>
> Documentation/virtual/kvm/cpuid.txt | 4 ++
> Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++
> arch/x86/include/asm/page.h | 25 +++++++++++++++
> arch/x86/include/uapi/asm/kvm_para.h | 3 ++
> arch/x86/kernel/kvm.c | 51 ++++++++++++++++++++++++++++++
> arch/x86/kvm/cpuid.c | 6 +++-
> arch/x86/kvm/x86.c | 35 +++++++++++++++++++++
> include/linux/gfp.h | 4 ++
> include/linux/mm.h | 2 +
> include/uapi/linux/kvm_para.h | 1 +
> mm/madvise.c | 13 +++++++-
> mm/page_alloc.c | 2 +
> 12 files changed, 158 insertions(+), 2 deletions(-)
>
> --
--
Regards
Nitesh
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 0/4] kvm: Report unused guest pages to host
2019-02-05 17:25 ` [RFC PATCH 0/4] kvm: Report unused guest pages to host Nitesh Narayan Lal
@ 2019-02-05 18:43 ` Alexander Duyck0 siblings, 0 replies; 55+ messages in thread
From: Alexander Duyck @ 2019-02-05 18:43 UTC (permalink / raw)
To: Nitesh Narayan Lal, Alexander Duyck, linux-mm, linux-kernel, kvm
Cc: rkrcmar, x86, mingo, bp, hpa, pbonzini, tglx, akpm,
Luiz Capitulino, David Hildenbrand, Pankaj Gupta
On Tue, 2019-02-05 at 12:25 -0500, Nitesh Narayan Lal wrote:
> On 2/4/19 1:15 PM, Alexander Duyck wrote:
> > This patch set provides a mechanism by which guests can notify the host of
> > pages that are not currently in use. Using this data a KVM host can more
> > easily balance memory workloads between guests and improve overall system
> > performance by avoiding unnecessary writing of unused pages to swap.
> >
> > In order to support this I have added a new hypercall to provided unused
> > page hints and made use of mechanisms currently used by PowerPC and s390
> > architectures to provide those hints. To reduce the overhead of this call
> > I am only using it per huge page instead of of doing a notification per 4K
> > page. By doing this we can avoid the expense of fragmenting higher order
> > pages, and reduce overall cost for the hypercall as it will only be
> > performed once per huge page.
> >
> > Because we are limiting this to huge pages it was necessary to add a
> > secondary location where we make the call as the buddy allocator can merge
> > smaller pages into a higher order huge page.
> >
> > This approach is not usable in all cases. Specifically, when KVM direct
> > device assignment is used, the memory for a guest is permanently assigned
> > to physical pages in order to support DMA from the assigned device. In
> > this case we cannot give the pages back, so the hypercall is disabled by
> > the host.
> >
> > Another situation that can lead to issues is if the page were accessed
> > immediately after free. For example, if page poisoning is enabled the
> > guest will populate the page *after* freeing it. In this case it does not
> > make sense to provide a hint about the page being freed so we do not
> > perform the hypercalls from the guest if this functionality is enabled.
> >
> > My testing up till now has consisted of setting up 4 8GB VMs on a system
> > with 32GB of memory and 4GB of swap. To stress the memory on the system I
> > would run "memhog 8G" sequentially on each of the guests and observe how
> > long it took to complete the run. The observed behavior is that on the
> > systems with these patches applied in both the guest and on the host I was
> > able to complete the test with a time of 5 to 7 seconds per guest. On a
> > system without these patches the time ranged from 7 to 49 seconds per
> > guest. I am assuming the variability is due to time being spent writing
> > pages out to disk in order to free up space for the guest.
>
> Hi Alexander,
>
> Can you share the host memory usage before and after your run. (In both
> the cases with your patch-set and without your patch-set)
Here are some snippets from the /proc/meminfo for the system both
before and after the test.
W/O patch
-- Before --
MemTotal: 32881396 kB
MemFree: 21363724 kB
MemAvailable: 25891228 kB
Buffers: 2276 kB
Cached: 4760280 kB
SwapCached: 0 kB
Active: 7166952 kB
Inactive: 1474980 kB
Active(anon): 3893308 kB
Inactive(anon): 8776 kB
Active(file): 3273644 kB
Inactive(file): 1466204 kB
Unevictable: 16756 kB
Mlocked: 16756 kB
SwapTotal: 4194300 kB
SwapFree: 4194300 kB
Dirty: 29812 kB
Writeback: 0 kB
AnonPages: 3896540 kB
Mapped: 75568 kB
Shmem: 10044 kB
-- After --
MemTotal: 32881396 kB
MemFree: 194668 kB
MemAvailable: 51356 kB
Buffers: 24 kB
Cached: 129036 kB
SwapCached: 224396 kB
Active: 27223304 kB
Inactive: 2589736 kB
Active(anon): 27220360 kB
Inactive(anon): 2481592 kB
Active(file): 2944 kB
Inactive(file): 108144 kB
Unevictable: 16756 kB
Mlocked: 16756 kB
SwapTotal: 4194300 kB
SwapFree: 35616 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 29476628 kB
Mapped: 22820 kB
Shmem: 5516 kB
W/ patch
-- Before --
MemTotal: 32881396 kB
MemFree: 26618880 kB
MemAvailable: 27056004 kB
Buffers: 2276 kB
Cached: 781496 kB
SwapCached: 0 kB
Active: 3309056 kB
Inactive: 393796 kB
Active(anon): 2932728 kB
Inactive(anon): 8776 kB
Active(file): 376328 kB
Inactive(file): 385020 kB
Unevictable: 16756 kB
Mlocked: 16756 kB
SwapTotal: 4194300 kB
SwapFree: 4194300 kB
Dirty: 96 kB
Writeback: 0 kB
AnonPages: 2935964 kB
Mapped: 75428 kB
Shmem: 10048 kB
-- After --
MemTotal: 32881396 kB
MemFree: 22677904 kB
MemAvailable: 26543092 kB
Buffers: 2276 kB
Cached: 4205908 kB
SwapCached: 0 kB
Active: 3863016 kB
Inactive: 3768596 kB
Active(anon): 3437368 kB
Inactive(anon): 8772 kB
Active(file): 425648 kB
Inactive(file): 3759824 kB
Unevictable: 16756 kB
Mlocked: 16756 kB
SwapTotal: 4194300 kB
SwapFree: 4194300 kB
Dirty: 1336180 kB
Writeback: 0 kB
AnonPages: 3440528 kB
Mapped: 74992 kB
Shmem: 10044 kB
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 0/4] kvm: Report unused guest pages to host
2019-02-04 18:15 [RFC PATCH 0/4] kvm: Report unused guest pages to host Alexander Duyck
` (5 preceding siblings ...)
2019-02-05 17:25 ` [RFC PATCH 0/4] kvm: Report unused guest pages to host Nitesh Narayan Lal
@ 2019-02-07 14:48 ` Nitesh Narayan Lal
2019-02-07 16:56 ` Alexander Duyck
2019-02-10 0:51 ` Michael S. Tsirkin7 siblings, 1 reply; 55+ messages in thread
From: Nitesh Narayan Lal @ 2019-02-07 14:48 UTC (permalink / raw)
To: Alexander Duyck, linux-mm, linux-kernel, kvm
Cc: rkrcmar, alexander.h.duyck, x86, mingo, bp, hpa, pbonzini, tglx,
akpm, Luiz Capitulino, David Hildenbrand
[-- Attachment #1.1: Type: text/plain, Size: 3764 bytes --]
On 2/4/19 1:15 PM, Alexander Duyck wrote:
> This patch set provides a mechanism by which guests can notify the host of
> pages that are not currently in use. Using this data a KVM host can more
> easily balance memory workloads between guests and improve overall system
> performance by avoiding unnecessary writing of unused pages to swap.
>
> In order to support this I have added a new hypercall to provided unused
> page hints and made use of mechanisms currently used by PowerPC and s390
> architectures to provide those hints. To reduce the overhead of this call
> I am only using it per huge page instead of of doing a notification per 4K
> page. By doing this we can avoid the expense of fragmenting higher order
> pages, and reduce overall cost for the hypercall as it will only be
> performed once per huge page.
>
> Because we are limiting this to huge pages it was necessary to add a
> secondary location where we make the call as the buddy allocator can merge
> smaller pages into a higher order huge page.
>
> This approach is not usable in all cases. Specifically, when KVM direct
> device assignment is used, the memory for a guest is permanently assigned
> to physical pages in order to support DMA from the assigned device. In
> this case we cannot give the pages back, so the hypercall is disabled by
> the host.
>
> Another situation that can lead to issues is if the page were accessed
> immediately after free. For example, if page poisoning is enabled the
> guest will populate the page *after* freeing it. In this case it does not
> make sense to provide a hint about the page being freed so we do not
> perform the hypercalls from the guest if this functionality is enabled.
Hi Alexander,
Did you get a chance to look at my v8 posting of Guest Free Page Hinting
[1]?
Considering both the solutions are trying to solve the same problem. It
will be great if we can collaborate and come up with a unified solution.
[1] https://lkml.org/lkml/2019/2/4/993>
> My testing up till now has consisted of setting up 4 8GB VMs on a system
> with 32GB of memory and 4GB of swap. To stress the memory on the system I
> would run "memhog 8G" sequentially on each of the guests and observe how
> long it took to complete the run. The observed behavior is that on the
> systems with these patches applied in both the guest and on the host I was
> able to complete the test with a time of 5 to 7 seconds per guest. On a
> system without these patches the time ranged from 7 to 49 seconds per
> guest. I am assuming the variability is due to time being spent writing
> pages out to disk in order to free up space for the guest.
>
> ---
>
> Alexander Duyck (4):
> madvise: Expose ability to set dontneed from kernel
> kvm: Add host side support for free memory hints
> kvm: Add guest side support for free memory hints
> mm: Add merge page notifier
>
>
> Documentation/virtual/kvm/cpuid.txt | 4 ++
> Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++
> arch/x86/include/asm/page.h | 25 +++++++++++++++
> arch/x86/include/uapi/asm/kvm_para.h | 3 ++
> arch/x86/kernel/kvm.c | 51 ++++++++++++++++++++++++++++++
> arch/x86/kvm/cpuid.c | 6 +++-
> arch/x86/kvm/x86.c | 35 +++++++++++++++++++++
> include/linux/gfp.h | 4 ++
> include/linux/mm.h | 2 +
> include/uapi/linux/kvm_para.h | 1 +
> mm/madvise.c | 13 +++++++-
> mm/page_alloc.c | 2 +
> 12 files changed, 158 insertions(+), 2 deletions(-)
>
> --
--
Regards
Nitesh
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 0/4] kvm: Report unused guest pages to host
2019-02-07 14:48 ` Nitesh Narayan Lal@ 2019-02-07 16:56 ` Alexander Duyck0 siblings, 0 replies; 55+ messages in thread
From: Alexander Duyck @ 2019-02-07 16:56 UTC (permalink / raw)
To: Nitesh Narayan Lal, Alexander Duyck, linux-mm, linux-kernel, kvm
Cc: rkrcmar, x86, mingo, bp, hpa, pbonzini, tglx, akpm,
Luiz Capitulino, David Hildenbrand
On Thu, 2019-02-07 at 09:48 -0500, Nitesh Narayan Lal wrote:
> On 2/4/19 1:15 PM, Alexander Duyck wrote:
> > This patch set provides a mechanism by which guests can notify the host of
> > pages that are not currently in use. Using this data a KVM host can more
> > easily balance memory workloads between guests and improve overall system
> > performance by avoiding unnecessary writing of unused pages to swap.
> >
> > In order to support this I have added a new hypercall to provided unused
> > page hints and made use of mechanisms currently used by PowerPC and s390
> > architectures to provide those hints. To reduce the overhead of this call
> > I am only using it per huge page instead of of doing a notification per 4K
> > page. By doing this we can avoid the expense of fragmenting higher order
> > pages, and reduce overall cost for the hypercall as it will only be
> > performed once per huge page.
> >
> > Because we are limiting this to huge pages it was necessary to add a
> > secondary location where we make the call as the buddy allocator can merge
> > smaller pages into a higher order huge page.
> >
> > This approach is not usable in all cases. Specifically, when KVM direct
> > device assignment is used, the memory for a guest is permanently assigned
> > to physical pages in order to support DMA from the assigned device. In
> > this case we cannot give the pages back, so the hypercall is disabled by
> > the host.
> >
> > Another situation that can lead to issues is if the page were accessed
> > immediately after free. For example, if page poisoning is enabled the
> > guest will populate the page *after* freeing it. In this case it does not
> > make sense to provide a hint about the page being freed so we do not
> > perform the hypercalls from the guest if this functionality is enabled.
>
> Hi Alexander,
>
> Did you get a chance to look at my v8 posting of Guest Free Page Hinting
> [1]?
> Considering both the solutions are trying to solve the same problem. It
> will be great if we can collaborate and come up with a unified solution.
>
> [1] https://lkml.org/lkml/2019/2/4/993
I haven't had a chance to review these yet.
I'll try to take a look later today and provide review notes based on
what I find.
Thanks.
- Alex
^permalinkrawreply [flat|nested] 55+ messages in thread

*Re: [RFC PATCH 0/4] kvm: Report unused guest pages to host
2019-02-04 18:15 [RFC PATCH 0/4] kvm: Report unused guest pages to host Alexander Duyck
` (6 preceding siblings ...)
2019-02-07 14:48 ` Nitesh Narayan Lal@ 2019-02-10 0:51 ` Michael S. Tsirkin7 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2019-02-10 0:51 UTC (permalink / raw)
To: Alexander Duyck
Cc: linux-mm, linux-kernel, kvm, rkrcmar, alexander.h.duyck, x86,
mingo, bp, hpa, pbonzini, tglx, akpm
On Mon, Feb 04, 2019 at 10:15:33AM -0800, Alexander Duyck wrote:
> This patch set provides a mechanism by which guests can notify the host of
> pages that are not currently in use. Using this data a KVM host can more
> easily balance memory workloads between guests and improve overall system
> performance by avoiding unnecessary writing of unused pages to swap.
There's an obvious overlap with Nilal's work and already merged Wei's
work here. So please Cc people reviewing Nilal's and Wei's
patches.
> In order to support this I have added a new hypercall to provided unused
> page hints and made use of mechanisms currently used by PowerPC and s390
> architectures to provide those hints. To reduce the overhead of this call
> I am only using it per huge page instead of of doing a notification per 4K
> page. By doing this we can avoid the expense of fragmenting higher order
> pages, and reduce overall cost for the hypercall as it will only be
> performed once per huge page.
>
> Because we are limiting this to huge pages it was necessary to add a
> secondary location where we make the call as the buddy allocator can merge
> smaller pages into a higher order huge page.
>
> This approach is not usable in all cases. Specifically, when KVM direct
> device assignment is used, the memory for a guest is permanently assigned
> to physical pages in order to support DMA from the assigned device. In
> this case we cannot give the pages back, so the hypercall is disabled by
> the host.
>
> Another situation that can lead to issues is if the page were accessed
> immediately after free. For example, if page poisoning is enabled the
> guest will populate the page *after* freeing it. In this case it does not
> make sense to provide a hint about the page being freed so we do not
> perform the hypercalls from the guest if this functionality is enabled.
>
> My testing up till now has consisted of setting up 4 8GB VMs on a system
> with 32GB of memory and 4GB of swap. To stress the memory on the system I
> would run "memhog 8G" sequentially on each of the guests and observe how
> long it took to complete the run. The observed behavior is that on the
> systems with these patches applied in both the guest and on the host I was
> able to complete the test with a time of 5 to 7 seconds per guest. On a
> system without these patches the time ranged from 7 to 49 seconds per
> guest. I am assuming the variability is due to time being spent writing
> pages out to disk in order to free up space for the guest.
>
> ---
>
> Alexander Duyck (4):
> madvise: Expose ability to set dontneed from kernel
> kvm: Add host side support for free memory hints
> kvm: Add guest side support for free memory hints
> mm: Add merge page notifier
>
>
> Documentation/virtual/kvm/cpuid.txt | 4 ++
> Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++
> arch/x86/include/asm/page.h | 25 +++++++++++++++
> arch/x86/include/uapi/asm/kvm_para.h | 3 ++
> arch/x86/kernel/kvm.c | 51 ++++++++++++++++++++++++++++++
> arch/x86/kvm/cpuid.c | 6 +++-
> arch/x86/kvm/x86.c | 35 +++++++++++++++++++++
> include/linux/gfp.h | 4 ++
> include/linux/mm.h | 2 +
> include/uapi/linux/kvm_para.h | 1 +
> mm/madvise.c | 13 +++++++-
> mm/page_alloc.c | 2 +
> 12 files changed, 158 insertions(+), 2 deletions(-)
>
> --
^permalinkrawreply [flat|nested] 55+ messages in thread