Commit Message

The POWER8 processor has a Micro Partition Prefetch Engine, which is
a fancy way of saying "has way to store and load contents of L2 or
L2+MRU way of L3 cache". We initiate the storing of the log (list of
addresses) using the logmpp instruction and start restore by writing
to a SPR.
The logmpp instruction takes parameters in a single 64bit register:
- starting address of the table to store log of L2/L2+L3 cache contents
- 32kb for L2
- 128kb for L2+L3
- Aligned relative to maximum size of the table (32kb or 128kb)
- Log control (no-op, L2 only, L2 and L3, abort logout)
We should abort any ongoing logging before initiating one.
To initiate restore, we write to the MPPR SPR. The format of what to write
to the SPR is similar to the logmpp instruction parameter:
- starting address of the table to read from (same alignment requirements)
- table size (no data, until end of table)
- prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
32 cycles)
The idea behind loading and storing the contents of L2/L3 cache is to
reduce memory latency in a system that is frequently swapping vcores on
a physical CPU.
The best case scenario for doing this is when some vcores are doing very
cache heavy workloads. The worst case is when they have about 0 cache hits,
so we just generate needless memory operations.
This implementation just does L2 store/load. In my benchmarks this proves
to be useful.
Benchmark 1:
- 16 core POWER8
- 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
- No split core/SMT
- two guests running sysbench memory test.
sysbench --test=memory --num-threads=8 run
- one guest running apache bench (of default HTML page)
ab -n 490000 -c 400 http://localhost/
This benchmark aims to measure performance of real world application (apache)
where other guests are cache hot with their own workloads. The sysbench memory
benchmark does pointer sized writes to a (small) memory buffer in a loop.
In this benchmark with this patch I can see an improvement both in requests
per second (~5%) and in mean and median response times (again, about 5%).
The spread of minimum and maximum response times were largely unchanged.
benchmark 2:
- Same VM config as benchmark 1
- all three guests running sysbench memory benchmark
This benchmark aims to see if there is a positive or negative affect to this
cache heavy benchmark. Although due to the nature of the benchmark (stores) we
may not see a difference in performance, but rather hopefully an improvement
in consistency of performance (when vcore switched in, don't have to wait
many times for cachelines to be pulled in)
The results of this benchmark are improvements in consistency of performance
rather than performance itself. With this patch, the few outliers in duration
go away and we get more consistent performance in each guest.
benchmark 3:
- same 3 guests and CPU configuration as benchmark 1 and 2.
- two idle guests
- 1 guest running STREAM benchmark
This scenario also saw performance improvement with this patch. On Copy and
Scale workloads from STREAM, I got 5-6% improvement with this patch. For
Add and triad, it was around 10% (or more).
benchmark 4:
- same 3 guests as previous benchmarks
- two guests running sysbench --memory, distinctly different cache heavy
workload
- one guest running STREAM benchmark.
Similar improvements to benchmark 3.
benchmark 5:
- 1 guest, 8 VCPUs, Ubuntu 14.04
- Host configured with split core (SMT8, subcores-per-core=4)
- STREAM benchmark
In this benchmark, we see a 10-20% performance improvement across the board
of STREAM benchmark results with this patch.
Based on preliminary investigation and microbenchmarks
by Prerna Saxena <prerna@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
--
changes since v1:
- s/mppe/mpp_buffer/
- add MPP_BUFFER_ORDER define.
---
arch/powerpc/include/asm/kvm_host.h | 1 +
arch/powerpc/include/asm/ppc-opcode.h | 10 ++++++
arch/powerpc/include/asm/reg.h | 1 +
arch/powerpc/kvm/book3s_hv.c | 54 ++++++++++++++++++++++++++++++++-
4 files changed, 65 insertions(+), 1 deletion(-)

On 09.07.14 00:59, Stewart Smith wrote:
> Hi!>> Thanks for review, much appreciated!>> Alexander Graf <agraf@suse.de> writes:>> On 08.07.14 07:06, Stewart Smith wrote:>>> @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)>>> int i, need_vpa_update;>>> int srcu_idx;>>> struct kvm_vcpu *vcpus_to_update[threads_per_core];>>> + phys_addr_t phy_addr, tmp;>> Please put the variable declarations into the if () branch so that the>> compiler can catch potential leaks :)> ack. will fix.>>>> @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)>>> >>> srcu_idx = srcu_read_lock(&vc->kvm->srcu);>>> >>> + /* If we have a saved list of L2/L3, restore it */>>> + if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {>>> + phy_addr = virt_to_phys((void *)vc->mpp_buffer);>>> +#if defined(CONFIG_PPC_4K_PAGES)>>> + phy_addr = (phy_addr + 8*4096) & ~(8*4096);>> get_free_pages() is automatically aligned to the order, no?> That's what Paul reckoned too, and then we've attempted to find anywhere> that documents that behaviour. Happen to be able to point to docs/source> that say this is part of API?
Phew - it's probably buried somewhere. I could only find this document
saying that we always get order-aligned allocations:
http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html
Mel, do you happen to have any pointer to something that explicitly (or
even properly implicitly) says that get_free_pages() returns
order-aligned memory?
>>>> +#endif>>> + tmp = phy_addr & PPC_MPPE_ADDRESS_MASK;>>> + tmp = tmp | PPC_MPPE_WHOLE_TABLE;>>> +>>> + /* For sanity, abort any 'save' requests in progress */>>> + asm volatile(PPC_LOGMPP(R1) : : "r" (tmp));>>> +>>> + /* Inititate a cache-load request */>>> + mtspr(SPRN_MPPR, tmp);>>> + }>> In fact, this whole block up here could be a function, no?> It could, perfectly happy for it to be one. Will fix.>>>> +>>> + /* Allocate memory before switching out of guest so we don't>>> + trash L2/L3 with memory allocation stuff */>>> + if (cpu_has_feature(CPU_FTR_ARCH_207S) && !vc->mpp_buffer) {>>> + vc->mpp_buffer = __get_free_pages(GFP_KERNEL|__GFP_ZERO,>>> + MPP_BUFFER_ORDER);>> get_order(64 * 1024)?>>>> Also, why allocate it here and not on vcore creation?> There's also the possibility of saving/restorting part of the L3 cache> as well, and I was envisioning a future patch to this which checks a> flag in vcore (maybe exposed via sysfs or whatever mechanism is> applicable) if it should save/restore L2 or L2/L3, so thus it makes a> bit more sense allocating it there rather than elsewhere.>> There's also no real reason to fail to create a vcore if we can't> allocate a buffer for L2/L3 cache contents - retrying later is perfectly> harmless.
If we failed during core creation just don't save/restore L2 cache
contents at all. I really prefer to have allocation and dealloction all
at init time - and such low order allocations will most likely succeed.
Let's leave the L3 cache bits for later when we know whether it actually
has an impact. I personally doubt it :).
Alex

On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:
> > On 09.07.14 00:59, Stewart Smith wrote:> >Hi!> >> >Thanks for review, much appreciated!> >> >Alexander Graf <agraf@suse.de> writes:> >>On 08.07.14 07:06, Stewart Smith wrote:> >>>@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)> >>> int i, need_vpa_update;> >>> int srcu_idx;> >>> struct kvm_vcpu *vcpus_to_update[threads_per_core];> >>>+ phys_addr_t phy_addr, tmp;> >>Please put the variable declarations into the if () branch so that the> >>compiler can catch potential leaks :)> >ack. will fix.> >> >>>@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)> >>> srcu_idx = srcu_read_lock(&vc->kvm->srcu);> >>>+ /* If we have a saved list of L2/L3, restore it */> >>>+ if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {> >>>+ phy_addr = virt_to_phys((void *)vc->mpp_buffer);> >>>+#if defined(CONFIG_PPC_4K_PAGES)> >>>+ phy_addr = (phy_addr + 8*4096) & ~(8*4096);> >>get_free_pages() is automatically aligned to the order, no?> >That's what Paul reckoned too, and then we've attempted to find anywhere> >that documents that behaviour. Happen to be able to point to docs/source> >that say this is part of API?> > Phew - it's probably buried somewhere. I could only find this> document saying that we always get order-aligned allocations:> > http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html> > Mel, do you happen to have any pointer to something that explicitly> (or even properly implicitly) says that get_free_pages() returns> order-aligned memory?>
I did not read the whole thread so I lack context and will just answer
this part.
There is no guarantee that pages are returned in PFN order for multiple
requests to the page allocator. This is the relevant comment in
rmqueue_bulk
/*
* Split buddy pages returned by expand() are received here
* in physical page order. The page is added to the callers and
* list and the list head then moves forward. From the callers
* perspective, the linked list is ordered by page number in
* some conditions. This is useful for IO devices that can
* merge IO requests if the physical pages are ordered
* properly.
*/
It will probably be true early in the lifetime of the system but the milage
will vary on systems with a lot of uptime. If you depend on this behaviour
for correctness then you will have a bad day.
High-order page requests to the page allocator are guaranteed to be in physical
order. However, this does not apply to vmalloc() where allocations are
only guaranteed to be virtually contiguous.

On 10.07.14 15:07, Mel Gorman wrote:
> On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:>> On 09.07.14 00:59, Stewart Smith wrote:>>> Hi!>>>>>> Thanks for review, much appreciated!>>>>>> Alexander Graf <agraf@suse.de> writes:>>>> On 08.07.14 07:06, Stewart Smith wrote:>>>>> @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)>>>>> int i, need_vpa_update;>>>>> int srcu_idx;>>>>> struct kvm_vcpu *vcpus_to_update[threads_per_core];>>>>> + phys_addr_t phy_addr, tmp;>>>> Please put the variable declarations into the if () branch so that the>>>> compiler can catch potential leaks :)>>> ack. will fix.>>>>>>>> @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)>>>>> srcu_idx = srcu_read_lock(&vc->kvm->srcu);>>>>> + /* If we have a saved list of L2/L3, restore it */>>>>> + if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {>>>>> + phy_addr = virt_to_phys((void *)vc->mpp_buffer);>>>>> +#if defined(CONFIG_PPC_4K_PAGES)>>>>> + phy_addr = (phy_addr + 8*4096) & ~(8*4096);>>>> get_free_pages() is automatically aligned to the order, no?>>> That's what Paul reckoned too, and then we've attempted to find anywhere>>> that documents that behaviour. Happen to be able to point to docs/source>>> that say this is part of API?>> Phew - it's probably buried somewhere. I could only find this>> document saying that we always get order-aligned allocations:>>>> http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html>>>> Mel, do you happen to have any pointer to something that explicitly>> (or even properly implicitly) says that get_free_pages() returns>> order-aligned memory?>>> I did not read the whole thread so I lack context and will just answer> this part.>> There is no guarantee that pages are returned in PFN order for multiple> requests to the page allocator. This is the relevant comment in> rmqueue_bulk>> /*> * Split buddy pages returned by expand() are received here> * in physical page order. The page is added to the callers and> * list and the list head then moves forward. From the callers> * perspective, the linked list is ordered by page number in> * some conditions. This is useful for IO devices that can> * merge IO requests if the physical pages are ordered> * properly.> */>> It will probably be true early in the lifetime of the system but the milage> will vary on systems with a lot of uptime. If you depend on this behaviour> for correctness then you will have a bad day.>> High-order page requests to the page allocator are guaranteed to be in physical> order. However, this does not apply to vmalloc() where allocations are> only guaranteed to be virtually contiguous.
Hrm, ok to be very concrete:
Does __get_free_pages(..., 4); on a 4k page size system give me a 64k
aligned pointer? :)
Alex

On Thu, Jul 10, 2014 at 03:17:16PM +0200, Alexander Graf wrote:
> > On 10.07.14 15:07, Mel Gorman wrote:> >On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:> >>On 09.07.14 00:59, Stewart Smith wrote:> >>>Hi!> >>>> >>>Thanks for review, much appreciated!> >>>> >>>Alexander Graf <agraf@suse.de> writes:> >>>>On 08.07.14 07:06, Stewart Smith wrote:> >>>>>@@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)> >>>>> int i, need_vpa_update;> >>>>> int srcu_idx;> >>>>> struct kvm_vcpu *vcpus_to_update[threads_per_core];> >>>>>+ phys_addr_t phy_addr, tmp;> >>>>Please put the variable declarations into the if () branch so that the> >>>>compiler can catch potential leaks :)> >>>ack. will fix.> >>>> >>>>>@@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)> >>>>> srcu_idx = srcu_read_lock(&vc->kvm->srcu);> >>>>>+ /* If we have a saved list of L2/L3, restore it */> >>>>>+ if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {> >>>>>+ phy_addr = virt_to_phys((void *)vc->mpp_buffer);> >>>>>+#if defined(CONFIG_PPC_4K_PAGES)> >>>>>+ phy_addr = (phy_addr + 8*4096) & ~(8*4096);> >>>>get_free_pages() is automatically aligned to the order, no?> >>>That's what Paul reckoned too, and then we've attempted to find anywhere> >>>that documents that behaviour. Happen to be able to point to docs/source> >>>that say this is part of API?> >>Phew - it's probably buried somewhere. I could only find this> >>document saying that we always get order-aligned allocations:> >>> >>http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html> >>> >>Mel, do you happen to have any pointer to something that explicitly> >>(or even properly implicitly) says that get_free_pages() returns> >>order-aligned memory?> >>> >I did not read the whole thread so I lack context and will just answer> >this part.> >> >There is no guarantee that pages are returned in PFN order for multiple> >requests to the page allocator. This is the relevant comment in> >rmqueue_bulk> >> > /*> > * Split buddy pages returned by expand() are received here> > * in physical page order. The page is added to the callers and> > * list and the list head then moves forward. From the callers> > * perspective, the linked list is ordered by page number in> > * some conditions. This is useful for IO devices that can> > * merge IO requests if the physical pages are ordered> > * properly.> > */> >> >It will probably be true early in the lifetime of the system but the milage> >will vary on systems with a lot of uptime. If you depend on this behaviour> >for correctness then you will have a bad day.> >> >High-order page requests to the page allocator are guaranteed to be in physical> >order. However, this does not apply to vmalloc() where allocations are> >only guaranteed to be virtually contiguous.> > Hrm, ok to be very concrete:> > Does __get_free_pages(..., 4); on a 4k page size system give me a> 64k aligned pointer? :)>
Yes.

On 10.07.14 15:30, Mel Gorman wrote:
> On Thu, Jul 10, 2014 at 03:17:16PM +0200, Alexander Graf wrote:>> On 10.07.14 15:07, Mel Gorman wrote:>>> On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote:>>>> On 09.07.14 00:59, Stewart Smith wrote:>>>>> Hi!>>>>>>>>>> Thanks for review, much appreciated!>>>>>>>>>> Alexander Graf <agraf@suse.de> writes:>>>>>> On 08.07.14 07:06, Stewart Smith wrote:>>>>>>> @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)>>>>>>> int i, need_vpa_update;>>>>>>> int srcu_idx;>>>>>>> struct kvm_vcpu *vcpus_to_update[threads_per_core];>>>>>>> + phys_addr_t phy_addr, tmp;>>>>>> Please put the variable declarations into the if () branch so that the>>>>>> compiler can catch potential leaks :)>>>>> ack. will fix.>>>>>>>>>>>> @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)>>>>>>> srcu_idx = srcu_read_lock(&vc->kvm->srcu);>>>>>>> + /* If we have a saved list of L2/L3, restore it */>>>>>>> + if (cpu_has_feature(CPU_FTR_ARCH_207S) && vc->mpp_buffer) {>>>>>>> + phy_addr = virt_to_phys((void *)vc->mpp_buffer);>>>>>>> +#if defined(CONFIG_PPC_4K_PAGES)>>>>>>> + phy_addr = (phy_addr + 8*4096) & ~(8*4096);>>>>>> get_free_pages() is automatically aligned to the order, no?>>>>> That's what Paul reckoned too, and then we've attempted to find anywhere>>>>> that documents that behaviour. Happen to be able to point to docs/source>>>>> that say this is part of API?>>>> Phew - it's probably buried somewhere. I could only find this>>>> document saying that we always get order-aligned allocations:>>>>>>>> http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html>>>>>>>> Mel, do you happen to have any pointer to something that explicitly>>>> (or even properly implicitly) says that get_free_pages() returns>>>> order-aligned memory?>>>>>>> I did not read the whole thread so I lack context and will just answer>>> this part.>>>>>> There is no guarantee that pages are returned in PFN order for multiple>>> requests to the page allocator. This is the relevant comment in>>> rmqueue_bulk>>>>>> /*>>> * Split buddy pages returned by expand() are received here>>> * in physical page order. The page is added to the callers and>>> * list and the list head then moves forward. From the callers>>> * perspective, the linked list is ordered by page number in>>> * some conditions. This is useful for IO devices that can>>> * merge IO requests if the physical pages are ordered>>> * properly.>>> */>>>>>> It will probably be true early in the lifetime of the system but the milage>>> will vary on systems with a lot of uptime. If you depend on this behaviour>>> for correctness then you will have a bad day.>>>>>> High-order page requests to the page allocator are guaranteed to be in physical>>> order. However, this does not apply to vmalloc() where allocations are>>> only guaranteed to be virtually contiguous.>> Hrm, ok to be very concrete:>>>> Does __get_free_pages(..., 4); on a 4k page size system give me a>> 64k aligned pointer? :)>>> Yes.
Awesome - thanks a lot! :)
Alex