Commit Message

Commit a6f177efaa58 ("KVM: Reenter guest after emulation failure if
due to access to non-mmio address") added reexecute_instruction() to
handle the scenario where two (or more) vCPUS race to write a shadowed
page, i.e. reexecute_instruction() is intended to return true if and
only if the instruction being emulated was accessing a shadowed page.
As L0 is only explicitly shadowing L1 tables, an emulation failure of
a nested VM instruction cannot be due to a race to write a shadowed
page and so should never be re-executed.
This fixes an issue where an "MMIO" emulation failure[1] in L2 is all
but guaranteed to result in an infinite loop when TDP is enabled.
Because "cr2" is actually an L2 GPA when TDP is enabled, calling
kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped
case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA
and cause reexecute_instruction() to immediately return true.
Way back when, commit 68be0803456b ("KVM: x86: never re-execute
instruction with enabled tdp") changed reexecute_instruction() to
always return false when using TDP under the assumption that KVM would
only get into the emulator for MMIO. Commit 95b3cf69bdf8 ("KVM: x86:
let reexecute_instruction work for tdp") effectively reverted that
behavior in order to handle the scenario where emulation failed due to
an access from L1 to the shadow page tables for L2, but it didn't
account for the case where emulation failed in L2 with TDP enabled.
All of the above logic also applies to retry_instruction(), added by
commit 1cb3f3ae5a38 ("KVM: x86: retry non-page-table writing
instructions"). An indefinite loop in retry_instruction() should be
impossible as it protects against retrying the same instruction over
and over, but it's still correct to not retry an L2 instruction in
the first place.
[1] This issue was encountered after commit 3a2936dedd20 ("kvm: mmu:
Don't expose private memslots to L2") changed the page fault path
to return KVM_PFN_NOSLOT when translating an L2 access to a
prive memslot. Returning KVM_PFN_NOSLOT is semantically correct
when we want to hide a memslot from L2, i.e. there effectively is
no defined memory region for L2, but it has the unfortunate side
effect of making KVM think the GFN is a MMIO page, thus triggering
emulation. The failure occurred with in-development code that
deliberately exposed a private memslot to L2, which L2 accessed
with an instruction that is not emulated by KVM.
Fixes: 95b3cf69bdf8 ("KVM: x86: let reexecute_instruction work for tdp")
Fixes: 1cb3f3ae5a38 ("KVM: x86: retry non-page-table writing instructions")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
---
arch/x86/kvm/x86.c | 6 ++++++
1 file changed, 6 insertions(+)

Comments

On 22/08/2018 23:52, Sean Christopherson wrote:
> This fixes an issue where an "MMIO" emulation failure[1] in L2 is all> but guaranteed to result in an infinite loop when TDP is enabled.> Because "cr2" is actually an L2 GPA when TDP is enabled, calling> kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped> case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA> and cause reexecute_instruction() to immediately return true.
The only way to get here would be if vcpu->arch.mmu.page_fault
returned RET_PF_EMULATE while in guest mode. Should it be
kvm_mmu_page_fault instead that does:
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index af38644cf042..c9b734be685a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5261,7 +5261,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
return 1;
}
- if (mmio_info_in_cache(vcpu, cr2, direct))
+ /*
+ * vcpu->arch.mmu.page_fault returned RET_PF_EMULATE, but we
+ * can still optimistically try to just unprotect the page and let
+ * the processor execute the instruction that caused the page fault.
+ * Of course that's pointless if we know it's MMIO, but it also does
+ * not work when running a nested guest, because...
+ */
+ if (mmio_info_in_cache(vcpu, cr2, direct) || is_guest_mode(vcpu))
emulation_type = EMULTYPE_NO_REEXECUTE;
emulate:
/*
So I guess the issue can arise only if L0 is using shadow page tables?
With nested EPT, mmu_is_nested(vcpu) should be true, and therefore
mmio_info_in_cache should return false. That also deserves a mention
in the commit message, if it's indeed the case.
By the way, if we follow the suggestion in my review of patch 2,
we also should put that change first, and initialize emulation_type to
zero in kvm_mmu_page_fault at the same time, because the sole case that
needs EMULTYPE_RETRY is
- int r, emulation_type = EMULTYPE_RETRY;
+ int r, emulation_type = 0;
...
- if (mmio_info_in_cache(vcpu, cr2, direct))
+ /*
+ * vcpu->arch.mmu.page_fault returned RET_PF_EMULATE, but we
+ * can still optimistically try to just unprotect the page and let
+ * the processor execute the instruction that caused the page fault.
+ * Of course that's pointless if we know it's MMIO.
+ */
+ if (!mmio_info_in_cache(vcpu, cr2, direct)
emulation_type = EMULTYPE_RETRY;
which then becomes in this patch.
if (!mmio_info_in_cache(vcpu, cr2, direct) && !is_guest_mode(vcpu))
emulation_type = EMULTYPE_RETRY;
> [1] This issue was encountered after commit 3a2936dedd20 ("kvm: mmu:> Don't expose private memslots to L2") changed the page fault path> to return KVM_PFN_NOSLOT when translating an L2 access to a> prive memslot. Returning KVM_PFN_NOSLOT is semantically correct> when we want to hide a memslot from L2, i.e. there effectively is> no defined memory region for L2, but it has the unfortunate side> effect of making KVM think the GFN is a MMIO page, thus triggering> emulation. The failure occurred with in-development code that> deliberately exposed a private memslot to L2, which L2 accessed> with an instruction that is not emulated by KVM.
Can you write a testcase for tools/testing/selftests/kvm, or alternatively
for kvm-unit-tests? It's okay if it only fails for shadow page tables in L0.
But the priority for me is to understand the bugfix and refactor the EMULTYPE
mess.
Thanks,
Paolo

On Thu, Aug 23, 2018 at 11:25:38AM +0200, Paolo Bonzini wrote:
> On 22/08/2018 23:52, Sean Christopherson wrote:> > This fixes an issue where an "MMIO" emulation failure[1] in L2 is all> > but guaranteed to result in an infinite loop when TDP is enabled.> > Because "cr2" is actually an L2 GPA when TDP is enabled, calling> > kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped> > case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA> > and cause reexecute_instruction() to immediately return true.> > The only way to get here would be if vcpu->arch.mmu.page_fault> returned RET_PF_EMULATE while in guest mode. Should it be> kvm_mmu_page_fault instead that does:> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c> index af38644cf042..c9b734be685a 100644> --- a/arch/x86/kvm/mmu.c> +++ b/arch/x86/kvm/mmu.c> @@ -5261,7 +5261,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,> return 1;> }> > - if (mmio_info_in_cache(vcpu, cr2, direct))> + /*> + * vcpu->arch.mmu.page_fault returned RET_PF_EMULATE, but we> + * can still optimistically try to just unprotect the page and let> + * the processor execute the instruction that caused the page fault.> + * Of course that's pointless if we know it's MMIO, but it also does> + * not work when running a nested guest, because...> + */> + if (mmio_info_in_cache(vcpu, cr2, direct) || is_guest_mode(vcpu))> emulation_type = EMULTYPE_NO_REEXECUTE;> emulate:> /*> > So I guess the issue can arise only if L0 is using shadow page tables?
Yes, and I think only if L0 is also using EPT as that's the case where
cr2 is actually a GPA and thus kvm_mmu_gva_to_gpa_write() fails.
> With nested EPT, mmu_is_nested(vcpu) should be true, and therefore> mmio_info_in_cache should return false. That also deserves a mention> in the commit message, if it's indeed the case.
Ha, I had that at some point but removed it because I thought that
it'd be redundant/obvious information.
> By the way, if we follow the suggestion in my review of patch 2,> we also should put that change first, and initialize emulation_type to> zero in kvm_mmu_page_fault at the same time, because the sole case that> needs EMULTYPE_RETRY is> > - int r, emulation_type = EMULTYPE_RETRY;> + int r, emulation_type = 0;> ...> - if (mmio_info_in_cache(vcpu, cr2, direct))> + /*> + * vcpu->arch.mmu.page_fault returned RET_PF_EMULATE, but we> + * can still optimistically try to just unprotect the page and let> + * the processor execute the instruction that caused the page fault.> + * Of course that's pointless if we know it's MMIO.> + */> + if (!mmio_info_in_cache(vcpu, cr2, direct)> emulation_type = EMULTYPE_RETRY;> > which then becomes in this patch.> > if (!mmio_info_in_cache(vcpu, cr2, direct) && !is_guest_mode(vcpu))> emulation_type = EMULTYPE_RETRY;
I considered this approach, but ultimately re{try,execute}_instruction
simply do not handle is_guest_mode() with EPT. Part of the reason I
dug deep into the history was to try and understand whether or not we
can ever legitimately retry an L2 instruction, EPT aside. I think the
answer is no, so I put the check in re{try,execute}_instruction().
That being said, I agree that adding the check in kvm_mmu_page_fault()
makes the desired behavior more obvious. And I also think removing
EMULTYPE_NO_REEXECUTE (or at least making re-execute opt-in) is a very
good thing for robustness and readability.
So, what about adding !is_guest_mode() to kvm_mmu_page_fault(), and
also add a WARN_ON_ONCE(is_guest_mode()) check in the retry functions?
> > [1] This issue was encountered after commit 3a2936dedd20 ("kvm: mmu:> > Don't expose private memslots to L2") changed the page fault path> > to return KVM_PFN_NOSLOT when translating an L2 access to a> > prive memslot. Returning KVM_PFN_NOSLOT is semantically correct> > when we want to hide a memslot from L2, i.e. there effectively is> > no defined memory region for L2, but it has the unfortunate side> > effect of making KVM think the GFN is a MMIO page, thus triggering> > emulation. The failure occurred with in-development code that> > deliberately exposed a private memslot to L2, which L2 accessed> > with an instruction that is not emulated by KVM.> > Can you write a testcase for tools/testing/selftests/kvm, or alternatively> for kvm-unit-tests? It's okay if it only fails for shadow page tables in L0.> But the priority for me is to understand the bugfix and refactor the EMULTYPE> mess.
Will do.
> Thanks,> > Paolo

On 23/08/2018 17:57, Sean Christopherson wrote:
> That being said, I agree that adding the check in kvm_mmu_page_fault()> makes the desired behavior more obvious. And I also think removing> EMULTYPE_NO_REEXECUTE (or at least making re-execute opt-in) is a very> good thing for robustness and readability.
Yeah, retry and reexecute are really the same thing, and your patch 2
was already a baby step towards fixing that. reexecute is really more
like retry_failed_decode and need not have a separate emul_type bit.
> So, what about adding !is_guest_mode() to kvm_mmu_page_fault(), and> also add a WARN_ON_ONCE(is_guest_mode()) check in the retry functions?
Sure! Actually I wanted to propose the same, but not before checking
that doing it in kvm_mmu_page_fault() made sense...
Thanks,
Paolo