OVERLAY NETWORKS WITH MY SDN CONTROLLER

2018-01-06

Intro

The objective with this project is to be able to create virtual networks
that span across several switches. For example, I have 5 VMs that run in
a 10.0.0.0/24 subnet but those VMs are on different hypervisors. I want my controller
to create an overlay network for those, and only those, VMs. It is possible
to have many overlays with the same addressing space because the switches
will only forward traffic between hosts part of the same overlay

For such a topology, OVS bridges end up looking like this:

The controller

On the controller, I configure 4 types of objects: Bridges, Networks, Routers and Hosts.
A host is a VM or a container (or anything using a TAP interface in the OVS bridge).
Each host is part of a bridge. In general, you'd setup one bridge per hypervisor.
So because I define where bridges are (on which hypervisor they run), the controller
knows that 2 hosts run on different hypervisors and they should communicate through a tunnel.
Each host is part of a network. Those are virtual network that I define in the controller.
The controller will allow communication between hosts that are on the same virtual network.
A virtual network can span across many bridges. From the controller's point of view, all
the configuration is static. For each host, I define the bridge it is on, the port number
it is using, it's IP address and mac address. The mac,port,bridge could be given to
the controller by some kind of agent that would run on the HV but in my case, I have
just hardcoded those values. The IP address of the VM can be set to anything.

Flows

DHCP

Since I pre-configure the controller with the list of hosts that will attach to it with their MAC, IP
and networkID, gateway IP etc., the controller has all the information it needs to act as a DHCP server.
So the controller adds a flow, 1 for every host that is local to the hypervisor, to match the source MAC/port
and UDP port 67 with an action to send the packet to the controller. The controller then crafts a DHCP reply
packet and sends it back on the ingress port.

ARP

Again, since the controller has all information about the hosts, it is also capable of responding to ARP queries.
Flows are installed to intercept any ARP query and the controller will craft a response. If a host makes
an ARP query, that query will never be flooded (not even on the vxlan tunnel) and will never be seen by
anyone else than the controller. Technically, no hosts should ever receive ARP requests.

Table 0

Table 0 contains the flows to match ARP and DHCP requests. If a packet does not match any flow in Table 0,
a table-miss flow will send the packet to Table 1.

Table 1

Table 1 is about matching the source of a packet. For each host on the current bridge, a flow
that matches the in_port,src_mac,src_ip will be installed. The actions are to set the
virtual network ID in the tunnel_id metadata (will be used in Table 3) and then goto Table 2
Another flow that matches the tunnel in_port is installed with an action to goto Table 3.
That tunnel flow doesn't set the tunnel_id metadata because it was already set when
the packet came through the tunnel. The tunnel_id of a packet coming in from a VXLAN
tunnel is the VNI.

Table 2

Table 2 has flows that match dst_mac set to gateway addresses as defined in Router objects.
These flows are used for routing between virtual networks. If the dst_mac is not for
a virtual router, then a table-miss flow sends the packet to table 3.

Table 3

Table 3 is about forwarding to the correct host. When a packet enters Table 3, it is
guaranteed to have a tunnel_id (set as the virtual network ID).
Table 3 contains 1 flow for each host known across all virtual networks.
All flows match the tunnel_id,dst_mac,dst_ip. The action depends on if the host
is local or resides on another hypervisor. If it is local, then the action
is simply to forward to the port of that host. If it is remote, then 2 actions
are executed: Set-Field for remote tunnel endpoint IP and output to the tunnel
port. Note that since the tunnel_id is set (in table 1), the VXLAN packet created
by OVS will have that ID as the VNI. So when entering the other HV, a flow will
match that to the virtual network ID of the host.

So since the controller knows about all VMs and virtual networks, it is able to install
those flows that will make traffic between hosts part of the same virtual network possible,
even across hypervisors, thanks to VXLAN tunnels.

Example flows

Here is an example of the flows created on Hypervisor 1 from the above example setup

First, we see that in table 0, we intercept all ARP queries from all local hosts and send them to the controller.
The controller will respond if the target is a local or a remote host

Now in table 1, we match the source of all known local hosts and set the tun_id field (VNI), which is a way for us to tag the packet with its
network ID. The last flow matches the in_port 65279, which is the tunnel port. It will make us jump to table 2. The tun_id is already
set for all packets comming in from the tunnel

These last flows, in Table 3 again, match the destination of all remote hosts and forwards the packet out of the tunnel port
after having set the tunnel destination address (address of the HV running that host)

VXLAN tunnels

I had a hard understanding how to create the tunnels at first.
I know that with OVS, we can create VXLAN tunnel ports and we just need
to forward traffic on those ports. But this requires configuring (N-1) tunnel ports
on each hypervisors if we have a total of N hypervisors. So instead I found that
there is an openflow extension (so not part of the official spec) that allows
to set the destination of the tunnel dynamically with a flow. This is done with
a "set-field" action, with a OXM class of 0x001 and field ID 32 (this is a Nicira extension).
Then you only need to create 1 tunnel port per hypervisor like this:

Using "flow" as the remote_ip allows us to set the destination IP using flows. And setting key=flow allows us to set the VNI with flows.

Routing

Routing is a very simple process. When a host sends a packet to another host,
on another network, it simply uses the mac address of the gateway as the destination
mac. So from the switch's point of view, it's just a matter of detecting that "special"
mac address and allowing to forward on a port that is on a different virtual network.

In my controller, this done by configuring
a Router object in the controller and adding networks it. The router
will allow communication between networks associated to it. When a host
needs to send traffic to another network, it needs to go through a gateway.
in the configuration shown above, we can see that an IP has been assigned as
the gateway ip in each network. The controller will respond to arp queries
for those gw IP with the mac adress of the virtual router that those networks are
associated to. So hosts will be sending their inter-network traffic using the
virtual router's mac as the destination mac.

So the controller can install flows to match traffic with that mac and
the dst_ip (with a mask) to change the network id on that packet. the packet
can then be sent to table 3 for proper port forwarding.

Example flows

Here is a representation of table 1,2,3. In each scenario, the yellow flows are the ones
that would be matched

Configuration

Like I've mentionned above, right now, the configuration is very static. We have
to pre-configure the controller and tell it exactly where (on which HV) each VMs runs,
the port number in the OVS they are attached to and their MAC address. Although the controller
could deduct some of that information, I prefer to tell it exactly what I am expecting so
I can exercise better control over my network.

So right now, if I create a VM, I need to tell the controller about that new VM and provide the following information:

MAC address

Network ID (that will be used as the VNI)

Hypervisor where the VM runs

IP Address I want assigned for the VM

Port number on the OVS bridge

Note that the chosen IP address can be configured manually on the VM, but if
choosing DHCP, the controller will intercept the request and assign that IP.
So I would need a tool to do this for me instead of doing that step manually
after creating a VM. So what could be done next is to create an orchestrator that allows me to spawn VMs on any
of the HV (transparently maybe?). At the time of creation, I would need to specify in
which network I want my VM to be part of. Then the orchestrator would choose a HV,
choose a and IP from the network that was selected and spawn the VM. After spawning
the VM, the orchestrator would know the MAC address and the OVS port number that
the VM was started with. At that point, the orchestrator has all the information
needed to tell the controller how to establish proper networking for that new VM.

What next?

I guess that a cool feature I could do next would be to have a way for those virtual networks to access an external network such as the internet. This would involve NAT and a way route to the underlay. I'll have to think about that one.

SIMPLE LEARNING SWITCH WITH OPENFLOW

2017-12-16

Introduction

I recently wanted to learn more about Openflow so I decided to write a controller that acts as
a simple L2 learning switch. The controller is written in c++. I couldn't find a good c++ library
for openflow so I ended up writing my own from scratch. The openflow code is far from being spec1.3-compliant
because it is missing a lot of things, but it does enough to accomplish my goals. The code for the
controller, including the Openflow code, is available on github. https://github.com/pdumais/OpenflowController

Openflow Controller

The general idea behind my implementation of the openflow controller, is to program the switch
as the controller discovers the network. The controller is made aware of the vlans on which ports
should belong to through a configuration file.

When the switch connects to the controller, the controller will delete all flows in tables 0,1,2.
It will then install a table-miss rule (match all, priority 0) in table 0 and 1 to send packets
the controller. table 2 will stay empty; the default behaviour of this will be to drop packets.
So at this point, any packets comming in on the switch will trigger a table-miss and be sent to
the controller. The controller will then make decisions on how the packet should be forwarded
based on the port type (access or trunk) and vlan. Whenever a packet gets sent to the
controller, this is done in a PacketIn message. The controller then send a PacketOut message
to tell the switch on which port(s) to forward that packet and to maybe add/delete a vlan tag.
In order to avoid having to do this processing everytime, the controller also adds a flow
on the switch with a specific match criteria and a set of action so that the logic
of the packetOut can be applied by the switch automatically the next time instead of sending
the packet to the controller.

Learning Switch

Many examples of a learning switch I saw were a bit naive. For example, those implementation would
fail if more than one host would be connected behind a port (for example, another switch could be
connected on that port). So my implementation is an attempt to be a bit more robust.

My learning switch uses two tables because there are two decisions that need to be taken
eveytime a packet comes in: 1) Should the controller learn about that mac/in_port? 2) What
port needs to be used to forward the packet?

When the first packet will be sent to the switch, it will be sent to the controller, because there
are only table-miss rules defined. The controller will learn the src_mac/in_port of that packet
and install a rule in table 0 to match that src_mac/in_port and forward to table 1. The next time
the same host sends a packet, it will not hit the table-miss rule in table 0 and will be forwarded
to table 1. After adding that rule, the contoller will look at the dst_mac of the packet
and will not have an entry in its FIB so it will send a PacketOut to the switch with a flood
action (if vlans are used, this is a bit different).

As packets come in, table0 will get populated with entries to avoid sending already
learned sources to the controller. But when hitting table 1, packets will hit the table-miss
rule. So the packets will be sent to the controller. The controller will look at the dst_mac and
decide to flood (as described above) or not. If the dst_mac was found in the FIB (because
it was previously learned through another packet's table-miss on table 0), the a packetOut
is sent with only the port found in the FIB. Also, a rule will be installed in table 1
to match the dst_mac.

Example: At time 0, the bridge only has table-miss flows installed, so anything going in the switch
will hit that and be sent to the controller.

first packet sent from A on port 1, destined to B, on port 2

table0-miss

learn(A is on port 1)

installRule(in t0, if from A and on port 1, goto t1)

check destination

not found!

PacketOut(flood)

Then B responds on port 2

table-0-miss

learn(B is on port 2)

installRule(in t0, if from B and on port 2, goto t1)

check destination

Found! A is on port 1.

PacketOut(1)

installRule(in t1, if for A, forward to 1)

Then A sends something else to B

table-0 Pass, jump to t1

table1-miss (no rule for destination B)

check destination

Found! B is on port 2.

PacketOut(2)

installRule(in t1, if for B, forward to 2)

And B replies to A again

table-0 Pass, jump to t1

table-1 Pass, forward according to rule.

Vlans

When vlans are involved, things changes a bit. The switch is configured
with a set of port in different vlans. The switch determines the vlan
that the incomming packet belongs to by looking at the vlanID in the packet if
it came in through a port that is configured with a "trunk" personality. If
the traffic is untagged and came in on a trunk, then a "default vlan" config
is also available. Note that these configs are set on my controller when I define
the virtual topology of the switch, and not on openvswitch itself. If the port
is an access port, then traffic is assumed to be for the vlan configured on the port.
The controller maintains a forwarding table on a per-vlan basis. So it would be
technically possible to have the same MAC twice but on different vlans.
Flooding then happen only on ports part of the same vlan. Also, when flooding
out of trunk ports, a vlan tag must be added. This is all doable with flows.

Test network

This is the test network I am using:

I am using veth pairs with 1 end in the bridge and the other end in its own network
namespace. This is easier than to create 6 VMs.
The controller is configured to recognize interfaces 1,2 as access ports for vlan 100.
Interfaces 3,4 as access ports for vlan 200 and interfaces 5,6 as trunk ports for both vlans.

To better understand what the controller does, here is a series of openflow rules installed
after the controller has seen some packets go through

After a packet (ARP query) was sent from port1, we learned the mac, so we can skip table 0

At this point, these are the rules for table1. If a packet makes it to here, it means that the source_port/mac
was already learned before and we need to check for a destination mac flow to forward it to the correct port.
When trying to send a packet, it triggered a table miss but the controller found that the destination mac
was in it's FIB. So it installed a flow matching the source port,vlanID and destination. The actions
here are to pop the vlan tag because in_port 5 is a trunk and we need to send the packet to port 1, which is an access port.

This is the table-miss flow. If none of the above rules match a packet in table1, we send it to the controller.
This would happen because there are no flows installed to forward the destination MAC to the correct port.
The controller will either have that destination in its FIB and will install a flow in table 1 for next time
Or it will flood on all ports of the same VLAN if the destination is not in the FIB.

Final thoughts

There are many other things that should be considered here by that I haven't implemented.
For example, if port is removed from the switch or if the port goes down, flows that
forwards to that port should be removed, and flows that match that incoming port should
also be removed. Broadcast flows should be modified. And if changing the vlan config
on a port, within the controller configuration, flows also need to be removed/modified.

INSTALLING KUBERNETES MANUALLY

2017-09-13

The procedure described here applies to centos but the same recipe can obviously be adapted for other distritions with some
minor tweaks.

I am deploying a 3 nodes cluster and all servers are master/nodes at the same time. This is not recommended by
the k8s team but for a lab environment it is perfectly fine. If you understand this procedure well, then you will find
that deploying nodes and masters separately can be just as easy. Let's assume I have 3 servers with IPs 192.168.1.100,192.168.1.101
and 192.168.1.102.

System preparation

First thing we need to do is disable selinux, disable the firewall, NetworkManager and setup a yum repo.
This is probably a bad idea, but then again, it makes things easier in a lab environment.

If you are behind a corporate proxy, also add this line to that last repo file: proxy=http://yourproxy_address

Install docker

yum install -y docker
systemctl enable docker
systemctl start docker

If you are behind a corporate proxy, add this line to /etc/sysconfig/docker: HTTP_PROXY=http://yourproxy_address
and then restart docker

Install etcd

etcd should be installed on all masters. You actually have to install etcd on the same server where the k8s binaries
will be found, but you should definitely install several instances of it and cluster them. Some people also
suggest to install at least one instance somewhere in the cloud. This is because etcd will store all the persistant
admin stuff that k8s needs. So it would be painful to lose that data. Bringing back (or adding) a k8s node is very easy
and transparent as long as your etcd cluster is intact.

The ETCD_INITIAL_CLUSTER line should list all the hosts where etcd will be running. The other lines where you see 192.168.1.100
should be modified to match the IP address of the server you are currently installing etcd on. and ETCD_NAME should also match
the server you are installing on (see the first line where these names are used). The ETCD_NAME can be any arbitrary name (as long
as you properly match them in ETCD_INITIAL_CLUSTER) but most people try to use the server's hostname.

After having installed etcd on all your servers, wait for the cluster to be up checking the output of "etcdctl cluster-health"
Make sure the cluster is up and healthy before continuing.

Install Flannel

Your k8s cluster will need an overlay network so that all pods appear to be on the same layer 2 segment. This is nice because even though
you cluster runs on different servers, each pod will be able to talk to each other using a virtual network that sits on top of your real
network and will span accross all k8s nodes. It's basically a virtual distributed switch, just like openvswitch does. Flannel is
one such driver that enables that for kubernetes. It is the most basic overly driver and works just fine. For more advanced stuff, Nuage is
an option and there are many other options. If you are new to this (like I was), then this is the SDN stuff that all the cool kids
are talking about.

Kubernetes can be downloaded as a binary package from github. What I really like about these binaries is that they are simple standalone
applications. You don't need to install a RPM and a whole bunch of libraries. Simply copy the executable and run it. You will need a total
of 6 process to run. So the first thing to do is to unpack the binaries. Download the version you want from
https://github.com/kubernetes/kubernetes/releases

Each of the daemons can simply be started from the CLI with a whole bunch of command line arguments, you don't even need any
configuration files. This is beautiful because it is so easy.
So technically, you could just do:

And that's it, kubernetes is running. Simple. But let's get systemd to manage those process. If you are not using systemd (slackware?) then
you can setup a bunch of rc scripts to launch those. But for this example, let's create some systemd unit files that will launch this for us.

Note how I did not hard code all command line arguments in the systemd unit files. Instead I will store them in separate environment files
under /etc/kubernetes. So earlier I was praising about how nice it was to just be able to launch the processes on the command line with not
extra config files or no service files but here I am create service files and config files. I know... I just like the fact that I can
customize it any way I want. So here are the config files needed. But as I said, you could just write one bash script that invokes
all those 5 process with all their command line arguments and you would have zero config file and no need for systemd. just 5 binaries
that you copied on your server.

Then you can enable and start all those new services. You have a k8s cluster running. You can test by invoking "kubectl get nodes"

Some notes about the k8s processes

kubectl can be copied on any machine. It will try to communicate to kubernetes through the apiserver process on the localhost. If
you are running this tool on a server where apiserver is not running, then you need to specify --server=http://api-server-address:8080.
In our case, we have installed the apiserver on all k8s nodes. So you can connect to any instances in the cluster.

The apiserver process needs to run on all masters. This is so you can control each masters remotely. I've configured each
apiserver to only talk to it's local etcd because we have installed etcd on all nodes. But would could configure it to talk to
all etcd servers, even remote ones.

the kube-proxy should run on all worker nodes (in our cae, workers and masters are the same). Let's say you have a pod running a web
server on port 242. You don't know on which node your pod will run. But you want to be able to access it using the IP of any of the nodes
in the cluster. That is what kube-proxy does. So in you go to http://192.168.1.100:242 but your pod runs on 192.168.1.102, then kube-proxy
will handle it. It will, as you guessed it, proxy the request. And this is at the TCP/UDP level, it is not just a http proxy.

I am not entirely sure which of these processes are considered as "master" vs "worker nodes" processes. But I believe
that the nodes should only be running kube-proxy and kubelet. The other processes should only run on what would be considered master servers.
So since it is just a matter of copying the binaries over and changing some addresses in the environment files under /etc/kubernetes, it
would be easy to tweak the procedure above to get a different infrastructure.

Ansible

WRITING A HYPERVISOR WITH INTEL VT-X

2017-06-21

Introduction

Before virtualization support was available on CPUs, virtual machines were emulating most of the code that a guest OS would run.
With VMX, the CPU handles the virtualization. If you are familiar with the hardware context-switches that were available
on ia32, it's a bit similar. The hypervisor manages many VMs. When executing code of a VM, there is a hardware context switch
that occurs. The context switch saves the current CPU state (but not the general purpose registers) and loads another
state that would be the one of the VM. Then code can execute from there. So the VM ends up running "bare-metal". The CPU
has changed it's entire context and runs your VM just as it would if the guest was running bare-metal. With one exception:
Some operations causes VMExits. For example, when an interrupt occurs, the CPU automatically switches back to the hypervisor
context (what they call the VMM). The VMM can then handle the interrupt and change the Guest's state data so that on the next
entry it will think it has an interrupt to handle (or not). This is basically how you would handle the timer interrupt
on the guest and on the host so that multithreading works. VMExits occur for many other reasons. You could have a VMExit
occur when the VM tries to access unmapped memory, when executing "hlt" or many other reasons.

Goal

The virtual environment we'll be setting up is for a 16bit real mode VM because that's how a real CPU would start.
Control will initially be transfered to address 0x00000000. Normally, an entry point to BIOS code should be located there
and then the BIOS would eventually load the first sector of the primary HDD and transfer control to 0x7c00. But in my case,
I am not writting a BIOS, so only test code will be executed in the VM.

Memory

In a VMX environment, there exists several layers of memory:

Guest virtual memory: The guest's virtual memory if paging is enabled in the Guest. That memory translates
to guest physical memory using the guest's paging structures, entirely controlled by the guest.

Guest physical memory: This is the physical memory from the VM's point of view. From the hypervisor's point
of view, this is just virtual memory that has to be mapped to physical memory. This memory would be
allocated lazily as the VM uses more and more memory. This memory is mapped to physical memory
by the hypervisor.

Physical memory: This is the real hardware physical memory as seen by the host's CPU.

For more details about how paging works, please refer to Memory Paging

For the VM to have memory, some of the host's physical memory must be assigned to the VM. The guest's physical memory is seen as
virtual memory by the host. To assign memory to the guest, a feature provided by the CPU called "EPT" can be used. EPT works
Just like the regular paging system. It is a way to map guest-physical addresses to real physical addresses using a paging
structure with a PML4 table on top.

So a guest would prepare it's own internal paging system and have it's own PML4 table. When an a virtual address inside the VM
needs to be resolved, it will be resolved to the guest's physical address using its PML4 table. Then, the guest-physical address
will be resolved using the EPT's PML4 table. So 2 translations would be done.

EPT is enabled by setting the "Enable EPT" bit in the VM-Execution controls. When enabled, an EPT PML4 structure
must be prepared and and the base address of that structure must be stored in the VMCS EPT pointer

Lazy allocation of guest physical memory

When a guest is launched with, for example, 16gb of ram, it would not be desirable to reserve that entire
memory on the hypervisor immediately. Instead, it would be preferable to allocate that memory in a lazy allocation
fashion. A guest might never use it's entire RAM so reserving it on the hypervisor would be a waste.

Reading an unallocated guest physical page will return zeros. Writing to it will trigger a EPT violation. The VMM can then reserve
a physical page and map it there instead of the dummy page.

When the guest is launched, a single page of write-protected physical memory (filled with zeros) should be mapped to all the guest
physical memory. If the VM's BIOS starts probing memory by reading all of it, every 4k would be mapped to that single page so the BIOS would read zeros and think it is
reading valid memory. If the BIOS (or the guest OS) writes into
memory, a VMexit would occur because writing to any address would map to this write-protected physical page.
The hypervisor can then allocate a new R/W page to the EPT's paging structure for the guest physical memory. Of
course, if the guest does a write/read-back kind of algorithm the probe the memory, then all the guest physical
memory will have been allocated because all pages will have been written to, so all bets are off.

Reclaiming freed-up memory is a different story. The only way the hypervisor can
know about memory that can be reclaimed is by using memory balooning. The virtio
specification describes a memory baloon device that can be implemented for that
purpose.

When the guest will access guest-physical memory that is unmapped to physical memory, a VMExit will occur.
The hypervisor needs to evaluate if the guest-physical address falls into the possible addressing space of the VM
(i.e. if it is trying to access memory beyond the last byte of the virtual RAM). If the memory is unmapped
because of the lazy-loading scheme, then a new page can be allocated. If it is because that memory address should
not be mapped, then an exception will be injected back in the VM and the guest will need to handle that.

Translation Lookaside Buffer (TLB)

When the process translate a virtual address to a physical address, the translation is cached in the TLB so that
it wont have to walk through the page tables again next time. In a virtual environment, the CPU only has one TLB
in which is caches translations for:

All host virtual/physical mappings

All guest-physical/physical mappings (of all VMs)

All guest-virtual/guest-physical mappings (of all VMs).

This could create collisions in the TLB. For example, 2 VMs could have a mapping of virtual address 0x00000000. A simpler
case would be two processes with different virtual memory mappings running on the host. For this reason, as I described in
another article (Process Context ID and the TLB), the TLB supports processID-tagging. With VMX, two new taggings exist: VPID and EP4TA.

VPID is the tagging of a TLB entry for a specific virtual CPU. This is only tru if VPID is enabled in the VMCS and if a non-zero
VPID has been set in the VMCS. In such a case, every guest-virtual to guest physical translations that are inserted in the TLB
will be tagged with the current VPID. This means that another VCPU can insert a similar entry in the TLB and it will be tagged
with a different VPID. When the MMU looks up a mapping in the TLB, it will only consider those tagged with the current VPID.

EP4TA tagging is done when EPT is enabled. Every guest-virtual to host-physical (going through EPT tables) are cached in the TLB
with a tagging of the current EPT,PCID and VPID. The EPT ID (EP4TA) is derived from the current EPTP, so there is no specific
ID to set, unlike PCID and VPID.

If VPID is not enabled, then the VPID tag is always 0 and is always ignored during lookups. A lookup while EPT
is active (during non-root operation) will only consider entries with a matching EP4TA.

I'm a bit confused by what the Intel documentation says about EPT and the TLB. From what I can understand,
it seems like the TLB will cache guest-virtual mappings to host-physical addresses. Those are called "combined mappings"
When EPT is enabled, these entries will be tagged with the current EP4TA,VPID,PCID.
If VPID is not enabled:

entries will be tagged with VPID=0.

On vmexit/vmentry, all entries with VPID==0 will be invalidated, regardless of EP4TA tag.

For this reason, I am enabling VPID with a tag of "1" for all my VCPUs. The documentation clearly states that having the same VPID
for different EPT is acceptable.

When the guest will invoke "invlpg" or "invpcid", these instructions should be trapped because they will invalidate all
requested TLB entries regardless of the current EP4TA. So the VMM should set a VMExit on those operations and emulate them
using "invept. That's what I am understanding from section 28.3.3.1, but section 28.3.3.3 says that when EPT is enabled, there is
no need to emulate these instruction.

When to invalidate

When a VMExit occurs because of an EPT violation, the faulting page will always be invalidated in the TLB. In my OS,
I use the EPT violation handler to allocate more memory to the VM. When the VM resumes, the TLB will be up to date the VM
will be writing data in the correct physical page. If the VM is rescheduled on another core where the stale TLB entry exists
for that address, then it will be accessing the wrong physical page. The TLB was only invalidated on the CPU that detected
the EPT violation. For this reason, the host needs to do a TLB shootdown on all host CPUs. This done by sending a IPI to
all other CPUS. The CPUS will then execute "invept" to invalidate TLB entries associated with the EPT in question.

The "invept" can be used to invalidate ALL tlb entries tagged with a specific EPT, or all EPT.

Initialization

The initialization of vmx mode is well described. There are a couple of web sites with examples that can be found using a simple

google search. But the Intel SDM #3 describes an algorithm to do so. Basically, this is what needs to happen:

detect support for VMX

create a VMCS for vmxon

execute vmxon

create a VMCS for a VM and write all required data in it

run vmlaunch

It's important to note that VMXON should be executed on all CPUs and this should be done when the hypervisor starts.
In my OS, I do this as soon as I switch to long mode on all cores and I leave it on all the time.
The tricky part is to get the VMCS for the guest initialized correctly. I had to spend quite some time
fine-tunning the fields. If VMLAUNCH fails, the zero flag will be set, then you can do a VMREAD to VM_INSTRUCTION_ERROR
to get the error code. Then, the SDM (chapter 26) describes all the checkings that the CPU does so you can
walk through those and make sure your code is compliant.

Also, it seems like access rights and segment limits for guest segment descriptors still need to be properly set even if
the VM will run in real mode. If I recall correctly, it's because they will be checked even in real mode and
that is how "unreal mode" works.

I am not going in the details about this because everyone talks about that and the SDM has very good documentation
about it.

Multi-tasking (in a multi-core environment)

Running a VM in a multitasking environment requires a bit of logic that is beyond that the Intel manuals describe. This is because
a VM really is just a process getting a time slice with the scheduler. So when the process is scheduled out, the entire VM is paused.

Every VCPU is managed by one and only one thread in the OS. So each thread has a VMCS (if it is running a VCPU)
Before the task get's scheduled out, vmclear is executed so that any cached data gets flushed in the VMCS.
At least that's what Intel recommends to do. When a task is scheduled back in, VMPTRLD is executed to load the VMCS (if any) that
this task manages.

When a timer interrupt occurs and it's time to do a context switch on the host, since the VM runs as part of
a process on the host, it is normal that it gets interrupted and that time is given to another process.
That process might want to run a VM also, or maybe it's just a normal application. When the timer interrupt occurs,
this will trigger a VMExit and the CPU will be back at executing code in the context of the host. The thread will change.
When getting back to this thread, we will do a VMRESUME to continue executing the VM. We will also reload the VMCS
with VMPTRLD because another process might have loaded another VMCS. We could even be running on another processor
in a multi-core system. And remember, loading a VMCS is only loaded on the current core. So with 4 cores,
you could have 4 different VMs running

When the VMM is running in root-operations because of a VMexit, it could have been scheduled out at any moment before doing the
VMRESUME. So when VMRESUME is executed, it is possible that VMCLEAR was executed before because of a task switch. For this
reason, the OS checks if VMRESUME fails and executes VMLAUNCH instead.

The OS should avoid scheduling VCPUs from cores to cores because moving a VCPU to another core makes it mandatory to do a VMCLEAR.
My OS does not make such effort and blindly does a VMCLEAR on every task switch. So performance could be improved there.

And interesting point here is that after executing VMLAUNCH, control never returns to the calling function (unless vmlaunch fails).
VMLAUNCH and VMRESUME will jump in the guest's code without pushing a return address. A VMExit handler has been defined in the VMCS,
so from the host's point of view. control skips from vmlaunch to the handler. Also, since the stack is setup in the host's section
of the VMCS, the VMM gets a brand new stack everytime the VMExit handler gets called. This means that there is no need to pop the
stack before executing VMLAUNCH/VMRESUME

VM Exits

When the cpu is running in non-root operation (running VM code), it's pretty much as if the VM
was running bare-metal. But Several events can make the VM exit back to root-operation (the hypervisor).
A VM exit will restore the CPU state back to what it was set in the VMCS before vmlaunch. The hypervisor
must determine what caused the VMExit and take proper action. VMExits are similar to interrupts because
the general purpose registers are not automatically saved. A VMExit handler must save the registers and
restore them before resuming the VM otherwise the VM will be corrupted. As per the host section of the
VMCS though, a separate stack is being set. The VMM could alter the VM state by changing data in the
VMCS before resuming the VM. For example, if the VMExit occured because of an exception of type #UD,
the VMM could emulate the invalid instruction and change general purpose registers and flags in the VM.

Interrupts

When an interrupt needs to be delivered to the host, a VMExit will occur and control will be immediately
given to the VMM's exit handler. At that moment interrupts will be disabled (flags.IF=0).
When bit 15 (Acknowledge interrupt on exit) is cleared on the VM Exit control, the interrupt is not acknowleged by the processor. This means that as soon as the "IF" flag becomes 1 (ie, when executing "sti"),
the interrupt handler will get called.

In my hypervisor, if such a VM exit is detected, the kernel executes "sti" and finally
"vmresume". This will let the host execute the handler as soon as interrupts are re-enabled, then
control will be handed back to the VMM exit handler to execute "vmresume"

It is worth noting that it doesn't seem to matter if interrupts were cleared before VMLAUNCH.
Interrupts will still trigger a VMExit and the interrupt will be pending until the VMM enables them (using "sti")

Multiple VCPU

supporting multiple VCPUs for a VM requires having 1 VMCS per VCPU (just as if they were all used in independent VMs).
But the hypervisor has to make those CPUs dormant until a "Startup IPI" is sent. From there, the hypervisor can configure
the VCPU to start executing code at the desired address. But the main thing here is that only one VCPU would be started
when the VM is launched. My OS does not yet support this. I am focussing on 1 VCPU for now

One thing to consider when running multiple VCPUs is that one of them could trigger a VMExit because of a memory
allocation requirement (as described in a previous section of this article). In such a case, another VCPU could be running,
or the current thread could get scheduled out with another VCPU one being scheduled in. So extra care
should be taken to ensure that memory does not get corrupted.

Other thoughts

There are many other things to consider to improve performances and to make the design robust
Things to consider are:

Trapping "hlt" with a vmexit so we can schedule out an idle VCPU

Trapping TLB/MMU operations to better manage memory

Trapping IO operations so that the hypervisor can retain control over hardware.

Setup a virtual apic so that VCPUs can receive interrupts (and make a scheduler).

The code

For reference, this is a part of my VMX code:

What next?

There is still some work to be done for implementing a full hypervisor. The work done here only creates
virtual CPUs. The hypervisor needs to emulate other hardware such as a hard drive, a network card, a video card, keyboard etc.
I would probably write virtio devices since they are easy to code and very efficient.
Another big part that is missing is a BIOS. The hypervisor has to provide a BIOS for your guests since they rely on that to boot.