Azure Virtual Machine Internals – Part 1

Introduction

The Azure cloud services are composed of elements from Compute, Storage, and Networking. The compute building block is a Virtual Machine (VM), which is the subject of discussion in this post. Web search will yield large amounts of documentation regarding the commands, APIs and UX for creating and managing VMs. This is not a 101 or ‘How to’ and the reader is for the most part expected to already be familiar with the topics of VM creation and management. The goal of this series is to look at what is happening under the covers as a VM goes thru its various states.

Azure provides IaaS and PaaS VMs; in this post when we refer to a VM we mean the IaaS VM. There are two control plane stacks in Azure, Azure Service Management (ASM) and Azure Resource Manager (ARM). We will be limiting ourselves to ARM since it is the forward looking control plane.

Getting Started

For most of the customers, their first experience creating a VM is in the Azure Portal. I did the same and created a VM of size ‘Standard DS1 v2’ in the West US region. I mostly stayed with the defaults that the UI presented but chose to add a ‘CustomScript’ extension. When prompted I provided a local file ‘Sample.ps’ as the PowerShell script for the ‘CustomScript’ extension. The PS script itself is a single line Get-Process.

The VM provisioned successfully but the overall ARM template deployment failed (bright red on my Portal dashboard). Couple clicks showed that the ‘CustomScript’ extension had failed and the Portal showed this message:

It wasn’t immediately clear what had gone wrong. We can dig from here and as is often true, failures teach us more than successes.

I RDPed to the just provisioned VM. The logs for the VM Agent are in C:\WindowsAzure\Logs. The VM Agent is a system agent that runs in all IaaS VMs (customers can opt out if they would like). The VM Agent is necessary to run extensions. Let’s peek into the logs for the CustomScript Extension:

The fact that the failure logs are cryptic hinted that something catastrophic had happened. So I re-looked at my input and realized that I had the file extension for the PS script wrong. I had it as Sample.ps when it should have been Sample.ps1. I updated the VM this time specifying the script file with the right extension. This succeeded as shown by more records appended to the log file mentioned above.

The CustomScript extension takes a script file which can be provided as a file in a Storage blob. Portal offers a convenience where it accepts a file from the local machine. I had provided Simple.ps1 which was in my \temp folder. Behind the scenes Portal uploads the file to a blob, generates a shared access signature (SAS) and passes it on to CRP. From the logs above you can see that URI.

This URI is worth understanding. It is a Storage blob SAS with the following attributes for an account in West US (which is the same region where my VM is deployed):

se=2016-08-15T08:41:30Z means that the SAS is valid until that time (UTC). Comparing it to the timestamp on the corresponding record in log (08/14/2016 08:42:24.05) it is clear that the SAS is being generated for a period of 24 hours.

I asserted above that this is a storage account in West US. That may be apparent from the naming of the storage account (iaasv2tempstorewestus) but is not a guarantee. So how can you verify that this storage account (or any other storage account) is in the region it claims to be in?

The blob URL is a CNAME to a canonical DNS blob.by4prdstr03a.store.core.windows.net. Experimentation will show that more than one storage accounts maps to a single canonical DNS URL. The ‘by4’ in the name gives a hint to what region it is located. As per the Azure Regions page, the West US region is in California. Looking up the geo location of the IP address (40.78.112.72) indicates a more specific area within California.

Understanding the VM

Now that we have a healthy VM, let’s understand it more. As per the Azure VM Sizes page, this is the VM that I just created:

The interesting field here is the version. Publishers can have multiple versions of the same image at any point of time. Popular images are revved typically on a monthly frequency with the security patches. Major new versions are released as new SKUs. The Portal has defaulted me to the latest version. As a customer, I can chose to pick a specific version as well, whether I deploy thru Portal or thru an ARM template using the CLI or REST API; the latter being the preferred method for automated scenarios. The problem with specifying a particular version is that it can render the ARM template fragile. The deployment will break if the publisher unpublishes that specific version in one or more regions, as a publisher can do. So unless there is a good reason not to, the preferred value for the version setting is latest. As an example, the following images of the SKU 2012-R2-Datacenter are currently in the WestUS region, as returned by the CLI command azure vm image list.

The OS disk is a page blob and starts out as a copy of the source image that the Publisher has published. Looking at the meta data of this blob and correlating it to what the VM itself has is instructive. Using the Cloud Explorer in Microsoft Visual Studio the blob’s property window:

This is a regular page blob that is functioning as an OS disk over the network. You will observe that the Last Modified date pretty much stays with NOW() most of the time – the reason being as long as the VM is running there are some flushes happening to the disk regularly. The size of the OS disk is 127 GB; the max allowed OS disk in Azure is 1 TB.

The interesting properties are the Lease properties. It shows the blob as leased with an infinite duration. Internally to VM creation, when a page blob is configured to be an OS/data disk for a VM, we take a lease on that blob before attaching it to the VM. This is so that the blob for a running VM is not deleted out of band. If you see a disk-backing blob has no lease while it shows as attached to a VM then that would be an inconsistent state and will need to be repaired.

RDPing in the VM itself, we can see two drives mounted and the OS drive is about the same size as the page blob in Storage. The pagefile is on D drive; so that faulted pages are fetched locally rather than over the network from Blob Storage. The temporary storage can be lost in case of events that case a VM to be relocated to a different node.

},
"caching": "ReadWrite"
},
"dataDisks": []

there are no data disks yet but we will add some soon

},
"osProfile": {
"computerName": "BlogWindowsVM",

The name we chose for the VM in Portal is the hostname as well. The VM is DHCP enabled and gains its DIP address thru DHCP. The VM is registered in an internal DNS and has a generated FQDN.

This is a hint to install a guest agent that does a bunch of config and runs the extensions. The guest agent binaries are here - C:\WindowsAzure\Packages

"enableAutomaticUpdates": true

Windows VMs by default are set to receive auto updates from Windows Update Service. There is a nuance to grasp here regarding availability and auto updates. If you have an Availability Set with multiple VMs with the purpose of getting high SLA against unexpected faults, then you do not want to have correlated actions (like Windows Updates) that can take down VMs across the Availability Set.

The boot screenshot can be viewed in Portal. However, the URL for the screenshot bmp file does not render in a browser.

What gives? It is due to the authentication on the storage account which blocks anonymous access. For any blob or container in Azure Storage it is possible to configure anonymous read access. Please do this with caution and only in cases where secrets will not be exposed. It is a useful capability for sharing data that is not confidential without having to generate SAS signatures. Once anonymous access is enabled on the container the screenshot renders in any browser outside of the portal.