Azure Automation: Monitoring and Troubleshooting Your Runbook Jobs

Introduction

As a savvy customer who automates the management and maintenance of your Microsoft Azure environment, you are of course getting very familiar with the new Automation service of Azure. The Azure Automation service enables you to automate the time consuming, manual, repetitive, and complex tasks of keeping your cloud services up and running, and to integrate with other systems in order to save you time, money, and reduce errors. Azure Automation provides you with a complete solution for creating and managing the runbooks (PowerShell Workflows) that power your cloud orchestration.

Once you start creating runbooks and associated modules and assets, and you start using them on a daily basis, you will want to have an operator’s view into what resources are in the system — when the runbooks are running, what their current states are, how they are performing, and how much money they’re saving you. Simply put, you will want to know how all of your automation is doing at all times. And if some unexpected issue arises you will want to be able to quickly troubleshoot and debug the issue and get your automation back online.

Azure Automation helps you out here. Automation provides features for both monitoring the state of your runbooks and for troubleshooting your runbook jobs. In this post, I’ll take you on a tour of these features of Azure Automation.

Automation Dashboard

When you first open the Azure Portal, navigate to the Automation service, and select an Automation Account, you will be presented with the all-up Automation Dashboard.

In this dashboard, there are four key sections that inform you about the system:

At the top of the dashboard is a chart that shows you the status of each runbook job during the time period you choose (from the last hour up to the last 30 days). In this chart you can quickly see how many jobs are running right now, how many have completed, and most importantly you can see any jobs that require your attention – those that are suspended or failed. At the top of the chart is a row of icons that represent each possible job state. You can click on these icons to toggle on and off particular status lines and thereby allow you to focus on job states of particular interest.

Just below the chart is the usage overview. This section identifies your current usage against your quotas, for job execution time, number of runbooks, and module size. This can be used to measure how much more usage of the system you have before you’ll have to start paying for Azure Automation (post preview), or if you’re already paying, how much more you’ll be able to use the service before having to move to an upgraded plan.

Below the usage section is the jobstable. This table contains an entry for each job that was started in the last 30 days, and shows the name of the runbook (workflow), the time the job was last updated, and the current status of the job. Thus, if there are any recent runbook jobs that require your attention, you can use this table to quickly identify the exact jobs and then drill in to troubleshoot.

Just to the right of the usage overview and jobs table is the quick glance section. This section contains useful static information about your Automation Account, such as the number of runbooks, modules, and assets. It also indicates how many runbooks are currently in an authoring state.

Now while the above graph looks fine, let’s say you go to your dashboard and see something that looks more like this:

Well that’s no good! As you can see above, we have a number of suspended runbook jobs (light blue in the screenshot), and in the time span we’ve never had this kind of suspended behavior before. Because Automation jobs are expected to run to completion (unless the runbook author intended it to suspend), this is a problem, and you will want to troubleshoot the issue.

When you look at the jobs table, you notice that the same runbook, Update-AzureVM, is now being suspended each time it runs:

At this point you could click on one of the Job Last Update links in the jobs table and drill directly to the job details; however in this case you want to see when the jobs started suspending, so in the jobs table you click the Update-AzureVMrunbook name and navigate to the dashboard for that runbook to get a historical view of that runbook’s execution.

Runbook Dashboard

The Runbook Dashboard looks very similar to the Automation Dashboard, with a chart, a jobs table, and a quick glance section. However, the information in the runbook dashboard is scoped entirely to a single runbook.

In the quick glance section, you can see when this runbook was last published and who published it, as well as other information like how much job execution time this runbook has taken this month. You can also see its authoring status – is it currently being edited or is it done and published. And you can see if the runbook has been configured to run on any schedules.

Because you are trying to figure out why this runbook has been suspended the last three times it ran, you need more information, so in the jobs table you click the Job Last Update time for each job to navigate to the associated Job Summary.

Job Summary

In the Job Summary you are presented with summary of the particular job. You can see the name of the runbook and the current job status, plus who started the job, when it started, and when it was last updated (when it became suspended, in this case). Also, you can see the names and values of any input parameters to the runbook, and any output or exceptions from the runbook.

This is useful, but it hasn’t given us enough information to get to the bottom of the issue. It appears some variable in the runbook ended up as null, but why did that happen? To find out, let’s click the History tab to view detailed information about each step of the runbook’s execution.

Job History

For every PowerShell Workflow that runs, the workflow engine emits several streams that contain useful information. These streams are the Progress, Output, Warning, Error, and Verbose streams. By default in Azure Automation the Progress and Verbose streams are not stored for each job (because the data storage can become large, especially for the Progress stream); however, you can enable logging for these streams in the runbook configuration page if you need this information for debugging and troubleshooting.

The History page for a job contains a list with all of the stream records that were stored, sorted by the creation time of the record. Thus, you can use this page to quickly drill down and see what happened in each step of the runbook as it ran. Because Automation can store this information and then retrieve it for you, you should author your runbooks with troubleshooting in mind, just in case later if the runbook’s jobs are suspending it will be easier to find the root cause.

As you can see from above, it looks like in the job history we have an error – the runbook seems to have been unable to remote into the Azure virtual machine it was meant to patch.

If you want even more detailed information you can choose to view the details of any stream record or view the source code of this job to remind yourself what may be happening in the PowerShell Workflow code to cause the problem. Even if the runbook has changed since this job was run, the source code used in this specific job will be shown.

As a savvy Automation user, you have gone into the History for each of the suspended jobs and have viewed the details on the error that was produced. From above you can see the details show that this error is due to an authentication failure to the Azure virtual machine the runbook was trying to patch. You know from a quick view of the code that the Update-AzureVM runbook uses the “joeAzureVMCred” Automation credential asset as the credential to use to remote into the VM.

And then you remember that the password for this virtual machine was recently changed. Now you understand what the problem is! The fix is easy – edit the joeAzureVMCred credential asset, updating the password to the correct value.

With that fix in place, you can resume each of the suspended jobs, and they will start from the last checkpoint (more on checkpointing in a future blog post) and finish the work they were doing. We’re back up and running!

Summary

In Azure Automation, the tools are in place for you to completely automate your Azure cloud environments. From creating new runbooks, to running jobs, to monitoring and troubleshooting issues, you can control the entire automation experience.