Logical Disk Free Space Monitor

Throughout my years working with MOM and Operations Manager 2007, periodically I hear complaints about Operations Manager not alerting on low disk space conditions, or that administrators are receiving false alerts. Just about every time I’ve been called upon for this type of issue, it turned out to be thresholds not being adjusted properly, not that Operations Manager didn’t do it’s job correctly.

Before I get into this deeply, I want to iterate the importance of having a good disk free space monitoring definition in place. I have seen so many companies struggle with disk free space monitoring, when they really don’t need to. The problem almost always starts with not having a good discussion around your free space requirements, defining the thresholds for server roles and types, and then executing on the design.

This is a basic requirement for monitoring operational health of every server role in your infrastructure. Whether we’re talking about file servers, database servers, web servers or application servers, it is a mistake to put this on the back-burner and not define your requirements as soon as possible for each server role.

Two types of monitoring

My standpoint from a disk space monitoring perspective is simple, and it is aligned with the intent and purpose of Operations Manager. It’s two-fold.

Reactive and Proactive

Although it may seem elementary, let me explain the difference between reactive and proactive monitoring, and how it relates to the Logical Disk Free Space Monitor.

There are two scenarios when it comes to state changes in monitors, and each of these can be paired up with either reactive or proactive type monitoring.

Two-State Monitor = Reactive Only This monitor has only two states. Healthy is required for one of the states. The other state can be warning or critical. In my opinion, a two-state monitor almost always defines some type of reactive monitoring scenario. In other words, a component being monitored by a two-state monitor is either healthy, or an administrator needs to take immediate action in order to correct the problem. This is synonymous to ON and OFF. There is no period of time where this component is in a degraded state, but still functioning, that allows an administrator to take remediation actions to correct the issue before it worsens.

Three-State Monitor = Reactive and ProactiveThis monitor has three states. Healthy, Warning and Critical. The rules are similar to the Two-State monitor, as far as Healthy and Critical states are concerned. However, there is an additional state that connotes a degraded condition. In a degraded condition, the service or component is still functioning, but there are problems on the horizon if the administrator doesn’t plan to take remediation actions at the earliest convenience.

With this additional Warning (or degraded) state, we lend another type of monitoring to our operational monitoring; Proactive. Although this borders on both Reactive and Proactive, this is still very much proactive, in my opinion, because the administrator is informed of a degraded condition before is turns critical.

How does this relate the Logical Disk Free Space monitor? Well, this is a Three-State monitor. Hence, we are provided with the best of both worlds from an operational standpoint. Both Proactive and Reactive.

Another part of Proactive monitoring is provided by the reporting feature in Operations Manager. This goes above and beyond the capabilities of having a monitor warn your staff of a degraded state. This arms you with the capability to perform trend analysis of your applications and hardware, allowing your company to use this information for planning and provisioning resources in your infrastructure.

My argument

I have been in my share of arguments around monitoring disk space, usually relating to general recommendations for the threshold types used in this monitor. One of the most heated arguments I’ve heard around these thresholds, is to only use one type of threshold; either the MB threshold or the Percentage threshold. My argument has always been to use both these threshold types, and not to generalize an entire IT infrastructure based on a single threshold type.

By using only one threshold type, I don’t see how anyone could encompass the array of disk sizes and different types of server roles in the environment, and define a disk free space monitoring solution using only one threshold type. In my opinion, using only one threshold type generalizes all the unique attributes that make up the infrastructure as a whole. All I ask is that you read this article before making a decision as to how you’re going to use this monitor.

The problem

I’ve done my time going through the ranks of systems administration. And this includes carrying a pager, and reacting to alerts from that pager, 24/7. This being the case, I know one thing for sure. And that is…

I do not want to be stirred out of a deep sleep, pulled away from my family or have my golf game interrupted, in order to check on an alert that was triggered, only to find there was plenty of free space on the server I was alert on.

Sound familiar? I bet it does.

If you answer yes to any of the below questions, your reactive thresholds are not adjusted correctly.

1. At the earliest convenience, do you adjust the threshold for that instance? Or, just disable monitoring for that drive and be done with it (I have seen this done).

2. Do you have a routine down, and you know exactly when that alert will trigger, so you auto-respond to that alert without actually checking it? Or have you started ignoring alerts altogether?

5. Do you end up just checking on that server every day when you come in and when you leave, and see that it’s grown by 100MB each day, just waiting to bring it up in a meeting to allocate more drive space?

Whatever the case may be, you know that this drive is not in a critical state and there is no need to be alarmed yet. Growth of that particular disk has always averaged around 100MB a day, and you know the SAN group will not allocate more space until it’s down to 10GB free.

Make your case

To the on-call admin wearing the pager, listen up. I’m offering this argument to you, so you can then present your ideas to the operations monitoring group.

First thing you’ll want to do is download the Logical Disk Free Space Monitor Calculator (attached to bottom of article). Also grab this query, to help map out what your current disk sizes look like. A method I often use is, plug in the largest disk size, the smallest disk size, and the average disk size in the the calculator. Then start playing with the thresholds in the calculator to determine your unique threshold requirements for both System and Non-System drives.

First things first. How does the Logical Disk Free Space monitor work, when using both the MB and % threshold types? Here’s how.

The moment BOTH thresholds are exceeded, the state of that monitor will change.

Some basics of the monitor. This monitor is targeted to each type of Windows Server (2000, 2003 and 2008). Just keep that in mind when adjusting thresholds.

This is a double-threshold, three-state monitor. However, being that there two types of thresholds (MB and %), there is actually four thresholds that need to be set for this monitor.

Go ahead and open up the monitor properties and take a peak at the thresholds. To do this, go to the Authoring space.

Click on Monitors, then click Scope.

Type Logical Disk in the Look for input box, and check all three targets (for each type). Then click okay.

If you expand each of the types, as shown in the image below for 2003 type, you’ll find the monitor. Do not confuse the Free Space monitor with the Availability monitor.

Open the properties of the monitor.

As you’ll see, these thresholds are also split into to types of drives; System and non-System. This may sound confusing, but it’s really quite simple and there is good reason for it. As you might expect, System type drives host the operating system. Non-System type drives are all other drives.

And here are the tabs showing the properties of the monitor.

The reason for the two types of drives is because, drives that host the operating system are usually well-defined with specific volume sizes. These drives usually do not fluctuate in free space. And if they do, we monitor that. But, the monitoring is generally much more strict and will match as closely as possible to a true warning or critical state for the operating system to function properly.

In other words, a System type drive with 500MB of free space is okay. This drive doesn’t need to generate an alert unless it drops below, for example, 200MB. That’s when we would actually do something to free up some space. That’s when we need to be paged. That truly warrants an alarm.

Out of the box, the System type drive thresholds are as follows.

Also by default, this monitor generates an alert when it changes to critical. What this means to you, is you’ll see a state change in the Operations Console when the drive hosting the operating system drops below 200MB. This state will persist, allowing you to catch this warning state in the console before it reaches critical state, or until someone moves some files off and creates more free space.

There is a state view specifically for monitoring Logical Disk free space in the Microsoft Windows Server node of the monitoring pane in the Operations Console. You can also create a view in My Workspace to spot check a specific set of servers for drives in a Warning state once each day. This is part of the proactive monitoring I mentioned.

So, when the drive hosting the operating system drops below 100MB, you’ll get a page and an alert in the Operations Console. Again, this is when action must be taken with urgency. Hence, critical or reactive.

Out of the box, the non-System type drive thresholds are as follows.

As far as non-System type drives, this is usually the tricky threshold that needs to be discussed with your operations team. This is when you can put my disk space calculator to use.

I’m not going to get into semantics about all the different server roles and make recommendations for types of server roles. I’ll just note that the type of server is an important factor in determining disk space monitoring requirements. For instance, database servers will usually have different disk space monitoring thresholds than file servers.

I will, however, be using a file share server role in an example. This is only to get you thinking in the right direction, and is not intended to be a recommendation.

Scenario:

The company has 40 Windows Server 2003 File Share Servers. The majority of these servers have a 40GB system drive, hosting the operating system, with the exception of a handful of servers that were installed in 2003. At the time, the standard build was a 20GB system drive.

For the file shares, most later model servers have one 800GB volume. There are quite a few servers with two 300GB volumes. Then there are a few older model servers, which have two or four 80GB volumes.

The questions that need to be answered are:

What is a warning state? This is the state in which your administrators need to be informed of a degraded situation. At this state of the monitor, there is time to take action to resolve the issue before it turns into a critical state. In other words, this the proactive threshold.

What is a critical state? This is the state in which your administrators need to be alerted of a critical situation. In this state, an alert will be raised in the Operations Console and a page will be sent to your on-call administrator. This state connotes an urgent issue, and action must be taken at once. In other words, this is the reactive threshold.

These questions need to be answered for both types of drives.

System Drives

In your meeting with the operations monitoring team, these thresholds and state were discussed, and everyone agreed upon the following. Regardless of the size of the system drive, 20GB or 40GB, and considering the operating system drive usually doesn’t fluctuate, and the fact that nobody should be storing data on those drives anyway, a warning should be raised when free space drops to 500MB.

This should give administrators adequate elbow room to proactively monitor for warning conditions and take remediation actions at the soonest opportunity.

Everyone also agreed that we only need an on-call admin to be paged if a drive hosting the operating system drops below 100MB. This is considered critical, as this will affect operating system performance and render it unresponsive soon, and we want someone paged to move files off that drive immediately.

Using the calculator, you determine that the thresholds for the system drive should be adjusted as follows.

Note that only a single threshold needed to be adjusted. The critical MB threshold, by default, meets our requirements. And both the warning and critical % thresholds, by default, meet our requirements. We need to create an override, for the file share servers, only for the warning MB threshold.

Here’s what it looks like in the calculator.

Remember, our decision was based on MB thresholds only. We did not even care about % free space.

Given that 10% and 5%, for warning and critical, are well over our defined 500MB and 100MB, respectively, given our drive sizes, we don’t need to play with the % thresholds. Technically, these % thresholds will be exceeded on our 40GB drives at 4GB and 2GB, for warning and critical.

Remember that both MB and % need to be exceeded, in order for a state change to occur. So, again, we only need to create an override for the warning MB threshold. And that override setting is 500MB.

Non-System Drives

Remember, most later model servers have one 800GB volume. There are a few with two 300GB volumes. Then there are a few older model servers, which have two or four 80GB volumes.

As I mentioned earlier, these non-system drives are usually a bit trickier to find a good balance. This is because there is a vast difference in volume sizes, and we’re trying to wrap our heads around a happy medium.

In the meeting with the operations monitoring team, we discussed only using the % threshold, and setting it at 10% and 5% for warning and critical, respectively. This didn’t go over very well. Because, again, we don’t want to wake our on-call admin up in the middle of the night because there was only 40GB left on a file share. That’s not exactly an urgent issue. Plus, we already know about that server and we’re expecting addition drive space to be allocated on Wednesday. We knew this because we saw the state change in the Operations Console when that volume dropped to 80GB two weeks ago.

We discussed only using the MB thresholds, adjusting them to 20GB and 4GB, for warning and critical, respectively. This didn’t go over well, because we really don’t want to wake the on-call admin again when one of the smaller 80GB drives drops to 4GB free space. These are not high volume drives, and when they are out of space we plan to move that data off to a larger volume anyway.

Rather than jumbling with these numbers, you break out the calculator, plug in the volume sizes (800, 300 and 80GB), and start plugging in some threshold values. After a few iterations, everyone liked the following thresholds.

Notice in the middle columns in the calculator, that the 800GB drive changes state for both warning and critical on only the MB threshold value. The 80GB drive changes state for both warning and critical on only the % threshold. The 300GB actually will use the % threshold value for the warning state change, and the MB threshold value for the critical state change.

This is a great balance for these file share servers. Each size volume has an adequate warning threshold, to allow plenty of time to proactively monitor these warning states and take action at the earliest convenience.

This also generates a critical state, subsequently generating an alert in the Operations Console and paging the on-call admin. These are all truly critical states, that require immediate action.

This meets all our requirements to expedite warning and critical states appropriately. And, most importantly, you’re on-call admin will appreciate that we have a good definition around monitoring disk space. Now he’s taking these pages seriously, and isn’t bothered for non-critical conditions.

Using Views for Proactive Monitoring

With well defined thresholds around disk free space monitoring, allowing for ample time to take action without urgency, we can use the Logical Disk state view in the Operations Console to proactively monitor free disk space. Checking this state view once per day will be a part of the daily routine.

You can find this state view here.

What we’re looking for here are servers in a warning state. If you have hundred, or thousands of servers, you can make this easier to look at by sort by the State column header.

If you want a more targeted view, containing only file share servers in a warning state, you can create a new state view in My Workspace. Here’s an example of such view.

So, not only are we monitoring for reactive conditions, we are also proactively monitoring disk space by means of establishing well defined thresholds for the Logical Disk Free Space monitor.

Again, as I mentioned earlier, another important piece of proactive monitoring is the report feature in Operations Manager. We can take proactive measures much further by using the reporting component. This will give us even richer information, like trend analysis for future planning and provisioning of resources.

I hope now you have a good understanding of how this monitor works. Along with the given example, and the free space calculator, you should now be armed and ready to tackle these disk free space alerts that have been so troubling for so many…especially for those on-call administrators.

Initially I thought this was a subscription criteria question, and I wanted to say that you would need to update your notification subscription to pick up both Warning and Critical severity alerts, in order to receive notifications about alerts in both states. But then I thought about how monitors work, and the mechanics of a monitor is to generate an alert at state change only. Even though this is a state change, we never do upgrade to a healthy state in this case. If we do not upgrade to a healthy state, the original alert is not resolve and a new alert will not be generated. So, we could essentially go back and forth between Warning and Critical without ever generating a new alert. In fact what happens is the original alert instance will be updated to match state of the monitor. When you think about this, and you see the Warning and then the Critical alert, you are actually looking at the same alert. But the state of that alert had only change. No new alert was actually generated.

Going back to your problem with not receiving notifications when alert changes from Warning to Critical (and vise-versa), this will not work as expected because in actuality no new alert was generated. This is a fundamental piece that is missing from the current subscription criteria UI, in which we cannot specify to also notify on alerts that have changed state between Warning and Critical.

Bottom line:

If you decide to match alert severity to monitor health state, this works as designed as far as seeing that alert severity matching monitor state in the console. BUT, if you're using notifications while also overriding the monitor to match alert severity to monitor state, you will not receive notifications when monitor changes state between Warning and Critical because the subscription workflow cannot be configured to pick up alert severity changes. Because of this missing criteria in the notifications workflow, I do not recommend matching alert severity with monitor state if you are using notification channel, because you'll miss the Critical alerts in your email or pager, etc…

@Krish – you can create groups of logical disk instances, and apply different overrides for each of those groups. So, you'll have 3 groups, each containing disks that match the size criteria. Then you would set the MB threshold for the monitor very high, like 10240000. Then set the % threshold to what you want for both warning and critical states. The groups will dynamically populate when you change disk sizes, as long as your group memberships are dynamic (not explicit).

First, to answer your question…no. I do not recommend configuring thresholds for each disk size. This isn’t manageable. The whole idea behind this post is that we do NOT need to configure thresholds for each disk size.

The calculator tool here is to help identify a "happy balance" for ALL your disk sizes. The intent is to configure acceptable default thresholds (both % and MB) for all disk drives in your environment.

As noted in the post, I recommend setting these thresholds for TYPES of servers, as disk drive space thresholds on particular server types may be different than other server types. Other than specifying for server type, the idea is to set the default thresholds on the monitor that offers a good balance for ALL disk drive sizes.

Please use the disks sizes query I supplied in the post to determine what are the disk sizes in your environment, as this will help determine these thresholds. I recommend plugging in the smallest and largest disk size, as well as the average disk size, into the calculator. This has worked well with my customers.

What we’re trying to determine here is a good default threshold for both threshold types (% and MB). This is very specific to your needs. That’s why I do not give recommendations here. Instead, I give you the knowledge and tools for you to determine what is best.

Hi Dom – I know there are some problems with disks being undiscovered/rediscovered in some failover situations, but I don't recall hearing of your specific issue before. Without ranting, I personally think there should be a dedicated disk monitoring MP, and should have better workflows built around different types of disks to cover situations like this. Would love to write this, but would take time…

I see that you answered a question previously about the possibility of multiple notifications being triggered from a severity change on an alert. I understand that this won't work by simply creating 2 different subscriptions, but is there another way to achieve this? I want to send out emails when a warning alert is generated, and send out a message to our pagers when an alert moved to critical.

I too need to create a view for servers with disk space in non-healthy state. But targeting Logical Disk also will get disk performance states, which I do not want included in the View that is meant for disk space state only. How do I get around this? thanks.

Suresh – If you want to use only % threshold type, you could simply set the MB threshold type on the sealed monitor discussed in the post to a very high value. If you set the MB threshold to 1024000, then the MB threshold would be exceeded when the logical disk has less than 1TB free space. This is usually very easy to meet, and then the monitor would next need to meet the % threshold, which would be the value that you want to use. There is no need to create another custom monitor for this.

Hey Paul – That's one way to go about it, but in my experience the method you describe causes a lot of additional override management for the SCOM admin. One of the reason we have the two threshold types is to reduce the number of overrides we need to create to cover all disk sizes. Thanks for the input!

@Mark67 – if you are savvy with authoring MP's, you could "forklift" the logical disk free space monitor and put into your own custom MP – twice. So, name one Free Space Critical and the other Free Space Warning. Setup the monitors just as they are in the vendor MP, and set your thresholds. Now subscriptions will generate notifications both on warning and critical.

There are probably other ways to do it, I'm sure, but this is the route I would take if I needed to tackle your problem. In my opinion, it's better to change monitoring than it is to change other moving parts – like notificaiton channels and getting fancy with extensibility.

We can include columns in state views for data that is discovered. Disk free space is not discovered data. It is, however, collected as performance data. So you could create a performance view that shows, for example, LogicalDisk% Free Space.

I have a question and maybe you can help me with this as well. I successfully configured Override for all objects of class Windows Server Logical Disk. I changed the default values of Warning and Error % Threshold, Warning and Error MBytes Threshold for System drives. I put it into new management pack.

But when i open properties of Logical Disk Free Space monitor, and then go to System Drive % tab or System Drive MBytes tab, i see the same default values of threshold. It seems that nothing have changed.

I have the issue that when overriding the alert on state to "warning or critical", that no notifications are send when the alert is changing from warning to critical. Is this by design when you use the override functionality or can this be prevented in a way?

is there a way to create a view that actually SHOWS the free space (MB free) for each of the servers? It would be great to be able to add that as a column and be able to sort on that data (whether in "My Workspace" or in the Logical Disk State view.

Nice work – problem we had was a whole bunch of different sized disks and would have been a nightmare to override all of them (250+). Our need was to fix a percentage value which I found was possible by fixing a very high MB free value. This effectively makes that trigger value always true and allows you to manipulate the percentage free value to what you want. This same method can be used to isolate the MB free trigger (maybe by setting it to 100%).

Great to see detailed information regarding Logical Disk Free Space. You gave many scenarios and explained in detail about the positive and negative of using Available MB and % Free space.

But in my environment, I want to receive an alert only by checking % Free space and NOT Available MB. Hence I created a "Static Double Threshold Performance Monitor" targeting Windows 2003 Logical Disk. After creating this I could see two challenges:

1. Drive letter coming in "Path:server.xx.comDriveLetter" field is wrong but the server name is correct.

I have a C: Drive on a Cluster node where all active drives and services have failed over the other node.This C: drive seems to be seen as a non-system drive now as the threshold for the disk space % and MB are sending alerts accordingly to this non-system monitor. Is it expected?