Mean Time Between Failure (MTBF) and Power BI

Introduction

Mean Time Between Failure (MTBF) is a common term and concept used in equipment and plant maintenance contexts. In addition, MTBF is an important consideration in the development of products. MTBF, along with other maintenance, repair and reliability information, can be extremely valuable to organizations to help identify problematic systems, predict system outages, improve product designs and improve overall operational efficiency and effectiveness. As such, it would be beneficial to understand how to implement this metric within Power BI.

Background

MTBF is essentially the arithmetic mean time between failures of a system. Now, the definition of a failure will depend greatly on the circumstance in which this metric is applied but in general, failures that do not take the system out of service are generally not considered failures. In addition, routine maintenance tasks that are scheduled are also generally not considered failures.

The below image from this Wikipedia Page probably does about as good a job as anything in depicting MTBF.

In formal math language, MTBF is defined as:

Let's Build It!

OK, enough history and background, let's build this puppy!

Step 1: Prepare Your Data

So, I generated some fake data that you can download here. The data has a little over 4,000 rows and represents 3 years of repair data for about a dozen "machines". This data is pretty bare bones but has the essential elements that we will need to demonstrate the technique. Each row represents a single repair. Here is what the data looks like:

So, we essentially have an identifier for the machine, when the repair started, when the repair ended, the type of repair (a repair or preventative maintenance (PM)) and the cause of the repair.

Step 2: Load into Power BI and Create Calculated Fields

This next step is easy, just fire up Power BI Desktop and from the Home tab of the ribbon, choose Get Data | Excel. Point to the data file you just downloaded (above) and choose either the Sheet (MTBF) or the Table (MTBF1) displayed in the navigation and choose Load.

Next, we are going to create two calculated columns. The first one is how long the repair took to complete. The formula for this is:

Once you create this column, on the Modeling tab of the ribbon, make sure that the Data type is set to Decimal Number and that you specify at least 2 decimal places.

Obviously, all we are doing is subtracting the time the repair started from the time the repair completed so that we end up with how long the repair took to complete. Perhaps less obvious is the reason we specified SECOND instead of HOUR in our DATEDIFF function even though we want this metric in hours and thus the division by 3600 seconds/hour. The reason is that if we had specified HOUR, then DATEDIFF would have truncated the calculation to whole hours and we would have lost fractions of an hour.

The second calculated column is a bit more complex:

Uptime =

VAR next = MINX(FILTER(Repairs,

Repairs[MachineName]=EARLIER(Repairs[MachineName]) &&

Repairs[RepairStarted]>EARLIER(Repairs[RepairStarted]) &&

Repairs[RepairType]<>"PM"

),Repairs[RepairStarted])

RETURN IF([RepairType]="PM", 0,IF(ISBLANK(next),

DATEDIFF([RepairCompleted],NOW(),SECOND),

DATEDIFF([RepairCompleted],next,SECOND)

)

)

OK, let me explain what is going on here. Remember back to our definition of MTBF, we need to find the uptime between a machine going into an "up" state and that same machine going into a "down" state. But our data isn't structured that way, the data is focused on collecting repair information.

Thus, the first part of the formula, the VAR portion is the calculation of the variable next. This is about finding the next repair after the current repair. We do this by finding the MIN of the RepairStarted column after filtering our Repair table for machines that are equal to the current machine in the row as well as having a RepairStarted that is after the current repair in the row and we are excluding "PM" rows as those are preventative maintenance and not an actual failure.

The second part of the formula, the RETURN portion is really three pieces. If the current row is a maintenance task (PM), then we return 0 for uptime as we do not want those tasks included in our MTBF calculation. If our calculation for next is BLANK, then we know that this is the most recent failure in our data set. So, we return the difference in seconds between NOW and the completion of the current failure. Finally, if neither of those are the case, then we simply calculate the difference in seconds between when our next repair started and the current repair ended.

Step 3: Create some Measures

OK, now that we have our data loaded and have created some extra calculated columns, we can now create the measures that we will use in our report. Create the following measures:

Repairs: This is the total number of repairs that have occurred excluding preventative maintenance (PM) repairs.

MTBF: This is our MTBF calculation that takes the sum of our uptime in seconds, divides by the number of repairs to get an average and then converts the number to hours by dividing by 3600 seconds/hour.

MDT: This is a measure that is often associated with MTBF, MDT is Mean Down Time or the average amount of time that a repair takes to complete. Therefore, we take the sum of our Repair Hours (which does not exclude preventative maintenance tasks) and divides it by the count of all repairs. Note, we do not use the Repairs measure in this case since we want to include PM repairs as well.

Last Repair: This is simply the date/time of our last repair

Next Expected Repair: By knowing when our last repair occurred, we can add our MTBF calculation converted to days (/24) to determine when we expect our next failure to occur.

Step 4: Build the Report

Now that we have all our measures created, it is mere child's play to create a report similar to the following:

I have left everything default so that the visuals themselves show what columns and measures are being displayed. Even this highly simplified report shows valuable information about each of our machines, the types of failures they experience and because we used measures it is all interactive. For example, if we click on "Worn Component" in the middle bar chart:

We can see that failures caused by worn components occur much less often than the average (a MTBF of 300 versus 199 for all failure types) and they take about half the time to complete (MDT of 1.23 versus 2.19 for all failure types).

Conclusion

Other than one potentially complex DAX calculated column, it is extremely easy and straight-forward to use Power BI to create reports that summarize repair information, including MTBF. And, remember that the definition of "failure" can be subjective and can be creatively applied to other subject domains. This example here really is just based on having a start time and an end time for "something". So, could this be applied to something like HR data and start dates and termination dates? Sure, consider each termination and "failure" and MTBF could shed light on the average time it takes to backfill a position perhaps. The possibilities are endless so start putting this technique into use in your own situations in Power BI!

So a repair started after the first, but was resolved before the second.

Messed around with a few things, but am still getting a lot of negative values for MTBF and 'Uptime'.I tried making a new row key by combing Category, Location, Machine. But its not gotten me what I am after, so I would appreciate some input if anyone sees this and might understand whats going on a bit better than I do.