Systematic approach for eliminating failures

Eliminating equipment failures may be every plant engineer's dream, but relatively few have a clear roadmap for accomplishing it. This system, or a close derivative of it, has enabled numerous manufacturing plants to achieve significant continuous improvements in plant availability, run rates, safety, and production quality.

Eliminating equipment failures may be every plant engineer's dream, but relatively few have a clear roadmap for accomplishing it. This system, or a close derivative of it, has enabled numerous manufacturing plants to achieve significant continuous improvements in plant availability, run rates, safety, and production quality. Benchmarking efforts focused on these metrics have shown use of this or a similar system to be a global best practice.

The system provides predictable positive results for the following reasons:

Clear accountabilities are established throughout the organization

Decisions are data driven without bureaucracy

Continuous improvement is systematic because raising actual performance and expectations become ingrained in the plant's standard operating procedures and, eventually, the culture

ISO, QS, and OSHA corrective action requirements are met as a natural course of doing business.

The flow chart in Fig. 1 shows each of the process elements in the system. Reference numbers in the chart correspond to the points in the following text.

Fig. 1. A formalized failure elimination process like this one can ensure continuous improvement in many areas of plant operations and maintenance.

Publication of performance expectations and goals by the leadership team (1)

Typically, the first step of each improvement loop is for the plant or corporate leadership team to challenge those who operate a department or facility to reach specific potentials for availability, run rates, quality, etc. The potential is based on performance benchmarks achieved elsewhere or on projected results of specific, planned improvement action items. The longer the failure elimination system has been used, the more accurately the expectations and goals can be projected.

Establishment of operating plan to achieve expectations and goals (2)

Assembling a clear scheme for how a department or plant can actually achieve the performance potential in the coming year is part of the operating plan. When this planning process demonstrates that a change in expectations is appropriate for the leadership team to consider, the iterative logic of goal setting and feasibility evaluation is called "catchball." The end result is a set of challenging goals confirmed to be realistic through definitive action plans.

Variance analysis of actual vs. plan (3)

The variance analysis involves comparing actual performance versus what was promised in the operating plan. The analysis is typically completed or finalized by departmental leadership.

Poorer than planned or projected performance triggers an analysis into the cause of the variance. Cumulative annual and period (i.e., daily, weekly, monthly, quarterly) statistics are normally tracked. As experience with the failure elimination process grows, the detail level of this analysis grows.

Initially, the variance review may focus on total downtime, run rates, defect levels, mean times between failures, mean times to restore, and accident rates. Later, the focus is often down to the level of specific downtime causes, defects, or rate losses.

Over time, the analysis becomes an audit of actual failure rate statistics versus the rates assumed in failure modes and effects, reliability centered maintenance, HAZOPS, or other risk analysis.

Reports of major incidents and root cause analyses (4)

An extremely important part of rapid failure elimination is removal of significant incidents. An initial step is to define what a significant failure is. Typically, that definition is in terms of downtime minutes, accident/near miss severity, production rate deterioration, and/or units of defective product.

Next, the responsibility for reporting when such an incident has occurred and for analyzing the root cause of that incident becomes standard operating procedure.

A form for reporting significant failures and capturing information related to a root cause failure analysis is provided in Fig. 2.

Fig. 2. Failure analysis report should capture all data and activities related to a failure, including root cause and correction verification.

Decisions on whether to proceed with the suggested preventive or mitigation action plans should be made by the analyst or departmental leadership team in order to maximize the speed of implementation, solution ownership, and accountability.

Variance and incidents preventable by known actions? (5)

Many times, the appropriate response to performance variances or major incident reports is not obvious. A standing team or ad hoc team to deal with these situations is necessary. When the failure elimination system has been in place for less than a year or two, external support typically helps departmental teams to address these issues. Later, when uptime, run rates, and quality are under control, departmental leadership teams decide how to proceed if the required solution is not obvious.

Prioritization of loss reasons (6)

When it is not clear what to do to address losses of availability, run speed, or quality throughput, there is a need for more detailed causal data and analysis. This need exists because failure reasons are usually codified at a summary level in order to simplify reporting and performance statistics.

Pareto charts will reveal which of the failure reasons codified at a summary level are deserving of the extra work necessary to eliminate the underlying root causes that drive the losses. Figure 3 illustrates what a typical downtime Pareto chart looks like. This activity is often performed by an ad hoc team when the elimination process is first implemented. Later, it becomes the responsibility of departmental leadership or reliability and/or quality teams.

Fig. 3. A Pareto chart summarizing production downtime attributed to various causes quickly reveals where priorities should be set fro improving performance.

Loss contributions and action plans developed for root causes (7)

The reasons for downtime, product defects, and run speed defects are typically codified by the equipment component or type of operator error involved rather than the root cause of the problem. For example, downtime caused by a motor failure will be logged against the drive rather than against "improper lubrication," "bad rewinding practice," "excessive start frequency by operator," or another potential root cause. This situation arises because there is insufficient time (and sometimes insufficient resources) to perform root cause analyses of all incidents and because there is a time lag between the time that the incident is recorded and the time at which the root cause is determined.

Activities to remove failures must be directed at specific root causes to be effective. Consequently, when developing action plans to reduce losses, one must apportion the total downtime, defects, etc., associated with a loss reason among the likely root causes. Figure 4 (Please scroll to bottom of page) is a form for accomplishing this apportionment. Although the process is not completely objective or data driven, it beats the alternatives.

Once root causes are identified and their impact quantified, potential action plans must be postulated with estimates of time and cost to complete the plans. This step compiles the information necessary to select the action plans that make the best business sense.

Action items added to action database (8)

In plants that suffer from numerous reliability and quality problems, it is common for plant personnel to dance from one crisis to another without focusing on a problem or a plan until resolved. Keeping action plans current in a database with priorities, target dates, and accountabilities viewable by the general plant population promotes the discipline needed to systematically eliminate the failures. The responsibility for completing this activity usually falls to operating and maintenance leads, engineers, or standing reliability teams.

Action plans approved? (9)

Many of the action plans developed will involve minor changes in operating or maintenance practices. These changes mandate minimal red tape for approvals so that the rate of improvement is maximized. Some actions, however, require capital or collaboration across teams. These plans require a simple, well understood structure for getting the appropriate people involved in deciding how to proceed.

Action item delay noted in database (10)

When an action item cannot or should not be implemented immediately, the reason and logic for not approving it must be communicated to those involved in the previous analysis and planning. Similarly, the status of that action item should be maintained in the database so it can be reconsidered easily when appropriate.

Accountability for completing the action plan in a timely, cost-effective fashion is crucial. Although teams of resources may provide support, accountability with sufficient resources must be assigned to a single individual. Frequently, this assignment is automatic, because the responsibility is already part of someone's job description.

Action item completion (12)

The failure elimination process does not end with the completion of action items. Effectiveness of the solution must be validated, including assurance that the fix did not create new problems.

Action item status reports (13)

Simple and routine status reports by those accountable for action items maintain focus and promote schedule and budget discipline.

Effectiveness audits (14)

If the ultimate actual impact of each action item isn't quantified, work teams lose many valuable lessons learned from their successes and misfires. In addition, a trial-and-error mentality evolves and errors are repeated by successors.

Once successes are validated, they are easily extrapolated to other items that are susceptible to the same or similar failure modes.

Database updates (15)

Using the action item database to track and communicate the status of each action item provides a useful tool when a failure reoccurs or is not completely removed as planned.

Routine performance and action item review by leadership team (16)

By religiously reviewing performance statistics and the status of key items in the action database, plant and departmental leaders can properly allocate resources, adjust priorities, focus improvement, and raise the bar for expected performance.