How to develop an effective root cause failure analysis process

April 25, 2013
Adopt a reliability-centric organization structure to improve reliability, maintenance, and operations.

A reliability-centric organization makes reliability the focus of the maintenance and operations departments. The company must have a strong, independent reliability leader who not only looks at traditional reliability improvements, but also influences and leads the organization’s operations, maintenance, capital, and turnaround functions to improve overall corporate performance, with the motto: “Engineer it right, keep it running, and repair it right.” This is final installment of a multi-part series.

Click here to read Part I Click here to read Part II | Click here to read Part III

Developing a root cause failure analysis (RCFA) process and culture is the primary value the reliability department brings to the asset. Once the organization understands the value of driving out poorly performing equipment, incorrect behaviors, and outdated processes, the organization will improve in more areas than just equipment MTBR.

The first steps in designing an effective RCFA culture is to develop effective procedures for when and how to conduct these analyses and a process to identify when equipment is reaching the end of its useful life (Figure 1). Additionally, the reliability department must have a process to organize the asset into an effective defect-eliminating organization.

Figure 1. Preliminary work is required before starting an RCFA program.

RCFAs must be performed on the equipment and processes that are adversely impacting the asset’s reliability. However, where do you begin? Do you perform them on all equipment failures? One method is to develop a set of definitions for your asset’s top 10 worst actors, other bad actors, and repetitive failures. For example, reliability can develop the following lists:

  • top 10 list of defective equipment — based on maintenance and lost opportunities
  • bad actors — equipment with three or more failures in a two-year period
  • repeat offenders — equipment with repeated failures in the past six months.

The number of failures and time frames in the latter two definitions can be modified to limit the lists to workable numbers. Without developing these target lists, the organization will not know where to focus its efforts.

From experience, the most effective approach is to go after the top 10 equipment items first, followed by the bad actors, and later the repeat offenders. One plant study indicated that the highest returns on investment are achieved by addressing the top 10 list.

The plant’s RCFA process must be aggressive and is essential for addressing the bad actors and repeat offenders. The chart shows the results of driving out the bad actors (Figure 2). In this chart, only one of the original bad actors remained on the list after 3.5 years and was slated for replacement during a turnaround.

Figure 2. Tracking bad actors over a period of three years, only one of the original bad actors remained and was slated for replacement during a turnaround.

These results show that by finding solutions to all of the bad actors and implementing those solutions, maintenance was not fixing the same equipment over and over again. However, the chart also shows that new bad actors came into the mix. Without solving the existing bad actor problems, the list would have increased, thus increasing the organization’s unnecessary, repetitive work.

One company set a goal for its reliability engineers to perform one to two RCFAs per month in their assigned units. This may sound rather low, but the engineers didn’t think so, and they struggled to meet that modest goal, along with all of their other goals and activities. However, a system was put in place to review all work notifications against the lists of top 10, bad actors, and repeat offenders, and RCFAs were assigned to all applicable notifications. RCFAs were not completed on all of them, but reliability personnel focused on the most important equipment. The MTBR improved over time and that enabled the engineers to begin work on other reliability initiatives.

Performing RCFAs is essential, but having a system to implement the RCFA recommendations is even more important. The reliability department cannot implement the recommendations alone; they must collaborate and cooperate with the operations, maintenance, and capital/turnaround departments. A cross-functional reliability improvement team (RIT) process is an excellent method for developing and solving top 10 reliability issues. An RIT process also solves other non-machinery reliability issues, integrates the operations, maintenance, and capital/turnaround departments into the process and gets their support and buy-in. This process should define the team structure, the team members’ roles and responsibilities, objectives, top 10 list workflow, budget requirements, key performance indicators, and reporting responsibilities. Of all the processes in the plant concerning reliability, this is the most important, as it will ensure the asset’s resources are continuously focused on the reliability issues and driving the recommendations from the RCFA to completion. Lessons learned from these teams should also be reviewed to determine if the findings could be applied in other parts of the company (Figure 3).

Figure 3. The reliability team should review lessons learned and share findings with other parts of the company.

Predictive maintenance (PdM) programs such as vibration analysis, lubrication oil analysis, and pressure-volume analysis are important, but it isn’t enough to set up the systems, purchase the equipment, and train the personnel to obtain the maximum benefit. Far more benefit can be realized if you have qualified reliability engineers who understand the data trends and the theoretical potential of the equipment and focus solely on putting the data and theory into practice.

That is why PdM programs, including collecting and reviewing the data and entering action items into the CMMS, belong under reliability engineering, and not the maintenance department. On any given day, the focus of the maintenance department is squarely on what needs to be fixed; reviewing lab reports and collecting vibration data are not high on the priority list, and that is how it should be. But what often happens is maintenance has a good lubrication sampling program but is so busy repairing equipment that nothing is done with the data, resulting in damage to critical equipment from running contaminated oil for a prolonged period.


The best practice is for the reliability department to take the data or obtain it from the lubrication service company, review the data, and enter deficiencies into the CMMS, indicating what corrective action is required or what action was taken, and then saving that knowledge in the CMMS for historical purposes. The software or process should include a review of outstanding notifications to ensure the organization responds in a timely manner. This type of work is not a priority for the maintenance department, which is why the reliability department should own the PdM programs.

There are several benefits to validating all machinery notifications. First, if the quality of the notifications is low, this process will allow the machinery specialist to review and update the notification with more specific information. At the same time, he will be able to review notifications for the entire train and add all necessary repairs to the scope given to the planner, who can then develop a high-quality repair plan even though he might not have a machinery background.

The greatest benefit of notification validation is the money saved by not working on equipment that has not failed. Experience has shown that around 10% of machinery notifications are not valid or the asset is not in actual need of repair. This savings alone can justify the cost of personnel assigned to perform the machinery specialist role.

Capital procurement

A reliability-centric organization also assigns a machinery engineer to the capital procurement organization to develop standards and procedures to "buy the right things." The impact of developing quality equipment standards and enforcing their use in the capital procurement process was exemplified in a Gulf Coast plant. A gas-oil hydrotreater was designed using the machinery equipment standards reviewed, modified, and approved by the machinery department over a period of several years. At the same time, another unit was installed and paid for by a partner company that built the unit to avoid signing a gas-treating contract. The partner built it cheaply, although not to the same standards, and the unit ended up with multiple equipment failures in the first few years, and four pumps ended up on the plant’s top 10 list. Solving these issues took a lot of money and effort. In contrast, the other unit, which was designed to the company's standards, had no machinery failures in the first three years of operation. This example shows the importance of up-front engineering and its impact on the lifecycle savings. If you spend more money up front for reliable installations, you will save more money overall.

Maintaining reliability

As reliability improves and reaches a new plateau, company executives will become comfortable with that level of reliability and will start to look for new ways to cut costs by targeting the very programs and people you have spent years developing. These cuts provide an immediate reduction in costs, which validates their decision, at first. However, the long-term effect is reduced reliability and increased costs. To prevent this, reliability leaders must communicate to executive management, especially new ones, that reliability is fleeting and will disappear without dedicated personnel and efforts. A reliability-centric organization must continue to improve by having a clear vision at the top and communicating to all levels the need and value of these efforts. If not given the proper focus, reliability will morph back into a series of unpleasant surprises.

For example, one plant improved its pump MTBR significantly, but then operations and maintenance became complacent about reliability/availability and “controlled” their costs by delaying repairs. In reality, delaying repairs did nothing to control costs because once a piece of equipment has failed, it will have to be repaired, resulting in sunk costs. Delaying the repairs, however, negatively impacted the availability of the equipment and increased the operational risk.. The lesson here is don't let improved equipment reliability lead to complacency, cost cutting, and poor decision making, which will only derail the reliability improvements in your train.


Some companies take reliability engineering to the other extreme. One plant spent considerable money on reliability and maintenance but failed to improve tool time. The process developed and implemented for planning and scheduling was sound and had worked at other companies, so why did they fail? They failed because the leadership didn’t engage with the workforce regarding work execution. The maintenance managers spent all of their time developing the process, implementing the process, and studying metrics on every aspect, averaging four meetings per day. The maintenance leadership had lists of reported issues and discussed them to death, but they failed to solve the problems in the field with the workforce as they occurred. Supervisors spent less than 30 minutes a day in the plant, rarely setting foot in the plant until three or four hours into the workday. This is not "engaged leadership" and was the main reason for not showing any improvement in tool time.

Craig D. Cotter, PE, CMRP, is a maintenance specialist with 21 years of experience in refining, chemical, and E&P organizations in the areas of reliability engineering and maintenance management. He’s a member of SMRP and the Vibration Institute with a Category III certification. Contact him at (281) 413-9475 or [email protected].

Engaged leadership is leading from the front. Employees will follow a leader in the field, not a metric or a KPI or a general back in the office. Engage leadership at all levels, including first line supervisors, middle management, and senior executives. Leaders run toward problems. They look for and welcome problems and go to the location to help solve the problems to make the organization better. Engaged leaders understand the difference between problems and failures. Your people will fail; encourage them to try again.

Engaged leaders care for their teams. They develop them for succession by training them, providing vision, providing structure and room to succeed, delegating authority, and getting rid of bad performers. Engaged leadership is flexible, as situations and people are different, and gives credit to the team for success and takes the blame for failures.

By adopting a reliability-centric organization structure and dedicating the people and developing the processes, any organization can achieve top quartile performance in reliability, maintenance, and operations, no matter what the company manufactures. It may take several years to build the organization and develop processes, and three to four more years to achieve the desired outcomes.

The key programs to focus on first, one for each leg of the stool, are:

  • keep it running with operator-driven reliability (ODR)
  • fix it right with precision repairs
  • engineer it right with an RCFA/RIT process.

With the proper leadership and vision at the senior levels, top performance and profitability will be achieved, with additional safety and environmental benefits.

Be notified when future parts of this article are published.

About the Author

Craig Cotter | P.E., CMRP

Craig Cotter, P.E., CMRP, is a mechanical engineer. He has more than 30 years of experience in reliability engineering and maintenance management. Cotter has a B.S. in mechanical engineering as well as an MBA. He is a retired U.S. Army Colonel. Contact him at [email protected].

Sponsored Recommendations

Arc Flash Prevention: What You Need to Know

March 28, 2024
Download to learn: how an arc flash forms and common causes, safety recommendations to help prevent arc flash exposure (including the use of lockout tagout and energy isolating...

Reduce engineering time by 50%

March 28, 2024
Learn how smart value chain applications are made possible by moving from manually-intensive CAD-based drafting packages to modern CAE software.

Filter Monitoring with Rittal's Blue e Air Conditioner

March 28, 2024
Steve Sullivan, Training Supervisor for Rittal North America, provides an overview of the filter monitoring capabilities of the Blue e line of industrial air conditioners.

Limitations of MERV Ratings for Dust Collector Filters

Feb. 23, 2024
It can be complicated and confusing to select the safest and most efficient dust collector filters for your facility. For the HVAC industry, MERV ratings are king. But MERV ratings...