Advanced failure analysis methodologies and techniques

Building on last month’s column that examined basic failure analysis techniques and risk mitigation strategies, this month I want to cover more advanced methodologies such as failure mode and effects analysis (FMEA), reliability-centered maintenance (RCM) and root cause analysis (RCA).

Be forewarned that unlike the approaches described last month, these advanced tools require a substantial investment of time and effort. A surprising number of companies have sent their eager staff for training in these techniques, only to abandon implementation after the process started looking a bit onerous. Thus, it’s imperative that you adhere to some simple principles before adopting any of the advanced techniques:

View more content on PlantServices.com

Start with your most critical equipment and components. This increases the likelihood of your maintenance initiative capturing management’s interest and support, ensures you get the greatest value for money spent, and builds the necessary track record and momentum for expanding the program to include less critical assets.
Spend generously on quality training. If you don’t have employees who are fully trained and comfortable with the techniques, it’ll be a frustrating exercise and nearly impossible to claim any net benefits.
Have patience. It might take a year or more before internal resources are fully trained, the techniques widely accepted and in production, supporting information systems in place, and results are clearly visible.
This isn’t a replacement for the basics. It’s important to use different techniques for different purposes, so continue to expand your program of basic methodologies in support of the advanced techniques.
Don’t work in isolation. Wherever possible, try to integrate the advanced techniques into your current processes and CMMS, including data collection, analysis and taking action.

FMEA

This technique applies to the design, manufacturing, operation or maintenance of a component, equipment or overall system. It’s used to determine potential reliability problems through identification of:

What might go wrong (failure mode such as cracking or shorting)
Possible results of that failure mode (effect such as rupture or sparking)
What action is, therefore, desirable.

Failure modes, effects and criticality analysis (FMECA) builds on FMEA by adding a criticality ranking of the failure effect to quantify risk and prioritize actions. FMEA and FMECA are often used interchangeably. Criticality or risk usually refers to effect on customer satisfaction, health and safety, environment, regulatory compliance and cost. It varies depending on the asset, process or product involved.

Another key component of risk is the likelihood of occurrence: the probability that a given failure mode will occur and cause a certain failure effect. Usually, the likelihood of occurrence is expressed as a ranking from very unlikely to frequently. A third concept is how likely it is that the failure mode will be detected. For example, if cracking is the failure mode, how difficult, on a scale of one to 10, is it to detect the crack? The more difficult it is to detect, the greater the risk.

Sometimes the three rankings are multiplied together to form the Risk Priority Number (RPN). The RPN = O x S x D, where O is the likelihood the failure mode will occur, S is the failure’s severity or criticality ranking, and D is the likelihood that the failure will be detected. Each variable has a value between one and 10, where the greater the number, the greater the risk. Understanding the risk level helps prioritize the review team’s work. The team should develop corrective actions that minimize the likelihood of failure and its severity, and make it easier to detect or predict failure so that preventive action can be taken.

RCM

Reliability-centered maintenance is an advanced technique for determining what preventive maintenance is required to keep an asset operating in accordance with its original design and the operational requirements of its users. FMEA/FMECA is a key subset of the RCM implementation process, as can be seen implicitly in the following steps:

For each asset (component or equipment within a system or subsystem) under review, define its desired function and performance expectations. Ensure that the asset is actually capable of delivering these expectations. For example, a pump should move fluid with a specific discharge pressure, flow, etc.
Define how the asset might fail to meet performance expectations (functional failure). For example, the pump’s flow rate is less than expected.
Identify what might cause a functional failure (failure mode). An example is a crack in the pump casing, a broken seal, a seized bearing or an open motor winding.
Determine what will occur as a consequence of a given failure mode (failure effect). For example, the crack, broken seal or seized bearing would cause a reduction in flow, and the open winding would result in a total loss of flow.
Quantify the criticality of failure, likelihood of occurrence and probability of detection to determine the risk, consequence and priority of each failure. In this example, suppose a total loss of cooling water flow causes the system to overheat, which leads to an explosion that puts many lives at risk.
Analyze and document ways to prevent, detect and predict failure, where the cost of the maintenance approach selected is less than the cost of the consequence, and starting with the worst failures. An example is using the condition-based maintenance (CBM) functionality of a CMMS to monitor vibration readings.
If detecting, preventing or predicting failure is impractical or impossible, but failure criticality is high, look for ways to change the asset, process, product or environment to bring it into line with the consequence’s cost/impact. If detecting, preventing or predicting failure is more onerous than the cost/impact of the consequence, a run-to-failure approach may be warranted.

RCA

Root cause analysis (RCA) identifies and eliminates the root cause of a failure, ensuring that it doesn’t recur. By comparison, some argue that RCM and FMEA simply treat the symptoms of failures rather than the root cause. For example, RCM might be used to detect an oncoming failure using CBM and take steps to prevent it, whereas RCA is used to determine the root cause, thus negating the need for detection in the first place.

RCA begins much the same way as RCM and FMEA. However, after step 5, one conducts root cause analysis for high criticality failures using methods such as:

Ishikawa, fishbone or cause-and-effect diagrams, the tools that map the possible causes and sub-causes for a given failure mode/effect.
Five why’s that, as with lean thinking, keep the analyst asking “why did this failure mode/effect occur” until it eventually drills down to the root cause.
Fault-tree or root-cause map showing the failure mode/effect at the top of this tree diagram and all possible causes and sub-causes displayed beneath it in a hierarchical fashion, complete with failure probabilities, where possible.

Once you determine a root cause using one or more of these tools, make a change to prevent a repeat failure. For example, if two key causes are found to be, say, improper equipment installation and operator error, the equipment can be reinstalled and the operator trained. Multiple iterations of analysis and action might be necessary to minimize or even eliminate a given failure.

E-mail Contributing Editor David Berger, P.Eng., partner, Western Management Consultants, at [email protected].