How to build a better failure mode and effects analysis

How to build a better failure mode and effects analysis

April 11, 2024
"You can do an FMEA without criticality, but that just means you're going to be over-maintaining your assets or you're going to be misguided in where you're applying your efforts."

Brian Hronchek is a principal trainer and consultant at Eruditio. Over the years, Brian has worked as a maintenance manager, reliability engineer, and planning manager, and his background includes stints in the U.S. Marine Core and Purdue Aerospace Engineering. Brian recently spoke with Plant Services editor in chief Thomas Wilk about how to fill out the FMEA (click here for example template), and then use it to build robust equipment maintenance plans. 

Listen to Brian Hronchek on Great Question: A Manufacturing Podcast

PS: You really outline the nuts and bolts of what it takes to know what you've got in your system and then know exactly what to evaluate in the FMEA, what's the more important failure. So after you start collecting information, what's the next step?

BH: So this is the part where I think we need to pause just for a minute and talk failure modes, failure mechanisms, and failure indicators. A lot of these things are a little bit confusing, right? This is where I really struggled the first time I did. It was like, what's the failure mode, you know? Tell me, I mean, in your mind, what's the failure mode?

PS: Anything which would cause the asset to break down or not operate correctly.

BH: OK, so give me an example of what would cause the asset to breakdown. Be specific if you want.

PS: Using the example of your shower in the morning, let's say you've got a washer that's bent or loose, and there's some sort of leak where the fluid isn’t moving where it's supposed to go. It's leaking out the side. That could be one thing, right?

BH: Perfect. OK, so what you described to me was a whole bunch of these terms lumped together into one, right? So let's talk about a failure mode. If a failure mode, if we could define it, I would call it the deviant behavior of your component. 

Now we're talking about an asset, we're talking about a pump skid, but the pump skid is made-up of components. So even though we're not going below the pump skid in the hierarchy, we have to determine the failure modes based on the component and how it fails. So, a leak from a seal, right, the leak is the deviant behavior. It's not supposed to leak. It's supposed to seal. That's the deviant behavior. So that is the failure mode. Let me ask you this, do you want to do an inspection to find a failure mode?

PS: That's a good question. I would say it depends.

BH: OK it does. Tell me why.

PS: Let's say the failure mode is a cracked pump housing. You don't need to do an inspection every day to make sure the housing is correct, that a less likely event. But you can see a leak, you can see the fluid pooling on the ground.

BH: You described the crack, which is not a behavior, that's a condition, right? But the leak is a behavior. So do I want to inspect for the leak, or do I want to find it before the leak happens? If a bearing seizing is the deviant behavior, do I want to wait until it seizes before I find it?

A lot of us talk about failure mode based asset strategies and I'm going to challenge that a little bit and say if you're a failure mode based asset strategy, looking for the failure mode is reactive. Looking for the failure mechanism would be what we actually do, and it’s all semantics, right? We all know that we're looking for a condition before it actually turns into something bad, but I'm going to rename this during our conversation, and say this is “failure mechanism based” and we'll talk about failure indicators too. The failure mechanism would be the deviant state. That is the state that ends up leading to the deviant behavior.

PS: OK.

BH: So if the state is misalignment, and that misalignment results in excessive heat or excessive wear, eventually it's going to result in the leak or the seizure or something else like that. So the deviant state is misalignment. I want to find the misalignment before it does damage.

Now we're looking at failure mechanism. So when you look at these things, if you get a template that says “What's the failure mode?” and that's the only question it asks, put in another column and say “What's the failure mechanism?” Answer that question because if you say the failure mode is seizure, the failure mechanism is misalignment. But you know what? There’s another contributor to seizures, it could be lack of lubrication, it could be over-lubrication, it could be contamination, it could be another number of different things that lead to the seizure, and now I know what to look for. “Leaks frequently have seizures because of lack of lubrication. So we're going to attack lack of lubrication.”

How often does that happen? It happens all the time. So now our score as we get into the RPN scores is going to be able to differentiate between what causes the seizure, so that we can look for that specific cause. Because if it's misalignment, I need to go train everybody on how to do alignments. If it's lack of lubrication, I need to set up a lubrication route.

So we actually have a third one too, and the third one is called the failure indicator. Why do you go to the doctor?

PS: Normally when you're sick or you have or you have a symptom of being sick.

BH: Right, right. You got a symptom of being sick. And what's the symptom?

PS: Sore throat, cough, sneezing.

BH: And when you go and say, “hey Doc, I got a sore throat,” does the doctor come and say “This is what it is!” He doesn't, right? Yeah, he goes, “hmm, well, it could be this, it could be bronchitis, it could be you have a cold, it could be a virus, it could be bacteria.” He's got a lot of questions, but, there's an indicator that something is wrong. Now we're going to run some tests and dig in further to find the failure mechanism, to find the virus, to find the bacteria, to find the imbalance, to find the hormones, whatever it is. 

We do the same thing in maintenance. If I take thermography, I can tell you that the bearing is overheating, but I can't tell you why. So failure mode is the deviant behavior; the failure indicator tells us there's a problem; the failure mechanism tells us exactly what that problem is.

So why do I go through all that right? It's because it costs a lot of money to inspect for the failure mechanism, and sometimes that's important. If I've got a big critical asset and I need to make sure this thing never fails, I don't mind doing a semi-annual alignment check and I don't mind doing a lubrication check, and doing precision lubrication and checking to make sure that's right. I don't mind changing filters and testing my oil to make sure that there's no contamination because I want to find those mechanisms as early as I can to remove them so that no damage is done to my asset. 

But when I get down to my pump bank, to do that same level of inspection is a bit painful. I'm going to take a step back and I'm going to look for the failure indicator. I might do thermography because I can do everything all in one big picture, or very quickly by going asset to asset and taking pictures. And if there's a problem, then I'll go back to that asset and I'll start digging in. Is it lubrication? Is it alignment? But I can very quickly get a bigger picture. And then back to your original point, maybe sometimes we would inspect for the failure mode; if the criticality of that asset is low enough, then maybe we let the operator tell us when there's an actual degradation in performance. And that's OK, that's OK.

PS: And indicators could be something like product quality. For example, if you're on the frozen pizza line, the pizzas are off center or there's not enough ingredients on one, then something in the machine is not operating as it should be.

BH: The next question we have to answer is, what are the failure effects? I have seen templates of this where the failure effects are “the machine stopped running” or “it affects quality” all the way to “it blows up.” And you know, it's a big explosion and somebody gets hurt. 

But I want to tie this to something that helps us downstream. We already calculated, we already developed our scale for severity. I want you to write down the answer to failure effects in terms of the severity, so that the money (which is profitability or revenues or downtime or you know uptime or however you measure that aspect), the safety (you know how safe people are, how much damage it does to people or equipment), and then the customer (does this hit the EPA and everybody says we can't do business with them anymore because they spilled a bunch of oil). 

So we want failure effects in those specific terms because in a minute we're going to ask you, what's the severity? and when you say, “the failure effect was that it causes six hours of downtime or loss of $20,000 in production or it's a minor injury, not OSHA recordable, or a minor release into the environment. If you stated it in those terms, then when you go to severity and you say, hey, team, what's the severity of this, they go directly to that spot and they say, “oh it's a 6.” You've already defined it.

PS: If I could take just a step backwards and look at the bigger picture, this is the connection that, from what I've been able to gather, some plant teams it takes a while to connect what they're doing with the cost of the business. And this is what we all often preach as an industry, which is to understand the value of your work to the business. Understand the connection between the scale you developed and severity, and then connecting the effect of the failure to it. 

Have you seen a lot of light bulbs go on when you get to this part of the FMEA and people start making the connection and saying, OK, the effect is linked back in this way. Wait, we're costing this much money, where it could hurt this many people?

BH: Yeah, it definitely does. When you start breaking this down and going through and you coach them like this is how you build your severity scale. This is how you answer failure effects. Then it's like, oh, I get it now. “Why are you asking me how this affects the business? That doesn't matter to me, that's too big, I'm just talking about my area in the middle.” No, your area in the middle is for a bigger purpose. We're all here for something bigger. You know what I mean? They didn't hire us just to go inside of our little box and do our own thing. This is for somebody else.

PS: I just wanted to do a gut check and make sure that that made sense with what you've experienced.

BH: Yeah, for sure. I want to hit one more thing before we're done, right? So we answer severity, answer occurrence, how often does it happen? Answer detection, how early can you find it? We get our RPN score, ok, great. 

And then we all know the next thing to do is to sort our RPN scores and bring the biggest ones up to the top, and then we're going to recommend some different actions in order to bring that RPN score down. But do we have a strategy for bringing them down? So what am I going to do? Well, let's throw PdM at it, OK? Does that actually fix it? Does PdM lower the score, or is the detection already very early? Because if detection is a very low score, throwing predictive technologies as it and going from a 2 to a 1 really doesn't do a whole lot for the score. 

Going from a 10 to a 1, that would be awesome, right, because we're multiplying together severity × occurrence × detection. The max score is about 1,000, so if I've got a very high detection score, meaning it's very hard to detect and I throw something at it that brings that score way down. Look at your RPN score, the components of it, which one is the highest? Is it severity? Is it occurrence or is it detection? Let's say that severity is the highest. What can we do to reduce the severity of a failure? What are the tools we have? What are the levers we can pull right? Any thoughts? Any ideas?

PS: Again, not having been hands-on in a plant like this, uh I would think some of the levers are #1 established better routes or #2 see which routes are not helping you and so you can free up time to do more specific work.

BH: Well, if we think about a bearing, let's say the bearing seizure on this one motor that is the single point failure in this big process, right? So if it seizes, you know, we're pretty much done, and the motor takes six months to build and it's a million and a half dollars. So basically the whole plant is going to be shut down for six months. This is a severe, severe, severe situation. So you have a couple of levers that you can pull to reduce the severity. 

One of those things is reengineering. I can reengineer this so that the severity is less. I can install a second one right next to it. Now if I have two of them and one of them goes down, the other one runs while we go spend 6 months buying the backup. Policies and procedures that protect the asset. Having a critical spare on standby, that's a warehousing policy. Well, if this thing takes six months, then we're going to get it built right now and we're going to put it in the warehouse. And that way if it happens, then it's not going to be 6 months, it's going to be 6 days to bring the whole crews together, lift this thing out with the crane, rip off the roof, put the new one in. Getting back to your six days is way better than six months. Now we've come from a 10 to a 7, or whatever the score is. 

So severity can be affected by a few different things. You want to remove the problem altogether, but you probably can't do it. Reengineer it out, or policies or procedures that protect the people or the assets that you know is causing the severity. Something blowing up and killing somebody, that's pretty severe. So can you remove the people from that? Hey, we're going to do this a little different, and we're going to put guarding between us and the problem. Well, that lowers the severity. Now if a tire blows up inside of the cage, it doesn't blow somebody clear across the parking lot, right? (We were just looking at tires being filled up, so that one was in my head.)

So severity has certain things you can do, but occurrence is different. Occurrence. What could you do to reduce the frequency in occurrence? How do you make it happen less?

PS: Well, again, that's where I think the PM would come in, where you take a look and see what would make it happen, and you find out what mechanically you can do on a regular basis to reduce the impact of that. Whether it's greasing it regularly, whether it's taking predictive readings once a month, whether it's putting a wireless automatic lubricator on there, where it's monitoring the frequency and lubing when necessary, that sort of thing.

BH: Yep, totally, occurrence can be affected through the things that we do to reduce how often it happens. We're going to get in the precision realm, right, precision maintenance, maintenance procedures. So having a good job library a lot of times can reduce that. Training operators for how to operate the assets so they don't break it. Adding a lubrication route, adding that PM inspection or something along those lines. By adding the things that are going to make sure that you don't get to that failure to begin with, that's the precision. That's the lubrication. That's the training and how we operate, right? 

And then detection is the same thing. If the detection score is high, that means we can't see it coming. So then we want to throw in the things that help us see it coming. PM inspection activity, predictive technologies, increased inspection frequency. Maybe even using data to alarm us of a condition as it's coming, like if we have the data available, why aren't we using it? So those types of things. 

When you get to the very end of this thing, making a recommended action, recommend an action that's going to lower the score based on what component of your RPN score is the highest. Maybe put two things in place and bring it way down, right? But that's the strategy.

PS: Just to bring it back around to what comes next, this exercise is a tool that's been developed to help plant teams develop stronger equipment maintenance plans. In essence, to identify the kinds of work that need to be done, and again the FMEA itself is not the recipe for work, it's the analysis to determine what is the best work for this asset. Sorry that that may sound like I'm restating the obvious, but again sometimes when it comes to the all the different kinds of maintenance tools out there, they can be confusing as to what leads to what. So I just wanted to make sure that were building that connection in there.

BH: Yeah, for sure. Alright.

PS: You know, you’re reminding me too, I had a conversation at MARCON with Lee McClish, who I think you know, he's working with a data center company. He's a reliability engineer there. And he was saying that interestingly, data centers have such a high degree of redundancy built into the hardware that the reliability approach changes. When you talked about, what can you do to mitigate the severity of these things, building in redundancy, building in, say, a critical spare, in Lee's world, the data center engineers have built in a lot of critical spares on the hardware side. What he's there to evaluate, he says, is what's the HVAC system like? Do we have enough filters? Do we have enough replacement parts to make sure that all the redundant hardware in the data center actually stays cool?

BH: Yeah, for sure, and with data you're working with, you know, close to 100% availability and that's hard to achieve, which is why they’ve got to use redundancy. For keeping that stuff cool, hopefully you're shooting for pretty close to 100% too, otherwise the data goes away.

PS: I think we're at the end here. We’ll put some information on the podcast notes about what an FMEA looks like, and a couple of tip sheets that Brian's got on how to build these things out. Just to wrap it up, if someone was directed to either build an FMEA or we're told, OK, we need these things, some manager heard these at a conference. What's the first thing you do as a reliability point person to start building these things?

BH: Get other people involved. A lot of times we take our jobs like it's The Lone Ranger, the lone guy in the corner. It's my job to be the reliability guy. I'm going to take care of all this and everybody kind of expects that because they don't really know what we do. The challenge that brings in is that we don't have enough data to go by to build these things. You have to have those mechanics in the room to be able to answer questions. The electricians, the operators. You need the other managers to help you understand the context of how this affects the business. Like, what's most important to them? And then you have to be able to go the other direction and educate them on, OK, this is what we're going to do, and this is why, and it's because of the concerns that you've given to me. 

Our job as reliability engineers is to solve other people's problems, right? So if we're not solving other people's problems, or if we don't even know what those problems are, we can't do anything. So get other people in the room, build these tools out, agree on the definitions so that everybody will agree to take action on whatever you output, and then you're you've got a better start already.

PS: Excellent, that's terrific advice. And I know team building is a whole separate conversation, but if you're asked to be doing this, you're not alone. Draw your whole team in.

About the Author

Thomas Wilk | editor in chief

Thomas Wilk joined Plant Services as editor in chief in 2014. Previously, Wilk was content strategist / mobile media manager at Panduit. Prior to Panduit, Tom was lead editor for Battelle Memorial Institute's Environmental Restoration team, and taught business and technical writing at Ohio State University for eight years. Tom holds a BA from the University of Illinois and an MA from Ohio State University

Sponsored Recommendations

Arc Flash Prevention: What You Need to Know

March 28, 2024
Download to learn: how an arc flash forms and common causes, safety recommendations to help prevent arc flash exposure (including the use of lockout tagout and energy isolating...

Reduce engineering time by 50%

March 28, 2024
Learn how smart value chain applications are made possible by moving from manually-intensive CAD-based drafting packages to modern CAE software.

Filter Monitoring with Rittal's Blue e Air Conditioner

March 28, 2024
Steve Sullivan, Training Supervisor for Rittal North America, provides an overview of the filter monitoring capabilities of the Blue e line of industrial air conditioners.

Limitations of MERV Ratings for Dust Collector Filters

Feb. 23, 2024
It can be complicated and confusing to select the safest and most efficient dust collector filters for your facility. For the HVAC industry, MERV ratings are king. But MERV ratings...