Assets Anonymous is a 12-step podcast series designed to help you get grounded in reliability basics and create a culture of continuous improvement with your team. This series will feature interviews with George Williams and Joe Anderson of ReliabilityX. ReliabilityX aims to bridge the gap between operations and maintenance through holistic reliability focused on plant performance. In this episode, George and Joe help you understand how your facility's critical assets fail.
PS: You know, in the last episode we talked about “What do I own and how critical is it?” which is the foundation of, criticality analysis, and sorting out assets that run to fail from assets for which more care should be taken.
This episode is going to cover “If they're critical, how does it fail?” and we talked a little bit before we started recording about what this might mean. We'll talk a little bit about the ones that do run to fail, and then focus on the strategy you have to have in place for the ones you've identified as critical. Can you start by talking about that run to fail aspect, the strategy behind those? How many get to run to fail?
GW: It depends on the output of that criticality analysis, but typically speaking, somewhere between 20 and 25% of your assets end up being just not really critical to the business. That doesn't always mean the strategy there will be run to fail; in some cases, the replacement cost of that piece of equipment still may be substantial, even though it's not really providing significant business value, and so maybe you do some PM.
But there also may be assets where they do supply some value to the business but they're not really practical to work on, and you're just going to swap them out and call them spare parts. And so even though by definition they may have been an asset because of some regulatory nature, you're just going to toss them and replace them. Maybe you have a backflow preventer that you track in the system as an asset because you have to do the testing, but realistically there's no preventive maintenance. You're just proving out the test that the states require. So there are assets of fall kind of in all that stuff. But most generally speaking, run to failure is because it's just not worth it. It's not worth the time, effort, and cost to have a preventive maintenance strategy.
JA: Yeah. You have a liability piece too, where you'll be doing some maintenance because of insurance or something, and it's still run to failure. But you have to do your IR scans, you've got to do some of that type of stuff, but you're still basically run to failure, although you're doing a few tasks related to it, based on some other requirement.
PS: Let's say you've got your criticality list, you've built the criteria for what is and isn't critical using your cross functional team, as we talked about in episode eight, and the team signed off on it. And Joe, as you mentioned, sometimes you got a piece of paper that everyone does sign at the end of the process to say, yes, we agree. These are the criteria, right? So what's the next step, when it comes to building that strategy, that really is what this episode of the podcast is all about, understanding the failure modes of the critical assets.
JA: As a rule of thumb, they say about the top 20% of your assets are the most critical. And again, that varies, but that would drive either the need for FMEA or an RCM analysis depending on the company, those types of things. But, using those tools to determine the most robust maintenance strategy that you could possibly have. Where people get into situations is they could not be doing any PdM technology at the time, and so for detectability and for some of those other things, the use of PdM technologies will be required to have the most robust strategy you could possibly have to understanding that asset's health.
So there's this dilemma that comes up that says, “Well, you know, if we use certain technologies, we could mitigate a lot of the risk.” So that's kind of a watch out, I guess you could say, as you're determining your maintenance strategy. Typically you're going to do FMEA or RCM to determine the maintenance strategy.
GW: At this point when it's critical, you're trying to understand, and the title of our episode here, “how does it fail?” We're trying to understand exactly how it fails, whether you use those formalized approaches and call them failure modes, or whatever the case may be. We're trying to understand in what ways, what component failures, and what will cause failure of this asset that I have to make sure I mitigate. Again, it's all about risk mitigation. So which of those failure modes are likely to occur?
Even these scenarios are not perfect, and I'm going to probably get lots of feedback on this, but when you do an FMEA or an RCM, and you're getting down to those component levels in deciding what the failure modes are and assessing those failure modes, and then coming up with a risk priority number and all that fun stuff that you follow in the process. We've had a failure where we tried to convince a manager of a plant that they needed training in how to properly torque and use precision tools. There was a pump whose electrical connections were on a ceramic block for the motor, and they almost lit this thing on fire because the ceramic block was cracked because they over tightened it. No PM; no FMEA; it’s unlikely that the ceramic block terminal connection being over tightened and causing the crack – which eventually creates an arc and heat and melts all my insulation and creates a fire – is being identified in the FMEA.
So it's not only the formal process, it has to be what happens in the field feeding back into that system. Your failure modes have to be a living entity that include not only what you brainstorm in a room, and hopefully you have all the right people and identify a bunch of stuff, but it's unrealistic to think you will identify 100% of the potential failure modes when you sit in a room and do this exercise. If anybody does believe that, it's probably foolish, you know, the probability is slim and none.
It has to be a living list of failure modes that are recurringly looked at and that your CMMS feeds back. And in a previous episode you asked about, you know, what's the value of the CMMS? This is it right here, because in a filing cabinet, finding all the failures and identifying what failure modes you have a strategy against and which ones are new is not an easy task. And a CMMS, it should be an easy task if you've set it up properly.
JA: The other variable with the robust maintenance strategy is how knowledgeable are your people? You could have a lot of green guys and you bring in a great facilitator. They sit down and they start probing, trying to understand failure modes, and people are just deer in the headlights, because they don't understand this stuff. So that's part of the dilemma as well, especially when you've always been a 100% reactive and you don't know any better. You get a bunch of people in the room and they don't know either. And then depending on the facilitator, right? So there's a lot of weaknesses in using a formal process, to develop it, because it requires at least a general level knowledge and some understanding of the process.
GW: What I love about this conversation is that it's not about “what's the process of FMEA or RCM?” So when we think about Episodes 10 and 11, we're talking about what will maintenance do about failure modes, and what will operations do about failure modes.
When you conduct an FMEA, you draw a box around the asset, and you basically say, OK, utilities are there, raw materials are coming the way they're supposed to, but that's not the reality of life. The reality of life is supply chain decided to save a penny and bought cheaper corrugate, and now it's causing box jams and it'll never be identified as a failure mode, and so no one is going to address it until it's a nightmare at the line, hopefully. But in not being rigid, in being more fluid, that gets identified as a failure mode and an operational standard and an inspection gets put in place because you have a more fluid, holistic approach to reliability versus the rigid, “well, you must draw a box around the asset and assume all the stuff is here.” That's not the way life is; you don't have perfection coming into your asset on a regular basis.
And so the question becomes, if the raw material is not right…reliability is basically mission success, right? Performing a stated function over a stated period of time, under stated conditions. But if those conditions are that the corrugate must be perfect and that's not what happens, did the equipment fail? The equipment didn't fail, there's no failure of the equipment. The process failed, but more broadly, are we now operating unreliably? The answer to that question is yes. And so, you know, when we talk about failure modes being a living entity, it far exceeds past the machine itself or the asset’s inherent reliability and failure mode.
Our job is reliability, it’s to derive value from an asset. It's not to make it perfect in terms of its maintenance. So there's a big difference there, and when we're in an episode like this, that for me is more interesting to explore than how to identify a failure mode.
JA: Right. Especially given that equipment-related failures are very minimal. Typically, it's always some outside circumstance that causes the machine to go down. Typically it's people.
PS: Joe, one of our first conversations, I want to say, going back to 2015,when you told the story about how a vendor had brought in a different kind of set screw for a bearing.
JA: Yeah, the bearing failures.
PS: And no one had noticed the set screw change, because the assumption was that the set screws coming in from his inventory would be the right ones. And that your team had traced that back down to a change in set screw, which really impacted bearing performance. It wasn't the bearing itself.
JA: Yeah, we were used to using cups, so on the set screw, you have different tips and they have different functions. And we typically used cups in, and that's what we required to lock down our bearings. And they brought in a new guy, it was vendor managed on the back end, I won't mention the name because we eventually got rid of them out of the plant. He just started going, “okay, it's a 1/4-20 set screw” and dumping everything every type of set screw known to man in there just to make sure that the bin was full. And my guys are just grabbing set screws, not paying attention, so there was an awareness piece on our end that we were at fault for, that required me to bring some attention to the education of my folks on set screws. But they were just grabbing them and putting them in, and they kept backing out, and then the bearings were riding on the shaft and eating out the shaft.
It came back that it was a set screw failure, and those are failure modes again that you don't plan for. It's human-induced error, and a lot of times we don't look at the human inducing of defects into the system to address failure modes. We just go: piece of equipment, motor bearing over lubrication, under lubrication. It's very robotic, and very textbook, and those typically aren't the main failure modes that crash your machine all the time.
GW: In defense of those approaches, how many human induced errors will you list like that? Who knows, somebody takes that crap on a conveyor belt, there's a thousand things that could be identified. Right? It's millions, it's just endless, the sabotage that you could list as a failure mode.
Listen to the entire interview
JA: Yeah but sporadic versus chronic issues are completely different. If I crap on the conveyor belt, that's a sporadic one-off thing (unless the dude's a habitual crapper). A lot of these are the same mistakes being made over and over and over again, and those are the ones that become the main drivers. They're the main causes behind the machine failures. All your minor stops. Corrugate is a good one. The fight with corrugate in just about every food manufacturing plant you can think of, there's a fight between operations, maintenance, quality, and supply chain over the quality of the corrugate that they get in, or it not being within specification. But I don't think I've ever really seen that in an FMEA, although it's very common.
GW: You'll have it in a process FMEA, it'll say box jam, shut the machine down. And that's the response, right? Identify a box jam, shut the machine down. So the engineers who design the line, they don't identify, well, is it in spec or out of spec? Because the theory behind the FMEA says draw the box, and everything coming in is there and in the right quantity and all that stuff. Even when you read the RCM standard, it tells you all those things. Everything is present in the right quantity and right quality.
But that's not the reality of life, and so we have to have a living failure mode library that includes what actually happens at the plant floor level. We're not knocking those approaches. They're going to identify the physical entity failures associated with the equipment for sure. Now, what they're not going to do is identify the reality of life of what happens when you have to operate the equipment. An FMEA on the car isn't going to identify that the user put in diesel in their gasoline engine, right?
PS: For the record this is my favorite kind of mistake on The Amazing Race TV show. In the early seasons that happened to were teams driving through a desert and I'd say half the teams put diesel in the car, and didn't take a look at the labels in the car and just assumed that any gas was good gas.
GW: When you get an American to fly anywhere outside of America, the first thing they do is put the wrong fuel in.
PS: Let me ask you this question then too. George and Joe both, I like how you're tying this into the notion of continuous improvement, which really in the end is kind of what this whole series of podcasts is driving towards. One of the key parts of getting out of reactive is to stop thinking of things as non-continuous as isolated. Once you engage in reliability, it is this continuous journey. The library you're talking about building here in failure modes, it's a living library, it's got to be, people have to record things that it sounds like they might not otherwise have considered recording. Honestly, when it comes to people interacting with the machines, there's infinite ways to do it, So you try and figure out which are those ways that are adding flaws and failures to the machine.
JA:. I think the biggest mistake you can make, especially as a maintenance manager, is completing a criticality analysis and going, “yay, I made it!” Acting like you never have to go back and revisit it again. It's just one of those things that seems to happen quite a bit. They'll even do FMEA, do all the stuff, but 20 minutes after completing it, the business need changes, which completely upsets everything you just spend all this time doing, but we never go back and, and revisit it. And now another line has all the volume and another line is the driver behind the business.
GW: They deal with that a lot in pharmaceuticals because your patent exclusivity pans out in 17 years and it takes you seven to 10 to even get manufacturing. So you buy these assets with 20-year lives and 7-year expectations of use, and so consistently you are shifting your focus to the newer lines.
The lines where the volume has cut to 25 to 50% because you lost patent exclusivity, should you still be using predictive technologies on those lines? It depends on the value that they bring, and some lines you're just going to shut down, they're mothballed, they're run to failure until the next line comes in and you have to continually update that strategy accordingly. To Joe's point, that happens not just in pharma, that that happens all over the place, and so you have to be able to shift your focus to what's important. All of this boils down to risk identification and risk mitigation.
PS: Let me ask you a question about online asset / failure mode libraries, because that that's part of what Digital Age has brought us, the ability for plant teams and especially vendors to share knowledge about the assets. I'm guessing that this is part of this process of building the failure mode library. We talked a lot about the ones that were unpredictable on the mechanical side, you know, the human errors. What's the role that you think vendors have to play going forward, vendors who collect data on their digital machines and understand where the failure modes might be?
GW: So being a vendor, that's a double edged sword, but I'll take a stab at just saying, yeah, they've done a lot of this work, I think it's a crime when they come in and pretend they have to do this from scratch and they write failure modes for a pump with a client, when they've probably done it for 50 different clients by that point, and so they already know the failure modes of a pump. They should just be reviewing it and seeing if there's anything unique about the design of the pump they're talking about, but instead they start from scratch and go through the exercise. I like to think we're not an organization that works out, that operates like that, and maybe others don't.
But you're also seeing things like online subscription stuff, and you can buy my failure to code library for x. I think they're all great starting points. If you have nothing, then something is better than nothing, and then going through those and identifying what's different about the context of your asset is certainly a time saving exercise. I think that adds value. At this point in time though, we should have a massive failure mode library everywhere, and everybody should have access to it, it should pretty much just be free.
JA: The watch out for me is, I've bought one before by a very reputable company in an Access database, and it was blank. That pissed me off because then they wanted to facilitate some exercises with me and start building it. The way that they sold it to me was the assumption that there's all these failure modes in here for all these pieces of equipment, and I paid 10 grand for this thing, and it was a blank Access database, which on the back end I think they had all the failure modes, but it was hidden through password protection. They wanted me to take time to fill out some information to give me certain failure modes.
GW: Oh, you want access to the data!
JA: And that was not the assumption that I had when I paid for it. I was so mad, I lost so much respect for this company, because of the way you're selling it versus the way everything else went down. I'm a loyal guy, so once you ding me, it's very hard to recover my trust, which is a shortcoming of my own. It’s like I won't do business with you anymore and I won't recommend you to anyone, you know? It was horrible.
You have to be careful because you have this idea of what you think you're getting versus actuality when you go to purchase some of those things, so just be aware of that. Not to say that everybody does that. I haven't really seen anyone that really has one, outside of this one company, but I'm sure they're out there.
GW: I do know other companies that have like an online database, but they're not really selling it, they just use it on the back end, on the backside of things as they facilitate the next FMEA. But I think, you know, if you are going to vet those things out, things to be aware of are the level of detail that you need versus the level of detail that that particular failure mode library is going to. Because you can drill down way into component level failure modes, right? Like, how does the bearing fail? What causes an inner race defect or an outer race defect, or a cage failure, versus a pump bearing failure, and then stop there. Do you need the next layer down? I guess that depends on your business and the critical nature of that bearing, and that's what should be driving that. But if you don't, then you're not looking for a library that's going down to that level of detail. And if you do, then you want to vet that library to make sure it is giving you that level of detail.
PS: Well, one final question then, and I was going to ask at one point, how long does this take? But I think that's the wrong question, and my question really is, would I be right in assuming that the question really should be, once you start, when can I stop? And the answer is really, never. It's continuous. Right?
JA: There comes a point where you have to have something developed for all your assets, so that is one. You continue to improve it, and like I said, the business needs change and stuff, so you've got to constantly be on top of it. But if you have nothing, the first goal is figuring out, again, what's critical, but then take one asset at a time. Take the one that's been identified as the most critical and just say, “hey, we need to have a robust strategy that can mitigate some of these failure modes,” and then go out and execute on it. If you don't have time or resources or whatever, especially given the circumstances we're in today where it seems no one wants to work, and so you're short 32 people in a facility of a hundred. How are you going to create that strategy? How do you have time and resources?
GW: And it is extensive it, right? This is not a minor amount of work, to get the proper maintenance strategy in place is not a minor amount of work, and there are approaches that can help streamline, that can get you, for lack of a better way to describe it, or accuracy, 80% of the value for 20% of the effort, right?
You can look at centrifugal pumps. and identify all the failure modes, and then from a context perspective, see what applies to your most critical pumps and what doesn't apply, and what failure modes you want to address and not address, and not necessarily get into all the weeds of individualized components down to make and model of every pump. And that can help reduce significantly the amount of time and effort that you put into this. And there's lots of different approaches. There's RCM, RCM II, RCM Blitz, 50 different ways that you can slice this up and either gain some efficiency in lieu of value, but there's always a trade off. So the faster you're going to go through this process, the less you should expect from a quality perspective. But do you need it to that level of detail really is the question.
There's other tips for people to be aware of, like you can save some time by doing this in context and by asset class and then criticality of that asset, because criticality should have vetted the value already, at least generally speaking. And so instead of maybe getting way down into individualized RPNs, let's do this once for a pump, ignore RPN in terms of severity, and try to focus on detectability and occurrence, and then look across all of my pumps. Which ones, what's different? Insert my severity scores and then output my RPNs – , what's the highest RPNs and what do I have to do. Our next podcast is called “What are we going to do about it?”
JA: But you could take your top 50 components and do that. Everyone has a gearbox, everyone has a motor, everyone has a pump, everyone has a bearing. There's these commonalities in the industry, they all have those components. Like George said, you just take the top failure modes for most of those components, and you can assess whether they apply to everything else across your asset base. So really you're only doing 50 pieces of equipment or 50 components.
GW: I just want to say really quick for all of the RCM purists out there, please drive safely while listening to this portion of the podcast. They are banging the steering wheel and screaming, but the reality is the reality. You can get value without going to that level of detail.
JA: I didn't need it in the food industry. And this is where people get irritated, because the purists are in oil and gas, they're in chemical processing, and in these very risk intensive industries. In food, I just needed to know, okay, I don't need to over lubricate the bearing, got it. I didn't need a whole lot of detail in there because the consequence of the business wasn't nearly as severe as it would be in a different industry, so it wasn't a big deal to me, you know? I never did RCM, I wouldn't ever do an RCM in a food plant, I just wouldn't do it. There's, it's not worth the time and the value.
GW: Unless they cut a PO.
JA: Even then you're basically doing a bastardized version.
PS: But that goes back to your point, which is when it comes to this kind of effort, it's going to be a boatload of effort, so make sure you know which of your work is going to add value to extracting value from the asset. Back to your first point, George, this isn't about making sure the asset never breaks or never dies. It's about extracting the value you want from these assets and knowing which ones are most critical to your operation.
GW: That's our job.