Assets Anonymous is a 12-step podcast series designed to help you get grounded in reliability basics and create a culture of continuous improvement with your team. This series will feature interviews with George Williams and Joe Anderson of ReliabilityX. ReliabilityX aims to bridge the gap between operations and maintenance through holistic reliability focused on plant performance. The first two steps in this podcast series were centered on understanding reactivity and proactivity. Episode three focuses on the circle of fire and how you can break the reactive cycle.
PS: So who wants to tackle what circle of fire means?
GW: I'll take a quick stab at this. The circle of fire is an analogy for the reactive state. Most folks in a reactive mode have no idea where to start, so they're in a reactive mode because they don't have the answers of how to not be in that reactive state. In large part it's because they see it as this gigantic mountain. Right? There's too many steps to become reliable, and we don't have the resources, the time, or capability to do that, and the reasoning behind that is the circle of fire.
And so if we imagine this circle, and the circle is: we can't PM it because operations wouldn't give us the equipment, operations wouldn't give us the equipment because we're behind schedule, we're behind schedule because this conveyor failed earlier in the week, the conveyor failed earlier in the week because we couldn't do the PM, and then the circle just keeps going and going, we couldn't do that PM because something else on the line wasn't right and they had to run a line.
So we've got this continuous reactivity that takes place. What is interesting to me is that the circle of fire is made up of smaller circles. And so inside “why we didn't do the PM” is another circle. It's not just the fact that the conveyor broke down, but the conveyor has been breaking down once every six months because we don't do certain things appropriately to stop that from happening. And so why the conveyor breaks down is not only the PM wasn't appropriate, but we've got the wrong type of gearbox, we don't have a breather on it, we don't align things appropriately, the shaft has been bent for three years and the whole gearbox wobbles while the conveyor is going around.
There's all these other things inside of this that causes that fire, and our approach to trying to put out the fire is symptomatic. We throw a cup of water on it, but we can't do anything more than that because we have to go put out another fire. We put a little cup of water on it, we run away but there's still embers and it comes back. So in trying to move from a reactive organization to one that is more proactive – remember that's a sliding scale, it's not a state – we've got to start putting out the smaller fires permanently. Someone has to make a fire go away and reduce its likelihood of coming back.
JA: We do that through a series of defect elimination techniques and, you know, it's like eating an elephant, right, with the spoon. You take one bite at a time, eventually you'll get through it. Right? We mentioned in the earlier episodes that you take one problem and make it go away forever. Just one, right? But if you can do that every day, those things start adding up, and now a week later, you've put away five different problems. By the end of the year, all that stuff starts adding up and you start putting out these little circles of fire and you start to break that cycle. Now you can introduce a new cycle which is on the more proactive side of the scale, in which you're addressing issues before it leads to catastrophic failure.
Which comes to another point: I think people understand the term “failure” when there's actually three different types. There's a “potential to failure” which means a defect has been identified, but everything still seems to be running normally. That would be picking up a vibration defect, or a misalignment defect, or something. You know that that defect is there, but that is a potential to failure and it needs to be addressed. Then you have the “functional failure” for example, with the pump if it's designed to pump 30 gallons per minute, and it now drops below that threshold, it's at 29.9 gallons per minute, you've reached a functionally failed state. Now you can overcome that by turning up a drive or something to overcome, maybe the impeller is wearing or whatever, but it's functionally failed. And then you have the “catastrophic failure” which means it doesn't work anymore, and most people only associate failure with catastrophic failure.
You have to understand, to get out of these cycles, there's other types of failures that have to become more important than the catastrophic failure. The potential to failures are your opportunity to address things at the cheapest possible point of overcoming a defect. You've now moved into the reactive realm because you have a defect, you have to react to it, and that's the cheapest point in that reactive realm of things that you can possibly solve a problem. Whereas the catastrophic failure leads to a loss of production, all this downtime, possibly rushing parts in, and the costs start adding up. You're looking at 10 to 40 times the cost to do maintenance in the catastrophically failed state versus, "I found a bearing defect, and I need to replace a $300 bearing." Understanding those types of failures and addressing them with as much urgency, if not more, as you would a catastrophic failure is how you start to break those little cycles down.
PS: Does it matter where in the larger circle you start when it comes to trying to break out of this habit, which step do you start at? George, you mentioned there's operators involved, Joe, you mentioned there's certain kinds of failure to look for. Is there one good place to start or does it more matter that people recognize the circle exists, and you can start identifying what is in your power to manage and control better?
JA: I would say as a maintenance manager, what's in your power is working with your operations group and start getting them to understand the importance of cleaning equipment. Again, I come back to this all the time, but lubrication management, cleaning equipment, and proper tightening techniques, and getting them to do things, that's the root causes of most breakdowns.
For example, one tenth of an inch of dust on a motor reduces the motor's life by half. All I have to do to extend the life of that motor is go wipe it with the ShamWow or whatever, right, and get the dirt off of the motor. And that's something that operators can do, you know, fairly simple. Instead, we wait until the motor overheats, and now it starts tripping out and we call maintenance, and now this falls in the lap of maintenance when operations should have been wiping with their ShamWow all over the place. It's not really that hard. We make it hard in our heads. It's not really that hard, go wipe with your ShamWow and just sit back and watch.
GW: I think there's two ways to answer that question. One is holistically, like Joe is mentioning. You've got to create an environment where you're doing what you can do proactively to minimize defects coming into your system as early as possible. I think another way to answer your question is, well, if there's so many fires going on right now, what can I do today while I get all that in place?
The short answer is it doesn't matter, just make a fire go away. The more strategic approach would be to look at areas that are repetitive, because those fires keep happening, and go after those first while you're working with operations to try to cut things off as early as possible. There's the defects that are very early that will ultimately result in catastrophic failure, and that's where you want to work. But between those two points there's a whole bunch of stuff today that sits in between those points. And I think if you do a Pareto analysis of what's repetitive in your plant, which assets fail the most frequently, those are an ideal candidate to go put fires out.
PS: To tie that back into something that we covered in the first episode of the podcast, make sure you have a list of the work that you're doing. As you both mentioned, there's a lot of plants that don't have that understanding of what work is actually being done, so you need a list in order to understand what are the repetitive failures, and what are the repetitive tasks that are occurring which could be eliminated.
GW: A hundred percent. Even if it's work order that says I had to go help operations because they didn't have adjustments right, that stuff needs to be documented if you want to do it from an analysis perspective. It is not a requirement to get better though. You can talk to people, talk to your maintenance staff. If you're the maintenance manager, sit them all in a room and have a complaint session. What are the things you keep doing repetitively that we should make not happen.
And they'll give you a list, they'll give you a big, long list. Go make those things go away, and that does two things for you as the maintenance manager. One, it gives you credibility with your staff. They'll actually respect you more because they've given you what they need to be more successful and you've made it happen. So you've gotta take the action on the list. You can't collect the list and leave it sit there stagnant. You have to go do something with it. And the second piece is it's actually going to improve your plant because you're making fires go away.
Listen to the entire interview
JA: And then on the backside about making lists is something that doesn't get done a lot is when maintenance goes out to execute work, that list should be posted visibly and communicated to everyone so that they know what is being worked on. I think a lot of times we don't share with anyone, we just go out and do work. And then they come in on Monday morning and stuff doesn't work, and they just go, "Maintenance worked on stuff, stuff doesn't work," and you start losing credibility. When sometimes, a lot of the times, I didn't even touch that piece of equipment. Right? And so if I'm posting what I'm doing and sharing it with everyone, I don't lose as much credibility when stuff doesn't work on Monday morning after I had the weekend to work on things.
And what we started to find was that they weren't shutting down machines properly. That was on operations, and they would come in on Monday and nothing would work, and I'm like, "I didn't even touch those assets, and you're sitting here trying to blame me for this." Right? So it kind of helps you as a see-a-way but it's also a great way to communicate to people the work that's being done so that they can see you're actually going out and doing stuff, and not eating donuts and talking in the shop all day.
GW: The credibility piece is really important, Joe, and I'm glad you brought it up because it's not just credibility of what work is getting done, but part of the reason you're not getting the asset to do the PM is the poor credibility of the maintenance staff. They don't trust that you will return the asset in a timely manner. So that's why as complicated as asset management is, part of it is planning and scheduling appropriately, so when you say I need a two-hour window on the asset, you only need a two-hour window on the asset.
JA: And the other piece is they don't wanna give you the equipment because it doesn't work as well after you've done work on it. And so that's another way that shows your PM effectiveness, you know, and your corrective maintenance work practices aren't where they should be, which is a lot of times why you lose credibility as well.
PS: I'm struck by your story of the operators working the machines. I heard one a while back about an operator who was known as kind of the rockstar operator, where they got the most out of their asset and got the most production done for those days, but every now and then, they'd have to have a maintenance tech go in and clean out a pipe that was getting clogged. And this was the chemical processing company. It turns out that in order to help achieve those super production goals, this operator was starting up the machine improperly, and not letting enough warmup time happen, where that warmup time was required to help keep the pipes from getting blocked because of the kind of chemical reactions that were going on.
JA: But that warmup time allowed them to produce more. So...
GW: We've seen those issues. Joe and I were at a plant where there was a specific shift and specific operator that kept beating the numbers, kept beating the numbers, and we were doing a lot of things to improve the numbers, but the actions that were being taken to do that were typically detrimental to the next shift.
Towards the end of their shift, they threw away all the things that would take care of the line because I'll survive the next two hours producing without doing these things, and it'll be on the next shift to figure it out. But ultimately the plant suffers, right?
JA: Yeah, and it happens a lot, that and taking shortcuts. But what that tells you about their culture is no matter what any leader tries to tell you about the organization, they're production driven. They'll try to tell you, "We're a quality organization, we care about safety" but you know by the attitude of the operator, the culture of the plant.
If they're trying to drive numbers by taking shortcuts, because it's a few things, right? You're trying to beat operators over the head because the other ones aren't hitting the numbers. And so now you start driving that culture, or it's the fact that everyone's getting ringed because we're not hitting production numbers. So instead of communicating effectively as a leader to say, "What are the issues we need to solve, so that we can get our throughputs back to where they need to be," they just start blaming people. What that shows is a lack of processes and a lack of systems. It's wild but it is what it is.
PS: When it comes to the circle of fire, let me ask the reverse question, which is: when do you start understanding that the circle is getting broken? And I'm guessing one of those is going to be reduction in repetitive work. But are there other signs that as the maintenance tech or the maintenance manager that you can see that, "Okay, there's an impact being had here and that we're slowing the circle down or starting to break that into chunks."
GW: Yeah, your mean time between failures will start to grow; production output, if you're a manufacturing facility, will go up because your available utilization goes up, or your availability of the equipment goes up; and your maintenance cost will go down because you're not getting to the catastrophic event. It's like, if you're a smoker, you don't care about it until you get cancer, like you don't understand the effects of it until it's too late, right? So you live with it and you don't feel the risk exists until it does.
I remember early in our pathway to predictive maintenance at a previous company we had an air-handler in a clinical supply area. They manufactured pharmaceuticals that sat in a clinical supply cooling environment, and those drugs are incredibly critical, and they have to be precisely controlled in terms of temperature, because the outcome of your studies could be affected if they're not. This area was significantly critical and was not allowed to survive any more than like a 30-minute outage of any air-handler.
We took vibration readings that said we had a misalignment issue on a fan and we were told, "Well, you can't redo the alignment because we're afraid it'll take 30 minutes, and we can't move all the products, so you're not getting the asset." A few months later, vibration report now says, "The bearing's going bad." Okay. Well, now we need to replace the bearing. "Oh, that takes way more than 30 minutes, you can't have the asset, and I don't hear the bearing so it must not be a problem."
And this goes and forth for a couple of months, and then finally the fan fails. But when the fan fails, the bearing blows apart, the shafts drops on the pillow block and puts a nice oval hole in the pillow block and scores the shaft, to the point of falling inwardly and bending the fins on the squirrel cage. Now the entire fin needs to be replaced, and miraculously enough, they were able to get somebody into the area to move all the products so the product didn't get destroyed.
But the entire time their excuse for not giving you the asset was, "We can't, it's too risky to move the product, we can't possibly do it." And so they made a business decision not understanding the risk and not understanding the cost to the company holistically, in a holistic vision, versus just their impact as the customer or as operations.
And when Joe talks about partnering with operations to get a better understanding of things, that's in part what he means. It's a business understanding. You're one company and everyone's job is product out the door, including maintenance, including operations, but it's product out the door at the highest possible quality and at the right cost. I don't want to say the lowest cost. At the right cost.
And so understanding holistically that you're not just a customer, maintenance is a critical role in ensuring product delivery just as operations is, and understanding how the information that the maintenance organization provides you helps you make good business decisions. That's a challenge for most maintenance managers, and it's a skill that they can learn, but it is definitely a challenge and part of the reason that that circle exists.
JA: And then results, I said in the previous podcast, but when you start seeing the results, you end up with more time. Okay?
GW: You can actually go get the donuts without having breakdowns.
JA: Right. Exactly. You have more time, which means that you can plan and schedule more work and do it correctly. You can start doing your PdM tasks while you're running instead of out firefighting, right? It gives you a chance to start moving your needle towards that more proactive side and doing all those tasks. You go solve a few problems, and then next week you solve a few more problems, the next week after that you solve a few more. But to be honest, on each line and each plant, speaking from a manufacturing perspective, there's only two or three things that's really eating their lunch. If you can solve those in a week on one line to move to the next line, it will amazingly free you up a lot of time to start doing stuff more proactively.