8 steps to improve reliability and regain your sanity

Sept. 5, 2017
What to do when maintenance gets blamed when things go wrong.
Question: Jeff, we struggle with constant reactivity due to unreliable assets and practices. As our group has a lot of pride, the daily production meetings are tough. It feels like maintenance gets blamed all the time when things go wrong. Can you suggest methods that will enable us to improve our asset reliability and end the pain?

Chip, maintenance supervisor, Ohio

Answer: Chip, I understand the feeling when things don’t seem to go right after you have poured your heart and soul into your work. As a practitioner, I lived that, too. The following approach is how we overcame the problems and drove significant improvements.

First, most likely you don’t have a true partnership with the other stakeholders in the organization, such as the engineering and operations teams. Note I said a true partnership, where everyone is headed in the same direction – the direction of improving asset reliability to ensure we meet business and customer objectives. Asset reliability is NOT a maintenance thing; it’s an everybody thing. Engineering and operations have more influence on asset reliability than the maintenance group does. Consider how you can improve these relationship and not have the operations team as a customer but as a partner.  We began with a biweekly meeting of managers and supervisors to discuss how both groups could drive improvements together. This meeting gave both groups time and a venue to educate each other on best practices and get agreement on next steps.

Of the most importance is to implement a critical events improvement program. We can segment asset downtime into two buckets: The first bucket is short stops and nagging items such as incorrect adjustment that allow you to run but to force you to stop the machine frequently. (Sometimes the fix is simply to reset the machine back to “zero” every week as a PM task list.) The second bucket is downtime that exceeds a threshold value and can add additional costs regarding scrap or rework. These events are where you are hemorrhaging, and they often involve critical assets. This segment is the bucket I want us to focus on for this critical-events program.

  1. Pick a pilot line or process.
  2. Establish a time and cost threshold to classify downtime as a “critical event.” We focused on anything that caused the process line to be down for two hours or more. The process area fed the packaging area, and we had redundant assets there. The threshold in the packaging area was four hours. Raise or lower the thresholds to adjust the number of events requiring analysis so that it is manageable. If too few events meet the threshold, we'll miss opportunities to add value. Too many, and we'll get overwhelmed.
  3. Develop a simple set of questions that will form the basis for a “5-whys” or “8-step” template. The template should include information such as date and time of the event, who was involved, a statement of the problem, a statement of the impact, and what caused the event. The form is something that can be completed by the operator when the critical event strikes. Other questions belong on the template as well, but our initial focus is to collect information on what happened.  I will address the other questions later for the remainder of the template.
  4. As a team with the operations partners, set an expectation that the form will be completed immediately after a critical event by those involved.
  5. Determine a champion to administer the program and to whom the completed form will be submitted. It can be an administrative clerk, but a better option would be a reliability engineer or a reliability technician. When a critical event occurs and the form is submitted, the individual will pull together a team to review the event and make a plan for improvement.
  6. With the team, review the event; evaluate your options; and recommend actions. The team should have cross-functional representation depending on the type and location of the event. The remaining questions on the template address root-causes identification, the financial benefits/value of recommended actions, the resolution to the problem, prevention activities, and finally, the validation or conclusion.  From an analysis perspective, depending on the severity of the event, you can utilize the “5 whys,” “8 step,” or root-cause analysis methods to get to the actions.
  7. Log the recommended actions in a spreadsheet or database. Assign names and due dates for improvement actions to be completed.
  8. Celebrate the wins with the teams and publicize them to gain more support and buy-in.

When we began this activity, we were struggling with a single process line’s overall reliability. It became our pilot. In the first month of tracking, we had 14 critical events of two hours or more. For one or two of those events, the duration was longer than 12 hours. Within the first year, we had dropped this number down to fewer than two critical events per month on average. While the downtime exceeded two hours, the overall duration was less than four hours for each. Asset availability increased significantly, and we saved millions via reduced costs.

What the critical-events improvement program did was to draw focus to the “bad actors” in specific events. While maintenance and operations collaborated to address issues originating with equipment, in some cases the problem (and attendant fix) was related to how operators ran the equipment. Standardized work and training were employed to address those concerns.

I’ll add that companies spend considerable effort implementing downtime reporting systems. In many organizations, I find that the data reported is garbage in, garbage out. Often, there is no auditing or accountability for the data entered. I prefer to keep it simple. It’s better to spend money on eliminating defects and potential failures than to spend money on systems that will incorrectly report these issues.

Do you have a process like this one to drive improved reliability?  Do you find the data reported in downtime reporting systems to be accurate? What can we do differently? Please comment on your thoughts below. Any questions?

Talk soon,
Jeff Shiver, CMRP

If you have questions in the fields of maintenance, reliability, planning and scheduling, MRO storerooms, or leadership as examples, please contact Jeff Shiver with your question(s) here.

About the Author

Jeff Shiver | Founder and managing principal at People and Processes, Inc.

Jeff Shiver CMRP is a founder and managing principal at People and Processes, Inc. Jeff guides people to achieve success in maintenance and reliability practices using common sense approaches. Visit or email [email protected].

Sponsored Recommendations

Reduce engineering time by 50%

March 28, 2024
Learn how smart value chain applications are made possible by moving from manually-intensive CAD-based drafting packages to modern CAE software.

Filter Monitoring with Rittal's Blue e Air Conditioner

March 28, 2024
Steve Sullivan, Training Supervisor for Rittal North America, provides an overview of the filter monitoring capabilities of the Blue e line of industrial air conditioners.

Limitations of MERV Ratings for Dust Collector Filters

Feb. 23, 2024
It can be complicated and confusing to select the safest and most efficient dust collector filters for your facility. For the HVAC industry, MERV ratings are king. But MERV ratings...

The Importance of Air-To-Cloth Ratio when Selecting Dust Collector Filters

Feb. 23, 2024
Selecting the right filter cartridges for your application can be complicated. There are a lot of things to evaluate and air-to-cloth ratio. When your filters ...