Many of us are inclined to ignore the time honored adage, “If it ain't broke, don't fix it.” Sometimes our handyman instincts can't leave well enough alone. Yet that same intuition also encourages us to believe, "If it ain't broke, don't test it." This reluctance is especially evident when it comes to testing the redundant and backup features of our critical control systems. At best, failures resulting from inadequate testing will only cost you lost production. At worst, insufficient testing can cost you your job.
Get hip to R&B
Although the terms redundant and backup (R&B) are often interchanged, each represents a different aspect of reliability design. A redundant system uses multiple similar components in a configuration that permits simultaneous performance of the same (or similar) function. A redundancy failure causes no reduction of system operation or capability. Simple examples include parallel power supplies and series shutdown valves. A more sophisticated example is a redundant PLC system: a microchip fails, a warning light comes on and production continues normally. A key aspect of redundant systems is that multiple components do the same job at the same time.
A backup system takes a different approach to reliability by providing an independent means of performing all or part of the overall control function, usually in a primary and standby configuration. Manual or automatic transfer mechanisms determine which component takes the lead. For increased reliability, backup systems can use alternate configurations and technologies to improve resistance to single-point and common-mode failures. For example, a simple local controller that can operate without assistance from a plant-wide control system is a common instance of backup technology. The local system may lack bells and whistles, but at least it can maintain safe production should the primary system go offline.
Backups also can be found within the control system's support structure. A frequent example is an uninterruptible power supply (UPS) that delivers reliable energy to many electronic control systems. If the primary power goes away, the UPS instantly takes over to maintain essential control functions — at least as long as the batteries hold out.
Note that redundancy and backup are not mutually exclusive. Many control systems have separate elements of both, and some even combine them into so-called redundant backup systems that have high levels of fault tolerance. Such systems include two or more similar control entities, each having full capability, but based on different technologies. Having two independent and diverse control systems is often considered the best protection against unanticipated failures.
In addition to improving reliability, R&B controls can simplify routine operating facility maintenance and avoid downtime. R&B concepts allow portions of the control system to be repaired offline while the controlled process remains in service. Special operating modes such as manual supervision may be required, but the ability to perform online testing of items like relief valves and meter runs is a valuable benefit of high-reliability systems.
Test to ensure safety
Some industries, such as aerospace and nuclear, routinely test redundant and backup system because reliable technology is essential to their high-risk business. But less-risky industrial users don't always adopt a mission-critical approach to testing R&B performance. Everyone in industry has heard war stories of redundant and backup systems that failed to do their job, explained with statements such as,
"The UPS should have kept us going," or, "the redundant processor had an outdated program." The subsequent diagnosis is often performed through a rear-view mirror, with perhaps some adjustment to future maintenance procedures. But in reality, proper testing of R&B systems remains on the back burner of many maintenance programs.
Of course, the less exotic elements of R&B systems, such as inputs and outputs, are often tested during routine control system maintenance. However, such checks are often limited to calibration and physical care. This level of maintenance may test the heart of an R&B system, but not its soul. A true test requires simulation of the specific transient conditions for which the R&B systems are designed. Proper R&B testing requires more than simply faking a process fault to verify that the system performs its normal role. R&B testing also should include challenging their unique nonstop features to verify reliable performance even while partially disabled.
Further, the requirement to routinely verify R&B operation is becoming increasingly important because of safety-related standards such as IEC 61511 and ANSI/ISA S84-2004. These internationally accepted guidelines define Safety Integrity Levels (SIL) and Safety Instrumented Systems (SIS) that generally rely on redundant and backup systems. Merely designing controls to meet those standards isn't sufficient to satisfy existing and pending regulations. Proper testing and verification of specific redundant and backup features is essential for meeting both the spirit and letter of those standards.
Test under operating conditions
A proper test of redundancy and backup requires producing operating conditions that mimic failures of the control system and its various support systems. These tests must go beyond the manual or automatic diagnostics built into many R&B systems (i.e. the UPS "test" button). Those diagnostics are generally local to the device and might not adequately test responses to external problems. Although built-in tests help verify operation of an R&B component, they can’t verify reliable system operation for situations that involve interconnected units.
So how can the R&B functions best be tested? There's no easy answer. Every redundant and backup system has its own requirements. But a common theme is to simulate fault conditions that are unrelated to the controlled machine or process. A significant goal is testing the redundant or backup system's ability to maintain operation during and after a transient condition that interrupts normal operation, including loss of the primary control system. Therefore, testing one part of an R&B system usually requires disabling other parts under conditions that simulate real-world failures.