Common cause failures (CCF). What are they and how are they mitigated
Common cause failures represent events in which multiple failures occur in a short period of time, due to a common cause. This is often called the underlying effect. Its study and importance are especially relevant in redundant systems since its effect can completely inhibit the advantages of this type of complex architecture. Today, "all" functional safety standards require common cause failures to be taken into account, regardless their industry domain or application area. The aim of researching uses to be its disposal. In other words, regarding safety-related applications, it must be usual intending to study and completely eliminate common cause failures.
It is well known that redundant systems tend to have a high cost because the number of elements tends to be doubled. Common cause failures cause this redundant equipment to fail and, therefore, the investment made is no longer meaningful and worthwhile. This redundant design strategies, whether for increasing reliability or safety, would not be effective against a common cause failure. Therefore, it is a failure that must be minimized or eliminated.
Common cause failures normally originate for two basic reasons. (a) Due to an instantaneous relevant effect on the system (for example, a hit), multiple failures will occur at the same time or (b) Due to a constant increase in time of an undesirable situation. For instance, an increase in vibrations in the equipment or an increase in temperature for too long. This will normally cause that, little by little, failures will appear in a disjointed manner in time, while this situation persists.
When studying the underlying causes of common cause failures, it is recommended to divide it into two elements of analysis: the root cause and the coupling factor. The root cause is the one that, if corrected or simply not happening, it would prevent the occurrence of multiple failures. The coupling factor is, instead, the property or characteristic that causes several elements to be susceptible to failure, due to a shared cause.
It is important, in this sense, to differentiate the coupling factor with the cascading failure -or domino effect failure-. Although cascading failures should also be analysed and mitigated when it makes sense, they should not consider common cause failures for its impact. Coupling factor of common cause failures, as mentioned above, share same root cause. However, cascading failures are based on the appearance of new root causes generating new effects on the system.
The root causes can usually be identified with a thorough study during the specification and design phase:
- Specification error: lack of specification or incorrect specification. This point includes operating the product, system or installation in different margins and situations for which it has been designed.
- Implementation error: design, mechanical, chemical, electronic hardware, or software errors.
- Product or system installation faults.
- Commissioning errors.
All these unidentified failures in corresponding stages will remain in the worst-case scenario. During the operating phase and the service phase, in the form of operational errors, exposure of operating conditions will take place beyond design limits or maintenance errors.
Human factor is usually a common denominator in common cause failures. It is important to pay attention to human error when analysing this quite unique type of failure.
Regarding coupling factor, we will typically find the following situations:
- Using the same design principle.
- Using the same hardware or software.
- Using the same operating or maintenance personnel.
- Using the same processes.
- Using the same environment or location.
Strategies to reduce or eliminate the probability of the occurrence of a common cause failure.
As mentioned above, since there is two basic points that define a common cause failure (root cause and coupling factor), the strategies to eliminate/reduce common cause failures will be effectively aimed at eliminating/reducing the root cause and/or the coupling factor, typically:
- Improving design so that the root cause will have no effect on our system. That is to say, applying "shields" against external effects.
- Increasing intrinsic reliability of each element, that is, using more reliable and robust components.
- Ensuring that the operating environment is within the design limitations: We are talking about environmental variables, such as temperature and mechanical stress like shocks, vibrations, etc. It is common for these variables to work above the design conditions and therefore generate problems, due to divergences between the requirements definition stage and the equipment operation and use stage.
- In preventive maintenance, we must design control and test points for common cause failures (usually root causes).
- Introducing the concept of diversity of electronics hardware and software.
- Applying changes in technology, completely eliminating the coupling factor and, ensuring that a new one is not created. For example, from an electrical communication to an optical communication
- Physically separating redundant systems so that they are
- Avoid couplings in electronic hardware architectures and designs by simplifying them.
- Typical FMECA analysis to detect vulnerabilities in designs and architectures.
Within the state of the art of RAMS Engineering, there are different techniques to model common cause failures, as well as specific software for such models that we have in xxxx. The beta-factor model, for example, is defined in the IEC 61508 standard. The IEC 61508 standard also proposes a kind of checklist based on 37 questions whose analysis and response helps to reduce common cause failures.
In CENELEC EN 50126 standard, the common cause failure analysis is considered recommendable (R) for a SIL-1 or SIL-2 system and highly recommended (HR -meaning that it is mandatory to be complied-) for a SIL- 3 or SIL-4.