What does it mean that a system is fail safe or intrinsically safe?

16/10/2020

A fail safe system is one which, due to the characteristics of its equipment and components and the way in which they are integrated, is guaranteed that, in the event of any fault appearing, the system will always go to a safe status, normally affecting availability but never, and in no case, affecting safety.

The concept of a safe state, moreover, does not have to be, in itself, "self-contained". That is to say, perhaps some conventions or rules of use or operation of such equipment or system should be established. This will help all parties to assume the entrance to this safe state affecting availability, most of the times. Here is an interesting example to understand this concept...

The example of book in railway signalling: the light bulb

Because of its simplicity and unique failure safe mode, the bulb has been used as a failsafe element for many years. Moreover, it also allows railway signals to be converted into fail safe equipment. Obviously, before existence of LED technology and as a replacement for mechanical signals, the light bulb was the main element in railway signals throughout the world. Its validity is still extremely high thanks precisely to its property of being fail safe, at an extremely low cost. In this case, in order to use the light bulb as the main element of a railway signal, a very simple conventional attitude was taken to harness unique failure mode of the light bulb: whenever a sign will not be illuminated in any aspect, the signal shall then be considered to have the most restrictive aspect, that is, red aspect.

It is well known that the only failure mode that can occur in a light bulb is that its filament breaks. Automatically, the light bulb will not glow anymore, nor consume more energy since no current passes through it. Therefore, it could happen, and it is quite frequent -every 10 000 hours of operation- that when a red aspect is presented to stop a train, the filament of the bulb would break -or it is already broken. Thus, we could not provide this information to the driver. Thus, in railway industry, the concept of "signal off is equal to signal presenting the most restrictive aspect" was established. In this way, a driver faced with a switched off signal will always stop and activate degraded modes of operation in order to continue. As it is confirmed, we will not affect safety , but we will affect availability . Then, we shall activate degraded modes that will normally affect operating times.

The example shown here is really simplified since there can be double aspects, with different bulbs lighting up. There is also fusion control of the interlock -hot and cold- playing an important role in this whole process, but we will not go into the details in this article.

The crucial thing is understanding that the activation of safe states and degraded modes can sometimes be conventionalisms that are implemented in the operation and use of the system, in order to ensure safety of a system. This is not always the case, but it can become a strategic resource to be implemented in order to solve a problem of safety.

Going back to the fail-safe concept or intrinsically safe system, this is based on using components with well-established and limited failure modes, and that in the event of fail , a safe operating condition is maintained. That is, for any of the possible failures , assume a possible impact on availability , but never on safety.

From a point of view of RAMS engineering, the concept fail safe can be understood in the following way: a system can have two types of failures, the systematic and the random ones. Both random and systematic failures can result in the following three behaviours at the system level:

The failure has no effect on the operation or use of the system.
The failure affects the availability of operating or using the system.
The failure precipitates operating conditions or use of the system against safety.

A fail-safe system is one ensuring that, in the event of a random or systematic failure, such system will never be operated or used against safety. In fact, the concept of random failure no longer makes sense when we are talking about a fail-safe system, since the conceptualisation of statistics of random failure serves, in most cases, to determine SIL levels (Safety Integrity Level) of a product, system or installation.

In this sense, when we talk about fail safe systems, the concept SIL should stop being used since there is no probability that a failure will appear against the safety and, therefore, the level THR (Tolerable Hazard Rate) associated with the discretization SIL (for example SIL 1, SIL 2, SIL 3, SIL 4), it will always be 0.

Why has the concept fail-safe stopped being used and the concept SIL has started being used?

The technique evolution and, especially, via the introduction of electronic systems, has allowed the development of extraordinarily complex systems. They have an extraordinary capacity to meet the requirements of a product, system, or installation with little space and with exceptionally low costs. In other words, with high integration. This level of integration has allowed the development of very complex solutions. It is considered not possible to use equipment and components where their failure mode is guaranteed to always go towards a safe state, moving on to use approximations and statistical models of safety integrity. In other words, this is SIL level. Probabilistic approach in the electronics industry becomes the strategy for measuring failure rate (with the MTBF , for example) and quality in production (the SIX SIGMA method, for instance). Thus, electronics industry generates a new reference framework where it is assumed that, for all parties a probability of failure in the quality and manufacture of the equipment. This is obviously extremely low, but it must be taken into account.

Therefore, we can sentence that the state of the art and technique, in complex systems and technologies, the deterministic approach would not be viable. In these cases, probabilistic orientation comes into play.

Why is the concept of fail-safe still used in systems where probabilistic rates or approaches are required with SIL level?

Fail safe concept used today for complex systems where failure rate is calculated by a statistical approach. It is from a solution design point of view. In other words, in some way the system, or rather the system designer, is asked to take into account every failure mode that the system may have and to design solutions associated with these failure modes. This will avoid the system going towards unsafe situations, regarding its use or operation. As we have seen within the fail-safe concept, an analysis of the environment and of strategies for operating or using the system against failures is also required. Hence, the study of strategies fail safe within these complex systems would also incorporate such conditioning factors outside the scope of the system itself.

Which system would be better: SIL-4 or a fail-safe system?

This is the typical question that a customer might ask you after a RAMS engineering session or after a training with xxxx. The answer would be difficult to give, but we could say that theoretically a fail-safe system is safe in all cases and under all situations. However, there is no doubt that a SIL-4 level guarantees a fail-safe rate against safety being so low that we have an extremely safe and reliable system. Besides that, as we have seen in this article, if we want to carry out systems being capable of solving great technological challenges in an economic, fast, compact and integrated way, we will not be able to build fail safe systems and, therefore, the statistical approach SIL allows us to access a technology ready to solve all the great challenges of our society.

At Leedeo Engineering, we are specialists in the development of RAMS Railway projects, applying CENELEC standards EN 50126, EN 50129, EN 50128, EU Implementation Regulation 402/2013 with the application of the Common Safety Methods CSM-RA, supporting any level required to RAM and Safety tasks, in the development and certification of safety products and applications.