Safety-Critical Software Design Techniques

08/01/2020

In this article, we describe a set of strategies that are used when developing safety-critical software, which is typical in railway applications with SIL-4 or lower level or in electronic safety equipment on board.

These applications, in addition to complying with EN 50128 standard (and EN 50155 standard in electronic applications on trains or on-board applications), must be developed in order to be robust against internal and external failures that they will surely suffer along their service life..

Safety of an electronic system with microcontroller will depend on hardware architecture, but also on its software, as it is laid down in CENELEC EN50129 and EN50128 regulations. Regarding software, security management can be performed using some techniques that we provide on this article. The aim of these techniques is to be able to manage incorrect states of software applications, that is, to detect them and, in some cases, when possible, to return the system to the correct operating mode -either with all features being operational or in degraded mode-.

When it comes to making safe software applications, the following software development techniques are normally used:

Error detection.
Error recovery.
Defensive programming.
Double implementation of the software application.
Data redundancy.

Therefore, we will give a detailed description of each one of these techniques. It is important to keep in mind that the application of these techniques involves adding a large number of code lines -the same ones that are necessary for the fulfillment of the initial functional requirements can be considered as a reference-. Consequently, there will be a design and a source code twice as big as the one projected for the initial system. Therefore, we potentially double the occurrence of failures and bugs in the code, may result to worsen, paradoxically, safety when adding these strategies. Thus, it is important to take care of the development of this software to apply the necessary quality processes and validate/ verify with the effort involved in this new scenario. Finally, these techniques must be rationalized to the maximum since their use presents such complexity that could end up damaging instead of contributing to the overall performance of the system.

ERROR DETECTION

The purpose of error detection is not to ignore the appearance of an error during the implementation of the code and, therefore, its main mission is to be able to indicate that the code in execution is facing an error or exceptional circumstance. Typically, under normal circumstances, the result of an operation -of a function, a procedure or a code segment-, returns a result of a certain type bounded by the definition of the type it returns. In contrast, when exceptional circumstances arise, the code cannot return a bounded result to the type, due to the wrong environment that is occurring. The most used solution is to design a code and an extension of type by means of an "indefinite" special constant. Dealing with an anomaly, the designed system has the ability to return as a result the "undefined" constant. Faced with this solution, some programmers prefer to implement complex types return, with the initial variable(s) plus an extra variable. For example, binary, which indicates whether the primary variable that is returned has been calculated on the basis of a correct procedure. Another mechanism widely used for error processing when coding, for example in C ++, is derogation mechanism. Exception management provides clear and standardized coding. However, it relocates error processing and, therefore, increases its complexity and traceability. In any case, it is vital to code software in order to detect errors that may appear during its implementation.

ERROR RECOVERY

The recovery of software execution errors aims to return the entire system to a correct state, after an error has been detected. Typically, two possibilities are defined in terms of error recovery:

Error recovery via retries.

"Backward" recovery implies having and executing a mechanism within the software in order to restore the system to a safe state. To do so, the system must be stored regularly, and saved systems must allow "reloading". The system is understood as those parameters, variables and states of the system as a whole, changing along the implementation thread. When an erroneous situation has been detected, it is possible to reload one of the previous situations -in most cases the most recent- and restart implementation from that point. If the failure will originate in the environment or in a transient fault, then the system can adopt a correct operating mode from thereon, since failure situation -theoretically being transient- will not happen again. Obviously, if the fault is systematic (from hardware or software), the system will return to the fault mode since the same conditions will occur again. Faced with these systematic errors, some very advanced systems may have several alternatives for software applications and activate a different replica of the application when they will detect a failure which is repeated systematically.

The simplest and most typical system used in error recovery via retries is to perform a reset to the microcontroller, returning it to a 100% known and controlled state. Although we have already seen that it does not have necessarily to be the only strategy used in a retry process. If a good periodic mechanism of key data storage and an intelligent recharge system are used, then it is not necessary to force a reset to the system hardware. As we might expect, one of the keys of these designs will be a correct definition of execution points where autosave will take place.

Error recovery via continuation

"Backward" recovery implies, against fault detection, to continue executing the application from an erroneous state, trying to make selective corrections to the system status. This is when they would enter again, like exceptions already mentioned in C++, for instance.

"Forward" recovery is a technique that we can consider complex to be implemented, and its validation is also complex to verify its effectiveness.

Somehow, "forward" recovery requires a complete list of errors and strategy of how to correct them. Adding error detection points and operation verification points will increase the complexity of the code and augment the combinations of tests that will be performed to validate the application.

DEFENSIVE PROGRAMMING

Defensive programming is based on producing programs being able to detect erroneous command flows, data or values of data during implementation, and reacting to these errors by default and in an acceptable manner, as follows:

First rule of defensive programming, although being obvious and imperative, is using restrictive initialization of all variables (general and local ones). In addition, and being very significant, the choice of values for initialization must be related to the notion of safe state.
Checksum and test of memories. During system startup and periodically during the implementation phase, tests are carried out to check that program and data memory has not been modified in any way due to any effect not caused by the system itself. In this sense, during the implementation phase, "filling" of memory -not being used- in order to check that it is not being misinterpreted by any instruction is also typical. Memory verification aims to ensure that in all memory locations used by the system, we can read and write correctly. This operation is also carried out during system startup and periodically under implementation.
The second rule is to manage coherence of the inputs, to prevent an erroneous entry from spreading through software application. Thus, it is again relevant to systematically verify the entries -parameters of a function query, general variables, function query returns, etc.-. For example, if it is impossible for a function to have a negative value at its entrance, it will be part of a defensive programming, checking that input value is not negative in any of the cases, always having to apply restrictions as limited as possible.
The third approach is related to data management during processing. To do this, the context of each different use must be taken into account for each programming structure. For an IF instruction, there will systematically be an ELSE branch. For a SWITCH instruction, there must be a DEFAULT systematically. That is, there is always a thread programmed and under control to which the execution of the software can go even if there was a failure.
Plausibility of data. Do the values of the variables seem plausible, credible, probable or make sense, given the knowledge we have about the program? When it comes to coherence of input data, it is customary to implement redundant actions in time, that is, to capture two or more times the same value in order to obtain the same result -or to do a voting process of n captures- or using two different input generators for the same source. For instance, two sensors in parallel, feeling the same source, must have the same value. For some systems, there are some conditions that help verify the consistency of input data, such as an angle sensor of a rotating part: the measurement must pass through all the angles from degree to degree and it is not possible to adjust to 90º. Full use should be made of having a system with this property. Regarding values that generate, for example, jumps of angles being physically impossible, they must not be accepted.
Plausibility of the operation Techniques related to plausibility of the operation revolve around the following question: "Does program execution follow a predictable and expected flow?" Consistency in operation will help to assess whether a given software application follows a previously validated route or not. To do this, implementation of the software application must be traceable by marking steps that have been carried out. Implementation consists of a range of traces, each one is a series of control points (implementation path).
Watchdog. Watchdog is one of the most used techniques to ensure that a software is at least "alive" enough to periodically refresh a flag that keeps the watchdog without resetting the microcontroller.
Refreshing configuration bits. It is very common to periodically refresh the configuration bits of the microcontroller that, without defensive programming orientation, are only configured once at the beginning of the program. Moreover, in order to evaluate possible problems, configuration must be read, it must be certified that it still has the adequate value, and the configuration value is overwritten again. On some occasions, such as reading and writing processes of input and output ports, these processes are used to reconfigure these ports.

DOUBLE EXECUTION OF THE SOFTWARE APPLICATION

Operational redundancy uses the same application twice through the same treatment unit (processor, etc.). The results generated in two different times are compared through an internal or external device to the processor and any inconsistency that an entry into a degraded system mode may cause. This operational redundancy technique can help detect memory failures, for instance. A single program is loaded into two different zones of the storage medium -two different zones in the addressing memory, two different memory media, etc.-. Memory failures (RAM, ROM, EPROM, etc.) they can be detected together with random failures of the processing units. It should be noted that some failures of shared hardware devices (comparison unit, treatment unit) are not detected and, therefore, remain dormant.

An extension of the level of redundancy consists in introducing code diversification. This diversification can be "light", imposing the use of two different sets of instructions to program the application. For example, one of the programs would use A + B, while the other would use - (- A-B). On the other hand, we can talk about "complex" diversification, using full redundancy, where the entire application is designed and executed, independently of each other.

Software diversification can be complemented with hardware diversification. That is, the same software or different software diversified in a "light" or "complex" way is executed at the same time, in two hardware units. The result of each hardware unit is exchanged towards the other to assess consistency of the results or, a third unit will be implemented that will evaluate if both systems reach the same results. This type of configuration is called 2 out of 2, 2oo2. It is also very frequent, in order to increase the availability of equipment, to use strategies 2 of 3 (2 out of 3, 2oo3), where a voting system will be established by executing an action while at least 2 of the 3 hardware sub-systems will provide the same result.

DATA REDUNDANCY

In its simplest form, data redundancy doubles stored variables and compares their content to ensure that they have been initialized correctly, updated correctly, read correctly and, for as long as they have been stored in some memory, have not been changed involuntarily. Data redundancy is also understood and, this time, with a higher level of complexity, the use of data integrity techniques with redundancy. CRC systems, Hamming code, etc. are well-known. The particularity of the latter, for example, is their ability to fix the failure.