Design
Safety-critical MCUs: what if things go wrong?
Microcontrollers used in vehicles’ safety critical systems need to meet stringent reliability standards. Sally Ward-Foxton takes a look at some of the safety features used in today’s automotive MCUs.
One At the heart of most automotive electronic systems is the ubiquitous, but increasingly complex, microcontroller. MCUs in safety-critical systems use a number of different architectures and techniques to prevent and detect random and systematic hardware errors. These safety features mitigate the three main types of fault: Single point faults, latent faults and common cause faults.
Single point faults
Single point faults can be immediately critical for system safety and require fast detection and reaction. Typically single point faults are avoided using replication – performing an operation twice in two physically separated instances of the same circuit and comparing the result. “Since the compare may also go wrong, this is also implemented twice,” explains Mark O’Donnell, Product Manager for Chassis and Safety Operation at Freescale.
One of the most common techniques for implementing replication in automotive MCUs is called ‘lockstep’. Lockstep means that two processors are running the same set of operations in parallel. The two sets of results are compared by a checker unit or comparator to look for single point faults.
“In the event of a fault, an error signal is forwarded to a separate hardware block and triggers the according reaction to prevent propagation of the fault and put the device into a fail-safe mode,” O’Donnell says.
Freescale has taken the idea of replication a step further. With its newly introduced Qorivva MPC564xL automotive microcontroller family, the sphere of replication is extended from the cores to also include the crossbar, MPUs, interrupt controller, DMA units and software watchdog timer (see Figure 1). The devices effectively contain two ‘channels’, each consisting of a core, bus, interrupt controller, memory controller and other core-related modules.
“In the past the principle of replication and lockstep was predominantly used for the cores only,” O’Donnell says. “The main benefit of this extended sphere of replication is the capability to detect single point failures, which are typically higher rate soft errors not only in the cores but also in key sub-modules of the microcontroller.”
With the development of modern multi-core architectures, dual-core devices are now commonly used for lockstep techniques. Dual core lockstep architectures are thought to be fast and reliable while offering performance advantages over two single cores. “With a single core architecture as the alternative, multicore architectures reduce the risk from the hardware perspective,” O’Donnell says.
“A dual core lockstep architecture significantly reduces the software overhead versus a dual MCU solution,” adds Frank Forster, Safety Microcontroller Marketing & Systems Manager EMEA Sales & Marketing at TI, explaining that when using two single cores, additional software has to be generated, tested and certified, and it also represents a performance burden of up to 30% of the CPU performance. This is not needed in a dual-core architecture.
Latent faults
Latent faults are ‘hidden’, that is, they have occurred but are not yet compromising system safety functions. Hardware built-in self-test (BIST) mechanisms are key for the detection of latent faults. Both CPUs and memories use self-test mechanisms.
TI’s ARM Cortex-R4F dual core MCU for automotive safety-critical applications, the TMS570 series, features both memory and CPU self-test (see Figure 2). Its hardware logic built-in self-test (LBIST) function can check the function of the CPU’s logic down to the gate level, providing a high level of diagnostic coverage. According to Forster, it is typically executed at the system start, but can also be executed at application runtime in 32 time slices of 2k cycles each. Doing this in hardware avoids complex safety software and code-size overheads.
##IMAGE_2_C##
The TMS570 also self-tests its memory using techniques including MBIST (memory built-in self test) and ECC (error correcting code).
“As the Flash and RAM occupy quite some space on the device, they have to be specifically protected against random failures,” says Forster.
ECC is actually performed by the CPUs. The type used on the TMS570 for example, can correct single bit errors and detect double bit errors (SECDED). This is hardware based too – the ECC is evaluated in parallel to processing the application so there is no latency or performance impact.
Common cause faults result when redundant elements of an MCU are not completely independent, perhaps because they use a common die as in dual core lockstep architectures. The same error happens on both cores, so they provide identical but erroneous data. Typical causes of common cause faults are system clock or power supply problems which affect both cores in the same way. In the Freescale Qorivva MPC564xL family, hardware blocks are provided to detect clock deviations and monitor important voltages such as the internal core voltage and Flash supply voltage, and flag them up if necessary.
Other measures taken to avoid common cause faults include introducing diversity by differing the physical layout of the dual cores on the device. TI’s TMS570 features dual ARM Cortex-R4F cores, with the second core located at least 100μm from the first, mirrored and rotated, to reduce the risk of common cause failures.
As microcontroller devices become more and more complex, hardware error detection and correction have become more complex too. While the advent of dual core technology has made lockstep techniques more efficient, the risk of common cause faults has become more important since the cores share the same die. Today’s mechanisms for preventing and detecting the three main types of hardware faults must work together to keep microcontrollers running safely, despite development time and cost pressures. Solutions available on the market today are specifically designed to be ASIL-D compliant with error detection and correction techniques built-in to help designers meet functional safety requirements without compromising on development time or bill of materials.
##IMAGE_3_C##
CAPTIONS
Figure 1. Freescale’s Qorivva MPC564xL. The sphere of replication is extended beyond the cores to many other features
Figure 2. TI’s Hercules TMS570 family. This ARM Cortex-R4F dual core MCU for automotive safety-critical applications features both memory and CPU self-test
BOXOUT
Adapted standard
With vehicle safety functions increasingly dependent on electronics, safety standards are increasingly important. IEC 61508 was ratified in 1998, and since then various industries have developed their own adaptations based on it. The automotive industry published ISO 26262 in 2011, which takes the safety process and requirements for automotive electronics to a deeper level.
A key concept in both IEC 61508 and ISO 26262 is Safety Integrity Levels (SILs). IEC 61508 defines SILs 1 to 4 with 4 being the safest. ISO 26262 defines Automotive SILs (ASILs) A to D, with D being the most stringent. It’s important to note that the SILs and ASILs do not directly relate to each other (see Figure 3). Both the Freescale Qorivva MPC564xL and TI Hercules TMS570 mentioned in this article are specifically designed to comply to ASIL-D.
CAPTION FOR FIGURE IN BOXOUT
Figure 3. Details of functional safety standards for automotive electronics. (Image courtesy of Freescale Semiconductor)