Generalized Formula for the Calculation of a Probabilistic Metric for Random Hardware Failures in Redundant Systems

Editorβs NoteβThe paper on which this article is based was originally presented at the 2017 IEEE Product Safety Engineering Society Symposium, where it received recognition as the Best Symposium Paper. It is reprinted here, with permission, from the proceedings of the 2017 IEEE Product Safety Engineering Society International Symposium on Product Compliance Engineering. Copyright 2017 IEEE.

Since the introduction of International Organization for Standardization (ISO) standard 26262 [1], which is a functional safety standard for automotive electrical and/or electronic (E/E) systems in 2011, ISO 26262 has effectively mitigated two categories of failure. The first category is the prevention of systematic failures based on safety measures in verification and validation development processes. The second category is the prevention of safety issues that are caused by hardware failure using safety mechanisms. Those safety measures assure logical and quantitative solutions that are not achievable by traditional strategies of quality management (QM).

ISO 26262 defines the probabilistic metric for random hardware failures (PMHF), which is a metric related to the probability of a safety goal violation caused by a random hardware failure, and the architecture metrics that are discussed in papers such as [2] and [3].

Target Values for Hardware

The architecture metrics provide average diagnostic coverages that are relative values [2]. On the other hand, the PMHF is an absolute value of the average failure rate of an item. Although some formulas are shown in Part 10 8.3.3 of [1], a detailed explanation of the PMHF is not provided [4].

Before we start the PMHF discussion, we focus on the definition of the probability of failure (PoF) and related equations shown in [5] as follows:

πππΉππ‘ππ,π‘ β‘ Pr{πππ‘ππ β€ π‘} = πΉππ‘ππ(π‘) = 1 β πβπππ‘πππ‘, πππ‘ππ(π‘) = πππ‘πππβπππ‘πππ‘ (1)

where

• πππΉππ‘ππ,π‘: the probability of failure of an item until time t;
• πππ‘ππ: the random variable that represents the failure-free operating time of an item, and has an exponential distribution;
• πΉππ‘ππ(π‘): the un-reliability, or the cumulative distribution function (CDF) in terms of the failure of an item until time t;
• πππ‘ππ: the failure rate of an item; and
• πππ‘ππ(π‘): the probability density function (PDF) in terms of the failure of an item at time t.

The PMHF is related to the probability of a violation of a safety goal due to a random hardware failure within a vehicle lifetime; it is expressed as an average PoF within a vehicle lifetime as (2), using (1) and the Taylor expansion of exponential function with an assumption of πππ‘ππππππππ‘πππ βͺ 1 that is applied throughout the discussion;

Β (2)

where

• ππππ»πΉ: the PMHF of an item; and

The PMHF can eventually be considered as an average failure rate of an item based on (2).

Related Work

Although detailed methods regarding the calculation of a PMHF are not available in the literature, some papers focus on the PMHF metric ([3], [6], and [7]). The authors of [4] derive the PMHF using an expression of PoF based on a probability calculation tool, such as fault tree analysis (FTA). Although each PoF of the basic events can be calculated, we believe that the use of observable parameters, such as failure rates and diagnostic coverages (DCs), which are employed in Part 10 8.3.3 of [1], are more effective for improving the understanding of these parameters and creating an FTA model.

In [4], the authors discriminate between the following two cases with respect to the latent fault of a safety mechanism:

1. failures of a safety mechanism that are βlatentβ (e.g., not detectable); and
2. failures of a safety mechanism that are detectable but occur within the diagnostic test interval.

However, the discussion in [4] is not based on the conditional probability, as shown in [8]. Although the authors introduce the conditional probability in [8], the latent fault calculation has some issues, which are detailed in section V part B. The opposite situation regarding the mission function in the latent fault state is not referenced. This paper aims to clarify that the dual point failure (DPF) should be the failure caused by the second fault when the first fault is the latent fault, where the first element would be a mission function or a safety mechanism.

Target Subsystem

Figure 1 shows the target subsystem named βSUBS.β SUBS includes a mission function βMβ and a safety mechanism βSMβ that supervises M; SM is the primary safety mechanism.1 The notations M and SM are from Part 10 8.3.3 of [1]. Two secondary safety mechanisms2Β exist for each M and SM: βSM2Mβ and βSM2SM.β Here, SM and SM2M can be the same element. We can observe three parameters, e.g., failure rate (Ξ»), diagnostic coverage (π·πΆ), and diagnostic period (π), for each element according to Part 10 8.1.7 of [1], as shown in Figure 1. With respect to the element M, let Ξ» be ππ and π·πΆ and π be non-existent because the mission function does not have diagnostic capability.

1Primary safety mechanism: safety mechanism to prevent faults from violating a safety goal.

2Secondary safety mechanism: safety mechanism to prevent latent faults.

Figure 1: Target subsystem βSUBSβ

With respect to the element SM, let Ξ» be πππ , π·πΆ be πΎπ,πΉππΆ,ππΉ , and π be zero because SM is working within the fault tolerant time interval (FTTI). In terms of SM2M, let Ξ» be zero (the lemma that proves this follows), π·πΆ be πΎπ,πΉππΆ,πππΉ, and π be ππ . For SM2SM, let Ξ» be zero, π·πΆ be πΎππ,πΉππΆ,πππΉ , and π be πππ according to Part 10 8.3.3 of [1].

Lemma β Secondary Safety Mechanisms Never Fail

Before we begin to prove that the failure rate of a secondary safety mechanism is zero, we introduce the notation βA β B,β which indicates that element βBβ receives a fault when element βAβ is in a latent fault sate. The small dual-point PoF for SM β M in the period of (t, t + dt] can be defined as follows:

βπππΉππβπ,π‘ β‘ Pr{SM is in a latent fault state at π‘ β©Β M receives a fault in (π‘, π‘ + ππ‘]}Β  (3)

We do not have to consider the fault of SM2M (the secondary safety mechanism for M) in this case because it is considered after time π‘ + ππ‘. When we consider two cases in which SM2SM (the secondary safety mechanism for SM) is in a fault or non-fault state, βπππΉππβπ,π‘, as shown in (2), can be expressed as follows:

β πππΉ ππβπ,π‘ = Pr{SM2SM is in a fault state at π‘ β©
Β  Β SM is in a latent fault state at π‘ β©
Β  Β M receives a fault in (π‘, π‘ + ππ‘]} +
Pr{SM2SM is not in a fault state at π‘ β©
Β  Β SM is in a latent fault state at π‘ β©
M receives a fault in (π‘, π‘ + ππ‘]} (4)

The first term in (4) assumes that faults occur for SM2SM, SM, and M, which are classified as safe faults according to [1]. As a result, we can calculate the PMHF, assuming that SM2SM is never in a fault state considering only the second term of (4). The same scenario is valid for SM2SM when M β SM; thus, we assume that both secondary safety mechanisms are never in fault states.

Calculation of the PMHF

PoF Calculation of Single Point Failure

The PMHF shall be expressed as the sum of a single point failure (SPF) term and DPF terms because the failure that is caused by three or more faults is classified as a safe fault, as previously explained. Because the fault of a safety mechanism does not cause a violation of a safety goal by definition, we obtain πππΉπππΉ based on the conditional probability as

πππΉπππΉ,π‘ = Pr{M is in a non β prevention state at π‘}
= Pr{M is in a fault state at π‘
β© the fault is not prevented by SM}
= Pr{the fault is not prevented|M is in a fault state at π‘}
Β  Β β Pr{M is in a fault state at π‘}
= (1 β πΎπ,πΉππΆ,ππΉ) Pr{ππ β€ π‘}
= (1 β πΎπ,πΉππΆ,ππΉ)πΉπ(π‘)Β Β (5)

where

• πΎπ,πΉππΆ,ππΉ: the failure mode coverage of M with respect to residual faults (Part10 8.3.3 of [1]);
• ππ: the random variable that represents the failure-free operating time of M; and
• πΉπ(π‘): the CDF in terms of the failure of M.

Therefore, we obtain (6)

ππππ»πΉ,πππΉ = (1 β πΎπ,πΉππΆ,ππΉ)ππ = πππΉΒ  (6)

where

πππΉ: the residual failure rate of M.

PoF Calculation of Dual Point Failure

SMβM case

Based on equation (7) in [8], the authors calculate the probability of a DPF caused by βLogicβ when βMonitorβ is in a fault state (πππππ‘ππ β πΏππππ according to our notation).

Pr{subsystem gets a failure during time (π‘, π‘ + ππ‘)} = Pr{Monitor is in a fault at time π‘ β© Logic gets a failure during time (π‘, π‘ + ππ‘)}Β  Β (7)

Here, we assume that the Monitor fault does not cause a dependent failure; the event {Monitor is in a fault at time π‘} and the eventΒ  {Logic gets a failure during time (π‘, π‘ + ππ‘)} are independent. To use the notation as in the previous chapters, we rewrite Logic as M, Monitor as SM, and Subsystem as SUBS. We rewrite (7) as follows:

Pr{SUBS receives a failure in (π‘, π‘ + ππ‘] } =Β Pr {SM is in a fault state at time π‘} βΒ Pr {M receives a fault in (π‘, π‘ + ππ‘]}Β  Β (8)

Then, the left-hand side of (8) can be rewritten as

Pr{SUBS receives a failure in (π‘, π‘ + ππ‘]} = Pr{π‘ < ππππ΅π,ππβπ β€ π‘ + ππ‘} = ππππ΅π,ππβπ(π‘)ππ‘Β  Β (9)

where

• ππππ΅π,ππβπ : the random variable that represents the failure-free operating time of SUBS for SMβM, and
• ππππ΅π,ππβπ(π‘): the PDF in terms of the failure of SUBS at time π‘ for SMβM.

The fault of SM must be a latent fault to cause the DPF; thus, the first term on the right-hand side of (8) (and (7) in [8]) should be written as Pr{SM is in a πππ‘πππ‘ state fault at time π‘} as explained in (3). This term can consequently be expressed using the following conditional probability:

Pr{SM is in a πππ‘πππ‘ fault state at π‘}
= Pr{SM is in a fault state at π‘
Β  Β β© the fault of SM is not detected}
= Pr{the fault of SM is not detected|SM is in a fault state at π‘}
β Pr{SM is in a fault state at π‘}
= (1 β πΎππ,πΉππΆ,πππΉ) Pr{πππ β€ π‘}
= (1 β πΎππ,πΉππΆ,πππΉ)πΉππ(π‘)Β  Β (10)

where

• πΎππ,πΉππΆ,πππΉ : the failure mode coverage of SM with respect to multi-point faults (Part10 8.3.3 of [1]);
• πππ: the random variable that represents the failure-free operating time of SM; and
• πΉππ(π‘): the CDF in terms of the failure of SM.

The second term on the right-hand side of (8) can be expressed as follows:

Pr {M receives a fault in (π‘, π‘ + ππ‘]} = Pr{π‘ < ππ β€ π‘ + ππ‘} = ππ(π‘)πtΒ  Β (11)

where

• ππ(π‘): the PDF in terms of the failure of M.

Applying (3), (9), (10), and (11) to (8) yields the following expression:

βπππΉππβπ,π‘ = ππππ΅π,ππβπ(π‘)ππ‘ = (1 βΒ πΎππ,πΉππΆ,πππΉ)πΉππ(π‘) β ππ(π‘)ππ‘Β  (12)

Then, equation (12) produces the following integral form:

Β  (13)

Thus, applying equations (1) to (13) produce the following expression:

Β  (14)

where

• πππ,πππΉ,πππ‘: the failure rate of SM with respect to multi- point faults latent.

We consequently obtain the PMHF of SMβM by applying (14) to (2) as follows:

Β  (15)

According to [4], we can classify the failure scenario based on the two categories shown in chapter III as a) and b). Equation 15 corresponds to case a), which should be rewritten as ππππ»πΉ,ππβπ,πππ‘.

For case b), we assume a Markov process, i.e., that a fault of SM that is detected by SM2SM will be perfectly repaired (as good as new) and that the repair time will be ignored. Thus, we obtain (16) as

Β  (16)

where

πππ,πππΉ,πππ‘: the failureΒ  rate of SMΒ  with respect to the detected multi-point faults.

Therefore, combining (15) and (16) to consider both cases yields the following expression:

Β  (17)

Although Part 10 8.3.3 of [1] provides and describes only the equation for SM β M, we are advised to multiply by two when we consider both cases, such as βSM β Mβ and βM β SM.β However, as this approach provides an un-exact result, we derive the exact result in the next section.

M
βSM case

A typical example of a redundant subsystem is a body control module (BCM), which is a type of electronic control unit (ECU). It may have circuitry that includes headlight drivers driven by a microcomputer with backup circuitry hardware to maintain visibility if the microcomputer stops. According to the generalization explained in the previous chapter, we assume the general redundant subsystem shown in Figure 2. A latent fault exists in this situation even for the M β SM case. For convenience, we introduce the probability coefficient πΎπ,πΉππΆ,πππ‘ that will be removed eventually. It is defined in (18) and refers to the detection ratio by the primary SM. This coefficient is 100% in a non-redundant subsystem and 0% in a redundant subsystem (e.g. 1oo2). Conversely, we should define an intrinsic redundant subsystem with πΎπ,πΉππΆ,πππ‘ as 0%. On the other hand, we may refer to the subsystem including dual-core lock-step (DCLS) as an extrinsic redundant subsystem although it has redundant processor cores because its πΎπ,πΉππΆ,πππ‘ is 100%.

Figure 2: Block diagram of a redundant subsystem

We define πΎπ,πΉππΆ,πππ‘ as

πΎπ,πΉππΆ,πππ‘ β‘ Pr{a fault is detected by SM|M is in a prevention state at π‘}Β  Β (18)

We obtain the small dual-point PoF for the M β SM case using (3), (5), (11), and (18):

βπππΉπβππ,π‘ = Pr {M is in a latent fault state at π‘
= Pr {M is in a prevention state at π‘
the fault is not detected by SM}
β Pr {SM receives a fault in (π‘, π‘ + ππ‘]}
= Pr{the fault is not detected by SM|
M is in a prevention state at π‘}
β Pr{M is in a prevention state at π‘}
β Pr {SM receives a fault in (π‘, π‘ + ππ‘]}
= (1 β πΎπ,πΉππΆ,πππ‘)πΎπ,πΉππΆ,ππΉπΉπ(π‘)πππ(π‘)ππ‘.Β  Β (19)

Considering the same argument in both cases, as in a) and b) in the section of this paper on βRelated Work,β we can obtain the following expression:

Β  (20)

Adding (6), (17), and (20), we finally obtain the following generalized equation:

Β  (21)

Applying the PMHF Equation to Non-Redundant and Redundant Subsystems

Non-Redundant Subsystem

For an item with non-redundant and redundant subsystems, we can apply (21) to both subsystems. In a non-redundant subsystem, because the fault detection and prevention of the safety goal violation are performed by SM (primary safety mechanism), πΎπ,πΉππΆ,πππ‘ equals 1 (100%). Applying this relationship to (21) yields the following expression:

Β  (22)

This is the same formula that is described in Part 10 8.3.3 of [1].

Intrinsic Redundant Subsystem

Because an intrinsic redundant subsystem does not have a residual fault, we assume that πΎπ,πΉππΆ,πππ‘Β  = 0 and πΎπ,πΉππΆ,ππΉΒ  = 1 (100%). Applying these relationships to (21) generates the following expression:

Β  (23)

Conclusion and Future Work

In this study, we have presented generalized formulas for the calculation of PMHF in non-redundant and redundant subsystems using observable parameters, such as the failure rate of a mission function and a safety mechanism, the diagnostic coverages of the primary and secondary safety mechanisms and the diagnostic periods of the secondary safety mechanisms to expand the scope of the application according to ISO 26262. Because the PMHF of an item can be quantitatively calculated using FTA, we plan to prepare FTA models based on the methodology described in this paper.

References

1. ISO/TC 22/SC 3, “ISO 26262-5:2011(E),”Β ISO, 2011.
2. S. H. Jeon, J. H. Cho, Y. Jung, S. Park, and T. M. Han, “Automotive hardware development according to ISO 26262,” in 13th Int. Conf. Advanced Commun. Technol. (ICACT2011), Seoul, 2011, pp. 588β592.
3. N. Adler, S. Otten, M. Mohrhard, and K. D. MΓΌller-Glaser, “Rapid safety evaluation of hardware architectural designs compliant with ISO 26262,” in Int. Symp. Rapid Syst. Prototyping (RSP), Montreal, QC, 2013, pp. 66β 72.
4. N. Das, and W. Taylor, “Quantified fault tree techniques for calculating hardware fault metrics according to ISO 26262,” in IEEE Symp. Product Compliance Eng. (ISPCE), Anaheim, CA, 2016, pp. 1β8.
5. A. Birolini, Quality and Reliability of Technical Systems, Theory Practice Management 2nd Edition, Springer, 1997, pp. 365.
6. V. Rupanov, C. Buckl, L. Fiege, M. Armbruster, A. Knoll, and G. Spiegelberg, “Early safety evaluation of design decisions in E/E architecture according to iso 26262,” in Proc. 3rd Int. ACM SIGSOFT Symp. Architecting Critical Syst., Bertinoro, Italy, 2012, pp. 1β10.
7. K. L. Leu, H. Huang, Y. Y. Chen, L. R. Huang, and K. M. Ji, “An intelligent brake-by-wire system design and analysis in accordance with ISO-26262 functional safety standard,” in Int. Conf. Connected Veh. Expo (ICCVE), Shenzhen, 2015, pp. 150β156.
8. M. Takeichi, Y. Sato, K. Suyama, and T. Kawahara, “Failure rate calculation with priority FTA method for functional safety of complex automotive subsystems,” in Int. Conf. Quality, Rel., Risk, Maintenance, Safety Eng., Xi’an, 2011, pp. 55β58.

Atsushi Sakurai is the Chief Executive Officer and Chief Technology Officer of FS-Micro Corporation. He can be reached at sakurai@fs-micro.com.