Generalized Formula for the Calculation of a Probabilistic Metric for Random Hardware Failures in Redundant Systems

Editor’s Note—The paper on which this article is based was originally presented at the 2017 IEEE Product Safety Engineering Society Symposium, where it received recognition as the Best Symposium Paper. It is reprinted here, with permission, from the proceedings of the 2017 IEEE Product Safety Engineering Society International Symposium on Product Compliance Engineering. Copyright 2017 IEEE.

Since the introduction of International Organization for Standardization (ISO) standard 26262 [1], which is a functional safety standard for automotive electrical and/or electronic (E/E) systems in 2011, ISO 26262 has effectively mitigated two categories of failure. The first category is the prevention of systematic failures based on safety measures in verification and validation development processes. The second category is the prevention of safety issues that are caused by hardware failure using safety mechanisms. Those safety measures assure logical and quantitative solutions that are not achievable by traditional strategies of quality management (QM).

ISO 26262 defines the probabilistic metric for random hardware failures (PMHF), which is a metric related to the probability of a safety goal violation caused by a random hardware failure, and the architecture metrics that are discussed in papers such as [2] and [3].


Target Values for Hardware

The architecture metrics provide average diagnostic coverages that are relative values [2]. On the other hand, the PMHF is an absolute value of the average failure rate of an item. Although some formulas are shown in Part 10 8.3.3 of [1], a detailed explanation of the PMHF is not provided [4].

Before we start the PMHF discussion, we focus on the definition of the probability of failure (PoF) and related equations shown in [5] as follows:

𝑃𝑜𝐹𝑖𝑡𝑒𝑚,𝑡 Pr{𝑋𝑖𝑡𝑒𝑚𝑡} = 𝐹𝑖𝑡𝑒𝑚(𝑡) = 1 − 𝑒𝜆𝑖𝑡𝑒𝑚𝑡, 𝑓𝑖𝑡𝑒𝑚(𝑡) = 𝜆𝑖𝑡𝑒𝑚𝑒𝜆𝑖𝑡𝑒𝑚𝑡 (1)

where

  • 𝑃𝑜𝐹𝑖𝑡𝑒𝑚,𝑡: the probability of failure of an item until time t;
  • 𝑋𝑖𝑡𝑒𝑚: the random variable that represents the failure-free operating time of an item, and has an exponential distribution;
  • 𝐹𝑖𝑡𝑒𝑚(𝑡): the un-reliability, or the cumulative distribution function (CDF) in terms of the failure of an item until time t;
  • 𝜆𝑖𝑡𝑒𝑚: the failure rate of an item; and
  • 𝑓𝑖𝑡𝑒𝑚(𝑡): the probability density function (PDF) in terms of the failure of an item at time t.

The PMHF is related to the probability of a violation of a safety goal due to a random hardware failure within a vehicle lifetime; it is expressed as an average PoF within a vehicle lifetime as (2), using (1) and the Taylor expansion of exponential function with an assumption of 𝜆𝑖𝑡𝑒𝑚𝑇𝑙𝑖𝑓𝑒𝑡𝑖𝑚𝑒 ≪ 1 that is applied throughout the discussion;

 (2)

where

  • 𝑀𝑃𝑀𝐻𝐹: the PMHF of an item; and
  • 𝑇𝑙𝑖𝑓𝑒𝑡𝑖𝑚𝑒: the vehicle lifetime.

The PMHF can eventually be considered as an average failure rate of an item based on (2).


Related Work

Although detailed methods regarding the calculation of a PMHF are not available in the literature, some papers focus on the PMHF metric ([3], [6], and [7]). The authors of [4] derive the PMHF using an expression of PoF based on a probability calculation tool, such as fault tree analysis (FTA). Although each PoF of the basic events can be calculated, we believe that the use of observable parameters, such as failure rates and diagnostic coverages (DCs), which are employed in Part 10 8.3.3 of [1], are more effective for improving the understanding of these parameters and creating an FTA model.

In [4], the authors discriminate between the following two cases with respect to the latent fault of a safety mechanism:

  1. failures of a safety mechanism that are “latent” (e.g., not detectable); and
  2. failures of a safety mechanism that are detectable but occur within the diagnostic test interval.

However, the discussion in [4] is not based on the conditional probability, as shown in [8]. Although the authors introduce the conditional probability in [8], the latent fault calculation has some issues, which are detailed in section V part B. The opposite situation regarding the mission function in the latent fault state is not referenced. This paper aims to clarify that the dual point failure (DPF) should be the failure caused by the second fault when the first fault is the latent fault, where the first element would be a mission function or a safety mechanism.


Target Subsystem

Figure 1 shows the target subsystem named “SUBS.” SUBS includes a mission function “M” and a safety mechanism “SM” that supervises M; SM is the primary safety mechanism.1 The notations M and SM are from Part 10 8.3.3 of [1]. Two secondary safety mechanisms2 exist for each M and SM: “SM2M” and “SM2SM.” Here, SM and SM2M can be the same element. We can observe three parameters, e.g., failure rate (λ), diagnostic coverage (𝐷𝐶), and diagnostic period (𝜏), for each element according to Part 10 8.1.7 of [1], as shown in Figure 1. With respect to the element M, let λ be 𝜆𝑀 and 𝐷𝐶 and 𝜏 be non-existent because the mission function does not have diagnostic capability.

1Primary safety mechanism: safety mechanism to prevent faults from violating a safety goal.

2Secondary safety mechanism: safety mechanism to prevent latent faults.

Figure 1: Target subsystem “SUBS”

 

With respect to the element SM, let λ be 𝜆𝑆𝑀 , 𝐷𝐶 be 𝐾𝑀,𝐹𝑀𝐶,𝑅𝐹 , and 𝜏 be zero because SM is working within the fault tolerant time interval (FTTI). In terms of SM2M, let λ be zero (the lemma that proves this follows), 𝐷𝐶 be 𝐾𝑀,𝐹𝑀𝐶,𝑀𝑃𝐹, and 𝜏 be 𝜏𝑀 . For SM2SM, let λ be zero, 𝐷𝐶 be 𝐾𝑆𝑀,𝐹𝑀𝐶,𝑀𝑃𝐹 , and 𝜏 be 𝜏𝑆𝑀 according to Part 10 8.3.3 of [1].

Lemma – Secondary Safety Mechanisms Never Fail

Before we begin to prove that the failure rate of a secondary safety mechanism is zero, we introduce the notation “A ⇒ B,” which indicates that element “B” receives a fault when element “A” is in a latent fault sate. The small dual-point PoF for SM ⇒ M in the period of (t, t + dt] can be defined as follows:

𝑃𝑜𝐹𝑆𝑀⇒𝑀,𝑡 Pr{SM is in a latent fault state at 𝑡 ∩ M receives a fault in (𝑡, 𝑡 + 𝑑𝑡]}  (3)

We do not have to consider the fault of SM2M (the secondary safety mechanism for M) in this case because it is considered after time 𝑡 + 𝑑𝑡. When we consider two cases in which SM2SM (the secondary safety mechanism for SM) is in a fault or non-fault state, ∆𝑃𝑜𝐹𝑆𝑀⇒𝑀,𝑡, as shown in (2), can be expressed as follows:

𝑃𝑜𝐹 𝑆𝑀⇒𝑀,𝑡 = Pr{SM2SM is in a fault state at 𝑡
   SM is in a latent fault state at 𝑡
   M receives a fault in (𝑡, 𝑡 + 𝑑𝑡]} +
Pr{SM2SM is not in a fault state at 𝑡
   SM is in a latent fault state at 𝑡
M receives a fault in (𝑡, 𝑡 + 𝑑𝑡]} (4)

The first term in (4) assumes that faults occur for SM2SM, SM, and M, which are classified as safe faults according to [1]. As a result, we can calculate the PMHF, assuming that SM2SM is never in a fault state considering only the second term of (4). The same scenario is valid for SM2SM when M SM; thus, we assume that both secondary safety mechanisms are never in fault states.


Calculation of the PMHF

PoF Calculation of Single Point Failure

The PMHF shall be expressed as the sum of a single point failure (SPF) term and DPF terms because the failure that is caused by three or more faults is classified as a safe fault, as previously explained. Because the fault of a safety mechanism does not cause a violation of a safety goal by definition, we obtain 𝑃𝑜𝐹𝑆𝑃𝐹 based on the conditional probability as

𝑃𝑜𝐹𝑆𝑃𝐹,𝑡 = Pr{M is in a non − prevention state at 𝑡}
= Pr{M is in a fault state at 𝑡
∩ the fault is not prevented by SM}
= Pr{the fault is not prevented|M is in a fault state at 𝑡}
   ∙ Pr{M is in a fault state at 𝑡}
= (1 − 𝐾𝑀,𝐹𝑀𝐶,𝑅𝐹) Pr{𝑋𝑀 ≤ 𝑡}
= (1 − 𝐾𝑀,𝐹𝑀𝐶,𝑅𝐹)𝐹𝑀(𝑡)  (5)

where

  • 𝐾𝑀,𝐹𝑀𝐶,𝑅𝐹: the failure mode coverage of M with respect to residual faults (Part10 8.3.3 of [1]);
  • 𝑋𝑀: the random variable that represents the failure-free operating time of M; and
  • 𝐹𝑀(𝑡): the CDF in terms of the failure of M.

Therefore, we obtain (6)

𝑀𝑃𝑀𝐻𝐹,𝑆𝑃𝐹 = (1 − 𝐾𝑀,𝐹𝑀𝐶,𝑅𝐹)𝜆𝑀 = 𝜆𝑅𝐹  (6)

where

𝜆𝑅𝐹: the residual failure rate of M.

PoF Calculation of Dual Point Failure

SMM case

Based on equation (7) in [8], the authors calculate the probability of a DPF caused by “Logic” when “Monitor” is in a fault state (𝑀𝑜𝑛𝑖𝑡𝑜𝑟 ⇒ 𝐿𝑜𝑔𝑖𝑐 according to our notation).

Pr{subsystem gets a failure during time (𝑡, 𝑡 + 𝑑𝑡)} = Pr{Monitor is in a fault at time 𝑡 Logic gets a failure during time (𝑡, 𝑡 + 𝑑𝑡)}   (7)

Here, we assume that the Monitor fault does not cause a dependent failure; the event {Monitor is in a fault at time 𝑡} and the event  {Logic gets a failure during time (𝑡, 𝑡 + 𝑑𝑡)} are independent. To use the notation as in the previous chapters, we rewrite Logic as M, Monitor as SM, and Subsystem as SUBS. We rewrite (7) as follows:

Pr{SUBS receives a failure in (𝑡, 𝑡 + 𝑑𝑡] } = Pr {SM is in a fault state at time 𝑡} ∙ Pr {M receives a fault in (𝑡, 𝑡 + 𝑑𝑡]}   (8)

Then, the left-hand side of (8) can be rewritten as

Pr{SUBS receives a failure in (𝑡, 𝑡 + 𝑑𝑡]} = Pr{𝑡 < 𝑋𝑆𝑈𝐵𝑆,𝑆𝑀⇒𝑀𝑡 + 𝑑𝑡} = 𝑓𝑆𝑈𝐵𝑆,𝑆𝑀⇒𝑀(𝑡)𝑑𝑡   (9)

where

  • 𝑋𝑆𝑈𝐵𝑆,𝑆𝑀⇒𝑀 : the random variable that represents the failure-free operating time of SUBS for SMM, and
  • 𝑓𝑆𝑈𝐵𝑆,𝑆𝑀⇒𝑀(𝑡): the PDF in terms of the failure of SUBS at time 𝑡 for SMM.

The fault of SM must be a latent fault to cause the DPF; thus, the first term on the right-hand side of (8) (and (7) in [8]) should be written as Pr{SM is in a 𝑙𝑎𝑡𝑒𝑛𝑡 state fault at time 𝑡} as explained in (3). This term can consequently be expressed using the following conditional probability:

Pr{SM is in a 𝑙𝑎𝑡𝑒𝑛𝑡 fault state at 𝑡}
= Pr{SM is in a fault state at 𝑡
   ∩ the fault of SM is not detected}
= Pr{the fault of SM is not detected|SM is in a fault state at 𝑡}
∙ Pr{SM is in a fault state at 𝑡}
= (1 − 𝐾𝑆𝑀,𝐹𝑀𝐶,𝑀𝑃𝐹) Pr{𝑋𝑆𝑀𝑡}
= (1 − 𝐾𝑆𝑀,𝐹𝑀𝐶,𝑀𝑃𝐹)𝐹𝑆𝑀(𝑡)   (10)

where

  • 𝐾𝑆𝑀,𝐹𝑀𝐶,𝑀𝑃𝐹 : the failure mode coverage of SM with respect to multi-point faults (Part10 8.3.3 of [1]);
  • 𝑋𝑆𝑀: the random variable that represents the failure-free operating time of SM; and
  • 𝐹𝑆𝑀(𝑡): the CDF in terms of the failure of SM.

The second term on the right-hand side of (8) can be expressed as follows:

Pr {M receives a fault in (𝑡, 𝑡 + 𝑑𝑡]} = Pr{𝑡 < 𝑋𝑀𝑡 + 𝑑𝑡} = 𝑓𝑀(𝑡)𝑑t   (11)

where

  • 𝑓𝑀(𝑡): the PDF in terms of the failure of M.

Applying (3), (9), (10), and (11) to (8) yields the following expression:

𝑃𝑜𝐹𝑆𝑀⇒𝑀,𝑡 = 𝑓𝑆𝑈𝐵𝑆,𝑆𝑀⇒𝑀(𝑡)𝑑𝑡 = (1 − 𝐾𝑆𝑀,𝐹𝑀𝐶,𝑀𝑃𝐹)𝐹𝑆𝑀(𝑡) ∙ 𝑓𝑀(𝑡)𝑑𝑡  (12)

Then, equation (12) produces the following integral form:

  (13)

Thus, applying equations (1) to (13) produce the following expression:

  (14)

where

  • 𝜆𝑆𝑀,𝑀𝑃𝐹,𝑙𝑎𝑡: the failure rate of SM with respect to multi- point faults latent.

We consequently obtain the PMHF of SMM by applying (14) to (2) as follows:

  (15)

According to [4], we can classify the failure scenario based on the two categories shown in chapter III as a) and b). Equation 15 corresponds to case a), which should be rewritten as 𝑀𝑃𝑀𝐻𝐹,𝑆𝑀⇒𝑀,𝑙𝑎𝑡.

For case b), we assume a Markov process, i.e., that a fault of SM that is detected by SM2SM will be perfectly repaired (as good as new) and that the repair time will be ignored. Thus, we obtain (16) as

  (16)

where

𝜆𝑆𝑀,𝑀𝑃𝐹,𝑑𝑒𝑡: the failure  rate of SM  with respect to the detected multi-point faults.

Therefore, combining (15) and (16) to consider both cases yields the following expression:

  (17)

Although Part 10 8.3.3 of [1] provides and describes only the equation for SM ⇒ M, we are advised to multiply by two when we consider both cases, such as “SM ⇒ M” and “M ⇒ SM.” However, as this approach provides an un-exact result, we derive the exact result in the next section.


M
SM case

A typical example of a redundant subsystem is a body control module (BCM), which is a type of electronic control unit (ECU). It may have circuitry that includes headlight drivers driven by a microcomputer with backup circuitry hardware to maintain visibility if the microcomputer stops. According to the generalization explained in the previous chapter, we assume the general redundant subsystem shown in Figure 2. A latent fault exists in this situation even for the M ⇒ SM case. For convenience, we introduce the probability coefficient 𝐾𝑀,𝐹𝑀𝐶,𝑑𝑒𝑡 that will be removed eventually. It is defined in (18) and refers to the detection ratio by the primary SM. This coefficient is 100% in a non-redundant subsystem and 0% in a redundant subsystem (e.g. 1oo2). Conversely, we should define an intrinsic redundant subsystem with 𝐾𝑀,𝐹𝑀𝐶,𝑑𝑒𝑡 as 0%. On the other hand, we may refer to the subsystem including dual-core lock-step (DCLS) as an extrinsic redundant subsystem although it has redundant processor cores because its 𝐾𝑀,𝐹𝑀𝐶,𝑑𝑒𝑡 is 100%.

Figure 2: Block diagram of a redundant subsystem

We define 𝐾𝑀,𝐹𝑀𝐶,𝑑𝑒𝑡 as

𝐾𝑀,𝐹𝑀𝐶,𝑑𝑒𝑡Pr{a fault is detected by SM|M is in a prevention state at 𝑡}   (18)

We obtain the small dual-point PoF for the M ⇒ SM case using (3), (5), (11), and (18):

𝑃𝑜𝐹𝑀⇒𝑆𝑀,𝑡 = Pr {M is in a latent fault state at 𝑡
SM receives a fault in (𝑡, 𝑡 + 𝑑𝑡]}
= Pr {M is in a prevention state at 𝑡
the fault is not detected by SM}
∙ Pr {SM receives a fault in (𝑡, 𝑡 + 𝑑𝑡]}
= Pr{the fault is not detected by SM|
M is in a prevention state at 𝑡}
∙ Pr{M is in a prevention state at 𝑡}
∙ Pr {SM receives a fault in (𝑡, 𝑡 + 𝑑𝑡]}
= (1 − 𝐾𝑀,𝐹𝑀𝐶,𝑑𝑒𝑡)𝐾𝑀,𝐹𝑀𝐶,𝑅𝐹𝐹𝑀(𝑡)𝑓𝑆𝑀(𝑡)𝑑𝑡.   (19)

Considering the same argument in both cases, as in a) and b) in the section of this paper on “Related Work,” we can obtain the following expression:

  (20)

Adding (6), (17), and (20), we finally obtain the following generalized equation:

  (21)


Applying the PMHF Equation to Non-Redundant and Redundant Subsystems

Non-Redundant Subsystem

For an item with non-redundant and redundant subsystems, we can apply (21) to both subsystems. In a non-redundant subsystem, because the fault detection and prevention of the safety goal violation are performed by SM (primary safety mechanism), 𝐾𝑀,𝐹𝑀𝐶,𝑑𝑒𝑡 equals 1 (100%). Applying this relationship to (21) yields the following expression:

  (22)

This is the same formula that is described in Part 10 8.3.3 of [1].

Intrinsic Redundant Subsystem

Because an intrinsic redundant subsystem does not have a residual fault, we assume that 𝐾𝑀,𝐹𝑀𝐶,𝑑𝑒𝑡  = 0 and 𝐾𝑀,𝐹𝑀𝐶,𝑅𝐹  = 1 (100%). Applying these relationships to (21) generates the following expression:

  (23)


Conclusion and Future Work

In this study, we have presented generalized formulas for the calculation of PMHF in non-redundant and redundant subsystems using observable parameters, such as the failure rate of a mission function and a safety mechanism, the diagnostic coverages of the primary and secondary safety mechanisms and the diagnostic periods of the secondary safety mechanisms to expand the scope of the application according to ISO 26262. Because the PMHF of an item can be quantitatively calculated using FTA, we plan to prepare FTA models based on the methodology described in this paper.


References

  1. ISO/TC 22/SC 3, “ISO 26262-5:2011(E),” ISO, 2011.
  2. S. H. Jeon, J. H. Cho, Y. Jung, S. Park, and T. M. Han, “Automotive hardware development according to ISO 26262,” in 13th Int. Conf. Advanced Commun. Technol. (ICACT2011), Seoul, 2011, pp. 588–592.
  3. N. Adler, S. Otten, M. Mohrhard, and K. D. Müller-Glaser, “Rapid safety evaluation of hardware architectural designs compliant with ISO 26262,” in Int. Symp. Rapid Syst. Prototyping (RSP), Montreal, QC, 2013, pp. 66– 72.
  4. N. Das, and W. Taylor, “Quantified fault tree techniques for calculating hardware fault metrics according to ISO 26262,” in IEEE Symp. Product Compliance Eng. (ISPCE), Anaheim, CA, 2016, pp. 1–8.
  5. A. Birolini, Quality and Reliability of Technical Systems, Theory Practice Management 2nd Edition, Springer, 1997, pp. 365.
  6. V. Rupanov, C. Buckl, L. Fiege, M. Armbruster, A. Knoll, and G. Spiegelberg, “Early safety evaluation of design decisions in E/E architecture according to iso 26262,” in Proc. 3rd Int. ACM SIGSOFT Symp. Architecting Critical Syst., Bertinoro, Italy, 2012, pp. 1–10.
  7. K. L. Leu, H. Huang, Y. Y. Chen, L. R. Huang, and K. M. Ji, “An intelligent brake-by-wire system design and analysis in accordance with ISO-26262 functional safety standard,” in Int. Conf. Connected Veh. Expo (ICCVE), Shenzhen, 2015, pp. 150–156.
  8. M. Takeichi, Y. Sato, K. Suyama, and T. Kawahara, “Failure rate calculation with priority FTA method for functional safety of complex automotive subsystems,” in Int. Conf. Quality, Rel., Risk, Maintenance, Safety Eng., Xi’an, 2011, pp. 55–58.

Atsushi Sakurai is the Chief Executive Officer and Chief Technology Officer of FS-Micro Corporation. He can be reached at sakurai@fs-micro.com.

Leave a Reply

Your email address will not be published.

X