Get our free email newsletter

Implementing Robust Watchdog Timers for Embedded Systems

Design Principles, Configuration Strategies, and Fault Recovery Methods Using Watchdogs for Modern Systems

Watchdogs have long been a standard, if slightly esoteric, element of system design, often receiving only secondary consideration after the primary application has been planned out. At their core, they serve a straightforward purpose: providing a graceful means of recovery in the event of abnormal system behavior. At their origin, watchdog timer architectures were simple, implemented via a dedicated application-specific integrated circuit (ASIC), positioned adjacent to the system’s processor (see Figure 1).

Figure 1
Figure 1: Basic ASIC watchdog architecture (top); basic internal watchdog architecture (bottom)

In that early form, interaction with the watchdog was typically limited to a simple general-purpose input/output (GPIO) pin. However, as systems have grown in complexity and adopted stricter safety requirements, watchdog implementations have also evolved. In more modern setups (see Figure 2), a watchdog may be integrated into the microcontroller itself, reside in a voltage monitor or supervisory device, or be part of a power management integrated circuit (PMIC), often refreshable through GPIO, I²C, or SPI.

Figure 2
Figure 2: Advanced watchdog architecture with power management device

Two major standards, ISO 26262 and IEC 61508, which outline functional safety requirements, are now driving the design for these watchdog architectures. This has led to additional complexity in an already mature device category, introducing new mandates such as:

- Partner Content -

Mastering High Voltage: The Importance of Accurate Test Equipment

Ensuring precision in high-voltage testing demands rigorous instrument calibration, with certified laboratories meticulously aligning measurements to national standards for uncompromising accuracy and reliability.
  • The system watchdog must be able to operate independently, mitigating the risk of dependent failures between the device being protected and the device doing the protecting and
  • The system watchdog must be capable of accurately monitoring the timing of individual tasks and reporting a hung task to a higher-level application, all while remaining fault tolerant, allowing for maximum uptime.

With these new requirements, watchdogs may include advanced features such as challenge-response mechanisms and, in other system designs, multiple watchdogs that feed into a “master” watchdog to present an overall health status.

In the following sections, we’ll first trace the evolution of watchdog timers, starting with the basic refresh mechanism, moving on to windowed watchdogs, and finally examining challenge-response (Q&A) watchdogs, complete with real-world application examples. We’ll then delve into safety analyses, explore various system architectures showing how multiple watchdogs can coexist in a robust design, and ultimately highlight the key reporting features to look for when selecting a watchdog solution. This holistic approach will equip you with a comprehensive understanding of modern watchdog systems and how they’re meeting the demands of functional safety in increasingly complex, multicore environments.

Evolution of the Watchdog Timer

Before diving into the complexities of modern watchdogs designed for high-safety systems—those requiring Automotive Safety Integrity Level (ASIL) B or higher, or Industrial Safety Integrity Level (SIL) 2 or higher—it helps to first examine the fundamental watchdog timer and its inner workings. This begins with the simplest form of watchdog refresh, progresses through window-based watchdogs, and culminates in challenge-response (Q&A) watchdogs.

The Basic Watchdog Refresh Driven by Pin

In its classic incarnation, a watchdog timer is often implemented as a separate ASIC sitting alongside the main processor (see Figure 3).

Figure 3
Figure 3: Example internals of a basic WDT ASIC

Because it’s relatively straightforward, this type of watchdog can also be integrated directly into the system microcontroller (MCU). In its simplest form, it uses a GPIO pin (or similar) to detect if the system is “alive.” If a refresh signal isn’t received within a predefined interval, the watchdog times out and triggers a reset or error condition.

- From Our Sponsors -

Key features of this basic watchdog include:

  • Edge-triggered input: Responds to rising edges, falling edges, or both
  • Programmable timeout: Defines how much time may pass between refresh events and
  • Reset delay (or grace period): Specifies the time from a missed refresh to an actual reset.

These parameters may be configured via special function registers (SFRs, if the watchdog is onboard the MCU) or through one-time programmable (OTP) memory (if it’s an external device). Typically, engineers choose a maximum refresh interval that comfortably accommodates the longest atomic task—or aligns with a system’s fault-tolerant time interval—plus some margin. While that broad coverage is helpful for catching major system stalls, a basic pin-driven watchdog does have drawbacks. It can’t determine whether individual tasks are running within their expected timeframes, and it only detects a complete system hang which results in a missed refresh.

To address the need for more precise timing checks, the next evolution introduces the window-based watchdog.

The Window-Based Watchdog

As systems grew more complex, the desire to pinpoint abnormal task durations led to the development of window-based watchdogs (WWDT, see Figure 4). These watchdogs monitor refresh signals within a defined “window” of acceptable timing. Conceptually, there are three critical calibratable time limits:

  1. Lower time limit: If the refresh comes early, the watchdog flags an error
  2. Upper time limit: If the refresh comes late, the watchdog flags an error and
  3. Terminal limit: Much like the basic watchdog’s timeout, this defines the point at which the system will be reset if no valid refresh is seen; this is often referred to as too late or non-responsive time limit.
Figure 4
Figure 4: Window watchdog window depiction

By imposing both lower- and upper-time constraints, a window watchdog offers two primary benefits:

  1. Granular fault tolerance: You can configure different potential responses based on whether the refresh was too early or too late, potentially allowing the system to log errors and continue running for minor timing violations and
  2. Monitoring task health: Some designs allow you to keep track of how often these individual limits are breached. A higher-level supervisor could read an “error count” register and spot if certain tasks are chronically missing their timing. Over time, this data helps diagnose performance bottlenecks or failing components.

User-defined features in a window-based watchdog typically include:

  • Setting the three-time limits: Specified in base clock counts or milliseconds and
  • Defining an error tolerance: An “error accumulator” or similar mechanism (Figure 5) decides how many errors trigger a reset.
Figure 5
Figure 5: Fault accumulator depiction

By returning to correct timing within the defined window, the system can “self-heal” from minor hiccups while still triggering a reset for persistent or major faults (see Figure 6). This improves coverage in functionally safe designs and allows tighter timing constraints compared to the basic watchdog.

Figure 6
Figure 6: Flow of a window watchdog accumulating and clearing errors

However, while the window watchdog accounts for tasks running too long or too short, it still doesn’t confirm that each task is performing the correct operations. That gap is filled by a more advanced mechanism, either the challenge-response or Q&A watchdog.

The QA Watchdog, 4QA and 16QA

For systems demanding the highest levels of safety, such as those requiring ASIL-D or SIL 4, engineers often use a challenge-response watchdog (also called a question-and-answer watchdog). This design (see Figure 7) builds upon the window concept by actively verifying that the MCU or SoC is not only responding within the correct timeframe but also providing the correct response to a given challenge.

Figure 7
Figure 7: Example of the process a multiple challenge, response watchdog follows

In this scheme, the watchdog issues a “token” or “challenge” that the monitored device (MCU or SoC) must process before returning the result, either by doing arithmetic or applying a bitwise transformation. By expecting a specific and sequential response, the watchdog effectively tests whether the system is running the right code in the right order. Below are three common variants:

  • 4-question-and-answer (4QA): The watchdog provides four sequential challenges (often simple operations on an 8-bit value), and each must be answered correctly in the right time window (see Figure 8).
  • 16-question-and-answer (16QA): The watchdog uses a seed token that defines the next four responses. Additional seeds can chain together to create longer challenge sequences (see Figure 9). This allows for more in-depth program-flow monitoring across multiple tasks.
  • LFSR (linear feedback shift register)–based: A polynomial is used to generate a pseudo-random challenge (Figure 10). The monitored device must compute the correct response for each step in the sequence. This approach can create a large number of possible challenge-response pairs, further increasing system robustness.
Figure 8
Figure 8: Example of a 4QA response table, and sequence

 

Figure 9
Figure 9: Example of a 16QA Response table and sequence

 

Figure 10
Figure 10: Digital representation of a watchdog implemented as an LFSR polynomial, along with the expected sequential response from the processor

Typically implemented in advanced ASICs, voltage supervisors, or PMICs, these watchdogs have moved away from simple pin refreshes, instead relying on I²C, SPI, or dedicated serial interfaces to send and receive challenge tokens. This more sophisticated interaction allows for:

  • Program flow verification: Ensuring tasks and subroutines execute in the proper sequence and
  • Advanced error accounting: An error accumulator or register can keep track of how many challenges were missed or answered incorrectly, letting the system react gracefully to minor issues while still triggering a hard reset when conditions warrant it.

Each type of watchdog (basic pin-driven, window-based, and challenge-response) has its place in the design of reliable, safe, and functionally robust systems. Selecting the right one depends on your application’s criticality, performance requirements, and tolerance for fault conditions. Understanding where your system stands in terms of safety (e.g., ASIL-B vs. ASIL-D, SIL-2 vs. SIL-4) is often the first step in deciding which watchdog functionality you’ll need. Now that the function of these different watchdog timers has been investigated, the next step is focusing on their proper implementation into your system architecture.

System Requirements 

When choosing a watchdog to integrate into an application, it is essential to identify the specific requirements driving its selection. Such requirements often include understanding the types of faults the watchdog is expected to detect, the hardware or software resources the watchdog might share with the device it protects, and the nature of the reporting mechanisms the system must support. All of these considerations can be distilled into four major areas:

  • The safety analysis of relevant fault types
  • The avoidance of undesirable system dependencies
  • The proper integration of multiple watchdogs (if necessary) and
  • The selection of reporting features that align with the application’s diagnostic needs

Watchdog and Functional Safety

At its core, a watchdog must provide a graceful means of recovery to the system if the application stops responding. This task generally protects against two principal failure modes.

  • The first relates to random hardware faults, such as an oscillator becoming stuck or running at an incorrect frequency or issues arising from miscounted edges on a communication interface like I²C or SPI.
  • The second concerns systematic failures, where tasks fail to execute correctly because of software defects—examples include memory corruption due to an errant pointer or semaphore mismanagement that leaves a task waiting indefinitely.

One common example of a systematic failure involves an I²C or SPI interface that becomes stuck because unexpected clock pulses are generated in the presence of high-power transients or noise (see Figure 11). In a system requiring low-voltage interfacing, transient electrical noise can elevate reference voltages beyond VIH/VIL limits, causing the digital state machine inside an ASIC or MCU to miscount edges and wait indefinitely.

Figure 11
Figure 11: An example of a stuck I2C line, halting system processing

Another widespread systematic issue, usually caught in design reviews, involves pointers that exceed array bounds or function pointers that jump to invalid memory locations, leaving the program “off in the weeds.”

Although these two categories illustrate typical random and systematic errors, a thorough failure mode, effects, and diagnostic analysis (FMEDA) is recommended to examine the system’s fault modes. The diagnostic analysis portion of this activity often identifies coverage the watchdog can provide to a specific subsystem. For example, a challenge-response watchdog can address more sophisticated problems by verifying correct arithmetic operations, monitoring individual cores in a multi-core architecture, and detecting clock-frequency irregularities or hardware communication issues as opposed to a simple windowed response watchdog. In general, the watchdog is expected to catch single-point faults that might otherwise violate the system’s safety goals.

Achieving this level of protection requires that the watchdog remains free from dependencies that could prevent it from detecting application errors. A dependent failure analysis (DFA) helps uncover situations where the watchdog relies on shared resources or design elements that might also be subject to failure. In Figure 12, the core oscillator and clock divider provide one clock domain to the peripherals, core, and watchdog timer. Additionally, all system memory is shared, as well as all powered from one power domain. Failure in any one of these areas would prevent the WDT from functioning properly in addition to impacting the ability of the device to achieve the safety goal.

Figure 12
Figure 12: A watchdog diagram, full of dependencies (colored in dark yellow). Clock, memory/registers, and VCC are all identified here.

Often, this analysis shows the watchdog must:

  1. Use an independent clock source that remains unaffected if the main system clock fails
  2. Maintain reliable power or voltage supervision so that it does not lose state information during dips or brownouts and
  3. Operate offboard, which keeps it isolated from defects in the MCU pipeline or CPU core and ensures that it continues running even when the MCU is compromised.

The need for watchdog independence is widely recognized, and achieving it typically involves selecting suitable reporting features and designing the system so that multiple watchdogs can operate without compromising each other’s functions (see Figure 13).

Figure 13
Figure 13: A distributed domain with two clocks, a clock monitor, and two separate VCC domains to remove dependencies

Examining Watchdog Architectures

In the previous section, the DFA was introduced as an analysis that aids in ensuring that a watchdog is free from dependency. In this section, we can apply that thinking to various situations to determine the advantages given three general-use cases.

Hierarchical Watchdogs for Dependency Avoidance

When a system has multiple watchdogs available to them it is up to the designer how to structure their function in such a way that each watchdog can exist at a different level in the system. This allows them to provide evidence that their watchdogs are free from dependency over the domain they’re monitoring and will operate in a predictable way to avoid a boot lock.

This situation is best found in applications with watchdogs per core on a multicore processor and an offboard watchdog either on an associated PMIC or voltage monitor integrated circuit. To start this design, the watchdogs are separated into two groups: the subordinate watchdogs found on a per-core basis and the master watchdog found offboard (see Figure 14).

Figure 14
Figure 14: A block diagram of a multicore MCU, with WDT from each core reporting to a system supervisor

At its base function, each watchdog will watch its own core, providing localized monitoring, usually in the form of a normal single refresh watchdog or a window-based one. Generally, if that watchdog is not refreshed under the correct parameters, that individual core will issue an interrupt to a specific space in memory and execute a subroutine that will often restart that specific core, leaving the others independent. While this works well for individual tasks that have hung, such as I2C or SPI, it does not work well for clock and voltage independencies as it relies on a functioning program counter and CPU bus to issue the interrupt and load the program counter with the correct spot in memory. By adding a master watchdog that is refreshed along with individual ones, you’re able to address dependencies such as:

  • Clock independence: The external master watchdog relies on its own clock source. If any single core’s clock fails, the core’s own watchdog stops refreshing, and eventually, the master sees an issue. This separation ensures that one domain’s fault does not compromise the entire watchdog infrastructure.
  • Fault isolation: Each subordinate watchdog can trigger a localized reset or interrupt for its core. If that fails or the core remains hung, the master watchdog takes further action, like forcing a full power cycle.
  • Resource separation: Subordinate watchdogs rely on the MCU’s internal registers and clock domains, while the master watchdog uses an offboard resource (such as a separate oscillator). This avoids the problem of a single clock or memory bus dominating the entire safety mechanism.

The main advantage of this layered approach is that the system offers simplified debugging, allowing each core to issue a refresh and subsequently log irregularities specific to each core while still allowing a master watchdog timer to issue a system-wide reset should errors continue. The downside is that the fault limits need to be chosen correctly to fit inside of the module’s fault-tolerant timer interval while still allowing some fault tolerance for the watchdog of each core. This is best used for instances where an SPI, I2C, or a systematic error causes an individual core to hang.

Dual-Output Watchdog: Separating Peripheral Resets from Full System Reboots

In some systems, it’s useful for a watchdog to be in control of both a hardware as well as a software reset, passing the responsibility of resetting the module or communications interface to the watchdog timer if the system processor is unable to respond, even after a software reset (see Figure 15). This type of watchdog architecture requires the watchdog to have two output pins, referred to below as WDO1 and WDO2.

Figure 15
Figure 15: Previously depicted PMIC circuit with 2 reset lines (in green)

In architectures like these, the MCU will communicate with the offboard watchdog and, in the event of a missed refresh, can be programmed to issue a reset to the system controller through a non-maskable interrupt pin. If the offboard watchdog does not receive a signal from the processor indicating a successful recycle, it can toggle the second output, which can either reset the power or communications to the device (see Figure 16). This two-tiered approach allows for:

  • Reduced downtime: Resetting a single peripheral requires less re-initialization time than rebooting the entire system, improving availability in real-time or mission-critical applications and
  • Dual timeout domains: The shorter timeout for WDO1 (e.g., 100 ms) might be enough to catch process stalls. The longer WDO2 timeout (e.g., 500 ms) covers system-level lockups. While the watchdog remains decoupled from the processor clock domain, this also reduces the chance of a single clock failure compromising both outputs.
Figure 16
Figure 16: Labeled timing diagram of both reset lines

One such use case for this architecture is in a motor-control system that communicates with multiple sensor ICs via SPI. Occasional electrical noise can cause the SPI bus to freeze due to miscounted clock edges, causing the controller and peripheral to get out of sync. The watchdog’s WDO1 toggles a hardware reset line if the bus remains idle or locked for 50 ms, allowing immediate recovery. If the system fails to refresh WDO2 for 500 ms (perhaps because the main control loop is hung), the watchdog triggers a full system reset by either toggling power to the module or the communications interface to the processor.

The key advantage here is that not only the clock but also the way to a safe system state is physically isolated from the processor’s main reset domain but still allows some fault tolerance in the case of a non-critical error.

Cross-Monitoring with a Co-Processor

While an offboard watchdog integrated into a supervisor chip or PMIC is quickly becoming popular due to cost, some systems still rely on a co-processor, either be it a simpler MCU or a field programmable gate array of logic (FPGA). This architecture offers the most flexibility, albeit for the most price, as it allows a user to create a custom response either via a companion MCU or FPGA (see Figure 17).

Figure 17
Figure 17: Example system focusing on a custom FPGA that monitors the health of two logical domains, with a master reset with little to no dependencies

In this architecture, it is best to separate the domains into a local domain and a companion processor domain. The local domain is where each core on the MCU is being watched internally. However, instead of being given the authority to toggle a hardware reset line or preempt the CPU with an interrupt, it’s given a GPIO or SPI message to a neighboring processor.

At the co-processor layer, the device collects these signals and tracks their failures applying custom behavior allowing pin-based refreshing to lower-level tasks, and challenge-response type logic for higher-level, critical tasks.

Additionally, in this scenario, we gain the benefits of a dual watchdog output in addition to customizable reset requirements. Examples of these custom requirements are:

  • This would allow the system to log a custom number of errors in different situations, resulting in increased uptime and visibility into which tasks have stalled or are otherwise causing system instability.
  • Custom responses to reset specific physical interfaces. In an automotive module, there might be multiple interfaces (e.g., CAN, LIN, and an Ethernet PHY). Depending upon the task that is stalled, you may want to either cycle power to or otherwise disable a specific function while still offering reduced functionality (this is often referred to as failing functional).

Overall, this type of architecture allows the system designer to consider which tasks and domains are suitable for a hard reset where power must be cycled, or a soft reset where just the program counter in the CPU must be modified. Additionally, this scheme offers physical separation of function, which is also desirable when it comes to failure mode analysis as one physical failure will no longer guarantee failure in the monitoring hardware. However, care must be taken in this example to avoid voltage dependency, if the same voltage source is feeding both the co-device and the main MCU there runs a risk of voltage anomalies disturbing the ability for both the watchdog aggregator and the MCU to malfunction similar resulting in the negation of this complex architecture.

Reporting and Error Handling

A final, often overlooked topic involves how complex watchdogs handle faults and generate reports. In many designs, once the watchdog resets the main controller or toggles power to the module, the watchdog’s internal memory (often a form of RAM) preserves status flags indicating the nature of the reset that is often cleared upon read, or when written to (see Figure 18). At startup, the system bootloader can read this information and decide whether to log an event, increment a counter in non-volatile memory, or alter its initialization routine.

Figure 18
Figure 18: Flow diagram of a system reset, boot loader, and error reporting

Additionally, systems that require more comprehensive tracking benefit from watchdogs capable of distinguishing between different causes for a reset. For instance, some devices maintain separate flags for time-based refresh failures, invalid challenge-response answers, or total non-responsiveness. Placing these flags in an easily accessible register allows a bootloader to implement custom strategies, such as reloading a firmware image or initiating a safe-mode startup if too many consecutive resets occur.

Ideally, the system would offer a simple communication channel built into the boot loader (for example, CAN or LIN in automotive applications) to facilitate external queries about the reset cause. When done properly it aids engineers in finding the cause for reset by highlighting persistent errors.

Conclusion 

Overall, today’s watchdogs can do more than simply catch hung processes; they can precisely monitor task timing, validate code execution, and accumulate valuable diagnostic data. And as safety standards continue to evolve, so will watchdog timers and their architectures, ensuring they remain informative, independent, and, most importantly, reliable in today’s functionally safe systems. 

Related Articles

Digital Sponsors

Become a Sponsor

Discover new products, review technical whitepapers, read the latest compliance news, and check out trending engineering news.

Get our email updates

What's New

- From Our Sponsors -

Sign up for the In Compliance Email Newsletter

Discover new products, review technical whitepapers, read the latest compliance news, and trending engineering news.

Close the CTA