Title: Watchdog Timers
1Watchdog Timers
- Jeffrey Schwentner
- EEL6897, Fall 2007
2Software Reliability
- Embedded systems must be able to cope with both
hardware and software anomalies to be truly
robust. - In many cases, embedded devices operate in total
isolation and are not accessible to an operator. - Manually resetting a device in this scenario when
its software hangs is not possible. - In extreme cases, this can result in damaged
hardware or loss of life and incur significant
cost impact.
3The Clementine
- In 1994, a deep space probe, the Clementine, was
launched to make observations of the moon and a
large asteroid (1620 Geographos). - After months of operation, a software exception
caused a control thruster to fire for 11 minutes,
which depleted most of the remaining fuel and
caused the probe to rotate at 80 RPM. - Control was eventually regained, but it was too
late to successfully complete the mission.
4Watchdog Timers
- While it is not possible to cope with all
hardware and software anomalies, the developer
can employ the use of watchdog timers to help
mitigate the risks. - A watchdog timer is a hardware timing device that
triggers a system reset, or similar operation,
after a designated amount of time has elapsed. - A watchdog timer can be either a stand-alone
hardware component or built into the processor
itself. - To avoid a reset, an application must
periodically reset the watchdog timer before this
interval elapses. This is also known as
kicking the dog.
5External Watchdogs
- External watchdog timers are integrated circuits
that physically assert the reset pin of the
processor. - The Processor must assert an output pin in some
fashion to reset the timing mechanism of the
watchdog. - This type of watchdog is generally considered the
most appropriate because of the complete
independence of the watchdog from the processor. - Some external watchdogs feature a windowed
reset. - Enforces timing constraints for a proper watchdog
reset. - Minimizes likelihood of errant software resetting
the watchdog.
6External Watchdog Schematic
7Windowed Watchdog Operation
Maxim MAX6323
8Internal Watchdogs
- Many processors and microcontrollers have
built-in watchdog circuitry available to the
programmer. - This typically consists of a memory-mapped
counter that triggers a non-maskable interrupt
(NMI), or reset, when the counter reaches a
predefined value. - Instead of issuing a reset via an I/O pin
assertion, an internal counter of reset to an
initial value. - Watchdog configuration is controlled user
software. - Watchdog may even be used as a general purpose
timer in some cases.
9Internal Watchdog Considerations
- Internal watchdogs are not as safe as watchdog
circuits external to the processor. - Watchdogs that issue a NMI instead of a reset may
not properly reinitialize the system. - Watchdog control registers may be inadvertently
overwritten by runaway code, disabling the
watchdog all together. - Reset is limited to the processor itself (no
outside peripherals). - To circumvent these issues, most built-in
watchdogs have extra safety-steps designed to
prohibit errant code from interfering with the
operation of the watchdog timer. - On-chip solutions have a significant cost and
space advantage over their external counterparts.
10MSP430 Watchdog
- Texas Instruments MSP430 family of
microcontrollers has a built-in 16-bit watchdog
timer featuring - Configurable clock source and prescaler
- Two interrupt options (Reset or NMI)
- Isolated watchdog counter
- Access to the watchdog counter requires a unique
binary code, or password. - The code must be written to the password register
prior to resetting watchdog timer. - An invalid password attempt causes a key
violation interrupt.
11MSP430 Watchdog Timer Block Diagram
12Design Considerations
- The effectiveness of the watchdog is a function
of how it is used within the application
software. - Simply issuing a watchdog reset in every
iteration of the program loop may be
insufficient. - Take a more proactive approach.
- Periodically assess the state and health of the
system. Only issue a reset if all processes are
deemed normal. - Employ a state-based approach when resetting the
watchdog timer. - Should a watchdog failure occur, provide an
indication and/or capture debugging information.
13System Health Assessment
- As the size and complexity of software increases,
so does the likelihood of introducing code that
may be detrimental to the system. - Software may not be the only cause of system
invalidation. A spike in the power supply, for
example, may corrupt data in memory, or even
system registers (program counter, stack pointer,
etc). - Check for things like stack overflows and
validate memory wherever possible.
14System Health Assessment
- If the state of the system is compromised, let
the watchdog timer perform the reset. This is a
better approach than an application
pseudo-reset. - Watchdog timers, themselves, can also adversely
affect the system. - Setting a watchdog interval too short will
generate a premature reset. - If a critical section of code takes 80
milliseconds to complete, do not set the watchdog
interval for 60 milliseconds.
15State-based Watchdog
- To guarantee that the software executes as
intended, incorporate a simple state machine. - This involves adjusting a state variable at the
beginning of a program iteration. - Prior to resetting the watchdog timer at the end
of the program iteration, verify that the state
is correct. - Prevents random code from wandering into the main
loop and kicking the dog. - Enforces a constraint on program sequence.
16State-based Watchdog Example
- void watchdog_state_advance(void)
-
- g_usWatchdogState 0x1111
-
- void watchdog_state_validate(void)
-
- g_usWatchdogStatePrev 0x1111
- if(g_usWatchdogState ! g_usWatchdogStatePrev)
-
- // State is invalid, allow watchdog to
reset. - SLEEP()
-
- else
-
- // Reset the watchdog timer.
- WDT_RESET()
-
Note Repeated calls to the validate function
will cause a watchdog reset.
17Debugging Information
- If software detects a fault condition, log the
error information prior to allowing the watchdog
to reset the system. - Allows the cause of the failure to be addressed.
- A report of the error should be attempted when
the system resets (part of initialization
perhaps). - In addition to reporting errors after reset, it
is a good idea to indicate that the device has
been reset. - If the software was unable to catch the error, it
will still attempt to notify of the reset event. - Systems that appear sluggish may actually be
experience frequent watchdog resets.
18Single-threaded Implementation
- Single-threaded implementations should reset the
watchdog timer in the main software loop. - To determine the proper watchdog timeout
duration, the programmer must determine the
amount of time it takes to execute the code,
using worst case scenarios. - Many systems do not require tight timing.
- In these cases, setting the timeout to a very
large safe value may be acceptable, just to
provide a protection against deadlocks. - Prior to resetting the watchdog, verify that the
state of the system is valid, and system health
is normal.
19Single-threaded Example
- main(void)
-
- hwinit()
- for ()
-
- watchdog_state_advance()
- read_sensors()
- control_motor()
- display_status()
- if(system_check() S_OK)
-
- // Kick the dog.
- watchdog_state_validate()
-
- else
-
- flash_led()
20Multi-threaded Implementation
- The same concepts used in a single-threaded
design are also applicable for multi-threaded
implementations. - Avoid creating a thread that simply resets the
watchdog timer at regular intervals. - Other threads could fail, and the watchdog thread
would keep kicking the dog. - Generate a set of flags or data from each thread
that can be validated in a monitoring thread. - The monitoring thread should reset the watchdog
at regular intervals only if the data produced by
the other threads is acceptable.
21Multi-threaded Monitoring
Monitoring Task
System Tasks
22Multi-threaded Frequency
- An important criteria that can be used to
validate the health of the system is the
execution frequency of the worker threads. - This can be accomplished by incorporating a
simple counter that is incremented on each
iteration of a worker thread. - These counters are then monitored and compared
with threshold values from the monitoring thread. - If the execution frequency of the monitoring task
is significantly greater, the monitoring task can
perform a thresholds - This allows the software to validate timing
constraints.
23Multi-threaded Example System Threads
- thread_read_sensor(void)
-
- for ()
-
- read_sensors()
- thread_sensor_cnt
- sleep(50)
-
-
- thread_control_motor(void)
-
- for ()
-
- control_motor()
- thread_motor_cnt
- sleep(100)
-
thread_display_status(void) for ()
display_status ()
thread_display_cnt sleep(125)
Note Each thread maintains a unique execution
counter.
24Multi-threaded Example Monitoring Thread
- main(void)
-
- hwinit()
- launch_threads()
- for ()
-
- watchdog_state_advance()
- if(system_check() S_OK
- thread_sensor_cnt 18
- thread_sensor_cnt
- thread_motor_cnt 8
- thread_motor_cnt
- thread_display_cnt 6
- thread_display_cnt
-
- // Kick the dog.
- watchdog_state_validate()
-
else flash_led()
report_error(E_FAIL)
// Sleep monitoring task for 1 sec.
sleep(1000)
25Mars Pathfinder
- In July of 1997, a priority inversion occurred on
the Mars Pathfinder mission, after the craft had
landed on the Martian surface. - A high priority communications task was forced to
wait on a mutex held by a lower priority
science task. - The timing of the software was compromised, and a
system reset issued by its watchdog timer brought
the system back to normal operating conditions. - On Earth, scientists were able to identify the
problem and upload new code to fix the problem. - Thus, the rest of the 265 million dollar mission
could be completed successfully.
26Conclusion
- Watchdog timers can add a great deal of
reliability to embedded systems if used properly. - To do so requires a good overall approach.
Resetting the watchdog timer must be part of the
overall design. - Verify the operation integrity of the system, and
use this as a criteria for resetting the watchdog
timer. - In addition to validating that the software does
the right thing, verify that it does so in the
time expected. - Assume the software will experience a hardware
malfunction or software fault. Add enough
debugging information to help debug situation.
27Questions ?
28References
- 1 Barr, M. (2001). Introduction to Watchdog
Timers, http//www.netrino.com/Publications/Glossa
ry/WatchdogTimer.php - 2 Barr, M. (2002). Introduction to Priority
Inversion, http//www.netrino.com/Publications/Glo
ssary/PriorityInversion.php - 3 Gansel, J. (2004, January). Great Watchdogs,
http//darwin.bio.uci.edu/sustain/bio65/Titlpage.
htm - 4 Murphy, N. Watchdog Timers, Embedded Systems
Programming, http//www.embedded.com/2000/0011/001
1feat4.htm - 5 Maxim Integrated Products, Inc. (2005,
December). Supervisory Circuits with Windowed
(Min/Max) Watchdog and Manual Reset,http//datash
eets.maxim-ic.com/en/ds/MAX6323-MAX6324.pdf - 6 Texas Instruments, Inc. (2006). MSP430x1xx
Family Users Guide, http//focus.ti.com/lit/ug/sl
au049f/slau049f.pdf