Highly Accelerated Life Testing (HALT) and Highly
Accelerated Stress Screening (HASS) are often used to detect faults in
electronic systems, testing the extremes and determining the outer limits of
system margins. All hardware components and systems will eventually fail under
environmental stress. But can this technique be used to improve software
reliability as well?
Temperature cycling and repetitive shock, power margining
and power cycling are the most common forms of failure acceleration for
electronic equipment in Accelerated Stress Testing. These methodologies subject
a product to a series of overstresses, finding defects in design and
manufacturing before they become expensive field issues.
An excellent paper by Allied Telesis entitled Software Fault Isolation Using HALT and HASS describes that
companyโs experience in isolating software bugs. Interestingly, of all the
failures found during HALT testing, almost one-third are attributed to
software:
A sample of the software bugs found is covered in the paper,
and it makes for interesting reading. Some of the symptoms found that were
attributable to software defects included abnormal LED activity, invalid packet
routing due to a bad I/O driver on a Marvell Prestera 98EX115 part, CPLD board
power-up sequencing that failed at cold temperatures, and other system crashes
and silent reboots.
The system crash example was particularly interesting. As it
turns out, the crashes were caused due to an incorrect register setting inside
memory initialization code, in conjunction with a problem with the memory
controller silicon.
In all of these cases, root cause is down in the firmware,
driver, and operating system kernel level of the system. A hardware-assisted
debugger is an extremely useful tool in these situations. For more information
on the power of these tools on Intel designs, see our white paper, Faster Firmware Debug with Intel Embedded Trace Tools.