Catastrophic events within an Intel processor will often manifest themselves with a “hung” or “wedged” system. What causes these events, and how can they be resolved?
Errors on electronics systems fall into two broad categories: detectable and undetectable.
Undetectable errors themselves may not be system-critical. For example, there may be branch logic within code that is flawed, but has no effect on system computations. On the other hand, undetected errors that manifest themselves with Silent Data Corruption (SDC) are system-critical, as intrinsically they cannot be detected and subjected to error-handling routines.
Detectable errors themselves can be classified into two categories: recoverable and unrecoverable. Recoverable errors are detected and subjected to standard error-handling routines, an example of which is single-bit memory errors with ECC handlers, or link-layer retries on PCI Express channels.
A general topology of error classifications can be seen in the article Autonomic Foundation for Fault Diagnosis in the Intel Technology Journal, Volume 16, Issue 2, 2012. See below:
Detectable but Uncorrected Errors (DUE) can manifest themselves via blue screens or other system hangs/crashes. In Intel designs, internal processor errors, such as a processor instruction retirement watchdog timeout (also known as a three-strike timeout) will cause a CATERR assertion and can only be recovered from by a system reset. Identifying the root cause of such events is notoriously difficult, as the system is effectively wedged and cannot be put into probe mode by JTAG-assisted hardware debuggers. In such extreme cases the machine check error handler at vector 0x18h does not execute correctly and no register information is captured.
A very good Intel reference on processor instruction retirement watchdog timeouts can be found here: Processor Reorder Buffer (ROB) Timeout Debug Guide. Keep in mind that ROB timeouts are only one of many types of internal, catastrophic errors.
Clearly, it is desirable to capture root cause information from such a failure event, before the system is disrupted and probe mode cannot be accessed. This can be accomplished by setting a breakpoint within the offending code (once identified), and tracing execution backwards in time. The methodology is defined in our whitepaper here (requires registration): Intel Trace Hub | Faster Software Debug | Finding Root Cause.