In my last Blog, I wrote about the iNEMI report on how testing memories is one of the top three problems engineers face today. Why is memory test so important?
In my last Blog on “How to test DDR3, DDR4 and other memory buses”, I wrote about how memory clock, data and address signals are vulnerable to temperature, jitter, noise, voltage aberrations, environmental conditions, as well as manufacturing variations in resistance, capacitance, inductance and others. I also pointed out our recent whitepaper on this subject: http://www.asset-intertech.com/News/White-Papers. But when I went back and read both the blog and the whitepaper, I realized that we hadn’t explained why this is important. Or, to put it another way, why are these vulnerabilities and sensitivities important?
It’s a good question. And one that is possibly more important in higher-end systems than in low-end. For example, for servers used in financial or scientific applications, or for higher-reliability/availability telecom systems, ECC (Error Correcting Code) memory is used. ECC allows for single-bit errors (“parity” errors) to be detected, logged by the OS, and invisibly corrected. Double-bit errors will be detected by ECC but are in general not recoverable – in which case the system must crash in order to prevent data corruption. Google, in conjunction with the University of Toronto, found that memory errors on its servers were more frequent than previously believed, in the paper “DRAM Errors in the Wild: A Large-Scale Field Study”.
Lower-end systems, such as notebooks, generally don’t use ECC to reduce cost. Even if they do, it’s often turned off in the BIOS to accommodate the possible use of non-ECC memory. Random variances as listed above can cause bit-flips which may cause the machine to crash – an inconvenience, but maybe not terribly critical if you’re doing Facebook or watching YouTube. It’s worthwhile noting that any sort of parity checking and bit correction requires some system overhead.
System administrators for high-availability systems have ways of monitoring system logs, and high bursts of parity errors. Any DIMM which becomes a high-runner will be re-seated and, if the problem persists, replaced – but that often does not address the root cause, which can be due to design, manufacturing or environmental issues. Design and Field Service engineers spend a huge amount of time debugging memory errors. Memory scrubbers are useful for detecting errors, but not necessarily diagnosing them. Typically, margining technologies are needed to detect such things as strobe/data variances and to diagnose them to data bits or lanes.