What happens during DDR memory training/initialization, and how does it differ from what is needed for manufacturing testing or system marginality validation?
On Intel platforms, the BIOS Memory Reference Code (MRC) is used to initialize the memory controller and optimize read/write timing and voltage for optimal performance. The MRC is very complex: its job is to optimize multiple parallel buses operating at 2GT/s and beyond and get them to act as “one system”. It does this by using sophisticated methods including on-die termination (ODT), read/write leveling (using a “fly-by” topology to deliberately introduce flight-time skew, thereby avoiding simultaneous switching noise), Vref tuning, CMD/CTL/ADDR timing training, and other methods. It is widely recognized that DDR4 is likely the last parallel bus for interfacing to SDRAM; the physics of wide, parallel, single-ended buses precludes sufficient margin for such devices to operate at higher speeds. The future (actually, today) involves such technology as hybrid memory cube (HMC) and high-bandwidth memory (HBM) with through-silicon vias (TSV).
The MRC has a goal of “booting at all costs”: that is, it brings the memories up in as efficient a way as possible, finding the “sweet spots” of timing and voltage quickly so that the memories are up and running, and the system can continue booting. It is supposed to perform its function quickly because users of such gear want minimal boot times and “always-on” performance. Its behavior is dictated by the platform on which it is running: that is, on a consumer device, if a major fault on a DIMM is found, it will “blue screen” the system, because laptop users need to know something catastrophic has happened. On the other hand, for an enterprise system like a server, the MRC will quietly disable the affected channel with the defective DIMM, and continue to get the rest of the system up and running to achieve five-nines (99.999%) availability of the system.
And therein lies the major challenge of the MRC: the trade-off between boot time, defect coverage, and fault isolation. In general, the MRC will minimize boot time, which means its defect coverage and fault isolation can be low. To ensure robustness of field systems, more sophisticated testing and even margining methods are needed.
On enterprise systems, in the case of an uncorrectable error, such as a short or open-circuit on a DQ (data) bit, the MRC will disable the affected channel or rank. Since there may be up to three DIMMs per channel, the test technician has to do a lot of legwork to identify the failing DIMM. In a system where the memory is soldered down, in a case where a rank were disabled, there could be eight or more suspect devices; clearly very difficult to debug.
For other types of defects, such as a short to ground or open-circuit on one leg of a strobe (DQS) signal or opens/shorts on DIMM GND pins, there may be sufficient margin for the MRC to train the affected memory, but there will less margin on that SDRAM device or DIMM; maybe, there is sufficient impact that it will later fail in the field, due to environmental conditions (temperature, humidity, etc.) taking it outside of its margin envelope.
So, other means should be employed to provide higher defect test coverage, and better diagnostic resolution, than what is normally employed within the MRC. It is certainly possible to enhance the MRC to exercise the memories more, and thereby provide bit-level diagnostics, but this would come at the expense of boot time.
In terms of memory testing for structural faults, such as for manufacturing test, a desired solution is independent of the BIOS (because if the system won’t boot or the channel is disabled, it is impossible to access and diagnose the affected bit). Ideally, this methodology will perform such tests as a byte enable check, address check, data integrity check, walking 1s, walking 0s, etc. The best way to deliver on this requirement is to spoof the memory controller to think that the memory has been fully initialized (analogous to what is done in a “fast boot”), then to run an off-the-shelf memory test script. Some more detailed material on this approach is found in the following two blogs: DDR Testing using Spoofed Memory Reference Code Part 1 and DDR Testing using Spoofed Memory Reference Code Part 2.
In terms of memory margining, for example to detect/isolate defective strobes or power signals, this should allow the BIOS MRC to run to completion, then exercise the interface with “killer patterns” which stimulate inter-symbol interference (ISI), crosstalk, and simultaneous switching output (SSO) effects, and then observe any timing or voltage margin impact on affected byte lanes or ranks. This requires an understanding of the baseline margins of a given configured system in terms of memory vendor, memory populations, frequencies, and so on; as well as a quantitative determination of the degree of margin impact expected from a given defect.
More sophisticated DRAM device cell or array testing uses these patterns, possibly combined with voltage and time margining, to detect stuck-at faults (SAF), transition faults (TF), coupling faults (CF), neighborhood pattern sensitive faults (NPSF), and address decoding faults (AF). The stimulus can of course be done within software, but this approach is subject to maintenance transactions and is subject to a large performance penalty, which reduces its determinism and thoroughness – so a hardware engine (BIST within the memory controller itself) approach is needed.
For some technical insight into structural testing testing of memories, see our eBook Testing DDR4 Memory with JTAG (note: requires registration).
To learn about the "tuning" necessary to optimize memory performance, see for example DDR Tuning and Calibration Guide on the Zynq-7000.