In Part 1 of this blog, we discussed some of the limitations of ICT, boundary-scan, and conventional functional test for testing memories. This Part 2 reviews methodologies that overcome these constraints.
The memory controller is responsible for bringing up the memories and, as such, performs many complex and necessary checks to validate that the memory channels are functioning correctly. The process of getting the best operational point for each line of the memory bus is broadly known as “training”. This training is performed by hundreds of thousands of read/write transactions and configuring of the memory controller registers. Based on variances, calculations can be performed to adjust for items such as skew, jitter, temperature and clocks. If the training process fails, the memory controller will disable the affected channel or hang the boot process. The affected memory thus becomes unavailable, or the system will not finish the boot loading sequence and run the OS, and there is little in the way of resulting diagnostics. This breaks many traditional functional test methods for the diagnosis of a faulty memory net.
Two broad methodologies can address these constraints. Both involve using controller on-chip memory (OCM) or processor cache to run an enhanced set of diagnostic routines on bare metal, since the test algorithms must run independently of the main memory. This by itself may make the solution beyond the scope of most test engineers, because it is often not a simple matter of taking some code and recompiling it to run in cache; and external access to OCM is often via JTAG. Fortunately, some off-the-shelf tools provide this capability inherently.
The first methodology involves “instrumenting” the memory reference code (MRC) training sequence, to add both debug code and diagnostic algorithms. This can be quite challenging, as a deep understanding of MRC is needed, and it must be pared down to fit into the available size of OCM or cache. And, scanning in large code blocks via JTAG, and then executing them, can be fairly time-intensive – and a key attribute of production-level memory testing is that it must run quickly. Instrumenting the MRC is not for the faint-of-heart.
The second methodology involves “spoofing” the memory controller. On a “golden board” of a specific configuration, the memory controller initialization sequence can be recorded, parsed, pared down in size, and stored as an algorithm. This spoofed MRC is then used to force the memory controller setup on a production board, bypassing any issues that would have arisen if the training sequence failed. Then, a conventional memory test routine, such as a Walking 0’s and 1’s for example, can be run in OCM to detect structural and functional defects, with bit-level diagnostics
Both methodologies yield similar results in empirical testing. As an example, on an SO-DIMM with pins 186 and 188 shorted (a short between DQS7 and DQS7#):
In this case, shorting the positive and negative side of the strobe will disable the associated byte.
A sample test routine output flags the failed byte:
It is important to note that a basic data bus integrity check as shown above will not be able to distinguish between a DQS7 strobe failure and a data mask failure of DM7. The DM is an input mask signal for write data, so it could conceivably be open, stuck or shorted. The memory test chosen must be able to distinguish faults on as many as possible of DQ, DQS, DM, Ax, BAx, RASx, CASx, etc. signals.
For more information on the first methodology as described above, see our eBook, Cache-as-RAM to bring up non-booting boards.