It is often part of a hardware validation test suite to initiate multiple PCIe bus retrains, looking for hardware design issues, or LTSSM RTL bugs in the device under test. These test suites take a very long time to run. Is there a way to speed them up?
Serial protocols like PCI Express (PCIe) provide very high speed and throughput. This results in a high level of complexity within the physical layer. With this complexity comes a high probability of bugs in the upstream and downstream devices, which might only expose themselves intermittently over extended periods of testing time. Ferreting these bugs out is often done as part of the hardware qualification process, and repeated link retrains is one mechanism often used.
Physical layer link initialization and training is a very complex process. In PCIe devices, this process undertakes many important tasks, such as link width negotiation, link data rate negotiation, bit lock per lane, symbol lock/block alignment per lane, decision-feedback equalization (DFE), etc. etc. All of these functions are accomplished by the Link Training & Status State Machine (LTSSM), which observes the stimulus from the remote link partner as well as the current state of the link, and responds accordingly.
Links are automatically trained up during the boot-up process. It is possible to validate PCIe links by retraining them repeatedly by rebooting a platform numerous times and looking for training errors. The number of times the platform would be rebooted is related to the acceptable failure rate of PCIe over the lifespan of the product. Of course, this could involve a huge amount of time if one is using the operating system to do so; climbing all the layers of the OS stack (physical, datalink, network, transport, session, etc.) would take far too long. There are shortcuts that avoid going all the way up the stack, but the best solution uses embedded instrumentation within the silicon itself.
The best solution for PCIe hardware validation involves using embedded run-control down on a BMC within the platform. The BMC uses JTAG to initiate the register reads/writes necessary to exercise the LTSSM without having to climb up and down the stack. This way, hundreds of thousands of link retrains can be exercised on multiple platforms within a reasonable period of time. The embedded run-control looks for speed errors, width errors, upstream component (USC) and downstream component (DSC) correctable and uncorrectable errors, and other failures.
Want to learn more about this application? Review our free eBook, ScanWorks Embedded Diagnostics (note: requires registration).