Last week I published a blog on detecting defects on PCI
Express. Does Intel QuickPath Interconnect (QPI) differ in any way?
In the Blog Testing
PCI Express, I mentioned that PCIe uses interference-canceling differential
signaling and jitter-canceling embedded clocking. Intel QuickPath Interconnect
(QPI) operates a little differently. Instead of using an embedded clock, QPI
uses a separate forwarded clock lane per every 20 data lanes. The bus is
differential, but not AC-coupled like PCIe. It is capable of both half-width
and quarter-width operation, similar (but not identical to) PCIe, and in
addition supports both data lane and clock lane failover. For example, if a QPI
lane fails, the quadrant that lane is in will be marked as unavailable; and the
link will drop back to x10 operation or x5 as needed. With clock lane failover,
an existing data lane can substitute for a failed clock lane, and in turn the
link width will fail back as above. And there are numerous other differences. An
excellent description on the reliability mechanism of QPI is in the Intel Press
book, Weaving High
Performance Multiprocessor Fabric.
From a failure detection point of view, technologies that
can detect and diagnose failures are as follows:
Boundary-scan test (JTAG) can detect any structural defect on QPI,
such as short and open circuits.
Processor-controlled test can detect some structural
shorts/opens defects which result in a change in link width/speed. In addition,
it can detect at-speed faults, such as clock failures, which result in changes
in link width/speed.
HSIO for Intel Architecture (Intel IBIST) will discover some of the above structural
and at-speed defects, and in addition uses margining and bit error rate testing
to gauge the performance of the bus. Thus, it can detect a more exotic failure
spectrum, such as bad device lots, power droops that affect lane performance
but not link speed/width, component/system aging, etc. For more information on
the latter, see my Blog at Pollution,
Power Margins, and SerDes Problems.