Out-of-band access between on-board BMCs and CPU JTAG chains is now a de facto implementation on hyperscale servers manufactured by the ODMs. This article describes the BMC firmware library within our ScanWorks Embedded Diagnostics (SED) product used to run diagnostics at-scale.
As described in the article Microsoft Project Olympus Schematics and Embedded JTAG Run-Control, Microsoft’s disclosure of embedded JTAG between the ASPEED Baseboard Management Controller (BMC) and the Intel, AMD, and ARMv8 (Cavium and Qualcomm) CPUs within its Project Olympus servers established this hyperscale giant’s requirement for run-control functionality in-situ within its data centers. JTAG and run-control form the foundation for hardware-level diagnostics that cannot be executed in any other way: for example, retrieval of machine-check and uncore register contents in a wedged system. The topology looks like this (block diagram courtesy of Microsoft):
Note that the red oval in the diagram denotes the fact that GPIO from the BMC have access to the traditional XDP (eXtended Debug Port) connections used to initiate debug and trace functions normally performed using a benchtop debugger like SourcePoint. This is a new hardware modification made on Project Olympus server designs to enable debug forensics initiated via JTAG and the sideband signals necessary for run-control.
A unique aspect of run-control is that it also provides functionality at a speed several orders of magnitude faster than OS-based utilities for some operations, such as those described for example in Using Embedded Run-Control for PCIe Link Training Testing Part 1 and Part 2. Not having to run all the way up and down the OS stack, and just initiating link training via hardware as in this example, is a key performance advantage.
Run-control achieves its true potential when it is embedded down within a server’s BMC. This is provided by ASSET as a value-added alternative to having it running remotely on a host PC. This allows debug to truly scale across tens, hundreds or even thousands of servers without the need for a remote host connection. In this case JTAG acts as a true “agent-based” debugger, as opposed to having to deal with the encumbrances of cables, hardware interfaces, and bloated remote workstation software applications. This approach also avoids the overhead of the hundreds or even thousands of separate transactions between the remote PC and the target BMC needed to execute the run-control primitives. A metaphorical contrast of the two is like this:
ASSET’s implementation of its embedded SED Library comprises a set of APIs that are executed upon the BMC on, for example, an Intel-based design. The SED Library uses Intel In-Target Probe (ITP) run-control to perform register, memory and I/O reads/writes via the Intel JTAG/XDP interface. SED provides very powerful out-of-band (independent of the operating system) access to critical system state information, particularly in the event of a system crash or hang. Since the embedded ITP functionality is available in-situ, this provides for remote (i.e. over TCP/IP) access to run-control capabilities, without the constraints of physical hardware probes, cables, and board physical access.
Programming the ASSET SED library is fairly straightforward, for engineers with a good knowledge of x86 architecture. The library itself is an ultra-compact, with a small memory and flash footprint for use in embedded systems, and is fully documented with lots of code samples. As an example, consider the ReadMSR function, whose use was referenced in the article Spectre, Meltdown, and Embedded MSR Access, and documented via man page:
For more information on ScanWorks Embedded Diagnostics, please register for our eBook, SED Technical Overview.