Out of the frying pan, and into the fire: I beat the #&@! out of my new CPU from AMD, and the segmentation faults have gone away. But, now the new system is crashing!
In the MinnowBoard Chronicles Episodes 27 and 28, I wrote about my discovery that I had acquired an older AMD Ryzen 7 1700X CPU from Amazon. The older production runs of these chips exhibited problems with cache coherency, that only manifest themselves rarely when you’re cranking all 16 threads simultaneously. And that was just what I was doing with my Yocto Linux builds for the MinnowBoard Turbot and Portwell Neptune Alpha boards: the compilation process maxes out CPU utilization. This past month, I RMA’ed my older CPU (that had a datestamp of work week 09) to AMD, and very promptly got a replacement in the mail (a nod to AMD for responding so quickly and efficiently). When I unwrapped it, I was delighted to see that it was a much more current production run:
If you look carefully, you’ll see that the datestamp is “1737”, versus the “1709” from my previous device. From my researches on the web, via for example New Ryzen Is Running Solid Under Linux, No Compiler Segmentation Fault Issue, we know that any CPU with a datestamp after work week 25 should be good. So, I carefully re-installed the new CPU into my home-built PC (which is always a nerve-wracking process, by the way, since this is my money and I don’t want to mess anything up), and put it to the test by running the Kill Ryzen script again:
It ran for several hours, and it was getting late, so, I let it run overnight. When I got up in the morning, this is what I saw:
In the middle of the night, it crashed the system, with a kernel dump! This bears further investigation. Is it some new flaw in the new chip? Some incompatibility with Ubuntu? Did I re-install the new CPU correctly, with the right amount of thermal paste? Or maybe a bug in the Kill Ryzen script? So many interesting avenues to explore, so little time…
It’s easy to get distracted while on a mission, but I decided to put this issue in the parking lot for now, and focus on the main goal: doing a Linux build for the MinnowBoard Turbot. Apparently, something is still suspect with my system, but at least it appears that it can run for several hours without the segmentation faults manifesting; which would lead me to believe that I can probably consistently do a Yocto build without any failures. It was time to put that to the test.
Since it’s been a while since I did a Yocto build, I decided to do a QEMU emulator run from a fresh environment, following the directions in the Yocto Project Quick Start Guide. QEMU images were easier to build, I found, with less chance of user error on my part. I used the same approach as documented in The MinnowBoard Chronicles Episode 22: Project Yocto success!, and it fired up right away and started running. Normally, it takes about 45 minutes (on my new PC, using all 16 threads, assuming it didn’t crash due to an AMD segmentation fault on the old CPU) to build a QEMU image, so I stepped away for a coffee, and let it run:
Alas, when I returned, the system was sitting in the Ubuntu login screen! Somehow, something bad was happening during the build, and it would never complete, but rather would reboot and put me into the login screen. I tried this numerous times, and always got the same result. I looked into the logs in ~poky/build/tmp/log/cooker/qemux86, and saw that it got to task 4283 of 6148, but then the log ended, with no failure information. The last lines looked like:
NOTE: Running task 4281 of 6148 (/home/alan/poky/meta/recipes-core/dbus/dbus-glib_0.108.bb:do_package)
NOTE: recipe libxt-1_1.1.5-r0: task do_package: Started
NOTE: recipe dbus-glib-0.108-r0: task do_package: Started
NOTE: recipe eudev-3.2.2-r0: task do_compile: Succeeded
NOTE: Running task 4282 of 6148 (/home/alan/poky/meta/recipes-core/udev/eudev_3.2.2.bb:do_install)
NOTE: recipe eudev-3.2.2-r0: task do_install: Started
NOTE: recipe libtirpc-1.0.2-r0: task do_package: Started
NOTE: recipe cairo-1.14.10-r0: task do_configure: Succeeded
NOTE: Running task 4283 of 6148 (/home/alan/poky/meta/recipes-graphics/cairo/cairo_1.14.10.bb:do_compile)
NOTE: recipe cairo-1.14.10-r0: task do_compile: Started
Somewhat frustrated at this point, I elected to try a different approach, with a completely fresh environment. I had had some earlier success with Yocto using Virtualbox on my old machine, so I installed Virtualbox on my new machine and clicked on “New” to begin the new installation. What I found what that only 32-bit operating were supported! It took a little digging around on Google, but I finally found out that the default setting on my AMI BIOS did not support virtualization. I had to boot into UEFI and enable this setting first. It was really hard to find: in the AMI UEFI BIOS Utility, it’s buried under the Advanced Menu, and labeled “SVM Mode”. After this was done, I was finally able to create a new 64-bit virtual machine.
I decided to go ahead and install Ubuntu 16.04.3 LTS desktop, the same as I had on my separate Linux partition, to see if running it in a VM made any difference. I could always install Debian or any of the other distributions later.
So, once again following the instructions in the Yocto Project Quick Start 2.4 document, I kicked off another QEMU bitbake. It only used one thread as a default within the VM, not the 16 that I have on my AMD CPU, but I decided to let it run anyway to see what happened. And it ran to completion!
Now, that is a real clue. On my separate Ubuntu partition, it was blasting away with all 16 threads, and never finishing. Cut it back to one thread and run in a VM, and it finished (albeit taking almost 7 hours, compared to the 45 minutes it was taking when it managed to run to completion using the older AMD CPU and 16 threads). There is a BB_NUMBER_THREADS variable that I can set in my project’s local.conf configuration file that might be able to adjust this? Maybe I should adjust this to a higher number and see when it starts to crash? Stay tuned!