I’m never building a new PC again [without a retailer warranty.]
— Me, literally 2 months ago

FUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUCK.

I helped Maus spec a new high-end AM5 rig, consisting of:

  • ASUS ProArt Creator
  • AMD Ryzen 7950X3D
  • G.Skill 4x16GB DDR5 @ 6400MHz
  • Samsung 990 Pro (main drive)
  • WD SN850X 4TB (backup/Steam drive)

We were going to reuse:

  • Old Corsair full-tower case (Ancestor of the Obsidian 800D.)
  • EVGA SuperNOVA 1200 P2 power supply
  • EVGA RTX 2080 Super

We were going to retire:

  • Samsung 850 EVO (SATA SSD)
  • 1TB WD Black (SATA HDD)
  • 150GB WD Velociraptor (SATA HDD)

Tuesday Evening

We get the machine assembled, I run it through MemTest86+ and immediately we notice the RAM won’t hold its XMP profile at 6400MHz. There are errors at basically every address. After some research I conclude that AMD seems to prefer DDR5-6000 kits, and this kit was not designed for “AMD EXPO”, so I manually downtune the RAM and it completes a pass satisfactorily.

I write this off as inexperience w/ the AM5 platform, make a mental note that the QVL list is perhaps not just a suggestion, and move on to installing Windows. We get into Windows and the installation is going fairly smoothly until we hit a small hiccup: the NVIDIA driver installation is hanging.

I use a driver uninstaller, which works perfectly, so I write this off as not being patient enough with Windows update. I run some basic benchmarks and see that the SSD is performing nominally, the CPU is stable w/ Prime95, etc. Satisfied with the build, and realizing it is now pretty late in the evening, I bolt the PC up and let Maus set it up in its rightful place. (aka: not the living room.)

A short while later Maus informs me the PC crashed almost immediately after installing and booting the game Darktide. I make a token effort to troubleshoot and the PC is now starting to crash with BSODs. They look eerily similar to the types of errors I was seeing building MINAKO, including:

  • memory / paging related crashes (unexpected store, memory cache, etc.)
  • filesystem filter crashes
  • driver irql errors
  • critical process crashes / watchdog timeouts

I suggest that we try swapping in my known good power supply in the daylight to rule out a bad power supply, since the weird failures of MINAKO are fresh in my mind. (Also a storm the weekend prior killed one monitor, one hard drive in a server I administer, and created 6 uncorrectable errors on all 6 of the drives in my workstation. — In other words I’m not totally convinced that the power supply is still good just because it was good a few days ago.)

Wednesday

We swap in my EVGA 850 T2 power supply, which thankfully is quite easy due to two factors in our favor:

  1. The modular pinout is the same, gods bless EVGA.
  2. Maus’ full-tower case makes it trivial to fit the new power supply from the front.

The PC seems somewhat stable for a bit, but begins crashing again. After much troubleshooting we eventually try a memory benchmark by Passmark that runs inside Windows: it almost immediately shows errors.

After lots of fucking about with different DIMM configurations we conclude that potentially one of the RAM kits was bad. We reinstall the OS, due to it being mondo corrupted by all the crashes, and Maus plays CP2077 for a while. The PC seems to be reasonably stable. During this whole troubleshoot we end up getting one of the 2x16GB RAM kits to no-POST, so we set it aside and run with less memory for now.

Around this time I decided to order a memory kit from the QVL just to rule out it being a compatibility issue with the kit. After running for a bit CP2077 eventually crashes, hard. I admit defeat and go to bed.

Thursday

Maus swaps in the AMD EXPO DDR5-6000 2x32GB RAM kit I purchased, and it immediately shows errors in the PassMark test. We swap back to the “known good” RAM kit and it is now also showing errors.

Thoroughly weirded out by this I decide to try one last thing: I use a different SSD, basically the only component we haven’t changed at this point. The system basically immediately crashes upon getting to the Windows desktop.

At this point we’re basically down to it being bad motherboard, or bad CPU. Given my general lack of experiencing “marginally bad” CPUs, and empowered by seeing a YouTuber struggle with basically an identical issue on the same platform, I make the call to RMA the board along with one RAM kit we have that is now behaving as a no-POST.

Friday

We unbuild Maus’ PC and box everything up. There’s not much to say here, we’re both pretty frustrated and disappointed in AM5’s showing. I know bad parts happen, but it’s immensely frustrating when it happens to an lone system builder who doesn’t have a stockroom full of spare parts to play with.

I notice an error in the case: there is an extra standoff. This has probably been in the case since it was originally built. It is stamped “eATX/ATX”, but it is not a standard ATX pattern bolt. Upon telling Maus this we rebuild the PC just to make sure this was not shorting something out. (Though I see no thru-hole components in the vicinity, and no obvious damage to the board.)

At this point I’ve basically built four computers (Mr. Saturday, ACHILLES, Mr Saturday again, and then ACHILLES again.) I’m pretty much fucking fed up with consumer electronics. Which is why I decide to double down on the insanity and purchase my own AM5 build to aid in future troubleshooting. (spoiler alert: NAGATO2 is getting replaced for 2024, after all.)

Saturday

We return the parts and ruminate in our poor financial decisions. Maus decides, after being similarly frustrated with the misadventure, that ACHILLES will be retired as-is as Mr. Saturday is going to get a new case and ATX3 power supply.

Sunday

My (much cheaper) ASUS TUF X670 board shows up, along with a 7950X (not X3D), and an extra AM5 cooler for myself, since Maus’ Noctua was pressed back into service for his old rig. Some observations from the bench:

  1. The remaining 2x16GB kit that was “known good” holds its XMP OC of 6400MT/s with absolutely no problems. Passes MemTest86+ completely.

  2. The replacement 2x32GB “AMD EXPO” kit I bought also passes MemTest86+ fine.

  3. After booting to Windows both kits run Cinebench and PassMarks’ memory test just fine. (Remember: the 2x32 kit almost immediately failed the PassMark test when it was first installed.)

At this point I’m thinking “I guess it was the board after all”, but I remember there is one part that hasn’t been swapped: the CPU.

  • I install Maus’ 7950X3D: immediate errors in the PassMark test.
  • I put my 7950X back in to make sure I’m not insane: it’s fine.
  • I put Maus’ X3D part back in (A) to make sure it wasn’t a fitment issue and (b) to show him the insanity: it fucking BSODs before I even get into the test.

At this point I’m mostly just in shock - I have never seen a CPU behave marginally. Remember this thing (a) installed Windows multiple times, (b) ran CP2077 with some success for like an hour or two, and (c) never once crashed or returned a detectable error under a full on Prime95 torture test. I will also add that this thing only seemed to fault under Windows, it ran “fine” in preboot diagnostic environments. So I suspect the issue only becomes apparent when a “real OS” is doing power-state regulation and frequency scaling. (Perhaps that is why Prime95 was stable, since it most likely pushed the voltage to the maximum allowable value?)

Yet I cannot deny that I have a test bench that only behaves differently upon swapping out the CPU. At this point I feel somewhat bad for RMA’ing a mobo that was probably fine, but I’m also not sure how the average system builder could have been expected to figure this out. I have a lot of disposable income and enjoy playing with new PC hardware, which is the only reason I was able to catch this. I reckon most individual system builders are not in a position to have enough spare parts, and spare money, to build a whole second PC just to troubleshoot a bad motherboard/CPU.

Lessons learned

  1. I’m never building a PC [inside the case] again. Most of the suffering was a self-imposed time pressure because we unbuilt ACHILLES and Maus had no PC. This should have been bench tested before replacing a battle-proven rig.

  2. More diversity in your test suite is always a good thing. The PassMark memory diagnostic inside Windows proved more valuable than MemTest86+, in this instance, and I would have never expected that. (My assumption has always been that memory diagnostics while an OS is doing full-blown virtual memory management are esesntially worthless.)

  3. Do not write off weird errors as “chance” or “operator error.” (The NVIDIA driver install hang should have probably been our first clue to reassess.)

  4. Remember “what they say” about assumptions. I should have caught the extra standoff before I even installed the board, but I assumed the original builder used their eyes and brain, instead of the incorrect stampings from a factory in Taiwan.

  5. I am going to be a lot more thorough about testing this generation, and future generations, of PC hardware. This shit is apparently getting quite complex and fussy. Again this is the first bad CPU I’ve seen that wasn’t straight up dead. (To that end: most of the dead CPUs I have seen are a result of mishandling and/or power excursions, not manufacturing defects.) My toolkit needs to evolve to identify potentially untrustworthy CPUs.