What would you do if you were simply browsing the internet on your web browser, and the computer froze? You try to move the mouse and it doesn't. Keypresses aren't registered. But you can see all the apps on the screen as they are. Nothing responds to input though. So you are left with no option other than to reboot, and then everything works normally, which leaves you puzzled.
When this happened to me, I checked the dmesg log, the kern log, the sys log, and found nothing that indicated that the computer crashed. However, when I used last -x, I saw there was a crash registered: tty7 20:36 - crash (00:19). Such a crash can be caused due to a faulty RAM, so I ran Memtest86+, and sure enough, the test failed.
- With XMP switched off, the tests passed.
- With just one RAM stick, the tests passed even with XMP.
While considering going through the long cycle of handing over the RAM to the service center and them testing it and replacing it, I decided to ask a few queries on the motherboard company's forum, because it seemed like a power distribution issue.
As the conversation progressed, I realized that the problem wasn't with the RAM. The problem was with XMP. The helpful person in the forum advised updating the BIOS and trying a higher, but safe CPU NB/SoC voltage (it's usually used for tuning memory frequency, but can raise CPU temperature if too high). When Memtest failed with this setting too, but with the BIOS update, when Memtest succeeded during the first pass and failed in the second pass, the person realized that I was on the edge of success. So he advised trying 3000MHz instead of 3200MHz.
This worked perfectly. With 3000MHz, there was no crash. This was far better than the 2133MHz that the RAM would have worked at if XMP was switched off. Now that everything worked fine, there was no need to go to the service centre too, since I know the problem was with the RAM frequency, and not with the RAM itself.
Some things I learnt:
- Updating the BIOS is important. Some issues got fixed when I did the update.
- Memtest86+ is a fork of Memtest86, and the guys developing Memtest86 are continuing to maintain it, so it may be a better option than Memtest86+.
- Plenty of other factors can affect RAM (the mainboard model (PCB layer count, PCB trace optimization, RAM slot topology and slot count, component selection, RAM VRM etc.), the mainboard's BIOS optimizations and the BIOS settings, the CPU's integrated memory controller (IMC), quality depends on the CPU generation as well as on the individual CPU (silicon lottery), the properties of the RAM modules), it's also possible that the board/BIOS lacks proper optimization for it.
- Problems often appear if you increase the stress on the memory system. So, with one module, everything is ok, with two modules, you double the stress on the memory system (and again when going from two to four modules). Or you go from the safe default speeds to the XMP speeds that the RAM advertises with.
- About that XMP speed: The RAM maker will have verified that in-house, so there's rarely a doubt that the RAM (in isolation) can achieve that speed. For example, the highest-spec DDR4 you can buy has an XMP profile of DDR4-5333 at a staggering 1.6V. Is this RAM cabable of that speed under the right circumstances? Yes. Will it work at that speed in a run-of-the-mill board model and CPU that is not handpicked? Unlikely.
- If any one of the components of the memory system is not entirely happy with the speed and timings you're trying, it will fall down like a house of cards. Often times, the RAM kit itself is not the culprit, it's just the combination of everything that doesn't quite work well together for some reason. The first order of business is to look at the important voltages, and try some tweaks there. It might be a matter of BIOS optimization, even to the point where you have to open a ticket with the motherboard manufacturer.
- The voltage stability of the PSU's output will not be affected by any peripherals. It can take a hit when you install a graphics card with a high power draw, as their power draw nowadays can be higher than any CPU you can buy. But for entry-level or mid-range graphics cards with <200W power draw, there should be no concern. Fans and DVD drives and USB devices don't even cause a blip, they often in the single-digit Watt range.
- About XMP Profile1 and profile2 in the BIOS, I realized from this page: "Profile 1 is the profile that offers better stability but might come at the cost of slightly looser timings". Whereas "Profile 2 loads a memory’s complete default XMP profile, including all the advanced timing parameters. Timings would be tighter in this case, but stability mileage would vary from motherboard to motherboard".
- A faulty RAM could have possibly corrupted many files in your system, so it's advisable to reinstall the operating system and all other software. Unfortunately, even your personal files may be affected.