Hardware error - really?

    • Hardware error - really?

      Hi,

      today I noticed that my shared folders weren't accessible, so I went down to my server and found it switched off. =O

      When I tried to switch it back on, it wouldn't boot, and after I opened it and disconnected all HDDs, I got this message:

      Sep 20 22:26:57 orion kernel: [ 0.496079] mce: [Hardware Error]: Machine check events logged
      Sep 20 22:26:57 orion kernel: [ 0.496083] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: e600000000020408
      Sep 20 22:26:57 orion kernel: [ 0.496160] mce: [Hardware Error]: TSC 0 ADDR fef5b780
      Sep 20 22:26:57 orion kernel: [ 0.496231] mce: [Hardware Error]: PROCESSOR 0:506c9 TIME 1569011177 SOCKET 0 APIC

      Strange enough, after reconnecting everything, it boots again, but I'm not feeling my files are really safe. Does it really mean my CPU is broken?

      My syslog also contains:

      Sep 20 22:26:57 orion kernel: [ 0.509893] mtrr: your CPUs had inconsistent variable MTRR settings
      Sep 20 22:26:57 orion kernel: [ 0.509894] mtrr: probably your BIOS does not setup all CPUs.

      Does that mean the BIOS disabled the faulty CPU? So I'm running at reduced power, but safe?

      Any help or suggestion appreciated!
    • I concur with Adoby. An actual hardware fault could, potentially, lead to corruption of your files. Backup ASAP.

      _______________________________________________________

      Where to go from there is a matter of opinion. If it was me:

      First, a rebuild is in order. Software can be corrupted and the kernel is no exception, but that doesn't mean the errors or the shutdown event should be ignored. Even if the errors were never seen again, I wouldn't be comfortable with hardware errors related to a CPU. (Does your rig have 2 CPU's?) If you don't have a backup server, as in another independent platform, I'd be looking for one. The new platform would become my primary server.

      With the replacement server on the job; "if" I kept a server with hardware errors, it would be after extensive checks and testing of all components with a live distro (Memtest86 and others), checking the PS for AC ripple, etc., etc.
      I would need to find an actual cause for the problem, eliminate it and retest, to trust the hardware again.
    • I don't want to make more out of this (I'm assuming) one time event, than it is. It could have been a flipped bit in non-ecc ram caused by a "cosmic ray" (as they say :) ) . That could lead to a kernel panic or other one time issues. (A long shot to be sure.) But, without a repeat, who knows?
      What is concerning, to me, is that the issue didn't self correct as a flipped bit would. The server wouldn't boot the first time with a simple power on.

      I believe in Murphy's Law. Being a true believer and since a warning has been given, I'd have a difficult time trusting the server. First, I believe I'd do a clean rebuilt. (It's time consuming, but it's the easiest thing to do.)
      Then with good, solid, checked and tested backup, at a minimum I'd watch it closely. I'd consider demoting it to backup server status, so it's convenient to run off-line tests, then watch it for a few weeks to a couple months.