ZFS Zpool Status Degraded with no ZFS errors or SMART Failures

ZeroGravitas23 · 7. Januar 2020

Hey Everyone,

Over the last couple of weeks I refreshed my OMV box with a fresh OS install and also transitioned from mergerfs + snapraid to ZFS using the ZFS plugin. Everything went very smooth and it was working fine until a day or two ago when I noticed one of the Zpools had a degraded disk. All of the pools are set up with mirrored vdevs and two vdevs to each pool.

A single vdev started showing this degraded drive yet there were no reported errors on the writes, reads, or checksums. I am a little stumped as they are new drives (the two in the degraded vdev) and there are no SMART errors on either drive.

I have attached pictures of the pool (wingclipper in the pictures), the pool status, as well as the SMART info regarding the two drives.

Also, I am currently trying to run a long SMART test on the first degraded drive to see if any further errors actually show up. Any ideas or help on what might be causing this would be wonderful. Could it be a cable, a backplane (as this is a Norco 4224 server case), or something else?? They are connected from the backplane to an LSI HBA. Once the SMART test finished I might turn off the box and reseat the drives to see if that may do it but I am unsure if it will help any.

cabrio_leo · 7. Januar 2020

I would say the first device ("faulted") with SN 8CJYBX7E has a serious problem although the smart data of both drives are good.

To replace the faulted drive have a look at this thread: Replacing a defective disc in a ZFS pool

Edit: And this: ZFS: zpool offline fails to change state of faulty drive

cabrio_leo · 10. Januar 2020

@ZeroGravitas23 Did you solve your problem?

ZeroGravitas23 · 10. Januar 2020

Thanks for being interested cabrio_leo!

In fact it is not fixed yet but I am narrowing it down. I swapped the location of drives in that zpool with two other drives in a different zpool to figure out if it was something to do with the hard drives or something to do with the other hardware in the computer.

In doing so, after rebooting and scrubbing the pools to fix the checsum issues, more checksum errors showed up but on the pool where the second drives were a part of.

Next, I ran a memtest86 on the RAM because something seems to going on with the hardware and no files are ever really effected or listed as corrupted in the zpools when the pool gets "degraded". No errors on the ECC RAM showed up from the memory testing so I am now onto the next step.

The HBA's look good but I might try to reseat them to ensure they are OK. Next on the list is the Backplane and SAS connectors between the HBA's and the backplane. Also, at the same time I want to ensure I look at the PSU as I am getting some really weird Print_req_error: I/O error, dev "hard drive ID", sector #######.

And I am also unable to complete a SMART Long Test as it keeps saying : Interrupted (host reset).

I believe most of these oddities are occurring due to the same thing, some sort of inconsistent power or data to the drives but I really need to troubleshoot more as the original drives that had a checksum problem in the zpool are brand new with no recorded SMART issues. Once I get further I will update this post so there is some sort of resolution.

ZeroGravitas23 · 29. Januar 2020

Alright, I wanted to give an update on this as I have had some success in troubleshooting and the pools seem to be holding steady right now.

These are the steps I have troubleshooted so far in case anyone needs to follow the same steps:

Switched location of drives - this caused more drives in random pools to see checksum errors, so no consistency or answers here
Scrubbed and cleared the pools in question - afterwards more checksum errors showed up randomly but never an I/O error which indicates something in data/power transfer
Checked HBA's - Two Dell PERC Cards and an internal motherboard HBA. All worked good with the correct firmware
Switched mini SAS 8087 cables to backplanes - this was to see if certain cables were causing the problem. Nothing found as the checksum errors afterwards were random across cables
Mapped out drives In a grid according to the Slot Location on the Norco 4224 case to see if errors were occurring on certain bays. Nothing initially conclusive on this but more info further down.
Replaced PSU - This was to see if power was the issue. PSU tester did show old PSU not giving enough power so this may have been part of the cause
Finally, looked at grid of drives (17 total) in relation to the backplanes - This I believe was the issue. All drives in question were part of the bottom 3 backplanes of the case.
- Top 3 backplanes in case go to MB, directly to internal HBA while bottom 3 go to two HBA cards
- Given the notoriety of Norco and their backplanes I had some extras on hand an swapped them in
- So far after two days (includes scrubbing and clearing the ZFS errors) there have been no more Checksum error

I am going to keep checking on the drives, running a couple smart tests and transferring some data on them to see if anything else pops up but it sure seems to be pointing to the backplanes given the testing. It is really frustrating to try and solve these issues as there are so many moving parts in the setup of these server cases but this along with many other threads make me want to switch to a Supermicro server. I can't do it right now but it is definitely in the upgrade path in the future.

Thank you for the help on this, and if anything changes I will post updates to this.

Jetzt mitmachen!