Multiple disks failure

  • I think there is some other reason for such a huge fail rate in one location.

    You're right.

    I deleted the RAID array and replaced the broken disks with 3 new 12 TB drives.

    I reset everything using mdadm --zero-superblock /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh.

    In the "SMART -> Devices" section, all the disks were reported as "in good condition".

    I created a new RAID 6 array, added the 8 disks, and it started syncing.

    After few minutes, I checked and it was marked as "clean, degraded," and two of the disks were shown as Unknown (ZJV3Z851 and ZJV2TTQH).

    I rebooted the NAS and the drives reappeared with "Good" status.

    At that point, to try and undestand whether it was an issue with the controller, SATA port, or cable, I powered off the NAS, changed the order of the drives, and turned it back on.

    Once again, all the drives showed a "Good" status.

    I created the new RAID 6 array, but once again the sync failed but now It seems that drives ZJV3YELM and ZJV65XX1 having issues.
    I removed one drive (unplugged it from the caddy) and recreated the array and now it seems to be syncing without problems.

    Maybe a power supply problem? I'm using the Corsair CX750M.

    The UPS?
    Like Queen, I'm going slightly mad.















    • Official Post

    This thread is puzzling.

    The fact that simultaneous failures (or failures within a short period of time) occur on multiple hard drives could suggest a faulty batch, but I find andrzejls's theory about the hardware surrounding the hard drives more plausible.

    However, it's very rare for the same thing to happen on two different servers at the same time, with different hard drive models in both servers. This invalidates any theory related to batches of hard drives with manufacturing defects. It also invalidates the theory that some server component, such as the power supply, could have caused this, since it has happened on two servers with two power supplies and different hardware.


    So, analyzing everything that's been said, this seems more like a mystery movie than anything else. I can think of a few theories:


    1. That you're pulling our leg. If you were a relatively new forum user, that would be a possibility, but considering you've been a forum member since 2018, it's unlikely, so I'm ruling it out.


    2. You're using similar hardware on both servers, and there's been a recent update that harmfully affects that hardware. For example, the same SATA port expansion card. It could be, but I'd be surprised if it only happens to you. So I'd rule out this option.


    3. Sabotage. You say these servers are in a company. Has it occurred to you that a disgruntled employee might be kicking the servers when no one is looking? That would explain it. I think I'll stick with that option; I can't think of anything else.

  • Could be OMV7 :)

    I had a squeaky clean disk the other week. Got email notification about a missing mountpoint. I checked the drive, it was clean. No partition table, no data. So I had to set it up for the array and after that I recreated the data fixing the array.


    No problems whatsoever with OMV5 or OMV6. Now this :) It happened with no explanation. I suppose the drive internal low level format routine kicked in for some reason. But usually all disks are frozen for those kind of ATA routines. It's a mystery too.


    So far in this case SMART has reported some unrecoverable bad sectors, which a later manually done bad-sectors test showed as 100% passed, with no errors.

    • Official Post

    Could be OMV7

    I'd look elsewhere for the cause, for the same reason as in point 2 of my argument. I'd be surprised if this happened to just one person. OMV is used in countless systems. We'd have the forum flooded with complaints.

  • With regard to reusing drives, I do a full wipe of a reused drive before using. I've found that this has picked up some errors in a few cases and triggers a sector relocate and/or clears any pending sector issues. My reasoning is that I'd rather force an issue before use than wait till the drive is partially full of data.

    Certainly not a cure all but just another step that for the sake of a few hours wiping the drive might save me from installing a potential dud drive.

    • Official Post

    I fixed and launched postqueue -f and tada! The "A DegradedArray event had been detected on md device /dev/md0." mail arrived.
    I don't konw why on OMV GUI the Test button work (the email is sent).

    Excellent. Notifications will keep you on top of things and out of harms way.


    Tnx for SnapRAID info, I'll check when this mess is fixed.

    We have doc's for -> SnapRAID and for mergerFS. While they're fully independent packages, they work well together.

    In your use case, you might consider ZFS. Along with a number of good features for business use, a "ZFS scrub" can detect drive issues before SMART begins to kick out bad sectors and other errors. Currently, I'm working on a ZFS user doc but it might be awhile before it's finished.

    • Official Post

    So, analyzing everything that's been said, this seems more like a mystery movie than anything else. I can think of a few theories:


    1. That you're pulling our leg. If you were a relatively new forum user, that would be a possibility, but considering you've been a forum member since 2018, it's unlikely, so I'm ruling it out.


    2. You're using similar hardware on both servers, and there's been a recent update that harmfully affects that hardware. For example, the same SATA port expansion card. It could be, but I'd be surprised if it only happens to you. So I'd rule out this option.


    3. Sabotage. You say these servers are in a company. Has it occurred to you that a disgruntled employee might be kicking the servers when no one is looking? That would explain it. I think I'll stick with that option; I can't think of anything else.

    Interesting analysis . I have to admit, I wouldn't have thought of the first or third possibilities. :)

  • However, it's very rare for the same thing to happen on two different servers at the same time, with different hard drive models in both servers

    Consider that on one of the servers I had an auth problem in sending emails: inside /etc/postfix/sasl_passwd there was the wrong password while on the GUI the password was correct, so the check worked but the emails did not send (and there were a thousand).
    I did a test and saw that when you change the password on the GUI, the password in /etc/postfix/sasl_passwd is also updated so I don't know why they were different. Maybe I changed it by hand years ago for some insane reason.

    So most likely the failures occurred at different times.

    The NAS did its rsync backup every night and notified me that everything was ok so I assumed that there were no problems of any kind on the machine.

    "Mea culpa" Leo XIV would say.

    1. That you're pulling our leg.

    Unfortunately for me no, it's not a joke.

    You're using similar hardware on both servers

    They are virtually identical, except for the disk sets.

    One has 7 x 12TB disks, the other has 8 x 10TB disks.

    3. Sabotage.

    As specified, the NAS are not located in accessible areas.

    The head of the company is the only one (with me) that can access the server rooms. If he enjoys kicking the NAS and then paying to fix them... I don't know, it would be stupid.

    I think it's an electrical problem.

    Maybe one of the UPSs is damaged and did not protect the NAS properly.
    But how do I check if the problem is the Corsair CX750M? I think the only way is to replace it with a new one.

    So far in this case SMART has reported some unrecoverable bad sectors, which a later manually done bad-sectors test showed as 100% passed, with no errors.

    This is really worrying me.

    It takes 26 hours to do a full check of a 12TB disk and the 3 tests I've run all give "0 errors".

    I don't know how to handle this discrepancy from the SMART status.

    Do I trust SMART and change disks or do I trust the badblocks test and keep them?

    When in doubt, I'll obviously change the disks but that's a lot of money and that doesn't make anyone happy.

    I've found that this has picked up some errors in a few cases and triggers a sector relocate and/or clears any pending sector issues.

    In the company, this is not an option I feel like pursuing: since I am responsible for the data and its backup, every little anomaly must be resolved ASAP.

    In your use case, you might consider ZFS. Along with a number of good features for business use, a "ZFS scrub" can detect drive issues before SMART begins to kick out bad sectors and other errors. Currently, I'm working on a ZFS user doc but it might be awhile before it's finished.

    ZFS would be a great alternative and we used it on the main rack mounted NAS.

    If I remember correctly, the problem was the RAM needed (too much for small systems).

    They asked me to make the configuration of the various NAS as similar as possible so I switched everything to ext4.


    Some updates.

    During the night the sync is going on on the NAS with the 3 new disks.

    Code
    cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4] [raid0] [raid1] [raid10]
    md0 : active raid6 sde[2] sdh[4] sdb[1] sdc[0] sdd[3] sdf[5] sdg[6]
    70312519680 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/6] [UU_UUUU_]
    [====>................]  recovery = 22.1% (2593674484/11718753280) finish=2453.7min speed=61980K/sec
    bitmap: 88/88 pages [352KB], 65536KB chunk
    
    
    unused devices: <none>


    So it seems that the problem is the 8th disk that messes everything up.

    Even though I'm not at ease, for now I'm keeping this server in production until we have the money to buy a new NAS.

    • Official Post

    Consider that on one of the servers I had an auth problem in sending emails: inside /etc/postfix/sasl_passwd there was the wrong password while on the GUI the password was correct, so the check worked but the emails did not send (and there were a thousand).
    I did a test and saw that when you change the password on the GUI, the password in /etc/postfix/sasl_passwd is also updated so I don't know why they were different. Maybe I changed it by hand years ago for some insane reason.

    So most likely the failures occurred at different times.

    That really invalidates the time coincidence. If you mentioned it before in this thread, I missed it, sorry.

    Unfortunately for me no, it's not a joke.

    Yes, I already ruled out that possibility. It crossed my mind only because it wouldn't be the first time in this forum. But I'm sure it's not the case this time, unfortunately for you.

    They are virtually identical, except for the disk sets.

    Can you give a brief description of the hardware? Motherboard, CPU, etc.

    As specified, the NAS are not located in accessible areas.

    The head of the company is the only one (with me) that can access the server rooms. If he enjoys kicking the NAS and then paying to fix them... I don't know, it would be stupid.

    Perhaps inadvertent sabotage? I'm guessing someone entered those rooms to sweep and mop the floors. You say they're located on the floor. Perhaps someone moved them while they were running to sweep them, which could cause these hard drive errors.

    ZFS would be a great alternative and we used it on the main rack mounted NAS.

    If I remember correctly, the problem was the RAM needed (too much for small systems).

    They asked me to make the configuration of the various NAS as similar as possible so I switched everything to ext4.

    You're wrong about that. The need for large amounts of RAM to use ZFS is an urban myth. Read the OpenZFS documentation and you'll see that a system with 2GB of RAM can run ZFS without problems if you don't use deduplication. In this case, I agree with crashtest; I would recommend using ZFS for many reasons.


    _________________________________________________________


    Regarding letting hard drives spin up and spin down, if you search on Google you'll find opinions for and against. It's not proven, but stopping and restarting the hard drive can often cause them to fail more quickly. I've had them running 24/7 since I got a server; I've never considered stopping them, and they've lasted for many, many years. But I've always used dedicated NAS drives.

    • Official Post

    You're wrong about that. The need for large amounts of RAM to use ZFS is an urban myth. Read the OpenZFS documentation and you'll see that a system with 2GB of RAM can run ZFS without problems if you don't use deduplication. In this case, I agree with crashtest; I would recommend using ZFS for many reasons.

    Read here

    FAQ — OpenZFS documentation


    And regarding ECC memory, that's another myth. It's not necessary, even though it's recommended. It's also recommended on any other file system; the difference is that OpenZFS clearly states this, while the documentation for other systems isn't so clear. But if you don't use ECC, you'll most likely never have problems, just like on any other system.

    In any case, you're using it for professional purposes, so it wouldn't hurt to consider using ECC RAM.

  • Can you give a brief description of the hardware? Motherboard, CPU, etc.

    This is the lshw -short | cut -c 20- for first nas (3 disk broken).


    Two NAS were identical in terms of components, apart from the disks.
    After the doubt regarding the hardware, I mounted the eight disks of the other NAS (the one with 4 yellow warnings) inside a ProLiant DL180 G6 that had been decommissioned (from old sysadm) because it had given some problems in the past.
    Day and night dismantling machines and disks... soon I'll go crazy and start greeting my dog "hi mom".

    (If I also add this description, the message will exceed 10,000 characters, but if you need it, I'll add it later.)

    The need for large amounts of RAM to use ZFS is an urban myth. Read the OpenZFS documentation and you'll see that a system with 2GB of RAM can run ZFS without problems if you don't use deduplication.

    In general I try to mount as much RAM as possible for my obsession: the more powerful the machine, the less it will be subject to stress.


    On ProLiant



    On J3355M


    When resync is completed, I'll search other RAM with ECC and I'lll try to use ZFS.
    If I don't see any particular problems, I will propose to the boss to change the structure at least for what concerns the NAS that do the backups.

    Thanks for the tips.

    • Official Post
    Code
    ASM1062 Serial ATA Controller

    That's what I wanted to know. That chip shouldn't cause any problems in RAID configurations, so everything seems to be in order. I can't think of any other ideas at the moment.


    With 16GB, you have more than enough RAM to use ZFS without any problems.

    • Official Post

    This is really worrying me.

    It takes 26 hours to do a full check of a 12TB disk and the 3 tests I've run all give "0 errors".

    I don't know how to handle this discrepancy from the SMART status.

    Do I trust SMART and change disks or do I trust the badblocks test and keep them?

    I can't explain two (2) different servers displaying somewhat similar (not identical) drive issues. However, I can say this much; one (1) bad drive can create fault conditions on other drives, simply by being plugged into the same controller or motherboard. They're all sharing the same controller and, by extension, they're on the same bus.

    A drive with serious issues (intermittently shorting 12V or 5v input power, for example), simply by being connected to the motherboard, can cause all kinds of seemingly crazy behavior. If this is the case, the only way to find the source of the problem is the process of elimination.

    I've heard of some scenarios where a user would place a motherboard on a table, with a known good PS and one stick of RAM, and began running hardware stress tests. Then, they add back additional components, one at a time. Testing to such lengths is time consuming and, to my way of thinking, not practical.

    The practical approach, in my opinion, is to eliminate the most likely source of the problem and go from there. In your case, on one of the servers, that's the drive with 96 bad sectors.

    Obviously, I can't discount the drive "bad batch" theory but it's just that a theory. In my opinion, the best course of action is to work on what can be worked on and draw conclusions later.

  • Wow, this has turned into quite an interesting read. I think it's worth considering other "environmental" causes as well.


    External Vibrations: Even if mounted properly, the disks could still experience vibrations from things like earthquakes, construction, passing trains etc. I used to work in an office right next to a railroad track, and the whole building would shake as it was passing by. I wonder if the servers in that building had higher-than-average failure rates as well 🤔


    Temperature/Humidity: What's the normal temperature and humidity in the room? What's the value of the SMART "max lifetime temperature"?


    Also, AFAIK, badblocks won't report any errors for sectors that have already been re-allocated in SMART. It only tests the currently visible (healhty) blocks. Those re-allocated sectors are still bad, and the number will likely grow over time.


    Can you post the full `smartctl -x /dev/sd*` output for a couple of the drives?

  • External Vibrations: Even if mounted properly, the disks could still experience vibrations from things like earthquakes, construction, passing trains etc. I used to work in an office right next to a railroad track, and the whole building would shake as it was passing by. I wonder if the servers in that building had higher-than-average failure rates as well 🤔

    No trains in the area or anything else that might create vibrations in the ground... unless a graboid pops out at night.


    Temperature/Humidity: What's the normal temperature and humidity in the room? What's the value of the SMART "max lifetime temperature"?

    One NAS is in a refrigerated room: constant temp at 15°C and humidity around 40%.

    The other in a normal room, average temp 20-25°C and 50% humidity.


    Can you post the full `smartctl -x /dev/sd*` output for a couple of the drives?

    Of course, see attachment.

  • Personalities : [raid6] [raid5] [raid4] [raid0] [raid1] [raid10]

    md0 : active raid6 sdb[1] sde[4] sdg[6] sdh[3] sdd[2] sdc[0] sdf[5]

    70312519680 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/7] [UUUUUUU_]

    bitmap: 88/88 pages [352KB], 65536KB chunk

    You are one disk short in your md raid-array. Somehow your md0 device shows 7 disks, but mdstat reports [8/7] [UUUUUUU_] ... so one disk short (degraded).


    Your lshw -short from post #31 reports only one disk attached to scsi5 interface:

    Code
    scsi5      storage        ASM1062 Serial ATA Controller
    .0.0  /dev/sde   disk           12TB ST12000NM0127

    While the others have always 2 disks attached to them. Is that correct? Only one disk there?

    Also it shows .0.0 in the beginning of that line, which doesn't look good to me.


    You cold switch the sata-controllers and see if the errors switch too. Could be a faulty controller that makes SMART report all that stuff.

    Edited 2 times, last by bermuda ().

  • On the NAS with 12 TB disks I chose to try ZFS.

    From what I understand, with ZFS you should never use a RAID via mdadm.

    So I removed the RAID and created a RAID-Z2 with 7 disks and inside the pool I created 3 independent filesystems (which reflects a bit the division into folders that was present before).
    I enabled notifications for ZFS and ran rsync to create a new copy of the primary NAS.

    In theory, if everything goes as I hope, I should have solved the problem with the first NAS and in a little while I will be able to redo everything on the second NAS that is giving problems.
    I hope I don't mess up ZFS.

    • Official Post

    I hope I don't mess up ZFS.

    If you've never used ZFS before, you'll want to read a little about its use. This document is old, but everything it says is valid. It's one of the recommended documents in the OpenZFS documentation. https://web.archive.org/web/20…l-zfs-on-debian-gnulinux/

    • Official Post

    I'm going to assume that your OMV install has the kernel plugin and you're setting up ZFS with the Proxmox Kernel.

    I hope I don't mess up ZFS.


    A few things about setting up ZFS:

    - When creating your pool, for best performance, "click" the ashift box. This will set your ashift value to 12. (If you don't "click" the ashift box, it will default to "0" which is not good for performance.) If you're running spinning drives ashift should, most likely, be 12. If you're using SSD's it might be 13. (More info -> here.) In the bottom line, to be sure, you should check your model of drive with the OEM to find what sector size your drives are using.


    - After you create your pool, given your current config, I'm assuming you'll go with RAIDZ1 or RAIDZ2 - the rough equivalents of RAID5 or RAID6 respectively. (RAIDZ3 is triple parity, requiring 3 parity disks. )

    - After the pool is created you'll need to create "filesystems". A filesystem is created under, Storage, ZFS, Pools. Click the + icon and select add filesystem.

    A filesystem is the rough equivalent of a Linux folder at the root of the pool, but they have their own assignable ZFS properties AND they can have their own independent snapshots.

    Within a filesystem you should group "like" datasets pictures, documents, videos or in your use case perhaps it would be, sales, engineering, admin, accounting, etc.


    - Once you've created all of your filesystems, but before you copy data in, you might consider running the following on the CLI. (This could be done in the GUI but it's easier on the command line.)

    SSH in, as root, and paste the following into the command line:
    (Replace ZFS1 with the name of your pool.)


    zfs set aclmode=passthrough ZFS1

    zfs set aclinherit=passthrough ZFS1

    zfs set acltype=posixacl ZFS1

    zfs set xattr=sa ZFS1

    zfs set compression=lz4 ZFS1



    Obviously, the last line, "compression" is optional but the potential is there to save some disk space.

    If you ever decide to use ACL's (versus standard Linux permissions) the above will support it. It's important make this change before copying data into the pool. Otherwise, if these attributes are changed later, existing files and folders would not have these properties. The change would be for new files but is not retroactive to old files.

  • I tried to prepare the server for ZFS but I noticed that none of the USB ports seem to work.

    At this point I would say that this machine needs to be replaced.

    Too many anomalies and damages, perhaps caused by electrical problems.

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!