Multiple disks failure

  • Hello,
    I've a OMV with a 8 disks 12T in RAID 5 config and today I've found 3 disks with problems.


    sde and sdi are marked "Device has a few bad sector" (yellow alert) and sdf "Device is being used outside design parameters" (red alert).

    I'm so so so happy...
    The /dev/md0 is down and I need to fix this mess.
    This is a full backup NAS so no problem with data but I need to understand what is the correct procedure to replace more disks.

    1) remove sdf -> mdadm --manage /dev/md0 --remove /dev/sdf

    2) shutdown omv
    3) replace broken disk with a new one
    4) start omv and add new disk -> mdadm --manage /dev/md0 --add /dev/sdX
    5) wait OMV clean the raid array
    6) go to 1 and repeat for each broken disk

    This is correct?
    Or there is some faster procedure?
    I need a huge hug ;( ;( ;(

  • RAID 5 allows for 1 disk to fail without data loss, assuming the rest of the disks don't have a problem. If you have more than one disk with problems other than outright failure, you may not be able to recover all the data.


    I have never encountered such a "mess" so I can't tell you exactly what to do aside from what you have started, meaning replace the failed disk and hope it can recover the array, then repeat one at a time for the other disks.


    All that said, I can lay out the general recommendation for using drives in a RAID, since you did not mention what the drives were, and so that someone else reading this that is not aware of this general recommendation can hopefully use the information.


    1) all drives should be the same size

    2) all drives should be NAS/RAID rated drives or Enterprise/RAID rated drives, and by extension should be CMR drives and not SMR drives

    3) all drives should ideally be from the same manufacturer and if at all possible the same model with the same firmware version.

    4) RAID is not a backup. RAID is about high availability and distributed workload to build a large volume from smaller drives. Important data should still be backed up.


    My experience has been that I have had better luck with seagate ironwolf/ironwolf pro/exos drives instead of WD Red/Red Pro/Gold drives, although some people will say different. (I have also seen some Hitachi/HGST drives perform very well, but have not used any since WD bought them and not sure if the Hitachi/HGST technology had made its way into the newer drives in the Gold line)


    I have several large RAID systems at the office (some on areca RAID controllers and some using linux mdadm), in the area of 768TB, 288TB, 128TB and two 80TB (all total/raw storage amounts for reference). With the exception of the 128TB one, they all currently use seagate exos drives, while the 128TB one is using WD Gold drives, and it is only the 128TB/Gold array that seems to have drive problems. This one has failed several times with drives just dropping out of the array, bit still being perfectly healthy, to the point where I have reconstructed it to be a RAID 60 (2 RAID 6 arrays in a RAID 0 stripe) so that I can loose 2 drives from each array before data loss, but I still only use it for temporary storage due to it's questionable reliability. The 768TB is a RAID 60 made from 4 RAID 6 chassis, while the 288TB and 80TB are RAID 5.


    At home I am running a 6 drive RAID 5 and a 2 drive RAID 1, all seagate ironwolf/ironwolf pro using mdadm.

    • New
    • Official Post

    One of the problems with using traditional RAID, at home, is the lack of monitoring AND unrealistic expectations of hardware. For example, at 5 years of age, it's not safe to use spinning hard drives in a primary (24x7) server. My older hard drives become a cold backup in a switched off (lights out) server. I have two cold servers of this type. One comes on and backs up a couple times a month. The other I run manually and backup, every two months or so.


    Most commercial or pro managed servers have their spinning media swapped out on a schedule that, usually, doesn't exceed 5 years. Once drives get to the 5 year mark, they're becoming "geriatric", meaning if one drive fails in a RAID array another may not be far behind. The RAID-drive rebuilding process can EASILY send a second or even a third disk into failure.


    For drives of any age, RAID 5 is hopelessly inadequate for 8 disks. You'd need something like RAID6 to be prepared for the chance failure of a second drive.

    Lastly, -> automated notifications would report drive issues in the event that a critical SMART stat took a error count. (Usually, certain SMART stats begin to increment before a drive actually fails.) This might have helped you avoid the situation you're currently in.

    ______________________________________________________________


    If your drives are old or are incrementing in the following SMART stat counts, you should consider starting over with new drives.


    SMART 5 – Reallocated_Sector_Count.

    SMART 187 – Reported_Uncorrectable_Errors.

    SMART 188 – Command_Timeout.

    SMART 197 – Current_Pending_Sector_Count.

    SMART 198 – Offline_Uncorrectable.


    If you think some of the drives are savable, run short SMART tests on the green drives and long tests on the yellow or red drives. At the end of the tests, compare your SMART stat results with the above. 1 or 2, in the raw counts, of any of the above would be a concern. 3 to 5 counts or more, especially if you get reports that they are regularly incrementing, means (to my way of thinking) the drive should be replaced ASAP.


    In the bottom line, it's good that you have backup.

  • If that many disks are showing errors at the same time I'd also be checking power supplies, cables and interface cards. You don't know if the drives are the root failures or if it's another common component. Doesn't negate any of the previous replies but!

  • Main NAS is used to serve files and directory, other NAS are just to backup data.
    I wanted to use 3-2-1 strategy but it is hard to find different media to handle 42 Tb of files (the company does not have much money).

    So I decided to increase the copies of the main NAS with 5 different machines and keep 2 NAS in the same network, 1 NAS in a DMZ of the same company building but different room, 2 NAS in remote locations.

    I mounted disks bought from different vendors on the same machine so as to try not to have disks from the same batch.


    On the NAS with the red warning I mount 8 Seagate Enterprise ST12000NM0127 12 Tb (CMR).

    The "smaller" NAS mount 8 Seagate Exos 7E8 ST8000NM0055 (CMR).

    After reading a Google report with stats about various disks getting damaged, I had considered and tried HGST disks but they did not prove to be superior to other brands.

    But, considering the bad luck that hits me from time to time, I would say that my experience has rather insignificant value.
    In Italy we say, “Luck is blindfolded, misfortune sees very well.”

    Due to money issues, I started with a few disks so the choice of RAID 5 was forced but now that I have the 8 slots full, I think I will reinstall everything using RAID 6.


    The notification system is active (I tested and it works) and all checkboxes are checked but I don't remember I've received an alert email about SMART problems.

    I'll install a CPU stress tool and try to see if the notifications starts.


    About the wiring, I had never considered it. I tried replacing the SATA cables and nothing changed.

    Changing the RAID controller is a tad more complex but I tried moving the connection ports of the various drives and nothing changed.

    So I think the problem is really the disks.

    Maybe a failed batch?

    About SMART stats counts, this is the result for /dev/sdf (8Tb warning yellow with "Device has a few bad sector")

    Code
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
    5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       96
    187 Reported_Uncorrect      0x0032   098   098   000    Old_age   Always       -       2
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0


    smartctl confirm the problem reported in the warning and I have already ordered the new disks.

    About "replace the failed disk and hope it can recover the array, then repeat one at a time for the other disks" my procedure is correct?
    The commands are correct?
    The NAS with all yellow warnings is still working, the /dev/md0 is mounted and I can see files and dirs but on the main NAS with red warning I can't mount /dev/md0 and state is inactive.
    In theory “bad sector” is not such a serious problem to block a RAID 5 and mount should be possible with a single broken disk (the red warning disk).
    There is a way to try to activate and mount?

    Thank you all, at this time any help makes me feel better.


  • I don't understand why OMV say "Device has a few bad sector"

    • New
    • Official Post

    About SMART stats counts, this is the result for /dev/sdf (8Tb warning yellow with "Device has a few bad sector")

    96 is more than "a few" bad sectors (by definition a few is "3" to 4). Something must be wrong with the reporting process or there's an issue with the E-mail provider rejecting the server as a source. (Part of setting up notifications is a test e-mail.) I've received e-mail notifications for non-critical SMART counts.

    ______________________________________________________________________

    - This is a "one drive at a time" process which means, with an array in the size range you have, it's going to take awhile where another drive failure is a possibility.

    About "replace the failed disk and hope it can recover the array, then repeat one at a time for the other disks" my procedure is correct?
    The commands are correct?

    (Before starting, read the Bolded Note below. Also, you can get status updates in OMV's GUI under Storage, Multiple Device.)


    If you have the extra port connection, I would go with the following:

    ____________________________________________________________


     mdadm /dev/md0 --add /dev/sdx

    Adds a drive as a spare


    mdadm /dev/md0 --replace /dev/sdf --with /dev/sdx

    Swaps in sdx in place of sdf (which is the bad drive). Wait for the sync to complete.


    mdadm /dev/md0 --remove /dev/sdf



    ___________________________________________________________________________________________

    Assuming you don't have an extra port.

    1) remove sdf -> mdadm --manage /dev/md0 --remove /dev/sdf

    2) shutdown omv
    3) replace broken disk with a new one
    4) start omv and add new disk -> mdadm --manage /dev/md0 --add /dev/sdX

    (Again before starting, read the Note below.)

    Since you can't remove an active device I would go with:

    mdadm /dev/md0 --fail /dev/sdf
    mdadm /dev/md0 --remove /dev/sdf

    Shutdown, remove the drive, add a new drive, boot up.

    mdadm /dev/md0 --add /dev/sdx


    Wait for the sync to complete.
    ___________________________________________________________________________________

    **Note:**
    Depending on your BIOS, drive device names can (and often do) change, when drives are physically changed. If you remove a drive, the remaining drive device names may change their device name order. When physically adding or removing a drive, after booting up, look under Storage, Disks and confirm the drive's device name, against it's serial number (in the model column), to insure that the commands you're using are being applied to the correct drive.

    ___________________________________________________________________________________

    I would start with the bad drive with numerous failed sectors. The others, I'd compare their SMART stats with the critical list posted above and replace as needed. (With 4 or 5 in the raw counts, it's just a matter of time). Other stats are not as important or may be informational.
    Well... Stat 199, CRC errors, can't be ignored. They're usually cable or drive/mobo interface related.

    Finally, I'd run SMART tests on all remaining drives to see if critical errors increment. If they do, I'd replace the affected drives ASAP.

  • Can you describe, show pictures how those drives are mounted in the Computer case and what that case is?. IMHO it is uncommon to see that many drives fail in just few years. It could be persistent vibration of drives due to mounting method.

    Arch Linux, Linux Mint 22, FreeBSD KDE Plasma6

    OMV7 NAS 10GB Fiber, Fractal Design Define R5 Case, Kodi "Omega", pfSense Plus firewall/router

  • Hello there.

    To be really sure about those drives, you should check them with the manufacturer's tools. SMART values differ and there is no way to tell what they really mean. And you are right, when you build your array with different drives from different manufacturers. People's opinion often differ on that issue. But drives can and in fact often do fail all at the same time when they are used the same way. Having a mix of different models with same/identical specs is a wise strategy. And always follow the rule to add 1 parity disk every 3 to 4 data disks and also have 1 disk as cold/hot spare. Ordering disks when you have a failure is already a mistake.

    My disks are about 10 years old. They have less than 10'000 spinning hours. I use consumer drives with spindown. Yes, sometimes I also get weird values, but when I check the drive with the manufacturer tools, things are 100% ok. Before I use/install them for the first time, I always check them first and do a low lever format. Disks which are spinning 24/7 should be replaced on schedule, considering additional proper risk assessment. Last rule ... have backups. In the old days we used tapes. Those things were indestructible, they always worked :)

  • I'm also curious to see how these are mounted and whether they're properly cooled.


    It's not normal to get this many failures this close to each other. However, it's worth considering those Seagate 12TB drives have an abysmal failure rate according to the most recent backblaze drive stats for q1 2025. At 9.47% AFR this quarter, it's one of the highest failure rates I've seen. It also suggests their failures tend to be more clustered/grouped into certain time periods throughout their lifetime (bad), which is much worse than if the failures are spread out more over time.


    Also, were these drives purchased new or used? I've been exclusively running used eBay drives in my arrays, and every single one is considered "geriatric" by crashtest's standards. Some are so old that the SMART self-test log times always report the maximum value of 65,535 hours (7.5 years). It felt like watching a car hit 999,999 on the odometer LOL.

    Code
    SMART Self-test log
    Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
         Description                              number   (hours)
    # 1  Background short  Completed                   -   65535                 - [-   -    -]
    # 2  Background long   Completed                   -   65535                 - [-   -    -]
    # 3  Background short  Completed                   -   65535                 - [-   -    -]
    # 4  Background short  Completed                   -   65535                 - [-   -    -]
    # 5  Background long   Completed                   -   62335                 - [-   -    -]
    # 6  Background short  Completed                   -   62328                 - [-   -    -]

    Out of 19x 3TB enterprise drives (mixed Seagate and HGST) running across 3 arrays over ~4 years, I've had a single drive report a single URE. It still passes a SMART long test and badblocks, but I still removed it from rotation anyway.


    Disclaimer: I don't advocate for anyone else to do this for anything important. This is for my home lab, and I have the redundancy and backups in place to recover within a day.


    And yes, I know they're old. I'm in the process of refreshing the drives with "new" used 6TB drives. I ordered a bulk lot off eBay like I've done so many times in the past, but this batch was awful. Out of eleven drives, one was DOA, three reported a SMART unhealthy status, and three more failed a SMART long test despite reporting an initial "okay" SMART status. These were low-hour drives too, with an average of around 2k hours. I returned the whole batch and bought from a different vendor, and they're clean as a whistle.


    So here's the moral of the story: your drives may have had bad blocks for a long time - maybe even before you put them in the NAS. The drives won't report the errors until they try to read from those blocks. I've also heard it's possible for vendors to clear some of the SMART logs to try and hide the drive's true health.

  • 96 is more than "a few" bad sectors (by definition a few is "3" to 4).

    I was reporting the text that appears in the OMV badge tooltip. I agree with you that 96 is a lot.

    Something must be wrong with the reporting process or there's an issue with the E-mail provider rejecting the server as a source. (Part of setting up notifications is a test e-mail.) I've received e-mail notifications for non-critical SMART counts.

    The “Test” email button works and I get the test email.

    I don't know if there is a way on OMV to check if there are emails in the send queue. I'll do other tests.


    Tnx crashtest for the detailed procedure, very helpful.

    Can you describe, show pictures how those drives are mounted in the Computer case and what that case is?. IMHO it is uncommon to see that many drives fail in just few years. It could be persistent vibration of drives due to mounting method.

    These two NAS need to be “transportable” so I chose the Silverstone SST-CS380 case.

    Disks are mounted by inserting them into the front caddies of the case (with the final serial number of the disk).



    Now the NAS is in my office but, in a normal condition, first NAS (the one with 4 yellow warnings) is in a room cooled to 15°C and second is in a normal room where people cannot enter.

    Both are positioned on the floor.


    Inside, I tried to keep everything as clean as possible to improve air flow.



    I know that compared to rackmount NAS these solutions are less professional, but 7 disks with problems out of 16 is too many.


    And always follow the rule to add 1 parity disk every 3 to 4 data disks and also have 1 disk as cold/hot spare.

    Is this possible with OMV?

    I am not a RAID expert at all and have always just installed OMV, connected the disks, and used the GUI to configure the new array.


    I use consumer drives with spindown.

    Do you have any brand/model to recommend?

    I know that with OMV it is possible to set spindown but I have never set it.

    On the main NAS, the rack-mounted one that handles fileshare, I can't set spindown because many services use data but on others NAS is definitely a good thing.



    Considering that NAS does rsync once a day, which of these settings do you think is the best?


    Seagate 12TB drives have an abysmal failure rate according to the most recent backblaze drive stats for q1 2025.

    Thank you very much for the link, extremely helpful.

    Also, were these drives purchased new or used?

    All new.

    So here's the moral of the story: your drives may have had bad blocks for a long time - maybe even before you put them in the NAS. The drives won't report the errors until they try to read from those blocks. I've also heard it's possible for vendors to clear some of the SMART logs to try and hide the drive's true health.

    I hadn't heard this one yet, however I will format as advised to see if any bad sectors pop up. Thanks for the info.


    And as always, thanks to everyone for the valuable answers.

  • Considering the new information, a safe procedure could be:


    1) Insert the new drive

    2) Write to all sectors, for example with the command dd if=/dev/urandom of=/dev/sdX bs=1M status=progress

    3) Perform a bad sectors check with the command badblocks -b 4096 -v -s /dev/sdX

    4) If everything is fine, mark the damaged disk as failed mdadm /dev/md0 --fail /dev/sdf and remove it mdadm /dev/md0 --remove /dev/sdf

    5) Shut down the NAS, remove the disk and boot the NAS.

    6) Add the new disk to the array mdadm /dev/md0 --add /dev/sdX

    7) Wait for the sync to complete


    If disks keep failing in a short time, stop bothering the community and resolve the issue once and for all with a flamethrower.

  • Hello again,

    to check OMV's mailque you simply have to execute mailq in a shell.

    Is this possible with OMV?

    I am not a RAID expert at all and have always just installed OMV, connected the disks, and used the GUI to configure the new array.

    With Linux, anything is possible :) ... For data shares I now use SnapRAID to have my needed redundancy for the disks. SnapRAID lets you configure your Raid the way you want. You can put as many parity drives as you wish. It also has other features, like scrubbing to detect failing sectors and more. As for the hot-spare, I simply use disks that are attached to the controller but they are unsused in the gui. It is convenient, when you get an email about a failing or failed drive and you can instantly act via vpn/ssh. You safe time and you can check the failed drive later and reuse it, if there was no problem with the hardware at all.

    Do you have any brand/model to recommend?

    I know that with OMV it is possible to set spindown but I have never set it.

    On the main NAS, the rack-mounted one that handles fileshare, I can't set spindown because many services use data but on others NAS is definitely a good thing.

    I use whatever at the time is used in consumer hardware. Usually this are Seagate or Western Digital products. I use the green series for spin down. Everything that has a higher workload I now do with SSDs. HDDs are configured for SMART checks once a month. Anything unusual and I get an notification. Same goes for scrubbing, which I get a full report for.

    As for the spin down settings, I usually go for the lowest value "1" and I set the timer for 20 minutes. This is due to data caching on some devices, like watching movies. Also with SnapRAID only the disc that contains the data is awake. So other drives in that array won't necessary spin up. That's a good thing. But if you really want to have a spin down, you have to go for "1" anyway. For private usage, I would recommend that anyway. You have a cooler system, you save money on your electric bill and you also prolong the lifespan of those low consumer drives.

    So if your rsync only takes up a short period of time to be completed and the shares are not used everyday, you can spin them down. Unless they are 24/7 drives, which I think is not so good to have them spin down a lot.

  • Silverstone SST-CS380 is great looking case. I picked Fractal Design Define 5 Case for the reason that drive mounting caddies utilize rubber grommets to mount 3,5 HDD`s providing "dumping" condition and not transferring drive vibration onto the case. I have 8 4TB Seagate Barracuda, ST4000DMZ04/DM004, drives (this is my home NAS) that are just over 5 years old. My NAS runs 24/7 with no spin-down in a 18 C under the staircase closet in my house. No issues with drives, so far.

    You are right that having 7 out of 16 drives with problems is troubling. The only thing I can thing of and I could be wrong is the Case. Possibility that 7 out of 16 same brand drives have a manufacturing defect are astronomically small, so there must be some other condition contributing to the disk failures.


    Note.

    I do not use RAID at all. I use rsync for incremental backup (RAID is not backup) of one drive to an another.

    Arch Linux, Linux Mint 22, FreeBSD KDE Plasma6

    OMV7 NAS 10GB Fiber, Fractal Design Define R5 Case, Kodi "Omega", pfSense Plus firewall/router

    Edited once, last by andrzejls: Added note. ().

  • You are right that having 7 out of 16 drives with problems is troubling. The only thing I can thing of and I could be wrong is the Case. Possibility that 7 out of 16 same brand drives have a manufacturing defect are astronomically small, so there must be some other condition contributing to the disk failures.

    Not necessarily. If the serial numbers of the failing drives are very similar, then you in fact can have multiple fails, because of manufacturing defects. That is usually what SMART is. SMART does not detect any failures. SMART is like a huge database and when certain changes in the disk conditions are detected, it can with a certain probability foretell the remaining lifespan, based on similar cases. SMART is a good tool to warn admins to take action to avoid later problems. Even when we can't understand it and it seems that everything is ok. But best approach is always to use the manufacturer's tools to check the drive.

  • Not necessarily. If the serial numbers of the failing drives are very similar, then you in fact can have multiple fails, because of manufacturing defects. That is usually what SMART is. SMART does not detect any failures. SMART is like a huge database and when certain changes in the disk conditions are detected, it can with a certain probability foretell the remaining lifespan, based on similar cases. SMART is a good tool to warn admins to take action to avoid later problems. Even when we can't understand it and it seems that everything is ok. But best approach is always to use the manufacturer's tools to check the drive.

    Are you suggesting possibility of ~45% failure rate on NAS/Enterprise drives from that manufacturer?.

    Arch Linux, Linux Mint 22, FreeBSD KDE Plasma6

    OMV7 NAS 10GB Fiber, Fractal Design Define R5 Case, Kodi "Omega", pfSense Plus firewall/router

  • Are you suggesting possibility of ~45% failure rate on NAS/Enterprise drives from that manufacturer?.

    That's what they call a bad batch ... you know, similar to when a car manufacturer does a recall for certain vehicles build between this and that month. It is not uncommon. Manufacturers produce many revisions of the same product, because they change some part, usually to improve the product or to lower manufacturing costs. And sometimes this backfires on them. And when a bad batch of drives fails after 5 years, that is not unusual, since the drives are mostly out of their warranty anyway. Happened to me like 20 years ago. One batch of drives failed in a span of about 3 days. Basically every day we were looking for yellow and red lights on the racks. I can't recall the percentage but it was one batch with similar serial numbers.

  • to check OMV's mailque you simply have to execute mailq in a shell.

    I've found the problem into /etc/postfix/sasl_passwd
    I fixed and launched postqueue -f and tada! The "A DegradedArray event had been detected on md device /dev/md0." mail arrived.
    I don't konw why on OMV GUI the Test button work (the email is sent).
    Tnx for SnapRAID info, I'll check when this mess is fixed.


    One batch of drives failed in a span of about 3 days. Basically every day we were looking for yellow and red lights on the racks. I can't recall the percentage but it was one batch with similar serial numbers.

    My serials are:

    Code
    ZJV28E6W 
    ZJV2GJCT
    ZJV2TTQH
    ZJV3YELM
    ZJV3Z851
    ZJV3Z93E
    ZJV4WQB7
    ZJV65XX1


    Z should be "made in Thailand", JV the model or some components, the rest of the code, the incremental.

    They don't look that similar but I'm not sure.

  • That's what they call a bad batch ... you know, similar to when a car manufacturer does a recall for certain vehicles build between this and that month. It is not uncommon. Manufacturers produce many revisions of the same product, because they change some part, usually to improve the product or to lower manufacturing costs. And sometimes this backfires on them. And when a bad batch of drives fails after 5 years, that is not unusual, since the drives are mostly out of their warranty anyway. Happened to me like 20 years ago. One batch of drives failed in a span of about 3 days. Basically every day we were looking for yellow and red lights on the racks. I can't recall the percentage but it was one batch with similar serial numbers.

    Hard to believe the failure rate of ~45% in extremely small sample of 16. Been in manufacturing env. (retired mechanical engineer) for more than 45 years. I am not familiar with HardDrive manufacturing but normally "batch" run is several thousands of units. No manufacturer is going to produce few years of projected sales just to store it in warehouse/DC Centers, cost prohibitive. So. if this "batch" you have 45% + failure, you would hear about it everywhere. I have not hear anything about it. Considering that, I think there is some other reason for such a huge fail rate in one location.

    Arch Linux, Linux Mint 22, FreeBSD KDE Plasma6

    OMV7 NAS 10GB Fiber, Fractal Design Define R5 Case, Kodi "Omega", pfSense Plus firewall/router

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!