Failing Hard Drive - Almost

    • New
    • Official Post


    First and foremost: Do you have backup? If not, you might consider getting a big external drive and backing up the array first. Why? Along with the possibly of making "FAT finger mistakes", working with a RAID array has it's hazards.
    _________________________________________________________________________

    The following GUI process would apply "if" you have the physical room and an extra port connection, for adding a new drive before removing the drive with issues.

    - First, physically install the new drive.
    - Under Storage, Disks. Find the new drive and wipe it.
    - Go to Storage, Multiple Device.
    - Highlight your md? array, and click the Recover icon.
    - Add the new drive as a spare.
    - Then you can physically remove the disk with issues, noting the following:

    **Note: it's necessary to know what the device name is for the drive you want to remove. Every reboot can affect drive device names so the name /dev/sd?, must be rechecked after every reboot . This can be done under Storage, Multiple Device and cross verified by Storage, Disks.

    If you want to do this on the command line, following is a short summary of command line actions:


    mdadm --add /dev/md? /dev/sd? # add the new drive a spare

    mdadm --fail /dev/md? /dev/sd? # mark the drive with issues as failed

    mdadm --remove /dev/mdX /dev/sd? # remove failed drive

    A more complete explanation for the command line process can be found -> here.

    • New
    • Official Post

    if I pull the failing drive out, does the array not start up showing degraded

    You cannot physically 'pull' a drive from an array, OMV's raid configuration is controlled by mdadm (software raid) so unless you give it instructions it appears to fail


    By pulling a drive from an active array the array, upon reboot becomes inactive, you the user does not know that, the array simply does not start and therefore does not display in the GUI. This can be confirmed by running cat /proc/mdstat from the cli as root.


    The instructions given to you by crashtest is the correct way to remove/pull a drive from a working array

    • New
    • Official Post

    Electrically, it is the same. If the drive dies, what distinguishes it from a missing drive? It's the same thing in terms of just disappearing.

    There are two major componentes of a spinning hard drive - the media (heads and platters) and the interface board. A missing drive would be registered if the drive's interface is not connected to the port. It is possible (but unlikely) that a drive can still be connected and appear to be missing. (In such a case, the interface board would not respond to the motherboard's queries.)

    Drives can fail in a great many ways. Rare is the event were a drive dies in a "lights out" fashion but it can happen that way. Unfortunately, drives tend to fail slowly, with an increasing potential to corrupt ever greater amounts of data as they fade out.


    Apparently OMV doesn't care about read errors until the SMART count gets to a certain level?

    First this in not about "OMV". The SMART package (smartctl) and Debian read and handle SMART stat's. SMART was implement to give users a bit more visibility into the health of a drive and / or to help determine if a drive is failing.


    Second, you'd need to post the actual SMART stat's, for the drive, for a guess or an interpretation.

    Hard drive read errors happen - there literally billions (trillions?) of operations as bits and bytes are read and written. To counter that, there's error correcting code in the interface board that corrects read and write errors on the fly. Issues come into play when errors can't be corrected anymore.


    Lastly, drive OEM's (Seagate, WD, HGST, etc.) implement SMART in the manner they want to. Interpreting SMART stat's can be significantly different between the brands.


    If you're interested in what's going on with your drive, as in if there's a good chance it's on the verge of failure, consider the following. BackBlaze did a study of SMART stat's that were the most likely to indicate a future failure. The following was the result:


    SMART 5 – Reallocated_Sector_Count.

    SMART 187 – Reported_Uncorrectable_Errors.

    SMART 188 – Command_Timeout.

    SMART 197 – Current_Pending_Sector_Count.

    SMART 198 – Offline_Uncorrectable.


    Any one count, in the raw counts of the above, is not a real concern. If they start to increment upward, it's time to worry a bit. Somewhere around 3 to 5, to be safe, I'd be ordering another drive.

    Finally, there's the following:
    SMART 199 - UltraDMA CRC errors


    Usually 199 is hardware and cable related, but it can be something to do with the drive or mobo interfaces as well.

    The rest of spinning drive SMART stat's, their meaning, impact, etc., would have to be determined on drive OEM's websites. They might be serious, informational, or nothing at all.

    • New
    • Official Post

    There's very little point in debating how drives fail, anecdotally. There's plenty of opinions (volumes) on the net about this topic.
    -> How hard drives fail.


    I think it is ultimately about OMV not being the one that "marked" it as dead, therefore it never happened as far as it's concerned.

    Let me reemphisize that OMV is a NAS "application". For the most part, when it comes to core packages like smartctl, OMV is simply "the messenger". It creates command lines for action, given to the OS, and it relays command line messages and output to the GUI. The core app (OMV) and plugins (like md) are skipping the command line drudgery and fat finger mistakes, while guiding new users toward productive ends. That's what OMV does in a nutshell. OMV helps users to make sense of Debian Linux - that's it.
    ______________________________________________________________________________

    While you've provided your opinions of what's happening, you haven't provided hard information (SMART stat's, sys logs, etc.). Without sitting in front of your console, I can't even begin to explain one of the myriad of scenarios (potentially infinite) of hard drive failure or how it's being interpreted by OMV. The best thing to do (for your data) is to acknowledge that the drive has issues and remove the slightly crippled or bad hard drive from the array before it causes serious damage and move on.

    Along those lines, I think you have the information needed to replace the drive.

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!