HDD failed/about to fail ? (mdadm status "clean, degraded", SMART selfcheck reports errors)

  • Hello to all,


    today I noticed that my RAID5 (6x4TB Western Digital Red) had gone into status "clean, degraded".
    I then checked in mdadm and saw that one of the drives had been removed from the RAID by mdadm.


    I proceeded to run some short selftests using smartctl -t short /dev/sd<x> on all drives and they came back as


    Code
    smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.9.0-0.bpo.4-amd64] (local build)
    Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
    
    
    === START OF READ SMART DATA SECTION ===
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed without error       00%      6770         -

    on all the drives except for the one that has been removed from the RAID5:

    Code
    === START OF READ SMART DATA SECTION ===
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed: read failure       70%      5158         9
    # 2  Short offline       Completed: read failure       70%      5158         9


    I think the drive might be dead already or is about to fail, so it does not make sense to try to "fix" the RAID using mdadm right now ?
    The drive only has 5158hrs (214days) on it so I think its a hardware failure of the drive ?


    My plan of action would be:
    - get a replacement for the drive (I still got warranty on it)
    - shutdown the NAS and replace the faulty drive
    - recover the RAID5


    Could you guys please give me your thoughts on the drive / plan of action above ?


    Thank you!

  • Hello tkaiser,


    full output of smartctl -x /dev/sde is listed here.


    Running dmesg | grep sde shows a lot of messages like the one below


    2 Mal editiert, zuletzt von raisOr () aus folgendem Grund: full output now on sprunge.us

  • I don't like that much reading 'redacted' logs (but should've pointed out that using an online pasteboard service as @macom suggested would be a good idea -- my fault) but a quick check seems to indicate that there's really only a problem with one sector (I/O error, dev sde, sector 9). So if you know your RAID is working correctly (scrubbing regularly?) then I would simply force the sector to be remapped and then let a RAID scrub repair the lost sector.

  • According to /etc/cron.d/mdadm, checkarray should get executed on each first Sunday of the month and I did not notice any issues with the data on the NAS so far.
    I suspect the issue has been there for at least a few weeks since I remember wondering about the "small size" of the RAID a few weaks ago when checking something on the webif.


    I am currently running smartctl -t long /dev/sde to see if this sheds more light on the issue with the disk (I'll post the output once its done).


    Some more information on my RAID setup (I don't know if this complicates things further): I built an encrypted RAID5 as documented in this OMV thread (basically mdadm -> LUKS -> ext4).


    Could you please elaborate a little bit more on how I can
    - remap the defective sector
    - the steps to take in order to a) take the RAID offline b) re-add the disk that was taken out by mdadm c) resync the raid (I guess that should be the sequence)
    since I am not an expert on those things ;)


    Thank you in advance!

  • Could you please elaborate a little bit more on

    Unfortunately not since I try to avoid these problems (no mdraid user but btrfs/ZFS/RAIDZ fanboi). The problematic sector is not already shown as pending so all I would do in such a situation is a web search for 'smart sector remap how hdparm' or something like that to force writing $something to exactly this sector which should finally convince the disk controller to remap the sector (then Reallocated_Sector_Ct should increase by 1 which is perfectly fine as long as you have a setup using redundancy in a reasonable way... which is the case with mdraid's RAID5 mode)


    But to prevent telling you BS I now call for @Sc0rp (Mr. RAID himself :) )

  • Re,


    Mr. RAID ... 8| ... well ... really?


    Anyway ... relocating the sector can help, but if SMART or better the hdd-controler does not detect the media-error himself, then you'l have a bigger problem. Theoretically should do the controler remap this sector "transparently" ...


    I know an article, which descripes the procedure to "remap" an sector within an mdraid-array, but i can't remember and i have currently to less time to google-fu that by myself ... sry.


    Sc0rp

  • UPDATE:


    With the help of @Sc0rp I could fix my issue:


    As recommended by Sc0rp: Run all the cmds in the CLI since the webif mit not always run the correct commands with all the required switches


    1. Mark the defective drive as faulty: mdadm /dev/md<x> -f /dev/sd<Y> or mdadm --manage /dev/md<x> --fail /dev/sd<Y>

    2. Remove the defective drive from the RAID: mdadm /dev/md<x> -r /dev/sd<Y> or mdadm --manage /dev/md<x> --remove /dev/sd<Y>

    3. Shutdown the server and remove the drive
    ...send the HDD to the seller (still had warranty on it) and wait for a few weeks :)
    The seller forwarded the drive to the manufacturer, and they confirmed it was indeed a hardware issue in the drive
    4. Install new replacement drive and boot the server
    5. Add the new drive to the RAID by issuing: mdadm /dev/md<x> -a /dev/sd<Z> or mdadm --manage /dev/md<x> --add /dev/sd<Z>
    6. cat /proc/mdstat should now show that mdadm is rebuilding the RAID
    ...wait for another 5-6h for the rebuild to complete...


    7. Check in OMV webif for the RAID status and it changed back to clean



    To be absolutely sure everything was fine in the RAID is also ran


    /usr/share/mdadm/checkarray /dev/md<x>


    afterwards.



    Before running checkarray


    cat /sys/block/md<x>/md/sync_action


    would output idle


    which changes to check while the check is being done



    The check can also be paused using /usr/share/mdadm/checkarray -x /dev/md<x> and continued by using /usr/share/mdadm/checkarray -a /dev/md<x>



    Once again thanks for the great help here in the OMV Forum and especially to "Mr. RAID" @Sc0rp ;)


    Update 20200605:

    Added section about removing the file and also added long versions of the commands

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!