Tons of unrecoverable read errors

  • So I've got an OMV 3.x server that I made out of an old dell PE 840, with 4 2TB drives in the Dell SAS Raid enclosure. Over the weekend a VM that I have on it failed. When i checked it out found tons of 'medium errors' and 'unrecoverable read errors'. I assume it's 'bad blocks' so first and foremost, how do I run a 'scandisk' or something on this to recover what it can?


    Second would be, is there a way to get notified of errors like this? I think if I could have moved this before it got bad I wouldn't be in a pinch now so i figured I'd check for my other systems and future reference...


    Thanks!

    • Offizieller Beitrag

    Your disk seems to be corrupted, i suggest to replace it. Check the status of the disk with SMART.


    If you've enabled emails notifications you'll get the notification that your system is under high load


    loadavg(5min) of 4.1 matches resource limit [loadavg(5min)>4.0]


    If you've enabled SMART, then you'll get notified about read/write errors.

  • The system is an old dell Poweredge 840, OMV4 doesn't work on it.
    The raid is a RAID 5 array controlled by the dell LSI raid controller, so OMV sees it all as one drive, so I don't know which drive has the bad blocks, which is why I was hoping to scan and repair it if that was possible. I guess not?


    Thanks for your input.

    • Offizieller Beitrag

    The raid is a RAID 5 array controlled by the dell LSI raid controller, so OMV sees it all as one drive, so I don't know which drive has the bad blocks, which is why I was hoping to scan and repair it if that was possible. I guess not?

    You should be able to turn off the Raid option within the bios, Dells, HP's and Intel allow you to turn off Raid and use enable AHCI, if you cannot do this then you will need to access the Raid controller during boot.

    • Offizieller Beitrag

    Ok, so there's no way to run a chkdsk on it?

    By that I take you are predominantly a Windows user :) in answer to your question, no, you will need to reboot the machine with a monitor and keyboard attached, there should be a warning message....maybe, but there should be an option to access the Raid cards menu/setup will be Ctrl + another key.

    • Offizieller Beitrag

    Yep, I know how to get into the raid controller, but there's no utility to check the disk there, so I hoped there was in OMV or the underlying OS...


    Thanks though.

    The only other option is to shut down the machine, and connect it to something else and use the manufacturers software or Spinrite, chkdsk won't work anyway as Windows will not recognise the drive.


    If this was me I would back up, either remove each drive and erase/shred write zeros to the drive, test each one, disable the hardware raid in the bios and start again.

  • Thanks yea that's the plan. I think I'll need to find an alternative to OMV though, if there's no way to do simple tasks like check a drive. I was just hoping to use the old Dell 840 server and OMV worked (where FreeNAS wouldn't) so that's where I landed.


    Guess I'll get back to looking for a good option that might work on the old hardware


    Thanks again.

    • Offizieller Beitrag

    I think I'll need to find an alternative to OMV though, if there's no way to do simple tasks like check a drive.

    You simply need to turn off the Raid in the bios, that way it will be a sata controller and each disk will be viewable.

    • Offizieller Beitrag

    (After turning off hardware RAID)


    To see the status of your disks, under Storage, SMART, enable SMART. Then, in the devices tab, enable SMART monitoring on each device. At that point, you can click on a drive, the information button, then the Attributes tab. Take a look, if you see counts in the following, you may have a drive that's failing. (And there may be Red Fail flags.)


    SMART 5 – Reallocated_Sector_Count.
    SMART 187 – Reported_Uncorrectable_Errors.
    SMART 188 – Command_Timeout.
    SMART 197 – Current_Pending_Sector_Count.
    SMART 198 – Offline_Uncorrectable.
    _____________________________________________________


    When it comes to notifications:
    One must set up their Email address notifications (instructions -> here on page 38 in the current version).
    As far as the items to notified about, the following is what I have selected on one server.



    __


    I had a single CRC error (SMART 199 - usually cable related) and that triggered these E-mails:
    _________________________________________________


    On 12/9/2018 6:09 AM, root wrote:
    > This message was generated by the smartd daemon running on:
    >
    > host name: omv-server
    > DNS domain: homenet
    >
    > The following warning/error was logged by the smartd daemon:
    >
    > Device: /dev/disk/by-id/ata-TOSHIBA_HDWQ140_47TEK0HYFPBE [SAT], ATA error count increased from 0 to 1
    >
    > Device info:
    > TOSHIBA HDWQ140, S/N:47TEK0HYFPBE, WWN:5-000039-7bbd83de4, FW:FJ1M, 4.00 TB
    >
    > For details see host's SYSLOG.
    >
    > You can also use the smartctl utility for further investigation.
    > Another message will be sent in 24 hours if the problem persists.
    ________________________________________________


    On 12/9/2018 4:40 AM, root wrote:
    > ZFS has detected an io error:
    >
    > eid: 6
    > class: io
    > host: omv-server
    > time: 2018-12-09 04:40:53-0500
    > vtype: disk
    > vpath: /dev/disk/by-id/ata-TOSHIBA_HDWQ140_47TEK0HYFPBE-part1
    > vguid: 0x0E89A65A0517260C
    > cksum: 0
    > read: 0
    > write: 0
    > pool: ZFS1
    >
    ____________________________________________________


    Is this what you're looking for? :)

  • OK, so it sounds like you're saying (and what I was missing with Geves) is that if we let OMV manage the drives and the raid, IT can then MONITOR each drive thereby eliminating the NEED to scan or check the drive?


    Thanks (and thanks for your patience Geves!)

    • Offizieller Beitrag

    Absolutely, let OMV manage your drives and with the right setup, if a drive passes gas you'll be notified.

    The raid is a RAID 5 array controlled by the dell LSI raid controller, so OMV sees it all as one drive, so I don't know which drive has the bad blocks, which is why I was hoping to scan and repair it if that was possible. I guess not?

    I looked over your syslog. /dev/sdb is toast. (Whatever "/dev/sdb" may be, as it's presented to the host by a RAID controller.) Multiple sector failures.


    There's no "repairing" a hard drive, per say. They have a life of 4 or 5 years (give or take), if running 24x7, and that's it. Once a drive starts reallocating sectors (attribute 5 begins to increment) it's just a matter of time.


    RAID5 on any controller is bad news. In most cases those drives are permanently married to that controller (or family of controllers). @geaves is correct - getting those drives out in the open will shed some light on the actual problem(s).


    I ran into a similar problem with an older server, using its stock Adaptec controller. While the controller would do JBOD, it wouldn't pass SMART attributes transparently to the host. That was a "No-Go". I ended up getting a perc H200 ($26) and flashing it to IT mode, see this -> thread, which made it a simple JBOD controller that passes SMART transparently.

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!