RAID5 with unexplained error

  • Hello,


    I have RAID 5 with 3 3TB drives.


    I keep receiving the message below although I run fsck and it fixes things but the message keeps appearing.
    Is there a way to understand the cause for it and to fix it?


    Please let me know if there is additional data that I need to provide

    Code
    Nov  7 12:23:45 openmediavault kernel: [310678.492008] EXT4-fs (md127): error count since last fsck: 4
    Nov  7 12:23:45 openmediavault kernel: [310678.492010] EXT4-fs (md127): initial error at time 1509773902: ext4_iget:4641: inode 98893575: block 11
    Nov  7 12:23:45 openmediavault kernel: [310678.492013] EXT4-fs (md127): last error at time 1510032951: ext4_iget:4641: inode 98893575: block 11
    • Offizieller Beitrag

    I am not a SMART expert, but to me the reports look fine. Just the load cycle count is quite high which is probably a result of agressive APM settings. Your drives are around 300k. For my drives 600k are specified. I had rapidly increasing load cycle counts in the past. I solved it by disabling hdparm and using hd-idle instead.


    Concerning your first post I found this with google. May be it helps you.


    Hopefully somebody else can give you a better response.

  • I am not a SMART expert, but to me the reports look fine. Just the load cycle count is quite high which is probably a result of agressive APM settings. Your drives are around 300k. For my drives 600k are specified. I had rapidly increasing load cycle counts in the past. I solved it by disabling hdparm and using hd-idle instead.


    Concerning your first post I found this with google. May be it helps you.


    Hopefully somebody else can give you a better response.


    Thank you for the quick reply. I will r ad through the link you sent and will also be glad to receive additional feedbacks on the SMART results.


    I want to know if you can better explain the APM parameter you mentioned. See attached my current configuration.



    Sent from my iPhone using Tapatalk

    • Offizieller Beitrag

    APM1 is the most agressive one. Maximum power saving, but this results in many load cycle counts I would try if with APM127 the number of load cycle counts per operation hour decreases.


    You can check the datasheet of your drive regarding the specified number of load cycle counts. Probably you have reached around 50% of the specified value.

    • Offizieller Beitrag

    For reference, if a drive is heading toward failure, the following SMART stat's start incrementing:


    • SMART 5 – Reallocated_Sector_Count.
    • SMART 187 – Reported_Uncorrectable_Errors.
    • SMART 188 – Command_Timeout.
    • SMART 197 – Current_Pending_Sector_Count.
    • SMART 198 – Offline_Uncorrectable.

    Where the remainder of the stat categories are concerned, some of the counts and their meaning (raw values) can vary between drive OEM's.


    If you want to get a complete picture of each drive:


    On the command line, do:
    apt-get install curl


    After the curl install finishes:
    (In the line below replace "?" with the letter for your disk)
    for disk in /dev/sd? ; do smartctl -x $disk ; done | curl -F 'sprunge=<-' sprunge.us


    The above line returns a URL. Copy and past the URL into the address bar of a Web browser.
    ___________________________________________________________________________


    You could copy and post the URL's generated from the above, into this thread, but I'm guessing that your drives have some age on them. Note that RAID is not kind to the remaining drives in the array, if a failed drive is replaced. "Resilvering" a new drive can cause another drive failure, during the process. Two failures and it's over.


    If you don't have a backup of the data stored in your array, I'd encourage you to give it some thought.

    • Offizieller Beitrag

    Your drives appear to be OK, if a bit on the warm side. The drives are a about a year old and it appears you leave your server on most of the time. The seek error rate can be related to platter thermal expansion. (Cooling, then heating up.) As macom noted, using "spin down" for power savings might be hard on your drives. They're designed to be NAS drives (on 24x7) so take his recommendation for APM127.


    So, from your code box you're using mdadm RAID and Ext4. (There's another forum member who is having a similar but slightly different problem. He's getting file system errors and the array is resyncing once in awhile.)
    ________________________
    (The command line below was provided to me and others by one of the forum moderators, tkaiser.)


    The place to start is in your log files.
    You can narrow down the vast amount of info in your ALL of your syslog files in var/log, in one shot, using the following on the CLI.



    Code
    zgrep -A2 -B2 -E "EXT4-fs|(md127)" /var/log/syslog* |more >/ext4logs.txt

    Between the quotes are search items (pulled from your code block above), each separated by a pipe, with a 2nd pipe directing output to a file that's deposited at the root. ext4logs.txt


    (After you look over your results:)
    You could modify the same command line to search within the output file, and narrow the search farther, using additional terms of interest. (The term "warning" may have no significance - it's used only as an example. You'll need to review and make an appropriate choice.)


    Code
    zgrep -A2 -B2 -E "warning" /ext4logs.txt |more >/ext4logs-2.txt

    The idea in the above is to look at error entries and look for associated events around the time frame of the events you're interested in. Things to look for - patterns - can it be isolated to a specific drive, is it always the same block, is there some event that triggers an error, etc.
    _____________________________________________________________________________


    As far as a real fix goes, can I talk you into using ZFS? :)


    OMV has a ZFS plugin, for the Web GUI, that will allow you to create a raidz1 array which is the functional equivalent of RAID5, but with a lot of extra benefits to include check sums for "self healing" files.


    If you have backup, to put your data back onto a newly created ZFS array, give it some thought.


    Happy Hunting!

  • Is there a way to understand the cause for it and to fix it?

    Sure, but that would require getting a bit into storage details and especially if you don't have backup and only play RAID (RAID5 in 2017 -- insane) you can't sleep well later (do you really want this?)


    Anyway: there's s problem with one specific inode (reported daily) so first step would be to search for the file affected. Check 'df' output for the mountpoint of your md device (eg. /srv/foo/bar) and then


    Code
    find /srv/foo/bar -inum 98893575

    If you have a backup simply try to delete the file, restore it from backup and see what's happening. If you don't have a backup you're doing something seriously wrong and can't be supported anyway.

  • Hello,


    Thank you very much for the inputs, I really appreciate it.
    As I understand from the responses I need to upgrade my File System.


    I would really much like to do that and would appreciate your guidance.
    How do I know what is the best FS for me?


    Regarding the current issue, I have received the following:


    Code
    /srv/dev-disk-by-label-RAID/lost+found/#98893575': Structure needs cleaning

    Can I just delete it?
    I also got his message during boot, tried to run the find command on these and got the same output for structure needs cleaning.

    • Offizieller Beitrag

    Hi,


    I'm ready with all the needed backups to my data.
    Please advise what are the next steps to improve my File System.


    Thank you

    If you have backup,, you're way ahead of the game. It's amazing how many users don't bother,, until it's too late. (I hope you'll maintain a regularly schedule backup going forward.) While ZFS is good, it is NOT backup.


    The safest bet of the advanced file systems available, at this point in time, is a ZFS mirror or the rough equivalent of RAID1. (That's just my opinion but there are others who agreed.) To get an understanding of ZFS, here's a good primer that provides an overview ->ZFS.) On the other hand, with good backup that you know can be restored (read - tested), raidz1 is fine. (radiz1 is the rough equivalent to RAID5). Read the provided link and think it over.


    With the disks you have (3x3TB) you can have a 6TB raidz1 array, or a 3TB ZFS mirror with a hot spare. Setting up either option can be done in OMV's WEB GUI. I can walk you through the process.


    To do this cleanly, rebuilding OMV from scratch might be the best way to go. (It doesn't take that long.) Or, if you want to keep your current build, realize that the process will mean deleting Samba Shares, Base shares, deleting file systems, etc., etc. I can give you a note or two on that as well.


    Again, give it some thought..

  • Hi,


    Thank you for all the details, I will need to take some time and learn it carefully but my preference is 6TB raidz1 array
    I have just recently installed OMV from Scratch due to the upgrade from OMV 2 to 3.


    On the meanwhile, I would appreciate an input regarding the current FS issue I have mentioned above, is that safe to delete that INODE in the lost+found?

    • Offizieller Beitrag


    On the meanwhile, I would appreciate an input regarding the current FS issue I have mentioned above, is that safe to delete that INODE in the lost+found?

    I haven't done it before. I'd have to refer you to tkaisers post on this above. It all comes back to the backup. You have options, and can take a risk, with a current backup because the file can be replaced/repaired if there is an ill effect.


    On the other hand, if you plan to upgrade to ZFS, I'd endure the error messages if they're not critical. (And they don't appear to be.)

  • /srv/dev-disk-by-label-RAID/lost+found/#98893575

    It's already where it belongs to, fsck sent it to this directory and somewhere else this data is missing (please search the web for »what is 'lost+found'«). So maybe the above syslog message is just telling you that there is a file you should have an eye on (no idea, I don't use ext4 for valuable data any more since years even if it can be considered one of the most robust filesystems around)

  • Thank you, I will have a look.


    So what is the recommendation for recreating my system?


    I have an 240GB SSD for the OMV system, what FS it should be?


    And what is the recommendation for the 3X3TB drives I have? I need at least 6 GB for storage.


    Thank you

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!