Issues with data consistancy with OMV 1.19, how to diagnose?

  • Hi,


    my NAS with OMV 1.19 has been running for a few years without any issues. The last few month however now and again something would go wrong. Sometimes i would watch something of the NAS and after a few hours there would be an error, or would dump 50GB on the NAS and after for example 70% there would be a "transmission" error. In case of an error the affected filesystem would go into "resync" --> 8 hours --> "active" --> 8 hours --> "clean" and i can use it again.


    I don't think there is anything wrong with my pc, because usb harddrives work perfectly and there is nothing else wrong with my pc.


    I also don't think there is something wrong with my network. It is really simple and everything else works fine. I already switched the cable going to the NAS running OMV.


    How do i start diagnosing my NAS running OMV 1.19? SMART data doesn't show an obviously defective drive, RAM test was also good. Where to look next? ?(

    • Offizieller Beitrag

    From your post - not sure what the affected file system is or if you're using RAID.


    For potential drive failure, these are the smart stat's to look at. If one or more of the following are incrementing, it's just a matter of time:

    • SMART 5 – Reallocated_Sector_Count.
    • SMART 187 – Reported_Uncorrectable_Errors.
    • SMART 188 – Command_Timeout.
    • SMART 197 – Current_Pending_Sector_Count.
    • SMART 198 – Offline_Uncorrectable.

    Where the remainder of the stat categories are concerned, some of the counts and their meaning (raw values) can vary between drive OEM's.
    ___________________________________________


    If you want to get a complete picture of each drive:


    On the command line, do:
    apt-get install curl


    After the curl install finishes:
    (In the line below replace "?" with the letter for the disk)
    for disk in /dev/sd? ; do smartctl -x $disk ; done | curl -F 'sprunge=<-' sprunge.us


    The above line returns a URL. Copy and past the URL into the address bar of a Web browser.
    ___________________________________________________________________________


    You could copy and post the URL's generated from the above, into this thread, but I'm guessing that your drives are older. If you're a RAID user and if a failed drive is replaced, note that RAID is not kind to the remaining drives in an array. "Resilvering" a new drive can cause another older drive to failure, during the process.


    In any case, it might be time to think about backing up your data.

  • the affected filesystem would go into "resync" --> 8 hours --> "active" --> 8 hours --> "clean" and i can use it again.

    Filesystems resyncing? Are you playing RAID? If so I would first check SMART attribute 199 of all used disks (CRC errors that indicate cabling/connector issues).


    Wrt diagnosing: by looking into logfiles of course. I don't know which OS this ancient OMV version is running (only know 3 and 4) but if it was already Debian back then most probably /var/log/syslog* could be interesting. In case others should help interpreting the events that lead to problems you might want to provide at least a little bit of information about your setup.

  • Im running two RAID 5 systems on Debian 7.11 with OMV 1.19 (a complete OMV release)


    lsblk: http://sprunge.us/TRHE


    sda: http://sprunge.us/AMja



    sdb: http://sprunge.us/XKQi



    sdc: http://sprunge.us/CXCQ



    sdd: http://sprunge.us/eWIX



    sde: http://sprunge.us/aHVW



    sdf: http://sprunge.us/BPab


    sdg: http://sprunge.us/KfKT


    My SATA cables have retaining connectors and i already made sure they are all the way in.


    The errors are intermitent. They system would work for a few weeks without incidents and then when i read or write data, there would be an error, the NAS usually becomes unresponsive and a have to make a hard reset. After that the affected filesystem would do "resync" --> "active" --> "clean" cycle.

    • Offizieller Beitrag

    As it seems, your drives are not ready to die,, at least soon. (They all die, it's just a question of when.)
    ________________________________________


    Are you having the intermittent issue with the array that contains sdc? If so, tkaiser may be pointing in the right direction on a potential cable/connector or associated hardware issue. sdc has had 16 attrib 199 events, where there should be 0, over a period of roughly a year - last year. (There's one other drive with a single count but with the uptime on your system - ignoring that.)


    Running RAID5 stripes across drives. If there's a glitch in one drive (or cable) during a read/write, a strip is corrupted, the file system might register a read/write error and check for consistency. So, even if these events have been far apart, think about it,, has there been roughly 16 events in the last year?
    ________________________________________


    Other thoughts...
    - Given what you're describing, your data is probably being corrupted. Really, it's not a question of "if", it's more of a question of severity. mdadm RAID will not prevent that and your downloaded transmission files are showing the real world effects. (Unfortunately, the files in the rest of this array are not being monitored for consistency.)
    - It's time to update. Many forum members have had little exposure to your version of OMV and, of of those that have, they've probably forgotten what the ver 1.X quirks are. Of the OMV Pro's: The Dev's and Moderators have, understandably, moved on.
    - If you want to run a RAID5 equivalent, give ZFS some consideration. With file check-sums and error correction (on the fly with active files), you'll be far better off than running mdadm RAID. Get OMV 3.X, the ZFS plugin, and with a few easy to do tweaks, you'd be ready to go.
    - Before doing any hardware surgery, backup. Before you even begin to seriously look at this problem, backup and hope that another "event" doesn't interrupt the process.

  • I don't recall md1 having issues and i also don't think there were already 16 events, probably more than 4, but definitely less than 8. All but one hard drive had lives before being used in this application, so one can't say the recorded smart errors must be tied to the current problem.


    I didn't upgrade because i really don't like to touch a "running system" and i didn't miss any features. Finding support for a legacy system is of course a downside of this politic. I do make an exception for security updates of the underlying OS though.


    But i'm also not comfortable updating until this problem is resolved, thats just asking for trouble and a world of hurt. Reading from the affected RAID should not lead to any additional data corruption, so the back up process should be fine.


    Tomorrow i'm going to check the hardware itself again.

    • Offizieller Beitrag

    On the count, there is a difference between an "event" and an event that you noticed or something that would trigger a resync. While it's speculation on my part - that's all anyone can really do without direct access - the other unnoticed events may have resulted in minor corruption. After all, a CRC is a calculated error.


    In any case, mdadm RAID is not known for error detection. It can detect some types of disk faults but there's been instances where mdadm RAID did not fail a dead disk and instances were it failed disks without a fault.


    On the upgrade, I agree with you. That's down the road. It would be pointless to upgrade software before figuring out what may be a hardware issue. However, where the array is concerned, if you really want the data, I'd back it up.


    If you want to look at hardware, in general, here's a link to The Ultimate Boot CD. But, given the intermittent nature of the problem, you might not find anything.


    It wouldn't hurt to swap out the cable on sbc, watch it for awhile and if there's another event and/or if attrib 199 increments again, swap out the drive.

  • I don't recall md1 having issues and i also don't think there were already 16 events, probably more than 4, but definitely less than 8

    The 'events' I was talking about are the issues that really happened: your RAID entering resync state for a reason (again: what do you think /var/log/syslog* is for)? While you can ignore those SMART values especially if they've not been monitored (so you don't know when numbers increased) ignoring the only source of information (your log files) adds to the 'unsupportable' situation (RAID 5, no backup, no responsible upgrade policy).

    • Offizieller Beitrag

    most probably /var/log/syslog* could be interesting.

    While loosely related;


    Log file(s) can be enormous which complicates looking for an odd entry back through several months. To parse log files, I've pasted them into a word processor. When I know all or part of a string, finding it is easy, "Edit,Find". But, if the string is unknown, normal entries can obscure the odd "once in a few weeks" error entry.


    Other than scanning them line by line,do you have some sort of method or utility for searching log files?


    Thanks

  • Other than scanning them line by line,do you have some sort of method or utility for searching log files?

    Hmm... in this case most probably starting with the following lines:


    Code
    zgrep -A2 -B2 -E "raid|mdadm" /var/log/syslog*

    Then copying contents in lines that look relevant, using a pager like 'less' or 'more' and then navigating to the occurence in the respective logfile. In case in /var/log/syslog.6 is something interesting mentioned (eg. 'Rebuild25 event detected on md device') then I would copy 'Rebuild25' to the clipboard, then open the logfile with 'less /var/log/syslog.6' and enter '/Rebuild25' to directly jump to the line in question.


    Way too complicated for average users and that's why I constantly whine that there's a better 'support data collection tool' missing in OMV.

    • Offizieller Beitrag

    Hmm... in this case most probably starting with the following lines:

    Code
    zgrep -A2 -B2 -E "raid|mdadm" /var/log/syslog*

    Then copying contents in lines that look relevant, using a pager like 'less' or 'more' and then navigating to the occurence in the respective logfile. In case in /var/log/syslog.6 is something interesting mentioned (eg. 'Rebuild25 event detected on md device') then I would copy 'Rebuild25' to the clipboard, then open the logfile with 'less /var/log/syslog.6' and enter '/Rebuild25' to directly jump to the line in question.


    Way too complicated for average users and that's why I constantly whine that there's a better 'support data collection tool' missing in OMV.

    I would whole heartedly agree but, as a user, I couldn't possibly complain because of the "free + open source" consideration.

    • Offizieller Beitrag

    How do i start diagnosing my NAS running OMV 1.19? SMART data doesn't show an obviously defective drive, RAM test was also good. Where to look next?

    tkaiser has given you (and me) something with which you can narrow down the vast amount of info in your ALL of your syslog files, in one shot.

    Code
    zgrep -A2 -B2 -E "raid|mdadm" /var/log/syslog* |more >/logsearch.txt


    (If I have it right.) Between the quotes are search items, each separated by a pipe, with a pipe add-on directing output to a file that's deposited at the root.


    (After you look over your results:)
    While it's nowhere near as polished as tkaiser's approach for sorting, you could modify the same command line to search within the output file, and narrow the search farther, using terms of interest.



    Code
    zgrep -A2 -B2 -E "Rebuild" /logsearch.txt |more >/logsearch2.txt




  • If I have it right

    Unfortunately doesn't seem so.


    I tried to describe a two step process. First is to grep through all available log files (if you know what you do and when log rotation happens simply only search in the file where those events you're interested in are collected). Second step is to copy&paste in the terminal search items to the clipboard, then open the respective logfile with a so called pager (not using a pipe) and search for the string you're interested to read through what happened at the same time (the '/' is used to search in those pagers like less -- which is better than more).


    As already said: that's command line stuff average users shouldn't deal with. But it's fast and efficient if you're used to it and need to diagnose what has happened on a failing/failed server (usually that's not necessary since systems are monitored -- we use the logmatch directive to let snmpd parse the relevant logfiles every minutes and if something important happens, an alert is generated automatically anyway so you already know what you're searching for if you login for troubleshooting).


    What's IMO missing in OMV is a data collection tool that at least contains dmesg output and maybe also log excerpts as above.

    • Offizieller Beitrag

    Unfortunately doesn't seem so.
    I tried to describe a two step process. First is to grep through all available log files (if you know what you do and when log rotation happens simply only search in the file where those events you're interested in are collected). Second step is to copy&paste in the terminal search items to the clipboard, then open.....


    /--- :) Bla, bla.. Bla bla bla... :) ---/

    Since your process is beyond the average user, by your own admission, I was attempting to use an element or two of your more polished approach, in a simplified search process. (In the event that Markx didn't fully understand what you proposed.)
    What I posted would be far less revealing, but might still yield a clue.
    ________________________


    While a bit off topic:
    (Excuse the bother), I couldn't find a pattern example with a pipe in it;
    e.g.: zgrep -A2 -B2 -E "one|two|three" /filename
    In this example, all 3 terms in the pattern must match a line, in a boolean "and" operation. Correct? (Thanks)

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!