Parity on a cheap nas - victim of silent corruption

  • Hi,


    Background

    I built a nas a while back on a budget - 20$ for rock64+sd+psu from ebay. With a 8TB HDD and a second 8TB HDD with the usb backup plugin to rsync the drives whenever i felt like plugging in the backup. Pretty solid setup - could reliably hit 100+MB/sec on transfers.


    Recently (~2 years later from initial setup) i discovered that some of my movies wouldn't play or would have artifacts that i knew weren't present in the past. Some of the rsynced backup data is fine as i believe it only syncs off of timestamps and not checksums, but i have no idea when the primary drive started corrupting data or what exact data is corrupted.... thankfully i did generate an md5 of all files a long time back, so that's a start... i'll need to write some code to dig into this.


    I started digging into SnapRaid as i could still use it with a pi4 or something i had onhand, and i could just buy another HDD and use as parity. After playing around I ran into some concerns


    Questions/Concerns


    SnapRaid works with USB drives, but guides suggest setting up a scheduled 'snapraid sync'. Seems like snapraid doesn't have knowledge of 'file actions' and will blindly sync and re-compute checksums on whatever the storage drives spit out. The only way to detect silent corruption or potentially bit rot is to run a diff, review the results and review all file differences, then either sync or rebuild data.


    I'm looking for something that runs 'live' where parity is updated per file, per file action. So if i transfer a file to the NAS, then the nas generates parity for that file - not rebuild the entire parity for the whole SSD. Also i'd like something that runs transparently (obviously i'd like to know if a drive is corrupting data).


    Is there anything that fits this description? I'm thinking the RAID 5 option would fit? Does OMV 'raid 5' use a software raid? This would require me to get a full computer with SATA and a couple more disks which I was hoping to avoid... At this point I'm thinking i need to get something more professional.


    To help someone in a similar predicament:

    Word of the wise - it's a very good idea to run a find and log an md5 of all files on your nas. Even with raid5 I've heard some horror stories long back (2005) of non protocol conforming hardware raid5 doing wonky stuff.


    Finding problems with video files - Let's say you don't know what an MD5 should be, you can still run this and see if the video file has 'problems'. Problem with doing this is that even good video files can still have missing frames here and there ;(. But if you get 100+ errors in a single video file - that's a red flag.

    • Offizieller Beitrag

    You would have to use inotify or something like that to generate checksum on write. You would be better off to have multiple backups. Borgbackup would let you do that with the same setup (it dedupes and compresses) and detect checksum issues when backing up.

    omv 7.0.5-1 sandworm | 64 bit | 6.8 proxmox kernel

    plugins :: omvextrasorg 7.0 | kvm 7.0.13 | compose 7.1.6 | k8s 7.1.0-3 | cputemp 7.0.1 | mergerfs 7.0.4


    omv-extras.org plugins source code and issue tracker - github - changelogs


    Please try ctrl-shift-R and read this before posting a question.

    Please put your OMV system details in your signature.
    Please don't PM for support... Too many PMs!

  • Hello,


    As far as I know, RAID 5 checks for data corruption, so yes, this would help preventing this by noticing the faulty drive as early as possible.


    In your present case, I would check the SMART for each disk and remove the faulty one from RAID (assuming it's RAID 1 or similar?) which might resolve the integrity issue, assuming one of the drives failed and the other one works OK and wrote proper data and didn't sync on the wrong disk.


    Best of luck

    • Offizieller Beitrag

    Would Raid5 have prevented this from happening?

    RAID5 does not have a checksum checking / testing capability that will repair bitrot. It can only replace a failed or failing drive, under certain conditions. Further running RAID5 on SBC's with USB connected drives is a real bad idea.
    ______________________________________________________

    SMART is the best tool to detect when a drive is beginning to go south.

    First, set up User Notifications. That's covered starting on page 35 in this OMV5 -> guide. With that done and tested, if SMART errors or other file system issues are detected, you'll get an E-mail heads up.

    Then, under Storage, SMART, enable SMART monitoring in the Settings tab.
    Under the Devices tab, consider running an after hours short test, once a week. I've found the short test to be enough, but others use the long test once a month. These tests are recorded and can be referred to for future use.

    The following are the SMART stat's to keep an eye on for spinning drives:

    SMART 5 – Reallocated_Sector_Count.

    SMART 187 – Reported_Uncorrectable_Errors.

    SMART 188 – Command_Timeout.

    SMART 197 – Current_Pending_Sector_Count.

    SMART 198 – Offline_Uncorrectable.

    (One or two counts of the above may not mean anything, but it they start to steadily increment, a drive failure may be in progress.)

    ______________________________________________________


    SNAPRAID:
    I believe the problem you're experiencing may be related to how you're implementing SNAPRAID.
    Consider the following operations, that may be setup to run in Scheduled Tasks, and note that sequence is important.


    First note that, when I used SNAPRAID on a backup server, I ran a SYNC operation once every two weeks. That gave me some time to recieve E-mail stat's and intervene if a problem was detected.

    (The following is the way I did it. There are many other ways this can be done. Perhaps others will chime in with their routine and rational.)


    snapraid -p 100 -o 13 scrub
    **scrubs all files that have not been scrubbed in two weeks. This operation is done two days before a sync.**

    snapraid -e fix
    **with the output of the above, SNAPRAID fixes corrupted files found, using their checksums and parity info. This operation is done the day before a SYNC**

    snapraid touch; snapraid sync -l snapsync.log
    ** This SYNC's new files added during the last two weeks, along with new checksums and parity for corrected files from above. The touch command takes care of a typical annoyance and the -l logs output to the file name shown.


    If notifications are set up and turned on in Scheduled Tasks, all of the output of the above is E-mailed to you for review and consideration.


    Optional:

    snapraid --force-zero sync

    **This command forces a sync when files of "0" byte length are detected - an annoying habit of windows clients.


    SNAPRAID "diff" scripts are also something to consider.
    Basically, a diff script can be set up to stop a SYNC operation if there's a designated percentage of files that are different from their checksums. This is an important consideration if a drive begins to fail where there are a large number of files that do not match their checksums. This is designed to keep the last SYNC available for drive recovery.
     
    ( gderf and others know more about customizing a diff script than I do.)

    In the bottom line, SNAPRAID is the one of the very few "easy" choices for error detection AND correction on SBC's. If combined with notifications and SMART drive testing, that covers a lot of bases. Nothing is perfect, but the combination (SNAPRAID and SMART) is very good. Of course, 100% backup of all data is highly recommended.

    _________________________________________

    At this point I'm thinking i need to get something more professional.

    I started out with a pair of R-PI2B's, several years ago, and rapidly came to the same conclusion. There's nothing wrong with SBC's and, for many users, they're enough. With two SBC's, even the backup function can be taken care of at low cost.

    However, in my case, I wanted something a bit more robust for dealing with LAN client backup and iron clad bit-rot protection, to include ECC RAM. I bought a small SOHO server to serve as my primary box, set up OMV with ZFS, and haven't looked back.

  • Snapraid might be an option on a low powered system like this one.

    It has a feature (PreHash) to read twice data in order to minimize bitrot when writing parity, and the Scrub feature scans the array to check if there are any silent errors or corrupted data.

    It is more flexible of any RAID system at the cost the parity is not done live.


    Ah, I can see crashtest has already recommended SnapRaid.


    crashtest I love how SnapRAID gives you control over everything.

    My daily all in one snapraid script also prints the snapraid internal report about disk health which is quite useful.

    Coupled with OMV SMART monitoring and a regular offline backup(s), it covers most home users.


    I came up with a very good script so feel free to ask.

    OMV BUILD - MY NAS KILLER - OMV 6.x + omvextrasorg (updated automatically every week)

    NAS Specs: Core i3-8300 - ASRock H370M-ITX/ac - 16GB RAM - Sandisk Ultra Flair 32GB (OMV), 256GB NVME SSD (Docker Apps), 2x16TB HDDs w/ SnapRAID - Fractal Design Node 304 - Be quiet! Pure Power 11 350W


    My all-in-one SnapRAID script!

    Einmal editiert, zuletzt von auanasgheps ()

  • BTW to help anyone searching on google: Installing RTL8125B drivers on Odroid h2+ which, for now, requires manual installation. Plug in a usb -> ethernet card temporarily to DL/install stuff:


    Code
    wget https://github.com/awesometic/realtek-r8125-dkms/archive/master.zip -O realtek-8125-dkms.zip
    sudo apt install make
    sudo apt install build-essential
    sudo apt install unzip
    apt-get install dkms
    apt install linux-headers-$(uname -r)
    unzip realtek-8125-dkms.zip
    cd realtek-8125-dkms
    sudo ./autorun.sh


    Then it should show up in 'ip a' - go to the webpage (still using a usb nic) and setup network profiles for the newly installed nics.

  • BTW to help anyone searching on google: Installing RTL8125B drivers on Odroid h2+

    Thanks for sharing your experience. Nevertheless it would be nice if you would open a new thread in this case with a more appropriate title.

    OMV 3.0.100 (Gray style)

    ASRock Rack C2550D4I C0-stepping - 16GB ECC - 6x WD RED 3TB (ZFS 2x3 Striped RaidZ1) - Fractal Design Node 304 -

    3x WD80EMAZ Snapraid / MergerFS-pool via eSATA - 4-Bay ICYCube MB561U3S-4S with fan-mod

  • Then, under Storage, SMART, enable SMART monitoring in the Settings tab.
    Under the Devices tab, consider running an after hours short test, once a week. I've found the short test to be enough, but others use the long test once a month. These tests are recorded and can be referred to for future use.

    Hi crashtest , I have enabled SMART monitoring and configured an after hours short test to run once a week (as you and the OMV5 guide recommended) on my Helios4 setup.

    I was wondering though if there any recommendations on the settings for SMART?

    For example when I set it up I just left the default settings as they were


    Reading the descriptions of the different options under Power mode should I be setting this to Standby?

    • Offizieller Beitrag

    On this particular page, I'd leave the defaults as they are but you can do as you like. (Which feeds into the following.)

    When it comes to Disks and Power Management, the power profiles may or may not work with your drive(s) and / or whatever the Helios is using to interface it's SATA ports. Given that the Helios is already power efficient, saving a couple watt/hours by spinning down drives may come at the expense of excessive wear and tear when spinning them up again.

    There's a number of arguments for and against spinning down drives. For example, starting a drive is the moment of the greatest stress and the highest consumption of power. If that happens too frequently, more power may be consumed than would be by simply letting it spin, along with creating extra wear and tear. NAS drives, if that's what you have, are designed to run 24x7.

    With that said, if you've been successful at spinning down drives and want to continue to do so, using the STANDBY setting in this screen would make sense.

  • Thanks, appreciate the advice. I'll stick with the defaults.

    When it comes to Disks and Power Management, the power profiles may or may not work with your drive(s) and / or whatever the Helios is using to interface it's SATA ports. Given that the Helios is already power efficient, saving a couple watt/hours by spinning down drives may come at the expense of excessive wear and tear when spinning them up again.

    There's a number of arguments for and against spinning down drives. For example, starting a drive is the moment of the greatest stress and the highest consumption of power. If that happens too frequently, more power may be consumed than would be by simply letting it spin, along with creating a lot of wear and tear. NAS drives, if that's what you have, are designed to run 24x7.

    You have already answered my next question. I don't currently have any settings to spin down drives and was going to ask if I should ;).

    I'm happy with the thought of NOT letting them spin down, and all 4 drives I have in my Helios are NAS drives being either IronWolf or Red drives. If they're designed to run 24x7 then I'll let them.


    Thanks, appreciate the advice and help!

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!