Parity on a cheap nas - victim of silent corruption

pinoLorentz · 25. September 2020

Hi,

Background

I built a nas a while back on a budget - 20$ for rock64+sd+psu from ebay. With a 8TB HDD and a second 8TB HDD with the usb backup plugin to rsync the drives whenever i felt like plugging in the backup. Pretty solid setup - could reliably hit 100+MB/sec on transfers.

Recently (~2 years later from initial setup) i discovered that some of my movies wouldn't play or would have artifacts that i knew weren't present in the past. Some of the rsynced backup data is fine as i believe it only syncs off of timestamps and not checksums, but i have no idea when the primary drive started corrupting data or what exact data is corrupted.... thankfully i did generate an md5 of all files a long time back, so that's a start... i'll need to write some code to dig into this.

I started digging into SnapRaid as i could still use it with a pi4 or something i had onhand, and i could just buy another HDD and use as parity. After playing around I ran into some concerns

Questions/Concerns

SnapRaid works with USB drives, but guides suggest setting up a scheduled 'snapraid sync'. Seems like snapraid doesn't have knowledge of 'file actions' and will blindly sync and re-compute checksums on whatever the storage drives spit out. The only way to detect silent corruption or potentially bit rot is to run a diff, review the results and review all file differences, then either sync or rebuild data.

I'm looking for something that runs 'live' where parity is updated per file, per file action. So if i transfer a file to the NAS, then the nas generates parity for that file - not rebuild the entire parity for the whole SSD. Also i'd like something that runs transparently (obviously i'd like to know if a drive is corrupting data).

Is there anything that fits this description? I'm thinking the RAID 5 option would fit? Does OMV 'raid 5' use a software raid? This would require me to get a full computer with SATA and a couple more disks which I was hoping to avoid... At this point I'm thinking i need to get something more professional.

To help someone in a similar predicament:

Word of the wise - it's a very good idea to run a find and log an md5 of all files on your nas. Even with raid5 I've heard some horror stories long back (2005) of non protocol conforming hardware raid5 doing wonky stuff.

Finding problems with video files - Let's say you don't know what an MD5 should be, you can still run this and see if the video file has 'problems'. Problem with doing this is that even good video files can still have missing frames here and there . But if you get 100+ errors in a single video file - that's a red flag.

Code

#search for all file types present in a folder
find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u

#manually make a list of video file types that you have present - feed into ffmpeg, log errors for each file. 
find -name '*.avi' -exec echo "processing {}" > >(tee  -a ~/avi.log) ';' -exec ffmpeg -v error -i {} -f null - 2> >(tee  -a ~/avi.log) ';'
find -name '*.flv' -exec echo "processing {}" > >(tee  -a ~/flv.log) ';' -exec ffmpeg -v error -i {} -f null - 2> >(tee  -a ~/flv.log) ';'
find -name '*.mkv' -exec echo "processing {}" > >(tee  -a ~/mkv.log) ';' -exec ffmpeg -v error -i {} -f null - 2> >(tee  -a ~/mkv.log) ';'
find -name '*.mp4' -exec echo "processing {}" > >(tee  -a ~/mp4.log) ';' -exec ffmpeg -v error -i {} -f null - 2> >(tee  -a ~/mp4.log) ';'
find -name '*.mpeg' -exec echo "processing {}" > >(tee  -a ~/mpeg.log) ';' -exec ffmpeg -v error -i {} -f null - 2> >(tee  -a ~/mpeg.log) ';'
find -name '*.mpg' -exec echo "processing {}" > >(tee  -a ~/mpg.log) ';' -exec ffmpeg -v error -i {} -f null - 2> >(tee  -a ~/mpg.log) ';
...

Alles anzeigen

ryecoaaron · 25. September 2020

You would have to use inotify or something like that to generate checksum on write. You would be better off to have multiple backups. Borgbackup would let you do that with the same setup (it dedupes and compresses) and detect checksum issues when backing up.

pinoLorentz · 26. September 2020

Would Raid5 have prevented this from happening?

UltimateByte · 26. September 2020

Hello,

As far as I know, RAID 5 checks for data corruption, so yes, this would help preventing this by noticing the faulty drive as early as possible.

In your present case, I would check the SMART for each disk and remove the faulty one from RAID (assuming it's RAID 1 or similar?) which might resolve the integrity issue, assuming one of the drives failed and the other one works OK and wrote proper data and didn't sync on the wrong disk.

Best of luck

crashtest · 26. September 2020

Zitat von pinoLorentz

Would Raid5 have prevented this from happening?

RAID5 does not have a checksum checking / testing capability that will repair bitrot. It can only replace a failed or failing drive, under certain conditions. Further running RAID5 on SBC's with USB connected drives is a real bad idea.
______________________________________________________

SMART is the best tool to detect when a drive is beginning to go south.

First, set up User Notifications. That's covered starting on page 35 in this OMV5 -> guide. With that done and tested, if SMART errors or other file system issues are detected, you'll get an E-mail heads up.

Then, under Storage, SMART, enable SMART monitoring in the Settings tab.
Under the Devices tab, consider running an after hours short test, once a week. I've found the short test to be enough, but others use the long test once a month. These tests are recorded and can be referred to for future use.

The following are the SMART stat's to keep an eye on for spinning drives:

SMART 5 – Reallocated_Sector_Count.

SMART 187 – Reported_Uncorrectable_Errors.

SMART 188 – Command_Timeout.

SMART 197 – Current_Pending_Sector_Count.

SMART 198 – Offline_Uncorrectable.

(One or two counts of the above may not mean anything, but it they start to steadily increment, a drive failure may be in progress.)

______________________________________________________

SNAPRAID:
I believe the problem you're experiencing may be related to how you're implementing SNAPRAID.
Consider the following operations, that may be setup to run in Scheduled Tasks, and note that sequence is important.

First note that, when I used SNAPRAID on a backup server, I ran a SYNC operation once every two weeks. That gave me some time to recieve E-mail stat's and intervene if a problem was detected.

(The following is the way I did it. There are many other ways this can be done. Perhaps others will chime in with their routine and rational.)

snapraid -p 100 -o 13 scrub
**scrubs all files that have not been scrubbed in two weeks. This operation is done two days before a sync.**

snapraid -e fix
**with the output of the above, SNAPRAID fixes corrupted files found, using their checksums and parity info. This operation is done the day before a SYNC**

snapraid touch; snapraid sync -l snapsync.log
** This SYNC's new files added during the last two weeks, along with new checksums and parity for corrected files from above. The touch command takes care of a typical annoyance and the -l logs output to the file name shown.

If notifications are set up and turned on in Scheduled Tasks, all of the output of the above is E-mailed to you for review and consideration.

Optional:

snapraid --force-zero sync

**This command forces a sync when files of "0" byte length are detected - an annoying habit of windows clients.

SNAPRAID "diff" scripts are also something to consider.
Basically, a diff script can be set up to stop a SYNC operation if there's a designated percentage of files that are different from their checksums. This is an important consideration if a drive begins to fail where there are a large number of files that do not match their checksums. This is designed to keep the last SYNC available for drive recovery.

( gderf and others know more about customizing a diff script than I do.)

In the bottom line, SNAPRAID is the one of the very few "easy" choices for error detection AND correction on SBC's. If combined with notifications and SMART drive testing, that covers a lot of bases. Nothing is perfect, but the combination (SNAPRAID and SMART) is very good. Of course, 100% backup of all data is highly recommended.

_________________________________________

Zitat von pinoLorentz

At this point I'm thinking i need to get something more professional.

I started out with a pair of R-PI2B's, several years ago, and rapidly came to the same conclusion. There's nothing wrong with SBC's and, for many users, they're enough. With two SBC's, even the backup function can be taken care of at low cost.

However, in my case, I wanted something a bit more robust for dealing with LAN client backup and iron clad bit-rot protection, to include ECC RAM. I bought a small SOHO server to serve as my primary box, set up OMV with ZFS, and haven't looked back.

auanasgheps · 26. September 2020

Snapraid might be an option on a low powered system like this one.

It has a feature (PreHash) to read twice data in order to minimize bitrot when writing parity, and the Scrub feature scans the array to check if there are any silent errors or corrupted data.

It is more flexible of any RAID system at the cost the parity is not done live.

Ah, I can see crashtest has already recommended SnapRaid.

crashtest I love how SnapRAID gives you control over everything.

My daily all in one snapraid script also prints the snapraid internal report about disk health which is quite useful.

Coupled with OMV SMART monitoring and a regular offline backup(s), it covers most home users.

I came up with a very good script so feel free to ask.

pinoLorentz · 1. Oktober 2020

Thanks i'll look into these options.

New setup: Odroid H2+, nvme -> pcie with 4x sata card, 3 8TB drives, 128GB boot ssd.

I'm currently setup as raid5, may look into snapraid again.

pinoLorentz · 2. Oktober 2020

BTW to help anyone searching on google: Installing RTL8125B drivers on Odroid h2+ which, for now, requires manual installation. Plug in a usb -> ethernet card temporarily to DL/install stuff:

Code

wget https://github.com/awesometic/realtek-r8125-dkms/archive/master.zip -O realtek-8125-dkms.zip
sudo apt install make
sudo apt install build-essential
sudo apt install unzip
apt-get install dkms
apt install linux-headers-$(uname -r)
unzip realtek-8125-dkms.zip
cd realtek-8125-dkms
sudo ./autorun.sh

Then it should show up in 'ip a' - go to the webpage (still using a usb nic) and setup network profiles for the newly installed nics.

cabrio_leo · 2. Oktober 2020

Zitat von pinoLorentz

BTW to help anyone searching on google: Installing RTL8125B drivers on Odroid h2+

Thanks for sharing your experience. Nevertheless it would be nice if you would open a new thread in this case with a more appropriate title.

computergeek100 · 2. Oktober 2020

Hey pinoLorentz

I found an Article in German about NAS-Server and what to consider.

Maybe that could help you:

https://www.technikhiwi.de/nas-server-test/

Ps: You can easily translate it with DeepL

ekent · 20. Oktober 2020

Zitat von crashtest

Then, under Storage, SMART, enable SMART monitoring in the Settings tab.
Under the Devices tab, consider running an after hours short test, once a week. I've found the short test to be enough, but others use the long test once a month. These tests are recorded and can be referred to for future use.

Hi crashtest , I have enabled SMART monitoring and configured an after hours short test to run once a week (as you and the OMV5 guide recommended) on my Helios4 setup.

I was wondering though if there any recommendations on the settings for SMART?

For example when I set it up I just left the default settings as they were

Reading the descriptions of the different options under Power mode should I be setting this to Standby?

crashtest · 20. Oktober 2020

On this particular page, I'd leave the defaults as they are but you can do as you like. (Which feeds into the following.)

When it comes to Disks and Power Management, the power profiles may or may not work with your drive(s) and / or whatever the Helios is using to interface it's SATA ports. Given that the Helios is already power efficient, saving a couple watt/hours by spinning down drives may come at the expense of excessive wear and tear when spinning them up again.

There's a number of arguments for and against spinning down drives. For example, starting a drive is the moment of the greatest stress and the highest consumption of power. If that happens too frequently, more power may be consumed than would be by simply letting it spin, along with creating extra wear and tear. NAS drives, if that's what you have, are designed to run 24x7.

With that said, if you've been successful at spinning down drives and want to continue to do so, using the STANDBY setting in this screen would make sense.

ekent · 20. Oktober 2020

Thanks, appreciate the advice. I'll stick with the defaults.

Zitat von crashtest

When it comes to Disks and Power Management, the power profiles may or may not work with your drive(s) and / or whatever the Helios is using to interface it's SATA ports. Given that the Helios is already power efficient, saving a couple watt/hours by spinning down drives may come at the expense of excessive wear and tear when spinning them up again.
There's a number of arguments for and against spinning down drives. For example, starting a drive is the moment of the greatest stress and the highest consumption of power. If that happens too frequently, more power may be consumed than would be by simply letting it spin, along with creating a lot of wear and tear. NAS drives, if that's what you have, are designed to run 24x7.

You have already answered my next question. I don't currently have any settings to spin down drives and was going to ask if I should .

I'm happy with the thought of NOT letting them spin down, and all 4 drives I have in my Helios are NAS drives being either IronWolf or Red drives. If they're designed to run 24x7 then I'll let them.

Thanks, appreciate the advice and help!

Jetzt mitmachen!