Bitrot protection light

Adoby · 20. Dezember 2018

So I have plenty of storage and good versioned backups. Now I can worry about bitrot. Maybe?

Very rare random errors creeping into the data over time. Very rare, but increasing the storage increase the probability of random errors.

I use versioned backups on the folder level using rsync snapshots over the local network.

Ideally I would like a system with checksums on a file level. That can be used to find files with bitrot. If the checksum has changed but the modification time has not changed, then we have bitrot. If bitrot is detected then the file is restored from backup. Perhaps with an extra check of checksum on the backup copy and optionally use an older, error free, copy of the file if available.

So I would have to quickly create new checksums and update snapshots often. Otherwise bitrot errors may migrate into the backups.

This seems like something that might already exist? It could be integrated into a rsync backup system. Rsync use checksums and copy files back and forth. And rsync knows where the copies of the file are. It would be pretty lightweight, except for all the checksum calculations, and need very little extra storage. Just a database for the filenames, checksums and file modification times.

Does anyone know if something like this already exists? Or something better and even simpler?

gderf · 20. Dezember 2018

Look at SnapRAID, but it requires considerable extra storage for the parity data which allows recovery.

Adoby · 20. Dezember 2018

Yes. I would prefer to use my already existing rsync backups for redundancy instead. Not introduce more redundancy. Perhaps run the bitrot detection as part of a weekly rsync snapshot. Spread out over the week between different backup targets (subfolders). Check E-Books for bitrot on Monday and check TV shows A-G on Tuesdays. And so on.

Perhaps I could run rsync twice. First as a dry run with checksums. Then based on modification times. Extra files from the first run, that didn't get synced, have bitrot. But I don't think rsync saves checksums? With saved checksums between runs the need for checksum recalc would be halved.

Adoby · 20. Dezember 2018

So I found his: https://github.com/ambv/bitrot

Interesting.

ryecoaaron · 20. Dezember 2018

Zitat von Adoby

So I found his: github.com/ambv/bitrot

Wouldn't that be the same as using btrfs and a scrub with the checksum being stored in sqlite3 instead?

And if you want to go the utility route, trapexit (author of mergerfs) wrote scorch - https://github.com/trapexit/scorch

Adoby · 20. Dezember 2018

Zitat von ryecoaaron

Wouldn't that be the same as using btrfs and a scrub with the checksum being stored in sqlite3 instead?
And if you want to go the utility route, trapexit (author of mergerfs) wrote scorch - https://github.com/trapexit/scorch

Yes, exactly the same. But with the utility approach you can limit the checksum calculations to the folders you want to protect. But then that most likely will be the bulk of files on a NAS.

But I would like to be able to use EXT4 and my existing rsync snapshots for redundancy. Otherwise I might go with snapraid or raid6 or something.

Scorch also looks interesting.

But the automatic restore from a backup is missing. Detection is nice, correction is even nicer. Most likely the bitrot correction would have to be integrated into a rsync backup script, when the location of the bad files and their replacement is known. But the detection bit might run asynchronous from the backup bit, to avoid extremely time consuming backups.

It seems that there are plenty of tools for detection of bitrot.

crashtest · 21. Dezember 2018

Zitat von Adoby

Ideally I would like a system with checksums on a file level. That can be used to find files with bitrot. If the checksum has changed but the modification time has not changed, then we have bitrot. If bitrot is detected then the file is restored from backup. Perhaps with an extra check of checksum on the backup copy and optionally use an older, error free, copy of the file if available.

If you can come up with a manual process that corrects bitrot, without ZFS or BTRFS in RAID1 / RAID10 (or SNAPRAID), I'm interested. When I started to look for bitrot protection, that inevitably led to ZFS and BTRFS. Believe me when I say, with its kernel integration and lighter hardware requirements, I really wanted BTRFS to be the answer. But there are "mostly OK" and other warnings, on the BTRFS Status page, that seem to be perpetual. With that considered, combined with personal experience running BTRFS on a single drive, well,, I've ruled BTRFS out for the next few years.

For effective bitrot protection, in a mature file system, I haven't found anything better than ZFS. Adding in native support for automated, self rotating and purging SNAPSHOTs, where a virus infection or even deliberate data deletion can be contained, and we're talking about "data preservation".

Coming to that realization guided my hardware choices from that point forward.

Zitat von Adoby

So I have plenty of storage and good versioned backups. Now I can worry about bitrot. Maybe?

I do because it's a "silent" phenomenon that goes undetected in the vast majority of cases. Curiously, there's little research on the extents of the problem, but I've read more than one horror story about unsuspecting users losing irreplaceable photo's, documents, etc., that they "thought" were fine. Time is also an issue. The longer these files are on a drive, the more likely they are to become corrupt from random events, platform hardware problems, media degradation, hard drives that fail slowly, etc.

Going further, simply detecting bitrot does nothing other than to suggest that one should be "concerned". Again, without "correction" that's known to work, detection is very limited in value. (The damage is done.)

On Rsync:
Rsync can be set to use checksums (which really slows it down) but that's for keeping transfers clean, not for ongoing protection. If source files are corrupted, Rsync, with checksums or not, will happily and accurately replicate the corrupted files to the destination.
___________________________________________________________________________

My approach:
At the top of my storage stack, on the main server, I'm running a ZFS mirror on a platform with ECC. From there, the known "clean store" is Rsync'ed out to backup devices/platforms. Putting a zmirror at the top made sense to me because it's pointless to replicate files that are corrupted at the source. The down side? 1/2 of drive real-estate is taken solely for bitrot protection, but it is effective.

One destination platform is running a smaller zmirror for sensitive network shares (pic's, doc's, user files, classic rock music, you know the important stuff ). While the full data store is replicated to this box, some of it is outside the zmirror and is unprotected. This particular box is also "cold storage". I use etherwake and scheduled tasks (cron) to start it twice a month for a day. During those days it replicates changed files and shuts down at midnight. (A scrub runs during one of those days.)

Another destination is MergerFS+SNAPRAID which provides a RAID5 like array where dissimilar sized drives can be aggregated AND protected. ((That's the beauty of SNAPRAID. Buy a drive, any drive, throw it in the mix, and run a SYNC.))
If SNAPRAID works as advertised, the entire store is protected on this platform. (As mentioned in another thread, I haven't tested SNAPRAID for bitrot correction yet, and I won't be thoroughly convinced until I do.)
_____________________________________________________________________________

Again, I'm looking for improved data preservation techniques so, if you find something potentially useful, please post it.

Adoby · 21. Dezember 2018

It really shouldn't be too difficult to write a set of programs/scripts that can do this. Actually I'm surprised that scripts that do this doesn't already exist.

I'm likely to at least try to see if I can cobble something together something simple. I think my setup with several single disk SBCs is suitable. NFS and EXT4. SSH and remote execution would have been more general, but I'll save that for later. There is also plenty of example python code available for checksum calculation and storage and bitrot detection.

However it is possible that the danger for permanent data loss is much higher from bugs in scripts that attempt to fix bitrot, than from bitrot itself. :X

tkaiser · 21. Dezember 2018

Zitat von Adoby

However it is possible that the danger for permanent data loss is much higher from bugs in scripts that attempt to fix bitrot, than from bitrot itself

That's why I would not reinvent the wheel but rely on great projects for the whole data safety and integrity thing: using either ZFS (together with znapzend) or btrfs (with btrbk). Snapshots when being transferred via zfs|btrfs send/receive are way less stressful than rsync jobs (that might produce an inconsistent copy when done without snapshots anyway) and the preserved disk stress can then be used for regular scrubs (at both source and backup destination).

Stramm · 26. Dezember 2018

Zitat von tkaiser

using either ZFS (together with znapzend) or btrfs (with btrbk).

Thanks for that... haven't heard of btrbk before.

brando56894 · 31. Dezember 2018

Use ZFS and ECC RAM, it was pretty much written to prevent bit rot. At the very least, use ECC RAM if you aren't already.

Jetzt mitmachen!