Bitrot protection light

    • Bitrot protection light

      So I have plenty of storage and good versioned backups. Now I can worry about bitrot. Maybe?

      Very rare random errors creeping into the data over time. Very rare, but increasing the storage increase the probability of random errors.

      I use versioned backups on the folder level using rsync snapshots over the local network.

      Ideally I would like a system with checksums on a file level. That can be used to find files with bitrot. If the checksum has changed but the modification time has not changed, then we have bitrot. If bitrot is detected then the file is restored from backup. Perhaps with an extra check of checksum on the backup copy and optionally use an older, error free, copy of the file if available.

      So I would have to quickly create new checksums and update snapshots often. Otherwise bitrot errors may migrate into the backups.

      This seems like something that might already exist? It could be integrated into a rsync backup system. Rsync use checksums and copy files back and forth. And rsync knows where the copies of the file are. It would be pretty lightweight, except for all the checksum calculations, and need very little extra storage. Just a database for the filenames, checksums and file modification times.

      Does anyone know if something like this already exists? Or something better and even simpler?
      OMV 4, 5 x ODROID HC2, 2 x 12TB, 2 x 8 TB, 1 x 500 GB SSD, GbE, WiFi mesh
    • Yes. I would prefer to use my already existing rsync backups for redundancy instead. Not introduce more redundancy. Perhaps run the bitrot detection as part of a weekly rsync snapshot. Spread out over the week between different backup targets (subfolders). Check E-Books for bitrot on Monday and check TV shows A-G on Tuesdays. And so on.

      Perhaps I could run rsync twice. First as a dry run with checksums. Then based on modification times. Extra files from the first run, that didn't get synced, have bitrot. But I don't think rsync saves checksums? With saved checksums between runs the need for checksum recalc would be halved.
      OMV 4, 5 x ODROID HC2, 2 x 12TB, 2 x 8 TB, 1 x 500 GB SSD, GbE, WiFi mesh
    • Adoby wrote:

      So I found his: github.com/ambv/bitrot
      Wouldn't that be the same as using btrfs and a scrub with the checksum being stored in sqlite3 instead?

      And if you want to go the utility route, trapexit (author of mergerfs) wrote scorch - github.com/trapexit/scorch
      omv 4.1.17 arrakis | 64 bit | 4.15 proxmox kernel | omvextrasorg 4.1.13
      omv-extras.org plugins source code and issue tracker - github

      Please read this before posting a question and this and this for docker questions.
      Please don't PM for support... Too many PMs!

      The post was edited 1 time, last by ryecoaaron ().

    • ryecoaaron wrote:

      Wouldn't that be the same as using btrfs and a scrub with the checksum being stored in sqlite3 instead?
      And if you want to go the utility route, trapexit (author of mergerfs) wrote scorch - github.com/trapexit/scorch
      Yes, exactly the same. But with the utility approach you can limit the checksum calculations to the folders you want to protect. But then that most likely will be the bulk of files on a NAS.

      But I would like to be able to use EXT4 and my existing rsync snapshots for redundancy. Otherwise I might go with snapraid or raid6 or something.

      Scorch also looks interesting.

      But the automatic restore from a backup is missing. Detection is nice, correction is even nicer. ;) Most likely the bitrot correction would have to be integrated into a rsync backup script, when the location of the bad files and their replacement is known. But the detection bit might run asynchronous from the backup bit, to avoid extremely time consuming backups.

      It seems that there are plenty of tools for detection of bitrot.
      OMV 4, 5 x ODROID HC2, 2 x 12TB, 2 x 8 TB, 1 x 500 GB SSD, GbE, WiFi mesh
    • Adoby wrote:

      Ideally I would like a system with checksums on a file level. That can be used to find files with bitrot. If the checksum has changed but the modification time has not changed, then we have bitrot. If bitrot is detected then the file is restored from backup. Perhaps with an extra check of checksum on the backup copy and optionally use an older, error free, copy of the file if available.
      If you can come up with a manual process that corrects bitrot, without ZFS or BTRFS in RAID1 / RAID10 (or SNAPRAID), I'm interested. When I started to look for bitrot protection, that inevitably led to ZFS and BTRFS. Believe me when I say, with its kernel integration and lighter hardware requirements, I really wanted BTRFS to be the answer. But there are "mostly OK" and other warnings, on the BTRFS Status page, that seem to be perpetual. With that considered, combined with personal experience running BTRFS on a single drive, well,, I've ruled BTRFS out for the next few years.

      For effective bitrot protection, in a mature file system, I haven't found anything better than ZFS. Adding in native support for automated, self rotating and purging SNAPSHOTs, where a virus infection or even deliberate data deletion can be contained, and we're talking about "data preservation".

      Coming to that realization guided my hardware choices from that point forward.

      Adoby wrote:

      So I have plenty of storage and good versioned backups. Now I can worry about bitrot. Maybe?
      I do because it's a "silent" phenomenon that goes undetected in the vast majority of cases. Curiously, there's little research on the extents of the problem, but I've read more than one horror story about unsuspecting users losing irreplaceable photo's, documents, etc., that they "thought" were fine. Time is also an issue. The longer these files are on a drive, the more likely they are to become corrupt from random events, platform hardware problems, media degradation, hard drives that fail slowly, etc.

      Going further, simply detecting bitrot does nothing other than to suggest that one should be "concerned". Again, without "correction" that's known to work, detection is very limited in value. (The damage is done.)

      On Rsync:
      Rsync can be set to use checksums (which really slows it down) but that's for keeping transfers clean, not for ongoing protection. If source files are corrupted, Rsync, with checksums or not, will happily and accurately replicate the corrupted files to the destination.
      ___________________________________________________________________________

      My approach:
      At the top of my storage stack, on the main server, I'm running a ZFS mirror on a platform with ECC. From there, the known "clean store" is Rsync'ed out to backup devices/platforms. Putting a zmirror at the top made sense to me because it's pointless to replicate files that are corrupted at the source. The down side? 1/2 of drive real-estate is taken solely for bitrot protection, but it is effective.

      One destination platform is running a smaller zmirror for sensitive network shares (pic's, doc's, user files, classic rock music, you know the important stuff :) ). While the full data store is replicated to this box, some of it is outside the zmirror and is unprotected. This particular box is also "cold storage". I use etherwake and scheduled tasks (cron) to start it twice a month for a day. During those days it replicates changed files and shuts down at midnight. (A scrub runs during one of those days.)

      Another destination is MergerFS+SNAPRAID which provides a RAID5 like array where dissimilar sized drives can be aggregated AND protected. ((That's the beauty of SNAPRAID. Buy a drive, any drive, throw it in the mix, and run a SYNC.))
      If SNAPRAID works as advertised, the entire store is protected on this platform. (As mentioned in another thread, I haven't tested SNAPRAID for bitrot correction yet, and I won't be thoroughly convinced until I do.)
      _____________________________________________________________________________

      Again, I'm looking for improved data preservation techniques so, if you find something potentially useful, please post it.

      Video Guides :!: New User Guide :!: Docker Guides :!: Pi-hole in Docker
      Good backup takes the "drama" out of computing.
      ____________________________________
      Primary: OMV 3.0.99, ThinkServer TS140, 12GB ECC, 32GB USB boot, 4TB+4TB zmirror, 3TB client backup.
      OMV 4.1.13, Intel Server SC5650HCBRP, 32GB ECC, 16GB USB boot, UnionFS+SNAPRAID
      Backup: OMV 4.1.9, Acer RC-111, 4GB, 32GB USB boot, 3TB+3TB zmirror, 4TB Rsync'ed disk

      The post was edited 2 times, last by flmaxey: edit ().

    • It really shouldn't be too difficult to write a set of programs/scripts that can do this. Actually I'm surprised that scripts that do this doesn't already exist.

      I'm likely to at least try to see if I can cobble something together something simple. I think my setup with several single disk SBCs is suitable. NFS and EXT4. SSH and remote execution would have been more general, but I'll save that for later. There is also plenty of example python code available for checksum calculation and storage and bitrot detection.

      However it is possible that the danger for permanent data loss is much higher from bugs in scripts that attempt to fix bitrot, than from bitrot itself. :X
      OMV 4, 5 x ODROID HC2, 2 x 12TB, 2 x 8 TB, 1 x 500 GB SSD, GbE, WiFi mesh
    • Adoby wrote:

      However it is possible that the danger for permanent data loss is much higher from bugs in scripts that attempt to fix bitrot, than from bitrot itself
      That's why I would not reinvent the wheel but rely on great projects for the whole data safety and integrity thing: using either ZFS (together with znapzend) or btrfs (with btrbk). Snapshots when being transferred via zfs|btrfs send/receive are way less stressful than rsync jobs (that might produce an inconsistent copy when done without snapshots anyway) and the preserved disk stress can then be used for regular scrubs (at both source and backup destination).