SMART Failed Drive Replacement w/ MergerFS & SnapRAID

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • SMART Failed Drive Replacement w/ MergerFS & SnapRAID

      I'm looking to see if I can get some validation or confirmation of my plan relevant to the setup and situation I have below.

      Current Setup:
      • SnapRAID pool
        • 3x 4TB Data Disks
        • 1x 4TB Parity Disk
      • MergerFS
        • /storage with multiple sub-folders and content
      • SMART Test
        • Running every Wednesday night (corrected to Wed)
        • Recent report flagged 3 bad sectors, then within the last 3 days or so bumped up to 7 bad sectors
        • Already filed RMA with WD and new drive is here
      • Install new physical disk alongside others
      • Reboot into CloneDrive
      • Clone current failing disk to new disk
      • Shutdown, and remove failing drive
      • Startup and name disk same as failed disk
      • Add new (replacement) disk to SnapRAID/MergerFS pools as same name

      Does this logically make sense, and actually make sense to achieve what I'd like to? It seems cloning will have the quickest/easiest processing to get this done, and avoid stupid snafus by me on the command line with permissions and other potential problems. My only main concern really is ensuring SnapRAID doesn't have problems continuing on like nothing happened, MergerFS volume /storage isn't affected, and that the clone with bad sectors on the drive doesn't cause some other type of copied issue.

      If I lost the 7 bad sectors, I'd be ok with it honestly. I'd rather salvage the rest of the data, then be able to do a SnapRAID Fix operation.
    • So I've started down this path on the non-destructive processes. Specifically I've reboot into CloneZilla. Seems the bad sectors are being detected there as well. During an early portion of the clone process, I received an error from bad sectors. As suggested, I've restarted in expert mode and enabled the -rescue option.

      I would assume at this point then I may be missing data blocks, but the FS should be intact? If this is the case, then a Fix process from SnapRAID should fix my missing files. Assuming that I haven't overwritten or re-run a SnapRAID sync process that could no longer read the blocks in the bad sectors.
    • Well, I was figuring that using CloneZilla, and getting an actual replica of my drive would actually be faster and more efficient assuming the SMART detected errors are not fully evolved (Aka the just hit and I reacted right away) and I can get fully copies of the data. Even if not, I'd assume it should skip through those bad sectors pretty quick and leave me with almost a perfect replica. In my instance basically I haven't lost the disk, just got the SMART errors returned and proactively replacing the disks.

      If this seems foolish, perhaps I will go that route. At this point, the disk seems to be copying over at a slower and slower rate declining since the start of the Clone. Only made it to 8.8% of Data Block and 2.11% of Total Block processing in 1hr15min. Started at 9.25GB/min down to 886MB/min.
    • Nothing is going to go fast on drives of this size.

      The point of using snapRaid to recover an entire disk is that the disk you are replacing isn't involved in the recovery process at all, so any corruption on it isn't going to get carried into the new disk.

      As you have seen, the cloning isn't going very fast. Who knows how long it will take to complete? And then you still have to use snapRaid to "fix" it. That's going to be slow too if all the data has to be hashed and compared to the hashes previously calculated.

      At least you are in a position to try whatever you want as many times as you want.........if you have the time that is.
      OMV 3.x - ASRock Rack C2550D4I - 16GB ECC - Silverstone DS380
    • Interestingly, you make me think of a different point in the restore speed. While the disk is a 4TB disk, it's only got about 800GB worth of content. Perhaps for that reason alone, a Repair operation using SnapRAID could produce a faster result as it only has to construct less than 25% of the data, whereas a clone will be attempting to read/write empty blocks (give or take, as I know advances reduce it from 100% copy of empty blocks) - but it should be a bit faster overall. And as you mentioned avoid the necessity to read the bad sector data.

      Perhaps since I haven't gotten very far, I'll attempt stopping it now and testing that method instead.
    • Well now it seems I can't manage to get the snapraid command to run locally on the system. This was my concern running it manually using SnapRAID, as when I recently attempted using SnapRAID to restore a few pictures that I deleted by accident, I ended up with wacky permissions. Likely because I was hit with this same permission denied when running the command. So I just used SUDO and had no issue. The end result though was mangled permissions that wouldn't let me access the content.

      Source Code

      1. >> snapraid -d disk3 -l fix.log fix
      2. Self test...
      3. Error creating the lock file '/media/97974d61-46e4-43fa-a535-54a31b4faec2/snapraid.content.lock'. Permission denied.
      The mentioned media mount, is not MergerFS, it is the local individual disk mount that is part of the pool.

      In doing a bit of reading, am I to possibly understand now that SnapRAID will unfortunately NOT restore the permissions for this content? If that's the case, I think I need to re-think my WHOLE strategy as I do have staggered permissions on content. Having to sort out all those permissions may be a dealbreaker for relying on SnapRAID.

      Since the disk isn't bad yet, is there any easy alternative to move the content onto the new disk? Perhaps just using RSYNC, then letting (assuming I keep using it) the SnapRAID index build again with the fresh content?
    • You must run snapraid as root. Permissions and ownerships are not restored. In my use case I do not have differing permissions and ownerships on the data so this is not a problem for me. So long as you copy the content with a tool that can maintain attributes you shouldn't have any problems.
      OMV 3.x - ASRock Rack C2550D4I - 16GB ECC - Silverstone DS380

      The post was edited 1 time, last by gderf ().

    • Ya so it looks like I will need to search out a potentially different solution. I'm glad I discovered this now at least with a disk that has less stuff on it. And have a working copy to move from. In the process of a long RSYNC right now. Almost halfway through from about 3 hours ago so I'd say it's in good shape to finish pretty quickly overall, considering.

      Thanks for helping me discover this, glad I know now.