Snapraid I/O errors during sync on single drive

  • I posted this in the snapraid forum but that doesn't seem to get much traffic.


    So I tried to run a sync and it just fails with I/O errors that were limited to a single disk. So I removed the disk, put it in a dock on my desktop, added a new disk to the server and named it with the same disk label then copied all of the files from the removed disk to the server over NFS. I run MergerFS on this pool of disks so the rule is to fill the most empty disk first so it pretty much just copied everything from the removed disk to the new disk. I copied everything rather than running fix since the disk is still completely accessible a it's been a while since I synced, figured my safest bet was to just copy and resync. The files that it is complaining about haven't been touched in years, but I think if it had the chance it would error on every file of the disk. I've tested the RAM, moved the disk into a different SATA bay....same thing


    There is pretty much 0 indication that there was anything wrong with the first disk. Running sync now with the new disk throws the same errors starting with the same files, so I moved the directory that was throwing the errors outside of the array and it's still throwing errors on the next files on this disk. I had no IO errors or anything while copying these files and there seems to be no actual problem with them. I think the parity for this disk is bad or something. I also did not copy the .content file to the new disk since I wasn't sure how that would be handled.


    This was the first attempt at a sync since I moved 3 more partially filled disks into the array, hence there's a warning recommending 3 parity levels for 15 disks. These disks were already in the server, just outside of the Snapraid array......they are quite a bit smaller than the 6TB parity disks so I figured it wouldnt be an issue. What do I do to get sync running clean again?


    Syncing...

    Using 136 MiB of memory for 32 cached blocks.

    Error reading file '/srv/dev-disk-by-label-PoolDisk7/YYY' at offset 1085014016 for size 262144. Input/output error.

    Input/Output error in file '/srv/dev-disk-by-label-PoolDisk7/YYY' at position '4139'

    DANGER! Unexpected input/output read error in a data disk, it isn't possible to sync.

    Ensure that disk '/srv/dev-disk-by-label-PoolDisk7/' is sane and that file '' can be read.

    Stopping at block 51245

    Saving state to /srv/dev-disk-by-label-PoolDisk1/snapraid.content...

    Saving state to /srv/dev-disk-by-label-PoolDisk2/snapraid.content...

    Saving state to /srv/dev-disk-by-label-PoolDisk3/snapraid.content...

    Saving state to /srv/dev-disk-by-label-PoolDisk4/snapraid.content...

    Saving state to /srv/dev-disk-by-label-PoolDisk5/snapraid.content...

    Saving state to /srv/dev-disk-by-label-PoolDisk6/snapraid.content...

    Saving state to /srv/dev-disk-by-label-PoolDisk7/snapraid.content...

    Saving state to /srv/dev-disk-by-label-PoolDisk8/snapraid.content...

    Saving state to /srv/dev-disk-by-label-PoolDisk9/snapraid.content...

    Saving state to /srv/dev-disk-by-label-PoolDisk10/snapraid.content...

    Saving state to /srv/dev-disk-by-label-NasDisk1/snapraid.content...

    Saving state to /srv/dev-disk-by-label-NasDisk2/snapraid.content...

    Saving state to /srv/dev-disk-by-label-SSD1/snapraid.content...

    Saving state to /srv/dev-disk-by-label-NVME2/snapraid.content...

    Saving state to /srv/dev-disk-by-label-PoolDisk11/snapraid.content...

    Verifying /srv/dev-disk-by-label-PoolDisk1/snapraid.content...

    Verifying /srv/dev-disk-by-label-PoolDisk2/snapraid.content...

    Verifying /srv/dev-disk-by-label-PoolDisk3/snapraid.content...

    Verifying /srv/dev-disk-by-label-PoolDisk4/snapraid.content...

    Verifying /srv/dev-disk-by-label-PoolDisk5/snapraid.content...

    Verifying /srv/dev-disk-by-label-PoolDisk6/snapraid.content...

    Verifying /srv/dev-disk-by-label-PoolDisk7/snapraid.content...

    Error reopening the temporary content file '/srv/dev-disk-by-label-PoolDisk7/snapraid.content.tmp'. No such file or directory.

  • Have you run fsck on the drive?

    On which drive?


    The original disk:

    Code
    $ sudo fsck /dev/sdd1
    fsck from util-linux 2.37.1
    e2fsck 1.46.2 (28-Feb-2021)
    PoolDisk7: clean, 5944/244195328 files, 848717657/976754385 blocks


    New disk

    Code
    sudo fsck /dev/sdd1
    fsck from util-linux 2.37.1
    e2fsck 1.46.2 (28-Feb-2021)
    PoolDisk7: clean, 4592/183144448 files, 713659279/1465130385 blocks
  • First, 15 disks is a LOT of disks. 15 disks must be getting into use scenarios that have not been thoroughly tested by the DEV.

    ___________________________________________

    *It wouldn't hurt to look at SMART stat's for the boot drive first. If something is corrupted on the boot drive, anything is possible.*


    A few ideas and things to look at:

    - Have you checked the SMART stat's for the affected disk(s).

    - Run LONG SMART tests to thoroughly check out the disk(s) with issues.

    - Have you checked to see if this is related to a specific "port". I/O errors could be related to the SATA/SAS port itself. (Hardware can fail and SATA/SAS cables may need to be reseated - both ends. On a rare occasion a cable may go bad. SMART STAT 199, CRC's errors, is an indication of hardware link and / or interface issues.)


    Where the port is concerned, I believe you can swap two disks between ports without consequence. (I don't know for sure. I haven't done it.) If the issue follows the port, you might have an answer.


    - If you're convinced nothing is wrong with the disks, without disturbing data, you could wipe your SNAPRAID installation and start again.
    - I'd consider breaking up the 15 disk group into 3 groups of 5 disks, with one parity drive per group.

  • First, 15 disks is a LOT of disks. 15 disks must be getting into use scenarios that have not been thoroughly tested by the DEV.

    I wouldn't say 15 disks is a "lot". Trying to keep free space on disks and the different use case for each and the disks add up quickly.


    That said you might be right about "scenarios that have not been thoroughly tested by the DEV." Not in regards to the number for disks, but the way I added the last few and the fact that they're on different interfaces. My previous snapraid use and the syncs were just for the mergerFS pool of spinning SATA disks. I have 2 more SATA spinners, a SATA SSD and an NVME that had no redundancy so I added them to the same existing snapraid pool. There's nothing that says this can't be done, you can add partially filled disks to snapraid, though maybe there's just too many differences here.


    Quote

    - Have you checked to see if this is related to a specific "port". I/O errors could be related to the SATA/SAS port itself. (Hardware can fail and SATA/SAS cables may need to be reseated - both ends. On a rare occasion a cable may go bad. SMART STAT 199, CRC's errors, is an indication of hardware link and / or interface issues.)


    I mentioned above, I've swapped the disks in the bays. They're in hotswap cages and each has its own SATA port going to a SAS card. This isn't a problem for snapraid, it doesn't care about the port used. The disk that originally had the errors has no SMART fail/prefail, neither does it's replacement.


    Quote

    - If you're convinced nothing is wrong with the disks, without disturbing data, you could wipe your SNAPRAID installation and start again.


    Well there doesn't appear to be a simple way to actually do that in snapraid other than maybe manually wiping the parity disks and content and config files. This is pretty much what I was asking about here, how to do it. "snapraid sync -R" would seem to be it but it wants all disks present. So I wouldn't be able to remove the added disks from the array at the same time. Seems messy since all I've had were partial syncs of all 15 disks since then


    According to the manual you have to change the config file to point the disk you want to remove to and empty directory and remove the .content reference from the config. So I did that for all of the newly added disks and left the original pool in tact with the replacement disk for the one that was throwing errors just as it was. I'm now running "snapraid sync -E", so it'll be using the original parity for the disks that remain. If that goes without error I'll get them out of the config entirely and see what I can do about setting up a separate parity for those disks.

  • So I did exactly as I said above, and the sync ran with no errors at all. Then I completely removed the 4 recently added disks from the snapraid config and sync'd again, as expected the sync really didn't have much to do and it didn't complain about anything. Kind of an odd bug that adding 4 partially full disks to the pool caused IO errors with a disk that's been there the whole time but good that it's back to normal.

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!