RAID Disappeared - need help to rebuild

brifletch · 27. November 2017

Zitat von Sc0rp

Re,

this looks bad - have any drive SATA-errors too? From plain data given, i would "sdf" change first, then "sdc", then "sdb" ... and may be use anonther vendor/type of harddrives ... btw. which drives do you use actual?

Sc0rp

Thanks for the guidance. Can't check for SATA errors right now (none that I'm aware of), as the server is off at home (figured it would be best to switch it off until I could start drive replacements). They're all Seagate Barracuda 3TB drives. I've already replaced one of the others previously, and I have a new spare ready and waiting. I was wondering whether I'd be better off switching to a different model, for the next swaps. I'd read that Barracuda weren't intended/good for always on servers, and had seen Seagate IronWolf drives recommended. Any thoughts on that?

Sc0rp · 27. November 2017

Re,

IronWolfs are relativly "young" drives, but it seems currently, that they are worth a try (but they operate at 7,2k rpm ... more performance for more temp!) ... WD Red are the usual/common recommendation for NAS-setups - they are realy good suited for private/home office use (you can change to WD Red Pro if you need more performance ...).

Sc0rp

brifletch · 27. November 2017

Thanks for that. Any issue with mixing Reds with Barracudas for a short while? And if I was to add 4TB drives, the NAS would (for now) use 3TB but I could grow the array at a later date (once all drives are upgraded to 4TB) to maintain data and expand capacity?

Sc0rp · 27. November 2017

Re,

Zitat von brifletch

Any issue with mixing Reds with Barracudas for a short while?

No.

Zitat von brifletch

And if I was to add 4TB drives, the NAS would (for now) use 3TB but I could grow the array at a later date (once all drives are upgraded to 4TB) to maintain data and expand capacity?

Yep.

Sc0rp

brifletch · 2. Dezember 2017

Hey guys ... work's been crazy this week, and I've only just got round to trying to progress this. The first thing I did was start backing up some of the files that I knew would be a ball-ache to replace. All was going well for a good hour or so, and then a read error was reported and the NAS disappeared from the network (I assume it was unmounted). I rebooted the NAS, and was struggling to even get OMV to boot. Next, I removed all drives from the NAS, and OMV booted no issue. So, I added the drives back in and swapped out what was sdb (the drive with the increasingly high number of pending/offline uncorrectable sectors). I'm still waiting for another 2 drives to replace sdf and sdc (in that order).

After swapping out sdb, repairing the degraded RAID seemed to go fine, and it completed with no reported issues. The RAID was mounted, and all folders/files were visible to the network again. However, I went to delete some files, and immediately faced a read error (Windows dialogue: Location is not available. W:\ is not accessible. The request could not be performed because on an I/O error).

The OMV GUI is still showing the RAID as active/clean/mounted. mdadm --examine /dev/sd[abcdef] reports:

Code

/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
           Name : openmediavault:MEDIAVAULT  (local to host openmediavault)
  Creation Time : Mon Sep 17 01:03:50 2012
     Raid Level : raid5
   Raid Devices : 6


 Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
     Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
  Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 7a6cb4e9:902f3852:3c8a0119:6f74dcea


    Update Time : Sat Dec  2 07:20:36 2017
       Checksum : 2b916fbb - correct
         Events : 289173


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 5
   Array State : AAAAAA ('A' == active, '.' == missing)
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
           Name : openmediavault:MEDIAVAULT  (local to host openmediavault)
  Creation Time : Mon Sep 17 01:03:50 2012
     Raid Level : raid5
   Raid Devices : 6


 Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
     Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
  Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 2ec934c5:f798aa47:7218e531:9d493298


    Update Time : Sat Dec  2 07:20:36 2017
       Checksum : c2bd99cb - correct
         Events : 289173


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 4
   Array State : AAAAAA ('A' == active, '.' == missing)
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
           Name : openmediavault:MEDIAVAULT  (local to host openmediavault)
  Creation Time : Mon Sep 17 01:03:50 2012
     Raid Level : raid5
   Raid Devices : 6


 Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
     Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
  Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 5fdfd228:cccdf51e:0ee5e8e6:eaf3d87c


    Update Time : Sat Dec  2 07:20:36 2017
       Checksum : 97515b29 - correct
         Events : 289173


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 0
   Array State : AAAAAA ('A' == active, '.' == missing)
/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
           Name : openmediavault:MEDIAVAULT  (local to host openmediavault)
  Creation Time : Mon Sep 17 01:03:50 2012
     Raid Level : raid5
   Raid Devices : 6


 Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
     Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
  Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : ef7151df:f61fce6e:16124dbf:c0e9d1cf


    Update Time : Sat Dec  2 07:20:36 2017
       Checksum : c905634c - correct
         Events : 289173


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 1
   Array State : AAAAAA ('A' == active, '.' == missing)
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
           Name : openmediavault:MEDIAVAULT  (local to host openmediavault)
  Creation Time : Mon Sep 17 01:03:50 2012
     Raid Level : raid5
   Raid Devices : 6


 Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
     Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
  Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 1e70980d:77996a00:97862be2:e6522558


    Update Time : Sat Dec  2 07:20:36 2017
       Checksum : 341aba6b - correct
         Events : 289173


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 2
   Array State : AAAAAA ('A' == active, '.' == missing)
/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 6ff00f35:b3aa6d29:25ac1d7e:2a2b2007
           Name : openmediavault:MEDIAVAULT  (local to host openmediavault)
  Creation Time : Mon Sep 17 01:03:50 2012
     Raid Level : raid5
   Raid Devices : 6


 Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
     Array Size : 14651325440 (13972.59 GiB 15002.96 GB)
  Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 55ad5625:0b737d46:248eaedc:cfc05460


    Update Time : Sat Dec  2 07:20:36 2017
       Checksum : 949e4b25 - correct
         Events : 289173


         Layout : left-symmetric
     Chunk Size : 512K


   Device Role : Active device 3
   Array State : AAAAAA ('A' == active, '.' == missing)

Alles anzeigen

Nothing jumps out as being odd, and all details seem to match from one HDD to the next.

I've pasted updated logs and messages here:

http://sprunge.us/UCLE
http://sprunge.us/PWVK

I'm really not sure what to do next (aside from swapping out the other 2 HDDs that are showing SMART errors. I did notice something in the logs about xfs_repair - should I be attempting this now (and basic question, is it run at HDD level, or at RAID/device level?).

As always, any insight and recommendations would be great. Thanks in advance,
Brian

Sc0rp · 4. Dezember 2017

Re,

i have to less time currently to dig deeper in this, but you cann issue a
touch /forcefsck
in the terminal and then reboot - this forces filesystemchecks at the next boot (even XFS) ... may be this will help.

Sc0rp

brifletch · 4. Dezember 2017

Hi Sc0rp, Appreciate your assistance, especially when you're so busy. I've done as suggested, but the NAS seems to hang on reboot and the array isn't unmounted. I connected a monitor directly to the device so I can see system messages. The following look relevant:

Turning off quotas ... quotaoff: Cannot resolve mountpoint path /media/folder ID: Input/output error (the same is repeated for several shared folders)...Unmounting local filesystems ...[213192.927608] INFO task umount:3569 blocked for more than 120 seconds.[213912.932051] Not tainted 3.16.0-0.bpo.4-amd64 #1[213192.936445] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message

The same 3 messages are now repeating every 120 secs (with different numeric values at the start of the message). What now? Leave it running? Force shutdown?

brifletch · 4. Dezember 2017

I've managed to do as you suggested now (not sure how I got there, but fsck ran). Also ran xfs_repair -n, which comes back with loads of issues, and eventually: Inode allocation btrees are too corrupted, skipping phases 6 and 7
No modify flag set, skipping filesystem flush and exiting.

I guess if the RAID has been rebuilt after disks had dropped out there's going to be inconsistencies all over the place.

Will swap out the other 2 discs when they arrive, and then maybe run xfs-repair and see what's salvageable but I'm thinking this might just have to be given up as a bad job, and rebuild my media collection from scratch (and of course, create a back-up next time!).

Sc0rp · 5. Dezember 2017

Re,

Zitat von brifletch

I guess if the RAID has been rebuilt after disks had dropped out there's going to be inconsistencies all over the place.

From my expirience: the inconsistences occured WHILE the drive was dying ... but the result remains the same ...

Zitat von brifletch

Will swap out the other 2 discs when they arrive, and then maybe run xfs-repair and see what's salvageable but I'm thinking this might just have to be given up as a bad job, and rebuild my media collection from scratch (and of course, create a back-up next time!).

Yeah, may be you can finally "force" xfs_repair to get a clean state (at least you can zero the journal ...) just search the inet for "man xfs_repair"

Btw. i had also many problems while using xfs over an old areca-hw-raidcontroller, due to a bug in the driver (kernel-module), but i never lost data ...

If you have ever the chance to make your current array new, consider using ZFS-Z1 or ZFS-Z2 instead, it's more convinient nowadays ... and as a special benefit 4 me, @tkaiser is then in charge ... uhm, just kidding ... a bit.

Sc0rp

brifletch · 6. Dezember 2017

Thanks again for taking the time @Sc0rp Once I have the replacement drives in hand and I have some time I'll try to force xfs_repair and let you know how I get on. And hopefully never need to darken your, or @tkaisers doors again!!

brifletch · 20. Dezember 2017

Hey guys. Wanted to give you an update. I swapped out the 2 damaged drives and rebuilt the array successfully, both times. After a lot of deliberation, I then tried xfs_repair -L, and held my breath.

The good news is that it would appear that only files saved to the NAS in November were corrupted. In my testing, at least, all files I’ve watched that were saved prior to early November are playing with no issues.

It’s a Christmas miracle!

I’ve now got all alerts active so I’ll know of any similar issues in future. And when finances allow, I’ll setup a proper backup solution.

Thank you both @Sc0rpand @tkaiser for all your help and advice. I wouldn’t be sitting here watching our Christmas movies right now, without you!

Wishing you both an awesome holiday season!

tkaiser · 20. Dezember 2017

Zitat von brifletch

The good news is that it would appear that only files saved to the NAS in November were corrupted. In my testing, at least, all files I’ve watched that were saved prior to early November are playing with no issues.

And that's a strong argument for filesystems developed in this century and not the last. Since with modern approaches like ZFS or btrfs you would now simply run a scrub over night and the scrub would list you each and every corrupted file without you having to waste any time on this or having to fear that you now have to live with a lot of corrupted files you simply don't know yet (very common problem in a lot of installations where I had to help with data recovery)

brifletch · 21. Dezember 2017

Unfortunately, that's a fear I'm going to have to give in to, and just replace any corrupted files as I come across them. The end result, as it stands, is much more positive than I first expected, so for now I'm a lot happier. And be sure that when I get chance to setup a proper backup solution, I'll first upgrade the filesystem and migrate the existing content as per your suggestion. Thanks again for your help, and Merry Christmas!

Jetzt mitmachen!