Raid Rebuild

drumltd · 7. März 2015

It appears that I have a failed drive in my RAID10 it consists of 4 drives, and I was in the process of copying the contents to a new RAID10 I've build when the old raid appears to have ground to a halt.

Issuing a cat /proc/mdadm gives

Code

Personalities : [raid10] 
md127 : active raid10 sdg[0] sdj[3] sdi[2] sdh[1]
      15627790336 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

md0 : inactive sda[0] sdb[5](S) sdd[4](S) sdc[1]
      15628070240 blocks super 1.2

unused devices: <none>

which shows that 2 of the drives have been loaded as spare, so then I decided to check which drives were uptodate, with mdadm -E /dev/sd[abcd] | egrep 'Event|/dev'

Code

/dev/sda:
         Events : 5839245
/dev/sdb:
         Events : 5839245
/dev/sdc:
         Events : 5839245
/dev/sdd:
         Events : 0

To me this looks like sdd is failed, but I can't understand why sdb is also been marked as spare, I do have a spare disk that I can add into the array, can anybody advise how best to rebuild the array??

(I have the spare in a USB3 enclosure so that I can add it without swapping cabling etc if required)

Thanks Dave

ryecoaaron · 7. März 2015

Try removing sdd using the Remove button from the Raid Management tab first. Then you could add (Grow button) your spare (shouldn't matter what connection but sata would be faster). If it doesn't work in the web interface, we can use the command line.

drumltd · 7. März 2015

Thanks the webgui doesn't even show the old RAID I'm guessing because it's inactive, but I reckon I can probably figure out the commandline for doing it.

drumltd · 7. März 2015

That didn't work aswell as I'd expected....

mdadm --manage /dev/md0 --remove /dev/sdd

gave

Code

mdadm: cannot get array info for /dev/md0]

ryecoaaron · 7. März 2015

mdadm --stop /dev/md0
mdadm --assemble /dev/md0 /dev/sd[abcd] --verbose --force

drumltd · 7. März 2015

Zitat von ryecoaaron

mdadm --stop /dev/md0
mdadm --assemble /dev/md0 /dev/sd[abcd] --verbose --force

Tried those, stop responds as expected, but the assemble gives

Code

mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb is identified as a member of /dev/md0, slot -1.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot -1.
mdadm: added /dev/sdc to /dev/md0 as 1
mdadm: no uptodate device for slot 2 of /dev/md0
mdadm: no uptodate device for slot 3 of /dev/md0
mdadm: added /dev/sdb to /dev/md0 as -1
mdadm: added /dev/sdd to /dev/md0 as -1
mdadm: added /dev/sda to /dev/md0 as 0
mdadm: /dev/md0 assembled from 2 drives and 2 spares - not enough to start the array.

Alles anzeigen

ryecoaaron · 7. März 2015

What is the output of: cat /proc/mdstat

You can try zeroing the superblock on ONE drive and assemble the array again. I'm guessing sdd would be the best to zero.

mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sdd
mdadm --assemble /dev/md0 /dev/sd[abcd] --verbose --force

drumltd · 7. März 2015

Zitat von ryecoaaron

What is the output of: cat /proc/mdstat

Code

Personalities : [raid10] 
md0 : inactive sda[0](S) sdd[4](S) sdb[5](S) sdc[1](S)
      15628070240 blocks super 1.2

md127 : active raid10 sdg[0] sdj[3] sdi[2] sdh[1]
      15627790336 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

Obviously md0 is the one we are concerned with.

Zitat

You can try zeroing the superblock on ONE drive and assemble the array again. I'm guessing sdd would be the best to zero.

mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sdd
mdadm --assemble /dev/md0 /dev/sd[abcd] --verbose --force

Is zeroing the superblock potentially distructive?? I know it sounds it but I guess sdd is aready dead.

ryecoaaron · 7. März 2015

Yes, it is destructive. That is why I said only one drive. Raid 10 can recover from one drive loss (two drives if right two).

drumltd · 7. März 2015

Tried it but I now get the following errors on the assemble

Code

mdadm: looking for devices for /dev/md0
mdadm: no RAID superblock on /dev/sdd
mdadm: /dev/sdd has no superblock - assembly aborted

ryecoaaron · 7. März 2015

I would say that is bad. I wouldn't zero another superblock. Hope you have a backup.

drumltd · 7. März 2015

I have most of the stuff, that was on it, but I was in the process of taking a copy when it failed, given that it thinks the events are uptodate on the remaining 3 disks, can anything be done to build them into something to get the last few items off?

drumltd · 7. März 2015

Interestingly it still lets me build it with sdk (my new disk) and still shows a,b,c as having the same Events figures, but when I try to assemble it thinks the sdb is not uptodate, can this flag be reset so that it thinks sdb is rest and then it rebuilds using sdk?

ryecoaaron · 7. März 2015

Because it marking two as spares and won't start (not the right two disks to start), I don't think there is anything you can do that I know of. Maybe the newer versions of utilities included with SystemRescueCD might be able to fix it.

drumltd · 7. März 2015

Duplicate post sorry

drumltd · 7. März 2015

Do you think the following command my dig me out of this ??

mdadm --create /dev/md0 --assume-clean --metadata=1.2 --level=10 --raid-devices=4 /dev/sd[abcd]

ryecoaaron · 7. März 2015

Never tried it.

drumltd · 7. März 2015

Well I tried it, and the raid looks like it's built, but when I try to mount it (Read Only to be safe), I get

mount: Structure needs cleaning

Which I believe means the xfs is corrupt, it's possible I've got the disks in the wrong order, but I don't think I have, so it's not looking good. I suppose I could try all 24 combinations, but I don't hold out much hope, I think its probably toast.

ryecoaaron · 7. März 2015

Nice try. Good to know. Sorry that didn't work. I assume you tried xfs_repair? You could try photorec on the drives to see if anything could be recovered but it usually recovers too much and doesn't have filenames or folder structure.

drumltd · 7. März 2015

I ran it with the -n option so that it wouldn't destroy things if I'd got the wrong order, but the list of errors was so just continous which suggest to me something is still wrong and I'd probably do more harm than good.

Jetzt mitmachen!