Raid 5 gone after reboot

Drake008 · 18. September 2017

Code

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : inactive sdb[1] sda[4](S) sdd[3]
      2929894536 blocks super 1.2


unused devices: <none>

BLKID

Code

BLKID
/dev/sdc: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="5366aea8-9876-f78d-a23e-99e6ad6fea31" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
/dev/sdb: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="175ea7f3-8e75-caaa-a728-75435266c481" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
/dev/sda: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="5c0b0f88-e883-6acd-ce5d-bb2b95a4bb9b" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
/dev/sdd: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="cec25e6e-e6f8-bc06-f4c7-d33148c0f9bd" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
/dev/sde1: UUID="5e284973-7730-436e-aeb5-491c3cfe6446" TYPE="ext4"
/dev/sde5: UUID="0f12fb13-59f7-456c-a1e7-628b8f4375a2" TYPE="swap"

Code

fdisk -l | grep "Disk "
Disk /dev/sda doesn't contain a valid partition table
Disk /dev/sdc doesn't contain a valid partition table
Disk /dev/sdd doesn't contain a valid partition table
Disk /dev/sdb doesn't contain a valid partition table
Disk /dev/sde: 31.5 GB, 31488000000 bytes
Disk identifier: 0x000b98c2
Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
Disk identifier: 0x00000000
Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
Disk identifier: 0x00000000
Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
Disk identifier: 0x00000000
Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
Disk identifier: 0x00000000

Alles anzeigen

Code

cat /etc/mdadm/mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#


# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
#DEVICE partitions containers


# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes


# automatically tag new arrays as belonging to the local system
HOMEHOST <system>


# instruct the monitoring daemon where to send mail alerts
MAILADDR root


# definitions of existing MD arrays


# This file was auto-generated on Mon, 29 Jun 2015 06:41:20 +0000
# by mkconf 3.2.5-5

Alles anzeigen

Code

mdadm --detail --scan --verbose
mdadm: cannot open /dev/md/Datengrab: No such file or directory

That is the info I can give you all. My Raid has been vanished after it got to rebuild itself. In the status before the reboot was the message "clean, failed".

Can I get the date somehow back? It is also strange that sdb sda sdd are there the sdb isn't. Maybe that could be the faulty drive?

Drake008 · 18. September 2017

Well the funny thing is that I used the mdadm command to force the raid back and it was always recovering and after a while I get clean, failed. When it recovers I see all my files and they are ok after the recovery I cannot open these files anymore.

I used these commands

mdadm --stop /dev/md127
mdadm --assemble /dev/md127 /dev/sd[abcd] --verbose --force

Now he recovers and I can get my files. But after the failed the data seems to be there but cannot be opend anymore.

If that happens again can I somehow stop the recovering process?

Sc0rp · 18. September 2017

Hi,

it seems your RAID did'nt suffer from multiple failures:
md127 : inactive sdb[1] sda[4](S) sdd[3] <- (S) indicates a spare drive, while sdc is completly missing ...

So it seems to me, that sda has lost some DCB informations and sdc is not reliable enough for RAID ...

Anyway, you forced the drives to work in a RAID - and so you got two issues:
- the RAID will now try to recover itself (gain a sync-status at all), starting immediately recovering/resyncing
- blocks (data) with wrong CRC's will be deleted when no rescue is possible at all

Zitat von Drake008

If that happens again can I somehow stop the recovering process?

Possible, read https://serverfault.com/questi…rupt-software-raid-resync for all options
(i use the "speed down" option too, because it works in every environment)

Sc0rp

EDIT: *OMG* i totally forgot to mention, that you have to checkout your /var/log/messages and /var/log/syslog for the errors occured!

Drake008 · 19. September 2017

Code

mdadm --detail --scan --verbose
ARRAY /dev/md/Datengrab level=raid5 num-devices=4 metadata=1.2 spares=1 name=NAS:Datengrab UUID=79d94d6c:7030af53:7c7b1956:c4564987
   devices=/dev/sdb,/dev/sdd,/dev/sdc,/dev/sda
root@NAS:~#

Here is the syslog after a fresh restart

https://pastebin.com/k4aWp8qG

and here is the messages log

https://pastebin.com/9rYDHNMm

Code

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : inactive sdb[1] sda[4](S) sdd[3]
      2929894536 blocks super 1.2

I find it strange that when I type in these commands I can get on all the data in the raid but when it failed after the automized repairing the data is not usabel anymore.

Code

blkid
/dev/sdc: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="5366aea8-9876-f7                                                                                                                                                             8d-a23e-99e6ad6fea31" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
/dev/sdb: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="175ea7f3-8e75-ca                                                                                                                                                             aa-a728-75435266c481" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
/dev/sda: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="5c0b0f88-e883-6a                                                                                                                                                             cd-ce5d-bb2b95a4bb9b" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
/dev/sdd: UUID="79d94d6c-7030-af53-7c7b-1956c4564987" UUID_SUB="cec25e6e-e6f8-bc                                                                                                                                                             06-f4c7-d33148c0f9bd" LABEL="NAS:Datengrab" TYPE="linux_raid_member"
/dev/sde1: UUID="5e284973-7730-436e-aeb5-491c3cfe6446" TYPE="ext4"
/dev/sde5: UUID="0f12fb13-59f7-456c-a1e7-628b8f4375a2" TYPE="swap"

FDISK command
https://pastebin.com/raw/RCB3q1K2

mdadm.conf
https://pastebin.com/a1niQBda

mdadm
https://pastebin.com/h73Dx9A8

That's what I get now. But now the data is not there anymore (example a Videofile won'T play) When I restart the RAID with the mdadm commands and he recovers all files are readable again. How can I find out which hdd is faulty. When I restart the NAS I'll get the error that 2/4 hdd are not found.

Is drive sdc corrupt? How can I determine the drive so that I don't pull the wrong one out.

I did the following after the reboot

Code

mdadm --stop /dev/md127
mdadm: stopped /dev/md127
root@NAS:~# mdadm --assemble /dev/md127 /dev/sd[abcd] --verbose --force
mdadm: looking for devices for /dev/md127
mdadm: /dev/sda is identified as a member of /dev/md127, slot -1.
mdadm: /dev/sdb is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdc is identified as a member of /dev/md127, slot 2.
mdadm: /dev/sdd is identified as a member of /dev/md127, slot 3.
mdadm: forcing event count in /dev/sdc(2) from 46700 upto 46712
mdadm: clearing FAULTY flag for device 2 in /dev/md127 for /dev/sdc
mdadm: Marking array /dev/md127 as 'clean'
mdadm: no uptodate device for slot 0 of /dev/md127
mdadm: added /dev/sdc to /dev/md127 as 2
mdadm: added /dev/sdd to /dev/md127 as 3
mdadm: added /dev/sda to /dev/md127 as -1
mdadm: added /dev/sdb to /dev/md127 as 1
mdadm: /dev/md127 has been started with 3 drives (out of 4) and 1 spare.

Alles anzeigen

Sc0rp · 19. September 2017

Re,

Since the log was after reboot, i find only these lines:

syslog/messages:
Sep 19 10:41:02 NAS kernel: [ 4.553638] md: kicking non-fresh sdc from array! *1)
Sep 19 10:41:02 NAS kernel: [ 4.553660] md: unbind<sdc>

... in these logs is no issue, which points at the drive failures ...

And the line from your commands after reboot talks the same:
mdadm: forcing event count in /dev/sdc(2) from 46700 upto 46712 *1)

*1) every member of an array gots an (individual) eventcounter - if this counter is on all members the same, the array is in sync. If a drive derives from the others - it is stated as not sync and kicked out of the array, because it is outdated. This forces the RAID normally to an resync, but in your case i assume it willd estroy your data completly, because sda is stated as spare ...

ToDo (just a recommendation):
1st: just try to get the most data from the array (do whatever it needs to back them up!)
2nd: do non-invasive searching:

Please check the SMART attributes on your drives:
smartctl -a /dev/sda

Then you need further the cause of the failure (aka root cause) - do more searching like that:
zcat /var/log/syslog.1.gz | grep sdc
zcat /var/log/syslog.1.gz | grep sda
zcat /var/log/messages.1.gz | grep sdc
zcat /var/log/messages.1.gz | grep sda
- alter the digit (possible 1...x, depends on your logrotate) for yourself, check the directory with ls -la /var/log | grep syslog and | grep messages
- alter the searchstring for yourself (instead of sdc/sda you can use disk, scsi, mdadm, or any other related keyword, just try it)

Good luck!

Sc0rp

Drake008 · 19. September 2017

First of all, Thank you very much for these hints. I try to recover the data whilst OMV is recovering.

After that I will create a new RAID5 with no spare. Is there a way to scan for HDD errors whilst creating the Raid?

Sc0rp · 19. September 2017

Re,

Zitat von Drake008

Is there a way to scan for HDD errors whilst creating the Raid?

More than one i assume ...

- smartctl should be running deamonized (ps -ef | grep smart to check)
- smartctl -a /dev/sda (alter "a" for b,c,d,e,...) if you have a suspect (output is very long)
- tail -f /var/log/messages (on a second console)
- may be you configure email reporting on your box too - it's working for me (and it's fast)

While building (rebuilding, syncing, resyncing) an array, mdadm takes care of faulty sectors (it's like formatting), and will log to syslog and/or messages ... so track these files via tail -f ...

Also check the smart-status of your drives to see some attributes (possible) rising - this will do the deamonized smartctl for you in an 180s interval (standard setting), and that's enough. Take a look at the documentation of SMARTMONTOOLS.

Sc0rp

ness1602 · 22. September 2017

Or use HDD Sentinel linux .

Jetzt mitmachen!