Raid 6 drive removal test yielded degraided raid and 2 split raid arrays - Don't know what to do

calvin940 · 20. September 2015

What I have before was:

Code

root@CHOMEOMV:~# lsscsi -d
[1:0:0:0] disk ATA WDC WD1600BEKX-0 01.0 /dev/sda [8:96]
[2:0:0:0] disk ATA WDC WD1600BEKX-0 01.0 /dev/sdb [8:112]
[0:1:0:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdc [8:0]
[0:1:1:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdd [8:16]
[0:1:2:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sde [8:32]
[0:1:3:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdf [8:48]
[0:1:4:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdg [8:64]
[0:1:6:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdh [8:80]

sda is the main OS boot drive and sdb is an unmounted clonezilla of my boot drive to have a backup.

I created a raid 6 array active consisting of sdc - sdh. /dev/md0

Had completed it's sync over night and was active.

I used lvm and created a physical group, volume group, and logical volume but had not created an fs yet.

This morning I noticed an led on that drive's bay to be much dimmer than the others and instead of powering down and yanking the drive to take a look, I decided to try a failure test anyhow, so I hot pulled sdc for a few minutes, inspected the drive and then put it back in the same slot. The raid array showed clean, degraded

cat /proc/mdstat showed the other drives fine by sdc as failed. (I cannot show you this because that ssh session was closed)

I tried re-adding the drive back to the array but I was getting:

Code

root@CHOMEOMV:~# mdadm --manage /dev/md0 --re-add /dev/sdc
mdadm: Cannot open /dev/sdc: Device or resource busy

so I rebooted my nas box.. however, now what I have is very puzzling:

lsscsi now reports:

Code

root@CHOMEOMV:~# lsscsi -d
[0:1:0:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sda [8:0]
[0:1:1:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdb [8:16]
[0:1:2:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdc [8:32]
[0:1:3:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdd [8:48]
[0:1:4:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sde [8:64]
[0:1:6:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdf [8:80]
[1:0:0:0]    disk    ATA      WDC WD1600BEKX-0 01.0  /dev/sdg [8:96]
[2:0:0:0]    disk    ATA      WDC WD1600BEKX-0 01.0  /dev/sdh [8:112]
root@CHOMEOMV:~#




root@CHOMEOMV:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : inactive sda[0](S)
      3905804288 blocks super 1.2


md127 : active (auto-read-only) raid6 sdb[1] sdf[5] sde[4] sdd[3] sdc[2]
      15623215104 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [_UUUUU]


unused devices: <none>

Alles anzeigen

and mdadm detail

Code

root@CHOMEOMV:~# mdadm --detail /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Sun Sep 20 00:13:11 2015
     Raid Level : raid6
     Array Size : 15623215104 (14899.46 GiB 15998.17 GB)
  Used Dev Size : 3905803776 (3724.86 GiB 3999.54 GB)
   Raid Devices : 6
  Total Devices : 5
    Persistence : Superblock is persistent


    Update Time : Sun Sep 20 11:45:00 2015
          State : clean, degraded
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0


         Layout : left-symmetric
     Chunk Size : 512K


           Name : CHOMEOMV:Storage001  (local to host CHOMEOMV)
           UUID : 7ff2875d:f8466166:ed83b1d8:5d486d37
         Events : 654


    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       16        1      active sync   /dev/sdb
       2       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       80        5      active sync   /dev/sdf

Alles anzeigen

So my questions are:

1. Why couldn't I re-add my drive back?
2. Why did all my drives change device pointers after the reboot?
3. Why do I now have 2 arrays (md devices)?
4. Where do I go from here?

so perhaps #3 has happened because the drive I failed was the first drive in the array and now re-insertion of a non-failed device somehow caused it to register as a new md array, but I don't believe that should have happened and don't know why it would have nor what the repurcussions are.

again, no fs was added yet, so none of the raid stuff has been mounted.

Also, this isn't a ciritcal issue because I have nothing on the array. I just wanted to start learning about the process prior to me having to deal with it in a real online data scenario .

Any/all help would be appreciated. I am not going to touch it for the moment so that hopefully you folks have the answers on what I should do next, what I should have done before, or whatever.

I am really liking this setup so far. I came from a QNAP 469L (4x3TB raid 5) and wanted to expand. I am pretty happy with my setup, but I am in learning mode and it's pretty fun so far. THis system is giving me the control and felxibility I wanted as well as the user recovery rather than having to send a device back to a vendor for repair if I got into that situation.

Thanks in advance.

calvin940 · 20. September 2015

I cannot edit my main post, but I had email alerting on and found that the initial alert actually gave me the old mdstat in my inbox:

Code

P.S. The /proc/mdstat file currently contains the following:


Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh[5] sdg[4] sdf[3] sde[2] sdd[1] sdc[0](F)
      15623215104 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [_UUUUU]


unused devices: <none>>

calvin940 · 20. September 2015

Code

root@CHOMEOMV:~# blkid
/dev/sdc: UUID="7ff2875d-f846-6166-ed83-b1d85d486d37" UUID_SUB="24939b24-ab9b-e0de-9cd8-a988a84beb17" LABEL="CHOMEOMV:Storage001" TYPE="linux_raid_member"
/dev/sdd: UUID="7ff2875d-f846-6166-ed83-b1d85d486d37" UUID_SUB="89d4d596-ddf9-369e-ddd6-d65ecc50e2e4" LABEL="CHOMEOMV:Storage001" TYPE="linux_raid_member"
/dev/sde: UUID="7ff2875d-f846-6166-ed83-b1d85d486d37" UUID_SUB="27192ce4-adbb-5ca3-8af6-fe8821b9189c" LABEL="CHOMEOMV:Storage001" TYPE="linux_raid_member"
/dev/sdf: UUID="7ff2875d-f846-6166-ed83-b1d85d486d37" UUID_SUB="3398896a-a781-4ef3-4ed0-c3e7f81730cc" LABEL="CHOMEOMV:Storage001" TYPE="linux_raid_member"
/dev/sdg1: UUID="7cab3464-94c9-4a02-bceb-85ec8aa9b523" TYPE="ext4"
/dev/sdg5: UUID="2b207325-2a6e-4608-9f90-1356de34208c" TYPE="swap"
/dev/sdh1: UUID="7cab3464-94c9-4a02-bceb-85ec8aa9b523" TYPE="ext4"
/dev/sdh5: UUID="2b207325-2a6e-4608-9f90-1356de34208c" TYPE="swap"
/dev/sda: UUID="7ff2875d-f846-6166-ed83-b1d85d486d37" UUID_SUB="363b9c48-93b3-9a7b-0cd0-c5c62d69046c" LABEL="CHOMEOMV:Storage001" TYPE="linux_raid_member"
/dev/sdb: UUID="7ff2875d-f846-6166-ed83-b1d85d486d37" UUID_SUB="b42f435f-a0f1-4cf5-8231-381f0dfae572" LABEL="CHOMEOMV:Storage001" TYPE="linux_raid_member"
/dev/md127: UUID="cWcbec-nGOq-2jQ5-1WEO-kJaa-J5WM-QJ0iYe" TYPE="LVM2_member"

Alles anzeigen

Code

root@CHOMEOMV:~# fdisk -l


Disk /dev/sdg: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders, total 312581808 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0008b16b


   Device Boot      Start         End      Blocks   Id  System
/dev/sdg1   *        2048   299839487   149918720   83  Linux
/dev/sdg2       299841534   312580095     6369281    5  Extended
/dev/sdg5       299841536   312580095     6369280   82  Linux swap / Solaris


Disk /dev/sdh: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders, total 312581808 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0008b16b


   Device Boot      Start         End      Blocks   Id  System
/dev/sdh1   *        2048   299839487   149918720   83  Linux
/dev/sdh2       299841534   312580095     6369281    5  Extended
/dev/sdh5       299841536   312580095     6369280   82  Linux swap / Solaris


Disk /dev/sda: 3999.7 GB, 3999677808640 bytes
255 heads, 63 sectors/track, 486266 cylinders, total 7811870720 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Disk /dev/sda doesn't contain a valid partition table


Disk /dev/sdc: 3999.7 GB, 3999677808640 bytes
255 heads, 63 sectors/track, 486266 cylinders, total 7811870720 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Disk /dev/sdc doesn't contain a valid partition table


Disk /dev/sdb: 3999.7 GB, 3999677808640 bytes
255 heads, 63 sectors/track, 486266 cylinders, total 7811870720 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Disk /dev/sdb doesn't contain a valid partition table


Disk /dev/sdd: 3999.7 GB, 3999677808640 bytes
255 heads, 63 sectors/track, 486266 cylinders, total 7811870720 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Disk /dev/sdd doesn't contain a valid partition table


Disk /dev/sdf: 3999.7 GB, 3999677808640 bytes
255 heads, 63 sectors/track, 486266 cylinders, total 7811870720 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Disk /dev/sdf doesn't contain a valid partition table


Disk /dev/sde: 3999.7 GB, 3999677808640 bytes
255 heads, 63 sectors/track, 486266 cylinders, total 7811870720 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Disk /dev/sde doesn't contain a valid partition table


Disk /dev/md127: 15998.2 GB, 15998172266496 bytes
2 heads, 4 sectors/track, -389163520 cylinders, total 31246430208 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 2097152 bytes
Disk identifier: 0x00000000


Disk /dev/md127 doesn't contain a valid partition table


Disk /dev/mapper/StorageMain-Media: 15998.2 GB, 15998170169344 bytes
255 heads, 63 sectors/track, 1945000 cylinders, total 31246426112 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 2097152 bytes
Disk identifier: 0x00000000


Disk /dev/mapper/StorageMain-Media doesn't contain a valid partition table

Alles anzeigen

calvin940 · 21. September 2015

So I can't sit on my hands....

rebooted machine, still the same status.

Went into /etc/mdadm/mdadm.conf and saw this:

Code

.
.


# definitions of existing MD arrays
ARRAY /dev/md0 metadata=1.2 name=CHOMEOMV:Storage001 UUID=7ff2875d:f8466166:ed83b1d8:5d486d37
.
.
.

That was the original definition of the array.. so I suspect it was picking up the first disk (that I pulled) and try to create an array out of that.. and then also subsequently saw another *actual* raid array to which it assigned md127, so, I changed that mdadm.conf entry to be md127 and rebooted.

So now I only have 1 raid showing up again in my /proc/mdstat and my sda drive is no longer busy. Tried to --re-add but it said it was not possible, so I just added it back to the array and now it is clean, degraded, but rebuilding.

I guess that I thought that somehow if you hot pulled a disk from the array but then re-inserted it, it would just merrily go along and continue its running with a little re-sync. Guess that doesn't appear to be the case (at least with my test)

Still not sure why my system did the curly shuffle with all my device assignments but I guess that too doesn't really matter since it appears to handle it all nicely anyhow.

Hopefully this is of value to anyone else who faces something similar.

ikogan · 21. September 2015

I'm glad you got it resolved. Perhaps your controller doesn't truly support hotswapping? Do you know if the disk came back as "sdb" or a new device? I think this behavior depends on the controller and/or backplane. If your controller doesn't report a disconnect, for example, then Linux doesn't know the drive is gone. This works on my box, if I hot swap "sdb", the new one will come up as "sdb". On my old box, "sdb" would just be failed forever until I reboot and I'd get a new device for the newly inserted disk.

I'm still confused how your mdadm.conf got so confused though.

calvin940 · 21. September 2015

I bought the Silverstone DS380 chassis and it supports hot-swapping.

I also have the Adaptec RAID 6805E which supports hot-swapping.

The drive after being reinserted came back with the previous block device.

I worked with md many, many moons ago (15+ years), but its a pretty reliable and stable beast these days from what I gather from forums and general opinion. I am not using my controller for hardware raid. It's just configure JBOD. Wanted to have all the control within the OS.

Anyhoo, no harm no foul I guess. I am going to burn-in my ram and maybe put my box under some load for a bit.. not a bad idea to send my drives out for a little run too to increase my chances of a solid stable system.

Cheers

Jetzt mitmachen!