Raid 6 drive removal test yielded degraided raid and 2 split raid arrays - Don't know what to do

  • What I have before was:


    Code
    root@CHOMEOMV:~# lsscsi -d
    [1:0:0:0] disk ATA WDC WD1600BEKX-0 01.0 /dev/sda [8:96]
    [2:0:0:0] disk ATA WDC WD1600BEKX-0 01.0 /dev/sdb [8:112]
    [0:1:0:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdc [8:0]
    [0:1:1:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdd [8:16]
    [0:1:2:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sde [8:32]
    [0:1:3:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdf [8:48]
    [0:1:4:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdg [8:64]
    [0:1:6:0]    disk    WDC      WD40EFRX-68WT0N0 82.0  /dev/sdh [8:80]


    sda is the main OS boot drive and sdb is an unmounted clonezilla of my boot drive to have a backup.


    I created a raid 6 array active consisting of sdc - sdh. /dev/md0


    Had completed it's sync over night and was active.


    I used lvm and created a physical group, volume group, and logical volume but had not created an fs yet.


    This morning I noticed an led on that drive's bay to be much dimmer than the others and instead of powering down and yanking the drive to take a look, I decided to try a failure test anyhow, so I hot pulled sdc for a few minutes, inspected the drive and then put it back in the same slot. The raid array showed clean, degraded


    cat /proc/mdstat showed the other drives fine by sdc as failed. (I cannot show you this because that ssh session was closed)


    I tried re-adding the drive back to the array but I was getting:


    Code
    root@CHOMEOMV:~# mdadm --manage /dev/md0 --re-add /dev/sdc
    mdadm: Cannot open /dev/sdc: Device or resource busy


    so I rebooted my nas box.. however, now what I have is very puzzling:


    lsscsi now reports:



    and mdadm detail




    So my questions are:


    1. Why couldn't I re-add my drive back?
    2. Why did all my drives change device pointers after the reboot?
    3. Why do I now have 2 arrays (md devices)?
    4. Where do I go from here?


    so perhaps #3 has happened because the drive I failed was the first drive in the array and now re-insertion of a non-failed device somehow caused it to register as a new md array, but I don't believe that should have happened and don't know why it would have nor what the repurcussions are.


    again, no fs was added yet, so none of the raid stuff has been mounted.


    Also, this isn't a ciritcal issue because I have nothing on the array. I just wanted to start learning about the process prior to me having to deal with it in a real online data scenario .


    Any/all help would be appreciated. I am not going to touch it for the moment so that hopefully you folks have the answers on what I should do next, what I should have done before, or whatever.


    I am really liking this setup so far. I came from a QNAP 469L (4x3TB raid 5) and wanted to expand. I am pretty happy with my setup, but I am in learning mode and it's pretty fun so far. THis system is giving me the control and felxibility I wanted as well as the user recovery rather than having to send a device back to a vendor for repair if I got into that situation.


    Thanks in advance.

  • I cannot edit my main post, but I had email alerting on and found that the initial alert actually gave me the old mdstat in my inbox:



    Code
    P.S. The /proc/mdstat file currently contains the following:
    
    
    Personalities : [raid6] [raid5] [raid4]
    md0 : active raid6 sdh[5] sdg[4] sdf[3] sde[2] sdd[1] sdc[0](F)
          15623215104 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [_UUUUU]
    
    
    unused devices: <none>>



  • So I can't sit on my hands....


    rebooted machine, still the same status.


    Went into /etc/mdadm/mdadm.conf and saw this:


    Code
    .
    .
    
    
    # definitions of existing MD arrays
    ARRAY /dev/md0 metadata=1.2 name=CHOMEOMV:Storage001 UUID=7ff2875d:f8466166:ed83b1d8:5d486d37
    .
    .
    .


    That was the original definition of the array.. so I suspect it was picking up the first disk (that I pulled) and try to create an array out of that.. and then also subsequently saw another *actual* raid array to which it assigned md127, so, I changed that mdadm.conf entry to be md127 and rebooted.


    So now I only have 1 raid showing up again in my /proc/mdstat and my sda drive is no longer busy. Tried to --re-add but it said it was not possible, so I just added it back to the array and now it is clean, degraded, but rebuilding.


    I guess that I thought that somehow if you hot pulled a disk from the array but then re-inserted it, it would just merrily go along and continue its running with a little re-sync. Guess that doesn't appear to be the case (at least with my test)


    Still not sure why my system did the curly shuffle with all my device assignments but I guess that too doesn't really matter since it appears to handle it all nicely anyhow.


    Hopefully this is of value to anyone else who faces something similar.

  • I'm glad you got it resolved. Perhaps your controller doesn't truly support hotswapping? Do you know if the disk came back as "sdb" or a new device? I think this behavior depends on the controller and/or backplane. If your controller doesn't report a disconnect, for example, then Linux doesn't know the drive is gone. This works on my box, if I hot swap "sdb", the new one will come up as "sdb". On my old box, "sdb" would just be failed forever until I reboot and I'd get a new device for the newly inserted disk.


    I'm still confused how your mdadm.conf got so confused though.

  • I bought the Silverstone DS380 chassis and it supports hot-swapping.


    I also have the Adaptec RAID 6805E which supports hot-swapping.


    The drive after being reinserted came back with the previous block device.


    I worked with md many, many moons ago (15+ years), but its a pretty reliable and stable beast these days from what I gather from forums and general opinion. I am not using my controller for hardware raid. It's just configure JBOD. Wanted to have all the control within the OS.


    Anyhoo, no harm no foul I guess. I am going to burn-in my ram and maybe put my box under some load for a bit.. not a bad idea to send my drives out for a little run too to increase my chances of a solid stable system.


    Cheers

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!