RAID 6 gone, physical drives visible

  • @Wabun


    thx - how can i remove the faulty spare disc 8 and the removed disc 7 - reformat them and add them one by one to the raid by recovering it


    i first want to add disc 8 and then recover and then add disc 7 and recover again .....

  • Hey good news, mdadm seems to have enough informations to start the rebuild. But I see a faulty spare, was there one before?


    I would let it rebuild without further action, then make a backup of all your data and start from scratch with a fresh installation of the whole box. And check all the hardware for other problems, such a thunderstorm can do a lot of not so funny things. That's why I have a UPS for all my boxes.


    Even if the raid is rebuilding you should see your shared folders from a windows box. Do you?

    Homebox: Bitfenix Prodigy Case, ASUS E45M1-I DELUXE ITX, 8GB RAM, 5x 4TB HGST Raid-5 Data, 1x 320GB 2,5" WD Bootdrive via eSATA from the backside
    Companybox 1: Standard Midi-Tower, Intel S3420 MoBo, Xeon 3450 CPU, 16GB RAM, 5x 2TB Seagate Data, 1x 80GB Samsung Bootdrive - testing for iSCSI to ESXi-Hosts
    Companybox 2: 19" Rackservercase 4HE, Intel S975XBX2 MoBo, C2D@2200MHz, 8GB RAM, HP P212 Raidcontroller, 4x 1TB WD Raid-0 Data, 80GB Samsung Bootdrive, Intel 1000Pro DualPort (Bonded in a VLAN) - Temp-NFS-storage for ESXi-Hosts

  • Hey good news, mdadm seems to have enough informations to start the rebuild. But I see a faulty spare, was there one before?


    I would let it rebuild without further action, then make a backup of all your data and start from scratch with a fresh installation of the whole box. And check all the hardware for other problems, such a thunderstorm can do a lot of not so funny things. That's why I have a UPS for all my boxes.


    Even if the raid is rebuilding you should see your shared folders from a windows box. Do you?


    problem is the raid is not rebuilding any longer :


    Code
    root@OMV:~# cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4]
    md127 : active raid6 sdm[13](F) sdb[0] sdl[11] sdk[10] sdj[9] sdi[12] sdh[6] sdg[5] sdf[4] sde[3] sdd[2] sdc[1]
          29301350400 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/11] [UUUUUUU_UUUU]


    i will try to mount the file system with only 10 discs ....
    and i am sorry i dont have the capacities to backup all my files :/

  • Hmm....looks like we have breaked the champagne too early...
    Is the raid visible in the web-ui? What is the result of blkid? Any reasons in the logs why mdadm has stopped the rebuild?

    Homebox: Bitfenix Prodigy Case, ASUS E45M1-I DELUXE ITX, 8GB RAM, 5x 4TB HGST Raid-5 Data, 1x 320GB 2,5" WD Bootdrive via eSATA from the backside
    Companybox 1: Standard Midi-Tower, Intel S3420 MoBo, Xeon 3450 CPU, 16GB RAM, 5x 2TB Seagate Data, 1x 80GB Samsung Bootdrive - testing for iSCSI to ESXi-Hosts
    Companybox 2: 19" Rackservercase 4HE, Intel S975XBX2 MoBo, C2D@2200MHz, 8GB RAM, HP P212 Raidcontroller, 4x 1TB WD Raid-0 Data, 80GB Samsung Bootdrive, Intel 1000Pro DualPort (Bonded in a VLAN) - Temp-NFS-storage for ESXi-Hosts

  • @ahab666


    For the last time and I give up now, the disk which started with all the problems should be zero-wiped or you remove the super-block, with your knowledge I suggest to use either the WD tool or DBAN. I am not going to repeat myself for the third time, just read back.


    Once you have wiped the disk clean and checked that all is fine, add it back and then start again.
    The raid will simple fail with that corrupt disk, if that disk and only THAT disk is zero-wiped, the raid will start to rebuild.
    The danger is now that 7 & 8 are in one set and you will break the raid, so only wipe 7 and start again.


    Best of luck.

    DISCLAIMER: I'm not a native English speaker, I'm really sorry if I don't explain as good as you would like... :)

    2 Mal editiert, zuletzt von Wabun ()

  • @Wabun okay - dban will need 10 hours so i will be quiet for now


    @datadigger - raid shows clean but degraded - restore shows no additional disks - i will do like wabun suggested - full wipe of disc 7 then a rebuild - then full wipe of disc 8 -rebuild ...


    cheers

  • If these two disks are a part of the raid don't forget to set them as failed and remove them before you wipe.
    mdadm --manage /dev/md0 --fail /dev/sdi
    mdadm --manage /dev/md0 --remove /dev/sdi
    and so on...otherwise the whole raid can break.


    Afterwards try to re-add them again.

    Homebox: Bitfenix Prodigy Case, ASUS E45M1-I DELUXE ITX, 8GB RAM, 5x 4TB HGST Raid-5 Data, 1x 320GB 2,5" WD Bootdrive via eSATA from the backside
    Companybox 1: Standard Midi-Tower, Intel S3420 MoBo, Xeon 3450 CPU, 16GB RAM, 5x 2TB Seagate Data, 1x 80GB Samsung Bootdrive - testing for iSCSI to ESXi-Hosts
    Companybox 2: 19" Rackservercase 4HE, Intel S975XBX2 MoBo, C2D@2200MHz, 8GB RAM, HP P212 Raidcontroller, 4x 1TB WD Raid-0 Data, 80GB Samsung Bootdrive, Intel 1000Pro DualPort (Bonded in a VLAN) - Temp-NFS-storage for ESXi-Hosts

  • @Wabun


    not too bad - but a little bit overwhelmed and confused :-> - seems like i nuked one wong HDD but raid rebuild worked with 11 of the 12 HDDs - and there is another minor problem
    as i was curious if my data survived - i created used one shared folder and started samba, createde a share and voila - was able to check the files in there ...
    now that shared foleder is marked as referenced - so i cannot unmount the filesystem or remove the share entry ... :/


    any ideas ?

  • @ahab666


    Oh dear, why couldn't you just wait till the whole raid was rebuild.
    I have no clue yet, try to stop samba and any other service like ftp rsync, stop all of them..


    Then try a reboot.


    Then DBAN the other disk and don't do anything else till the raid is rebuild 100%

    DISCLAIMER: I'm not a native English speaker, I'm really sorry if I don't explain as good as you would like... :)

  • @Wabun


    well the raid rebuild was finished and yes i stopped all services exepf dor the ssh/telnet one, i deleted all samba shares before i stopped the samba service
    there is one detail i missed in the last msg ... when i look at the shared folder in use tab i see that it is used by
    config/system/omvextrasorg/resetperms and
    config/system/shared/sharedfolder
    if you know any telnet commandline options how to force to stop services i would be very happy to learn about them ...


    cheers and thx for your patience

  • dban'ed 2 HDDs, one seems to be unrecoverable ... will dban the failed one again with multipass and verification ...
    another question - any idea how to identify the faulty hdd within my hdd cages .... seemsd like omv changes the
    device names sdb, sdc etc. randomly with every reboot ...


    cheers

  • @ahab666


    DBAN as follow, M = method, quick erase, V =verify ALL passes, F10 start, but before you do this make a screen-shot of the SMART write also down the re-allocated sectors. After one run you check SMART again is the re-allocated increased, then you have a bad disc, most disc can do reliable a few re-allocations, but then your gambling!!!


    Is the failed disc still under warranty? get it RMA, I wouldn't risk to put it back in the raid then.

    DISCLAIMER: I'm not a native English speaker, I'm really sorry if I don't explain as good as you would like... :)

    2 Mal editiert, zuletzt von Wabun ()

  • @Wabun :


    will do so - but at the moment i have troubles to ifentify the culprit physically - i labled the disks with their LUN number from my controller bios .... like 0 to 11 ...
    now i need to find a way ro re-lable them with the OMV naming scheme like sdb, sbc, ... a.s.o.


    what is the easiest way to do so without causing a "raid-break up" - so my question is - are theese namings sdb, sdc a.s.o. somehow constant, like written to the physical disk ?


    the raid is back to 11 useable disks but i am afraid that i might DBAN anoter of the working ones rather than the potentially defective one - i allready got a new hdd for replacement ...


    i'd like to suggest a utility (proggie or script) within OMV that makes it easy to identify the Harddiscs by simply removing and reconnecting one after the other without irritating the Raid


    is there any free texteditor for windows that is able to read and display unix log files correctly ?



    cheers - ahab666 aka alex

  • @ahab666


    Alex, the first thing you always do when you build a raid, you label the disk and write down the serial and label number in a notepad file or spreadsheet whatever you prefer. You also check the disk SMART details and make a note of it or a screen-shot. At the moment with BLKID command you can identify the disks, so it should not be that difficult to work out which physical disk is. SMART in OMV also gives such information, no need for any other tools in OMV, you have all you need. Another option is mdadm --detail /dev/mdx [x=127 or whatever it is] with cat /proc/mdstat you can get more info about your mdx number. Regards the log files you can use in Window$ the Notepad++ editor.


    I suggest before you do anything else, get your most important data of the raid!!!


    Then zero-wipe the new disk to make sure it goes in without any problems don't take any risk!


    Once you have your most important data you can DBAN every disk if you like, you have 12 so taking one out a time and check/replace should not be any problem. If you bought all 12 in one order the problem might be you will run in time in what I call a batch issue, all disk same age, might die at same time!
    Make sure you monitor your raid and make sure you have enabled the warnings by email, and check yourself the SMART status in raid, the only reliable way is you check this yourself. Don't forget new disks can die fast as well, like humans some get very old and some die sadly very young!


    Now the most important part is that you didn't tell what was wrong with the disk, did it die completely, a lot of bad sectors, still in warranty?
    Anyway you better get to the bottom if it was down to the surge pike or just having old disks [who to blame] or anything else. But like has been suggested get at least an UPS or better in combination with a surge protector in your consumer unit, cost as little as £60 here in the UK. We have this one: http://www.dehn-usa.com/pdbRes…mmer-pdf/31307/952070.pdf But it might be different in your country.


    Time for a coffee.

    DISCLAIMER: I'm not a native English speaker, I'm really sorry if I don't explain as good as you would like... :)

    3 Mal editiert, zuletzt von Wabun ()

  • @Wabun


    well the raid worked perfectly well for nearly 2 years .... the problem is that i still have no idea how to identify the culprit defective HDD and i do not want to do up to 11 DBANs or rebuilds if possible ....
    for my controller th HDDs are hot swappable .... so i can take unplug and replug any hdd and watch it in the BIOS .... would be cool to have a utility external or in the OMV GUI that could do the same ....
    like the physical device page but with the purpuse to identify all or any HDDs .... for a good programmer that should be a 10 to 30 line script i guess ..... i could even write you a flow chart or block diagram for that


    cheers - ahab666

  • @ahab666


    Try these.


    Code
    hdparm -i /dev/sdx


    Code
    smartctl -a /dev/sdx | more


    Code
    lsblk


    Code
    lsblk -o name,kname,label,uuid,state


    x is drive letter.
    | more let you scroll through each page.

    lsblk will show the mdx
    Look on the disk label it should show the serial...


    Edit: do you have WD drives?


    You better check in the SMART what the load cycle count is, if it is very high you need to patch your drives with wdidle3
    This applies to many of their products in all colour ranges, I myself have found: red, green, blue and black drives.
    There are rumours that this is killing drives in a raid, but you better check each drive if it has.


    A good tool is the UBCD Ultimate boot CD, it has this little tool on the CD-ROM, boot from it and check each disk.

  • @Wabun


    i am using the red ones - and WD has an utility for patching the bios for the TLER parameter - did that with all discs before i built my raid ....
    will check that either ..... may take some time though ....


    still - a utilirie that allows the physical id per HDD by disconnecting it (on the fly) would be great ;)



    cheers - ahab666

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!