RAID 6 gone, physical drives visible

ahab666 · 24. August 2015

thx - how can i remove the faulty spare disc 8 and the removed disc 7 - reformat them and add them one by one to the raid by recovering it

i first want to add disc 8 and then recover and then add disc 7 and recover again .....

datadigger · 24. August 2015

Hey good news, mdadm seems to have enough informations to start the rebuild. But I see a faulty spare, was there one before?

I would let it rebuild without further action, then make a backup of all your data and start from scratch with a fresh installation of the whole box. And check all the hardware for other problems, such a thunderstorm can do a lot of not so funny things. That's why I have a UPS for all my boxes.

Even if the raid is rebuilding you should see your shared folders from a windows box. Do you?

ahab666 · 24. August 2015

Zitat von datadigger

Hey good news, mdadm seems to have enough informations to start the rebuild. But I see a faulty spare, was there one before?

I would let it rebuild without further action, then make a backup of all your data and start from scratch with a fresh installation of the whole box. And check all the hardware for other problems, such a thunderstorm can do a lot of not so funny things. That's why I have a UPS for all my boxes.

Even if the raid is rebuilding you should see your shared folders from a windows box. Do you?

problem is the raid is not rebuilding any longer :

Code

root@OMV:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdm[13](F) sdb[0] sdl[11] sdk[10] sdj[9] sdi[12] sdh[6] sdg[5] sdf[4] sde[3] sdd[2] sdc[1]
      29301350400 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/11] [UUUUUUU_UUUU]

i will try to mount the file system with only 10 discs ....
and i am sorry i dont have the capacities to backup all my files

datadigger · 24. August 2015

Hmm....looks like we have breaked the champagne too early...
Is the raid visible in the web-ui? What is the result of blkid? Any reasons in the logs why mdadm has stopped the rebuild?

Wabun · 24. August 2015

@ahab666

For the last time and I give up now, the disk which started with all the problems should be zero-wiped or you remove the super-block, with your knowledge I suggest to use either the WD tool or DBAN. I am not going to repeat myself for the third time, just read back.

Once you have wiped the disk clean and checked that all is fine, add it back and then start again.
The raid will simple fail with that corrupt disk, if that disk and only THAT disk is zero-wiped, the raid will start to rebuild.
The danger is now that 7 & 8 are in one set and you will break the raid, so only wipe 7 and start again.

Best of luck.

ahab666 · 24. August 2015

@Wabun okay - dban will need 10 hours so i will be quiet for now

@datadigger - raid shows clean but degraded - restore shows no additional disks - i will do like wabun suggested - full wipe of disc 7 then a rebuild - then full wipe of disc 8 -rebuild ...

cheers

datadigger · 24. August 2015

If these two disks are a part of the raid don't forget to set them as failed and remove them before you wipe.
mdadm --manage /dev/md0 --fail /dev/sdi
mdadm --manage /dev/md0 --remove /dev/sdi
and so on...otherwise the whole raid can break.

Afterwards try to re-add them again.

Wabun · 25. August 2015

@ahab666

Alex, how are things going?

ahab666 · 25. August 2015

@Wabun

not too bad - but a little bit overwhelmed and confused :-> - seems like i nuked one wong HDD but raid rebuild worked with 11 of the 12 HDDs - and there is another minor problem
as i was curious if my data survived - i created used one shared folder and started samba, createde a share and voila - was able to check the files in there ...
now that shared foleder is marked as referenced - so i cannot unmount the filesystem or remove the share entry ...

any ideas ?

Wabun · 25. August 2015

@ahab666

Oh dear, why couldn't you just wait till the whole raid was rebuild.
I have no clue yet, try to stop samba and any other service like ftp rsync, stop all of them..

Then try a reboot.

Then DBAN the other disk and don't do anything else till the raid is rebuild 100%

ahab666 · 26. August 2015

@Wabun

well the raid rebuild was finished and yes i stopped all services exepf dor the ssh/telnet one, i deleted all samba shares before i stopped the samba service
there is one detail i missed in the last msg ... when i look at the shared folder in use tab i see that it is used by
config/system/omvextrasorg/resetperms and
config/system/shared/sharedfolder
if you know any telnet commandline options how to force to stop services i would be very happy to learn about them ...

cheers and thx for your patience

ahab666 · 26. August 2015

@dataminer et al

solved the share problem from the las msg - will continue the DBAN routine on the failed sparedrive .....

cheers - ahab

Wabun · 27. August 2015

@ahab666

Any news?

ahab666 · 27. August 2015

dban'ed 2 HDDs, one seems to be unrecoverable ... will dban the failed one again with multipass and verification ...
another question - any idea how to identify the faulty hdd within my hdd cages .... seemsd like omv changes the
device names sdb, sdc etc. randomly with every reboot ...

cheers

Wabun · 27. August 2015

@ahab666

DBAN as follow, M = method, quick erase, V =verify ALL passes, F10 start, but before you do this make a screen-shot of the SMART write also down the re-allocated sectors. After one run you check SMART again is the re-allocated increased, then you have a bad disc, most disc can do reliable a few re-allocations, but then your gambling!!!

Is the failed disc still under warranty? get it RMA, I wouldn't risk to put it back in the raid then.

ahab666 · 28. August 2015

@Wabun :

will do so - but at the moment i have troubles to ifentify the culprit physically - i labled the disks with their LUN number from my controller bios .... like 0 to 11 ...
now i need to find a way ro re-lable them with the OMV naming scheme like sdb, sbc, ... a.s.o.

what is the easiest way to do so without causing a "raid-break up" - so my question is - are theese namings sdb, sdc a.s.o. somehow constant, like written to the physical disk ?

the raid is back to 11 useable disks but i am afraid that i might DBAN anoter of the working ones rather than the potentially defective one - i allready got a new hdd for replacement ...

i'd like to suggest a utility (proggie or script) within OMV that makes it easy to identify the Harddiscs by simply removing and reconnecting one after the other without irritating the Raid

is there any free texteditor for windows that is able to read and display unix log files correctly ?

cheers - ahab666 aka alex

Wabun · 28. August 2015

@ahab666

Alex, the first thing you always do when you build a raid, you label the disk and write down the serial and label number in a notepad file or spreadsheet whatever you prefer. You also check the disk SMART details and make a note of it or a screen-shot. At the moment with BLKID command you can identify the disks, so it should not be that difficult to work out which physical disk is. SMART in OMV also gives such information, no need for any other tools in OMV, you have all you need. Another option is mdadm --detail /dev/mdx [x=127 or whatever it is] with cat /proc/mdstat you can get more info about your mdx number. Regards the log files you can use in Window$ the Notepad++ editor.

I suggest before you do anything else, get your most important data of the raid!!!

Then zero-wipe the new disk to make sure it goes in without any problems don't take any risk!

Once you have your most important data you can DBAN every disk if you like, you have 12 so taking one out a time and check/replace should not be any problem. If you bought all 12 in one order the problem might be you will run in time in what I call a batch issue, all disk same age, might die at same time!
Make sure you monitor your raid and make sure you have enabled the warnings by email, and check yourself the SMART status in raid, the only reliable way is you check this yourself. Don't forget new disks can die fast as well, like humans some get very old and some die sadly very young!

Now the most important part is that you didn't tell what was wrong with the disk, did it die completely, a lot of bad sectors, still in warranty?
Anyway you better get to the bottom if it was down to the surge pike or just having old disks [who to blame] or anything else. But like has been suggested get at least an UPS or better in combination with a surge protector in your consumer unit, cost as little as £60 here in the UK. We have this one: http://www.dehn-usa.com/pdbRes…mmer-pdf/31307/952070.pdf But it might be different in your country.

Time for a coffee.

ahab666 · 28. August 2015

@Wabun

well the raid worked perfectly well for nearly 2 years .... the problem is that i still have no idea how to identify the culprit defective HDD and i do not want to do up to 11 DBANs or rebuilds if possible ....
for my controller th HDDs are hot swappable .... so i can take unplug and replug any hdd and watch it in the BIOS .... would be cool to have a utility external or in the OMV GUI that could do the same ....
like the physical device page but with the purpuse to identify all or any HDDs .... for a good programmer that should be a 10 to 30 line script i guess ..... i could even write you a flow chart or block diagram for that

cheers - ahab666

Wabun · 28. August 2015

@ahab666

Try these.

Code

hdparm -i /dev/sdx

Code

smartctl -a /dev/sdx | more

Code

lsblk

Code

lsblk -o name,kname,label,uuid,state

x is drive letter.
| more let you scroll through each page.
lsblk will show the mdx
Look on the disk label it should show the serial...

Edit: do you have WD drives?

You better check in the SMART what the load cycle count is, if it is very high you need to patch your drives with wdidle3
This applies to many of their products in all colour ranges, I myself have found: red, green, blue and black drives.
There are rumours that this is killing drives in a raid, but you better check each drive if it has.

A good tool is the UBCD Ultimate boot CD, it has this little tool on the CD-ROM, boot from it and check each disk.

ahab666 · 28. August 2015

@Wabun

i am using the red ones - and WD has an utility for patching the bios for the TLER parameter - did that with all discs before i built my raid ....
will check that either ..... may take some time though ....

still - a utilirie that allows the physical id per HDD by disconnecting it (on the fly) would be great

cheers - ahab666

Jetzt mitmachen!