Raid5 inactive- Need help please

    • OMV 2.x
    • Resolved
    • Raid5 inactive- Need help please

      Hi,

      I have a raid 5 with 4 disks. It seems that 2 disks has SMART errors (sda and sdd). So I wanted to change them with new disk. First of all I tried to change sdd. So I replaced the disk and add it to the raid via omv web interface. I add an error on reading sda. Next I replaced the old disk sdd, replaced the sda disk with another new one and raid was not visible anymore. I got the raid back thanks to mdadm --assemble --force /dev/md127 /dev/sd[a-d]

      What is the good way to change my disk and don't loose my data ?

      Here the system informations


      cat /proc/mdstat
      Personalities : [raid6] [raid5] [raid4]
      md127 : active (auto-read-only) raid5 sda[0] sdc[4] sdb[1]
      5860538880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]

      blkid
      /dev/sdb: UUID="f32f2b91-1c43-ee2a-f2ec-e6be62c39a89" UUID_SUB="a09edadb-246f-7a15-f5ec-b5138b98dfb2" LABEL="Amelia:NasRaid5" TYPE="linux_raid_member"
      /dev/sdc: UUID="f32f2b91-1c43-ee2a-f2ec-e6be62c39a89" UUID_SUB="d67bf01d-6c82-af37-3a4a-5b8a2c4de855" LABEL="Amelia:NasRaid5" TYPE="linux_raid_member"
      /dev/sdd: UUID="f32f2b91-1c43-ee2a-f2ec-e6be62c39a89" UUID_SUB="417ab984-bece-f637-a267-fe7817430ba8" LABEL="Amelia:NasRaid5" TYPE="linux_raid_member"
      /dev/sda: UUID="f32f2b91-1c43-ee2a-f2ec-e6be62c39a89" UUID_SUB="c2041b86-fc04-258d-d907-40281f45c82a" LABEL="Amelia:NasRaid5" TYPE="linux_raid_member"
      /dev/md127: LABEL="data" UUID="aeef4793-13f1-40ef-8704-03cc7f4bed77" TYPE="ext4"

      fdisk -l | grep "Disk "
      Disk /dev/sda doesn't contain a valid partition table
      Disk /dev/sdb doesn't contain a valid partition table
      Disk /dev/sdc doesn't contain a valid partition table
      Disk /dev/sdd doesn't contain a valid partition table
      Disk /dev/md127 doesn't contain a valid partition table
      Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
      Disk identifier: 0x00000000
      Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
      Disk identifier: 0x00000000
      Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes
      Disk identifier: 0x00000000
      Disk /dev/sdd: 2000.4 GB, 2000398934016 bytes
      Disk identifier: 0x00000000
      Disk /dev/sde: 40.0 GB, 40020664320 bytes
      Disk identifier: 0x00081194
      Disk /dev/md127: 6001.2 GB, 6001191813120 bytes
      Disk identifier: 0x00000000

      cat /etc/mdadm/mdadm.conf
      # mdadm.conf
      #
      # Please refer to mdadm.conf(5) for information about this file.
      #


      # by default, scan all partitions (/proc/partitions) for MD superblocks.
      # alternatively, specify devices to scan, using wildcards if desired.
      # Note, if no DEVICE line is present, then "DEVICE partitions" is assumed.
      # To avoid the auto-assembly of RAID devices a pattern that CAN'T match is
      # used if no RAID devices are configured.
      DEVICE partitions


      # auto-create devices with Debian standard permissions
      CREATE owner=root group=disk mode=0660 auto=yes


      # automatically tag new arrays as belonging to the local system
      HOMEHOST <system>


      # definitions of existing MD arrays
      ARRAY /dev/md/NasRaid5 metadata=1.2 spares=1 name=Amelia:NasRaid5 UUID=f32f2b91:1c43ee2a:f2ece6be:62c39a89


      mdadm --detail --scan --verbose
      ARRAY /dev/md127 level=raid5 num-devices=4 metadata=1.2 name=Amelia:NasRaid5 UUID=f32f2b91:1c43ee2a:f2ece6be:62c39a89
      devices=/dev/sda,/dev/sdb,/dev/sdc
    • It's time to prioritize.

      1. The first thing you should do is copy data you don't want to lose off of the array. (Copy it to an external USB drive or another LAN host?) This is the first priority because you're in a high risk scenario. RAID5 + 2 disk failures = Total Loss. If you want to save your data, I wouldn't waste any time copying it to another location. By the way, I wouldn't reboot the NAS until your data is safe.

      **With two weak disks, I wouldn't attempt a rebuild because rebuilds are drive torture tests. Rebuilding one drive may push the second drive into a failure. If the second disk fails completely, during the rebuild, it's all over.

      If you have 2 disks to add to a 4 disk array, you may have the total capacity needed to copy your data. So the question is, what's more important? Your data or the array?

      - After the above is done -

      2. Then it would be safe to look at ways to "attempt" to fix the array.
      Good backup takes the "drama" out of computing
      ____________________________________
      OMV 3.0.94 Erasmus
      ThinkServer TS140, 12GB ECC / 32GB USB3.0
      4TB SG+4TB TS ZFS mirror/ 3TB TS

      OMV 3.0.94 Erasmus - Rsync'ed Backup
      R-PI 2 $29 / 16GB SD Card $8 / Real Time Clock $1.86
      4TB WD My Passport $119
    • Hi,

      Thanks for your answer. It is what I m doing. I don't want to loose my data but I have not enough space to backup all and it is very long to backup because of the very small CPU which is working a lot to compute the files missing part. Once all my precious preeeeciiiousss data will be saved into another place, I ll be back to ask the better way to change my disks.
    • kaly wrote:

      I cannot backup anymore, too much IO error.

      I will start to replace sdd because it is not visible by the array. I will replace the disk and run
      mdadm /dev/md127 -a /dev/sddIs it the best way to replace the disk ? What should I do if recovery failed ?
      (Sorry - I didn't get notified of your last two posts, for some unknown reason.)
      While the array may not have died completely, if you can't pull data from it, functionally it's dead. Further, if recovery fails, the remaining data on the array will not longer be available. At that point, it would be time to rebuild but, this time around, make sure you have full (100%) backup. As this event should tell you, "RAID is not backup".

      flmaxey wrote:

      **With two weak disks, I wouldn't attempt a rebuild because rebuilds are drive torture tests. Rebuilding one drive may push the second drive into a failure. If the second disk fails completely, during the rebuild, it's all over.
      Reading data, to backup, is far easier for failing disks to withstand than a recovery/rebuild would be. If you can't backup files anymore, I doubt that it's possible to successfully recover a disk.
      (But I'll cross my fingers for you.)
      ________________________________________________________________

      If you don't have hot swap hardware and you shutdown your NAS, to remove or add disks, note that the array may fail. Keep that in mind. Powering off may be a one way street.

      Line 1: Adds a disk (this assumes that the new disk is /dev/sde)
      Line 2: Swaps sdd out of the array, then puts sde in the array, and starts recovery. When the recovery is complete dev/sdd will be "failed".

      (You can watch recovery progress in the GUI, in RAID management.)

      Source Code

      1. mdadm /dev/md127 --add /dev/sde
      2. mdadm /dev/md127 --replace /dev/sdd --with /dev/sde

      The following is the final result of the above commands, from a test run:

      Version : 1.2
      Creation Time : Fri Dec 15 23:53:17 2017
      Raid Level : raid5
      Array Size : 10477568 (9.99 GiB 10.73 GB)
      Used Dev Size : 5238784 (5.00 GiB 5.36 GB)
      Raid Devices : 3
      Total Devices : 4
      Persistence : Superblock is persistent

      Update Time : Sat Dec 16 00:07:43 2017
      State : clean
      Active Devices : 3
      Working Devices : 3
      Failed Devices : 1
      Spare Devices : 0

      Layout : left-symmetric
      Chunk Size : 512K

      Name : OMV:RAID5 (local to host OMV-VM)
      UUID : 46f55d2b:fc489d7e:d807092e:0b438c47
      Events : 25

      Number Major Minor RaidDevice State
      0 8 16 0 active sync /dev/sdb
      1 8 32 1 active sync /dev/sdc
      3 8 64 2 active sync /dev/sde


      2 8 48 - faulty /dev/sdd
      Good backup takes the "drama" out of computing
      ____________________________________
      OMV 3.0.94 Erasmus
      ThinkServer TS140, 12GB ECC / 32GB USB3.0
      4TB SG+4TB TS ZFS mirror/ 3TB TS

      OMV 3.0.94 Erasmus - Rsync'ed Backup
      R-PI 2 $29 / 16GB SD Card $8 / Real Time Clock $1.86
      4TB WD My Passport $119
    • One last thing:

      The following will remove sdd from the array:

      Source Code

      1. mdadm --manage /dev/md127 --remove /dev/sdd
      Good backup takes the "drama" out of computing
      ____________________________________
      OMV 3.0.94 Erasmus
      ThinkServer TS140, 12GB ECC / 32GB USB3.0
      4TB SG+4TB TS ZFS mirror/ 3TB TS

      OMV 3.0.94 Erasmus - Rsync'ed Backup
      R-PI 2 $29 / 16GB SD Card $8 / Real Time Clock $1.86
      4TB WD My Passport $119
    • flmaxey wrote:

      Reading data, to backup, is far easier for failing disks to withstand than a recovery/rebuild would be. If you can't backup files anymore, I doubt that it's possible to successfully recover a disk.
      (But I'll cross my fingers for you.)
      Thanks for your help and for crossing your fingers for me ;-).

      I had to power off to replace the disk and as expected at startup array didn't start "not enough drives to start the array".
      Then I tried to run

      Source Code

      1. mdadm --assemble --force /dev/md127 /dev/sd[b-d]
      2. mdadm: forcing event count in /dev/sdd(3) from 3594801 upto 3676860
      3. mdadm: clearing FAULTY flag for device 2 in /dev/md127 for /dev/sdd
      4. mdadm: Marking array /dev/md127 as 'clean'
      5. mdadm: /dev/md127 assembled from 3 drives - not enough to start the array.
      Is there a way to rebuild the array allowing data loss ? Or is it really over and I have to create a new array ?
    • Even if I see the possibility of recovery as remote, I truly hope this works for you. 512 minutes, close to 9 hours, let's hope the bad drives hang on. Losing a LOT of data can be PAINFUL.

      Roughly how much data do you have on the array?
      Good backup takes the "drama" out of computing
      ____________________________________
      OMV 3.0.94 Erasmus
      ThinkServer TS140, 12GB ECC / 32GB USB3.0
      4TB SG+4TB TS ZFS mirror/ 3TB TS

      OMV 3.0.94 Erasmus - Rsync'ed Backup
      R-PI 2 $29 / 16GB SD Card $8 / Real Time Clock $1.86
      4TB WD My Passport $119
    • I decided to repair filesystem before changing the 2nd disk.

      The raid array is clean now but the partition is not mount.
      When I mount it I have this error

      Source Code

      1. #mount -a
      2. mount: wrong fs type, bad option, bad superblock on /dev/md127,

      dmesg give me the real error

      Source Code

      1. JBD2: no valid journal superblock found

      I repaired this error with


      Source Code

      1. mke2fs -t ext4 -O ^has_journal /dev/md127

      Then mount is OK but still no data. Currently I am checking filesystem with

      Source Code

      1. #fsck.ext4 -Dcf -C 0 /dev/md127
      2. e2fsck 1.42.5 (29-Jul-2012)
      3. Vérification des blocs défectueux (test en mode lecture seule) : 11.55% effectué, 47:26 écoulé. (0/0/0 erreurs)
      I hope it will give good results....
    • Well, to be honest, it doesn't look good. The underlying RAID array is really nothing more than a "simulated" single disk. RAID gives the appearance, to the file system, that it's storing files on one big drive.

      So what I'm getting at is, if you can't access your files (stored by the file system at the top) repairing the RAID array (the file container at the bottom) probably won't change anything. You may arrive at a repaired and healthy array, with corrupted or no files in it.

      At this point there's certainly no harm in trying, but don't be disappointed. You gave it your best shot.
      Good backup takes the "drama" out of computing
      ____________________________________
      OMV 3.0.94 Erasmus
      ThinkServer TS140, 12GB ECC / 32GB USB3.0
      4TB SG+4TB TS ZFS mirror/ 3TB TS

      OMV 3.0.94 Erasmus - Rsync'ed Backup
      R-PI 2 $29 / 16GB SD Card $8 / Real Time Clock $1.86
      4TB WD My Passport $119

      The post was edited 2 times, last by flmaxey ().

    • (I'm guessing that you now have a healthy RAID array, with no files on it.) At least getting files off, while you could, prevented a total loss.

      My sincere regrets.

      In the following I'm not trying to make matters worse - losing data that has been compiled over time hurts. It's just offered as food for thought.

      Some suggests:
      In provisioning for storage needs, on my primary server I started at a 25% fill rate. Since I'm running a ZFS mirror (a rough RAID1 equivalent) with a little over 1TB of data, I have 8TB of hard disks on the server. **Note: I'm not using a zmirror thinking that it's protecting me. As we know, RAID is not backup. I'm using a mirror for bitrot protection and self healing files.** I also have a backup device (an R-PI), with a single 4TB disk, for 100% file backup. (Also, I have two other obsolete PC's with full copies on roughly 1.75TB of disks each.) When I reach 75% fill (3TB), I'll be looking to expand the primary and the backup.

      So, for solid backup and good data safety for 1TB+ of data, I have 12TB+ of disks on fully independent devices. While that seems like a lot (and it is) any one item could go up in smoke and I'd be fine.
      _______________________________________________

      Regarding your current array, since your important data is backed up, I'd suggest that you consider rebuilding the array from scratch and, this time, use ZFS. A basic ZFS vdev/pool is not complicated, as some suggest. (At least, it didn't seem complicated.) The OMV plugin made getting started with ZFS, easy enough.

      A nice ZFS benefit is, data is "scrubbed" and compared to precalculated file checksums. After a schedule scrub is complete, I've configured OMV to run the command; zpool status -v ZFS1 which reports the findings of that last scrub, and to E-mail the report to me. (This can be done in the GUI.) This is what the report shows:

      Source Code

      1. root@omv-server:~# zpool status -v ZFS1
      2. pool: ZFS1
      3. state: ONLINE
      4. scan: scrub repaired 0 in 2h10m with 0 errors on Sun Dec 10 02:34:24 2017
      5. config:
      6. NAME STATE READ WRITE CKSUM
      7. ZFS1 ONLINE 0 0 0
      8. mirror-0 ONLINE 0 0 0
      9. ata-TOSHIBA_HDWQ140_47TEK0HYFPBE ONLINE 0 0 0
      10. ata-ST4000DM005-2DP166_ZDH18HML ONLINE 0 0 0
      11. errors: No known data errors
      Display All

      If errors are noted and repaired, it would be an alert to check SMART attributes, watch the hard drives closely and double check backup. Since this feature is sensitive, it has the potential to warn of developing drive problems before SMART triggers alerts.

      Anyway, just some thoughts on the matter. If you need more info, let me know.

      Regards and,

      Merry Christmas
      Good backup takes the "drama" out of computing
      ____________________________________
      OMV 3.0.94 Erasmus
      ThinkServer TS140, 12GB ECC / 32GB USB3.0
      4TB SG+4TB TS ZFS mirror/ 3TB TS

      OMV 3.0.94 Erasmus - Rsync'ed Backup
      R-PI 2 $29 / 16GB SD Card $8 / Real Time Clock $1.86
      4TB WD My Passport $119

      The post was edited 1 time, last by flmaxey ().

    • Users Online 1

      1 Guest