Posts by LukeR1886

    Here is the exact sequence of commands and outputs that put the array back "online" in degraded mode with 3 out of 4 drives. Please skim this thread to see if your failure is the same corrupt superblock journal data type of failure that I had before proceeding. And use these commands at your own risk. I am not an expert. This worked for me. I am not responsible for and data loss that may occur because of the improper (or proper) use of these commands.


    Stopped the array:

    Code
    mdadm --stop /dev/md127


    Then created new. It is my understanding that the --assume-clean tag is critical to keep your data intact. And you'll need to type your array in the exact sequence that the mdadm --detail /dev/md127 output showed, including where the "missing drive" is supposed to be in the array.



    Then we run the fsck -v /dev/md0 command. You could include a -y after the -verbose(-v) to automatically say yes to all, but I was paranoid at this point :D so I hit enter for each repair instead.


    Then we mount the file system with the mount -a


    Thanks very much to geaves  Krisbee and votdev ! Hopes this helps someone in the future!

    CORRECTION :


    Obviously the physical correction to the root of the problem was resolved by a higher output PSU. This solved every singe R/W error that was happening to the disks. Terrifying. ^^


    I am not a RAID or mdadm expert, but I think this is how the commands that I used repaired the array. The array was stuck in a catch 22 of having a failed/ removed drive, but also unable to add more than the original specified amount of drives. I believe that means the superblock journal data was corrupt. So instead of trying to reassemble the existing array with the existing corrupt superblock journal data, I rebuilt the array with (very carefully selected) commands to tell the new superblock journal entry exactly what was in the "new" array.

    It consisted of the drives that were still "up to date" together and entered in the exact order that they were previously listed out in the mdadm --detail /dev/md127 output, plus the "missing drive" in the correct order. The command I used was mdadm --create /dev/md0 --assume-clean --level=5 --verbose --raid-devices=4 missing /dev/sdb /dev/sdd /dev/sde , however, you'll need to assess your array and edit the drives and "missing drive" as it applies to your case scenario.

    My understanding is that this command re-created the array superblock journal data, and used the existing filesystem information it found on the drives. And the IMPORTANT part of the command is the --assume-clean tag, which stops the degraded array from trying to sync when it is created (because if it tries to sync, it can potentially overwrite the existing data).

    This process completed successfully. Then I ran the fsck -v /dev/md0 command, and the Check Filesystem process completed successfully. And finally I ran mount -a and the volume mounted in degraded mode with 3 out of 4 drives. And the data was now all accessible.


    I copied all the data that wasn't in my previous backup to my other Windows Server NTFS array. Then I successfully rebuilt the array with the original 4 drives. For good measure I created a new array with some eBay drives and made a clone copy of the successfully rebuilt array to my new EXT4 array.


    All RAID volumes are now happy, and I have a backup array that is unplugged which contains a January 2024 date backup.

    CAUSE :


    This took me a while to figure out the actual, true cause. And it was not a disk failure (directly). I recently upgraded my motherboard from a SuperMicro unit to a nearly identical SuperMicro unit that has 6x onboard 10GBe ports built in. The existing power supply worked just fine on the old motherboard for 2 years with no issue, but the 6x onboard 10GBe ports (3x 2-port chipsets) took the power supply over its maximum capacity.

    TL;DR


    The PSU was maxxed out, and the drives were intermittently powering down, then powering back up.

    So I just wanted to follow this up with the complete three C's (Concern, Cause, Correction) for anyone that runs into this nightmare.

    CONCERN :


    The concern was that a fully functional RAID 5 array had a disk that went "out of date". When I physically probed the machines while it was running, It seemed that moving the cables made the drives spin up. But I now realize this was false. Thinking the cables were the cause I shut down the machine and replaced the cables and additionally SAS HBA (with a known good unit) for good measure. Something was still never right, and when OMV booted up and I ran the mdadm --assemble --force --verbose /dev/md127 /dev/sdb /dev/sdc /dev/sdd /dev/sde command the array just never correctly reassembled.


    You can read through the thread to see all of the things that I tried. But what I eventually determined is that the superblock journal must have been corrupt. I continued to get error messages that contradicted the state of the array. The mdadm --detail /dev/md127 command showed that the "out of date" drive was successfully removed, but the mdadm --add /dev/md127 /dev/sdc command resulted in the error mdadm: add new device failed for /dev/sdc as 4: Invalid argument. So the device was already removed, but the replacement device couldn't be added and was stuck in a catch 22.

    A short answer is that the link to https://www.linuxquestions.org…ing-a-failed-disk-909577/ had the solution that got the array running.

    geaves  Krisbee  votdev Thank you all so much for bearing with an amateur!


    HA! I was able to mount the drive and see all of my data!!!! I am not a file system or RAID expert by any means, so I do not fully understand the commands which I used to make it happen. But the short answer is that I followed the directions of the LinuxQuestions.org link I posted. Then I ran fsck (which successfully completed), then I just typed mount -a like Krisbee said.

    I have a replacement array coming from eBay, so at this point in time I am not going to try to rebuild the array until my replacement array is here, And I'm going to use this opportunity to pull the most important files from my degraded array.

    I will post the commands which I used exactly to get this array back online, so if someone else runs into the issue of their array with a "missing drive" that cannot be added or removed, they will know what to try to get it back mounted and functional!

    I will also need to cruise this forum to learn how to make an rsync job to keep data backed-up automatically on this array and my newly arriving array.

    Thanks again for having patience and putting up with my ignorance!

    Krisbee and here is the dmesg error log I've generated:


    Run the command as fsck -yv /dev/md127 and let fsck do its thing. Hopefully not all superblocks are damaged.


    Krisbee So I've been busy with work, but here is part of the output from when I ran fsck. I'm no file system expert, but it doesn't look very successful. Wondering if I should run it again... ? I think the entire output is TOO long to post (more than 10,000 characters), but I can always upload *.txt files with the entire dump... but the important parts are the beginning and end. So...


    Beginning:



    Some different repairs in the middle:




    Ending:


    Krisbee Do I go ahead and hit yes to all of THESE?

    and the other two





    Krisbee Awesome. I did order some eBay 6TB drives to decide whether I'm going to copy each drive and try to repair on copies, or if I'm just gonna bite the bullet and run the fsck on the /dev/md127. I'll report beck with my decision and findings.


    2 more questions if you'd be so kind, but no pressure to answer. I appreciate all of the assistance you provided so far VERY much!


    1. Should I use the original VM to run the fsck on... will the OS container that originally created the array and file system have any data that would make the mounting/ repair chances better? Or should I NOT power the array on/off anymore and just run the current OS I'm been operating on?
    2. This might be a big ask, but from what I can tell, these smartctl outputs suggest that I have 4 healthy (high hour) drives, eh? (Which should hopefully have the potential to survive the stress of a fsck and array repair?)


    EDIT - The message is too long... sorry I'll split it up over several.




    Krisbee Thank you for taking the time to reply. The SAS HBA is in full passthrough mode. So the VM container can access the HBA and drives directly. ;) I can always switch back to the original VM in less than 2 minutes flat if you think there may be critical data regarding the array saved in that OS container.


    I have posted this problem in another forum, and gotten a few ideas. But thus far, I have only followed the guidance received in this thread. An amazing suggestion I received was to raw-copy each drive to another drive, and use the copies to attempt to repair and remount the file system. The original drives are very high mileage, so I have some worries about successfully raw-copying the data without the original drives failing. They've been healthy for a very long time, but they're passed the expected lifetime hours by A LOT! I'd love to hear some input about using this method.

    And thank you for the digitalocean and loggly links, I'll check them out when I'm home from work!

    Krisbee I'll do my best to answer as concisely as possibly, to avoid this thread getting long and cumbersome.



    1. This is my home NAS, and luckily not super-sensitive data. Unfortunately I don't have an automated RSYNC to another server, so the last time this volume was backed up was about one year ago, and a few small backups have been made. I'm not missing a TON of data since the last backup, but it would be a bummer to lose it. Plus it will be a great learning experience if I can recover it or even if not.
    2. We'll skip to your #3 because your #2 and #4 have related info. I've attempted to mount the filesystem in the WebUI. I cannot find the kernel log in the System Log page you mention, and may need help locating exactly what you're looking for in the CLI.
    3. Response to your #2 and #4. The MD RAID appears in the WebUI as clean, degraded. The filesystem is a bit of an explanation. It is not listed at all in the WebUI, because I transferred the array to a new OMV VM. This was for a few reasons. I had to shut down the metal-box machine to replace the SAS card and cabling. I suspect the root failure was the cabling, because when I touched cabled when it was powered up, I heard the drives turn on... :/ ... They then powered off shortly after and I knew I couldn't have them keep dong that. so I powered it off, replaced the SAS card and cable, and powered it up. The array did not reassemble on it's own, and there were LOTS of error messages clouding everything, many just related to mounts that were unavailable. So I spun up a new VM on the same version of OMV and have been attempting to recover in a fresh install. The original VM container is still existent, I just haven't used it since the first try after the SAS card and cable replacement. Long story long, that VM WebUI says the array is missing.
    4. Response to your #5. I used the command mdadm --stop /dev/md127 before trying to reassemble, first with mdadm --assemble --verbose /dev/md127 /dev/sdb /dev/sdc /dev/sdd /dev/sde then with mdadm --assemble --force --verbose /dev/md127 /dev/sdb /dev/sdc /dev/sdd /dev/sde .

      Sorry for the long response, but just trying to answer your questions fully.
      Here is the current fstb info:

    Krisbee Thank you for the reassurance. We'll see how this goes. Do you think it would be wise to run fsck at this time, so it can run while I'm at work? Or shall we just wait for further approval before doing anything of the sort?

    votdev I don't think the cat /proc/mdstat or the mdadm --detail /dev/md127 have changed much (if any) using a few reassemble commands after the initial post, but here goes a new run of each command this morning:

    Code
    root@openmediavault:~# cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
    md127 : active raid5 sdb[1] sde[3] sdd[2]
          17581171200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
          bitmap: 1/44 pages [4KB], 65536KB chunk
    
    unused devices: <none>

    UPDATE 1/30
    The /dev/sdc (original) and the /dev/sdf (spare I have from different machine) both finished "Secure Wipe" routines. When I try to add /dev/sdc to the array as a replacement using the WebUI, it returns this error message:


    Anyone have any further ideas? I found a thread from another forum where someone battled a very similar issue, but using only mdadm, as whatever system he was using was NOT OMV. I'll add a screenshot of his solution and the URL to the forum below the code box. Is his solution safe for me at all?


    Code
    OMV\ExecException: Failed to execute command 'export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin; export LANG=C.UTF-8; export LANGUAGE=; mdadm --manage '/dev/md127' --add /dev/sdc 2>&1' with exit code '1': mdadm: add new device failed for /dev/sdc as 4: Invalid argument in /usr/share/php/openmediavault/system/process.inc:197
    Stack trace:
    #0 /usr/share/openmediavault/engined/rpc/raidmgmt.inc(419): OMV\System\Process->execute()
    #1 [internal function]: Engined\Rpc\RaidMgmt->add(Array, Array)
    #2 /usr/share/php/openmediavault/rpc/serviceabstract.inc(123): call_user_func_array(Array, Array)
    #3 /usr/share/php/openmediavault/rpc/rpc.inc(86): OMV\Rpc\ServiceAbstract->callMethod('add', Array, Array)
    #4 /usr/sbin/omv-engined(537): OMV\Rpc\Rpc::call('RaidMgmt', 'add', Array, Array, 1)
    #5 {main}

    The solution to another RAID5 disaster (like this one), and the URL

    Basically it looks like he carefully and manually rebuilt the array and superblock data with a few (precise) commands.


    https://www.linuxquestions.org/questions/linux-server-73/mdadm-error-replacing-a-failed-disk-909577/


    Krisbee and geaves thank you both, kindly, for the input you've provided thus far.

    I'm preparing to rebuild the array, but have a few questions as I am not new to tech, but this is my first time repairing a completely missing drive.

    Can I mount the filesystem before trying a RAID rebuild, to recover some of the stuff that I know isn't backed up anywhere else? This is just a precaution in case the array completely dies on recovery. It's probably or 10% or 15% of the entire array's volume that I'd be focused on retrieving. When I tried to mount it in the WebUI, I got an error that I'll post at the bottom...

    Here is my preparation to start the recovery, anything else I should know/ do ?

    • I have the original /dev/sdc (a Seagate ST6000NM0014) reaching the end of its secure wipe routine.
    • I installed a "spare" drive as /dev/sdf (a Seagate ST6000NM0034) currently being secure-wiped too, just in case the original drive won't work.
    • Both are pulled from the actual chassis and are arranged in a way to to have extra fans placed on them, dropped the temperature from 38c to 30c, as I'm sure the rebuilding of the array will warm things up a little and... well, the drives have been healthy for a long time, but they're high-mileage units.


    "clean, degraded" array mount error: Looks like the superblock is unreadable. Is that normal or repairable?

    UPDATE:

    I haven't heard anything back yet, but I've made some progress and figured out some of this headache on my own. I could still REALLY use some help.

    I learned that when the array says  active (auto-read-only)  that the host OS will eventually work itself out of read-only mode without any further inputs.
    So now:

    1. I've allowed the array to finish syncing and the  active (auto-read-only)  property has disappeared.
    2. I've run the "wipe drive" tool in secure mode on the /dev/sdc (the drive that was out of sync) until it reached 25%
    3. The commands mdadm /dev/md127 --fail /dev/sdc and mdadm /dev/md127 --remove /dev/sdc were already reviously run, but I ran them again to ensure the drive was removed.
    4. When I run the command mdadm --add /dev/md127 /dev/sdc I cannot add the drive with the following error:


    Code
    root@openmediavault:~# mdadm --add /dev/md127 /dev/sdc
    mdadm: add new device failed for /dev/sdc as 4: Invalid argument


    The mdadm --detail /dev/md127 command is as follows, and displays that the Superblock data knows there are 4 devices, but only 3 are active. I cannot figure out the command to remove the "removed" device from the Superblock data. So thus I'm stuck unable to add a replacement device with the error message mdadm: add new device failed for /dev/sdc as 4: Invalid argument


    If anyone can chime in on the process of removing the "removed" drive and adding a new one, I would greatly appreciate any help! Thanks in advance to anyone who can assist!

    Krisbee  geaves

    Post the output of cat /proc/mdstat post in a code box please, this symbol </> on the forum bar makes it easier to read


    The output from #4 of the above shows the array as (auto-read-only), also to re add /dev/sdc with the 'Possibly out of date' error the drive will have to be securely wiped, this can usually be run to 25% then stopped, then try re adding the drive to the array. Do not add the drive until it has finished rebuilding and the (auto-read-only) is corrected.


    geaves So, I've already attempted to add the drive back, got errors, and performed a --fail /dev/sdc and --remove /dev/sdc to remove it. now it's back to the previous, non functional state. I hope nothing was damaged. :/

    How do I get the machine to rebuild the array with only 3 drives... to get the array out of auto-read-only?

    Here's the cat /proc/mdstat


    Code
    root@openmediavault:~# cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
    md127 : active (auto-read-only) raid5 sdb[1] sde[3] sdd[2]
          17581171200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
          bitmap: 0/44 pages [0KB], 65536KB chunk
    
    unused devices: <none>

    OK, and is this to be run with the array stopped?
    e.g. run this first? mdadm --stop /dev/md127