UPDATE: RESOLVED. Another case of missing RAID5 and file system. Please help!

  • LukeR1886 I'd wait for a comment from geaves I'm happy to defer to others and wouldn't want to be responsible for any data loss.


    But before that you could clarify that the md raid does appear in the WebUI as clean, degraded. Secondly, is the filesystem in /dev/md127 listed as missing in the WebUI, or not listed at all. For good measure can you post the output of cat /etc/fstab


    Thirdly, in #16 above, how were you attempting to mount your RAID array ( via CLI or Webui) and what messages were generated in the logs. Either look at the kernel log in the WeBUI Diagnostic System Logs section, or via the CLI.


    Lastly, have you left your server up all the time since this incident?


    Going back to your very first post, as you cannot execute a "mdadm --assemble ... " command unless the md raid is stopped, what did you do to stop the array at that stage?

  • Krisbee I'll do my best to answer as concisely as possibly, to avoid this thread getting long and cumbersome.



    1. This is my home NAS, and luckily not super-sensitive data. Unfortunately I don't have an automated RSYNC to another server, so the last time this volume was backed up was about one year ago, and a few small backups have been made. I'm not missing a TON of data since the last backup, but it would be a bummer to lose it. Plus it will be a great learning experience if I can recover it or even if not.
    2. We'll skip to your #3 because your #2 and #4 have related info. I've attempted to mount the filesystem in the WebUI. I cannot find the kernel log in the System Log page you mention, and may need help locating exactly what you're looking for in the CLI.
    3. Response to your #2 and #4. The MD RAID appears in the WebUI as clean, degraded. The filesystem is a bit of an explanation. It is not listed at all in the WebUI, because I transferred the array to a new OMV VM. This was for a few reasons. I had to shut down the metal-box machine to replace the SAS card and cabling. I suspect the root failure was the cabling, because when I touched cabled when it was powered up, I heard the drives turn on... :/ ... They then powered off shortly after and I knew I couldn't have them keep dong that. so I powered it off, replaced the SAS card and cable, and powered it up. The array did not reassemble on it's own, and there were LOTS of error messages clouding everything, many just related to mounts that were unavailable. So I spun up a new VM on the same version of OMV and have been attempting to recover in a fresh install. The original VM container is still existent, I just haven't used it since the first try after the SAS card and cable replacement. Long story long, that VM WebUI says the array is missing.
    4. Response to your #5. I used the command mdadm --stop /dev/md127 before trying to reassemble, first with mdadm --assemble --verbose /dev/md127 /dev/sdb /dev/sdc /dev/sdd /dev/sde then with mdadm --assemble --force --verbose /dev/md127 /dev/sdb /dev/sdc /dev/sdd /dev/sde .

      Sorry for the long response, but just trying to answer your questions fully.
      Here is the current fstb info:
  • LukeR1886


    In this confused situation, what is certain is that you have a clean, degraded array with a filesystem what will not mount. Clearly, you cannot hope to retrieve any data unless and until that filesystem mounts. I suspect there has been an unclean unmount of the filesystem on the array at some stage.


    Now it's clear OMV is running in a VM( in ESXI/PROXMOX or other?) how are your disks passed to the virtual machine. The proper way to do this is to pass through the whole controller so OMV has full control over the disks and can monitor the Smart data, etc.


    It's your choice if you do the following, I cannot predict if you will be able to access your data or the filesystem will mount, understand that you proceed at your own risk.


    1. First step is to do a filesystem check on the array device -  fsck  /dev/md127


    If you see a message like this example, just keep typing "y" until the check completes.


    Code
    root@omv:~# fsck /dev/md0
    fsck from util-linux 2.38.1
    e2fsck 1.47.0 (5-Feb-2023)
    /dev/md0 was not cleanly unmounted, check forced.
    Resize inode not valid.  Recreate<y>? 

    Successful completion message:


    Code
    /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
    /dev/md0: 19/655360 files (5.3% non-contiguous), 81226/2618880 blocks
    root@omv:~# 

    A further fsck should return clean, e.g:

    Code
    root@omv:~# fsck /dev/md0
    fsck from util-linux 2.38.1
    e2fsck 1.47.0 (5-Feb-2023)
    /dev/md0: clean, 19/655360 files, 81226/2618880 blocks
    root@omv:~#


    2. Now test if the filesystem on your array /dev/md127 will mount.


    Assuming your fstab contains the correct reference to the array, just type mount -a at the CLI. The filesystem should appear in the WebUI and an associated message should be found in the logs using journalctl -f -k -g md127 The later means follow (f) the current kernel (k) log, filtering for the string "md127". Example output extract:

    Code
    Feb 01 09:18:09 omv kernel: EXT4-fs (md0): mounted filesystem with ordered data mode. Quota mode: journalled.


    (If you want to know how to make best use of the system logs, this is good starting point: https://www.digitalocean.com/c…d-manipulate-systemd-logs and https://www.loggly.com/ultimate-guide/using-journalctl/)


    If your array filesystem does mount, the 1st priority is to retrieve any data not previously backed up. Whether your replacement disk can be added to the array comes later. Treat the original /dev/sdc as unusable until it's tested outside of the array.

  • Krisbee Thank you for taking the time to reply. The SAS HBA is in full passthrough mode. So the VM container can access the HBA and drives directly. ;) I can always switch back to the original VM in less than 2 minutes flat if you think there may be critical data regarding the array saved in that OS container.


    I have posted this problem in another forum, and gotten a few ideas. But thus far, I have only followed the guidance received in this thread. An amazing suggestion I received was to raw-copy each drive to another drive, and use the copies to attempt to repair and remount the file system. The original drives are very high mileage, so I have some worries about successfully raw-copying the data without the original drives failing. They've been healthy for a very long time, but they're passed the expected lifetime hours by A LOT! I'd love to hear some input about using this method.

    And thank you for the digitalocean and loggly links, I'll check them out when I'm home from work!

  • LukeR1886 If by raw-copy you mean using dd or ddrescue in some way I can't comment as it's not something I've ever had to do . If your drive power on hours are at 70,000+ they're near EOL. For SAS drives, the "grown defect" count is the thing to watch for in SMART data.


    Personally, I'd just bite the bullet and do a fsck on /dev/md127 and go from there. Of course it's your choice & your risk.

  • Krisbee Awesome. I did order some eBay 6TB drives to decide whether I'm going to copy each drive and try to repair on copies, or if I'm just gonna bite the bullet and run the fsck on the /dev/md127. I'll report beck with my decision and findings.


    2 more questions if you'd be so kind, but no pressure to answer. I appreciate all of the assistance you provided so far VERY much!


    1. Should I use the original VM to run the fsck on... will the OS container that originally created the array and file system have any data that would make the mounting/ repair chances better? Or should I NOT power the array on/off anymore and just run the current OS I'm been operating on?
    2. This might be a big ask, but from what I can tell, these smartctl outputs suggest that I have 4 healthy (high hour) drives, eh? (Which should hopefully have the potential to survive the stress of a fsck and array repair?)


    EDIT - The message is too long... sorry I'll split it up over several.




  • and the other two





  • Krisbee Do I go ahead and hit yes to all of THESE?

  • Krisbee Awesome. I did order some eBay 6TB drives to decide whether I'm going to copy each drive and try to repair on copies, or if I'm just gonna bite the bullet and run the fsck on the /dev/md127. I'll report beck with my decision and findings.


    2 more questions if you'd be so kind, but no pressure to answer. I appreciate all of the assistance you provided so far VERY much!


    1. Should I use the original VM to run the fsck on... will the OS container that originally created the array and file system have any data that would make the mounting/ repair chances better? Or should I NOT power the array on/off anymore and just run the current OS I'm been operating on?
    2. This might be a big ask, but from what I can tell, these smartctl outputs suggest that I have 4 healthy (high hour) drives, eh? (Which should hopefully have the potential to survive the stress of a fsck and array repair?)

    1. Leave disks in situ.

    2. SMART data looks OK - power on hours circa 50k and no grown defects. Error count formatting is poorly aligned but looks to be zero in each case.

  • Krisbee Do I go ahead and hit yes to all of THESE?


    Run the command as fsck -yv /dev/md127 and let fsck do its thing. Hopefully not all superblocks are damaged.

  • Run the command as fsck -yv /dev/md127 and let fsck do its thing. Hopefully not all superblocks are damaged.


    Krisbee So I've been busy with work, but here is part of the output from when I ran fsck. I'm no file system expert, but it doesn't look very successful. Wondering if I should run it again... ? I think the entire output is TOO long to post (more than 10,000 characters), but I can always upload *.txt files with the entire dump... but the important parts are the beginning and end. So...


    Beginning:



    Some different repairs in the middle:




    Ending:


  • Krisbee and here is the dmesg error log I've generated:


  • It's the final success or failure messaged from fsck that's key here. It looks like the filesystem on /dev/d127 was damaged early on.

    Can't offer much hope here. It emphases the importance of heeding the mantra "RAID is NOT a BACKUP", a harsh lesson.


    Whether something like ddresuce could help to recover any data I wouldn't really know. Here's a ref to the basic idea if you want to try your luck: https://linuxconfig.org/how-to…-clone-disk-with-ddrescue


    Other than that it's starting from scratch I'm afraid, and reviewing how you use RAID.

  • geaves  Krisbee  votdev Thank you all so much for bearing with an amateur!


    HA! I was able to mount the drive and see all of my data!!!! I am not a file system or RAID expert by any means, so I do not fully understand the commands which I used to make it happen. But the short answer is that I followed the directions of the LinuxQuestions.org link I posted. Then I ran fsck (which successfully completed), then I just typed mount -a like Krisbee said.

    I have a replacement array coming from eBay, so at this point in time I am not going to try to rebuild the array until my replacement array is here, And I'm going to use this opportunity to pull the most important files from my degraded array.

    I will post the commands which I used exactly to get this array back online, so if someone else runs into the issue of their array with a "missing drive" that cannot be added or removed, they will know what to try to get it back mounted and functional!

    I will also need to cruise this forum to learn how to make an rsync job to keep data backed-up automatically on this array and my newly arriving array.

    Thanks again for having patience and putting up with my ignorance!

  • I'm not sure why the create array with the assume clean switch allowed fsck to succeed on the re-built array when fsck had previously failed. But great news that you can access your data now.


    Apart from the "RAID is NOT a BACKUP" mantra, I'd make sure you were comfortable with the procedure to replace a failed drive in an array. If you didn't set up routine SMART data checks in OMV together with notifications then it's time to do it.


    If you've still got the original OMV VM and it boots, then you can go back a look at the logs to see how they reflect the array failure. For example, journalctl -g kick might show when an "out of date" drive was first kicked out of the array. Filtering on "md127" and "EXT4-Fs err" will help pin-point filesystem corruption.

  • So I just wanted to follow this up with the complete three C's (Concern, Cause, Correction) for anyone that runs into this nightmare.

    CONCERN :


    The concern was that a fully functional RAID 5 array had a disk that went "out of date". When I physically probed the machines while it was running, It seemed that moving the cables made the drives spin up. But I now realize this was false. Thinking the cables were the cause I shut down the machine and replaced the cables and additionally SAS HBA (with a known good unit) for good measure. Something was still never right, and when OMV booted up and I ran the mdadm --assemble --force --verbose /dev/md127 /dev/sdb /dev/sdc /dev/sdd /dev/sde command the array just never correctly reassembled.


    You can read through the thread to see all of the things that I tried. But what I eventually determined is that the superblock journal must have been corrupt. I continued to get error messages that contradicted the state of the array. The mdadm --detail /dev/md127 command showed that the "out of date" drive was successfully removed, but the mdadm --add /dev/md127 /dev/sdc command resulted in the error mdadm: add new device failed for /dev/sdc as 4: Invalid argument. So the device was already removed, but the replacement device couldn't be added and was stuck in a catch 22.

    A short answer is that the link to https://www.linuxquestions.org…ing-a-failed-disk-909577/ had the solution that got the array running.

  • CAUSE :


    This took me a while to figure out the actual, true cause. And it was not a disk failure (directly). I recently upgraded my motherboard from a SuperMicro unit to a nearly identical SuperMicro unit that has 6x onboard 10GBe ports built in. The existing power supply worked just fine on the old motherboard for 2 years with no issue, but the 6x onboard 10GBe ports (3x 2-port chipsets) took the power supply over its maximum capacity.

    TL;DR


    The PSU was maxxed out, and the drives were intermittently powering down, then powering back up.

  • CORRECTION :


    Obviously the physical correction to the root of the problem was resolved by a higher output PSU. This solved every singe R/W error that was happening to the disks. Terrifying. ^^


    I am not a RAID or mdadm expert, but I think this is how the commands that I used repaired the array. The array was stuck in a catch 22 of having a failed/ removed drive, but also unable to add more than the original specified amount of drives. I believe that means the superblock journal data was corrupt. So instead of trying to reassemble the existing array with the existing corrupt superblock journal data, I rebuilt the array with (very carefully selected) commands to tell the new superblock journal entry exactly what was in the "new" array.

    It consisted of the drives that were still "up to date" together and entered in the exact order that they were previously listed out in the mdadm --detail /dev/md127 output, plus the "missing drive" in the correct order. The command I used was mdadm --create /dev/md0 --assume-clean --level=5 --verbose --raid-devices=4 missing /dev/sdb /dev/sdd /dev/sde , however, you'll need to assess your array and edit the drives and "missing drive" as it applies to your case scenario.

    My understanding is that this command re-created the array superblock journal data, and used the existing filesystem information it found on the drives. And the IMPORTANT part of the command is the --assume-clean tag, which stops the degraded array from trying to sync when it is created (because if it tries to sync, it can potentially overwrite the existing data).

    This process completed successfully. Then I ran the fsck -v /dev/md0 command, and the Check Filesystem process completed successfully. And finally I ran mount -a and the volume mounted in degraded mode with 3 out of 4 drives. And the data was now all accessible.


    I copied all the data that wasn't in my previous backup to my other Windows Server NTFS array. Then I successfully rebuilt the array with the original 4 drives. For good measure I created a new array with some eBay drives and made a clone copy of the successfully rebuilt array to my new EXT4 array.


    All RAID volumes are now happy, and I have a backup array that is unplugged which contains a January 2024 date backup.

  • Here is the exact sequence of commands and outputs that put the array back "online" in degraded mode with 3 out of 4 drives. Please skim this thread to see if your failure is the same corrupt superblock journal data type of failure that I had before proceeding. And use these commands at your own risk. I am not an expert. This worked for me. I am not responsible for and data loss that may occur because of the improper (or proper) use of these commands.


    Stopped the array:

    Code
    mdadm --stop /dev/md127


    Then created new. It is my understanding that the --assume-clean tag is critical to keep your data intact. And you'll need to type your array in the exact sequence that the mdadm --detail /dev/md127 output showed, including where the "missing drive" is supposed to be in the array.



    Then we run the fsck -v /dev/md0 command. You could include a -y after the -verbose(-v) to automatically say yes to all, but I was paranoid at this point :D so I hit enter for each repair instead.


    Then we mount the file system with the mount -a


    Thanks very much to geaves  Krisbee and votdev ! Hopes this helps someone in the future!

  • LukeR1886

    Changed the title of the thread from “Another case of missing RAID5 and file syetem. Please help!” to “UPDATE: RESOLVED. Another case of missing RAID5 and file system. Please help!”.

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!