Beiträge von UltimateByte

    Hi,


    Just if anyone is curious, I thought I'd share my latest success story with OMV.


    Long story short: My NAS initially had a RAID 5 with 3x16TB Seagate Exos X16 (32TB usable) MDADM Ext4, and I've upgraded it to 4x16TB flawlessly. That said, it was a week long of a bit stressful tasks to process.


    I'm using OMV for backuping my datacenter dedicated servers, and for production work.
    I'm lucky enough to have a great internet, with 8Gbit/s download and 2Gbit/s upload speeds, and I've equipped my home/office network with full 10Gbit/s, which allows for fast, comfortable and interesting uses.

    I love it!


    Most critical data is on my PCs (desktop and laptop) which is synced to my off-site web server's Nextcloud, which itself is backed-up on the NAS on a nightly basis.

    For the rest, I don't currently have an off-site or offline backup of this NAS as it is logistically pretty expensive and complex to put out.


    Over the years of using OMV (since 2017), I had to replace 2 failing drives. One around 2018, and one around 2022 I think (just after upgrading to the Exos array, one drive died like 2 weeks after creation, as failures usually happen pretty quickly when a drive will fail).


    Recently, one drive started to softly fail due to oldness this time (SMART error, invalid sector, warning sent by mail by OMV, as I enabled SMART monitoring).
    This time I wouldn't take risks as the NAS is now holding important data.
    So I bought a replacement disk, then did the RAID rebuild and sent the failing one to RMA.
    Rebuild went flawlessly, took around 24h, just like first initialization, during which the NAS was obviously slower than usual.


    Once the replacement drive came up, I was surprised to see it was a 16TB Exos X18. Fortunately, this equal or better than the X16 so it is fully compatible with my RAID array. Then I thought: Why not extend the RAID?


    So I've proceeded to renting a new temporary server at Hetzner in order to backup all important data.
    This took almost 3 days to backup at near 1Gbit/s speed.

    Here is the NAS activity during this time (Download at the beginning is unrelated).

    upload%20backup.png


    I should have done this before and should keep this server. But the plan is to add an OMV NAS at family's house with daily backup from my NAS. Over time this will be far less expensive.


    Once the backup was done, I've used OMV's Grow feature.

    To do so, I shut down the NAS as this is a classic PC case, then added the drive.
    After rebooting, within OMV, went to Storage > Multiple Devices > Grow.
    Then I selected the device, validated. Weird thing is it started the MDADM process (visible with 'cat /proc/mdstat') , but it still asked me to apply changes, which I did.


    Grow process is far slower with intense write/reads during the first 2/3 of the process. Then, for the 33% remaining, it is as fast as a rebuild.
    Overall, took around 2.5 days.


    raid%20extend%204x16TB.png


    Once done, I shut down the NAS, and booted on a GParted live via USB stick.


    Interestingly, GParted asked to check the device instead of grow as I would have thought. So I did that.

    After an initial check, it did the resize2fs by itself.

    IMG_20240404_112439384.jpg


    This whole GParted process took around 1 hour. Most stressful part as I was really into the unknown at this point.


    Once done, rebooted to OMV, and everything worked flawlessly with now 48TB (43.4TiB) available.


    Hope this is useful to anyone wanting to do the same.

    If you proceed to this, don't forget to backup as if one single drive fails during the grow process, I would assume you'd lose the array.


    Anyway, I wanted to tell my deep thanks to the OMV team, as your tools helps me doing my personal as well as professional stuff on a daily basis, free of any limitation, thanks to Debian (rather than a limited proprietary all-in-one solution).


    Thank you all <3

    Recently, I took care of my NAS with an upgrade to OMV7 (from 6) and change of the SFP+ PCIe card for a better one. I'm unsure if that's the hardware or software change that caused this issue, but processes took way too much time to start than usual, triggering monitoring alerts.

    The issue is rather a Linux/Systemd issue than an OpenMediaVault. However I didn't experience it on a manually installed system with which I swapped the interface on as well (Debian 12), so there might also be something from OMV side that could be improved.


    After swapping network interfaces, I noticed that upon reboot, the monitoring complained about many services not starting properly on boot.


    After some research, I found some interesting commands to run right after reboot, and once the boot is fully completed, over here: https://ubuntuforums.org/showthread.php?t=2490962

    PiHole also has issues related to this: https://forums.raspberrypi.com/viewtopic.php?t=362055


    Right during boot time, before OMV panel is accessible, you can run : systemctl list-jobs

    As you can see here, the systemd-networkd-wait-online.service is the only one running, and other ones are waiting for it to complete.


    It would appear that because of the interface change, this task waits for an interface that doesn't exist anymore and therefore takes a (relative) tremendous amount of time to complete (around 2m30) and let other tasks go through. In the meantime, that causes the monitoring feature from OMV to generate alerts, that will be sent all at once, as soon as the mailing system is up.


    After boot, you can also use the following commands to get some info about what takes time to start:

    systemd-analyze critical-chain


    Code
    root@nas:~# systemd-analyze critical-chain
    The time when unit became active or started is printed after the "@" character.
    The time the unit took to start is printed after the "+" character.
    
    graphical.target @2min 3.037s
    └─multi-user.target @2min 3.036s  └─getty.target @2min 3.036s    └─openmediavault-issue.service @2min 546ms +2.489s      └─network-online.target @2min 506ms        └─network.target @2.026s          └─wpa_supplicant.service @2.009s +16ms            └─dbus.service @1.994s +12ms              └─basic.target @1.993s                └─sockets.target @1.993s                  └─systemd-journald@netdata.socket @1.993s                    └─sysinit.target @1.992s                      └─systemd-resolved.service @1.934s +57ms                        └─systemd-tmpfiles-setup.service @1.926s +6ms                          └─local-fs.target @1.922s                            └─run-credentials-systemd\x2dtmpfiles\x2dsetup.service.mount @1.927s                              └─local-fs-pre.target @384ms                                └─lvm2-monitor.service @216ms +134ms                                  └─systemd-journald.socket @212ms                                    └─-.mount @168ms                                      └─-.slice @168m

    You have to read this output from bottom to top in order to have sequential order. Here on this output, you can see 'network-online.target' takes a lot of time while everything else is basically instantly started.


    Other useful command (I don't have the "before" output anymore for this one) is:

    systemd-analyze blame


    Ultimately, you can confirm that this service is the issue by restarting it manually. If your command prompt stalls for minutes, then that is likely your issue. Command to restart the service is the following:

    Code
    systemctl restart systemd-networkd-wait-online.service



    So we've confirmed the problem, what's the fix?


    Solution is to tell systemd-networkd-wait-online to wait for any interface to be up, rather than for whatever interface it remembers.

    For that, edit the service start options with:

    Code
    systemctl edit systemd-networkd-wait-online.service

    We have to set the ExecStart variable empty first otherwise it won't work, which is why you can see 'ExecStart=' twice in the fix code. That is not a mistake, I tried without it, it doesn't work. Systemd mysteries.


    Add this:

    Code
    [Service]
    ExecStart=
    ExecStart=/usr/lib/systemd/systemd-networkd-wait-online --any

    Make sure you set it at the right location, between the beginning lines, otherwise it will be overwritten and won't work. (I tried too, didn't read properly the conf file at first)


    So full beginning of file looks like this:

    Code
    ### Editing /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf
    ### Anything between here and the comment below will become the new contents of the file
    
    [Service]
    ExecStart=
    ExecStart=/usr/lib/systemd/systemd-networkd-wait-online --any
    
    ### Lines below this comment will be discarded

    Then you can try to restart the service.


    Code
    systemctl restart systemd-networkd-wait-online.service


    If it takes less than a second, you're probably good to go. If not, then you need to search further.


    Some say they have to specify the interface name for this to work. This is achieved with the following syntax:


    Code
    ExecStart=/lib/systemd/systemd-networkd-wait-online --interface=eth0


    I hope it helps fellow OpenMediaVaulters that maybe had this issue but didn't bother fixing, and Googlers.

    Also maybe our awesome OMV devs with their systemd knowledge have simple ideas to mitigate the issue natively.


    All the best, and never forget: RAID is not a backup.

    Here's to hoping it all works out.


    For future ref; Operating System backup is easy and, as you're finding, a real good idea.

    Thank you!

    Well, in that case I assumed re-installing the whole system would basically take the same amount of time than restoring the backup, therefore I've got no backup of the OMV drive.


    I'm very familiar with R1Soft (deploying and using it professionally) but it's paid and not cheap!

    If you have a free (or cheap) and good way to backup the whole distro, I'd gladly take it!

    Hello,


    As far as I know, RAID 5 checks for data corruption, so yes, this would help preventing this by noticing the faulty drive as early as possible.


    In your present case, I would check the SMART for each disk and remove the faulty one from RAID (assuming it's RAID 1 or similar?) which might resolve the integrity issue, assuming one of the drives failed and the other one works OK and wrote proper data and didn't sync on the wrong disk.


    Best of luck

    Hello,


    Thanks a lot for your insight. :thumbup:


    It's weird that OMV is a dependency for NTP and not the other way around! I wouldn't have expected that.




    Anyways, running apt-get purge openmediavault-omvextrasorg showed a few warnings:


    OMV Login shows:

    Code
    Error #0:
    OMV\Rpc\Exception: RPC service 'VirtualBox' not found. in /usr/share/php/openmediavault/rpc/rpc.inc:99
    Stack trace:
    #0 /usr/sbin/omv-engined(537): OMV\Rpc\Rpc::call('VirtualBox', 'getMachines', Array, Array, 1)
    #1 {main}



    On OMV extra, still shows:

    Code
    RPC service 'OmvExtras' not found.


    And of course package list is empty. ||


    I've cleared my browser cache and even tried private navigation to make sure it's not cache related: Same issue.


    This error shows with basically any command involving apt:

    Code
    Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7f5691bf1510>
    Traceback (most recent call last):
    File "/usr/lib/python3.5/weakref.py", line 117, in remove
    TypeError: 'NoneType' object is not callable
    Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7f5691bf1510>
    Traceback (most recent call last):
    File "/usr/lib/python3.5/weakref.py", line 117, in remove
    TypeError: 'NoneType' object is not callable

    Found this: RE: Upgrade Debian 9 and 4.x

    I tried the fix, which changed the error to

    So I've reverted the change.


    If anyone is willing to provide any more insight to fix this, that would be lovely! :love:


    Maybe that would be a great time to upgrade to OMV 5... ? :S Depends if mdadm RAID is still supported and if my VirtualBox machines would still work or not (I would be willing to switch to KVM if available, though, seems to me it has better performance after using it for work on RHEL) :/

    Hello,


    Yesterday, I noticed my NAS wasn't at the correct time so I thought it would be a great idea to install chrony.


    Server was at the correct time, sure... But OMV web UI wasn't accessible at all.

    /var/www/openmediavault/ was basically empty.


    After seeking through Google and the forum, I found I could do that:

    apt-get install --reinstall openmediavault #would not work because of dependency issues

    apt remove chrony

    apt-get install --reinstall openmediavault # worked

    omv-engined -f -d # stalled


    Now I can display the OMV login screen, but it seems to miss the files from the extra phpvirtualbox:

    #cat /var/log/nginx/openmediavault-webgui_error.log

    2020/09/25 08:20:52 [error] 15103#15103: *63 open() "/var/www/openmediavault/virtualbox/favicon.svg" failed (2: No such file or directory), client: 192.168.1.254, server: openmediavault-webgui, request: "GET //virtualbox/favicon.svg HTTP/1.1", host: "192.168.1.10"


    I'm kinda worried since I've got active VMs that I need up and running, they are still running at the moment, but chances are they won't start properly if I reboot the host.


    On login attempt, OMV showed:

    Failed to connect to socket: No such file or directory

    Error #0:
    OMV\Rpc\Exception: Failed to connect to socket: No such file or directory in /usr/share/php/openmediavault/rpc/rpc.inc:140
    Stack trace:
    #0 /var/www/openmediavault/rpc/session.inc(56): OMV\Rpc\Rpc::call('UserMgmt', 'authUser', Array, Array, 2, true)
    #1 [internal function]: OMVRpcServiceSession->login(Array, Array)
    #2 /usr/share/php/openmediavault/rpc/serviceabstract.inc(123): call_user_func_array(Array, Array)
    #3 /usr/share/php/openmediavault/rpc/rpc.inc(86): OMV\Rpc\ServiceAbstract->callMethod('login', Array, Array)
    #4 /usr/share/php/openmediavault/rpc/proxy/json.inc(97): OMV\Rpc\Rpc::call('Session', 'login', Array, Array, 3)
    #5 /var/www/openmediavault/rpc.php(45): OMV\Rpc\Proxy\Json->handle()

    I've run:

    omv-engined


    After that, login was possible.


    Of course, all my extras are missing now


    I've then installed OMV extras with:

    wget -O - https://github.com/OpenMediaVa…ckages/raw/master/install | bash


    Now the OMV-Extras section of OMV shows:

    RPC service 'OmvExtras' not found.

    Error #0:
    OMV\Rpc\Exception: RPC service 'OmvExtras' not found. in /usr/share/php/openmediavault/rpc/rpc.inc:99
    Stack trace:
    #0 /usr/sbin/omv-engined(537): OMV\Rpc\Rpc::call('OmvExtras', 'getKernelList', Array, Array, 1)
    #1 {main}


    Now that's where I'm stuck...


    Any idea or help for me? Thank you! :*

    So, I've got many pending sectors emails today, which means more data!


    6 new for the disk CT500MX500SSD1_1911E1F0E132 which makes a total of 9
    And 3 new for the disk CT500MX500SSD1_1911E1F0F25B which makes a total of 5.


    First one has 45 Unused_Reserve_NAND_Blk
    Second one has 46 Unused_Reserve_NAND_Blk


    So the values are unchanged and doesn't seem to depend on current pending sectors activity, or at least it's not correlated with a 1:1 ratio which is good.


    That said, these emails are still freaking me out... I never like seing this kind of errors.
    Last time it happened, I had a 4TB out of warranty drive failing and it took me 1 week to download back my data from Hubic... Not that my connection was slow (I've got gigabit at home) but their servers suck (and they stopped Hubic since...). A solution might be to apply a filter in Thunderbird to put them as "read". Anyhow, I'll backup data on it more frequently.

    Thank you :)


    Yes, it is likely a coincidence (confirmation bias spotted). I've checked again, In fact there are only 4 mails total. (The 5th was another mail containing the "Pending" word, my bad.)
    That said, it is not impossible that an alert was sent while I didn't have emails setup on the NAS yet. (syslog are not kept before June 2nd so we can't know).


    I will try to report back once I have more data which will tell us more about the subject than my ravings.


    The advantage is we have my previous values for the record.


    Attached: Screenshot of errors detection times and dates.

    Order of disks for errors is: A B A B
    I see no obvious time pattern for now, but from the few data available, errors seem to be more and more rare which would be great if it could go on like that :D

    I've had a pretty great tech support on chat at Crucial with some interesting conclusions that I'll share.


    Since I was running Linux and no Crucial tool existed for the OS, I was informed that Micron is owning Crucial, so most of the things are valid for both Micron and Crucial.
    So they asked me to install a Micron GUI diagnose tool which was a bit complicated since I'm only using CLI on my OMV. Then they pointed out there was also a CLI version.
    That's the tool: https://www.micron.com/product…torage-executive-software
    But then I pointed out "I'm not alone having this, checking my specific SMART is not relevant", then they more or less accepted my smartctl output and logs were relevant enough to answer my worries.


    The tech's conclusion is the following:


    It is perfectly fine, normal and expected to have pending sectors sometimes on an SSD, due to the nature of NAND memory.



    Therefore, your SMART value Current_Pending_Sector might get to non 0 values from time to times. I think what would be worrying is if it doesn't go back to 0 afterwards. Then it would mean that there is no room for moving to available blocks.


    What you should check is the value: Unused_Reserve_NAND_Blk
    You don't want this close to 0.


    For reference, I have a value of 45 on one drive and 46 on the other, for drives that are I think a bit less than 1 month old; upon checking, I have other emails for this warning on both drives: 2 on one drive, and 3 on the other, so the default value for my 500GB MX500 is Unused_Reserve_NAND_Blk: 48. As that speed (hopefully it slows down), the drives might be dead in no time... And I'm not even writing a lot onto it (just a VM with two game servers on it).


    They also provided me a doc with an explanation of how SMART attributes are calculated. The tech was unsure if this was shareable or not, but I found it publicly on Micron's website so... Here's the link. https://www.micron.com/-/media…_ssd_smart_attributes.pdf


    In conclusion: Nothing to worry about at least until the values grow up. And Crucial support rocks :)

    Hi,


    One of my RAID 1 SSD (which contains a VM image) triggered an alert from smartd daemon tonight at 1:27AM:


    Zitat von smartd daemon

    This message was generated by the smartd daemon running on: host name: nas DNS domain: hidden.tldThe following warning/error was logged by the smartd daemon:Device: /dev/disk/by-id/ata-CT500MX500SSD1_1911E1F0E132 [SAT], 1 Currently unreadable (pending) sectorsDevice info:CT500MX500SSD1, S/N:1911E1F0E132, WWN:5-00a075-1e1f0e132, FW:M3CR023, 500 GBFor details see host's SYSLOG.You can also use the smartctl utility for further investigation.Another message will be sent in 24 hours if the problem persists.


    Syslog shows:

    Code
    root@nas:~# cat /var/log/syslog | grep smart
    Jun  6 01:27:31 nas smartd[810]: Device: /dev/disk/by-id/ata-CT500MX500SSD1_1911E1F0E132 [SAT], 1 Currently unreadable (pending) sectors
    Jun  6 01:27:31 nas smartd[810]: Sending warning via /usr/share/smartmontools/smartd-runner to admin@hidden.tld ...
    Jun  6 01:27:31 nas smartd[810]: Warning via /usr/share/smartmontools/smartd-runner to admin@terageek.org: successful
    Jun  6 01:57:31 nas smartd[810]: Device: /dev/disk/by-id/ata-CT500MX500SSD1_1911E1F0E132 [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 em


    It shows to be solved at 1:57AM. (I assume there is a cron every 30 minutes).


    smartctl output shows:



    I'm quite used about reading SMART and I can't see any error here... Which is comforting. But in the same time, there was still an error, and I don't like that.


    So my questions are:
    Do you have any idea what would this error indicate on a SATA SSD?
    Is there anything to worry about?
    How can there be such an error but nothing in SMART


    Any enlightment appreciated.
    Best regards

    Exactly. Since it's impossible due to the whole concept of data integrity being unknown to primitive RAID-1 variants like mdraid's. If you don't believe me maybe you believe the authors: https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives -- the difference between parity raid and a primitive mirror is well explained there.

    Thanks for your answer and for pointing out this useful doc. I for sure have a lot to learn about RAID, I'm just really digging into it.


    I've just learned on that wiki that you can convert RAID 1 to RAID 5 https://raid.wiki.kernel.org/i…ror_raid_to_a_parity_raid
    That's kind of revolutionary info to me. :o))


    I assume the interesting part for our discussion is the following:

    Zitat von raid wiki

    If the array is a mirror, as it can't calculate the correct data, it will take the data from the first (available) drive and write it back to the dodgy drive.

    So as I understand, if it finds out there is a dodgy drive it's either because it couldn't write on it in the first place, or that there are 3 drives or more in the mirror and it found one with different data on one of them.
    The real question is: if there is no distinction between drives's sanity and it cannot assume which is the dodgy drive (example 2 drives RAID 1), then what does it do?
    Does it say "RAID status=fail", then human interaction is required and you check the SMART and quickly find out which drive to replace; or does it blindly consider the first disk of the array to be accurate and possibly writes garbage data the the sane disk?
    Since I've never seen it happen, I assume with some confidence that it has near no chance of happening: that's my point that would need to be invalidated so that I can change my mind.


    It might be my near Asperger syndrome, but I find this wiki lacks a bit of accuracy to prevent any doubt like that.


    They also say:

    Zitat von raid wiki

    Drives are designed to handle this - if necessary the disk sector will be re-allocated and moved. SMART should be used to track how many sectors have been moved, as this can be a sign of a failing disk that will need replacing.

    To reformulate, upon data rewrite, SMART will help knowing if there are any errors. In fact, in practice, there should be an error in SMART since the first error (unrecoverable error). Unless maybe it was writable, but then it doesn't read as expected afterwards, therefore mdadm detects an inconsistency; if so, there should also be other SMART errors on this drive, because a drive usually starts to fail globally, not on one sector.
    That is my whole point since the beginning:
    SMART data is usually accurate and quickly alarming enough if don't buy poorly made drives. Therefore, if you monitor for SMART errors, you'll find out that a drive is failing before even having a RAID check, and therefore, garbage won't be written to your whole array.



    Using single redundancy with those drive sizes and such commodity hardware is a bit risky and once you add more redundancy (RAID6) rebuilds take even longer. And as soon as the RAID will be used while rebuilding the times needed explode: https://blog.shi.com/hardware/best-raid-configuration-no-raid-configuration

    Well, they don't detail their protocol well enough, but seriously, 1.5 hours to rebuild at idle but 134 hours to rebuild at load? That's almost 100 times longer, that's unrealistic on well suited servers...
    That kind of delay would mean your I/O would be 100% saturated before the rebuild and during the whole rebuild time which basically means there is a problem on your server in the first place, and you should resolve it first.
    I do server management, so I would detect such a loads before it's too late and advise anyone to change hardware (go for SSD, add RAM to favor caching and prevent swapping, spread the load across multiple servers or add a drive array to spread I/O or whatever) before reaching such a critical state, and I hope any sysadmin on his own would do the same.


    Also, I don't see how they came up with 77h rebuild time for a 10TB drives array... If I compare to my numbers (8 hours on 4TB drives): Proportionally to the disk size, it should take around 20h. And since a 10TB drive should be faster than a 4TB drive, it should be even faster than that.


    In the end, author's main point is "it takes too long to rebuild": I can say with a pretty high confidence it's not a problem unless you're over-using your I/O in the first place.


    Now their solutions:
    - Application redundancy: Nope. Not nearly as convenient as RAID, also the cost is twice the server price... I don't know any business that would go for that solution at the moment. A daily backup or more often + RAID 1 is the solution of choice at the moment.
    - Erasure coding/Software defined storage: Requirements are likely met by ZFS/BTRFS RAIDs, so that's valid.
    - Solid State Storage: Well, that was my point since the beginning: If your I/O is saturated, there is one obvious solution... Get a storage that multiplies I/O speeds by 4 to 50.


    There is a ZFS plugin and creating a zmirror is really easy. And by just using a storage implementation that has been developed in this century (ZFS/btrfs) and not the last (mdraid/RAID1) you get the following for free:

    • bitrot protection
    • flexible dataset/subvolume handling overcoming partition limitations
    • transparent filesystem compression
    • snapshots (freezing a filesystem in a consistent state)
    • ability to efficiently send snapshots to another disk/host (backup functionality)
    • immune to power failures due to CoW (Copy on Write)

    (though the latter requires disks with correct write barrier semantics -- if not you need an UPS -- and large arrays require server grade hardware in general)

    Found this: https://docs.oracle.com/cd/E19…819-5461/gaynr/index.html


    There is also the ability to create cache devices which I had heard about (likely in LinusTechTips Petabyte project or something like that), that's cool. Hopefully you can add cache later (I assume you can, otherwise it would be a bummer).


    ZFS seems very interesting for sure, I am 100% convinced. I will dig into it at work and run experiments. If it shows to be stable, understandable and "monitorable" and everyone is convinced, then we might use it for new servers.


    As for my home use with ZFS, I would have two questions before going for it:
    Could I create a pool of 1 drive, then add a second one for duplication later, without the need of moving the whole data?
    Also, is it possible to convert a ZFS RAID 1 to RAID 5 (or equivalent) like with mdadm?


    Thank you

    You're confusing RAID5 and RAID1 here. What I mentioned only applied to RAID1 (that's the RAID mode I call 'almost useless').
    With RAID5 there's parity information and as such corrupted data could be reconstructed (but you need to fear the 'parity RAID write hole' in such situations and rebuilds take ages that's why RAID5 with modern TB drives is insane as well).


    With mdraid's RAID1 there's nothing. And that's the real problem with wasting disks for an mdraid1: no data integrity protection and not even checking. You simply don't know whether a failure somewhere rendered your data corrupt. And since better alternatives like a zmirror or btrfs RAID1 exist, IMO we should really stop to waste disks for almost nothing.

    So in RAID 1, if there is an mdadm check, and a drive wrote garbage, you're saying it won't be detected?
    My understanding is that mdadm would show an error, and you can then easily identify the failed drive by reading the SMART attributes of both drives and finding the one with unrecoverable errors or failed sectors.
    Therefore, if I'm right, then there is still a way to recover from a failing drive in mdadm RAID 1 without losing anything or require a backup restore.
    OK, the detection is just not at file system level, and a bit delayed since you need to wait for the next check (1 week to 1 month usually), but it still works.


    About rebuild time, for reference, my 3x4TB RAID 5 (two ST4000DM000-1F21 and one new ST4000VN008-2DRA) took around 8 hours to rebuild (same duration as the initialization since I wasn't really using the array upon rebuilding).
    That's about the time required to write 4TB at an average of 140MB/s.
    For me, that's fast enough, but it is true it could have been faster if only actually used space was written back as it seems to be with other RAID methods.


    I don't think so. On x86 I would most probably go with a zmirror while on ARM due to different kernel support situation I would use a btrfs RAID1 and then create snapshots regularly (snapshot as in 'almost backup'). The respective tools aren't integrated in OMV at all: https://www.znapzend.org (ZFS) or https://digint.ch/btrbk/ (btrfs)

    I'm in x86_64 here, so I would rather go with ZFS/Zmirror. However since it's not implemented in OMV and want to keep things simple, I'm unsure to be ready for that just yet. Probably a next step in my geek life but for now I don't have the time and energy to really dig into it enough. Thanks for pointing it out though, and offering new perspectives, it will for sure be useful at some point!

    @tkaiser Unfortunately the "RAID-1 vs. data integrity" link doesn't work for me (access denied).


    I get what you dislike about RAID 1, but it still doesn't make it worthless in most cases. I've never seen these described scenarios in practice. Unless the user is totally careless for months and has no monitoring while using cheap ass SATA controller and/or cables, and never looks his SMART, one will notice a SATA/BUS/drive or whatever error could happen before it's too late by just seing a SMART error or something.
    IMO, a better approach than "mdraid1 is useless" would likely be "hey, with ZFS you get checksums/data integrity checks, you should have a look, here is some documentation".


    I've seen upon looking info about OMV5 that maybe OMV6 will only support ZFS and such, breaking compatibility with existing formats?
    While it's probably a good idea to promote newer (and likely better) systems, it would benefit from very good documentation and some cohabitation with existing filesystems and RAID, awaiting for people to migrate their data to such file systems. I don't see ext4 being abandoned that fast.


    "(rm -rf is a good illustration why RAID is 100% useless when we're talking about data safety)."
    Well, you can technically recover files from that. I've already done such recovery for a virtual machine that had no backups after a human mistake. The VM had an ext4 image on a ext4 host partition in a RAID 1 :D
    You can always blame the user for not having a backup, but shit happens to everyone, and I'm not sure if I would have been able to find a tool to recover deleted files from a directory with another file system. That time, it saved the day for the customer to be in ext4 and that a tool existed.


    "mdraid users here in the forum complain that all their data has gone every other day..."
    > Well, I feel you, in LinuxGSM users complain that their linux user cannot write files because they created directories as root and crap like that all the time, or that they cannot download the files but don't have curl or wget installed... That's why good documentation is important, that way the user can find and read it, and if they didn't then you can just send the link instead of getting mad every time (speaking from experience) :D
    For my part, I am very careful with my RAID, also I have coworkers I can ask for help in the event of an issue, so I'm not too worried about this case scenario. I'm also planning a full deported backup server for important data redundancy and better automated backup in the near future.


    But you forgot a huge good point for RAID 1: if one does crap with one disk in RAID 1, then they still have a second chance to not mess up with the second one! :D


    I think I start getting the advantages of ZFS/BTRFS but you're right, I don't have any idea of how it works in real world and it will require some new knowledge and tools. Any documentation or relevant link would be welcome.
    Do OMV 4 (or the 5 beta) offer any kind of tools to help managing these? I mean, you can create these file systems, but especially, is there anything to help with redundancy on these like the great mdadm integration provided?


    Like you said, the level of trust is still not very high especially for BTRFS and all the bugs and problems that have existed. Even if most are solved (not for RAID 5, it seems), the scars still exist and people are not ready to consider it seriously.
    And ultimately, people (and me included) prefer working with stuff they're more at ease with and with proper documentation if needed.


    About RAID 10:
    Some datacenter dude told me once there was some kind of possible checksum with RAID 10, maybe he was wrong or maybe he was talking about a different RAID or filesystem and I can't recall.
    Since you know RAID better than me, maybe you might see what he could have mentioned? RAID 5/6 maybe, since they have some parity checks, are they more suitable for data integrity?



    In the end, back to the actual goal of this topic: you know my case well now: do you have a better solution for my 8TB drive than the one I came up with in this topic, to have some kind of local automated redundancy, without performance loss, that I'm not forced to setup right now, and that won't require emptying the drive first?


    Thanks


    Edit:
    Sorry for the loooong posts, that's the price of interesting developed conversations, I guess.

    EDIT: Crap, you answered in the meantime while I was writing!
    Reading your answer! :D


    So, I've read your post.


    "RAID is only about availability (who needs this at home?)"
    Well, I do.
    One single good reason (but I have way more since my NAS is heavily multi-purpose) is that I'm running daily backup of my web server on my OMV NAS. I don't want this backup interrupted and I also don't want to lose these in the event of a single disk failure. Also it would be unreasonable to spend more money on other drives to backup this backup.


    "It allows to continue operation without interruption if a single disk COMPLETELY fails. It doesn't deal good or at all with other failures that happen way more often."
    It also allows to continue operation if a drive just starts failing and writes garbage: you just eject it, replace it, rebuild and voilà.
    Speaking of that, I shall mention I've tried the operation with my RAID 5 before putting it in production, and OMV handled that perfectly.


    "with cabling/connector problems that are undetecte there's a chance that different data is written to the 2 disks"
    Well, who cares? If that happened to you, it's the same as a failed drive, you remove it from the RAID and rebuild.


    "Since mdraid only reads from one disk (not checking for mismatches)"
    Well, it doesn't do on the fly, but you can still check the RAID, which will find mismatches. By default, there's a monthly cron for this. If not detected upon writing or with a SMART error, then the cron will help. Even in the worst case scenario if you waited one month to discover that, chances for the other drive to write garbage as well are insanely low.



    I'm not sure if you've been traumatized by a bad experience with RAID 1 or something, but to me it seems like you are underrating this RAID more than what I would consider reasonable. I don't see any valid argument against RAID 1 here. What would be a valid argument would be an alternative that offer the same capabilities which I've not come across yet.


    I've also read the link about RAID 10, though the article is really poorly structured and incomprehensible.
    What I get from it is: if using 4 partitions, then you're unsure if they're spread correctly for redundancy.
    With mechanical drives: it degrades performance, so that's not viable.
    If running 2 straight drives in RAID 10, as you would make a RAID 1, then I've only got questions.
    It's written everywhere that the minimum requirement for RAID 10 is 4 drives. So can you even create it? Then you only have mirror and no stripping, so what's the difference compared to RAID1? Is it considered a degraded RAID 10, or normally operating?
    Any enlightenment appreciated on this.

    Since many users (including myself indirectly) showed success with RAID 1, then it is useful, and therefore it cannot be "useless".
    I cannot agree with that wording since it goes against logic.
    If you have relevant and common examples where RAID 1 does not work or is useless (which I doubt they would be that common since I've never experienced them), then please provide them so that I can maybe understand a little bit more about your point of view.


    Thinking that there are better options is totally acceptable, but it doesn't make other options immediately useless.
    However, If these options are so much better than traditional mdadm RAID, why don't we see those in OMV when going to the RAID section? Or not even a warning?


    That said, I'm 100% willing to learn more details about these options, and if you have a better proposition to make that would suit my case scenario (1 disk with non critical data, that will get later any kind of automated and low performance cost redundancy to protect against a disk failure), then you are 100% welcome mate.


    Edit:
    PS1: I didn't see the link I'm reading atm.
    PS2: I'm not doing business work at home, but that's still work and I don't want downtimes which is why I'm interested in RAID. Also, that's a way to have a better data freshness in the event of a disk failure: You usually don't need to recall a backup.
    PS3: I mentioned RAID 10 because It's known to have more safety.