RAID, SMART timeout (STCERC) and drives. How to set them correctly in OMV?

Rizos39 · 30. Dezember 2018

This is not RAID specific, but they usually go together

I came across the articles below about RAID, linux timeouts and drive timeouts setting with SMART:

Udev rules and helper scripts for setting safe disk scterc and drive controller timeout values for mdraid arrays
https://github.com/jonathanunderwood/mdraid-safe-timeouts

Linux Software RAID and drive timeouts
http://strugglers.net/~andy/bl…-raid-and-drive-timeouts/

And this one that looks older..
Many (long) HDD default timeouts cause data loss or corruption (silent controller resets)
https://www.smartmontools.org/ticket/658

I am especially concerned that this will fail when you trying to recover the system from an error, for example when resynchronizing after replacing a disk from a RAID.

Is there anything similar in OMV ?
Is there a way to do this or something similar ?
I have to admit that I don't understand how to do what they propose and it seems to me that you also have to adapt the timeout of drives other than RAID1-6 so that the kernel timeout is higher than SMART timeout.

Any experience with this or any suggestion of how to do it ?

The general idea, as I would:

- The kernel has a timeout of 30 seconds for each disk individually.

- The SMART timeouts (STCERC) must be put on each restart.

- On disks where all the information is redunded, put a low timeout on SMART. Data can be rebuild from others sources. I'm not sure what value to put.

- If part of the information is not redunded, put a high timeout in SMART. Data can not be rebuild from others sources. I'm not sure what value to put.

- If I do not know then to put the value in SMART, or it gives error when putting it, to change the timeout of the kernel to 180 seconds for that disk.

Also it would be very interesting to use a higher value if an array is in degraded mode because their data is no longer redunded.

Thanks in advance.

Rizos39 · 3. Januar 2019

Hellooooo !!
Is somebody there ?

I'm learning a lot from this forum.. RAID, redundancy, availability, backups. ..

.. RAID is not a backup (I already knew this ) , RAID is not about "redundancy", it's about "availability" ..

It's amazing that there are many recent entries about the security of our data .. how to go from RAID5 to RAID6 ? , btrfs, zfs, mergerfs / overlayfs / unionfs or using snapraid, etc.

.. and nobody is interested in how to configure the timeout of disks to be in a good situation when one fails, because although nobody wants it, our disks are physical devices and will end up failing.

I think it's a topic of interest to everyone. Has anyone thought about this? ... I don't find anything about it in the forum or it's something so well known that it's not even mentioned anymore.

Am I missing something?

Or.. am I not asking the right question?

geaves · 3. Januar 2019

Zitat von Rizos39

Am I missing something?

Yes, paragraph 2 of your second link, your above thread simply relates to using non raid drives i.e. desktop drives in a raid.

Zitat von Rizos39

our disks are physical devices and will end up failing.

That's why a backup is important.

Rizos39 · 4. Januar 2019

geaves, thank you so much for answering.
Manufacturers are dividing the disks between the desktop and the RAID, somewhat arbitrarily in my opinion.

But I'm sorry I don't agree with what you say. This is not about to use desktop disks vs RAID disks, but that RAID disks must behave as desktop disks if errors appear.

Of course you have to make backups (and better if they are outside), it is the only way to recover from a physical theft of the equipment, a loss by fire or any other accident or natural disaster ... But leaving this aside, it is not the same to have to restore a full backup, say 10 TB or even 100 TB (in addition to the time lost understanding what is going on and trying to manually mount the RAID), than simply replace a disk or even not having to do anything at all because the computer recovers itself.

And that's what it's all about, setting Smart SCTERC timeout correctly to take advantage of this functionality to our advantage.

The point to emphasize here is that the disks of a RAID, when they are working normally, must have a low timeout, because the data are redundant. But if one of the disks fail (can not read a sector or the RAID is already degraded) it must make more effort (and therefore it requires more time) in reading the data, and in that situation RAID disks must behave like desktop disks, that is to have larger timeouts.

It is precisely for the previous paragraph that I am asking for help in this forum. Is there any way to do this in OMV?

I am surprised that something that seems obvious to me has not been included in the main linux development. For example in smartmontools.

Again thank you very much geaves for replying and to everyone else for reading this, I hope someone can shed some light on the matter.

geaves · 4. Januar 2019

Zitat von Rizos39

Manufacturers are dividing the disks between the desktop and the RAID, somewhat arbitrarily in my opinion.

That's your opinion and one that HD manufactures would disagree with, and following most of your links to further links this is about using desktop drives in a raid.
Further to that the timeouts that are set in the drives firmware are different than those set if using a hardware raid controller.
I have a link that also confirms the above and using that link I have output from two of my drives;

this is a desktop drive

Code

smartctl -l scterc /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.18.0-0.bpo.3-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org


SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

this is a nas drive

Code

smartctl -l scterc /dev/sdc
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.18.0-0.bpo.3-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org


SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Does this mean that I should be concerned about my Raid 5?

To therefore answer your question; Is there any way to do this in OMV? No because OMV is based on Debian and your links refer to Debian, from 3 years ago, so if it's not been implemented upstream it's not possible downstream.

natoechoes · 5. Januar 2019

Hi Rizos,

If I understood right, rather than being concerned about using consumer drives in a RAID array (like almost all the info around this seems to be) you're concerned about using a NAS drive as a standalone storage device? Because stcerc on NAS drives seems to default to 7s, rather than spending a more lengthy amount of time trying to correct the error itself like a consumer drive would, the NAS drive will very quickly just give up and report an error, whereas if it had kept trying it may have eventually read the data. This 7s default is ideal when running in a RAID array as you have redundancy and don't want to waste time, but not ideal when running as a standalone device with no parity.

So, if I understand the general premise, you're asking if its possible to have the system set all drives without parity protection to a "consumer style" stcerc timeout of 2 mins, all drives in a RAID array to 7s (seems to be NAS drive default), and set the SCSI reset time to a bigger amount accordingly for the 2 min drives on startup?

The short answer as geaves said above is "no". There isn't currently an easy in omv/linux to my knowledge that can do what you ask.

While it would be technically possible to script this behaviour; to be honest, at least in my opinion, writing a script to trigger when an array drops into a degraded state would be quite a lot of hassle for a minimal return. The single biggest hurdle to me would seem to be "how do we write something that can consistently detect a degraded state across all the different forms of RAID implemented in omv/linux?". On top of this, increasing the timeout doesn't actually guarantee data recovery - you're stacking the odds a little in your favour, but in the grand scheme of things if a degraded RAID starts encountering errors the odds of you successfully rebuilding it once the failed drive has been replaced would seem to have reduced sharply anyway.

TLDR; if you care that much about your data you should be 100% backing it up anyway. In which case you don't care if the other drive starts to fail as you can just replace it and restore from backup

All of the above said, perhaps a workaround here to accomplish what you ask would be a simple script that you manually trigger when one of your arrays drops into a degraded state, to set the stcerc on the remaining disk(s) to 120s and the SCSI timeout to 180s. You could also accomplish this manually by running the following commands against each remaining drive in the array in turn yourself (replacing "sda" with the mountpoint of your drive):

Code

smartctl -q errorsonly -l scterc,120,120 /dev/sda
echo 180 > /sys/block/sda/device/timeout

This wouldn't persist through reboots though.

Also just as an aside and for future reference, from all the links and info you provided, if you do use consumer drives in a raid array you ideally want to either set stcerc to 7s on them (NAS drive standard), or if that's not supported increase the scsi timeout to 180s. This may/may not be necessary depending on the kind of raid you're using - zfs apparently writes over the failing sector immediately and ignores the stcerc figure for example, but I'd set this anyway as better safe than sorry.

Some references about how stcerc works with different kinds of raid:
btrfs - https://www.spinics.net/lists/linux-btrfs/msg41211.html
zfs + mdadm + hw raid - https://en.wikipedia.org/wiki/Error_recovery_control

Jetzt mitmachen!