Clicking on "SMART > Devices > Information" increases ATA error count!

  • This console output is produced while clicking the "Information" button of /dev/sda (Error count increases from 17 to 18):

    Had to shorten some lines [...] to fit into this post.


    (However no entries in Syslog in this time)

  • no entries in Syslog


    Yes, @votdev substantiated that this time output will go to stdout since the daemon had to be started in the foreground. Essentially all that has been called is directly after another this:


    Code
    export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin; export LANG=C; udevadm info --query=property --name='/dev/sda' 2>&1 ; smartctl -x '/dev/sda' 2>&1

    Can you reproduce increased error count when executing this from a shell?

  • Yes, @votdev substantiated that this time output will go to stdout since the daemon had to be started in the foreground. Essentially all that has been called is directly after another this:


    Code
    export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin; export LANG=C; udevadm info --query=property --name='/dev/sda' 2>&1 ; smartctl -x '/dev/sda' 2>&1

    Can you reproduce increased error count when executing this from a shell?

    No, error count stays the same when executing this.
    To verify I clicked on Info after that and tadaa ... error count jumped one up ... reproducable! :thumbdown:


    edit: at least I saw that there are no weird parameters of smartctl responsible for this.

  • another example for clicking "Information" button (error count jumps from 26 to 27):

    any ideas?


    btw: Thanks for your support! :)

    • Offizieller Beitrag

    It seems smartctl behaves different when it is executed via PHP and your HD’s firmware does not like this. Due the fact that this behavior does not happen on any hardware this is really strange to identify. I think this can not be fixed by the OMV project finally because smartmontools and PHP are not maintained by the project nor can we fix firmware issues.


    Finally you have to life with this behavior by do not use the SMART UI or switch the hdd devices.

  • The four new HDDs were quite expensive and I cannot return them anymore, so switching hdds is not an option.


    It bugs me that there should be no way to reproduce the error aside from omv smart gui. There must be something "special" going on right before or after the smartctl call. You say that it could be something between php and smartctl.


    One more idea:
    In the console output above it seems that smartctl and udevadm are executed several times in one second. Is this observation right? If yes, why? If this happens the ATA error could probably be a timing problem due to concurring attemps to read smart values ...


    I'd really like to open issues in the right bugtrackers but sadly I don't have the php skills to program a minimal example for triggering that issue. Could someone please copy paste a "working" minimal example php file that just executes those problematic code that is executed by clicking the "Information"-Button? (or even better a bash command/script that triggers the error)
    Or provide me a improved/patched version of smart gui to test. I'm willing to help.


    I think it is important to follow those errors because there are so many users with WD drives affected.


    Examples:
    https://forum.openmediavault.o…creasing-ATA-error-count/ (OMV, WD Red)
    https://forum.openmediavault.o…a-SI-PEX40064-SATA3-Card/ (OMV, WD Red - mentioned in other thread)
    https://debianforum.de/forum/viewtopic.php?t=153705#p1025789 (Debian Wheezy but highly likely OMV as the username also posted in omv forum around that time, WD Red)
    https://www.hardwareluxx.de/co…1084178.html#post23773315 (OMV - see hostname, WD Red)
    https://www.hardwareluxx.de/co…1043116.html#post22821883 (OMV, HDD unknown)
    https://www.technikaffe.de/for…t-error-errorcount-hilfe/ (OMV, WD Red)
    https://www.technikaffe.de/for…-m-a-r-t-meldung-was-nun/ (OMV, WD Red, also other errors but ATA errors increase too)
    https://forum.ubuntuusers.de/t…he-s-m-a-r-t-werte-nicht/ (Ubuntu forum but OMV machine affected, WD Red, user asked for other problem but posted smartctl output revealed the ATA error problem by the way)
    ...


    I could list up many more. And if you google for this exact Error, most forum threads you find reveal sooner or later that the affected person is using OMV or OMV packet on top of Debian. I agree that the core for this problem might be that WD firmware is a bit picky regarding ATA communication but OMV is the software that somehow triggers this bug. So investigation should start here.

  • In the console output above it seems that smartctl and udevadm are executed several times in one second

    If you suspect that's the problem simply simulate this in a bash shell:


    Code
    for in 1 2 3 ; do
        (export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin; export LANG=C; udevadm info --query=property --name='/dev/sda' 2>&1 ; smartctl -x '/dev/sda' 2>&1) &
    done

    Just curious: Why do you still care about this WD firmware behaviour?

  • If you suspect that's the problem simply simulate this in a bash shell:

    Code
    for in 1 2 3 ; do
        (export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin; export LANG=C; udevadm info --query=property --name='/dev/sda' 2>&1 ; smartctl -x '/dev/sda' 2>&1) &
    done

    Just curious: Why do you still care about this WD firmware behaviour?

    @tkaiser: I had to put an i behind for but now it works! 8o ...
    This code hangs somewhere and has to be skipped by CTRL+C. And after that the ATA error count of sda is increased by one!

    Code
    for i in 1 2 3 ; do
        (export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin; export LANG=C; udevadm info --query=property --name='/dev/sda' 2>&1 ; smartctl -x '/dev/sda' 2>&1) & 
    done

    I've tested it several times and every execution increased the count. Even for i in 1 2 ; does the trick. But for i in 1 ; does NOT increase the counter - as expected. I'm convinced, we found the problem. :thumbup:


    @votdev: Would you please have a look at the omv code why this is executed more than one time?


    @tkaiser: I care because when this error occured I was worried about the health of my new drives. And in all the threads I found on internet there were given wrong advice to the people (swaping cables, even dumping the drives or mainboards(!) because there must be some kind of loose connection and so on) but no one saw the coherence between increasing ATA error count and clicking on OMV SMART Information Button. I (and all the other users) lost so much time because of that... :thumbdown:


    I want to use my disks to store irreplaceable family memories on them. So data integrity is very important to me. (Yes, I have Backups but better safe than sorry!) And i also like OMV for its super easy and good working SMART functionality. I bought those new drives because last week OMV sent me a SMART warning mail about one of my old drives slowly dying. And so I have been able to safe all my data before fatal drive failure. So I want to have SMART mails active. But with my new drives I got SMART warning mails virtually all day and I had no idea why. But I think you agree that simply switching off SMART warnings is not so clever.


    So please help me to sort this thing out (for me and for all the other affected users)! :)

  • Ok, yesterday evening i looked into this file to understand whats going on: https://github.com/openmediava…rage/smartinformation.inc (Please excuse my very basic approach. I have absolutely no idea how php/js works.)


    It seems that when you click on "Information" those three functions for the first three window tabs are executed and all three call getData() which itself executes smartctl:



    getInformation()
    getAttributes()
    getSelfTestLogs()



    I didn't find the place where this is actually executed. -> I found this but its all greek to me: https://github.com/openmediava…/storage/smart/Devices.js)


    My assumption is based on the fact that the Information, Attributes and Selftest Logs tabs are instantly there when you click on them, but the Extended Information tab needs a bit of time to come up. This explains the three concurrent executions of smartctl.


    @votdev: It should be easy for you to fix this Bug (or lets call it incompatibility with WD drives) by adding some small pause between the executions. The user would not even realize as the SMART Information Window starts on Information tab. Or maybe change the tab behavior from "preloading" to "loading just when the user clicks" - same as for the Extended Information tab. Or use one call of getData() instead of three to fill all three tabs at once.


    Thanks for your help!


    Mr Smile

  • Hmm. Hard to say. 0.5 seconds? Please give me a patched file and tell me where I can tune it. Then I'll test. Thanks for your support.


    And yes. I agree with you that the buggy WD firmware is the problem. But this triple execution of smartctl seems to be very uncommon. So if it is not really needed ... :)

  • Hard to say. 0.5 seconds?

    You have those buggy WD drives at hand. So you would need to test:


    Code
    for i in 1 2 3 ; do
        (export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin; export LANG=C; udevadm info --query=property --name='/dev/sda' 2>&1 ; smartctl -x '/dev/sda' 2>&1) &
        sleep 0.5
    done

    Will you open up a bug report at WD too?

  • It's impossible to fix a HDD firmware bug somewhere else. Please reach out to WD and let them fix their bug. Remember first page of this thread where you can read how they violate specs?

    I'll do this later, but how long do you think will it take to get new firmwares for all those WD Red drives? I don't think WD ever releases one, as this call behavior of smartctl is so uncommon. Most affected users (in fact all I found on the internet) seem to have OMV running on their machines. I absolutely agree that WD is to blame for their Firmware (and I'll report this behavior at smartmontools and WD) but for now lets try to make OMV behavior "more common".


    And to be honest: I don't understand why smartctl has to be executed three times at the same time ... :)

  • It is needed because they are executed async in different threads.

    ah ok, this explains why in one of ten cases or so the ATA error count stays the same.


    Ok, just to understand: Why is it needed to process three tabs (of which only one can be visible at the same time) concurrently with multithreading? Why is that done?


    Using three concurrent threads to ask a harddrive for the same SMART informations seems to me like a intentional firmware race condition test. :D

  • I don't think WD ever releases one, as this call behavior of smartctl is so uncommon

    They violate the specs. Specifications exist for a reason.


    My personal 'solution' for the various issues with WD drives is to not buy them and educate people to do the same. Maybe we should emphasize more on this since most probably you're absolutely right and they give a sh*t about being specs compliant and won't fix the issue in their firmware.

  • Your WD refusal is not very helpful. I've spent 500 bucks for the drives and when I realized this problem I just found other affected OMV users but no solution. In other words you say: Dump OMV and use one of the other operating systems that doesn't trigger this WD bug.


    The reality is that a lot of OMV users have WD Red drives because they are cheap, quiet, consume little energy, are recommended for NAS/Raid and so on. But also a lot of them don't give a shit on email-notifications. So they don't know that they are affected by this problem.


    You gave me 414 sites of specifications. But I'm not a programmer and this is all greek to me. Would you please point me to the violated paragraph so that I can file a bug at wd and/or smartmontools?


    edit: Damn I rebooted the server this morning and now even the triple loop doesn't trigger the firmware bug constantly ... maybe 1 of 7 times. :-/


    edit2: This also goes for the Information button. Will reboot now ... :S

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!