SMART Errors - driving me nuts ! (updated)!

  • Hi


    I am having a serious headache with multiple SMART erorrs.


    So here is the deal:


    Asrock E3C226D2I
    Core i3
    16 GB ECC RAM
    System disk Samsung EVO 840
    Storage 5x WD RED 5TB



    I have tried STD Wheezy Kernel 3.2 and backports Kernel 3.16 - no difference.


    I do get tons of the following errors an all of my ATA ports EXCEPT the one with the Samsung SSD:




    Sometimes the error is WRITE FPDMA instead of READ - but only sometimes.


    It seems that the drives boots up with 6.0Gbs links and then gets degraded to 1.5Gbs when the errors appear.




    My actions so far:
    replaced cables
    replaced ATC PSU
    removed backplanes to directly attach the drive to SATA controller
    replaced 2 out of 5 drives due to a different and real drive FAULT - the new drives showed the same message directly after plugging them in

    That DID not solve the issues
    !



    When I disable the SATA PM (which I enabled in rc.local for each port) the errors go away !? So it seems to be a power / wakeup related issue.



    However I encountered 3 different errors like this one (only 3 in 12h):




    The last errors are different because thy are UNC read errors, suggestion a failed read from the drive itsself and not a "communication" problem like all the other errors.
    But in regard to the UNC errors the drive does not show reallocated or failed sectors - SMART info for each drive is in the attached TXT document.



    Is the "resetting link" a known problem linux or WD REDs problem when PM for SATA is enabled?


    Asrock suugested replacing the MB, but that was before I tried disabling the PM - so I am waiting for a new answer.
    I am also waiting for WD support to answer me.



    So here is my Question for the Pro's:


    Is this:


    A) A real hardware bug and I should replace either the driver OR the MB !?
    B) A Linux Kernel bug related to SATA PM?
    C) A drive issue/incompatibility and I should replace the drives with other brand?


    I think I can ignore the UNC rad erros as long as I don't have any bad sectors - right?


    thx !!!! :)

  • I suggest you to crosspost this in the Debian forum too.


    Greetings
    David

    "Well... lately this forum has become support for everything except omv" [...] "And is like someone is banning Google from their browsers"


    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.

    Upload Logfile via WebGUI/CLI
    #openmediavault on freenode IRC | German & English | GMT+1
    Absolutely no Support via PM!

  • Yup. Or this one if you're german:


    https://debianforum.de/forum/


    Greetings
    David

    "Well... lately this forum has become support for everything except omv" [...] "And is like someone is banning Google from their browsers"


    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.

    Upload Logfile via WebGUI/CLI
    #openmediavault on freenode IRC | German & English | GMT+1
    Absolutely no Support via PM!

  • OK I think it is solved.


    When the sata ALPM setting is "min" I get the errors when set to "medium" all is good.


    When on "min" the data link goes to deep sleep and won't wake up fast enough hence the error.


    However I don't know if this is a kernel bug or WD firmware issue.


    I wonder why there is no safeguard in place or broad warning on wiki websites when this is such a delicate setting!?

  • Maybe the usage of this setting is very rare in combination with 24/7 drives? I don't even use spindown.


    Greetings
    David

    "Well... lately this forum has become support for everything except omv" [...] "And is like someone is banning Google from their browsers"


    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.

    Upload Logfile via WebGUI/CLI
    #openmediavault on freenode IRC | German & English | GMT+1
    Absolutely no Support via PM!

  • Hey


    look at that. UNC Errors on all of my WD RED drives at the same time!? WTF!?


    How ca that still be a drive failure!? Looks more like a broken SATA controller or incompatibility with WD RED - what do you guys think?


    BTW: long self test is reported to be "ok" but takes 12h !?


  • Is your board still under 2 year warranty? I would check it... maybe get a large sata controller to crosscheck it.


    Greetings
    David

    "Well... lately this forum has become support for everything except omv" [...] "And is like someone is banning Google from their browsers"


    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.

    Upload Logfile via WebGUI/CLI
    #openmediavault on freenode IRC | German & English | GMT+1
    Absolutely no Support via PM!

  • OK.


    Now i was about going the whole mile.
    I got 5 HGST Deskstar NAS and replaced the 1st drive yesterday. Mainboard is still the same.


    After replacing the 1st WD Red with the HGST I did not get any error in 12h (OK not that much time I admit, but still strange).


    How could one, maybe faulty, drive be responsible for UNC media errors on 4 other drives - vibration issues!!?? If so why didn't I feel the vibrations?


    I am really reluctant to exchange the other 4 drives now because I could still return them and get a refund ( it is am expensive experiment)


    Btw: the HGST is running relay hot in comparison 45 vs 33 degrees Celsius wow...

  • 12 degrees more? I would doubt that a drive in the same environmeant with the similiar airflow would be that much hotter (even if 5k vs 7k drives!) if it isn't the top 12th hdd slot and is compared to the first slot...


    Greetings
    David

    "Well... lately this forum has become support for everything except omv" [...] "And is like someone is banning Google from their browsers"


    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.

    Upload Logfile via WebGUI/CLI
    #openmediavault on freenode IRC | German & English | GMT+1
    Absolutely no Support via PM!

  • Hi


    So here is the follow up on this topic.


    The numerous errors continued until ALL WD REDs left the chassis of my NAS. I am running the HGST NAS drives for 1,5 weeks and have not had any errors since. I ran WD Datalifeguard on all of them (1st zeroing then full self test) - the all tested fine with mostly 1 re-allocated sector (but that 1 doesn't explain the multitude of errors while accessing totally different files)


    I brought the drives back for testing to my retailer and they also tested the drives without any errors. Though I had log files of my problem they where very kind and gave me a refund.


    In the end it had to be an incompatibility between the WD RED firmware and either the SATA Chipset and/or the mainboard itsself - just a pain in the a** !!!!

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!