SMART Errors - driving me nuts ! (updated)!

    • OMV 1.0

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • SMART Errors - driving me nuts ! (updated)!

      Hi

      I am having a serious headache with multiple SMART erorrs.

      So here is the deal:

      Asrock E3C226D2I
      Core i3
      16 GB ECC RAM
      System disk Samsung EVO 840
      Storage 5x WD RED 5TB


      I have tried STD Wheezy Kernel 3.2 and backports Kernel 3.16 - no difference.

      I do get tons of the following errors an all of my ATA ports EXCEPT the one with the Samsung SSD:

      Source Code

      1. Dec 17 07:46:57 StoreME kernel: [129393.578460] ata6.00: exception Emask 0x50 SAct 0x400000 SErr 0x48c0800 action 0xe frozen
      2. Dec 17 07:46:57 StoreME kernel: [129393.580786] ata6.00: irq_stat 0x04000040, connection status changed
      3. Dec 17 07:46:57 StoreME kernel: [129393.583295] ata6: SError: { HostInt CommWake 10B8B LinkSeq DevExch }
      4. Dec 17 07:46:58 StoreME kernel: [129393.585494] ata6.00: failed command: READ FPDMA QUEUED
      5. Dec 17 07:46:58 StoreME kernel: [129393.587529] ata6.00: cmd 60/58:b0:b0:cf:4a/00:00:3d:01:00/40 tag 22 ncq 45056 in
      6. Dec 17 07:46:58 StoreME kernel: [129393.587529] res 40/00:a8:a8:cf:4a/00:00:3d:01:00/40 Emask 0x50 (ATA bus error)
      7. Dec 17 07:46:58 StoreME kernel: [129393.591696] ata6.00: status: { DRDY }
      8. Dec 17 07:46:58 StoreME kernel: [129393.593769] ata6: hard resetting link
      9. Dec 17 07:46:58 StoreME kernel: [129394.313871] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
      10. Dec 17 07:46:58 StoreME kernel: [129394.315194] ata6.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
      11. Dec 17 07:46:58 StoreME kernel: [129394.315204] ata6.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
      12. Dec 17 07:46:58 StoreME kernel: [129394.315210] ata6.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
      13. Dec 17 07:46:58 StoreME kernel: [129394.317175] ata6.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
      14. Dec 17 07:46:58 StoreME kernel: [129394.317184] ata6.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
      15. Dec 17 07:46:58 StoreME kernel: [129394.317190] ata6.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
      16. Dec 17 07:46:58 StoreME kernel: [129394.317769] ata6.00: configured for UDMA/133
      17. Dec 17 07:46:58 StoreME kernel: [129394.317945] ata6: EH complete
      Display All



      Sometimes the error is WRITE FPDMA instead of READ - but only sometimes.

      It seems that the drives boots up with 6.0Gbs links and then gets degraded to 1.5Gbs when the errors appear.



      My actions so far:
      replaced cables
      replaced ATC PSU
      removed backplanes to directly attach the drive to SATA controller
      replaced 2 out of 5 drives due to a different and real drive FAULT - the new drives showed the same message directly after plugging them in

      That DID not solve the issues
      !


      When I disable the SATA PM (which I enabled in rc.local for each port) the errors go away !? So it seems to be a power / wakeup related issue.


      However I encountered 3 different errors like this one (only 3 in 12h):

      Source Code

      1. StoreME kernel: [17199.585854] ata3.00: exception Emask 0x0 SAct 0x6000000 SErr 0x0 action 0x0
      2. Dec 18 22:54:41 StoreME kernel: [17199.585925] ata3.00: irq_stat 0x40000008
      3. Dec 18 22:54:41 StoreME kernel: [17199.585966] ata3.00: failed command: READ FPDMA QUEUED
      4. Dec 18 22:54:41 StoreME kernel: [17199.586020] ata3.00: cmd 60/08:d0:e0:28:05/00:00:a8:00:00/40 tag 26 ncq 4096 in
      5. Dec 18 22:54:41 StoreME kernel: [17199.586020] res 41/40:00:e0:28:05/00:00:a8:00:00/40 Emask 0x409 (media error) <F>
      6. Dec 18 22:54:41 StoreME kernel: [17199.586159] ata3.00: status: { DRDY ERR }
      7. Dec 18 22:54:41 StoreME kernel: [17199.586197] ata3.00: error: { UNC }
      8. Dec 18 22:54:41 StoreME kernel: [17199.598261] ata3.00: configured for UDMA/133
      9. Dec 18 22:54:41 StoreME kernel: [17199.598279] sd 2:0:0:0: [sdc] Unhandled sense code
      10. Dec 18 22:54:41 StoreME kernel: [17199.598281] sd 2:0:0:0: [sdc]
      11. Dec 18 22:54:41 StoreME kernel: [17199.598283] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      12. Dec 18 22:54:41 StoreME kernel: [17199.598285] sd 2:0:0:0: [sdc]
      13. Dec 18 22:54:41 StoreME kernel: [17199.598287] Sense Key : Medium Error [current] [descriptor]
      14. Dec 18 22:54:41 StoreME kernel: [17199.598290] Descriptor sense data with sense descriptors (in hex):
      15. Dec 18 22:54:41 StoreME kernel: [17199.598291] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
      16. Dec 18 22:54:41 StoreME kernel: [17199.598299] a8 05 28 e0
      17. Dec 18 22:54:41 StoreME kernel: [17199.598303] sd 2:0:0:0: [sdc]
      18. Dec 18 22:54:41 StoreME kernel: [17199.598306] Add. Sense: Unrecovered read error - auto reallocate failed
      19. Dec 18 22:54:41 StoreME kernel: [17199.598308] sd 2:0:0:0: [sdc] CDB:
      20. Dec 18 22:54:41 StoreME kernel: [17199.598309] Read(16): 88 00 00 00 00 00 a8 05 28 e0 00 00 00 08 00 00
      21. Dec 18 22:54:41 StoreME kernel: [17199.598319] end_request: I/O error, dev sdc, sector 2818910432
      22. Dec 18 22:54:41 StoreME kernel: [17199.598359] ata3: EH complete
      Display All



      The last errors are different because thy are UNC read errors, suggestion a failed read from the drive itsself and not a "communication" problem like all the other errors.
      But in regard to the UNC errors the drive does not show reallocated or failed sectors - SMART info for each drive is in the attached TXT document.


      Is the "resetting link" a known problem linux or WD REDs problem when PM for SATA is enabled?

      Asrock suugested replacing the MB, but that was before I tried disabling the PM - so I am waiting for a new answer.
      I am also waiting for WD support to answer me.


      So here is my Question for the Pro's:

      Is this:

      A) A real hardware bug and I should replace either the driver OR the MB !?
      B) A Linux Kernel bug related to SATA PM?
      C) A drive issue/incompatibility and I should replace the drives with other brand?

      I think I can ignore the UNC rad erros as long as I don't have any bad sectors - right?

      thx !!!! :)
      Files
    • I suggest you to crosspost this in the Debian forum too.

      Greetings
      David
      "Well... lately this forum has become support for everything except omv" [...] "And is like someone is banning Google from their browsers"

      Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.


      Upload Logfile via WebGUI/CLI
      #openmediavault on freenode IRC | German & English | GMT+1
      Absolutely no Support via PM!

      I host parts of the omv-extras.org Repository, the OpenMediaVault Live Demo and the pre-built PXE Images. If you want you can take part and help covering the costs by having a look at my profile page.
    • Yup. Or this one if you're german:

      debianforum.de/forum/

      Greetings
      David
      "Well... lately this forum has become support for everything except omv" [...] "And is like someone is banning Google from their browsers"

      Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.


      Upload Logfile via WebGUI/CLI
      #openmediavault on freenode IRC | German & English | GMT+1
      Absolutely no Support via PM!

      I host parts of the omv-extras.org Repository, the OpenMediaVault Live Demo and the pre-built PXE Images. If you want you can take part and help covering the costs by having a look at my profile page.
    • OK I think it is solved.

      When the sata ALPM setting is "min" I get the errors when set to "medium" all is good.

      When on "min" the data link goes to deep sleep and won't wake up fast enough hence the error.

      However I don't know if this is a kernel bug or WD firmware issue.

      I wonder why there is no safeguard in place or broad warning on wiki websites when this is such a delicate setting!?
    • Maybe the usage of this setting is very rare in combination with 24/7 drives? I don't even use spindown.

      Greetings
      David
      "Well... lately this forum has become support for everything except omv" [...] "And is like someone is banning Google from their browsers"

      Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.


      Upload Logfile via WebGUI/CLI
      #openmediavault on freenode IRC | German & English | GMT+1
      Absolutely no Support via PM!

      I host parts of the omv-extras.org Repository, the OpenMediaVault Live Demo and the pre-built PXE Images. If you want you can take part and help covering the costs by having a look at my profile page.
    • Hey

      look at that. UNC Errors on all of my WD RED drives at the same time!? WTF!?

      How ca that still be a drive failure!? Looks more like a broken SATA controller or incompatibility with WD RED - what do you guys think?

      BTW: long self test is reported to be "ok" but takes 12h !?

      Source Code

      1. kernel: [88202.517609] ata2.00: exception Emask 0x0 SAct 0xe00000 SErr 0x0 action 0x0
      2. Dec 21 22:22:22 StoreME kernel: [88202.517664] ata2.00: irq_stat 0x40000008
      3. Dec 21 22:22:22 StoreME kernel: [88202.517711] ata2.00: failed command: READ FPDMA QUEUED
      4. Dec 21 22:22:22 StoreME kernel: [88202.517759] ata2.00: cmd 60/08:b8:88:ed:af/00:00:93:00:00/40 tag 23 ncq 4096 in
      5. Dec 21 22:22:22 StoreME kernel: [88202.517761] res 41/40:00:88:ed:af/00:00:93:00:00/40 Emask 0x409 (media error) <F>
      6. Dec 21 22:22:22 StoreME kernel: [88202.517902] ata2.00: status: { DRDY ERR }
      7. Dec 21 22:22:22 StoreME kernel: [88202.517944] ata2.00: error: { UNC }
      8. Dec 21 22:22:22 StoreME kernel: [88202.522758] ata2.00: configured for UDMA/133
      9. Dec 21 22:22:22 StoreME kernel: [88202.522773] ata2: EH complete
      10. Dec 21 22:22:22 StoreME kernel: [88202.546627] ata6.00: exception Emask 0x0 SAct 0x700 SErr 0x0 action 0x0
      11. Dec 21 22:22:22 StoreME kernel: [88202.546677] ata6.00: irq_stat 0x40000008
      12. Dec 21 22:22:22 StoreME kernel: [88202.546716] ata6.00: failed command: READ FPDMA QUEUED
      13. Dec 21 22:22:22 StoreME kernel: [88202.546759] ata6.00: cmd 60/08:50:58:01:b0/00:00:93:00:00/40 tag 10 ncq 4096 in
      14. Dec 21 22:22:22 StoreME kernel: [88202.546760] res 41/40:00:58:01:b0/00:00:93:00:00/40 Emask 0x409 (media error) <F>
      15. Dec 21 22:22:22 StoreME kernel: [88202.546897] ata6.00: status: { DRDY ERR }
      16. Dec 21 22:22:22 StoreME kernel: [88202.546933] ata6.00: error: { UNC }
      17. Dec 21 22:22:22 StoreME kernel: [88202.660223] ata3.00: exception Emask 0x0 SAct 0x38000 SErr 0x0 action 0x0
      18. Dec 21 22:22:22 StoreME kernel: [88202.660299] ata3.00: irq_stat 0x40000008
      19. Dec 21 22:22:22 StoreME kernel: [88202.660351] ata3.00: failed command: READ FPDMA QUEUED
      20. Dec 21 22:22:22 StoreME kernel: [88202.660408] ata3.00: cmd 60/08:88:48:00:b0/00:00:93:00:00/40 tag 17 ncq 4096 in
      21. Dec 21 22:22:22 StoreME kernel: [88202.660410] res 41/40:00:48:00:b0/00:00:93:00:00/40 Emask 0x409 (media error) <F>
      22. Dec 21 22:22:22 StoreME kernel: [88202.660554] ata3.00: status: { DRDY ERR }
      23. Dec 21 22:22:22 StoreME kernel: [88202.660598] ata3.00: error: { UNC }
      24. Dec 21 22:22:22 StoreME kernel: [88202.665674] ata5.00: exception Emask 0x0 SAct 0x38000000 SErr 0x0 action 0x0
      25. Dec 21 22:22:22 StoreME kernel: [88202.665731] ata5.00: irq_stat 0x40000008
      26. Dec 21 22:22:22 StoreME kernel: [88202.665785] ata5.00: failed command: READ FPDMA QUEUED
      27. Dec 21 22:22:22 StoreME kernel: [88202.665842] ata5.00: cmd 60/08:e8:c8:01:b0/00:00:93:00:00/40 tag 29 ncq 4096 in
      28. Dec 21 22:22:22 StoreME kernel: [88202.665845] res 41/40:00:c8:01:b0/00:00:93:00:00/40 Emask 0x409 (media error) <F>
      29. Dec 21 22:22:22 StoreME kernel: [88202.665984] ata5.00: status: { DRDY ERR }
      30. Dec 21 22:22:22 StoreME kernel: [88202.666024] ata5.00: error: { UNC }
      31. Dec 21 22:22:22 StoreME kernel: [88202.672103] ata3.00: configured for UDMA/133
      32. Dec 21 22:22:22 StoreME kernel: [88202.672128] ata3: EH complete
      33. Dec 21 22:22:22 StoreME kernel: [88202.673926] ata5.00: configured for UDMA/133
      34. Dec 21 22:22:22 StoreME kernel: [88202.673950] ata5: EH complete
      35. Dec 21 22:22:22 StoreME kernel: [88203.081363] ata4.00: exception Emask 0x0 SAct 0xe000000 SErr 0x0 action 0x0
      36. Dec 21 22:22:22 StoreME kernel: [88203.081435] ata4.00: irq_stat 0x40000008
      37. Dec 21 22:22:22 StoreME kernel: [88203.081489] ata4.00: failed command: READ FPDMA QUEUED
      38. Dec 21 22:22:22 StoreME kernel: [88203.081545] ata4.00: cmd 60/08:d8:58:ee:af/00:00:93:00:00/40 tag 27 ncq 4096 in
      39. Dec 21 22:22:22 StoreME kernel: [88203.081547] res 41/40:00:58:ee:af/00:00:93:00:00/40 Emask 0x409 (media error) <F>
      40. Dec 21 22:22:22 StoreME kernel: [88203.081691] ata4.00: status: { DRDY ERR }
      41. Dec 21 22:22:22 StoreME kernel: [88203.081735] ata4.00: error: { UNC }
      42. Dec 21 22:22:22 StoreME kernel: [88203.090630] ata4.00: configured for UDMA/133
      43. Dec 21 22:22:22 StoreME kernel: [88203.090653] ata4: EH complete
      44. Dec 21 22:22:22 StoreME kernel: [88203.160921] ata6.00: configured for UDMA/133
      45. Dec 21 22:22:22 StoreME kernel: [88203.160945] ata6: EH complete
      Display All

    • Is your board still under 2 year warranty? I would check it... maybe get a large sata controller to crosscheck it.

      Greetings
      David
      "Well... lately this forum has become support for everything except omv" [...] "And is like someone is banning Google from their browsers"

      Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.


      Upload Logfile via WebGUI/CLI
      #openmediavault on freenode IRC | German & English | GMT+1
      Absolutely no Support via PM!

      I host parts of the omv-extras.org Repository, the OpenMediaVault Live Demo and the pre-built PXE Images. If you want you can take part and help covering the costs by having a look at my profile page.
    • OK.

      Now i was about going the whole mile.
      I got 5 HGST Deskstar NAS and replaced the 1st drive yesterday. Mainboard is still the same.

      After replacing the 1st WD Red with the HGST I did not get any error in 12h (OK not that much time I admit, but still strange).

      How could one, maybe faulty, drive be responsible for UNC media errors on 4 other drives - vibration issues!!?? If so why didn't I feel the vibrations?

      I am really reluctant to exchange the other 4 drives now because I could still return them and get a refund ( it is am expensive experiment)

      Btw: the HGST is running relay hot in comparison 45 vs 33 degrees Celsius wow...
    • 12 degrees more? I would doubt that a drive in the same environmeant with the similiar airflow would be that much hotter (even if 5k vs 7k drives!) if it isn't the top 12th hdd slot and is compared to the first slot...

      Greetings
      David
      "Well... lately this forum has become support for everything except omv" [...] "And is like someone is banning Google from their browsers"

      Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.


      Upload Logfile via WebGUI/CLI
      #openmediavault on freenode IRC | German & English | GMT+1
      Absolutely no Support via PM!

      I host parts of the omv-extras.org Repository, the OpenMediaVault Live Demo and the pre-built PXE Images. If you want you can take part and help covering the costs by having a look at my profile page.
    • Hi

      So here is the follow up on this topic.

      The numerous errors continued until ALL WD REDs left the chassis of my NAS. I am running the HGST NAS drives for 1,5 weeks and have not had any errors since. I ran WD Datalifeguard on all of them (1st zeroing then full self test) - the all tested fine with mostly 1 re-allocated sector (but that 1 doesn't explain the multitude of errors while accessing totally different files)

      I brought the drives back for testing to my retailer and they also tested the drives without any errors. Though I had log files of my problem they where very kind and gave me a refund.

      In the end it had to be an incompatibility between the WD RED firmware and either the SATA Chipset and/or the mainboard itsself - just a pain in the a** !!!!