SMART Errors - driving me nuts ! (updated)!

blublub · 19. Dezember 2014

Hi

I am having a serious headache with multiple SMART erorrs.

So here is the deal:

Asrock E3C226D2I
Core i3
16 GB ECC RAM
System disk Samsung EVO 840
Storage 5x WD RED 5TB

I have tried STD Wheezy Kernel 3.2 and backports Kernel 3.16 - no difference.

I do get tons of the following errors an all of my ATA ports EXCEPT the one with the Samsung SSD:

Code

Dec 17 07:46:57 StoreME kernel: [129393.578460] ata6.00: exception Emask 0x50 SAct 0x400000 SErr 0x48c0800 action 0xe frozen
Dec 17 07:46:57 StoreME kernel: [129393.580786] ata6.00: irq_stat 0x04000040, connection status changed
Dec 17 07:46:57 StoreME kernel: [129393.583295] ata6: SError: { HostInt CommWake 10B8B LinkSeq DevExch }
Dec 17 07:46:58 StoreME kernel: [129393.585494] ata6.00: failed command: READ FPDMA QUEUED
Dec 17 07:46:58 StoreME kernel: [129393.587529] ata6.00: cmd 60/58:b0:b0:cf:4a/00:00:3d:01:00/40 tag 22 ncq 45056 in
Dec 17 07:46:58 StoreME kernel: [129393.587529]          res 40/00:a8:a8:cf:4a/00:00:3d:01:00/40 Emask 0x50 (ATA bus error)
Dec 17 07:46:58 StoreME kernel: [129393.591696] ata6.00: status: { DRDY }
Dec 17 07:46:58 StoreME kernel: [129393.593769] ata6: hard resetting link
Dec 17 07:46:58 StoreME kernel: [129394.313871] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 17 07:46:58 StoreME kernel: [129394.315194] ata6.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Dec 17 07:46:58 StoreME kernel: [129394.315204] ata6.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Dec 17 07:46:58 StoreME kernel: [129394.315210] ata6.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Dec 17 07:46:58 StoreME kernel: [129394.317175] ata6.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Dec 17 07:46:58 StoreME kernel: [129394.317184] ata6.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Dec 17 07:46:58 StoreME kernel: [129394.317190] ata6.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Dec 17 07:46:58 StoreME kernel: [129394.317769] ata6.00: configured for UDMA/133
Dec 17 07:46:58 StoreME kernel: [129394.317945] ata6: EH complete

Alles anzeigen

Sometimes the error is WRITE FPDMA instead of READ - but only sometimes.

It seems that the drives boots up with 6.0Gbs links and then gets degraded to 1.5Gbs when the errors appear.

My actions so far:
replaced cables
replaced ATC PSU
removed backplanes to directly attach the drive to SATA controller
replaced 2 out of 5 drives due to a different and real drive FAULT - the new drives showed the same message directly after plugging them in

That DID not solve the issues!

When I disable the SATA PM (which I enabled in rc.local for each port) the errors go away !? So it seems to be a power / wakeup related issue.

However I encountered 3 different errors like this one (only 3 in 12h):

Code

StoreME kernel: [17199.585854] ata3.00: exception Emask 0x0 SAct 0x6000000 SErr 0x0 action 0x0
Dec 18 22:54:41 StoreME kernel: [17199.585925] ata3.00: irq_stat 0x40000008
Dec 18 22:54:41 StoreME kernel: [17199.585966] ata3.00: failed command: READ FPDMA QUEUED
Dec 18 22:54:41 StoreME kernel: [17199.586020] ata3.00: cmd 60/08:d0:e0:28:05/00:00:a8:00:00/40 tag 26 ncq 4096 in
Dec 18 22:54:41 StoreME kernel: [17199.586020]          res 41/40:00:e0:28:05/00:00:a8:00:00/40 Emask 0x409 (media error) <F>
Dec 18 22:54:41 StoreME kernel: [17199.586159] ata3.00: status: { DRDY ERR }
Dec 18 22:54:41 StoreME kernel: [17199.586197] ata3.00: error: { UNC }
Dec 18 22:54:41 StoreME kernel: [17199.598261] ata3.00: configured for UDMA/133
Dec 18 22:54:41 StoreME kernel: [17199.598279] sd 2:0:0:0: [sdc] Unhandled sense code
Dec 18 22:54:41 StoreME kernel: [17199.598281] sd 2:0:0:0: [sdc]  
Dec 18 22:54:41 StoreME kernel: [17199.598283] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 18 22:54:41 StoreME kernel: [17199.598285] sd 2:0:0:0: [sdc]  
Dec 18 22:54:41 StoreME kernel: [17199.598287] Sense Key : Medium Error [current] [descriptor]
Dec 18 22:54:41 StoreME kernel: [17199.598290] Descriptor sense data with sense descriptors (in hex):
Dec 18 22:54:41 StoreME kernel: [17199.598291]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Dec 18 22:54:41 StoreME kernel: [17199.598299]         a8 05 28 e0 
Dec 18 22:54:41 StoreME kernel: [17199.598303] sd 2:0:0:0: [sdc]  
Dec 18 22:54:41 StoreME kernel: [17199.598306] Add. Sense: Unrecovered read error - auto reallocate failed
Dec 18 22:54:41 StoreME kernel: [17199.598308] sd 2:0:0:0: [sdc] CDB: 
Dec 18 22:54:41 StoreME kernel: [17199.598309] Read(16): 88 00 00 00 00 00 a8 05 28 e0 00 00 00 08 00 00
Dec 18 22:54:41 StoreME kernel: [17199.598319] end_request: I/O error, dev sdc, sector 2818910432
Dec 18 22:54:41 StoreME kernel: [17199.598359] ata3: EH complete

Alles anzeigen

The last errors are different because thy are UNC read errors, suggestion a failed read from the drive itsself and not a "communication" problem like all the other errors.
But in regard to the UNC errors the drive does not show reallocated or failed sectors - SMART info for each drive is in the attached TXT document.

Is the "resetting link" a known problem linux or WD REDs problem when PM for SATA is enabled?

Asrock suugested replacing the MB, but that was before I tried disabling the PM - so I am waiting for a new answer.
I am also waiting for WD support to answer me.

So here is my Question for the Pro's:

Is this:

A) A real hardware bug and I should replace either the driver OR the MB !?
B) A Linux Kernel bug related to SATA PM?
C) A drive issue/incompatibility and I should replace the drives with other brand?

I think I can ignore the UNC rad erros as long as I don't have any bad sectors - right?

thx !!!!

davidh2k · 19. Dezember 2014

I suggest you to crosspost this in the Debian forum too.

Greetings
David

blublub · 19. Dezember 2014

this one: ?
http://forums.debian.net/

davidh2k · 19. Dezember 2014

Yup. Or this one if you're german:

https://debianforum.de/forum/

Greetings
David

blublub · 19. Dezember 2014

ok done:

http://forums.debian.net/viewt…t=119534&p=563976#p563976

blublub · 19. Dezember 2014

OK I think it is solved.

When the sata ALPM setting is "min" I get the errors when set to "medium" all is good.

When on "min" the data link goes to deep sleep and won't wake up fast enough hence the error.

However I don't know if this is a kernel bug or WD firmware issue.

I wonder why there is no safeguard in place or broad warning on wiki websites when this is such a delicate setting!?

davidh2k · 19. Dezember 2014

Maybe the usage of this setting is very rare in combination with 24/7 drives? I don't even use spindown.

Greetings
David

blublub · 22. Dezember 2014

Hey

look at that. UNC Errors on all of my WD RED drives at the same time!? WTF!?

How ca that still be a drive failure!? Looks more like a broken SATA controller or incompatibility with WD RED - what do you guys think?

BTW: long self test is reported to be "ok" but takes 12h !?

Code

kernel: [88202.517609] ata2.00: exception Emask 0x0 SAct 0xe00000 SErr 0x0 action 0x0
Dec 21 22:22:22 StoreME kernel: [88202.517664] ata2.00: irq_stat 0x40000008
Dec 21 22:22:22 StoreME kernel: [88202.517711] ata2.00: failed command: READ FPDMA QUEUED
Dec 21 22:22:22 StoreME kernel: [88202.517759] ata2.00: cmd 60/08:b8:88:ed:af/00:00:93:00:00/40 tag 23 ncq 4096 in
Dec 21 22:22:22 StoreME kernel: [88202.517761]          res 41/40:00:88:ed:af/00:00:93:00:00/40 Emask 0x409 (media error) <F>
Dec 21 22:22:22 StoreME kernel: [88202.517902] ata2.00: status: { DRDY ERR }
Dec 21 22:22:22 StoreME kernel: [88202.517944] ata2.00: error: { UNC }
Dec 21 22:22:22 StoreME kernel: [88202.522758] ata2.00: configured for UDMA/133
Dec 21 22:22:22 StoreME kernel: [88202.522773] ata2: EH complete
Dec 21 22:22:22 StoreME kernel: [88202.546627] ata6.00: exception Emask 0x0 SAct 0x700 SErr 0x0 action 0x0
Dec 21 22:22:22 StoreME kernel: [88202.546677] ata6.00: irq_stat 0x40000008
Dec 21 22:22:22 StoreME kernel: [88202.546716] ata6.00: failed command: READ FPDMA QUEUED
Dec 21 22:22:22 StoreME kernel: [88202.546759] ata6.00: cmd 60/08:50:58:01:b0/00:00:93:00:00/40 tag 10 ncq 4096 in
Dec 21 22:22:22 StoreME kernel: [88202.546760]          res 41/40:00:58:01:b0/00:00:93:00:00/40 Emask 0x409 (media error) <F>
Dec 21 22:22:22 StoreME kernel: [88202.546897] ata6.00: status: { DRDY ERR }
Dec 21 22:22:22 StoreME kernel: [88202.546933] ata6.00: error: { UNC }
Dec 21 22:22:22 StoreME kernel: [88202.660223] ata3.00: exception Emask 0x0 SAct 0x38000 SErr 0x0 action 0x0
Dec 21 22:22:22 StoreME kernel: [88202.660299] ata3.00: irq_stat 0x40000008
Dec 21 22:22:22 StoreME kernel: [88202.660351] ata3.00: failed command: READ FPDMA QUEUED
Dec 21 22:22:22 StoreME kernel: [88202.660408] ata3.00: cmd 60/08:88:48:00:b0/00:00:93:00:00/40 tag 17 ncq 4096 in
Dec 21 22:22:22 StoreME kernel: [88202.660410]          res 41/40:00:48:00:b0/00:00:93:00:00/40 Emask 0x409 (media error) <F>
Dec 21 22:22:22 StoreME kernel: [88202.660554] ata3.00: status: { DRDY ERR }
Dec 21 22:22:22 StoreME kernel: [88202.660598] ata3.00: error: { UNC }
Dec 21 22:22:22 StoreME kernel: [88202.665674] ata5.00: exception Emask 0x0 SAct 0x38000000 SErr 0x0 action 0x0
Dec 21 22:22:22 StoreME kernel: [88202.665731] ata5.00: irq_stat 0x40000008
Dec 21 22:22:22 StoreME kernel: [88202.665785] ata5.00: failed command: READ FPDMA QUEUED
Dec 21 22:22:22 StoreME kernel: [88202.665842] ata5.00: cmd 60/08:e8:c8:01:b0/00:00:93:00:00/40 tag 29 ncq 4096 in
Dec 21 22:22:22 StoreME kernel: [88202.665845]          res 41/40:00:c8:01:b0/00:00:93:00:00/40 Emask 0x409 (media error) <F>
Dec 21 22:22:22 StoreME kernel: [88202.665984] ata5.00: status: { DRDY ERR }
Dec 21 22:22:22 StoreME kernel: [88202.666024] ata5.00: error: { UNC }
Dec 21 22:22:22 StoreME kernel: [88202.672103] ata3.00: configured for UDMA/133
Dec 21 22:22:22 StoreME kernel: [88202.672128] ata3: EH complete
Dec 21 22:22:22 StoreME kernel: [88202.673926] ata5.00: configured for UDMA/133
Dec 21 22:22:22 StoreME kernel: [88202.673950] ata5: EH complete
Dec 21 22:22:22 StoreME kernel: [88203.081363] ata4.00: exception Emask 0x0 SAct 0xe000000 SErr 0x0 action 0x0
Dec 21 22:22:22 StoreME kernel: [88203.081435] ata4.00: irq_stat 0x40000008
Dec 21 22:22:22 StoreME kernel: [88203.081489] ata4.00: failed command: READ FPDMA QUEUED
Dec 21 22:22:22 StoreME kernel: [88203.081545] ata4.00: cmd 60/08:d8:58:ee:af/00:00:93:00:00/40 tag 27 ncq 4096 in
Dec 21 22:22:22 StoreME kernel: [88203.081547]          res 41/40:00:58:ee:af/00:00:93:00:00/40 Emask 0x409 (media error) <F>
Dec 21 22:22:22 StoreME kernel: [88203.081691] ata4.00: status: { DRDY ERR }
Dec 21 22:22:22 StoreME kernel: [88203.081735] ata4.00: error: { UNC }
Dec 21 22:22:22 StoreME kernel: [88203.090630] ata4.00: configured for UDMA/133
Dec 21 22:22:22 StoreME kernel: [88203.090653] ata4: EH complete
Dec 21 22:22:22 StoreME kernel: [88203.160921] ata6.00: configured for UDMA/133
Dec 21 22:22:22 StoreME kernel: [88203.160945] ata6: EH complete

Alles anzeigen

davidh2k · 22. Dezember 2014

Is your board still under 2 year warranty? I would check it... maybe get a large sata controller to crosscheck it.

Greetings
David

blublub · 22. Dezember 2014

Brand New is that thing.
I have a new one coming before Xmas - I hope. I would RMA when that fixes it.

blublub · 23. Dezember 2014

OK.

Now i was about going the whole mile.
I got 5 HGST Deskstar NAS and replaced the 1st drive yesterday. Mainboard is still the same.

After replacing the 1st WD Red with the HGST I did not get any error in 12h (OK not that much time I admit, but still strange).

How could one, maybe faulty, drive be responsible for UNC media errors on 4 other drives - vibration issues!!?? If so why didn't I feel the vibrations?

I am really reluctant to exchange the other 4 drives now because I could still return them and get a refund ( it is am expensive experiment)

Btw: the HGST is running relay hot in comparison 45 vs 33 degrees Celsius wow...

davidh2k · 23. Dezember 2014

12 degrees more? I would doubt that a drive in the same environmeant with the similiar airflow would be that much hotter (even if 5k vs 7k drives!) if it isn't the top 12th hdd slot and is compared to the first slot...

Greetings
David

blublub · 23. Dezember 2014

It is in exactly the same spot, same airflow. The WD never got more than 38 under load so still would be 6 degrees at least.

blublub · 5. Januar 2015

Hi

So here is the follow up on this topic.

The numerous errors continued until ALL WD REDs left the chassis of my NAS. I am running the HGST NAS drives for 1,5 weeks and have not had any errors since. I ran WD Datalifeguard on all of them (1st zeroing then full self test) - the all tested fine with mostly 1 re-allocated sector (but that 1 doesn't explain the multitude of errors while accessing totally different files)

I brought the drives back for testing to my retailer and they also tested the drives without any errors. Though I had log files of my problem they where very kind and gave me a refund.

In the end it had to be an incompatibility between the WD RED firmware and either the SATA Chipset and/or the mainboard itsself - just a pain in the a** !!!!

SMART Errors - driving me nuts ! (updated)!

Jetzt mitmachen!

Tags