SATA errors

    • OMV 2.x
    • SATA errors

      Hi,

      I finally got all the HDDs for my NAS, but now I see these errors:

      [ 380.551488] ata3.00: exception Emask 0x0 SAct 0x2000 SErr 0x400000 action 0x6 frozen
      [ 380.551578] ata3: SError: { Handshk }
      [ 380.551653] ata3.00: failed command: READ FPDMA QUEUED
      [ 380.551740] ata3.00: cmd 60/08:68:58:09:00/00:00:00:00:00/40 tag 13 ncq 4096 in
      [ 380.551744] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeo ut)
      [ 380.551921] ata3.00: status: { DRDY }
      [ 380.552012] ata3: hard resetting link
      [ 381.042887] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
      [ 381.050615] ata3.00: configured for UDMA/133
      [ 381.050630] ata3.00: device reported invalid CHS sector 0
      [ 381.050651] ata3: EH complete
      [ 1644.680645] ata3.00: exception Emask 0x10 SAct 0x4000000 SErr 0x400000 action 0x6 frozen
      [ 1644.680759] ata3.00: irq_stat 0x08000000, interface fatal error
      [ 1644.680852] ata3: SError: { Handshk }
      [ 1644.680940] ata3.00: failed command: READ FPDMA QUEUED
      [ 1644.681037] ata3.00: cmd 60/08:d0:40:0b:00/00:00:00:00:00/40 tag 26 ncq 4096 in
      [ 1644.681041] res 40/00:00:38:0b:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
      [ 1644.681244] ata3.00: status: { DRDY }
      [ 1644.681333] ata3: hard resetting link
      [ 1645.171018] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
      [ 1645.178782] ata3.00: configured for UDMA/133
      [ 1645.178811] ata3: EH complete
      [ 1714.695456] ata3.00: exception Emask 0x10 SAct 0x20 SErr 0x400000 action 0x6 frozen
      [ 1714.695571] ata3.00: irq_stat 0x08000000, interface fatal error
      [ 1714.695660] ata3: SError: { Handshk }
      [ 1714.695747] ata3.00: failed command: READ FPDMA QUEUED
      [ 1714.695844] ata3.00: cmd 60/08:28:88:09:00/00:00:00:00:00/40 tag 5 ncq 4096 i n
      [ 1714.695847] res 40/00:00:80:09:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
      [ 1714.696053] ata3.00: status: { DRDY }
      [ 1714.696142] ata3: hard resetting link
      [ 1715.185041] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
      [ 1715.192794] ata3.00: configured for UDMA/133
      [ 1715.192823] ata3: EH complete

      When trying to format one of the drives through the web interface, I also sometimes got long delays, followed by "communication problem".

      I think the problem might be related with my SATA controller, it's this one:

      amazon.de/gp/product/B00AZ9T3O…age_o00_s00?ie=UTF8&psc=1

      EDIT: According to the manual, the controller should be supported by linux kernel versions >= 2.6.

      This is my mainboard:

      amazon.de/Asrock-J3455M-Hauptp…-1&keywords=j3455m+asrock


      The HDDs are HGST NAS 8TB.


      I am using OMV 2.2. Could it be the problem will be fixed in 3.1 due to the new linux kernel?

      The post was edited 1 time, last by Sean ().

    • Sean wrote:

      I think the problem might be related with my SATA controller
      Well, while I would never use such a combination like yours I would first check for cabling/connectivity problems. Check 'smartctl -a -x' output for each drive and watch out for SMART attribute 199.

      And maybe use a 2nd host with a well known SATA implementation to do individual burn in tests (at least running a few heavy iozone runs and checking SMART output afterwards)
    • Sean wrote:

      Can a faulty cable cause this kind of error?
      Of course. This counter reports CRC errors. ATA/SATA uses primitive checksumming to identify data corruption on the wire (sender calculcates CRC and sends it together with the data to the receiver. There CRC of data gets calculated again and if a mismatch occurs the receiver asks for a retransmit and if the disk controller is not crappy increments SMART attribute 199 to inform you there's something wrong -- some disks don't).

      SMART attribute 199 is the first thing to check when attaching any disk to a host after running a small benchmark. On crappy disks not incrementing 199 value count of retransmits slows sequential performance down so even in such cases you're most probably able to identify the problem.

      BTW: Internal SATA connectors/cables are rated for maximum 50 matings and the cheap crap from Aliexpress/eBay/whatever dies way earlier.
    • BTW: ATA/SATA checksumming is rather primitive so if cables/contacts corrupt data not every occurence will lead to a CRC mismatch and will trigger a data retransmit but you run in corrupted data on the storage. That's why you ALWAYS do a quick per disk 'burn in' using a benchmark like iozone after you added a disk to a host followed by checking SMART attribute 199.
    • tkaiser wrote:

      BTW: ATA/SATA checksumming is rather primitive so if cables/contacts corrupt data not every occurence will lead to a CRC mismatch and will trigger a data retransmit but you run in corrupted data on the storage. That's why you ALWAYS do a quick per disk 'burn in' using a benchmark like iozone after you added a disk to a host followed by checking SMART attribute 199.
      Is iozone included in some omv plugin?

      BTW, I did looked a bit closer at the "dmesg" output, and found something that seems to indicate communication problems between the system and the SATA controller:

      (of course the controller should be capable of 6Gbps)


      [ 11.578750] ata3: softreset failed (1st FIS failed)
      [ 21.567615] ata3: softreset failed (1st FIS failed)
      [ 22.059085] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
      [ 27.053459] ata3.00: qc timeout (cmd 0xec)
      [ 27.053470] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
      [ 27.544967] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
      [ 37.533861] ata3.00: qc timeout (cmd 0xec)
      [ 37.533877] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
      [ 37.533886] ata3: limiting SATA link speed to 1.5 Gbps
      [ 38.025288] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
      [ 38.027858] ata3.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 38.027869] ata3.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), A A
      [ 38.032831] ata3.00: configured for UDMA/133
      [ 38.033058] scsi 2:0:0:0: Direct-Access ATA HGST HDN728080AL A4GN PQ : 0 ANSI: 5
      [ 38.033403] scsi 3:0:0:0: Direct-Access ATA HGST HDN728080AL A4GN PQ : 0 ANSI: 5
      [ 38.033805] scsi 4:0:0:0: Direct-Access ATA HGST HDN728080AL A4GN PQ : 0 ANSI: 5
      [ 38.034228] scsi 5:0:0:0: Direct-Access ATA HGST HDN728080AL A4GN PQ : 0 ANSI: 5
      [ 38.037029] sd 0:0:0:0: [sda] 976773168 512-byte logical blocks: (500 GB/465 GiB)
      [ 38.037140] sd 0:0:0:0: [sda] Write Protect is off
      [ 38.037145] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
      [ 38.037192] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, does n't support DPO or FUA
      [ 38.037442] sd 1:0:0:0: [sdb] 15628053168 512-byte logical blocks: (8.00 TB/7 .27 TiB)
      [ 38.037447] sd 1:0:0:0: [sdb] 4096-byte physical blocks
      [ 38.037549] sd 1:0:0:0: [sdb] Write Protect is off
      [ 38.037553] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
      [ 38.037599] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, does n't support DPO or FUA
      [ 38.037793] sd 2:0:0:0: [sdc] 15628053168 512-byte logical blocks: (8.00 TB/7 .27 TiB)
      [ 38.037798] sd 2:0:0:0: [sdc] 4096-byte physical blocks
      [ 38.037905] sd 2:0:0:0: [sdc] Write Protect is off
      [ 38.037909] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
      [ 38.038036] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, does n't support DPO or FUA
      [ 38.038821] sd 3:0:0:0: [sdd] 15628053168 512-byte logical blocks: (8.00 TB/7 .27 TiB)
      [ 38.038827] sd 3:0:0:0: [sdd] 4096-byte physical blocks
      [ 38.038922] sd 3:0:0:0: [sdd] Write Protect is off
      [ 38.038927] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
      [ 38.038968] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, does n't support DPO or FUA
      [ 38.039725] sd 4:0:0:0: [sde] 15628053168 512-byte logical blocks: (8.00 TB/7 .27 TiB)
      [ 38.039731] sd 4:0:0:0: [sde] 4096-byte physical blocks
      [ 38.039796] sd 5:0:0:0: [sdf] 15628053168 512-byte logical blocks: (8.00 TB/7 .27 TiB)
      [ 38.039801] sd 5:0:0:0: [sdf] 4096-byte physical blocks
      [ 38.039832] sd 4:0:0:0: [sde] Write Protect is off
      [ 38.039836] sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
      [ 38.039880] sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, does n't support DPO or FUA
      [ 38.039902] sd 5:0:0:0: [sdf] Write Protect is off
      [ 38.039907] sd 5:0:0:0: [sdf] Mode Sense: 00 3a 00 00
      [ 38.039947] sd 5:0:0:0: [sdf] Write cache: enabled, read cache: enabled, does n't support DPO or FUA
      [ 38.044754] sd 0:0:0:0: Attached scsi generic sg0 type 0
      [ 38.044840] sd 1:0:0:0: Attached scsi generic sg1 type 0
      [ 38.044908] sd 2:0:0:0: Attached scsi generic sg2 type 0
      [ 38.044976] sd 3:0:0:0: Attached scsi generic sg3 type 0
      [ 38.045036] sd 4:0:0:0: Attached scsi generic sg4 type 0
      [ 38.045098] sd 5:0:0:0: Attached scsi generic sg5 type 0
      [ 38.079307] sdb: sdb1
      [ 38.079806] sd 1:0:0:0: [sdb] Attached SCSI disk
      [ 38.088526] sdc: sdc1
      [ 38.089022] sd 2:0:0:0: [sdc] Attached SCSI disk
      [ 38.090383] sdd: sdd1
      [ 38.090919] sd 3:0:0:0: [sdd] Attached SCSI disk
      [ 38.091203] sdf: sdf1
      [ 38.092009] sd 5:0:0:0: [sdf] Attached SCSI disk
      [ 38.093874] sde: sde1
      [ 38.094249] sd 4:0:0:0: [sde] Attached SCSI disk
      [ 38.096647] sda: sda1 sda2 < sda5 >
      [ 38.097295] sd 0:0:0:0: [sda] Attached SCSI disk
    • Sean wrote:

      Is iozone included in some omv plugin?
      It's a simple 'sudo apt install iozone3' and then I would recommend doing a chdir to the mountpoint in question and then

      Source Code

      1. iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2
      Test size is just 100 MB but it will be transferred 6 times (write, rewrite, read, reread, and random write and read) so you end up with 600 MB data transmitted. If you suffer from bad cabling this will trigger enough retransmits so you're able to identify problems.

      BTW: Your dmesg output is about negotiation problems between SATA controller and one disk but not 'system and the SATA controller' (that would be PCIe and looks different)
    • You're using Debian Wheezy? Hmm... no idea, I've no Wheezy around anymore (even Jessie is already outdated as hell). It's not that important to use iozone though testing both sequential and random IO can be interesting in case you run into troubles and provide logs and test results.

      To push around some data even dd is sufficient.
    • In the meantime, I managed to install the backport kernel 3.16 (as gderf suggested), which didn't resolve the timeout problem.

      But I also thought about what tkaiser said:

      tkaiser wrote:

      BTW: Your dmesg output is about negotiation problems between SATA controller and one disk but not 'system and the SATA controller' (that would be PCIe and looks different)
      If that's the case, then it's easy to find out it it's the cable, the disk, or the controller. So I exchanged ata3 and ata4, and moved the controller to a different PCI slot to be completely sure. Result: the problem is still with ata3; which means it can't be the cable or the disk.

      Here is the output before the change:

      [ 28.397697] ata1: SATA max UDMA/133 abar m2048@0x91213000 port 0x91213100 irq 141
      [ 28.397703] ata2: SATA max UDMA/133 abar m2048@0x91213000 port 0x91213180 irq 141
      [ 28.415522] ata3: SATA max UDMA/133 abar m2048@0x91010000 port 0x91010100 irq 142
      [ 28.415527] ata4: SATA max UDMA/133 abar m2048@0x91010000 port 0x91010180 irq 142
      [ 28.415531] ata5: SATA max UDMA/133 abar m2048@0x91010000 port 0x91010200 irq 142
      [ 28.415535] ata6: SATA max UDMA/133 abar m2048@0x91010000 port 0x91010280 irq 142
      [ 28.894034] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      [ 28.896618] ata2.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 28.896627] ata2.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
      [ 28.901540] ata2.00: configured for UDMA/133
      [ 28.902004] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      [ 28.904572] ata6.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 28.904579] ata6.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
      [ 28.906003] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      [ 28.908585] ata5.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 28.908588] ata5.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
      [ 28.909461] ata6.00: configured for UDMA/133
      [ 28.913470] ata5.00: configured for UDMA/133
      [ 28.938020] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      [ 28.940615] ata4.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 28.940623] ata4.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
      [ 28.945508] ata4.00: configured for UDMA/133
      [ 28.945554] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
      [ 28.958235] ata1.00: ATA-8: SAMSUNG HD501LJ, CR100-10, max UDMA7
      [ 28.958239] ata1.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
      [ 28.960292] ata1.00: configured for UDMA/133
      [ 40.172471] ata3: softreset failed (1st FIS failed)
      [ 40.662135] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
      [ 45.663860] ata3.00: qc timeout (cmd 0xec)
      [ 45.663871] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
      [ 56.178433] ata3: softreset failed (1st FIS failed)
      [ 56.671706] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
      [ 56.671808] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x100)
      [ 56.671813] ata3: limiting SATA link speed to 1.5 Gbps
      [ 62.177633] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
      [ 62.204092] ata3.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 62.204100] ata3.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
      [ 62.209022] ata3.00: configured for UDMA/133

      And here after:

      [ 28.397697] ata1: SATA max UDMA/133 abar m2048@0x91213000 port 0x91213100 irq 141
      [ 28.397703] ata2: SATA max UDMA/133 abar m2048@0x91213000 port 0x91213180 irq 141
      [ 28.415522] ata3: SATA max UDMA/133 abar m2048@0x91010000 port 0x91010100 irq 142
      [ 28.415527] ata4: SATA max UDMA/133 abar m2048@0x91010000 port 0x91010180 irq 142
      [ 28.415531] ata5: SATA max UDMA/133 abar m2048@0x91010000 port 0x91010200 irq 142
      [ 28.415535] ata6: SATA max UDMA/133 abar m2048@0x91010000 port 0x91010280 irq 142
      [ 28.894034] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      [ 28.896618] ata2.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 28.896627] ata2.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
      [ 28.901540] ata2.00: configured for UDMA/133
      [ 28.902004] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      [ 28.904572] ata6.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 28.904579] ata6.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
      [ 28.906003] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      [ 28.908585] ata5.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 28.908588] ata5.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
      [ 28.909461] ata6.00: configured for UDMA/133
      [ 28.913470] ata5.00: configured for UDMA/133
      [ 28.938020] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      [ 28.940615] ata4.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 28.940623] ata4.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
      [ 28.945508] ata4.00: configured for UDMA/133
      [ 28.945554] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
      [ 28.958235] ata1.00: ATA-8: SAMSUNG HD501LJ, CR100-10, max UDMA7
      [ 28.958239] ata1.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
      [ 28.960292] ata1.00: configured for UDMA/133
      [ 40.172471] ata3: softreset failed (1st FIS failed)
      [ 40.662135] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
      [ 45.663860] ata3.00: qc timeout (cmd 0xec)
      [ 45.663871] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
      [ 56.178433] ata3: softreset failed (1st FIS failed)
      [ 56.671706] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
      [ 56.671808] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x100)
      [ 56.671813] ata3: limiting SATA link speed to 1.5 Gbps
      [ 62.177633] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
      [ 62.204092] ata3.00: ATA-9: HGST HDN728080ALE604, A4GNW91X, max UDMA/133
      [ 62.204100] ata3.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
      [ 62.209022] ata3.00: configured for UDMA/133

      Is there any explanation other than a faulty controller?

      And the next question: Is it just a weird coincidence that I bought faulty cables and a faulty controller (I'm sure the cables are faulty because I don't get crc errors since I removed them), or could it be that the cable actually broke the controller?

      And a final question for tkaiser:

      tkaiser wrote:

      Well, while I would never use such a combination like yours
      What would you recommend for a system that is capable of handling >= 10 disks? If I have to return the controller, I might as well buy a different one...
    • Sean wrote:

      Is there any explanation other than a faulty controller?
      I wrote above all the time about 'connectors/cables' and '50 matings max' for a reason: cheap stuff is unreliable. It seems the 4th port of one of your two cards (ata3) has negotiation problems. Since I'm an electrics noob and never looked more into electrical SATA aspects maybe there are active components for a SATA PHY needed (resistors, whatever) or maybe it's just a corroded contact (your SATA controllers are quite old already).

      Can't help with the other question since I would never try to deal with more than 4 disks at home (I learned to hate RAIDs in the last two decades but that's due to dealing mostly with failures all the time). Anyway: If I would set something up with that huge count of disks I would choose an enclosure featuring a backplane to insert disks directly (less hassles) and controllers featuring at least mini SAS sockets (SFF8087).