SATA errors

  • Hi,


    I finally got all the HDDs for my NAS, but now I see these errors:



    When trying to format one of the drives through the web interface, I also sometimes got long delays, followed by "communication problem".


    I think the problem might be related with my SATA controller, it's this one:


    https://www.amazon.de/gp/produ…age_o00_s00?ie=UTF8&psc=1


    EDIT: According to the manual, the controller should be supported by linux kernel versions >= 2.6.


    This is my mainboard:


    https://www.amazon.de/Asrock-J…-1&keywords=j3455m+asrock



    The HDDs are HGST NAS 8TB.



    I am using OMV 2.2. Could it be the problem will be fixed in 3.1 due to the new linux kernel?

  • I think the problem might be related with my SATA controller

    Well, while I would never use such a combination like yours I would first check for cabling/connectivity problems. Check 'smartctl -a -x' output for each drive and watch out for SMART attribute 199.


    And maybe use a 2nd host with a well known SATA implementation to do individual burn in tests (at least running a few heavy iozone runs and checking SMART output afterwards)

  • Thanks for all the replies! The smart attribute 199 value for one of the disks was 34 (raw value), all the other ones had 0.


    I replaced the SATA cable, and haven't had any trouble since (about an hour ago).


    Can a faulty cable cause this kind of error?

  • Can a faulty cable cause this kind of error?

    Of course. This counter reports CRC errors. ATA/SATA uses primitive checksumming to identify data corruption on the wire (sender calculcates CRC and sends it together with the data to the receiver. There CRC of data gets calculated again and if a mismatch occurs the receiver asks for a retransmit and if the disk controller is not crappy increments SMART attribute 199 to inform you there's something wrong -- some disks don't).


    SMART attribute 199 is the first thing to check when attaching any disk to a host after running a small benchmark. On crappy disks not incrementing 199 value count of retransmits slows sequential performance down so even in such cases you're most probably able to identify the problem.


    BTW: Internal SATA connectors/cables are rated for maximum 50 matings and the cheap crap from Aliexpress/eBay/whatever dies way earlier.

  • BTW: ATA/SATA checksumming is rather primitive so if cables/contacts corrupt data not every occurence will lead to a CRC mismatch and will trigger a data retransmit but you run in corrupted data on the storage. That's why you ALWAYS do a quick per disk 'burn in' using a benchmark like iozone after you added a disk to a host followed by checking SMART attribute 199.

  • BTW: ATA/SATA checksumming is rather primitive so if cables/contacts corrupt data not every occurence will lead to a CRC mismatch and will trigger a data retransmit but you run in corrupted data on the storage. That's why you ALWAYS do a quick per disk 'burn in' using a benchmark like iozone after you added a disk to a host followed by checking SMART attribute 199.

    Is iozone included in some omv plugin?


    BTW, I did looked a bit closer at the "dmesg" output, and found something that seems to indicate communication problems between the system and the SATA controller:


    (of course the controller should be capable of 6Gbps)


  • Is iozone included in some omv plugin?

    It's a simple 'sudo apt install iozone3' and then I would recommend doing a chdir to the mountpoint in question and then


    Code
    iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2

    Test size is just 100 MB but it will be transferred 6 times (write, rewrite, read, reread, and random write and read) so you end up with 600 MB data transmitted. If you suffer from bad cabling this will trigger enough retransmits so you're able to identify problems.


    BTW: Your dmesg output is about negotiation problems between SATA controller and one disk but not 'system and the SATA controller' (that would be PCIe and looks different)

  • I hope you won't lose your patience...


    Quote


    -bash: apt: Kommando nicht gefunden.


    and if I do " apt-get install iozone3 ", I get:



    Quote


    Paketlisten werden gelesen... Fertig
    Abhängigkeitsbaum wird aufgebaut.
    Statusinformationen werden eingelesen.... Fertig
    E: Paket iozone3 kann nicht gefunden werden.

  • You're using Debian Wheezy? Hmm... no idea, I've no Wheezy around anymore (even Jessie is already outdated as hell). It's not that important to use iozone though testing both sequential and random IO can be interesting in case you run into troubles and provide logs and test results.


    To push around some data even dd is sufficient.

  • In the meantime, I managed to install the backport kernel 3.16 (as gderf suggested), which didn't resolve the timeout problem.


    But I also thought about what tkaiser said:

    BTW: Your dmesg output is about negotiation problems between SATA controller and one disk but not 'system and the SATA controller' (that would be PCIe and looks different)

    If that's the case, then it's easy to find out it it's the cable, the disk, or the controller. So I exchanged ata3 and ata4, and moved the controller to a different PCI slot to be completely sure. Result: the problem is still with ata3; which means it can't be the cable or the disk.


    Here is the output before the change:



    And here after:



    Is there any explanation other than a faulty controller?


    And the next question: Is it just a weird coincidence that I bought faulty cables and a faulty controller (I'm sure the cables are faulty because I don't get crc errors since I removed them), or could it be that the cable actually broke the controller?


    And a final question for tkaiser:

    Well, while I would never use such a combination like yours

    What would you recommend for a system that is capable of handling >= 10 disks? If I have to return the controller, I might as well buy a different one...

  • OK, I did one more test: Moved the controller and one of the disks to my desktop computer.


    If I connect the disk to the first port ("ata3" in the omv system), I get "unknown device" in the device manager. If I connect it to the second port ("ata4"), I get "HGST HDN..."


    I guess that settles it.

  • Is there any explanation other than a faulty controller?

    I wrote above all the time about 'connectors/cables' and '50 matings max' for a reason: cheap stuff is unreliable. It seems the 4th port of one of your two cards (ata3) has negotiation problems. Since I'm an electrics noob and never looked more into electrical SATA aspects maybe there are active components for a SATA PHY needed (resistors, whatever) or maybe it's just a corroded contact (your SATA controllers are quite old already).


    Can't help with the other question since I would never try to deal with more than 4 disks at home (I learned to hate RAIDs in the last two decades but that's due to dealing mostly with failures all the time). Anyway: If I would set something up with that huge count of disks I would choose an enclosure featuring a backplane to insert disks directly (less hassles) and controllers featuring at least mini SAS sockets (SFF8087).

  • Hi @tkaiser


    Coming back to this thread.


    I did the suggested


    Command line used: iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2


    on my new WD Red 6 TB (no RAID).


    Copying is slow and I am looking into issues.


    Can you help to interpret if


    Code
    Processor cache size set to 1024 kBytes.
    Processor cache line size set to 32 bytes.
    File stride size set to 17 * record size.
    random random bkwd record stride
    kB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
    102400 4 58512 71277 78700 76532 944 2681
    102400 16 71268 104658 134803 146002 3800 10897
    102400 512 119992 131178 149356 178384 60359 107270
    102400 1024 110612 127789 143063 178356 86468 110789
    102400 16384 111621 141307 158813 184863 138158 131632



    indicates obvious (e.g. SATA or cable) errors?
    THX!Nico

    OMV: 4.1.19-1
    HW: Athlon 200GE / Gigabyte Aorus M (B450) / 16 GB RAM
    Boot Drive: Kingston 240GB nVME
    Data Drives: 3 HDD

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!