SATA errors

  • Hi,


    I finally got all the HDDs for my NAS, but now I see these errors:



    When trying to format one of the drives through the web interface, I also sometimes got long delays, followed by "communication problem".


    I think the problem might be related with my SATA controller, it's this one:


    https://www.amazon.de/gp/produ…age_o00_s00?ie=UTF8&psc=1


    EDIT: According to the manual, the controller should be supported by linux kernel versions >= 2.6.


    This is my mainboard:


    https://www.amazon.de/Asrock-J…-1&keywords=j3455m+asrock



    The HDDs are HGST NAS 8TB.



    I am using OMV 2.2. Could it be the problem will be fixed in 3.1 due to the new linux kernel?

  • Do you have the backport kernel installed? If not you should try it.

    --
    Google is your friend and Bob's your uncle!


    OMV AMD64 7.x on headless Chenbro NR12000 1U 1x 8m Quad Core E3-1220 3.1GHz 32GB ECC RAM.

  • I think the problem might be related with my SATA controller

    Well, while I would never use such a combination like yours I would first check for cabling/connectivity problems. Check 'smartctl -a -x' output for each drive and watch out for SMART attribute 199.


    And maybe use a 2nd host with a well known SATA implementation to do individual burn in tests (at least running a few heavy iozone runs and checking SMART output afterwards)

  • Thanks for all the replies! The smart attribute 199 value for one of the disks was 34 (raw value), all the other ones had 0.


    I replaced the SATA cable, and haven't had any trouble since (about an hour ago).


    Can a faulty cable cause this kind of error?

  • Can a faulty cable cause this kind of error?

    Of course. This counter reports CRC errors. ATA/SATA uses primitive checksumming to identify data corruption on the wire (sender calculcates CRC and sends it together with the data to the receiver. There CRC of data gets calculated again and if a mismatch occurs the receiver asks for a retransmit and if the disk controller is not crappy increments SMART attribute 199 to inform you there's something wrong -- some disks don't).


    SMART attribute 199 is the first thing to check when attaching any disk to a host after running a small benchmark. On crappy disks not incrementing 199 value count of retransmits slows sequential performance down so even in such cases you're most probably able to identify the problem.


    BTW: Internal SATA connectors/cables are rated for maximum 50 matings and the cheap crap from Aliexpress/eBay/whatever dies way earlier.

  • BTW: ATA/SATA checksumming is rather primitive so if cables/contacts corrupt data not every occurence will lead to a CRC mismatch and will trigger a data retransmit but you run in corrupted data on the storage. That's why you ALWAYS do a quick per disk 'burn in' using a benchmark like iozone after you added a disk to a host followed by checking SMART attribute 199.

  • BTW: ATA/SATA checksumming is rather primitive so if cables/contacts corrupt data not every occurence will lead to a CRC mismatch and will trigger a data retransmit but you run in corrupted data on the storage. That's why you ALWAYS do a quick per disk 'burn in' using a benchmark like iozone after you added a disk to a host followed by checking SMART attribute 199.

    Is iozone included in some omv plugin?


    BTW, I did looked a bit closer at the "dmesg" output, and found something that seems to indicate communication problems between the system and the SATA controller:


    (of course the controller should be capable of 6Gbps)


  • Is iozone included in some omv plugin?

    It's a simple 'sudo apt install iozone3' and then I would recommend doing a chdir to the mountpoint in question and then


    Code
    iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2

    Test size is just 100 MB but it will be transferred 6 times (write, rewrite, read, reread, and random write and read) so you end up with 600 MB data transmitted. If you suffer from bad cabling this will trigger enough retransmits so you're able to identify problems.


    BTW: Your dmesg output is about negotiation problems between SATA controller and one disk but not 'system and the SATA controller' (that would be PCIe and looks different)

  • I hope you won't lose your patience...


    Zitat


    -bash: apt: Kommando nicht gefunden.


    and if I do " apt-get install iozone3 ", I get:



    Zitat


    Paketlisten werden gelesen... Fertig
    Abhängigkeitsbaum wird aufgebaut.
    Statusinformationen werden eingelesen.... Fertig
    E: Paket iozone3 kann nicht gefunden werden.

  • You're using Debian Wheezy? Hmm... no idea, I've no Wheezy around anymore (even Jessie is already outdated as hell). It's not that important to use iozone though testing both sequential and random IO can be interesting in case you run into troubles and provide logs and test results.


    To push around some data even dd is sufficient.

  • In the meantime, I managed to install the backport kernel 3.16 (as gderf suggested), which didn't resolve the timeout problem.


    But I also thought about what tkaiser said:

    BTW: Your dmesg output is about negotiation problems between SATA controller and one disk but not 'system and the SATA controller' (that would be PCIe and looks different)

    If that's the case, then it's easy to find out it it's the cable, the disk, or the controller. So I exchanged ata3 and ata4, and moved the controller to a different PCI slot to be completely sure. Result: the problem is still with ata3; which means it can't be the cable or the disk.


    Here is the output before the change:



    And here after:



    Is there any explanation other than a faulty controller?


    And the next question: Is it just a weird coincidence that I bought faulty cables and a faulty controller (I'm sure the cables are faulty because I don't get crc errors since I removed them), or could it be that the cable actually broke the controller?


    And a final question for tkaiser:

    Well, while I would never use such a combination like yours

    What would you recommend for a system that is capable of handling >= 10 disks? If I have to return the controller, I might as well buy a different one...

  • OK, I did one more test: Moved the controller and one of the disks to my desktop computer.


    If I connect the disk to the first port ("ata3" in the omv system), I get "unknown device" in the device manager. If I connect it to the second port ("ata4"), I get "HGST HDN..."


    I guess that settles it.

  • Is there any explanation other than a faulty controller?

    I wrote above all the time about 'connectors/cables' and '50 matings max' for a reason: cheap stuff is unreliable. It seems the 4th port of one of your two cards (ata3) has negotiation problems. Since I'm an electrics noob and never looked more into electrical SATA aspects maybe there are active components for a SATA PHY needed (resistors, whatever) or maybe it's just a corroded contact (your SATA controllers are quite old already).


    Can't help with the other question since I would never try to deal with more than 4 disks at home (I learned to hate RAIDs in the last two decades but that's due to dealing mostly with failures all the time). Anyway: If I would set something up with that huge count of disks I would choose an enclosure featuring a backplane to insert disks directly (less hassles) and controllers featuring at least mini SAS sockets (SFF8087).

  • Hi @tkaiser


    Coming back to this thread.


    I did the suggested


    Command line used: iozone -e -I -a -s 100M -r 4k -r 16k -r 512k -r 1024k -r 16384k -i 0 -i 1 -i 2


    on my new WD Red 6 TB (no RAID).


    Copying is slow and I am looking into issues.


    Can you help to interpret if


    Code
    Processor cache size set to 1024 kBytes.
            Processor cache line size set to 32 bytes.
            File stride size set to 17 * record size.
                                                                  random    random     bkwd    record    stride                                    
                  kB  reclen    write  rewrite    read    reread    read     write     read   rewrite      read   fwrite frewrite    fread  freread
              102400       4    58512    71277    78700    76532      944     2681                                                          
              102400      16    71268   104658   134803   146002     3800    10897                                                          
              102400     512   119992   131178   149356   178384    60359   107270                                                          
              102400    1024   110612   127789   143063   178356    86468   110789                                                          
              102400   16384   111621   141307   158813   184863   138158   131632



    indicates obvious (e.g. SATA or cable) errors?
    THX!Nico

    OMV: 4.1.19-1
    HW: Athlon 200GE / Gigabyte Aorus M (B450) / 16 GB RAM
    Boot Drive: Kingston 240GB nVME
    Data Drives: 3 HDD

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!