SMART Status RED but is it really an issue ?

  • Hi folks,


    the SMART status for one disk of my NAS is red. The reason is : 5 - Reallocated Sector Ct. I guess this is because of a non null value. But why is it marked red while the sectors are hidden from the operating system ? Moreover, the normalized value is 100 and the threshold is 5.


    I think this is a error and it would remain green for this case.


    What do you think about that ? Is there a dev list to address these kind of remarks ?


    Laurent

  • It is a sign of a dying disk.


    1. Backup data on that disk.
    2. Buy a replacement.
    3. Replace it.


    The disk may last five minutes. Or five weeks. Or whatever. But you should stop trusting it with anything you don't have backed up.


    It's a little like the joke about the drowning man. Ignore the warning at your own peril.

    Be smart - be lazy. Clone your rootfs. This help is Grateful™.
    OMV 4: 9 x Odroid HC2 + 1 x Odroid HC1 + 1 x Raspberry Pi 4

  • here the smartctl -x



    root@nas:~/SavOVH# smartctl -x /dev/sdb
    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.4-amd64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, http://www.smartmontools.org



    === START OF INFORMATION SECTION ===
    Model Family: Hitachi Ultrastar 7K3000
    Device Model: Hitachi HUA723020ALA640
    Serial Number: MK0A31YVGT4YKK
    LU WWN Device Id: 5 000cca 234cafb80
    Firmware Version: MK7OA5C0
    User Capacity: 2 000 398 934 016 bytes [2,00 TB]
    Sector Size: 512 bytes logical/physical
    Rotation Rate: 7200 rpm
    Form Factor: 3.5 inches
    Device is: In smartctl database [for details use: -P show]
    ATA Version is: ATA8-ACS T13/1699-D revision 4
    SATA Version is: SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
    Local Time is: Tue May 7 08:38:25 2019 CEST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    AAM feature is: Unavailable
    APM feature is: Disabled
    Rd look-ahead is: Enabled
    Write cache is: Disabled
    ATA Security is: Disabled, NOT FROZEN [SEC1]
    Wt Cache Reorder: Enabled



    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED



    General SMART Values:
    Offline data collection status: (0x82) Offline data collection activity
    was completed without error.
    Auto Offline Data Collection: Enabled.
    Self-test execution status: ( 0) The previous self-test routine completed
    without error or no self-test has ever
    been run.
    Total time to complete Offline
    data collection: ( 28) seconds.
    Offline data collection
    capabilities: (0x5b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    No Conveyance Self-test supported.
    Selective Self-test supported.
    SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer.
    Error logging capability: (0x01) Error logging supported.
    General Purpose Logging supported.
    Short self-test routine
    recommended polling time: ( 1) minutes.
    Extended self-test routine
    recommended polling time: ( 327) minutes.
    SCT capabilities: (0x003d) SCT Status supported.
    SCT Error Recovery Control supported.
    SCT Feature Control supported.
    SCT Data Table supported.



  • Error 2 [1] occurred at disk power-on lifetime: 10010 hours (417 days + 2 hours)
    When the command that caused the error occurred, the device was active or idle.



    After command completion occurred, registers were:
    ER -- ST COUNT LBA_48 LH LM LL DV DC
    -- -- -- == -- == == == -- -- -- -- --
    40 -- 51 00 5d 00 00 18 fa da a3 08 00 Error: UNC 93 sectors at LBA = 0x18fadaa3 = 419093155



    Commands leading to the command that caused the error were:
    CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
    -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
    25 d0 d0 01 00 00 00 18 fa da 00 40 00 2d+20:15:46.645 READ DMA EXT
    25 d0 d0 01 00 00 00 18 fa d9 00 40 00 2d+20:15:46.644 READ DMA EXT
    25 d0 d0 01 00 00 00 18 fa d8 00 40 00 2d+20:15:46.643 READ DMA EXT
    25 d0 d0 01 00 00 00 18 fa d7 00 40 00 2d+20:15:46.642 READ DMA EXT
    25 d0 d0 01 00 00 00 18 fa d6 00 40 00 2d+20:15:46.641 READ DMA EXT



    Error 1 [0] occurred at disk power-on lifetime: 10010 hours (417 days + 2 hours)
    When the command that caused the error occurred, the device was active or idle.



    After command completion occurred, registers were:
    ER -- ST COUNT LBA_48 LH LM LL DV DC
    -- -- -- == -- == == == -- -- -- -- --
    40 -- 51 00 8f 00 00 08 f0 66 71 08 00 Error: UNC 143 sectors at LBA = 0x08f06671 = 149972593



    Commands leading to the command that caused the error were:
    CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
    -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
    25 03 d0 01 00 00 00 08 f0 66 00 40 00 2d+19:51:57.760 READ DMA EXT
    25 03 d0 01 00 00 00 08 f0 65 00 40 00 2d+19:51:57.759 READ DMA EXT
    25 03 d0 01 00 00 00 08 f0 64 00 40 00 2d+19:51:57.758 READ DMA EXT
    25 03 d0 01 00 00 00 08 f0 63 00 40 00 2d+19:51:57.757 READ DMA EXT
    25 03 d0 01 00 00 00 08 f0 62 00 40 00 2d+19:51:57.756 READ DMA EXT



    SMART Extended Self-test Log Version: 1 (1 sectors)
    Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
    # 1 Short offline Completed without error 00% 13445 -
    # 2 Extended offline Completed without error 00% 13439 -
    # 3 Short offline Completed without error 00% 13415 -
    # 4 Short offline Completed without error 00% 13396 -
    # 5 Short offline Completed without error 00% 13388 -
    # 6 Short offline Completed without error 00% 13377 -
    # 7 Short offline Completed without error 00% 13373 -
    # 8 Short offline Completed without error 00% 13364 -
    # 9 Short offline Completed without error 00% 11385 -



    SMART Selective self-test log data structure revision number 1
    SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Not_testing
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing
    Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.


    SCT Status Version: 3
    SCT Version (vendor specific): 256 (0x0100)
    SCT Support Level: 1
    Device State: Active (0)
    Current Temperature: 30 Celsius
    Power Cycle Min/Max Temperature: 24/38 Celsius
    Lifetime Min/Max Temperature: 18/53 Celsius
    Under/Over Temperature Limit Count: 0/0



    SCT Temperature History Version: 2
    Temperature Sampling Period: 1 minute
    Temperature Logging Interval: 1 minute
    Min/Max recommended Temperature: 0/60 Celsius
    Min/Max Temperature Limit: -40/70 Celsius
    Temperature History Size (Index): 128 (17)

  • Index Estimated Time Temperature Celsius
    18 2019-05-07 06:31 25 ******
    ... ..( 28 skipped). .. ******
    47 2019-05-07 07:00 25 ******
    48 2019-05-07 07:01 24 *****
    49 2019-05-07 07:02 25 ******
    50 2019-05-07 07:03 24 *****
    51 2019-05-07 07:04 24 *****
    52 2019-05-07 07:05 25 ******
    53 2019-05-07 07:06 24 *****
    ... ..( 54 skipped). .. *****
    108 2019-05-07 08:01 24 *****
    109 2019-05-07 08:02 25 ******
    110 2019-05-07 08:03 26 *******
    111 2019-05-07 08:04 26 *******
    112 2019-05-07 08:05 27 ********
    113 2019-05-07 08:06 28 *********
    114 2019-05-07 08:07 28 *********
    115 2019-05-07 08:08 28 *********
    116 2019-05-07 08:09 29 **********
    117 2019-05-07 08:10 30 ***********
    118 2019-05-07 08:11 30 ***********
    119 2019-05-07 08:12 31 ************
    ... ..( 2 skipped). .. ************
    122 2019-05-07 08:15 31 ************
    123 2019-05-07 08:16 30 ***********
    124 2019-05-07 08:17 30 ***********
    125 2019-05-07 08:18 31 ************
    126 2019-05-07 08:19 31 ************
    127 2019-05-07 08:20 31 ************
    0 2019-05-07 08:21 32 *************
    1 2019-05-07 08:22 31 ************
    2 2019-05-07 08:23 31 ************
    3 2019-05-07 08:24 31 ************
    4 2019-05-07 08:25 30 ***********
    ... ..( 2 skipped). .. ***********
    7 2019-05-07 08:28 30 ***********
    8 2019-05-07 08:29 29 **********
    ... ..( 3 skipped). .. **********
    12 2019-05-07 08:33 29 **********
    13 2019-05-07 08:34 28 *********
    14 2019-05-07 08:35 28 *********
    15 2019-05-07 08:36 29 **********
    16 2019-05-07 08:37 30 ***********
    17 2019-05-07 08:38 30 ***********



    SCT Error Recovery Control:
    Read: Disabled
    Write: Disabled



    Device Statistics (GP Log 0x04)
    Page Offset Size Value Flags Description
    0x01 ===== = = === == General Statistics (rev 1) ==
    0x01 0x008 4 502 --- Lifetime Power-On Resets
    0x01 0x010 4 13466 --- Power-on Hours
    0x01 0x018 6 49845523561 --- Logical Sectors Written
    0x01 0x020 6 447958117 --- Number of Write Commands
    0x01 0x028 6 755837325616 --- Logical Sectors Read
    0x01 0x030 6 3199198841 --- Number of Read Commands
    0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
    0x03 0x008 4 11525 --- Spindle Motor Power-on Hours
    0x03 0x010 4 11525 --- Head Flying Hours
    0x03 0x018 4 864 --- Head Load Events
    0x03 0x020 4 90 --- Number of Reallocated Logical Sectors
    0x03 0x028 4 4679 --- Read Recovery Attempts
    0x03 0x030 4 0 --- Number of Mechanical Start Failures
    0x04 ===== = = === == General Errors Statistics (rev 1) ==
    0x04 0x008 4 3 --- Number of Reported Uncorrectable Errors
    0x04 0x010 4 15 --- Resets Between Cmd Acceptance and Completion
    0x05 ===== = = === == Temperature Statistics (rev 1) ==
    0x05 0x008 1 30 --- Current Temperature
    0x05 0x010 1 31 N-- Average Short Term Temperature
    0x05 0x018 1 37 N-- Average Long Term Temperature
    0x05 0x020 1 53 --- Highest Temperature
    0x05 0x028 1 18 --- Lowest Temperature
    0x05 0x030 1 47 N-- Highest Average Short Term Temperature
    0x05 0x038 1 23 N-- Lowest Average Short Term Temperature
    0x05 0x040 1 38 N-- Highest Average Long Term Temperature
    0x05 0x048 1 24 N-- Lowest Average Long Term Temperature
    0x05 0x050 4 0 --- Time in Over-Temperature
    0x05 0x058 1 60 --- Specified Maximum Operating Temperature
    0x05 0x060 4 0 --- Time in Under-Temperature
    0x05 0x068 1 0 --- Specified Minimum Operating Temperature
    0x06 ===== = = === == Transport Statistics (rev 1) ==
    0x06 0x008 4 1999 --- Number of Hardware Resets
    0x06 0x010 4 1050 --- Number of ASR Events
    0x06 0x018 4 0 --- Number of Interface CRC Errors
    |||_ C monitored condition met
    ||__ D supports DSN
    |___ N normalized value



    SATA Phy Event Counters (GP Log 0x11)
    ID Size Value Description
    0x0001 2 0 Command failed due to ICRC error
    0x0002 2 0 R_ERR response for data FIS
    0x0003 2 0 R_ERR response for device-to-host data FIS
    0x0004 2 0 R_ERR response for host-to-device data FIS
    0x0005 2 0 R_ERR response for non-data FIS
    0x0006 2 0 R_ERR response for device-to-host non-data FIS
    0x0007 2 0 R_ERR response for host-to-device non-data FIS
    0x0009 2 5 Transition from drive PhyRdy to drive PhyNRdy
    0x000a 2 4 Device-to-host register FISes sent due to a COMRESET
    0x000b 2 0 CRC errors within host-to-device FIS
    0x000d 2 0 Non-CRC errors within host-to-device FIS

  • Here it is : ix.io/1IhJ

    Your disk had at least several uncorrectable errors at '10010 hours (417 days + 2 hours)' that was 144 days ago: (13466-10010)/24


    Did you check the SMART health status within those last 144 days?


    As for the values. You should watch out for attributes 5 and 196 (related). In case these increase over the next time then I would agree this is a sign of a dying disk. In case the values remain stable I would treat this disk as any other disk with no 'abnormal' SMART values: being prepared it can die at any time.


    Ah, that's important: those UNC (uncorrectable) messages indicate data corruption so in case you're using an old/anachronistic filesystem be prepared that some data is corrupted and you don't know which (with modern attempts you could run a scrub, identify the corrupted files and hopefully restore them in correct state from a backup).

  • Thanks for your answer. I bought this NAS (HP Microserver gen7) last week with this disk and a friend of him. Both had the same ATA issue at the same time. Since I bought it, OMV performs a short-test every day and a long-test every month and these values (5 & 196) remains the same.


    In fact it was the sense of my question : for some attributes (like 5 or 196 or 199) why the color state is not driven by the variation between two checks instead of the direct raw values ? And let the direct raw value drive the color state for attributes 197 & 198 or when norm values lesser than threshold values ?


    Laurent

  • for some attributes (like 5 or 196 or 199) why the color state is not driven by the variation between two checks instead of the direct raw values ?

    No idea. Maybe @votdev has a reason for doing so, maybe it's just 'simple implementation'. But I agree that a more sophisticated model to deal with these 3 attributes (checking for changes over time) would be a great feature request.

  • That's true for every HDD regardless of any SMART values and that's why backups are important. :)

    I certainly agree about backups. However I would be much more wary of a drive with this type of SMART values. If it was a used drive I am buying I would VERY significantly haggle down the price and if unsuccessful refuse to buy it. If it was a new drive I just bought I would return it at once.


    So while backups always are important it may be prudent to check one time extra that you really have them and that they are good. And perhaps to get a replacement drive ordered.

    Be smart - be lazy. Clone your rootfs. This help is Grateful™.
    OMV 4: 9 x Odroid HC2 + 1 x Odroid HC1 + 1 x Raspberry Pi 4

  • However I would be much more wary of a drive with this type of SMART values. If it was a used drive I am buying I would VERY significantly haggle down the price and if unsuccessful refuse to buy it. If it was a new drive I just bought I would return it at once.

    Sure. But the information that this drive has been bought used wasn't available back then. And if the values of both attributes don't increase over time IMO there's nothing wrong using such a disk. Same with 197 Current_Pending_Sector BTW: If we encounter drives with this value not being zero, we remove the drive from the array, overwrite it and check again. We had several occasions where the 197 value decreased again and statistical analysis showed a simple vibration problem in a specific type of JBOD (the affected HDDs had also significant High Fly Writes values)

  • Hey folks,


    here my hdd health tonight :-)


    Code
    [59548.234798] sd 2:0:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
    [59548.234808] sd 2:0:0:0: [sdb] tag#17 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
    [60047.173999] sd 2:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
    [60047.174009] sd 2:0:0:0: [sdb] tag#19 CDB: ATA command pass through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
    [60047.174038] sd 2:0:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK


    and then:


    Code
    60103.184211] sd 2:0:0:0: [sdb] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
    [60103.184217] sd 2:0:0:0: [sdb] tag#29 CDB: Write(10) 2a 00 00 00 00 10 00 00 08 00
    [60103.184220] print_req_error: I/O error, dev sdb, sector 16
    [60103.184266] md: super_written gets error=10
    [60103.184312] md/raid:md0: Disk failure on sdb, disabling device.
    md/raid:md0: Operation continuing on 3 devices.


    Good, the operations are continuing on three devices. The failing disk (I hope), is emitting a very nasty noise. I do not want to shutdown the NAS 2 times: one time for removing it and another time to put the new one.


    I'm going to buy a disk and let see if the reconstruction is OK ... I hope his old friend will not follow him in this state ...


    Laurent

  • So I don't know if someone is interested in the thread :-) Anyway.


    I've just bought a new disk.


    I prepare for the reboot.


    First, set the disk as faulty (already done by mdadm, but anyway):


    Code
    root@nas:~# mdadm --manage /dev/md0 --fail /dev/sdb
    mdadm: set /dev/sdb faulty in /dev/md0




    Second: remove the disk:


    Code
    root@nas:~# mdadm --manage /dev/md0 --remove /dev/sdb
    mdadm: hot removed /dev/sdb from /dev/md0

    Third: backup the raid configuration in case of problem during reboot (RAID unable to assemble for instance):


    Code
    root@nas:~# mdadm --detail --scan --verbose > /root/mdadm.conf
    root@nas:~# cat /root/mdadm.conf
    ARRAY /dev/md0 level=raid5 num-devices=4 metadata=1.2 name=naslaurent:grappe UUID=eedf6d31:3c6c7ae4:af47f642:fe164bea
    devices=/dev/sda,/dev/sdc,/dev/sdd


    I save it on /root since /root is on an USB stick :-)


    As you can see, mdadm use /dev/sd* instead of blkid. I prefer to save them too:


    Code
    root@nas:~# blkid /dev/sda > /root/sda
    root@nas:~# blkid /dev/sdc > /root/sdc
    root@nas:~# blkid /dev/sdd > /root/sdd
    root@nas:~# cat /root/sd*
    /dev/sda: UUID="eedf6d31-3c6c-7ae4-af47-f642fe164bea" UUID_SUB="ad99400c-afc6-886a-45f1-5e6b5fac921d" LABEL="naslaurent:grappe" TYPE="linux_raid_member"
    /dev/sdc: UUID="eedf6d31-3c6c-7ae4-af47-f642fe164bea" UUID_SUB="c246ed46-598c-0fab-e794-7f08c07984ad" LABEL="naslaurent:grappe" TYPE="linux_raid_member"
    /dev/sdd: UUID="eedf6d31-3c6c-7ae4-af47-f642fe164bea" UUID_SUB="5020dc3e-5528-f32f-c584-b3ded78f6e7d" LABEL="naslaurent:grappe" TYPE="linux_raid_member"


    Awesome. We can reboot now:


    Code
    root@nas:~# halt
  • Well. I changed the disk, restart the NAS and few minutes after I received an email:


    Great, this is exactly what is was expecting for.


    Now, do a conveyance test on the new disk:

    Wait 5 minutes and check the results:

    Code
    root@nas:~# smartctl -a /dev/sdb
    blah blah blah
    SMART Self-test log structure revision number 1
    Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
    # 1 Conveyance offline Completed without error 00% 0 -

    Great, the disk is not able to self-detect any problem due to the conveyance.


    Now create the parition table:

    Code
    root@nas:~# sfdisk -d /dev/sda | sfdisk /dev/sdb
    sfdisk: /dev/sda: does not contain a recognized partition table

    Ok. Let's see that closer:


    Assuming that the RAID will take place onto the whole disk, I guess the signature will be copied too by rebuilding the array. So go ahead!


    Code
    root@nas:~# mdadm --manage /dev/md0 --add /dev/sdb
    mdadm: added /dev/sdb

    And now, we just have to wait!

    Code
    root@nas:~# cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
    md0 : active raid5 sdb[4] sdd[3] sda[0] sdc[2]
    5860150272 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [U_UU]
    [>....................] recovery = 0.1% (2964856/1953383424) finish=241.2min speed=134766K/sec
    bitmap: 7/15 pages [28KB], 65536KB chunk
    unused devices: <none>

    Just verifying if I was right about the RAID signature (and assuming the rebuilt begin by the first sector):

    :-)

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!