smartd detected "1 Currently unreadable (pending) sectors" while SMART is OK

    • smartd detected "1 Currently unreadable (pending) sectors" while SMART is OK

      Hi,

      One of my RAID 1 SSD (which contains a VM image) triggered an alert from smartd daemon tonight at 1:27AM:

      smartd daemon wrote:

      This message was generated by the smartd daemon running on: host name: nas DNS domain: hidden.tldThe following warning/error was logged by the smartd daemon:Device: /dev/disk/by-id/ata-CT500MX500SSD1_1911E1F0E132 [SAT], 1 Currently unreadable (pending) sectorsDevice info:CT500MX500SSD1, S/N:1911E1F0E132, WWN:5-00a075-1e1f0e132, FW:M3CR023, 500 GBFor details see host's SYSLOG.You can also use the smartctl utility for further investigation.Another message will be sent in 24 hours if the problem persists.

      Syslog shows:

      Source Code

      1. root@nas:~# cat /var/log/syslog | grep smart
      2. Jun 6 01:27:31 nas smartd[810]: Device: /dev/disk/by-id/ata-CT500MX500SSD1_1911E1F0E132 [SAT], 1 Currently unreadable (pending) sectors
      3. Jun 6 01:27:31 nas smartd[810]: Sending warning via /usr/share/smartmontools/smartd-runner to admin@hidden.tld ...
      4. Jun 6 01:27:31 nas smartd[810]: Warning via /usr/share/smartmontools/smartd-runner to admin@terageek.org: successful
      5. Jun 6 01:57:31 nas smartd[810]: Device: /dev/disk/by-id/ata-CT500MX500SSD1_1911E1F0E132 [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 em

      It shows to be solved at 1:57AM. (I assume there is a cron every 30 minutes).

      smartctl output shows:

      Source Code

      1. root@nas:~# smartctl -a /dev/sdb | grep pending
      2. If Selective self-test is pending on power-up, resume after 0 minute delay.
      3. root@nas:~# smartctl -a /dev/sdb
      4. smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.4-amd64] (local build)
      5. Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
      6. === START OF INFORMATION SECTION ===
      7. Model Family: Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
      8. Device Model: CT500MX500SSD1
      9. Serial Number: 1911E1F0E132
      10. LU WWN Device Id: 5 00a075 1e1f0e132
      11. Firmware Version: M3CR023
      12. User Capacity: 500,107,862,016 bytes [500 GB]
      13. Sector Sizes: 512 bytes logical, 4096 bytes physical
      14. Rotation Rate: Solid State Device
      15. Form Factor: 2.5 inches
      16. Device is: In smartctl database [for details use: -P show]
      17. ATA Version is: ACS-3 T13/2161-D revision 5
      18. SATA Version is: SATA >3.2 (0x1ff), 6.0 Gb/s (current: 6.0 Gb/s)
      19. Local Time is: Thu Jun 6 14:35:29 2019 CEST
      20. SMART support is: Available - device has SMART capability.
      21. SMART support is: Enabled
      22. === START OF READ SMART DATA SECTION ===
      23. SMART overall-health self-assessment test result: PASSED
      24. General SMART Values:
      25. Offline data collection status: (0x80) Offline data collection activity
      26. was never started.
      27. Auto Offline Data Collection: Enabled.
      28. Self-test execution status: ( 0) The previous self-test routine completed
      29. without error or no self-test has ever
      30. been run.
      31. Total time to complete Offline
      32. data collection: ( 0) seconds.
      33. Offline data collection
      34. capabilities: (0x7b) SMART execute Offline immediate.
      35. Auto Offline data collection on/off support.
      36. Suspend Offline collection upon new
      37. command.
      38. Offline surface scan supported.
      39. Self-test supported.
      40. Conveyance Self-test supported.
      41. Selective Self-test supported.
      42. SMART capabilities: (0x0003) Saves SMART data before entering
      43. power-saving mode.
      44. Supports SMART auto save timer.
      45. Error logging capability: (0x01) Error logging supported.
      46. General Purpose Logging supported.
      47. Short self-test routine
      48. recommended polling time: ( 2) minutes.
      49. Extended self-test routine
      50. recommended polling time: ( 30) minutes.
      51. Conveyance self-test routine
      52. recommended polling time: ( 2) minutes.
      53. SCT capabilities: (0x0031) SCT Status supported.
      54. SCT Feature Control supported.
      55. SCT Data Table supported.
      56. SMART Attributes Data Structure revision number: 16
      57. Vendor Specific SMART Attributes with Thresholds:
      58. ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
      59. 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
      60. 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
      61. 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 379
      62. 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
      63. 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
      64. 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
      65. 173 Ave_Block-Erase_Count 0x0032 100 100 000 Old_age Always - 8
      66. 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 0
      67. 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 46
      68. 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
      69. 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
      70. 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
      71. 194 Temperature_Celsius 0x0022 067 049 000 Old_age Always - 33 (Min/Max 0/51)
      72. 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
      73. 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
      74. 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
      75. 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
      76. 202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age Offline - 0
      77. 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
      78. 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
      79. 246 Total_Host_Sector_Write 0x0032 100 100 000 Old_age Always - 1399882358
      80. 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 25302175
      81. 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 72234345
      82. SMART Error Log Version: 1
      83. No Errors Logged
      84. SMART Self-test log structure revision number 1
      85. No self-tests have been logged. [To run self-tests, use: smartctl -t]
      86. SMART Selective self-test log data structure revision number 1
      87. SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
      88. 1 0 0 Not_testing
      89. 2 0 0 Not_testing
      90. 3 0 0 Not_testing
      91. 4 0 0 Not_testing
      92. 5 0 0 Completed [00% left] (0-65535)
      93. Selective self-test flags (0x0):
      94. After scanning selected spans, do NOT read-scan remainder of disk.
      95. If Selective self-test is pending on power-up, resume after 0 minute delay.
      Display All

      I'm quite used about reading SMART and I can't see any error here... Which is comforting. But in the same time, there was still an error, and I don't like that.

      So my questions are:
      Do you have any idea what would this error indicate on a SATA SSD?
      Is there anything to worry about?
      How can there be such an error but nothing in SMART

      Any enlightment appreciated.
      Best regards
    • I've had a pretty great tech support on chat at Crucial with some interesting conclusions that I'll share.

      Since I was running Linux and no Crucial tool existed for the OS, I was informed that Micron is owning Crucial, so most of the things are valid for both Micron and Crucial.
      So they asked me to install a Micron GUI diagnose tool which was a bit complicated since I'm only using CLI on my OMV. Then they pointed out there was also a CLI version.
      That's the tool: micron.com/products/solid-stat…torage-executive-software
      But then I pointed out "I'm not alone having this, checking my specific SMART is not relevant", then they more or less accepted my smartctl output and logs were relevant enough to answer my worries.

      The tech's conclusion is the following:

      It is perfectly fine, normal and expected to have pending sectors sometimes on an SSD, due to the nature of NAND memory.



      Therefore, your SMART value Current_Pending_Sector might get to non 0 values from time to times. I think what would be worrying is if it doesn't go back to 0 afterwards. Then it would mean that there is no room for moving to available blocks.

      What you should check is the value: Unused_Reserve_NAND_Blk
      You don't want this close to 0.

      For reference, I have a value of 45 on one drive and 46 on the other, for drives that are I think a bit less than 1 month old; upon checking, I have other emails for this warning on both drives: 2 on one drive, and 3 on the other, so the default value for my 500GB MX500 is Unused_Reserve_NAND_Blk: 48. As that speed (hopefully it slows down), the drives might be dead in no time... And I'm not even writing a lot onto it (just a VM with two game servers on it).

      They also provided me a doc with an explanation of how SMART attributes are calculated. The tech was unsure if this was shareable or not, but I found it publicly on Micron's website so... Here's the link. micron.com/-/media/documents/p…_ssd_smart_attributes.pdf

      In conclusion: Nothing to worry about at least until the values grow up. And Crucial support rocks :)
    • UltimateByte wrote:

      I have a value of 45 on one drive and 46 on the other ... I have other emails for this warning on both drives: 2 on one drive, and 3 on the other, so the default value for my 500GB MX500 is Unused_Reserve_NAND_Blk: 48
      I really hope (for you) that this is just a coincidence and not correct math :) Please keep us informed whether new occurrences of pending sectors correlate with a decrease of the 180 attribute.
      No more contributions to this project until 'alternative facts' (AKA ignorance/stupidity) are gone
    • Thank you :)

      Yes, it is likely a coincidence (confirmation bias spotted). I've checked again, In fact there are only 4 mails total. (The 5th was another mail containing the "Pending" word, my bad.)
      That said, it is not impossible that an alert was sent while I didn't have emails setup on the NAS yet. (syslog are not kept before June 2nd so we can't know).

      I will try to report back once I have more data which will tell us more about the subject than my ravings.

      The advantage is we have my previous values for the record.

      Attached: Screenshot of errors detection times and dates.

      Order of disks for errors is: A B A B
      I see no obvious time pattern for now, but from the few data available, errors seem to be more and more rare which would be great if it could go on like that :D
    • So, I've got many pending sectors emails today, which means more data!

      6 new for the disk CT500MX500SSD1_1911E1F0E132 which makes a total of 9
      And 3 new for the disk CT500MX500SSD1_1911E1F0F25B which makes a total of 5.

      First one has 45 Unused_Reserve_NAND_Blk
      Second one has 46 Unused_Reserve_NAND_Blk

      So the values are unchanged and doesn't seem to depend on current pending sectors activity, or at least it's not correlated with a 1:1 ratio which is good.

      That said, these emails are still freaking me out... I never like seing this kind of errors.
      Last time it happened, I had a 4TB out of warranty drive failing and it took me 1 week to download back my data from Hubic... Not that my connection was slow (I've got gigabit at home) but their servers suck (and they stopped Hubic since...). A solution might be to apply a filter in Thunderbird to put them as "read". Anyhow, I'll backup data on it more frequently.