smartd detected "1 Currently unreadable (pending) sectors" while SMART is OK

UltimateByte · 6. Juni 2019

Hi,

One of my RAID 1 SSD (which contains a VM image) triggered an alert from smartd daemon tonight at 1:27AM:

Zitat von smartd daemon

This message was generated by the smartd daemon running on: host name: nas DNS domain: hidden.tldThe following warning/error was logged by the smartd daemon:Device: /dev/disk/by-id/ata-CT500MX500SSD1_1911E1F0E132 [SAT], 1 Currently unreadable (pending) sectorsDevice info:CT500MX500SSD1, S/N:1911E1F0E132, WWN:5-00a075-1e1f0e132, FW:M3CR023, 500 GBFor details see host's SYSLOG.You can also use the smartctl utility for further investigation.Another message will be sent in 24 hours if the problem persists.

Syslog shows:

Code

root@nas:~# cat /var/log/syslog | grep smart
Jun  6 01:27:31 nas smartd[810]: Device: /dev/disk/by-id/ata-CT500MX500SSD1_1911E1F0E132 [SAT], 1 Currently unreadable (pending) sectors
Jun  6 01:27:31 nas smartd[810]: Sending warning via /usr/share/smartmontools/smartd-runner to admin@hidden.tld ...
Jun  6 01:27:31 nas smartd[810]: Warning via /usr/share/smartmontools/smartd-runner to admin@terageek.org: successful
Jun  6 01:57:31 nas smartd[810]: Device: /dev/disk/by-id/ata-CT500MX500SSD1_1911E1F0E132 [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 em

It shows to be solved at 1:57AM. (I assume there is a cron every 30 minutes).

smartctl output shows:

Code

root@nas:~# smartctl -a /dev/sdb | grep pending
If Selective self-test is pending on power-up, resume after 0 minute delay.
root@nas:~# smartctl -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.4-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
Device Model:     CT500MX500SSD1
Serial Number:    1911E1F0E132
LU WWN Device Id: 5 00a075 1e1f0e132
Firmware Version: M3CR023
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA >3.2 (0x1ff), 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jun  6 14:35:29 2019 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  30) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x0031)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.


SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       379
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       8
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   100   100   000    Old_age   Always       -       8
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       0
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       46
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   067   049   000    Old_age   Always       -       33 (Min/Max 0/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   100   100   001    Old_age   Offline      -       0
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   000    Old_age   Always       -       1399882358
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       25302175
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       72234345


SMART Error Log Version: 1
No Errors Logged


SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Completed [00% left] (0-65535)
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Alles anzeigen

I'm quite used about reading SMART and I can't see any error here... Which is comforting. But in the same time, there was still an error, and I don't like that.

So my questions are:
Do you have any idea what would this error indicate on a SATA SSD?
Is there anything to worry about?
How can there be such an error but nothing in SMART

Any enlightment appreciated.
Best regards

tkaiser · 6. Juni 2019

Zitat von UltimateByte

Is there anything to worry about?

That's something you need to ask Crucial. At least it's a known 'problem' with your SSD: Current_Pending_Sector mx500

UltimateByte · 6. Juni 2019

Well, it didn't occur to me that it could possibly a specific MX500 thing, that's very enlightening, thank you!
We can see some folks having the same issues on multiple systems, so at least we know it's not a debian based or OMV specific issue.

I'll contact Crucial about that and report back if I get any useful information.

UltimateByte · 6. Juni 2019

I've had a pretty great tech support on chat at Crucial with some interesting conclusions that I'll share.

Since I was running Linux and no Crucial tool existed for the OS, I was informed that Micron is owning Crucial, so most of the things are valid for both Micron and Crucial.
So they asked me to install a Micron GUI diagnose tool which was a bit complicated since I'm only using CLI on my OMV. Then they pointed out there was also a CLI version.
That's the tool: https://www.micron.com/product…torage-executive-software
But then I pointed out "I'm not alone having this, checking my specific SMART is not relevant", then they more or less accepted my smartctl output and logs were relevant enough to answer my worries.

The tech's conclusion is the following:

It is perfectly fine, normal and expected to have pending sectors sometimes on an SSD, due to the nature of NAND memory.

Therefore, your SMART value Current_Pending_Sector might get to non 0 values from time to times. I think what would be worrying is if it doesn't go back to 0 afterwards. Then it would mean that there is no room for moving to available blocks.

What you should check is the value: Unused_Reserve_NAND_Blk
You don't want this close to 0.

For reference, I have a value of 45 on one drive and 46 on the other, for drives that are I think a bit less than 1 month old; upon checking, I have other emails for this warning on both drives: 2 on one drive, and 3 on the other, so the default value for my 500GB MX500 is Unused_Reserve_NAND_Blk: 48. As that speed (hopefully it slows down), the drives might be dead in no time... And I'm not even writing a lot onto it (just a VM with two game servers on it).

They also provided me a doc with an explanation of how SMART attributes are calculated. The tech was unsure if this was shareable or not, but I found it publicly on Micron's website so... Here's the link. https://www.micron.com/-/media…_ssd_smart_attributes.pdf

In conclusion: Nothing to worry about at least until the values grow up. And Crucial support rocks

tkaiser · 6. Juni 2019

Zitat von UltimateByte

I have a value of 45 on one drive and 46 on the other ... I have other emails for this warning on both drives: 2 on one drive, and 3 on the other, so the default value for my 500GB MX500 is Unused_Reserve_NAND_Blk: 48

I really hope (for you) that this is just a coincidence and not correct math Please keep us informed whether new occurrences of pending sectors correlate with a decrease of the 180 attribute.

UltimateByte · 7. Juni 2019

Thank you

Yes, it is likely a coincidence (confirmation bias spotted). I've checked again, In fact there are only 4 mails total. (The 5th was another mail containing the "Pending" word, my bad.)
That said, it is not impossible that an alert was sent while I didn't have emails setup on the NAS yet. (syslog are not kept before June 2nd so we can't know).

I will try to report back once I have more data which will tell us more about the subject than my ravings.

The advantage is we have my previous values for the record.

Attached: Screenshot of errors detection times and dates.

Order of disks for errors is: A B A B
I see no obvious time pattern for now, but from the few data available, errors seem to be more and more rare which would be great if it could go on like that

UltimateByte · 11. Juni 2019

So, I've got many pending sectors emails today, which means more data!

6 new for the disk CT500MX500SSD1_1911E1F0E132 which makes a total of 9
And 3 new for the disk CT500MX500SSD1_1911E1F0F25B which makes a total of 5.

First one has 45 Unused_Reserve_NAND_Blk
Second one has 46 Unused_Reserve_NAND_Blk

So the values are unchanged and doesn't seem to depend on current pending sectors activity, or at least it's not correlated with a 1:1 ratio which is good.

That said, these emails are still freaking me out... I never like seing this kind of errors.
Last time it happened, I had a 4TB out of warranty drive failing and it took me 1 week to download back my data from Hubic... Not that my connection was slow (I've got gigabit at home) but their servers suck (and they stopped Hubic since...). A solution might be to apply a filter in Thunderbird to put them as "read". Anyhow, I'll backup data on it more frequently.

tkaiser · 12. Juni 2019

Zitat von UltimateByte

I never like seing this kind of errors.

If you're scared by notifications simply disable or filter them.

Nas-turally · 9. Dezember 2019

@UltimateByte : any news about your SMART alert with a Crucial MX500 drive?
I also own a Crucial MX500 (a 2To SSD) and I get the very same email on OMV 4, with the same SMART results. What happened next for you after mid-June?
Thank you!

Jetzt mitmachen!