Hi all,
my problem started with nearly daily smartd alert mails like this:
This email was generated by the smartd daemon running on:
host name: omv
DNS domain: xxxx.yyy
NIS domain: (none)
The following warning/error was logged by the smartd daemon:
Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-...F1V6 [SAT], ATA error count increased from 3 to 5
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original email about this issue was sent at Fri Sep 8 20:12:42 2017 CEST
Another email message will be sent in 24 hours if the problem persists.
Alles anzeigen
At first, when I looked after in smart extended information, I found latest error log entries to be empty. Until I invoked daily short-selftests - and waited some one week. Which changed extended information to this (different drive):
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Western Digital RE4-GP
Device Model: WDC WD2002FYPS-01U1B1
Serial Number: WD-...2160
LU WWN Device Id: 5 0014ee 2043b8a1b
Firmware Version: 04.05G05
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Sep 11 16:47:36 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
<<not really interesting - snip>>
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 149 147 021 Pre-fail Always - 9541
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1185
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 8539
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 241
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 234
193 Load_Cycle_Count 0x0032 134 134 000 Old_age Always - 200934
194 Temperature_Celsius 0x0022 123 104 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 2
SMART Error Log Version: 1
Warning: ATA error count 10 inconsistent with error log pointer 4
ATA Error Count: 10 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 10 occurred at disk power-on lifetime: 8528 hours (355 days + 8 hours)
When the command that caused the error occurred, the device was in standby mode.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 00 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 e1 4f c2 00 00 1d+12:44:22.982 SMART READ LOG
b0 d5 01 e1 4f c2 00 00 1d+12:44:22.982 SMART READ LOG
b0 d6 01 e0 4f c2 00 00 1d+12:44:22.981 SMART WRITE LOG
b0 d6 01 e0 4f c2 00 00 1d+12:44:22.980 SMART WRITE LOG
b0 d5 01 e0 4f c2 00 00 1d+12:44:22.979 SMART READ LOG
Error 9 occurred at disk power-on lifetime: 8498 hours (354 days + 2 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 00 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 e1 4f c2 00 00 06:43:32.607 SMART READ LOG
b0 d5 01 e0 4f c2 00 00 06:43:32.606 SMART READ LOG
b0 d5 01 e1 4f c2 00 00 06:43:32.605 SMART READ LOG
b0 d6 01 e0 4f c2 00 00 06:43:32.604 SMART WRITE LOG
b0 d5 01 e0 4f c2 00 00 06:43:32.603 SMART READ LOG
Error 8 occurred at disk power-on lifetime: 8494 hours (353 days + 22 hours)
When the command that caused the error occurred, the device was in standby mode.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 00 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 e1 4f c2 00 00 03:07:02.701 SMART READ LOG
b0 d5 01 e0 4f c2 00 00 03:07:02.700 SMART READ LOG
b0 d6 01 e0 4f c2 00 00 03:07:02.699 SMART WRITE LOG
b0 d6 01 e0 4f c2 00 00 03:07:02.698 SMART WRITE LOG
b0 d5 01 e0 4f c2 00 00 03:07:02.697 SMART READ LOG
Error 7 occurred at disk power-on lifetime: 8480 hours (353 days + 8 hours)
When the command that caused the error occurred, the device was in standby mode.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 00 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 e1 4f c2 00 00 02:09:51.204 SMART READ LOG
b0 d5 01 e0 4f c2 00 00 02:09:51.203 SMART READ LOG
b0 d5 01 e1 4f c2 00 00 02:09:51.202 SMART READ LOG
b0 d6 01 e0 4f c2 00 00 02:09:51.201 SMART WRITE LOG
b0 d5 01 e1 4f c2 00 00 02:09:51.200 SMART READ LOG
Error 6 occurred at disk power-on lifetime: 5408 hours (225 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 00 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 e1 4f c2 00 00 03:39:36.265 SMART READ LOG
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 8505 -
# 2 Extended offline Aborted by host 50% 8498 -
# 3 Extended offline Aborted by host 90% 8490 -
# 4 Short offline Completed without error 00% 8489 -
# 5 Short offline Completed without error 00% 5393 -
# 6 Extended offline Completed without error 00% 5376 -
<<not really meaningful - snap>>
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Alles anzeigen
From former empty error log entries there is just the "Warning: ATA error count 10 inconsistent with error log pointer 4" remaining. Ok, but what could be the reason for alerts still coming in? Guess it is the "Error: ABRT", which seems to be the drive's reaction on smartd's (whatever) last action when it gets terminated by autoshutdown
To confirm that context, I disabled autoshutdown and voila - there are no smartd alerts no more for couple of days now. Only letting run the omv machine all the time is not really, what I want to do.
So my question: How can I make sure, that autoshutdown waits for any smartd activity to be finished? Is it possible by just tuning the basic parameters (HDDIO?) or must I add a dedicated lock/unlock mechanism like here?
Btw another question: Is a shutdown by hand, i.e. by Web-GUI or by pushing the on/off button, known safe, not to disturb drive's smart the same way, too? I just didn't watch for this yet.
Best regards