Hello OMV community,
I'd like to report an issue which I have somehow "fixed" by using a workaround. First of all some information about my hardware. My server is a Fujitsu PRIMERGY MX130 S2 and I have recently bought 3 WD Red 5TB (WD50EFRX). I have set up a RAID 5 array with those 3 disks using the GUI of the latest OpenMediaVault 2. Everything went fine so far and I didn't have any problems setting up the system.
I have set up S.M.A.R.T. monitoring for all 3 drives and at some point I started to receive some emails that "ATA errors" have occured (for all of the 3 disks):
ZitatAlles anzeigenThis email was generated by the smartd daemon running on:
host name: HomeNAS
DNS domain: [Unknown]
NIS domain: (none)
The following warning/error was logged by the smartd daemon:Device: /dev/disk/by-id/ata-WDC_WD50EFRX-XXXXXXX_WD-XXXXXXXXXXXX [SAT], ATA error count increased from 4 to 5
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original email about this issue was sent at Sat Mar 12 19:20:06 2016 CET
Another email message will be sent in 24 hours if the problem persists.
So I immediately checked the SMART values using "smartctl -ia /dev/sdX" and found multiple entries about ATA errors at the buttom of the output (the SMART values itself were good):
Error 5 occurred at disk power-on lifetime: 28 hours (1 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 0b 00 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 e1 4f c2 00 08 00:06:31.694 SMART READ LOG
b0 d5 01 e1 4f c2 00 08 00:06:31.694 SMART READ LOG
b0 d6 01 e0 4f c2 00 08 00:06:31.693 SMART WRITE LOG
b0 d6 01 e0 4f c2 00 08 00:06:31.692 SMART WRITE LOG
b0 d5 01 e0 4f c2 00 08 00:06:31.692 SMART READ LOG
Error 4 occurred at disk power-on lifetime: 28 hours (1 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 0b 00 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 e1 4f c2 00 08 00:03:58.408 SMART READ LOG
b0 d5 01 e1 4f c2 00 08 00:03:58.407 SMART READ LOG
b0 d6 01 e0 4f c2 00 08 00:03:58.406 SMART WRITE LOG
b0 d6 01 e0 4f c2 00 08 00:03:58.405 SMART WRITE LOG
b0 d5 01 e0 4f c2 00 08 00:03:58.405 SMART READ LOG
Error 3 occurred at disk power-on lifetime: 22 hours (0 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 0b 00 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 e1 4f c2 00 08 22:53:03.082 SMART READ LOG
b0 d5 01 e1 4f c2 00 08 22:53:03.082 SMART READ LOG
b0 d6 01 e0 4f c2 00 08 22:53:03.081 SMART WRITE LOG
b0 d6 01 e0 4f c2 00 08 22:53:03.080 SMART WRITE LOG
b0 d5 01 e0 4f c2 00 08 22:53:03.080 SMART READ LOG
Error 2 occurred at disk power-on lifetime: 5 hours (0 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 0b 00 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 e1 4f c2 00 08 05:08:35.716 SMART READ LOG
Error 1 occurred at disk power-on lifetime: 5 hours (0 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 0b 00 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d5 01 e1 4f c2 00 08 05:08:35.662 SMART READ LOG
It seems like there have been multiple SMART requests at almost the same time. After noticing those ATA errors on all devices I decided to run a SMART short test on my drives (smartctl -t short /dev/sdX). Unfortunately the test did not finish in 2 minutes like it was supposed to be. The process just hang at 90% all the time on all drives. What I did then was installing the backports kernel because I thought maybe the "outdated" kernel causes this behavious.
OMV 2 with backports kernel
root@HomeNAS:~# uname -a
Linux HomeNAS 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3~bpo70+1 (2016-01-19) x86_64 GNU/Linux
While running the system and reading SMART values new ATA errors still occured. Additionally the SMART self test stopped at 90% and didnt finish (I had to manually abort it). I was really annoyed by that and decided to put the disks one by one into my desktop PC and run an UBUNTU live CD (14.04). The self test finished properly for all 3 disks. The next step was to put the drives back into the server and run the UBUNTU live CD there to rule out any hardware issues. Fortunatly the smart self test worked fine running ubuntu on my server. I was really happy to ruled out any hardware issue or incompatability of my board/controller.
Now I just had to find out how I could get OpenMediaVault to work with my server and hard drives. Therefore I decided to install the OMV 3 beta on an USB drive, update to the latest version and give it another try.
OMV 3.0.13
root@HomeNAS:~# uname -a
Linux HomeNAS 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u4 (2016-02-29) x86_64 GNU/Linux
I couldn't have been happier when I saw that the smart self test finished without any errors. Now I hope that the ATA errors just occured because of some incompatability with the software/kernel and that it did not damage my (expensive) drives. I have just tried to document my experience. Maybe some of the devs/mods/pros know what the problem was or maybe this could help to fix/detect any further issues. If you need any logs or complete SMART outputs, I'll post them here.