Sudden Mount Point Errors - Read Only File System, SMART Not Responding (One Drive)

bbddpp · 21. Juli 2017

I've been running my grounds-up built Dell C2100 rack server since November and while drives have run cooler than ever and have shown nothing but green lights on the SMART screen, I am on what looks like either the second or maybe even the 3rd drive starting to fail out of the blue, looking for some help.

I do not run a RAID first of all, stubbornly, I like to get the maximum space out of a file system, and I run desktop drives in here. The server just is for serving media to my home and running some apps to search for media, the usual stuff. It's not getting hammered and lives in a cool basement. Drive temps run 26-29 C.

Anyway, here's what happened this time.

On a routine SSH session I realized that one of my mountpoints (a 3TB Toshiba drive) had gone "read only" which is a red flag to me that OMV found something wrong in its file system. SMART screen shows green dot next to the drive. I'm afraid to reboot OMV because the last time I did that, I wasn't able to ever re-mount the drive to get the media off. I tried an rsync command to backup the files off the bad drive to another empty drive and it froze. I'm trying a basic cp command now in the shell a few files at a time and still getting Input/Output errors on every file. I tried a short self-test on the device and got this:

Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

What am I doing wrong here? Why are drives going read-only out of the blue? Am I missing something obvious the way I have things set up? Is there anything I can do to rescue this drive before I reboot OMV or unmount the drive and probably lose all my data on an entire volume AGAIN?

My drives are all set up as:

Advanced Power Management: 128 - Minimum power usage without standby (no spindown)
Automatic Acoustic Management: Maximum performance, maximum acoustic output
Spindown time: Disabled
Write cache: off

I do not run routine SMART self-tests and maybe I should be doing that moving forward (anyone have any suggestions on what tests they run?). All that said, I don't get how drives are just failing in months inside of this thing, without any warning signs.

Anything else I can do here or am I basically just going to have to expect to lose a drive every few months out of nowhere because I'm not running RAID or using network-level drives?

Happy to provide any additional detail from logs, fstab, etc.

Thanks for anyone who can help with this frustration.

tkaiser · 21. Juli 2017

Zitat von bbddpp

Happy to provide any additional detail from logs, fstab, etc.

Code

for disk in /dev/sd? ; do smartctl -x $disk ; done | curl -F 'sprunge=<-' http://sprunge.us
dmesg | curl -F 'sprunge=<-' http://sprunge.us

bbddpp · 22. Juli 2017

This sprunge is awesome!

http://sprunge.us/VDTO

http://sprunge.us/jiLV

See anything weird?

Anyone have any insights on spindowns and stuff on a media server where I don't use them much?

tkaiser · 22. Juli 2017

Zitat von bbddpp

sprunge.us/VDTO

There went something wrong since this is only the SMART data from one single disk. But this one suffers from cable/connector problems:

Code

199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    24686

You should check the other drives manually and keep an eye on SMART attribute 199 since this counter increases when checksumming mismatches occur between host and disk (usually the result of crappy/broken cables/connectors -- most people don't know that the standard internal SATA connectors are rated for 50 matings max)

bbddpp · 24. Juli 2017

THank you so much for this, this has been so helpful you have no idea. I re-ran the first command for all drives (I had run it just for the one drive that was giving me issues).

Let me know what you think.

This server is old and I bought it second hand of course and haven't changed any of the cabling guts inside at all. If you think investing in new backplane cables etc would make life easier I'm all for it (Just would have to figure out exactly what I need, since this sucker was plug and play and I never messed with the guts. But it would make total sense that the SATA cabling inside was causing my data errors, not the drives themselves failing (though I remain curious on the best settings for spindown, etc, for a media sever that is asleep most of the time).

New sprunge, all drives!

http://sprunge.us/GOeE

tkaiser · 24. Juli 2017

Well, you have at least 3 drives with reported CRC errors. And the box is stuffed with 'Desktop HDD' -- don't know whether that is such a great idea. But I'm of no further help here since I would never do RAID at home anyway.

bbddpp · 25. Juli 2017

No worries, I appreciate all you have helped so far.

Are there any regular SMART tests I can have scheduled to run so I actually get notified of these errors? I'm a little confused on how the drives show green lights in SMART inside OMV but are throwing errors.

These drives are all relatively new, like a year or less old and have been running at good temps in this server. Is it possible that it's the cabling inside causing the CRC errors?

I certainly have read the pitfalls of using gutted desktop HDDs vs. network drives, but never thought it would be that severe to have drives dying in less than a year with little to no usage, just sitting there storing media.

tkaiser · 25. Juli 2017

Zitat von bbddpp

Is it possible that it's the cabling inside causing the CRC errors?

SMART attribute 199 is about CRC errors. That's cabling/contacts. I don't know which SMART attributes OMV uses (maybe just the general 'health attribute' most disks support which never includes cabling problems but only internal 'health parameters')

I would not be concerned that much about temperatures but more about vibrations (reasons why you find in this comment thread). But these are different things. You should monitor smart attribute 199 while running heavy loads since if it's not increasing then the counter simply reports only problems in the past.

Jetzt mitmachen!