4 disc Raid5: S.M.A.R.T attributes show errors - S.M.A.R.T. Test PASSED!

PAPPL · 26. Januar 2019

Hi,
i'm running a 4 disc mdadm RAID5

sda (3TB)
sdb (3TB)
sdc (3TB)
sdd (3TB)

One drive (sdb) shows a red dot in OMV SMART settings (Popup message: Drive has few bad blocks).
Here a log of Smart-Attributes:

Code

root@Server:~# sudo smartctl -A /dev/sdb
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE     UPDATED  WHEN_FAILED RAW_VALUE 
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       8791      
  3 Spin_Up_Time            0x0027   175   174   021    Pre-fail  Always       -       6208      
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3283      
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       3         
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0         
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       7197      
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0         
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0         
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1263      
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       66        
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3216      
194 Temperature_Celsius     0x0022   117   105   000    Old_age   Always       -       33        
196 Reallocated_Event_Count 0x0032   197   197   000    Old_age   Always       -       3         
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       142       
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0         
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0         
200 Multi_Zone_Error_Rate   0x0008   200   187   000    Old_age   Offline      -       0

Alles anzeigen

Not so good:
Raw_Read_Error_Rate: 8791 (other discs show 0)
Reallocated_Sector_Ct: 3 (other discs show 0)
Current_Pending_Sector: 142 (other discs show 0)

Should i be concerned? The Raid-Status is clean, all is working fine, except the strange Smart-Status log.

sdb has few bad sectors which are pending. I tried to get the bad block numbers to reallocate these, but Smart health-test PASSED!

Code

root@Server:~# smartctl -H /dev/sdb
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Did selftests and they completed without errors and do not show bad blocks in LBA_of_first_error! Why? There are pending sectors!

Code

root@Server:~# smartctl -l selftest /dev/sdb
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error  
# 1  Short offline       Completed without error       00%      7195         -                   
# 2  Extended offline    Completed without error       00%      7180         -                   
# 3  Short offline       Completed without error       00%      7172         -

tune2fs, fdisk need the exact block numbers to reallocate/overwrite pending blocks, but there are no block faulty blocks under LBA_of_first_error
How can i fix the pending sectors without loosing raid5 or data-loss?

Please help
pappl

Adoby · 27. Januar 2019

I have very little experience with RAID and SMART errors.

But I would do this:

1. Order a new replacement HDD at once!
2. Check that your backups are safe.
3. Consider not using the NAS until you have replaced the HDD, creating extra backups may push the disk over the edge, perhaps along with more disks, resulting in total loss of all data. But you have backups, don't you? So that is not a big problem.

raulfg3 · 27. Januar 2019

5 Reallocated_Sector_Ct must be 0, you have still 140 sector to avoid data loss, so you have time to order a new disk or return in guaranty your defect disk

tkaiser · 27. Januar 2019

Zitat von PAPPL

Raw_Read_Error_Rate: 8791

Which drive vendor? In case it's Seagate you might want to educate yourself about 'numbers without meaning' (at least when interpreted as decimal numbers which is wrong http://www.users.on.net/~fzabk…Seagate_SER_RRER_HEC.html

PAPPL · 27. Januar 2019

Thanks for your help!

Zitat von raulfg3

5 Reallocated_Sector_Ct must be 0, you have still 140 sector to avoid data loss, so you have time to order a new disk or return in guaranty your defect disk

Good news!

Zitat von tkaiser

Which drive vendor? In case it's Seagate you might want to educate yourself about 'numbers without meaning' (at least when interpreted as decimal numbers which is wrong http://www.users.on.net/~fzabk…Seagate_SER_RRER_HEC.html

WD Red 3TB, thanks for the information.

Zitat von Adoby

I have very little experience with RAID and SMART errors.

But I would do this:

1. Order a new replacement HDD at once!
2. Check that your backups are safe.
3. Consider not using the NAS until you have replaced the HDD, creating extra backups may push the disk over the edge, perhaps along with more disks, resulting in total loss of all data. But you have backups, don't you? So that is not a big problem.

Backups are made for all shares i don't want to lose.

How do i replace one drive of the 4 disc raid5 system?

- Buy a new 3TB drive
- Power off the server (OMV 4.1.17)
- Replace the faulty hdd
- Power on and log into OMV web administration
- Raid5 will resync automaticly for hours? Or do i need to keep additional settings in mind?

pappl

geaves · 27. Januar 2019

Zitat von PAPPL

Power off the server (OMV 4.1.17)
- Replace the faulty hdd
- Power on and log into OMV web administration
- Raid5 will resync automaticly for hours? Or do i need to keep additional settings in mind?

No, all this will need to be completed from the cli.

If you search for replacing a hard drive in an mdadm raid 5 you'll get some answers. This is just one the only thing I'm not sure about is if the array has to be stopped first.

PAPPL · 28. Januar 2019

After reading about Raid5 array data loss, i don't want to wait until the suspicious drive gets more errors.
If a second drive fails all data is lost, i should have done a more secure Raid6 array. I have a backup of all important data, but not for all data due to high costs.

I'm a little afraid of swapping the drive and rebuild process, because some users get problems after swapping a still functioning drive in a clean Raid5 array. (Rebuild not possible, array afterwards not visible, array mount not possible, data lost,...)
So the drive which has to be replaced must be removed from the array via cli first.
Every tutorial seems to have another command options (--force,...).
Interesting there is no noob-proof sticky tutorial thread about swapping a still working drive/faulty drive and rebuilding Raid5 after all these years.

pappl

Adoby · 28. Januar 2019

Perhaps it is assumed that "noobs" don't fiddle with RAID?

I wouldn't like to. It seems very easy to setup, but when something goes wrong it seems very scary.

geaves · 28. Januar 2019

Zitat von PAPPL

Interesting there is no noob-proof sticky tutorial thread about swapping a still working drive/faulty drive and rebuilding Raid5 after all these years.

Nothing is noob proof here is a thread you may find interesting, for the average home user the use of a raid set up makes no sense, @Adoby has an excellent thread on here describing his own set up, that is a much better option than using raid. Raid is easy to set up can be a PITA recovering.

tkaiser · 28. Januar 2019

Zitat von PAPPL

If a second drive fails all data is lost

It's worse than that with most RAID-5 implementations. A single URE (unrecoverable read error) on one of the remaining disks occurring when rebuilding a RAID-5 can stop the whole rebuild and your whole array is lost.

Traditional RAID is not about data safety but only about data availability and nobody at home needs this. Unfortunately almost everyone at home playing RAID forgets about backup.

Jetzt mitmachen!