Again system hangs

shadowcast · 13. August 2014

Hello,

my start with a new NAS based on OMV was not really fine:
http://phpbb.openmediavault.or…966a4f0bd09d558b435b583a3

Before few month the reason was finally the CPU.

Since a few days, i get exactly the same error again. The complete system freezes with screen colored points as in topic above. The system is then not reachable, not per WI, also not per SSH.
On one disk i saw an SMART error count, so i decided to replace it.

But why should the complete system freeze if a HDD will fail?
My OS is placed on a 32GB SSD. The HDDs are 4x 2TB WD Red. Two Raid1 with each one have 2 HDDs.

I could also see, that the Raid 1 was resyncing. The freeze came between 10 & 30%.

Now with a new disk, i add it to the raid. But the freezes are still available.
My question, could it be that the system freeze if a HDD is damaged?
Should i look if the SSD has failures? Smart quick test are fine. Only one Raid HDD has failures.
Could it be that again the CPU is damaged? This would be very suspicious.

My MB like in topic above:
http://www.amazon.de/dp/B00J0D…171_37038021_TE_M3T1_dp_1
The CPU:
http://www.amazon.de/gp/produc…age_o05_s00?ie=UTF8&psc=1

Again so much time lost....

Greets

davidh2k · 13. August 2014

Yes, your system can freeze because of hard disk errors. If the filesystem is damaged you can get a kernel error which definitly can freeze your system.

Does your SSD support trim? Please post your syslog and kern.log from a time where your system froze. If you have a kernel error it will, in most cases but not allways, written into the logs.

Greetings
David

shadowcast · 13. August 2014

Hello,

i hope so much, that its not again the CPU.
I changed yesterday the defect disk. From this i have no further SMART errors. I removed all partitions on it and was then able to select it again for my raid1.
The status is "clean,degraded, recovering (unknown)" Currently if i double click on this entry, is see 45%.

BUT.
During this recovering, yesterday and today, i also got freezes???

Last today on 6:14 oclock.
The autoshutdown skript is normally working well, if my computer is away, after 1 minute the NAS shut down himself. This wasnt happening i saw, but i had to go to work. I let it run.
On 16.11 oclock i came back and saw as expected a frozen system.
Only hard shutdown and startup. Until then, it works with recovering. Until yet casually without frozens???

The logfiles you sayed prepared with the time from 6:14 until 16:11 appends on this tread.

Thank you...

Solo0815 · 13. August 2014

I can't see any noticable errors, besides the degraded RAID ones, but I'm not sooo familiar with RAID and mdadm.
Maybe a newer kernel (3.2 for 0.5, 3.13 for 1.0.x) will help, if the HW is "brandnew"

davidh2k · 13. August 2014

Code

Aug 13 06:21:53 sdev-nas-omv 
mountd[1590]: authenticated unmount request from 192.168.178.11:720 for 
/export/data (/export/data)

Thats what I spotted... seems to be NFS and/or autoFS related.

Greetings
David

shadowcast · 13. August 2014

Hi,

you think its the NFS connection to my main computer? How you can see that i use AutoFS?

I use AutoFS since the first day with the same config, i will post now. All runs last 3 month well:

Code

/etc/auto.master


/media/omv /etc/auto.omv --timeout=5 --ghost

and

Code

/etc/auto.omv


data -fstype=nfs,rw,soft,tcp,rsize=32768,wsize=32768 sdev-nas-omv:/export/data

The system is in meantime ono 75%. Also as i come before an hour, it was again frozen.
But as i went away, a file from AutoFS Mount was open.

Now i try to get the second HDD completely synchron without any Mount.
On your idea, this should go okay as i read.

Greets

davidh2k · 13. August 2014

Zitat von shadowcast

Hi,

you think its the NFS connection to my main computer? How you can see that i use AutoFS?

[...]

Because if you google that error you will find that its about AutoFS in the first hit.

Greetings
David

shadowcast · 13. August 2014

Okay. Now in the last half hour the system restartet more than 3 times on it self. No AutoFS, only in WI i was logged in.

:-(((

Solo0815 · 13. August 2014

restartet itself? Or freezes and you startet it again?

tekkb · 14. August 2014

Is this a new build and did you run Memtest on it before you installed OMV?

http://www.memtest.org/

shadowcast · 14. August 2014

I´m really desperated. I let the system run during night. Every device connecting to the NAS was turned off. No SMB, no NFS nothing else.
I recognized one or two own restarts as in my last post. This i could hear on upbooting the hdds because he make a quata check if a restart comes from a hard shutdown or failure.

Yes i have two things now, own restarts and the freezes. The same errors as on the last time before few month. There was the CPU damaged. This corresponds with the graphical failure on screen as in the linked posts above.

In meantime, my raid1 is always not ready. Syncing is on 89%. Also now my second Raid needs a resync perhabs the hole failures and restarts. This sync is at 2%. If the system runs a few minutes, again the freeze.

The Harddisk which had smart errors is now deinstallated. I will send it back.
Think its also as last time the best way, i sent the CPU back. Last time i made few tests memtest and so on. All good. I also changed the mainboard without effekt. Only the CPU change was on last time successful.

Is this hardware configuration not so recommendable? Has anyone same Mainboard and CPU as in first post running? This is the second time the CPU brake down during few weeks.

davidh2k · 14. August 2014

Maybe you should try to replace your PSU. Sounds like it could may be the reason for your errors.

A faulty PSU can damage all of your system components.

Greetings
David

shadowcast · 14. August 2014

I checked again the PSU with my tester, without any fault.

What you mean from this log:

Code

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 2
    CR     = Command Register
    FEATR  = Features Register
    COUNT  = Count (was: Sector Count) Register
    LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
    LH     = LBA High (was: Cylinder High) Register    ]   LBA
    LM     = LBA Mid (was: Cylinder Low) Register      ] Register
    LL     = LBA Low (was: Sector Number) Register     ]
    DV     = Device (was: Device/Head) Register
    DC     = Device Control Register
    ER     = Error register
    ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.


Error 2 [1] occurred at disk power-on lifetime: 50 hours (2 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 00 00 9f 00 40 00


  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 03 00 00 00 00 00 00 00 9f 00 40 08     01:45:46.184  READ FPDMA QUEUED
  60 04 00 00 18 00 00 00 00 9b 00 40 08     01:45:46.184  READ FPDMA QUEUED
  60 04 00 00 10 00 00 00 00 97 00 40 08     01:45:46.181  READ FPDMA QUEUED
  60 00 80 00 08 00 00 00 00 96 80 40 08     01:45:46.178  READ FPDMA QUEUED
  60 04 00 00 00 00 00 00 00 92 80 40 08     01:45:46.175  READ FPDMA QUEUED


Error 1 [0] occurred at disk power-on lifetime: 28 hours (1 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 88 c9 96 00 40 00


  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 58 00 00 88 c9 a1 00 40 08     02:39:05.223  READ FPDMA QUEUED
  60 01 80 00 50 00 00 88 c9 9f 80 40 08     02:39:05.222  READ FPDMA QUEUED
  60 01 00 00 48 00 00 88 c9 9e 80 40 08     02:39:05.220  READ FPDMA QUEUED
  60 00 80 00 40 00 00 88 c9 9e 00 40 08     02:39:05.219  READ FPDMA QUEUED
  60 01 00 00 38 00 00 88 c9 9d 00 40 08     02:39:05.217  READ FPDMA QUEUED


SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Alles anzeigen

Three from my 4 WD reds have such logs and no self test log.

It seems the freezes depends on the raids? if i remove on every raid1 on disk, over hours the system runs...

tekkb · 15. August 2014

Again, run memtest with no hard drives attached. See if you get errors.

shadowcast · 15. August 2014

Okay now. I installed memtest86 and started the system with it.

On 5% system hangs but without graphical fialure.
First RAM Module removed the test runs a while, then system makes reboot. 4x
First RAM inserted, Second removed. Same result. 3x

Without only 1 Drive each RAID all runs well.

In the meantime, every of my HDDs have one or more of the logs as above:

Code

Error 2 [1] occurred at disk power-on lifetime: 50 hours (2 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

Now i runned the system with only 1 HDD each RAID over hours. I plugged the other HDDs and rebuild the RAIDs... Same failure as in first post.

I think it would be the best, waiting for the new CPU.
But what about all the SMART logs? On every disk?
It only says, the when a command came, the device was active or idle. Only on reading, this should not be a disk failure or??? Should i send back all 4 devices???

What i have done that i´m so punished with this NAS???

tekkb · 15. August 2014

You were getting errors, hangs and reboots with no hard drives connected???? Your hard drives will not work right if you have bad ram modules. You should have 0 errors when you run memtest. There is no point in connecting drives if you have errors.

shadowcast · 15. August 2014

Yes. Hangs and reboots also without harddisks. 2 4GB RAM Modules are installed.
Both installed, Memtest Hangs every time on 5%.
One of the 2 installed restarts.

I don´t know what todo. Okay waiting for new cpu. I think the Memtest also the CPU runs on 100% based on google results?

tekkb · 15. August 2014

The ram is bad or it is not for that motherboard. Everything else is probably ok. You cannot run your system til the ram runs stable. It will corrupt any data on the hard drives. Tell us your motherboard and the ram that your have. We can verify it is the right kind of ram for your motherboard. If it is the right kind you need to return the ram for new sticks.

davidh2k · 15. August 2014

Did you try to swap the RAM to another RAM Bank/another Channel?

Greetings
David

shadowcast · 17. August 2014

Hello all,

so today the newest facts of my NAS systems, which runs again without trouble.

At frist my configuration:
The mainboard
The CPU
and the RAM

Yesterday a new CPU was coming, because this was the same failure before round about 3 month. The CPU was changed and no 10 minutes later the same error occurs.

With memtest86 the test hangs on 5 %. On single RAM modules the test made own resets.
But on memtest86+ the tests could run completely. (see memtest410.jpg in attachment)

Also no errors was found, but very strange settings as in image:
CAS 1-3-3-3 with DDR2??? On RAM specification this should be 9-9-9-24 and its a DDR3

Okay. I changed the 22 4GB modules with 2x 4GB modules from my main computer with following RAMs:
RAM from main Computer
From specification side the RAMS should be identically.

But with this RAM modules, the NAS works now correclty, without freezes and reboots.

Memtest86+ shows the same settings as on the other Modules. In BIOS its all correctly setup. Okay but this works.

I inserted the NAS RAM Modules in the main computer, so that i have there again 16GB RAM. The ASUS Mainbaord here shows me with a LED that RAM is imcompatible. With pressing the button as in manual, all is now good.

So the result of this failure is really, that the RAM was not compatible with the mainboard.

Hope this thread helps any other one. I spent many hours to solve this problem.
Hope also the NAS runs now longer than 3 month.

My final question, what i should do with all 4 HDDs? (2TB WD Red).
Every have one or more SMART error logs

Code

Error 2 [1] occurred at disk power-on lifetime: 50 hours (2 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

Should i change them? Or are these entries in fact of the instable system an RAM modules? Can i remove this entries from logs?

Greets

Jetzt mitmachen!