System Crash, nothing shows up in Syslog

davidh2k · 30. Dezember 2013

Hi Forum,

today I encoutered that my NAS wasn't reachable after approx. an hour after I started it and copied a file onto it.

Well, while I was writing this post and collecting all logs I found the error:

kern.log.1

Code

[...]
Dec 29 04:02:29 chap kernel: [   91.155474] ip6_tables: (C) 2000-2006 Netfilter Core Team
Dec 29 04:06:02 chap kernel: [  304.067797] EXT4-fs (sda1): error count: 1
Dec 29 04:06:02 chap kernel: [  304.067802] EXT4-fs (sda1): initial error at 1385238319: ext4_lookup:1050: inode 1587518
Dec 29 04:06:02 chap kernel: [  304.067807] EXT4-fs (sda1): last error at 1385238319: ext4_lookup:1050: inode 1587518

kern.log

Code

[...]
Dec 30 03:56:49 chap kernel: [  304.071969] EXT4-fs (sda1): error count: 1
Dec 30 03:56:49 chap kernel: [  304.071976] EXT4-fs (sda1): initial error at 1385238319: ext4_lookup:1050: inode 1587518
Dec 30 03:56:49 chap kernel: [  304.071982] EXT4-fs (sda1): last error at 1385238319: ext4_lookup:1050: inode 1587518

Code

root@chap:/var/log# blkid
/dev/sdb1: LABEL="RAID5_9TB" UUID="d3c21c66-af4b-41d4-b098-462e83fa641d" TYPE="xfs"
/dev/sda1: UUID="1eed4e13-6904-4af5-9b0d-8d01093f9c38" TYPE="ext4"
/dev/sda3: UUID="d0c9a972-fcd3-4980-929b-2d1bfda52076" TYPE="xfs"
/dev/sda5: UUID="2b60b699-b2e1-49f1-b9f0-b1f38300ba75" TYPE="swap"
/dev/sdc1: LABEL="Sammy1" UUID="ca425484-1be3-47f1-b7bb-8f0785c9ea5b" TYPE="xfs"
root@chap:/var/log# df -h
Dateisystem           Size  Used Avail Use% Eingehängt auf
/dev/sda1              29G  7,0G   21G  26% /
tmpfs                 2,0G   20K  2,0G   1% /lib/init/rw
udev                  1,9G  208K  1,9G   1% /dev
tmpfs                 2,0G  4,0K  2,0G   1% /dev/shm
/dev/sdb1              11T  8,2T  2,8T  75% /media/d3c21c66-af4b-41d4-b098-462e83fa641d
/dev/sda3             429G  5,6G  424G   2% /media/d0c9a972-fcd3-4980-929b-2d1bfda52076
/dev/sdc1             1,4T  471G  926G  34% /media/ca425484-1be3-47f1-b7bb-8f0785c9ea5b
root@chap:/var/log#

Alles anzeigen

So obviously my Filesystem on my OS drive is in fact damaged. Alltough smart doesn't show something that would say that sectors got replaced

Code

root@chap:/var/log# smartctl /dev/sda -a
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net


=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F3 EG series
Device Model:     SAMSUNG HD503HI
Serial Number:    S23CJ9DZ602665
Firmware Version: 1AJ10001
User Capacity:    500.107.862.016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Mon Dec 30 05:17:29 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (6000) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 100) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   085   084   025    Pre-fail  Always       -       4742
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       875
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       6200
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       1
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       890
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       8251
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   047   000    Old_age   Always       -       32 (Lifetime Min/Max 8/53)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       62
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       1
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       909


SMART Error Log Version: 1
No Errors Logged


SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]




Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Completed [00% left] (0-65535)
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Alles anzeigen

Things that come up:
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 1
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always - 8251
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 62

Google told me that all these errors are nothing to scare about as long as the value isn't gooing towards a meeting with the threshold.

What would you check in this type of error?

Greetings
David

---- you can ignore the stuff beneath here ----

So what I did was I started my NAS at '03:53:03'. I copied a file via smb over at approximately '03:59'. I then left my NAS alone and wanted to go to bed and watch the file at 4:35 where my Pi couldn't reach my NAS. Even after attaching usb keyboard and my tv screen no console came on the screen so I resetted the machine.

Judging from my temperature script for my NAS drives my NAS went off between '04:00:02' and '04:03:02' because '00' is the last entry written in the log.

Last messages in the syslog, nothing I would care about. It shows me that the System was running at least until '04:01:25' nothing more after that until the hard reboot at '04:42:47'

Code

Dec 30 04:00:01 chap /USR/SBIN/CRON[4395]: (root) CMD (/var/lib/openmediavault/cron.d/userdefined-311f8b5e-9be4-4dfa-b4ea-a1bc50d2e37f >/dev/null 2>&1)
Dec 30 04:00:01 chap /USR/SBIN/CRON[4396]: (root) CMD (test -x /usr/sbin/cron-apt && /usr/sbin/cron-apt)
Dec 30 04:00:01 chap /USR/SBIN/CRON[4397]: (root) CMD (/usr/sbin/omv-mkgraph >/dev/null 2>&1)
Dec 30 04:00:24 chap dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 5
Dec 30 04:00:29 chap dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 5
Dec 30 04:00:34 chap dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 9
Dec 30 04:00:43 chap dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 11
Dec 30 04:00:54 chap dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 19
Dec 30 04:01:13 chap dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 12
Dec 30 04:01:25 chap dhclient: No DHCPOFFERS received.
Dec 30 04:01:25 chap dhclient: No working leases in persistent database - sleeping.
Dec 30 04:42:47 chap kernel: imklog 4.6.4, log source = /proc/kmsg started.

Alles anzeigen

Full Syslog on pastebin: http://pastebin.com/BGaAcMjL

I think this already occured one or two days ago, but I'm not sure about that. Because I don't see anyhting in the syslog I doubt that it is a kernel panic or something similiar because it seems the system freezes 100% without beeing able to log anyhting after that point(-of-no-return).

Any Ideas where I should look for errors? Samba cores do not have any logs. The samba log for my desktop computer is empty.

log.smbd

Code

[2013/12/30 03:53:04,  0] smbd/server.c:1123(main)
  smbd version 3.5.6 started.
  Copyright Andrew Tridgell and the Samba Team 1992-2010
[2013/12/30 04:42:48,  0] smbd/server.c:1123(main)
  smbd version 3.5.6 started.
  Copyright Andrew Tridgell and the Samba Team 1992-2010

log.nmbd

Code

[2013/12/30 03:53:02,  0] nmbd/nmbd.c:857(main)
  nmbd version 3.5.6 started.
  Copyright Andrew Tridgell and the Samba Team 1992-2010
[2013/12/30 04:42:48,  0] nmbd/nmbd.c:857(main)
  nmbd version 3.5.6 started.
  Copyright Andrew Tridgell and the Samba Team 1992-2010

Jetzt mitmachen!