nginx failure, then succeeds but then fails agains

tannaroo · 27. Januar 2022

although i haven't touched my OMV5 setup about about a week ago i started to get monitoring emails about 'connection failed nginx' , 'execution failed nginx' and then about 'execution succeed nginx' and 'connection succeeded nginx'.

i thought it would resovle itself but in the past couple of days the monitoring alert nginx failures/succeedeed are increasing (i.e. happening every hour or so) - is this a reason to be concerned and not sure what is happening?

Zoki · 27. Januar 2022

Are you running some heavy load jobs on a weak system?

tannaroo · 27. Januar 2022

I dont think so. I'm using a Dell i5 computer using the same dockers (plex, urbackup, lms) I've been using for a couple of years where I haven't had any issues previously.

Zoki · 27. Januar 2022

So check the logs cat /var/log/nginx/error.log

tannaroo · 27. Januar 2022

This is the output - not sure what it means

Code

root@omv:~# cat /var/log/nginx/error.log
2022/01/27 00:23:03 [alert] 5563#5563: worker process 5565 exited on signal 9
2022/01/27 00:23:03 [alert] 5563#5563: worker process 5566 exited on signal 9
2022/01/27 00:23:03 [alert] 5563#5563: worker process 5567 exited on signal 9
2022/01/27 01:29:20 [alert] 21632#21632: worker process 21633 exited on signal 9
2022/01/27 01:29:20 [alert] 21632#21632: worker process 21634 exited on signal 9
2022/01/27 01:29:20 [alert] 21632#21632: worker process 21635 exited on signal 9
2022/01/27 01:29:20 [alert] 21632#21632: worker process 21636 exited on signal 9
2022/01/27 07:01:29 [alert] 23808#23808: worker process 23810 exited on signal 9
2022/01/27 07:01:29 [alert] 23808#23808: worker process 23811 exited on signal 9
2022/01/27 07:01:29 [alert] 23808#23808: worker process 23812 exited on signal 9
2022/01/27 09:15:08 [alert] 7536#7536: worker process 7537 exited on signal 9
2022/01/27 09:15:08 [alert] 7536#7536: worker process 7538 exited on signal 9
2022/01/27 09:15:08 [alert] 7536#7536: worker process 7539 exited on signal 9
2022/01/27 10:30:23 [alert] 8779#8779: worker process 8783 exited on signal 9
2022/01/27 10:30:23 [alert] 8779#8779: worker process 8781 exited on signal 9
2022/01/27 10:30:23 [alert] 8779#8779: worker process 8780 exited on signal 9
2022/01/27 10:30:23 [alert] 8779#8779: worker process 8782 exited on signal 9
root@omv:~#

Alles anzeigen

Zoki · 27. Januar 2022

Something is killing you nginx process. I only have seen that with OOM. (out of memory). Dou you have something in the syslog / dmesg?

tannaroo · 28. Januar 2022

Zitat von Zoki

Something is killing you nginx process. I only have seen that with OOM. (out of memory). Dou you have something in the syslog / dmesg?

I'm not good with linux - how do I look into syslog/dmesg?

Soma · 28. Januar 2022

cat /var/log/syslog | grep nginx

dmesg | grep error (or fail instead of error)

tannaroo · 28. Januar 2022

Thanks, I see a lot of I/O errors - would this mean I have a failing hard drive (sdc) which is causing nginx failure?

root@omv:~# dmesg | grep error

[157379.641395] blk_update_request: I/O error, dev sdc, sector 19127064 op 0x0:( READ) flags 0x0 phys_seg 1 prio class 0

[157572.140009] blk_update_request: I/O error, dev sdc, sector 19127064 op 0x0:( READ) flags 0x0 phys_seg 1 prio class 0

[161358.723272] blk_update_request: I/O error, dev sdc, sector 19127064 op 0x0:( READ) flags 0x0 phys_seg 1 prio class 0

[161545.161932] blk_update_request: I/O error, dev sdc, sector 19127064 op 0x0:( READ) flags 0x0 phys_seg 1 prio class 0

[165331.790198] blk_update_request: I/O error, dev sdc, sector 19127064 op 0x0:( READ) flags 0x0 phys_seg 1 prio class 0

[165518.152838] blk_update_request: I/O error, dev sdc, sector 19127064 op 0x0:( READ) flags 0x0 phys_seg 1 prio class 0

[167459.557361] blk_update_request: I/O error, dev sdc, sector 14424904 op 0x0:( READ) flags 0x80700 phys_seg 32 prio class 0

res 51/40:28:b8:59:3e/00:00:b7:27:74/e1 Emask 0x9 (medi a error)

[167639.271268] ata1.00: error: { UNC }

[167639.276138] sd 0:0:0:0: [sdc] tag#2 Add. Sense: Unrecovered read error - aut o reallocate failed

[167639.276145] blk_update_request: I/O error, dev sdc, sector 20863416 op 0x0:( READ) flags 0x80700 phys_seg 5 prio class 0

res 51/40:08:b8:59:3e/00:00:2f:27:74/e1 Emask 0x9 (medi a error)

[167643.595139] ata1.00: error: { UNC }

[167643.600005] sd 0:0:0:0: [sdc] tag#18 Add. Sense: Unrecovered read error - au to reallocate failed

[167643.600013] blk_update_request: I/O error, dev sdc, sector 20863416 op 0x0:( READ) flags 0x0 phys_seg 1 prio class 0

res 51/40:08:b8:59:3e/00:00:27:eb:48/e1 Emask 0x9 (medi a error)

This is only part of it as the whole error is more than 10000 characters so I couldn't paste here

Soma · 28. Januar 2022

Check the SMART status of that drive and see the errors of fields 5; 197; 198 on information-->>Attributes

tannaroo · 28. Januar 2022

Zitat von Soma

Check the SMART status of that drive and see the errors of fields 5; 197; 198 on information-->>Attributes

unfortunately, I can't log in now it says 'failed to connect to socket' with this error message.

Code

Error #0:
OMV\Rpc\Exception: Failed to connect to socket: Connection refused in /usr/share/php/openmediavault/rpc/rpc.inc:141
Stack trace:
#0 /var/www/openmediavault/rpc/session.inc(57): OMV\Rpc\Rpc::call('UserMgmt', 'authUser', Array, Array, 2, true)
#1 [internal function]: OMVRpcServiceSession->login(Array, Array)
#2 /usr/share/php/openmediavault/rpc/serviceabstract.inc(123): call_user_func_array(Array, Array)
#3 /usr/share/php/openmediavault/rpc/rpc.inc(86): OMV\Rpc\ServiceAbstract->callMethod('login', Array, Array)
#4 /usr/share/php/openmediavault/rpc/proxy/json.inc(97): OMV\Rpc\Rpc::call('Session', 'login', Array, Array, 3)
#5 /var/www/openmediavault/rpc.php(45): OMV\Rpc\Proxy\Json->handle()
#6 {main}

Soma · 28. Januar 2022

Run on the CLI: smartctl -a /dev/sdX (X is the letter of the drive, in you case it's the c )

How to check an hard drive health from the command line using smartctl - Linux Tutorials - Learn Linux Configuration

tannaroo · 28. Januar 2022

Zitat von Soma

Run on the CLI: smartctl -a /dev/sdX (X is the letter of the drive, in you case it's the c )

How to check an hard drive health from the command line using smartctl - Linux Tutorials - Learn Linux Configuration

I've lost access via SSH so I need to attach a monitor and figure out whats going on.

tannaroo · 29. Januar 2022

I've attached a monitor, disconnected all external drives and it seems to get to a certain point and not go further, so I guess the system drive is toast.

I tried to setup using recovery model to run fsck -p but it stops below.

Is there a way I can repair the drive somehow without a full re-install?

Zoki · 29. Januar 2022

You clould try to do so. Boot from a different media and run fsck as stated. I would not trust the disk any more.

tannaroo · 29. Januar 2022

Zitat von Zoki

You clould try to do so. Boot from a different media and run fsck as stated. I would not trust the disk any more.

ok I don't have another linux system so looks like i'm going for a full rebuild. Will probably use a thumb drive for my system drive going forward

Zoki · 29. Januar 2022

Why don't you boot from an USB stick and take a look. Linux USB-stick can be made from any live cd.

tannaroo · 29. Januar 2022

These are the errors (abridged as can't copy all of it) but it shows the attributes you were after. I tried running fsck -p /dev/sda1 but it hasn't seem to have fixed the error

Is the drive toast or can it be recovered?

root@OMV:~# smartctl -a /dev/sda

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.10.0-0.bpo.8-amd64] (local build)

=== START OF INFORMATION SECTION ===

Model Family: IBM Travelstar 48GH, 30GN, and 15GN

Device Model: IC25N020ATDA04-0

Serial Number: 63A63J51410

Firmware Version: DA3OA70A

User Capacity: 20,003,880,960 bytes [20.0 GB]

Sector Size: 512 bytes logical/physical

Device is: In smartctl database [for details use: -P show]

ATA Version is: ATA/ATAPI-5 T13/1321D revision 3

Local Time is: Sat Jan 29 19:14:18 2022 GMT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 645) seconds.

Offline data collection

capabilities: (0x1b) SMART execute Offline immediate.

Auto Offline data collection on/off supp ort.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

No Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

No General Purpose Logging support.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 27) minutes.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000b 081 081 062 Pre-fail Always - 27459770

2 Throughput_Performance 0x0005 103 103 040 Pre-fail Offline - 6012

3 Spin_Up_Time 0x0007 126 126 033 Pre-fail Always - 1

4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 1250

5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0

7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0

8 Seek_Time_Performance 0x0005 120 120 040 Pre-fail Offline - 36

9 Power_On_Hours 0x0012 048 048 000 Old_age Always - 22951

10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 849

191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always - 0

192 Power-Off_Retract_Count 0x0032 099 099 000 Old_age Always - 214

193 Load_Cycle_Count 0x0012 094 094 000 Old_age Always - 69894

194 Temperature_Celsius 0x0002 157 157 000 Old_age Always - 35 (Min/Max 13/66)

196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 219

197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 66

198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

Code

root@OMV:~# fsck -y /dev/sda
fsck from util-linux 2.33.1
e2fsck 1.46.2 (28-Feb-2021)
ext2fs_open2: Bad magic number in super-block
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sda

The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

Found a dos partition table in /dev/sda

Alles anzeigen

chente · 29. Januar 2022

Zitat von tannaroo

196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 219
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 66

That disk is not reliable. See here the meaning of the values 196 and 197.

https://en.wikipedia.org/wiki/…ATA_S.M.A.R.T._attributes

tannaroo · 29. Januar 2022

Zitat von chente

That disk is not reliable. See here the meaning of the values 196 and 197.
https://en.wikipedia.org/wiki/…ATA_S.M.A.R.T._attributes

ok sounds as though a full install on a new drive.

is there anyway I can get thios partially working, namely the web gui so that I can remember all of my configurations etc to copy over when I setup the new install?

Jetzt mitmachen!