CPU wrong values?

siulman · 8. Dezember 2017

Hello guys,

I just did a fresh install of omv3 and everytime I am doing small simple tasks like a "CP" or "Rsync", I am receiving e-mails from the system:
Monit alert -- loadavg(5min) check succeeded [current loadavg(5min)=2.0] (this is from the screenshot below)

I have a proliant g8 microserver, I know it's a Celeron but still...it shouldn't be so impacted by a "CP"... should it?

The thing is... I have compared "htop" values with the values showed in GUI and they are not aligned...

Can someone bring some clarity?

I am wondering if I just disabled the “CPU alerting” e-mails when I was running OMV2 to avoid spamming... :-S

tkaiser · 8. Dezember 2017

Zitat von siulman

Can someone bring some clarity?

Sure, you confuse 'average load' with CPU utilization. Waiting for IO on Linux counts to average load for whatever reasons so you simply spotted a storage bottleneck.

Easy to test:

Code

sudo apt install -f sysstat
sudo iostat 5

You'll see high %iowait values and that's an indication for low performing storage and also the reason why average load increases when you do something IO related.

siulman · 8. Dezember 2017

Zitat von tkaiser
Sure, you confuse 'average load' with CPU utilization. Waiting for IO on Linux counts to average load for whatever reasons so you simply spotted a storage bottleneck.
Easy to test:
Code
sudo apt install -f sysstat
sudo iostat 5
You'll see high %iowait values and that's an indication for low performing storage and also the reason why average load increases when you do something IO related.

Hello @tkaiser,

thanks for your response.

Could you then explain to me what is happening here please?
I understand IOWait is high because my storage is causing some limitation (USB3 HDD 2.5), why? and why the CPU increases in GUI?
I just re-did the CP command (on the bottom) I am affraid I am not familiar with this kind of data...

votdev · 8. Dezember 2017

Your system IS under heavy load, so the warning emails are correct. See http://blog.scoutapp.com/artic…derstanding-load-averages.

tkaiser · 8. Dezember 2017

Zitat von siulman

Could then explain to me what is happening here please?

No, since I'm not sitting in front of your OMV host. High %iowait is an indication for slow storage (or related problems, check logfiles and maybe also SMART output for attribute 199, in case the value increases here you have cable/contact problems and a lot of retransmissions).

Most probably sdc is the problem. If this is a SATA drive then more than 50% iowait look wrong, if it's an USB3 disk which is not UAS capable this looks right. Anyway: you're still not affected by 'wrong CPU usage display' but you have a storage bottleneck for reasons unknown to me (you need to check yourself)

tkaiser · 8. Dezember 2017

Zitat von votdev

Your system IS under heavy load, so the warning emails are correct. See
blog.scoutapp.com/articles/200…derstanding-load-averages

That's a pretty bad explanation since it misses totally the issue @siulman is running into and what's special on Linux: IOWAIT counting to CPU load for whatever reasons. His system is virtually idle (%user + % system below 15%) but just blocked due to some IO bottleneck.

I would better recommend http://www.brendangregg.com/bl…/linux-load-averages.html (for the TL;DR version scrolling down to 'Making sense of Linux load averages')

siulman · 8. Dezember 2017

@tkaiser,

Either sdf1 (hdd 2.5 usb3) or sdc (WD Red 3.5 SATA) are ok in terms of attribute 199. The value is 0 and not increasing. The copy was being done from sdf1 --> sdc.
However, sdf1 seems to have some reallocated sector with a value of "16".This one does not seems to increase. I have already noticed it for a long time now.

That being said, I redid the test from sda (SSD) to sdc (WD Red 3.5 Sata) again and see the difference, here are the results... That was even worse in CPU than before...Could that just be that the source disk is much more quicker than the destination disk so in some way the destination disk is not able to follow the speed...?

siulman · 8. Dezember 2017

And last test...

from sdc to sdb, different direction. They are both WD Red, same disks.

What I still don't understand is the "CPU Usage" in the GUI that goes up when apparently it's ok on HTOP.... I am talking about the "CPU usage" green bar

I've just received this e-mail. Could this explain something?

This message was generated by the smartd daemon running on:

host name: omv

DNS domain: lupilan.com

The following warning/error was logged by the smartd daemon:

Device: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6LU7AA7 [SAT], ATA error count increased from 0 to 1

Device info:

WDC WD30EFRX-68EUZN0, S/N:WD-WCC4N6LU7AA7, WWN:5-0014ee-2b774c2c8, FW:82.00A82, 3.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.

Another message will be sent in 24 hours if the problem persists.

========================================================================================
=========================================================================================

Very last test in order to use different disks sda to sde1 (avoiding SC), the sympthoms are the same:

tkaiser · 8. Dezember 2017

Zitat von siulman

That was even worse in CPU than before...

Huh? Did you read what I wrote above and what my link tries to explain: Linux for whatever reasons adds the time a CPU spends on waiting for disks to 'average load'. So while this 'average load' is a pretty questionable concept in general on Linux it's highly misleading and should never be confused with CPU utilization.

You still have a problem with %iowait but I can't tell you why. Just did a quick check on a much slower platform copying an empty file with 20GB in size to a 'linear' disk setup that consists of one USB3 pendrive (sda) and 3 SATA SSDs: https://pastebin.com/raw/LKbyZxcy

As can be clearly seen %iowait only increases when the filesystem driver decides to write to the USB3 drive and even then those values are a lot lower than yours. In other words: I consider the high %iowait percentage on your system somewhat suspicious but that's all. At least there's nothing OMV could and should fix since Linux' concept of 'load average' is as it is. For reasons see http://www.brendangregg.com/bl…/linux-load-averages.html

siulman · 8. Dezember 2017

Zitat von tkaiser

Huh? Did you read what I wrote above and what my link tries to explain: Linux for whatever reasons adds the time a CPU spends on waiting for disks to 'average load'. So while this 'average load' is a pretty questionable concept in general on Linux it's highly misleading and should never be confused with CPU utilization.
You still have a problem with %iowait but I can't tell you why. Just did a quick check on a much slower platform copying an empty file with 20GB in size to a 'linear' disk setup that consists of one USB3 pendrive (sda) and 3 SATA SSDs: https://pastebin.com/raw/LKbyZxcy

As can be clearly seen %iowait only increases when the filesystem driver decides to write to the USB3 drive and even then those values are a lot lower than yours. In other words: I consider the high %iowait percentage on your system somewhat suspicious but that's all. At least there's nothing OMV could and should fix since Linux' concept of 'load average' is as it is. For reasons see http://www.brendangregg.com/bl…/linux-load-averages.html

Ok! I get it now. I didn't get the "wait time" waiting for the disks was being added to the "average load". So nothing to worry then about my system and the CPU values that are not "real" if I may the term... and my CPU is definitly not struggeling with a simple CP... that's good, I was worried.

Now concerning the disks... I can't belive there is a problem with the disks because I crossed the tests doing several copies from A to B, A to C, B to C, A to D, C to D, and it's impossible all my disks have problems...and in all the scenarios I had the %iowait increasing a lot. So I guess it's something else but if you don't know what could be, it's not gonna be me for sure...

The thing is...this load average increasing is also happening when I use virtualbox (using my windows VM) for the same reasons I guess... is that impacting the perf?

In any case I appreciate the explanations you provided today.

tkaiser · 8. Dezember 2017

Zitat von siulman

What I still don't understand is the "CPU Usage" in the GUI that goes up when apparently it's ok on HTOP.... I am talking about the "CPU usage" green bar

Hmm... I usually never look at these bars since I monitor my systems with SNMP (quite simple with OMV). But I just had a look with OMV 4 and indeed the CPU graph looks wrong, seems like the value displayed is '100% - %idle percentage' (which is wrong on Linux as explained multiple times in the meantime -- IOWAIT is not CPU utilization but the CPU doing nothing while waiting for IO to finish)

Maybe @votdev can explain why/how the CPU utilization is displayed here?

votdev · 8. Dezember 2017

Zitat von tkaiser

Hmm... I usually never look at these bars since I monitor my systems with SNMP (quite simple with OMV). But I just had a look with OMV 4 and indeed the CPU graph looks wrong, seems like the value displayed is '100% - %idle percentage' (which is wrong on Linux as explained multiple times in the meantime -- IOWAIT is not CPU utilization but the CPU doing nothing while waiting for IO to finish)
Maybe @votdev can explain why/how the CPU utilization is displayed here?

It was always displayed there.
The usage is calculated here: https://github.com/openmediava…iavault/system/system.inc

tkaiser · 8. Dezember 2017

Zitat von votdev

The usage is calculated here: github.com/openmediavault/open…iavault/system/system.inc

Thank you for explanation and code hint. But then the calculation suffers from the same problem as /proc/loadavg since when only looking at %idle percentage in Linux without taking %iowait into account all the time spent on waiting for disks finishing its jobs adds wrongly to CPU utilization.

IMO it would be needed to take %iowait into account, that means adding this

Code

$diffIowait = $tnow[4] - $tprev[4];

and calculating usage like that:

Code

"usage" => (0 == $diffTotal) ? 0 : (($diffTotal - $diffIowait - $diffIdle) / $diffTotal) * 100

I tested it with slow USB storage and double checked with both output from htop and iostat in parallel. Looks a lot better now since %iowait doesn't count to CPU utilization any more.

votdev · 8. Dezember 2017

Thx for the hint, I will fix it.

siulman · 8. Dezember 2017

Great!
so my post was useful after all!

how will this be released? just an update from GUI?

Thanks!

jollyrogr · 8. Dezember 2017

I use WD red drives in my system too and was always getting emails about load average when doing a snapraid sync or scrub so I disabled notifications for load average.

siulman · 8. Dezember 2017

Zitat von jollyrogr

I use WD red drives in my system too and was always getting emails about load average when doing a snapraid sync or scrub so I disabled notifications for load average.

Thank you very much for this comment man!
This comforts me!

My 2 cents: WD Reds are disks made for NAS with low speed an power consumption (5400). So it’s not surprising the CPU is quicker and needs to wait for the disk...

I just disabled the notifications too.

Jetzt mitmachen!