0

I'm diagnosing a high CPU usage event, and I found a weird difference between numbers from ps/vmstat, which show almost 0%, and sar/top, which show almost 100% (user + system):

sar 1 5
Linux 2.6.9-67.ELsmp (uxdfl712)         07/25/2020

01:48:31 PM       CPU     %user     %nice   %system   %iowait     %idle
01:48:32 PM       all     43.83      0.00     56.17      0.00      0.00
01:48:33 PM       all     42.68      0.00     57.32      0.00      0.00
01:48:34 PM       all     42.57      0.00     57.43      0.00      0.00
01:48:35 PM       all     43.18      0.00     56.82      0.00      0.00
Average:          all     43.14      0.00     56.86      0.00      0.00

vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
32  0      0 10493612 233320 4485160    0    0     0    14    0     1  0  0 100  0

ps -e hao %cpu | awk '{ sum += $1 } END { print sum }'
0.2

top -bn 1 |
sed '1,/PID USER      PR  NI %CPU/d' |
awk '{ sum += $5 } END { print sum }'
398

I searched a lot in StackExchange and elsewhere, but all I could find were references about virtualization stuff (this is a physical machine) and CPU load, which is not my issue. I also checked out /proc/<PID>/stat, but found no hint on this.

Why do these commands show different numbers? Are they actually querying different things? Or may the executables just be too old and buggy (pls see server data below - I'm indeed in horror on how outdated this is).

Thanks!

uname -r
2.6.9-67.ELsmp

cat /etc/redhat-release
Red Hat Enterprise Linux ES release 4 (Nahant Update 6)

yum provides `which sar` | grep installed
sysstat.i386                             5.0.5-16.rhel4         installed       

yum provides `which vmstat` | grep installed
procps.i386                              3.2.3-8.9              installed       

yum provides `which ps`
<Too many providers>
ps -V
procps version 3.2.3

yum provides `which top` | grep installed
procps.i386                              3.2.3-8.9              installed       

grep -c processor /proc/cpuinfo 
4
  • Check; dmesg. The SAR %system most likely stems from either driver being stuck/broken (hence check dmesg) or I/O wait. – Miuku Jul 25 '20 at 18:03
  • @Miuku Thanks. I checked out, but found no errors about drivers or modules. But I could get useful info from top, and found database processes are eating up 100% CPU. So the system part should indeed be I/O wait related to these database processes. Actually, I misread sar output. I get 100% CPU usage, with 60% being system, and not 60% overall CPU usage. I'll correct the question. – Emerson Prado Jul 25 '20 at 19:09
  • If the system is really old and there's a lot of I/O wait, it might be prudent to take a look at the drives if they are starting to show their age and are slowing down or opt to move the entire system to newer hardware (if possible) or virtualize it on new hardware if OS upgrade if not a feasible option. Also checking the possibility to run database optimize and/or tune the settings might be a good idea. – Miuku Jul 26 '20 at 08:16

1 Answers1

2

This is an intermittent, occasional load. The first line of vmstat gives averages since the last reboot, which apparently on this host is mostly idle. Subsequent lines show data for the sampling period, which will be closer to what sar is reporting.

0% idle for an extended period of time is generally not good. But how bad running out of CPU is really depends on the system and applications.

Evaluate how the applications are performing on this box. How is response time to user requests? Is it doing batch processing in time? If your performance expectations are not met, that is a reason to improve things.

In addition to hardware age, this is older software; RHEL 4 entered extended support 8 years ago. On a modern Linux, finding exactly what's on CPU is easy. Install debug symbols, and run perf top. And anything can be instrumented in detail. However, I don't remember how good the performance tools on RHEL 4 were.

Really, if this host is to continue to provide value, it should be upgraded. To get security updates again, if nothing else.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32