4

My server is currently running CentOS 5.2, with WHM 11.34.

Currently, we're at 6.43 to 12 for a load average. The sites that we're hosting are taking a lot time to respond and resolve. top doesn't show anything out of the ordinary and iftop doesn't show a lot of traffic.

We have many resellers, and some not so good at writing code, how can we find the culprit?

vmstat output:

vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  2     84  78684 154916 1021080    0    0    72   274    0   14  6  3 80 12  0

top output (ordered by %CPU)

top - 21:44:43 up 5 days, 10:39,  3 users,  load average: 3.36, 4.18, 4.73
Tasks: 222 total,   3 running, 219 sleeping,   0 stopped,   0 zombie
Cpu(s):  5.8%us,  2.3%sy,  0.2%ni, 79.6%id, 11.8%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:   2074580k total,  1863044k used,   211536k free,   174828k buffers
Swap:  2040212k total,       84k used,  2040128k free,   987604k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15930 mysql     15   0  138m  46m 4380 S    4  2.3   1:45.87 mysqld
21772 igniteth  17   0 23200 7152 3932 R    4  0.3   0:00.02 php
 1586 root      10  -5     0    0    0 S    2  0.0  11:45.19 kjournald
21759 root      15   0  2416 1024  732 R    2  0.0   0:00.01 top
    1 root      15   0  2156  648  560 S    0  0.0   0:26.31 init
    2 root      RT   0     0    0    0 S    0  0.0   0:00.35 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.32 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    5 root      RT   0     0    0    0 S    0  0.0   0:02.00 migration/1
    6 root      34  19     0    0    0 S    0  0.0   0:00.11 ksoftirqd/1
    7 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/1
    8 root      RT   0     0    0    0 S    0  0.0   0:01.29 migration/2
    9 root      34  19     0    0    0 S    0  0.0   0:00.26 ksoftirqd/2
   10 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/2
   11 root      RT   0     0    0    0 S    0  0.0   0:00.90 migration/3
   12 root      34  19     0    0    0 R    0  0.0   0:00.20 ksoftirqd/3
   13 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/3

top output (ordered by CPU time)

top - 21:46:12 up 5 days, 10:41,  3 users,  load average: 2.88, 3.82, 4.55
Tasks: 217 total,   1 running, 216 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.7%us,  2.0%sy,  2.0%ni, 67.2%id, 25.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   2074580k total,  1959516k used,   115064k free,   183116k buffers
Swap:  2040212k total,       84k used,  2040128k free,  1090308k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+    TIME COMMAND
32367 root      16   0  215m 212m 1548 S    0 10.5  62:03.63  62:03 tailwatchd
 1586 root      10  -5     0    0    0 S    0  0.0  11:45.27  11:45 kjournald
 1576 root      10  -5     0    0    0 S    0  0.0   2:37.86   2:37 kjournald
27722 root      16   0  2556 1184  800 S    0  0.1   1:48.94   1:48 top
15930 mysql     15   0  138m  46m 4380 S    4  2.3   1:48.63   1:48 mysqld
 2932 root      34  19     0    0    0 S    0  0.0   1:41.05   1:41 kipmi0
  226 root      10  -5     0    0    0 S    0  0.0   1:34.33   1:34 kswapd0
 2671 named     25   0 74688 7400 2116 S    0  0.4   1:23.58   1:23 named
 3229 root      15   0 10300 3348 2724 S    0  0.2   0:40.85   0:40 sshd
 1580 root      10  -5     0    0    0 S    0  0.0   0:30.62   0:30 kjournald
    1 root      17   0  2156  648  560 S    0  0.0   0:26.32   0:26 init
 2616 root      15   0  1816  576  480 S    0  0.0   0:23.50   0:23 syslogd
 1584 root      10  -5     0    0    0 S    0  0.0   0:18.67   0:18 kjournald
 4342 root      34  19 27692  11m 2116 S    0  0.5   0:18.23   0:18 yum-updatesd
 8044 bollingp  15   0  3456 2036  740 S    1  0.1   0:15.56   0:15 imapd
   26 root      10  -5     0    0    0 S    0  0.0   0:14.18   0:14 kblockd/1
 7989 gmailsit  16   0  3196 1748  736 S    0  0.1   0:10.43   0:10 imapd

iostat -xtk 1 10 output

[root@server1 tmp]# iostat -xtk 1 10
Linux 2.6.18-53.el5    12/18/2012

Time: 09:51:06 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.83    0.19    2.53   11.85    0.00   79.60

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               1.37   118.83 18.70 54.27   131.47   692.72    22.59     4.90   67.19   3.10  22.59
sdb               0.35    39.33 20.33 61.43   158.79   403.22    13.75     5.23   63.93   3.77  30.80

Time: 09:51:07 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.50    0.00    0.50   24.00    0.00   74.00

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    25.00  2.00  2.00   128.00   108.00   118.00     0.03    7.25   4.00   1.60
sdb               0.00    16.00 41.00 145.00   200.00   668.00     9.33   107.92  272.72   5.38 100.10

Time: 09:51:08 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.00    0.00    1.50   29.50    0.00   67.00

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    95.00  3.00 33.00    12.00   480.00    27.33     0.07    1.72   1.31   4.70
sdb               0.00    14.00  1.00 228.00     4.00   960.00     8.42   143.49  568.01   4.37 100.10

Time: 09:51:09 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.28    0.00    2.76   21.30    0.00   62.66

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    21.00  1.00 19.00    16.00   192.00    20.80     0.06    3.55   1.30   2.60
sdb               0.00    36.00 28.00 181.00   124.00   884.00     9.65   121.16  617.31   4.79 100.10

Time: 09:51:10 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.74    0.00    1.50   25.19    0.00   68.58

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    20.00  3.00 15.00    12.00   136.00    16.44     0.17    7.11   3.11   5.60
sdb               0.00     0.00 103.00 60.00   544.00   248.00     9.72    52.35  545.23   6.14 100.10

Time: 09:51:11 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.24    0.00    1.24   25.31    0.00   72.21

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    75.00  4.00 28.00    16.00   416.00    27.00     0.08    3.72   2.03   6.50
sdb               2.00     9.00 124.00 17.00   616.00   104.00    10.21     3.73  213.73   7.10 100.10

Time: 09:51:12 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.75   24.31    0.00   73.93

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    24.00  1.00  9.00     4.00   132.00    27.20     0.01    1.20   1.10   1.10
sdb               4.00    40.00 103.00 48.00   528.00   212.00     9.80   105.21  104.32   6.64 100.20

Time: 09:51:13 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.50    0.00    1.75   23.25    0.00   72.50

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   125.74  3.96 46.53    15.84   689.11    27.92     0.20    4.06   2.41  12.18
sdb               2.97     0.00 91.09 84.16   419.80   471.29    10.17    85.85  590.78   5.66  99.11

Time: 09:51:14 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.75    0.00    0.50   24.94    0.00   73.82

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    88.00  1.00  7.00     4.00   380.00    96.00     0.04    4.38   3.00   2.40
sdb               3.00     7.00 111.00 44.00   540.00   208.00     9.65    18.58  581.79   6.46 100.10

Time: 09:51:15 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.03    0.00    3.26   26.57    0.00   59.15

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   145.00  7.00 53.00    28.00   792.00    27.33     0.15    2.50   1.55   9.30
sdb               1.00     0.00 155.00  0.00   800.00     0.00    10.32     2.85   18.63   6.46 100.10

[root@server1 tmp]#

MySQL Show Full Processlist

mysql> show full processlist;
+------+---------------+-----------+-----------------------+----------------+------+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Id   | User          | Host      | db                    | Command        | Time | State                      | Info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+------+---------------+-----------+-----------------------+----------------+------+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|    1 | DB_USER_ONE   | localhost | DB_ONE                | Query          |    3 | waiting for handler insert | INSERT DELAYED INTO defers (mailtime,msgid,email,transport_method,message,host,ip,router,deliveryuser,deliverydomain) VALUES(FROM_UNIXTIME('1355879748'),'1TivwL-0003y8-8l','xxxxxxxxxxxxxxxxxxxx@yahoo.com.tw','remote_smtp','SMTP error from remote mail server after initial connection: host mx1.mail.tw.yahoo.com [203.188.197.119]: 421 4.7.0 [TS01] Messages from 75.125.90.146 temporarily deferred due to user complaints - 4.16.55.1; see http://postmaster.yahoo.com/421-ts01.html','mx1.mail.tw.yahoo.com','203.188.197.119','lookuphost','','') |
|    2 | DELAYED       | localhost | DB_ONE                | Delayed insert |   52 | insert                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|    3 | DELAYED       | localhost | DB_ONE                | Delayed insert |   68 | insert                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|  911 | DELAYED       | localhost | DB_ONE                | Delayed insert |   99 | Waiting for INSERT         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|  993 | DB_USER_TWO   | localhost | DB_TWO                | Sleep          |  832 |                            | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|  994 | DB_USER_ONE   | localhost | DB_ONE                | Query          |  185 | Locked                     | delete from failures where FROM_UNIXTIME(UNIX_TIMESTAMP(NOW())-1296000) > mailtime                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| 1102 | DB_USER_THREE | localhost | DB_THREE              | Query          |   29 | NULL                       | commit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| 1249 | DB_USER_FOUR  | localhost | DB_FOUR               | Query          |   13 | NULL                       | commit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| 1263 | root          | localhost | DB_FIVE               | Query          |    0 | NULL                       | show full  processlist                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| 1264 | DB_USER_SIX   | localhost | DB_SIX                | Query          |    3 | NULL                       | commit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+------+---------------+-----------+-----------------------+----------------+------+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.00 sec)
Tango Bravo
  • 143
  • 5
  • 1
    Can you add output of "vmstat" , "mpstat -A"? In top, check/sort by %CPU, then check/sort by CPU time. – John Siu Dec 19 '12 at 03:14
  • Not to mention some basic hardware information. Please see [How can I ask better questions on Server Fault?](http://meta.serverfault.com/q/3608/126632) – Michael Hampton Dec 19 '12 at 03:26
  • Collect top -b -n 10, vmstat 1 10, iostat -xtk 1 10, cat /proc/meminfo, ps aux, ps auxH and put in pastebin. Would be worthwhile to see in apache logs or /var/log/messages to see whether anything is there. – Soham Chakraborty Dec 19 '12 at 03:27
  • @JohnSiu I added the output of `vmstat`. When I did `mpstat -A`, it seems that I'm missing some parameters (it gives the usage instructions). I'll add the sort of `top` too. – Tango Bravo Dec 19 '12 at 03:44
  • Is your sdb rebuilding?? wait time is high, utilization stay 99-100% – John Siu Dec 19 '12 at 03:54
  • @JohnSiu Not that I know of. I'll contact the person in house, to see what they say. But I don't think so. – Tango Bravo Dec 19 '12 at 03:59
  • iostat -xtk 1 10 output: it claims 25.31%, no _blocks in_ or _blocks out_ from vmstat, I would say it was **mysql** and go to _mysql_ client and enter 'show full processes'.... are you doing any...replicating on mysql ? – ArrowInTree Dec 19 '12 at 04:05
  • @ArrowInTree Okay, so I showed the output for "show full processlist", which I'm assuming you meant. Anyway, I did a little bit of "redaction" on there, but it should be fine. We're not doing any MySQL replication now. We have, however, started hosting more Drupal websites. – Tango Bravo Dec 19 '12 at 04:14
  • So what is this value _DELAYED_ that I am seeing in the _USER_ column? – ArrowInTree Dec 19 '12 at 04:20
  • @ArrowInTree I don't know. That isn't redacted – Tango Bravo Dec 19 '12 at 04:22
  • try _fuser /dev/sdb | grep mysql_ , lets see who has got open handles there... – ArrowInTree Dec 19 '12 at 04:24
  • @ArrowInTree That came up empty. The user which those queries are from is `eximstats` It looks like its something within WHM – Tango Bravo Dec 19 '12 at 04:32
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/6780/discussion-between-arrowintree-and-tim-bolton) – ArrowInTree Dec 19 '12 at 04:34

2 Answers2

5

It's pretty clear that your disk has reached it's limit. Generally, the %wa (iowait) should be very low ( <1% for websites in general ) and you want your util% (from iostat -x) to be as low as possible (0 is possible).

You can use iotop to find out what process is causing all the disk usage.

If it turns out to be mysql, you should turn on log slow queries in my.cnf (and restart mysql). Then you'll be able to find out what specific query is causing it.

Or. I think your sdb is broken. Try getting hardware checked out.

Edit : iotop (available through EPEL) is an awesome tool which lets you know which process cause iowait.

Grumpy
  • 2,939
  • 17
  • 23
2

Your sdb is acting unusual. Either the disk drive has become bad. If the traffic pattern on your websites is the same, and this is a new problem, then there is enough proof that you need to replace sdb.

There are two queues in the path of any IO in linux. One is the IO scheduler queue, controlled by nr_requests and another is the queue inside hardware. The merging of IO happens in the scheduler layer. So, when you see that the avgqu-sz is small i.e. average queue size is small while await is large and svctm is low, then it means that storage is taking time to service those IO requests.

Meaning, essentially slow storage or rather bad storage.

The %util shows that how much millseconds in 1000 millseconds an IO has taken to complete. The more it is the more hammered down your disk it is. That doesn't mean your disk is heavily hammered down but in your case it is slow, rather slow.

Soham Chakraborty
  • 3,534
  • 16
  • 24