What's going on with my server? High load, lots of idle CPU time, low disk utilization

Question

I run a web site and send a legitimate opt-in, daily email newsletter to subscribers. Both the web hosting and email sending are done by the same machine.

I have about 100,000 subscribers who have opted in to my daily email newsletter. My PHP script did a pretty good job sending mail to all of them until fairly recently, but as the list has grown I can't keep up.

When I run top, I have very high load--usually at least 6 or 7, sometimes as high as 15--even though I only have two CPUs. However, when I run sar, my CPU is idle an average of about 30% of the time. So, it seems I'm not CPU bound. When I run iostat, it seems as though I'm not disk bound because my %util for each device is very low (no more than 5%).

Given that I don't seem to be CPU bound or disk bound, why is top reporting such high load?

Additionally, since I don't seem to be CPU bound or disk bound, why is my email sending script not able to keep up?

Here's what I see when running top:

top - 11:33:28 up 74 days, 18:49,  2 users,  load average: 7.65, 8.79, 8.28
Tasks: 168 total,   5 running, 162 sleeping,   0 stopped,   1 zombie
Cpu(s): 38.9%us, 58.6%sy,  0.8%ni,  0.0%id,  0.7%wa,  0.2%hi,  0.8%si,  0.0%st
Mem:   3083012k total,  2144436k used,   938576k free,   281136k buffers
Swap:  2048248k total,    39164k used,  2009084k free,  1470412k cached

Here's what I see when running iostat -mx:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.80    1.20   55.24    0.37    0.00    8.38

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.19    71.70  1.59 29.45     0.02     0.07     5.90     0.55   17.82   1.16   3.59
sda1              0.00     0.00  0.00  0.00     0.00     0.00     7.10     0.00   13.80  13.72   0.00
sda2              0.05    50.45  1.13 24.57     0.01     0.29    24.25     0.35   13.43   1.15   2.97
sda3              0.05    10.17  0.20  2.33     0.01     0.05    43.75     0.05   20.96   2.45   0.62
sda4              0.00     0.00  0.00  0.00     0.00     0.00     2.00     0.00   70.50  70.50   0.00
sda5              0.07     0.22  0.03  0.07     0.00     0.00    32.84     0.08  856.19   8.03   0.08
sda6              0.02     5.45  0.03  0.72     0.00     0.02    67.55     0.02   26.72   5.26   0.39
sda7              0.00     1.56  0.00  0.42     0.00     0.01    38.04     0.00    8.88   5.84   0.24
sda8              0.01     3.84  0.20  1.35     0.00     0.02    28.55     0.05   31.90   4.08   0.63

Here's what I see when running sar:

09:40:02 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:50:01 AM       all     30.59      1.01     49.80      0.23      0.00     18.37
10:00:08 AM       all     31.73      0.92     51.66      0.13      0.00     15.55
10:10:06 AM       all     30.43      0.99     48.94      0.26      0.00     19.38
10:20:01 AM       all     29.58      1.00     47.76      0.25      0.00     21.42
10:30:01 AM       all     29.37      1.02     47.30      0.18      0.00     22.13
10:40:06 AM       all     32.50      1.01     52.94      0.16      0.00     13.39
10:50:01 AM       all     30.49      1.00     49.59      0.15      0.00     18.77
11:00:01 AM       all     29.43      0.99     47.71      0.17      0.00     21.71
11:10:07 AM       all     30.26      0.93     49.48      0.83      0.00     18.50
11:20:02 AM       all     29.83      0.81     48.51      1.32      0.00     19.52
11:30:06 AM       all     31.18      0.88     51.33      1.15      0.00     15.47
Average:          all     26.21      1.15     42.62      0.48      0.00     29.54

Here are the top handful of processes listed at the particular time I happened to run top -c:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                      
 8180 mysql     16   0 57448  19m 2948 S 26.6  0.7   4702:26 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --pid-file=/var/lib/mysql/bristno.pid --skip-external-locking                          
26956 brristno  17   0     0    0    0 Z  8.0  0.0   0:00.24 [php] <defunct>                                                                                                                                                               
26958 brristno  17   0 94408  43m  37m R  5.0  1.4   0:00.15 /usr/bin/php /home/brristno/public_html/dbv.php                                                                                                                               
22852 nobody    16   0  9628 2900 1524 S  0.7  0.1   0:00.17 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                    
 8591 brristno  34  19 96896  13m 6652 S  0.3  0.4   0:29.82 /usr/local/bin/php /home/brristno/bin/mailer.php 1qwqyb6 i0gbor                                                                                                               
24469 nobody    16   0  9628 2880 1508 S  0.3  0.1   0:00.08 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                    
25495 nobody    15   0  9628 2876 1500 S  0.3  0.1   0:00.06 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                    
26149 nobody    15   0  9628 2864 1504 S  0.3  0.1   0:00.04 /usr/local/apache/bin/httpd -k start -DSSL

Thank you, Dmitri!

1) I already have a script that unsubscribes email addresses that have bounced at least five times in the past month, so hopefully that is keeping my list relatively limited to active email addresses.

2) I am using exim 4.69. My config file is at

/etc/exim.conf

and my log files are at:

/var/log/exim_mainlog
/var/log/exim_paniclog
/var/log/exim_rejectlog

Additionally, when I look in /etc/syslog.conf, I see the following:

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog

I don't know what the "-" means at the beginning of -/var/log/maillog but when I look in that file it's clear that a lot is being logged there.

Additionally, a lot is being logged in this file:

/var/log/exim_mainlog

I since added to /etc/exim.conf this line:

no_message_logs

I thought that that would disable mail logging (I did restart exim), but when I look at /var/log/maillog and at /var/log/exim_mainlog both files are still receiving new log entries.

Question: How can I disable most/all exim logging?

3) When I look in /var/log/exim_paniclog, I see a ton of entries like this one:

2010-12-19 04:03:32 1PUFB1-0006xZ-GF User 0 set for local_delivery transport is on the never_users list

After looking around for a while, it seems as though that means exim is trying to deliver to the root email address. What's the best way to handle these mail deliveries to root while using as few CPU resources as possible?

Updated question to include output of `top -c`. (It seems as though there is no `-P` argument to top.) — , Dec 25 '10 at 17:19
You load is coming from the high `%sys`. `%sys` is time spent by the kernel. For some reason the kernel is using a **TON** of CPU time. Unfortunately there are are lot of causes for this, but the most common that I see is usually interrupts. I'd look at `watch -n 1 cat /proc/interrupts` and see if any of them are climbing at an insanely high rate (thousands per second). Edit: oh, this is an old question that just popped up on the front page, likely to be irrelevant now :-( — phemmer, Aug 26 '12 at 04:01
I ran into a possibly similar circumstance once: http://serverfault.com/a/524818/27813 turned out the high load average was from I/O of reading/writing large files in that case...yours looks a bit more like cpu bound though... — rogerdpack, Jul 19 '13 at 20:47

score 3 · Answer 1 · answered Dec 25 '10 at 19:15

As noted load average is related to the number of waiting processes in the run queue. If each of those processes has very little work to do and frees the processor quickly you can handle much larger load averages than the common 1 per CPU rule of thumb.

Mail is pretty much the perfect example of this, each process needs CPU to send a message but very, very little. I've seen mail systems running sendmail at a load average in the 25 to 35 range, and the system is still interactive and working fine.

Mark

score 1 · Answer 2 · answered Dec 31 '10 at 10:50

System metrics (load, CPU, I/O) are often the only indicators most people have of the performance of their system - however actual transactional performance is something quite different. These metrics can provide guidance on how performance is constrained, but really its a lot more useful to look at how long transactions actually take.

why is my email sending script not able to keep up?

Does that mean you are seeing problems with the mail queue not clearing down? Or is it the length of time the script takes to execute? Or are you inferring that tere is a problem based on the high load?

As mfarver says, high load is not uncommon on email systems, particularly with the increasing number of synchronous checking done by mailservers to avoid spam.

Personally, I'm not a big fan of exim - I've had much better experiences with sendmail and postfix, although I admit that its been several years since I did any serious testing on MTAs. Certainly you are getting into the ballpark where you need to be a lot more sophisticated about email processing.

Rather than switch off the logging, it might be a good idea to temporarily add forwarding for the root account to see exactly what all those emails which aren't getting delivered are about.

I'm guessing that the MTA is configured to send mail directly to their recipients. If you do have performance problems then you might consider using a smart relay to offload messages from your server faster. But try switching Exim to queue only to see if this resolves the load (and more importantly any performance) problem first. Also, have a look at your DNS caching and see if it could be improved.

If you're already using a smart relay, then do check its configured correctly - IME, with a sendmail based setup, php mail() calls block for a long time (but somehow messages still get delivered?) if the MTA can't connect to the smart host.

A lot of email providers now implement throttling as a method of spam blocking - while sorting the email list by domain would help reduce DNS lookups, you might end up having problems with remote systems throttling or blocking mail. Do make sure that you're doing everything practical to avoid looking like a spammer (e.g. SPF, DKIM) - IIRC Exim does not directly support milters - there are a lot of useful milters available - notably milter-limit.

score 0 · Answer 3 · answered Dec 25 '10 at 17:06

0

high load is the mean size of run queue - e.g. processes which want to be runned on cpu. Looks like your script does a lot of cpu work. So, you must profile it and post here its sources. How do you send letters?

answered Dec 25 '10 at 17:06

osgx

583
11
26

I send email newsletters using PHPMailer which is configured to simply use sendmail. – Dec 25 '10 at 17:13

score 0 · Answer 4 · answered Dec 25 '10 at 18:21

First of all, your load is not all that high. The load of 8 on 2 CPU means a load of 4 per CPU. Also modern CPU are usualy dual core, so it's like 2 CPU's in one, so the load is really more like 2 per CPU.

As far as processing emails is concerned, there are 2 things you can do to decrease the load: 1) make sure you have a script that processes the bounced emails so that you can mark an email address as 'bouncing' and not send to that address anymore. The usual bounce rate for a large email list, even an opt-in one is about 20%. The bounces really bugging down the server because not only your server has to send out emails that people don't see, it also has to receive and process the bounced emails.

2) Disable logging to maillog. On a high volume mailing list the entry is added to maillog on every email that goes out as well as on every bounce email received back. Writing to maillog is very resource intensive because it involves disk writes. By simply disabling maillog you can decrease your system load by "a lot", sometimes by as much as 50% I don't know what email server you using, but on Linux you would usually look in the /etc/syslog.conf

Just comment out the entry for mail then restart syslog service.

Once more thing: the bounced emails usually come back to the root account. It's very common for a system to reach the mailbox limit for the root account, which is usually 100MB. Once the limit is reached you are starting another problem where you cannot even accept bounced emails, so you system may be sending its own bounce messages adding even more load.

Conclusion: monitor you bounces and keep your list clean - marked bouncning accounts and don't send mails to them anymore.

score 0 · Answer 5 · answered Dec 26 '10 at 04:00

Your maillog entry is marked not to flush on each entry. This should help reduce the CPU overhead writing to this log. However, as you are using Exim this log is not used by default. Check your configuration to ensure you haven't enabled use of syslog.

To reduce what is being logged, add a log_selector specification to your configuration. Possible values are detailed in the Exim Specification (likely chapter 49). Although, this is likely not your problem.

Try running exiwhat to see what deliveries are being attempted. mailq should not have a lot of messages waiting delivery and fer which have been on the queue for a hour or more. A long list of messages which have been on the queue for a while indicates you are attempting deliveries which are likely to bounce.

Exim doesn't handle a lot of delivery processes running simultaneously well. You should look configurations changes which may help.

try increasing times between retries, and reducing the time before you bounce messages as undelivered. This will reduce the number of attempts required to bounce undeliverable messages.
Disable immediate delivery attempts, so that deliveries are run from the queue. You may want to use queue_only_load to do this conditionally.
Set queue_run_max to limit the number of queue runner processes.

To resolve the attempted deliveries to route your can use a transport or an alias. I alias root to my email address. Ubuntu uses this router to prevent deliveries running as root.

mail4root:
  debug_print = "R: mail4root for $local_part@$domain"
  driver = redirect
  domains = +local_domains
  data = /var/mail/mail
  file_transport = address_file
  local_parts = root
  user = mail
  group = mail

What's going on with my server? High load, lots of idle CPU time, low disk utilization

5 Answers5