I run a web site and send a legitimate opt-in, daily email newsletter to subscribers. Both the web hosting and email sending are done by the same machine.
I have about 100,000 subscribers who have opted in to my daily email newsletter. My PHP script did a pretty good job sending mail to all of them until fairly recently, but as the list has grown I can't keep up.
When I run top, I have very high load--usually at least 6 or 7, sometimes as high as 15--even though I only have two CPUs. However, when I run sar, my CPU is idle an average of about 30% of the time. So, it seems I'm not CPU bound. When I run iostat, it seems as though I'm not disk bound because my %util for each device is very low (no more than 5%).
Given that I don't seem to be CPU bound or disk bound, why is top reporting such high load?
Additionally, since I don't seem to be CPU bound or disk bound, why is my email sending script not able to keep up?
Here's what I see when running top:
top - 11:33:28 up 74 days, 18:49, 2 users, load average: 7.65, 8.79, 8.28
Tasks: 168 total, 5 running, 162 sleeping, 0 stopped, 1 zombie
Cpu(s): 38.9%us, 58.6%sy, 0.8%ni, 0.0%id, 0.7%wa, 0.2%hi, 0.8%si, 0.0%st
Mem: 3083012k total, 2144436k used, 938576k free, 281136k buffers
Swap: 2048248k total, 39164k used, 2009084k free, 1470412k cached
Here's what I see when running iostat -mx:
avg-cpu: %user %nice %system %iowait %steal %idle
34.80 1.20 55.24 0.37 0.00 8.38
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.19 71.70 1.59 29.45 0.02 0.07 5.90 0.55 17.82 1.16 3.59
sda1 0.00 0.00 0.00 0.00 0.00 0.00 7.10 0.00 13.80 13.72 0.00
sda2 0.05 50.45 1.13 24.57 0.01 0.29 24.25 0.35 13.43 1.15 2.97
sda3 0.05 10.17 0.20 2.33 0.01 0.05 43.75 0.05 20.96 2.45 0.62
sda4 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 70.50 70.50 0.00
sda5 0.07 0.22 0.03 0.07 0.00 0.00 32.84 0.08 856.19 8.03 0.08
sda6 0.02 5.45 0.03 0.72 0.00 0.02 67.55 0.02 26.72 5.26 0.39
sda7 0.00 1.56 0.00 0.42 0.00 0.01 38.04 0.00 8.88 5.84 0.24
sda8 0.01 3.84 0.20 1.35 0.00 0.02 28.55 0.05 31.90 4.08 0.63
Here's what I see when running sar:
09:40:02 AM CPU %user %nice %system %iowait %steal %idle
09:50:01 AM all 30.59 1.01 49.80 0.23 0.00 18.37
10:00:08 AM all 31.73 0.92 51.66 0.13 0.00 15.55
10:10:06 AM all 30.43 0.99 48.94 0.26 0.00 19.38
10:20:01 AM all 29.58 1.00 47.76 0.25 0.00 21.42
10:30:01 AM all 29.37 1.02 47.30 0.18 0.00 22.13
10:40:06 AM all 32.50 1.01 52.94 0.16 0.00 13.39
10:50:01 AM all 30.49 1.00 49.59 0.15 0.00 18.77
11:00:01 AM all 29.43 0.99 47.71 0.17 0.00 21.71
11:10:07 AM all 30.26 0.93 49.48 0.83 0.00 18.50
11:20:02 AM all 29.83 0.81 48.51 1.32 0.00 19.52
11:30:06 AM all 31.18 0.88 51.33 1.15 0.00 15.47
Average: all 26.21 1.15 42.62 0.48 0.00 29.54
Here are the top handful of processes listed at the particular time I happened to run top -c
:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8180 mysql 16 0 57448 19m 2948 S 26.6 0.7 4702:26 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --pid-file=/var/lib/mysql/bristno.pid --skip-external-locking
26956 brristno 17 0 0 0 0 Z 8.0 0.0 0:00.24 [php] <defunct>
26958 brristno 17 0 94408 43m 37m R 5.0 1.4 0:00.15 /usr/bin/php /home/brristno/public_html/dbv.php
22852 nobody 16 0 9628 2900 1524 S 0.7 0.1 0:00.17 /usr/local/apache/bin/httpd -k start -DSSL
8591 brristno 34 19 96896 13m 6652 S 0.3 0.4 0:29.82 /usr/local/bin/php /home/brristno/bin/mailer.php 1qwqyb6 i0gbor
24469 nobody 16 0 9628 2880 1508 S 0.3 0.1 0:00.08 /usr/local/apache/bin/httpd -k start -DSSL
25495 nobody 15 0 9628 2876 1500 S 0.3 0.1 0:00.06 /usr/local/apache/bin/httpd -k start -DSSL
26149 nobody 15 0 9628 2864 1504 S 0.3 0.1 0:00.04 /usr/local/apache/bin/httpd -k start -DSSL
Thank you, Dmitri!
1) I already have a script that unsubscribes email addresses that have bounced at least five times in the past month, so hopefully that is keeping my list relatively limited to active email addresses.
2) I am using exim 4.69. My config file is at
/etc/exim.conf
and my log files are at:
/var/log/exim_mainlog
/var/log/exim_paniclog
/var/log/exim_rejectlog
Additionally, when I look in /etc/syslog.conf, I see the following:
# Log all the mail messages in one place.
mail.* -/var/log/maillog
I don't know what the "-" means at the beginning of -/var/log/maillog
but when I look in that file it's clear that a lot is being logged there.
Additionally, a lot is being logged in this file:
/var/log/exim_mainlog
I since added to /etc/exim.conf this line:
no_message_logs
I thought that that would disable mail logging (I did restart exim), but when I look at /var/log/maillog and at /var/log/exim_mainlog both files are still receiving new log entries.
Question: How can I disable most/all exim logging?
3) When I look in /var/log/exim_paniclog, I see a ton of entries like this one:
2010-12-19 04:03:32 1PUFB1-0006xZ-GF User 0 set for local_delivery transport is on the never_users list
After looking around for a while, it seems as though that means exim is trying to deliver to the root email address. What's the best way to handle these mail deliveries to root while using as few CPU resources as possible?