0

I'm having a hard time figure out the reason cause high iowait on my server.

top:

%Cpu(s):  : 30.0 us,  5.4 sy,  0.0 ni, 53.9 id, 10.1 wa,  0.0 hi,  0.7 si,  0.0 st

iostat:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.00    0.00    0.80     0.00     7.20    18.00     0.00    0.25    0.00    0.25   0.25   0.02

iotop:

Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 4807 be/4 daemon      0.00 B/s    0.00 B/s  0.00 % 21.66 % httpd -k restart
 4522 be/4 daemon      0.00 B/s    0.00 B/s  0.00 % 10.76 % httpd -k restart
 4547 be/4 daemon      0.00 B/s    0.00 B/s  0.00 %  7.87 % httpd -k restart
 4603 be/4 daemon      0.00 B/s    0.00 B/s  0.00 %  7.66 % httpd -k restart
 4652 be/4 daemon      0.00 B/s    0.00 B/s  0.00 %  5.36 % httpd -k restart
 4671 be/4 daemon      0.00 B/s    0.00 B/s  0.00 %  4.83 % httpd -k restart

so it looks like all IO activity is being generated by httpd but there is no activity on sda and there aren't any more hard or network drives on the server.

i have tried strace on httpd:

Process 4975 attached
accept4(4, {sa_family=AF_INET6, sin6_port=htons(60547), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28], SOCK_CLOEXEC) = 9
getsockname(9, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(9, "GET /category/toys/ HTTP/1.0\r"..., 8000) = 287
stat("/srv/www/website/category/toys/", 0x7ffff11072d0) = -1 ENOENT (No such file or directory)
lstat("/srv", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/srv/www", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/srv/www/website", {st_mode=S_IFDIR|0755, st_size=20480, ...}) = 0
lstat("/srv/www/website/category", 0x7ffff11072d0) = -1 ENOENT (No such file or directory)
stat("/srv/www/website/index.php", {st_mode=S_IFREG|0644, st_size=39274, ...}) = 0
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={6000000, 0}}, NULL) = 0
rt_sigaction(SIGPROF, {0x7f4329cbc270, [PROF], SA_RESTORER|SA_RESTART, 0x7f432a7bd650}, {0x7f4329cbc270, [PROF], SA_RESTORER|SA_RESTART, 0x7f432a7bd650}, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [PROF], NULL, 8) = 0
getcwd("/", 4095)                       = 2
chdir("/srv/www/website")               = 0
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={1000000, 0}}, NULL) = 0
fcntl(7, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1, len=1}) = 0
getcwd("/srv/www/website", 4096)        = 17
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 10
connect(10, {sa_family=AF_LOCAL, sun_path="/tmp/memcached2.sock"}, 110) = 0
sendto(10, "get memc.sess.key.ee3e1a44ced404"..., 52, MSG_NOSIGNAL, NULL, 0) = 52
recvfrom(10, "VALUE memc.sess.key.ee3e1a44ced4"..., 8196, MSG_NOSIGNAL, NULL, NULL) = 175
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 11
connect(11, {sa_family=AF_LOCAL, sun_path="/tmp/memcached.sock"}, 110) = 0
sendto(11, "get 389768020cec2648767316c7233f"..., 38, MSG_NOSIGNAL, NULL, 0) = 38
recvfrom(11, "VALUE 389768020cec2648767316c723"..., 8196, MSG_NOSIGNAL, NULL, NULL) = 8196
recvfrom(11, "\0\"\345\22\315\00230:\374\16u\205\315\360\23\375\00050\376\0009\353\0\250\342\n\213\353Si"..., 8196, MSG_NOSIGNAL, NULL, NULL) = 8196
recvfrom(11, "px#f =\7body><tr@\v\0d \4\1ul \4\1li\304\36\2"..., 8196, MSG_NOSIGNAL, NULL, NULL) = 4629
sendto(11, "quit\r\n", 6, MSG_NOSIGNAL, NULL, 0) = 6
shutdown(11, SHUT_WR)                   = 0
shutdown(11, SHUT_RD)                   = 0
close(11)                               = 0
writev(9, [{"HTTP/1.1 200 OK\r\nDate: Mon, 11 M"..., 246}, {"<!DOCTYPE html> <html lang=\"en\">"..., 107113}], 2) = 107359
chdir("/")                              = 0
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
sendto(10, "set memc.sess.key.ee3e1a44ced404"..., 175, MSG_NOSIGNAL, NULL, 0) = 175
recvfrom(10, "STORED\r\n", 8196, MSG_NOSIGNAL, NULL, NULL) = 8
sendto(10, "quit\r\n", 6, MSG_NOSIGNAL, NULL, 0) = 6
shutdown(10, SHUT_WR)                   = 0
shutdown(10, SHUT_RD)                   = 0
close(10)                               = 0
fcntl(7, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
writev(9, [{"<!-- static content | last modif"..., 60}], 1) = 60
times({tms_utime=40554, tms_stime=8627, tms_cutime=0, tms_cstime=0}) = 456014999
shutdown(9, SHUT_WR)                    = 0
poll([{fd=9, events=POLLIN}], 1, 2000)  = 1 ([{fd=9, revents=POLLIN|POLLHUP}])
read(9, "", 512)                        = 0
close(9)                                = 0
read(5, 0x7ffff1107697, 1)              = -1 EAGAIN (Resource temporarily unavailable)
accept4(4,

it is doing quite a bit lstats for whole directory structure despite of rewrite rule to direct all requests to index.php, but is it possible that these simple lstats are responsible for such huge iowait ? It also have some communication with memcached servers via unix socket but as far as i have read unix socket communication is not counted as iowait.

Server is doing about 200req/sec with only dynamic content but all this content is buffered in the RAM (there is no swap to be sure and 64G ram) and it is not reading anything from the drive so why iowait is so high ?

How can i diagnose this problem ? Is these any way to measure across multiple httpd processes calls which increase iowait when there is almost no block device usage at all ?

Update 1: Server is on 1GE network doing ~10Mbps without any errors.

Update 2: strace -c -f -p:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 46.39    0.006063           1     10330           poll
 26.66    0.003485           0     10212        71 recvfrom
  7.98    0.001043           1       880           sendto
  5.03    0.000658           2       294       109 connect
  2.39    0.000313           1       294           socket
  2.10    0.000274           1       338           close
  1.45    0.000189           1       341        43 read
  1.44    0.000188           4        43           accept4
  1.40    0.000183           0       509           fcntl
  1.26    0.000165           3        52           writev
  0.83    0.000109           1       193           shutdown
  0.55    0.000072           1        84           chdir
  0.42    0.000055           0       168           setitimer
  0.41    0.000053           1        73           stat
  0.33    0.000043           1        43           getsockname
  0.25    0.000033           0       116           rt_sigaction
  0.24    0.000032           0        72           getcwd
  0.21    0.000028           0       109           getsockopt
  0.21    0.000027           0        60           lstat
  0.16    0.000021           0       106           setsockopt
  0.11    0.000014           0        43           times
  0.08    0.000010           0        42           rt_sigprocmask
  0.05    0.000007           7         1           select
  0.02    0.000003           1         5           access
  0.02    0.000002           0         6           brk
  0.00    0.000000           0         1           open
  0.00    0.000000           0         1           mmap
  0.00    0.000000           0         1           munmap
------ ----------- ----------- --------- --------- ----------------
100.00    0.013070                 24417       223 total

this is result of strace -c runned for about one minute but i still can't find anything that could generate such high iowait...

Update 3:

I have changed httpd configuration and moved RewriteRules from Directory to VirtualHost scope so i can disable FollowSymLinks and saved some extra lstat calls for each request, from:

stat("/srv/www/website/category/toys/", 0x7ffff11072d0) = -1 ENOENT (No such file or directory)
lstat("/srv", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/srv/www", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/srv/www/website", {st_mode=S_IFDIR|0755, st_size=20480, ...}) = 0
lstat("/srv/www/website/category", 0x7ffff11072d0) = -1 ENOENT (No such file or directory)
stat("/srv/www/website/index.php", {st_mode=S_IFREG|0644, st_size=39274, 

to just:

stat("/srv/www/website/index.php", {st_mode=S_IFREG|0644, st_size=39274, ...}) = 0
lstat("/srv/www/website/index.php", {st_mode=S_IFREG|0644, st_size=39274, ...}) = 0

but impact on IO is none. I have also noticed that about 10 seconds after httpd restart (http requests are being served all this time) iowait is on 0-1% levels and then goes up - maybe this could be clue ?

Update 4

I have changed Min spare servers from 280 to 64 and iowait has been reduced to 1%. It could be connected with scoreboard as Cameron Kerr mentioned or some parent-child communication but as far as i read "Apache httpd first attempts to create the scoreboard entirely in memory" and my apache instances are not using file based scoreboard.

Has anyone have idea why number of spare prefork processes have such large impact on iowait (20% vs 2% for 280 vs 64 spare processes) ?

forke
  • 123
  • 10
  • Your strace seems faulty, in that you don't appear to have used `-f` to follow all the child processes. I would imagine that there is logging and perhaps scoreboard (?) activity going on. – Cameron Kerr May 11 '15 at 12:12
  • If your server is connected to a 100 Mbps network, wrong duplex settings could very well cause problems similar to this one. – Janne Pikkarainen May 11 '15 at 12:22
  • @JannePikkarainen: Server is on 1GE network doing ~10Mbps without any errors. – forke May 11 '15 at 12:54
  • @CameronKerr strace -p PID and strace -p PID -f gives me the same results as i am monitoring just one of the prefork httpd instances. There is only error logging, because access logs are done by load balancer. There is also a scoreboard but how it can be related with iowait ? There are no more than 5-10 active connections between these httpd instances and load balancer. – forke May 11 '15 at 12:54
  • Use multiple -p for strace? – Cameron Kerr May 11 '15 at 20:16
  • @CameronKerr: I tried multiple -p for strace but it simply doesn't change anything - i just have more samples. I can count calls and still got something like i posted in Update 2 - it does not indicate anything that could result in up to 20% iowait per process... – forke May 11 '15 at 22:38

0 Answers0