4

I have a WordPress based website running on a shared hosting. Its response time is very decent (around 2s to retrieve the HTML page and 5s to load all the resources).

I was planning to move it to a dedicated virtual server (Ubuntu 12.04 LTS), which should theoretically improve things and make them more consistent given its not shared. However I observed severe performance degredation, with the page taking 10seconds to be generated.

I ruled out network issues by editing /etc/hosts on the server and mapping the domain to 127.0.0.1. I used the Apache load tester ab to get the HTML, so JS, CSS and images are all excluded. It still took 10 seconds.

I have Zpanel installed on the server which also uses MySQL, and its pages come up quite fast (1.5s) and also phpMyAdmin. Performing some queries on the wordpress database directly through phpMyAdmin returns them quite fast too, with query times in the 10 to 30 millisecond region.

Memory is also sufficient, with only 800Mb being used of the 1Gb physical memory available, so it doesn't seem to be a swap issue either. I have also installed APC to try to improve the PHP performance, but it didn't have any effect.

What else should I look for? What could be causing this degradation in performance? Could it be some kind of I/O issue since I am running on a cloud based virtual server?

I wish to be able to raise the issue with my provider but without showing actual data from some diagnosis I am afraid he will just blame my application.

UPDATE with sar output (every second) when I did an HTTP request:

02:31:29        CPU     %user     %nice   %system   %iowait    %steal     %idle
02:31:30        all      0.00      0.00      0.00      0.00      0.00    100.00
02:31:31        all      2.22      0.00      2.22      0.00      0.00     95.56
02:31:32        all     41.67      0.00      6.25      0.00      2.08     50.00
02:31:33        all     86.36      0.00     13.64      0.00      0.00      0.00
02:31:34        all     75.00      0.00     25.00      0.00      0.00      0.00
02:31:35        all     93.18      0.00      6.82      0.00      0.00      0.00
02:31:36        all     90.70      0.00      9.30      0.00      0.00      0.00
02:31:37        all     71.05      0.00      0.00      0.00      0.00     28.95
02:31:38        all     14.89      0.00     10.64      0.00      2.13     72.34
02:31:39        all      2.56      0.00      0.00      0.00      0.00     97.44
02:31:40        all      0.00      0.00      0.00      0.00      0.00    100.00
02:31:41        all      0.00      0.00      0.00      0.00      0.00    100.00

UPDATE 2 After josten's suggestions.

I/O:

iotop fails with OSError: Netlink error: No such file or directory (2) and sar -d also fails with Requested activities not available in file /var/log/sysstat/sa14. I think this is because this is a virtual machine, just like iostat also fails. Could it be the reason why %iowait reported by sar 1 10 is always 0%?

CPU Load:

The process that is topping the CPU% in htop is actually apache2. I was expecting this to maybe be the database, but its not. It goes up to 94% for a few seconds when I do a fresh HTTP request. Seems this is the culprit.

I have done an strace -f -t and one summary strace -c -f. There seems to be an awful lot of lstat calls (57786), with 2455 resulting in errors. No idea if this is normal. Other than that the topmost call was wait4 which I presume is normal (its just waiting), and munmap. Top 5 below.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.06    0.124742         897       139         6 wait4
 14.90    0.036388           1     57786      2455 lstat
  9.67    0.023622          13      1857           munmap
  7.69    0.018790          37       514           brk
  6.70    0.016361         481        34           clone
  2.87    0.006999          74        94        12 select

strace itself slowed down apache by a factor of 2. I am trying to understand the long trace now to see if there is anything indicative of what was causing the CPU to spike for a few seconds.

What is the typical time for lstat for a good performing server? I wish to gather some information so that I can complain in a constructive manner to the provider if it is the storage access fault.

UPDATE Output of fio random read test:

random-read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.59
Starting 1 process
random-read: Laying out IO file(s) (1 file(s) / 128MB)
Jobs: 1 (f=1): [r] [100.0% done] [12185K/0K /s] [2975 /0  iops] [eta 00m:00s]
random-read: (groupid=0, jobs=1): err= 0: pid=24264
  read : io=131072KB, bw=10298KB/s, iops=2574 , runt= 12728msec
    clat (usec): min=119 , max=162219 , avg=380.34, stdev=957.37
     lat (usec): min=119 , max=162219 , avg=380.89, stdev=957.40
    bw (KB/s) : min= 7200, max=13424, per=99.89%, avg=10285.72, stdev=1608.68
  cpu          : usr=2.80%, sys=18.65%, ctx=33511, majf=0, minf=23
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=32768/0/0, short=0/0/0
     lat (usec): 250=45.57%, 500=37.17%, 750=3.41%, 1000=7.83%
     lat (msec): 2=5.67%, 4=0.27%, 10=0.08%, 20=0.01%, 250=0.01%

Run status group 0 (all jobs):
   READ: io=131072KB, aggrb=10297KB/s, minb=10545KB/s, maxb=10545KB/s, mint=12728msec, maxt=12728msec

The only hint I have now is that the CPU line of the fio output seems to show quite a bit of activity when compared to other systems. I ran it on my local Ubuntu machine and the output was:

cpu          : usr=0.19%, sys=0.59%, ctx=32923, majf=0, minf=23

The usr percentage seems to be a small fraction of what is being reported on my server.

UPDATE Re PHP APC. Yes it is installed. Output from phpinfo:

APC Support enabled
Version 3.1.7
APC Debugging   Disabled
MMAP Support    Enabled
MMAP File Mask  no value
Locking type    pthread mutex Locks
Serialization Support   php
Revision    $Revision: 307215 $
Build Date  May 2 2011 19:00:42

Is there any specific setting I should check for? These are the settings I have (local value, Master value):

apc.cache_by_default    On  On
apc.canonicalize    On  On
apc.coredump_unmap  Off Off
apc.enable_cli  Off Off
apc.enabled On  On
apc.file_md5    Off Off
apc.file_update_protection  2   2
apc.filters no value    no value
apc.gc_ttl  3600    3600
apc.include_once_override   Off Off
apc.lazy_classes    Off Off
apc.lazy_functions  Off Off
apc.max_file_size   1M  1M
apc.mmap_file_mask  no value    no value
apc.num_files_hint  1000    1000
apc.preload_path    no value    no value
apc.report_autofilter   Off Off
apc.rfc1867 Off Off
apc.rfc1867_freq    0   0
apc.rfc1867_name    APC_UPLOAD_PROGRESS APC_UPLOAD_PROGRESS
apc.rfc1867_prefix  upload_ upload_
apc.rfc1867_ttl 3600    3600
apc.serializer  default default
apc.shm_segments    1   1
apc.shm_size    32M 32M
apc.slam_defense    On  On
apc.stat    On  On
apc.stat_ctime  Off Off
apc.ttl 0   0
apc.use_request_time    On  On
apc.user_entries_hint   4096    4096
apc.user_ttl    0   0
apc.write_lock  On  On

UPDATE Increased apc.shm_size to 96M. Cache full count is now 0, and there are 96.5% hits to the cache after a few refreshes of the website here and there. APC memory usage is 25.4MB free.

It seems to have reduced the loading time by 3 seconds or so, now down to around 4 to 5 seconds if I do a pure wget from the server itself without getting any images etc. Still more than twice slower than the other hosting but definitely was an improvement.

I am still finding it strange why it was taking so long to render those pages when the server is totally idle (I don't have APC installed on my development PC and it doesn't have that kind of behaviour). And its still strange where those extra remaining seconds are being wasted.

jbx
  • 93
  • 1
  • 10
  • You can use tools like iostat, sysstat or sar to find out if it is a IO problem. Please post the output from those commands then we can help you better. – Raffael Luthiger Nov 10 '13 at 23:30
  • `iostat` just says 'Cannot find disk data'. I attached `sar` output with 1 second frequency while I did an http request. – jbx Nov 11 '13 at 01:17
  • 1
    I'd toss [Newrelic](http://newrelic.com) on the PHP application side and see what's up... – ewwhite Nov 13 '13 at 05:51
  • Note that to really test apc you should set the `apc.stat` setting to "0". – regilero Nov 14 '13 at 09:41
  • What handler are you using for PHP? Are you certain you are on dedicated or a system with a hypervisor? Disk io is often slow on VPS/cloud offerings you may want to benchmark disk writes. – jeffatrackaid Nov 14 '13 at 20:52
  • For a quick test, I usually run these 3 items: plain html/image (static content), php page with just phpinfo function, php page with a call to your database (just do a simple select). This is quick and really tease apart where you should be focusing your efforts. – jeffatrackaid Nov 14 '13 at 20:53

4 Answers4

13

You have to first determine what the issue is first; if it's PHP, MySQL, I/O, load, memory, CPU, kernel, etc. sar logs the systems metrics; you'll have to catch it in the act. You can configure atop to do process accounting which definitely helps.

To determine if it's I/O

Use tools such as iotop and atop to see what the disk usage is; these tools will also tell you what is causing the IO. Generally, if iowait is sustained over 10% this could be the issue.

sar logs disk IO; so you can run sar -d to see it (look at %util column).

To determine if it's load

Use tools such as htop, top, uptime; again tie this to the process running and find out more details about what the process is doing. Note that this reports the load on the scheduler; it doesn't reflect the CPU usage.

To determine if it's a CPU

sar once again comes in to save the day; you can see this information with sar -P ALL. You can also use mpstat -P ALL for real-time data. Generally, the CPU is only an issue if all the CPU's are at 100%; 80%+ means they're being utilized (but not necessarily saturated).

To determine if it's the Memory (VM)

You'll want to use vmstat; vmstat -S M 1 and observe the swap, io, and system columns. Obviously a high amount of swapping can impact performance. There is also the system section; a high amount of interrupts can also do the same.

To determine if it's interrupts

You can use vmstat -S M 1. Unfortunately, it's hard to tell if interrupts are the issue if your system doesn't have a baseline on what is normal. A high amount of interrupts (which are caused from hardware requiring action from the kernel) will bring the system to a crawl. Failing NIC's are notorious for doing this.

To determine if it's the kernel

This is trickier but generally requires strace, perf, or sysdig tools. One such tool is perf top. strace with a summary (-c) is nice but it doesn't break it down relative to the system resources (so the data that is provided is only speculation); it's ideal to use perf top to come to the conclusion that it's the kernel. You can also use stap (SystemTap) if your machine supports it. I should also note that strace will impact performance; you should use sysdig if the system is at all important.

To determine if it's MySQL/PHP

You basically have to follow what I posted above (perf for example can provide information on what command is causing high kernel time, iotop, atop, htop can provide information relative to system resources on what is using them); basically, you're using the above tools to determine what is causing the load.

Once you've determined it to be MySQL

It could be a query that you're running (so you'll want to use EXPLAIN on that query in MySQL). You'll also need to make sure that your database is optimized and that the queries you're executing are optimized. You'll also have to make sure that the table engine you're using is ideal for what you're doing (I've seen many large tables that MyISAM when they should be InnoDB). If you've determined that none of the above are the issue and still suspect MySQL you may want to archive data in the affected tables to reduce access (table scans) to that table. You may also want to verify constraint consistency, enable cache buffering, and ensure indexes are optimal.

A good tool to help in this process is mytop; but all the information that mytop provides is easily accessible in the mysql client. Some useful statements to run:

  • SHOW FULL PROCESSLIST\G to get a complete list of currently executing SQL statements as well as their status to the server.
  • SHOW ENGINE INNODB STATUS\G (InnoDB only)
  • EXPLAIN EXTENDED <QUERY> to explain a query that you see MySQL executing.
  • SHOW GLOBAL STATUS\G for a server-wide status

Once you've determined it to be PHP

You can use tools to profile your PHP code (such as xdebug) and then open the generated profile in KCacheGrind to see a performance analysis of the profiled PHP code.

If you find it's none of these you may just need to upgrade your server.

  • Thanks. I followed your suggestions and the process that seems to be creating a CPU spike is apache2. Unfortunately since this is a virtual server most i/o tools are not working. I updated my question with some `strace` outputs. – jbx Nov 14 '13 at 02:42
  • Can you elaborate a bit more on what should I do with `fio`? Re Apache well I am just using the version that comes with the Ubuntu distribution. I don't think it is forking for each request, because I saw some calls to `select`, and it tends to reuse the same processes rather than forking new ones for each request. I don't think it has any effect on the problem at hand however, because I am just doing one HTTP request, there is no other load on the server. – jbx Nov 14 '13 at 13:53
  • I have added the output from an `fio` random read test. The latency and bandwidth seem fine. However I am not sure if the CPU line is reporting too high usage. (On the first example I saw here the CPU load was much lower https://www.linux.com/learn/tutorials/442451-inspecting-disk-io-performance-with-fio/) – jbx Nov 17 '13 at 22:59
2

Look over the answer I gave to another question similar to this one for clues.

The thing is, if other pages outside of WordPress area are loading fine, but WordPress itself is choked, three things come to mind outside of the generic things I recommend.

  1. When you migrated your WordPress code to the new setup, did you make sure to correctly set all paths on the file system in wp-config.php? The reason being that sometimes WordPress can work despite incorrect paths if those are set in the MySQL DB for WordPress options. Making sure they are in wp-config.php forces WordPress to use the correct directories & ensures temp & cache folders work as expected.
  2. Database slowdown? That is the only other thing I can thing of that would be idiosyncratic to WordPress yet allow other pages to load. Are you sure your MySQL my.cnf is working as it should for the DB needs of your site?
  3. Do you have a plugin or setting in your WordPress code that enables Gzip compression? In general Gzip compression should happen on the server side via Apache or Ngnix since they can hand Gzip compression more efficiently than PHP code. So if you have caching enabled in WordPress, disable it since PHP (which is what WordPress uses) is not great at Gzip compression.

In general I have setup tons of CMS sites—more recently WordPress—on cloud servers without issue. a 10 second page load is not a symptom of the cloud host being inadequate. I would recommend looking over the stuff I recommend here & in my other answer. And I would also recommend debugging by doing a clean WordPress install on the setup that is an issue & see how that reacts. If that works well in comparison to your full site, then it is clear there is some configuration issue in your sites specific code.

EDIT: Here is another idea. Do you have Apache authorization (htaccess) anywhere in your setup? Do you have it set to allow from localhost? See below. Sometimes this setup works, but if Allow from localhost is the first in the list of Allow’s or the only item in the list of Allow’s it can choke from reverse DNS weirdness. I would recommend trying to disable that—if you can—and see how quickly the site loads in comparison to it being enabled.

Order Deny,Allow
Deny from all
Allow from 127.0.0.1 ::1
Allow from localhost
Giacomo1968
  • 3,522
  • 25
  • 38
  • Thanks for your replies. Re 1: I checked my wp_config.php but there are no paths in that config file. Re 2: What should I check for in my.cnf in particular? Re 3: No Gzip compression at all. I disabled W3 Total Cache because it was actually slowing things down even without any caching. I have another 'vanilla' wordpress with very few plugins (just Attachments, Contact Form 7, User Role Editor and W3 Total Cache) and load time is quite fast. I know the obvious answer will be 'there is a plugin slowing things down' but fact is the same setup works much faster on the shared hosting. – jbx Nov 11 '13 at 10:56
  • I know it might not make much difference, but do you suggest I install a vanilla wordpress installation, install the plugins one by one, do an export and import of the site data via the wordpress xml tool, and copy over just wp-uploads and the theme? I was afraid I will lose some settings especially for the plugins if I do that, which is why I followed the Codex recommendations of just copying everything and changing the DB username/password settings. But if you think it might be better to install fresh and add things incrementally I can try that. – jbx Nov 11 '13 at 11:00
  • Please read the original answer I linked to. If you feel you want to start from scratch, please do that. But the assessment you make of the whole system—disk, memory & CPU—is at fault is off. Also, there is not one thing in `my.cnf` that would fix this. But in general, is your MySQL tuned? If none of this makes sense, you are in over your head. – Giacomo1968 Nov 11 '13 at 16:22
  • I had a look at your answer but it was mostly addressing memory usage, which in my case is (so far) always below the physical limit. I will have a look in more detail about the MySQL tuning. I had installed the Debug Queries plugin and when it listed the query times compared to the total time it said that only 14% of the time to generate the page was spent on the queries, and the rest was by PHP, whatever that meant. I was blaming I/O just because it seems that (assuming its not a DB issue) the more files are being loaded the more it slows down. Obviously I could be wrong. – jbx Nov 11 '13 at 17:07
  • New idea. See answer. – Giacomo1968 Nov 13 '13 at 00:14
  • Cheers. I checked the `.htaccess` file and there isn't anything of that sort. The only things I have are rewrite rules for pretty URLs and calls to `mod_deflate` added by the caching plugin. I would exclude DNS issues though because since I haven't changed the domain to point to the new server (due to this slowdown issue) I have put its IP directly in `/etc/hosts` – jbx Nov 13 '13 at 03:44
  • “I would exclude DNS issues though because since I haven't changed the domain to point to the new server…” You are completely missing the point, since the DNS issue I mention is with Apache dealing with ‘localhost.’ Best of luck. – Giacomo1968 Nov 13 '13 at 11:43
  • Why am I missing the point? I checked .htaccess and there isn't any authorisation of that sort. – jbx Nov 13 '13 at 13:30
2

This looks like other cases I've seen where Apache is spending a lot of its time compiling PHP. Have you made sure an opcode cache (e.g. APC) is installed? It'll show as a loaded module in the output of phpinfo(), if that helps. Otherwise, to track what Apache's doing within mod_php, your best bet is going to be XHProf.

To anyone other than jbx arriving here via Google: the other answers are excellent, by the way. Go read them. But those answers, and jbx's responses to them, have helped me arrive at this conclusion.

BMDan
  • 7,129
  • 2
  • 22
  • 34
  • thanks for your reply. Yes APC is enabled. I have updated the answer with the output from phpinfo with the settings related to it, in case anything catches your eye. Can you elaborate a bit more on what I can do with XHProf? I assume you are referring to this? http://us2.php.net/manual/en/xhprof.examples.php Where should I put that stuff? somewhere in wordpress? – jbx Nov 18 '13 at 22:44
  • `apc.shm_size` is quite small for any significant amount of WordPress code and plugins. Grab a copy of `apc.php` (http://git.php.net/?p=pecl/caching/apc.git;a=blob_plain;f=apc.php;hb=HEAD), throw it in your DocRoot, and check the cache full count. If it's > 0, which it will be, increase `shm_size`. A good guess for a typical WordPress site might be 64 or 96 MB (corresponding to `apc.shm_size=64M` or `apc.shm_size=96M` in your apc.ini). Oh, and delete apc.php when you're done; it's not a real problem, but it ideally shouldn't just be left sitting around. – BMDan Nov 22 '13 at 19:15
  • Increased it to 96M and restarted apache. Cache full count is now 0. Page load time improved, and reduced by around 3 seconds, but still too slow. Refreshing the front page still takes around 4 seconds. The script you gave me is saying 96.4% hits, so I guess APC is being used properly now. – jbx Nov 22 '13 at 23:49
  • Although I haven't completely solved my issue, this answer was the only one that made any real significant difference. – jbx Nov 26 '13 at 23:24
2

One of the biggest I/O sources, for sites with heavy traffic, is /tmp I/O which occurs from:

  1. /tmp is read for php session data, for many CMS systems, like WordPress on every page transition to determine if the visitor has authority to access new page

  2. /tmp is written/read many times for any SQL SELECT where the returned data, or any temporary select sets created, lives in /tmp

The first thing I do when migrating a client of a slow machine to a new (hopefully faster) machine is size memory. My quick 30 second algo is:

(top memory used + swap used) * 2

Then on new machine I setup /tmp to run in tmpfs (memory) + use mysqltuner (every few days) + tune mysql (really mariadb) till mysqltuner is mainly quiet.

Sometimes this simple trick is all that's required for breathing new life into a slow performing server.

Once this is done, then if machine is still sluggish, I start looking into tuning each subsystem.

When tuning, I always start with a tool that tells me current state.

So for memory resizing, use top (memory + swap + load).

For Apache, check logs to ensure their are no messages like MaxRequestWorkers being exceeded.

For old PHP versions use an APC monitor to ensure APC is actually working + has plenty of memory to spare + hit rate is high - 90%+ is a good target.

For modern PHP versions do the same for Opcache, which replaced APC several years ago.

For MySQL, first switch to MariaDB (way faster in my experience) + use mysqltuner every few days till you have fairly quiet output.

For CMSes like WordPress, never take anyone's word for what caching plugin works. Use ab to test site speed, first without any caching + add a caching plugin + retest.

Hint: Start with ZenCache + you'll be surprised.

Lastly, I simulate a slowloris DDOS attack against every new server I setup, as DDOS behavior works differently depending on network layout + adapter speed + machine resources. I tune systems to survive DDOS attacks long enough for Apache 4xx status codes (usually 400 + 408) to show up in Apache logs + use fail2ban for blocking these IPs.

A big part of i/o tuning is to generate unusual load situations, like slowloris DDOS before you deploy any sites on a machine. This way you can tune at your own leisure, rather than trying to tune under real load, like an Ad Spend or getting Slashdotted... or under attack load like DDOS or a High Value Attack or just an evil scrapper that ignores your robots.txt + sucks all resource out of your machine.

sebix
  • 4,175
  • 2
  • 25
  • 45
David Favor
  • 171
  • 2