The strange case of Mr. Time To First Byte

Question

I've got a webserver on a Linode 1024 VPS based on

Ubuntu 11.10
Nginx 1.0.5
PHP 5.3.6 (with PHP-FPM, APC)
Varnish 3.0.2

And a couple of blogs there based on WordPress 3.3.1. One of them is a plain blog, with the default config, theme and just the "Hello World" post, to test the server. The other one is a blog cloned from other server with almost 10k posts and over 10k comments. This blog has aroung 5k uniques per day.

The server gives good numbers on an ab test for the test blog, But the same test with the cloned blog is impossible to do: the ab test loads the server too much, and I have to stop the process, which anyway makes ab to show this really poor result.

The htop shows also a "normal" load when in normal operation, but anormal big load during the ab test.

There's another strange thing happening (the most important for me): the Time To First Byte is extremely high, but after that wait the site loads really fast. This can be easily tested with services such as tools.pingdom.com, which gives this result. Please pay attention to that yellow region that means "Wait time".

Why is this happening? Possible ideas:

Bad PHP-FPM config
Linode DNS response time is awful. Nonsense -the test blog resolves DNS fine, TTFB is fantastic
Bad Nginx config

In case someone needs more info,

Here you've got the current cloned blog nginx config file (/etc/nginx/sites-available/muycomputerpro.com)
Here you've got the current my.cnf config (/etc/mysql/my.cnf) (I know, for the moment not caching, this hasn't make a difference on TTFB on the past)
Here you've got the current PHP-FPM config (/etc/php5/fpm/pool.d/www.conf)

I think this may have something to do with the `if -f` directive your using in the `location` container in the nginx config. Based on what I'm reading here http://wiki.nginx.org/Pitfalls, I have a feeling that the `-f` is doing an inefficient search for the file which could cause a Time To First Byte issue, especially if you have directories with a large number of files. — d34dh0r53, Jan 15 '12 at 11:39
Mmm. Seems there is no problem there. I've commented those three lines to avoid that and the website loads fine, but the TTFB still remains very high. — javipas, Jan 15 '12 at 12:19
A few thoughts: a) what are the differences from the original server the blog is cloned from (e.g. does it run the same stack?) b) if you can, run ab directly from the server using localhost and the port. Try accessing via varnish, and then accessing nginx directly). c) Enable MySQL and PHP-FPM slow logs. d) run mysqltuner.pl and see if you can improve your MySQL performance (that would be the most obvious difference between the blogs - or plugins). e) The PHP-FPM config you posted doesn't seem to be the one used by nginx (/var/run/php5-fpm-tpnet.sock != /var/run/php5-fpm-www-data.sock) — cyberx86, Jan 15 '12 at 16:38
I think it's your database. Tune it and switch to `InnoDB` if not already. — SaveTheRbtz, Jan 15 '12 at 18:13
Definitely a PHP issue. Wordpress is *really* slow. You'll want a caching plugin for it to get a decent load time when you have that much content. — Martin Fjordvald, Jan 15 '12 at 18:28
Thanks for the suggestions, but as far as I know all of them are related to optimize load on high traffic. I can run ab test with localhost, and the results are fine (around 4k req/s), and running ab with and without Varnish should give the expected numbers as well, I'm not quite worried about this. The mysqltuner should be a nice help, I'll try that. Martin, I've got Hypercache plugin (runs fine, maybe not as good as W3 Total Cache), but again, all these tips are useful to make the server behave well under heavy load. The question, the important one, is why TTFB keeps so high — javipas, Jan 15 '12 at 20:17
You said you 'can run ab on localhost and get 4k req/s' - which localhost (previous/current) are you referring to? If that value is from your current server - the one with the high TTFB - then your problem just got a lot more interesting - since you have effectively eliminated PHP, MySQL, and your web server. TTFB includes DNS, round trip time, and processing time. A long TTFB is usually due to processing (e.g. PHP/MySQL). The point of running ab directly against nginx is to eliminate the other components. Also, Varnish, if setup right, should bypass the backend, giving a very high req/s. — cyberx86, Jan 16 '12 at 02:46
cyberx86, interesting point. I've made new ab tests, and these are the results: simple blog - varnish on (http://pastebin.com/fkjer6b1). Simple blog - varnish off (http://pastebin.com/sX0ZpmTQ). Localhost - varnish on (http://pastebin.com/xs0PUQKX). Localhost - Varnish off (http://pastebin.com/9uzXkGcb). The localhost is the one on the new server, the one with the high TTFB. As you've said, localhost gets around 4k reqs/s no matter varnish is on or off. One note that could be interesting: the simple blog (muylujo.com) has a fantastic TTFB (391ms on pingdom) compared to the cloned blog. Why??? — javipas, Jan 16 '12 at 12:03
Your localhost tests do not seem valid - you didn't actually retrieve your blog. Notice the difference in page size: 7500bytes when accessed from the domain, 151 bytes from localhost. Since you probably have multiple virtualhosts, you need to pass the host header to ab. `ab -n 1000 -c 100 -H 'Host: mysite.com' http://127.0.0.1/` That said - the difference in cached (Varnish) vs uncached results is enough to validate the position that the problem is unrelated to network, dns, etc. and lies in processing, as expected. — cyberx86, Jan 16 '12 at 12:37
Thanks for the suggestion, cyberx86... Any tips how to detect possible bottlenecks -apart from the ones you commented previously-? Is this the reason why the simple blog loads fast and has a nice ttfb, and the cloned, heavy blog (with lots of content) loads slow and has a horrible ttfb? — javipas, Jan 16 '12 at 12:51
It is very likely that your bottlenecks are causing your high TTFB - unless your localhost test suggests otherwise (once you redo it). I have provided some rough suggestions and diagnostics as an answer (mostly because I couldn't fit it as a comment). — cyberx86, Jan 16 '12 at 21:16

score 25 · Accepted Answer · answered Jan 16 '12 at 21:14

Firstly, this is not an answer, so much as a diagnostic approach.

This is by no means comprehensive - or even anything close, it is just a starting point.

Time to First Byte

Time to first byte (TTFB) has a number of components:

DNS Lookup: Find the IP address of the domain (possible improvement: more numerous/distributed/responsive DNS servers)
Connection time: Open a socket to the server, negotiate the connection (typical value should be around 'ping' time - a round trip is usually necessary - keepalive should help for subsequent requests)
Waiting: initial processing required before first byte can be sent (his is where your improvement should be - it will be most significant for dynamic content.

When you look at an ApacheBench output, you also see:

Processing: This is the sum of waiting + complete transfer of content (if the transfer time is significantly longer than what would be expected to download the quantity of data received, further processing (after the first byte received) is occurring (e.g. the page is flushing content as it is available)

Comparisons to Eliminate components

With few exceptions, your problem is going to lie in the backend processing, which usually comes down to overly complex/inefficient code, or poorly configured MySQL.

A good way to approach this problem is through a series of comparisons that will eliminate various aspects of your setup. A good comparison should keep as much constant as possible to help narrow down the problem. Currently, you have provided the following comparisons:

Identical (cloned) site running on old server and new server:
- Difference: Server
- Result: old server is fast; new server is slow
- Notes: What you need here is to quantify the differences between these servers - both in terms of the stack used (Nginx, etc) and the hardware (is the old server faster because it is a more powerful machine?)
- Conclusion: the code may be able to run fast on the right setup
Test site vs full site on the new server
- Difference: content, themes, plugins, etc
- Result: test site is fast, full site is slow
- Notes: in theory, this test should help you to eliminate a lot of aspects of your setup - DNS, network, even your nginx/php/mysql setup - however, it is not quite 'fair'.
- Conclusion: the extra content is having a significant impact on performance

The ideal test would have you duplicate your full site, but then delete all the content except for one article and the associated comments. The point of this test would be to conclusively determine if the large amount of content is the problem or if other aspects of your setup (wordpress plugins, theme, etc) are the cause. You would essentially compare the performance of identical sites, on the same (new) server - loading the same page (same length, etc) - with the only difference being the total site content (e.g. there is a good chance that some plugin does not scale well with increased content).

Without changing anything, there are some other comparisons you can do:

Test from a remote location vs local - this will help identify if network, latency, dns, etc is the cause
- You have already (somewhat) done this and mostly concluded that you don't have a network problem.
Test via Varnish (i.e. port 80) vs nginx directly (port 8080) - try not to change your configuration between tests - just use the correct port. This will show you the impact of Varnish. Since Varnish is a caching layer, it should serve all requests after the first one very quickly - essentially, it should bypass the backend and the processing that is needed to generate a dynamic page, and serve the cached copy very quickly.
- You have done this (although, not locally) and demonstrated that Varnish has a significant positive impact on your performance.

Tuning your Backend

By this point you should have either found the problem or concluded that it lies in your backend. That leaves you Nginx, PHP, or MySQL.

(I should mention here, that is it always handy to know if your bottleneck is CPU, RAM, or I/O - between sar, top, iostat, vmstat, free, etc you should be able to come to some conclusion on this.)

Nginx

Nginx is just taking requests and either serving static content or shifting the requests to PHP-FPM - there usually isn't much to optimize with Nginx.

Set workers = # CPU cores
Enable keepalive (a value of 10-15 is good)
Disable unneeded logging
Increase buffer sizes if needed
Avoid if statements (use static names instead of regexes where possible, eliminate unneeded extensions)

Ideally, your test blog and cloned blog have identical configs, in which case, you have effectively eliminated Nginx as the problem.

Application

In the case where you are trying to identify a problem in your code (for instance a slow plugin, etc) the slow logs are the place to start.

Enable the MySQL slow log and the PHP-FPM slow log run your benchmark and see what is coming up as slow.

MySQL

Increase your caches and run mysqltuner.pl to get a good starting point.

PHP

disable unneeded extensions,
disable register_globals, magic_quotes_*, expose_php, register_argc_argv, always_populate_raw_post_data
increase the memory_limit
open_basedir and safe_mode have significant performance implications, but also can provide an additional layer of defense. Test with and without them, to determine if their impact on performance is tolerable.

PHP-FPM

Adjust the pm.* values - increase them to deal with high load

It is worth noting that your htop results show php-fpm as consuming the bulk of the CPU - and your problem does appear to be directly related to this.

Caching

Once you have optimized each likely bottleneck, start caching.

You have an opCode cache (APC) already - ensure that it is working (it comes with a test file) - check your cache hit rates, and if possible have APC cache to memory instead of to disk.
Setup your code to cache (e.g. using a plugin for Wordpress such as W3TC)
With nginx you can setup FastCGI caching - but since you have Varnish, this is best avoided.
Setup a caching layer, such as Varnish (which you have already done) - and ensure that it is working (e.g use varnishstat, read Achieving a high Hitrate)
Add more caching for components of your site - e.g. MemCached if applicable

Sometimes, given the limitations of your application and hardware, you may not be able to improve backend performance that much - however, that is the point of caching - to minimize the use of the backend.

Further reading

That's a fantastic summary of points to analyse. Thank you so much for the comment, I'll try to perform a heavy test with all these suggestions -some of them, as you've said, are already clear- and see if I can finally detect the problem. Best regards, cyberx86. — javipas, Jan 17 '12 at 12:22
About the `memory_limit`, it was pointed in [another post](http://programmers.stackexchange.com/a/207940/202310) that it doesn't help with performance. — markphd, Oct 31 '15 at 06:37

The strange case of Mr. Time To First Byte

1 Answers1

Linked