We have a fairly heavily loaded server running nginx and PHP-FPM. We have 6 websites on this server, running PHP-FPM and nginx. Software is all vBulletin 3.8 and WordPress. Databases are on a separate server.
Now, because these are highly popular websites, we normally have 7-8,000 visitors online at one time, with each page hitting the database for the most part. I believe this is the cause of our problems.
Because we have so many large databases on the MySQL server, and because the queries could, honestly, be a lot better in the software, I think MySQL will occasionally fail to return results to PHP in a timely manner, creating a cascade effect that eventually causes everything to just stop until we reload PHP-FPM. After we do that, things begin working fine again.
The reason I'm having problems troubleshooting this is because I can't really discern anything from the logs. In the MySQL slow query log, I see nothing of interest when downtime occurs. In the nginx logs, I see thousands of entries saying that the read request timed out or the connection timed out (To PHP-FPM). And in the PHP-FPM logs, I see a lot of lines that says "execution timed out (31 sec), terminating
So at this point I just completely don't know where to look for the problem. Obviously, whatever is happening is happening because these scripts aren't executing quickly enough sometimes (Normally they load in under a second, but something happens that causes the load time to skyrocket). This happens many times a day and has become quite an issue for us.
For now I simply have a crontab to service php5-fpm reload every 10 minutes, which takes care of the crashing problem. Of course, when PHP reloads, nginx throws a 502 gateway error, so it's not much of a solution.
PHP is running APC cache, if that matters. I've read in a few spots that APC can cause hanging under certain circumstances.
Any pointers would be helpful. I'd really like to not have to worry about this machine all the time.
More info can be provided of course. Just let me know what you need.
Update: I just copied over apc.php to a web root and accessed it to look at our stats. Things looked good. Then I clicked the link to go to User stats and BOOM the server instantly hung. I reloaded php-fpm and then reloaded the user stats page and it went through fine. Waited a minute, reloaded again, server hung again.
So this definitely seems to be APC related. The question is - How do we fix it?
APC Config:
[apc]
apc.enabled="1"
apc.stat = "1"
apc.max_file_size = "2M"
apc.localcache = "1"
apc.localcache.size = "256"
apc.shm_segments = "1"
apc.ttl = "3600"
apc.user_ttl = "7200"
apc.gc_ttl = "3600"
apc.cache_by_default = "1"
apc.filters = ""
apc.write_lock = "1"
apc.num_files_hint= "10000"
apc.user_entries_hint="10000"
apc.shm_size = "1G"
apc.mmap_file_mask=/tmp/apc.XXXXXX
apc.include_once_override = "0"
apc.file_update_protection="2"
apc.canonicalize = "1"
apc.report_autofilter="0"
apc.stat_ctime="0"
Update 2: We've made some progress on this here. It turns out that the WordPress caching plugin (W3 Total Cache) is what was causing the crashes. We still don't know why, but with it disabled, we've been running PHP for nearly 4 hours now with no reloads, no slowdowns, no crashes. We're still using APC on the vBulletin forums and no issues there at all. Is there any way we can determine WHY APC is crashing? I'd love to use it on our WordPress installations, but not at the cost of a fragile system.