We recently migrated a long running (without issue) Ubuntu 16.04 / nginx (1.10.3) / PHP 7.0 stack over to an Ubuntu 18.04 / nginx (1.14.0) / PHP 7.2 stack. The result has caused instability at the PHP-FPM level which never existed before, and hours of investigation have proved no results as to the root cause of the issue.
As an aside, both deployments are managed through AWS, and in both situations, the PHP app being run uses identical code and connects to the same MySQL RDS endpoint.
Issue
Post upgrade, the system runs as per usual for several hours (sometimes over 24 hours) without issue. Suddenly, in a matter of a minute, FPM starts logging warnings that execution of the script has timed out on various children like so:
[15-Jul-2020 19:50:22] WARNING: [pool www] child 29649, script '/path/to/app.php' (request: "POST /app.php") execution timed out (140.371526 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25847, script '/path/to/app.php' (request: "GET /app.php") execution timed out (136.178758 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (135.822576 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25401, script '/path/to/app.php' (request: "POST /app.php") execution timed out (135.927461 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25229, script '/path/to/app.php' (request: "GET /app.php") execution timed out (143.179479 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25110, script '/path/to/app.php' (request: "POST /app.php") execution timed out (141.080509 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 24904, script '/path/to/app.php' (request: "POST /app.php") execution timed out (142.441526 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 24679, script '/path/to/app.php' (request: "GET /app.php") execution timed out (138.165572 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 24652, script '/path/to/app.php' (request: "POST /app.php") execution timed out (142.414685 sec), terminating
[15-Jul-2020 19:50:22] WARNING: [pool www] child 15982, script '/path/to/app.php' (request: "GET /app.php") execution timed out (125.469362 sec), terminating
Shortly thereafter, FPM starts spawning new children to try and handle the load from nginx.
[15-Jul-2020 19:50:37] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 46 total children
[15-Jul-2020 19:50:38] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 1 idle, and 51 total children
[15-Jul-2020 19:50:39] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 55 total children
[15-Jul-2020 19:50:40] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 60 total children
[15-Jul-2020 19:50:41] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 65 total children
Then, once the limit is reached, the server reaches max capacity at 70 children, and FPM stops accepting requests from nginx ([error] 1072#1072: *234300 connect() to unix:/run/php/php7.2-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream
), resulting in 502
errors.
[15-Jul-2020 19:50:42] WARNING: [pool www] server reached pm.max_children setting (70), consider raising it
At this point, the app is effectively down until the FPM service is restarted. When FPM is restarted, the hanging requests/processes are immediately terminated and the issue is instantly resolved for the time being.
Note: It's important to note that the app can normally run with <= 10 workers, even in peak times, so the fact that it scales instantly due to long running processes is certainly an abnormality.
Repeating logs for the same process ID
From the surface, the logs look random, but if you filter them by child process ID, it's clear that each child is trying to terminate a single request over and over again. What's more, each log entry for a given process ID is ~40 seconds apart, evidenced by comparing the timestamp at the beginning of each log with the corresponding script time at the end of each log. Here I've filtered the log by a specific process ID for an example:
[15-Jul-2020 19:40:39] NOTICE: [pool www] child 25617 started
[15-Jul-2020 19:50:22] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (135.847853 sec), terminating
[15-Jul-2020 19:51:03] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (176.182362 sec), terminating
[15-Jul-2020 19:51:43] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (216.516705 sec), terminating
[15-Jul-2020 19:52:23] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (256.849969 sec), terminating
[15-Jul-2020 19:53:04] WARNING: [pool www] child 25617, script '/path/to/app.php' (request: "GET /app.php") execution timed out (297.183892 sec), terminating
Occurs when traffic is increasing or at peak
Based on the times of day the issue crops up, it is fairly obvious that the issue predictably occurs when traffic is increasing. So for example, first thing in the morning or at the peak of the day, when traffic is high, the issue occurs. We have not seen this issue occur when traffic is lower, such as in the middle of the night.
New Relic is Installed
A new relic daemon is installed on both deployments via their apt repository, as per the recommended installation method outlined on their website.
CGroup: /system.slice/php7.2-fpm.service
├─ 6554 php-fpm: master process (/etc/php/7.2/fpm/php-fpm.conf)
├─ 6583 /usr/bin/newrelic-daemon --agent --pidfile /var/run/newrelic-daemon.pid --logfile /var/log/newrelic/newrelic-daemon.log --port @newrelic --wait-for-port 0s --define utilization.detect_aws=true --define utilization.de
├─ 6603 /usr/bin/newrelic-daemon --agent --pidfile /var/run/newrelic-daemon.pid --logfile /var/log/newrelic/newrelic-daemon.log --port @newrelic --wait-for-port 0s --define utilization.detect_aws=true --define utilization.de
├─ 7228 php-fpm: pool www
├─ 7520 php-fpm: pool www
├─ 7548 php-fpm: pool www
├─ 7569 php-fpm: pool www
├─11835 php-fpm: pool www
├─11944 php-fpm: pool www
├─11997 php-fpm: pool www
├─12106 php-fpm: pool www
├─12934 php-fpm: pool www
└─13049 php-fpm: pool www
Note: We have not tried disabling the new relic agent as of yet, and this doesn't appear on the surface to be a related cause.
FPM Pool Config
Here's pool config, and is the same one used on both stacks:
pm = dynamic
pm.max_children = 70
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 10
pm.max_requests = 1000
Nginx Config
The same nginx configuration is used on both deployments.
user www-data;
worker_processes auto;
pid /run/nginx.pid;
events {
worker_connections 2048;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
client_max_body_size 24m;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 120;
limit_req_status 429;
limit_req_zone $http_x_forwarded_for zone=web:16m rate=10r/s;
types_hash_max_size 2048;
server_tokens off;
ssl_protocols TLSv1.2;
ssl_prefer_server_ciphers on;
gzip on;
gzip_disable "msie6";
gzip_vary on;
gzip_proxied any;
gzip_comp_level 6;
gzip_buffers 16 8k;
gzip_http_version 1.1;
gzip_min_length 256;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml text/javascript image/x-icon;
log_format main '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" - $request_time X-Forwarded-For=$http_x_forwarded_for Host=$host';
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
PHP Config
Same config is used on both deployments, all other settings are left as default.
max_execution_time=120
max_input_vars=5000
memory_limit=256M
post_max_size=24M
Stuff we've tried and/or looked at
We tried adjusting the FPM pool configuration to include
request_terminate_timeout = 121
(just beyond themax_execution_time
set in php), in the hopes this would terminate these stuck processes, but it did not work. The processes continue to beterminating
every 40 seconds, sometimes for over 1000+ seconds.The master FPM config is left as default with no alterations, the default
php-fpm.conf
config file is exactly the same between both versions of FPM. We examined theemergency_restart_threshold
,emergency_restart_interval
andprocess_control_timeout
configs, but they don't seem to be a factor in the issue we're experiencing.Reverting the app back to Ubuntu 16.04 (PHP 7.0.33) magically solves the problem, but obviously not what we're trying to accomplish by upgrading.
Restarting FPM when it
seems busy
(via the logs) solves the problem, but again, not a solution.Examined frontend logging for the potential of coincidental DOS attacks, but traffic is not unusual.
Examined nginx/php error logs for indications that scripts are not properly shutting down (i.e. no response). Nothing found.
Conclusion
Something in php, nginx, or fpm is configured in a way that ultimately ends up in some sort of race condition whereby the entire FPM service is consumed by sleeping/stuck/stale processes that use all available resources until it is eventually out of capacity and rejects all traffic indefinitely until the FPM process is manually restarted.
We are not ruling out that the issue could also be an issue within the app itself, but the fact that it works without issue on a previous version of OS/PHP/FPM would indicate an incompatibility at the PHP level and thus far we haven't seen any logging to support this.
My goal here is to figure out a.) why this is happening, and b.) what we can do to mitigate or work around the issue.
Thanks for your eyes!