2

In recent weeks we experienced a negative phenomenon when deploying code: the server becomes unresponsive sometimes for a few minutes.

Here's an example of the load on the server when this happened: enter image description here

The only related log I can find is from /var/log/php7.2-fpm.log and sometimes (but not always) I see entries like this (note: this is from a different event than the image shown above, nevertheless the same happens):

[22-Mar-2019 15:33:50] WARNING: [pool api] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 231 total children
[22-Mar-2019 15:33:52] WARNING: [pool api] server reached pm.max_children setting (250), consider raising it
[22-Mar-2019 15:34:05] WARNING: [pool app] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 47 idle, and 104 total children

What happened was that we made a deployment to this server:

  • git status --porcelain to check for modifications
  • git pull origin master to update the files
  • resetting the opcache, i.e. we call an endpoint which performs opcache_reset()
  • clearing local cache files

After some experimenting I could reduce the problem of the loads to just this: opcache_reset()

Once I perform this call (independently of any prior or later deployment steps, this happens also in isolation when I just call this endpoint) there's a chance the the system load suddenly spikes.

If that happens and the load is "too high" (I would say from experience > 200 or so), the system becomes unresponsive until for seconds or minutes, depending how long it takes everything calms down.

Specs:

PHP-FPM configuration:

  • We're using a 6 pools with different virtual hosts
  • the start_servers directly totals to 320 php-fpm processes (also confirmed via ps auxw|grep -i fpm|grep -v grep |wc -l)
  • the total of max_children of all pools would be around 870

Maybe the totals are too high here, the intention was to cope with spikes on individual virtual hosts which happen to have sometimes.

Using htop, the system usually looks like this: enter image description here

And usually the load is low unless we've this spike which are all related to opcache resetting (which I just recently found out): enter image description here

I understand that resetting the cache and now all processes have to re-fill it is CPU consuming.

But what I don't understand:

  • this only started recently to happen, e.g. maybe 1-2 months, but only in the recent two weeks the unresponsiveness was noticeable
  • it does not happen always, sometimes when resetting the cache, nothing happens

Here's the output of opcache_get_status(false) right before the deployment:

{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
    "used_memory": 67353640,
    "free_memory": 66864088,
    "wasted_memory": 0,
    "current_wasted_percentage": 0
  },
  "interned_strings_usage": {
    "buffer_size": 8388608,
    "used_memory": 5215176,
    "free_memory": 3173432,
    "number_of_strings": 89109
  },
  "opcache_statistics": {
    "num_cached_scripts": 2873,
    "num_cached_keys": 5063,
    "max_cached_keys": 7963,
    "hits": 633581523,
    "start_time": 1553172771,
    "last_restart_time": 1553248200,
    "oom_restarts": 0,
    "hash_restarts": 0,
    "manual_restarts": 6,
    "misses": 9512,
    "blacklist_misses": 0,
    "blacklist_miss_ratio": 0,
    "opcache_hit_rate": 99.9984987161316
  }
}

and here afterwards:

{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
    "used_memory": 57745856,
    "free_memory": 76471872,
    "wasted_memory": 0,
    "current_wasted_percentage": 0
  },
  "interned_strings_usage": {
    "buffer_size": 8388608,
    "used_memory": 4337168,
    "free_memory": 4051440,
    "number_of_strings": 75163
  },
  "opcache_statistics": {
    "num_cached_scripts": 2244,
    "num_cached_keys": 3925,
    "max_cached_keys": 7963,
    "hits": 5893926,
    "start_time": 1553172771,
    "last_restart_time": 1553265235,
    "oom_restarts": 0,
    "hash_restarts": 0,
    "manual_restarts": 7,
    "misses": 4962,
    "blacklist_misses": 0,
    "blacklist_miss_ratio": 0,
    "opcache_hit_rate": 99.91588245106536
  }
}

Other things I observed:

  • php-fpm pretty soon stops responding
  • nginx still works unless the load gets REALLY high; I confirmed this because when php-fpm basically is unreachable, nginx delivers the configured 500 page

What causes these load spikes really ? How can I avoid them?

Update after accepting the answer:

Basically simply not calling opcache_reset and reverting most of my opcache setting customizations to the defaults (i.e. not imposing them) fixed it.

This step was part of my deployment pipeline for literally years. I tried to figure out the initial reason and as far as I could see, it related to deployment problems when classes referenced new code which was not yet loaded/refreshed.

In hindsight, I'm not even sure this was the actual problem but here we are.

mark
  • 1,516
  • 4
  • 21
  • 33
  • Why are you calling `opcache_reset()`? – Michael Hampton Mar 22 '19 at 15:23
  • So that php fpm is ensured to see all the newly deployed / changed files. – mark Mar 22 '19 at 21:15
  • Don't do that. Go back to the default opcache settings. – Michael Hampton Mar 23 '19 at 00:17
  • Oh. I thought about it since the analysis and will evaluate next week. But isn't some kind of invalidating necessary to ensure all files are up to date at the same time? – mark Mar 23 '19 at 06:27
  • By default PHP checks the file timestamp to invalidate the opcache entry. This can be turned off, and that is the only scenario I can think of in which `opcache_reset()` would be used. Of course, it also causes the problem you are having. – Michael Hampton Mar 23 '19 at 13:03
  • I'm not yet finished testing (needs some time for observation/monitoring), but I think this will be the solution. If you don't mind you can formulate this as an actual answer for me to accept, once I'm done confirming everything. – mark Mar 25 '19 at 07:42

1 Answers1

1

By default PHP checks the file timestamp to invalidate the opcache entry. This can be turned off, and that is the only scenario I can think of in which opcache_reset() would be used. Of course, it also causes the problem you are having.

I recommend going back to the defaults:

opcache.validate_timestamps = 1
opcache.revalidate_freq = 2
opcache.revalidate_path = 0
Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • Thank you very much; although quite obvious in hindsight, not so obvious to understand that it was simply unnecessary :-) – mark Mar 28 '19 at 15:30