Debugging and fine-tuning the linux process scheduler

Question

When debugging sporadic web application slowdowns, I've found that all the processes have quite high maximum scheduling delay.

# perf sched record -- sleep 10 && perf sched latency --sort max
 -----------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
 -----------------------------------------------------------------------------------------------------------------
  logspout:(7)          |     56.715 ms |     1607 | avg:    9.744 ms | max:  491.840 ms | max at:    270.634788 s
  kubectl:(21)          |    298.298 ms |     3085 | avg:   12.497 ms | max:  474.413 ms | max at:    267.932542 s
  node-health:(6)       |     10.035 ms |      269 | avg:   10.082 ms | max:  301.094 ms | max at:    271.033693 s
  fpm-prometheus-:(45)  |     20.314 ms |      515 | avg:   10.632 ms | max:  295.889 ms | max at:    269.832519 s
  ruby:(2)              |    297.768 ms |      162 | avg:    6.763 ms | max:   87.004 ms | max at:    270.532673 s
  :(3)                  |      0.550 ms |        9 | avg:    9.548 ms | max:   85.826 ms | max at:    271.332980 s
  gitaly-upload-p:(6)   |    142.000 ms |      178 | avg:    2.244 ms | max:   79.394 ms | max at:    271.633055 s
  runc:[2:INIT]:(112)   |     52.745 ms |     3159 | avg:    0.047 ms | max:   70.107 ms | max at:    269.332593 s
  curl:(22)             |     53.576 ms |      144 | avg:    2.325 ms | max:   67.033 ms | max at:    270.132701 s
  php-fpm:(27)          |    807.280 ms |     1925 | avg:    0.259 ms | max:   58.130 ms | max at:    273.032516 s
  [celeryd: celer:(5)   |    513.913 ms |     3569 | avg:    0.089 ms | max:   25.874 ms | max at:    272.733699 s
  ingress-monitor:(5)   |     14.067 ms |      741 | avg:    0.183 ms | max:   19.448 ms | max at:    269.075643 s
  node:(2)              |      9.363 ms |       34 | avg:    0.561 ms | max:    9.014 ms | max at:    267.289700 s
  V8 WorkerThread:(4)   |      3.330 ms |        9 | avg:    1.015 ms | max:    9.007 ms | max at:    267.289690 s
  docker-containe:(398) |    517.011 ms |    17648 | avg:    0.035 ms | max:    6.596 ms | max at:    269.068233 s
  sh:(55)               |    202.641 ms |     1773 | avg:    0.028 ms | max:    5.764 ms | max at:    267.437706 s
  kubelet:(7)           |    717.393 ms |    12067 | avg:    0.036 ms | max:    5.497 ms | max at:    269.068210 s
  dockerd:(60)          |    518.212 ms |     8723 | avg:    0.047 ms | max:    5.273 ms | max at:    273.041616 s
  sshd:(3)              |     25.612 ms |       51 | avg:    0.127 ms | max:    4.655 ms | max at:    270.432617 s
  docker-runc:(198)     |    443.346 ms |    11516 | avg:    0.024 ms | max:    4.262 ms | max at:    270.041672 s
  nginx:(13)            |     14.260 ms |      148 | avg:    0.142 ms | max:    2.483 ms | max at:    269.943108 s
  ksoftirqd/1:16        |     14.604 ms |      686 | avg:    0.032 ms | max:    2.350 ms | max at:    273.459962 s
  runc:[1:CHILD]:(34)   |     17.452 ms |      117 | avg:    0.206 ms | max:    2.257 ms | max at:    267.297665 s

Is it correct that the php-fpm process spent as much as 58.130 ms waiting for a CPU to run on? Does this translate into a response latency of the php-fpm server?

I image if an http request came at any point during those 60 ms, the php-fpm process would not handle it until it was given CPU time. Is this correct?

Overall, both the average delay and the maximum delays seem very high to me.

This was measured on a 4.20.0-trunk-amd64 kernel (but was comparably bad on 4.18) on a 2 CPU server with a load of 0.56, 1.34, 1.05. There was no memory pressure and little IO.

I haven't modified the default scheduler configuration yet and I'm using the default values:

kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_child_runs_first = 0
kernel.sched_latency_ns = 12000000
kernel.sched_migration_cost_ns = 500000
kernel.sched_min_granularity_ns = 1500000
kernel.sched_nr_migrate = 32
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_schedstats = 0
kernel.sched_tunable_scaling = 1
kernel.sched_wakeup_granularity_ns = 2000000

@poige a virtual machine, AWS EC2 m4.large on hvm virtualization to be exact — Mikulas Dite, Jan 14 '19 at 12:15
I don't study AWS EC2 marketing names by heart. ;) Added a piece of relevant info to answer. — poige, Jan 14 '19 at 12:23

poige · Accepted Answer · 2019-01-14T12:22:46.123

Is it correct that the php-fpm process spent as much as 58.130 ms waiting for a CPU to run on?

it's time from wake-up to context switch, so I'd say it's time taken to make it finally running (thus — "latency")

Does this translate into a response latency of the php-fpm server?

As said above it's latency. But actually there's a number in parenthesis — 27. I assume it's a number of all processes grouped by a name. Given avg is a quarter of millisecond I'd recommend not tuning scheduler due famous "preliminary optimisation is root of all evil" rule.

on a 2 CPU server with a load of 0.56, 1.34, 1.05

~~How many cores overall are there?~~

Ok, so now we know it's AWS EC2, IOW, running under hypervisor with some other customers' VMs. Well, that does change a lot, actually. I recommend 2 things:

Checkout what's "steal time" when you see slowdown. Or, better though, monitor it constantly
Checkout Redis latency write up.

P. S. I'd easily add "use dedicated/bare metal servers" but sometimes it's not an option, obviously.

It's a virtual machine with 2 vCPU, “each vCPU is a thread of either an Intel Xeon core or an AMD EPYC core”. — Mikulas Dite, Jan 14 '19 at 12:18
Thank you, the steal time actually makes a lot of sense. (I'll award the bounty after the mandatory 24 period ends) — Mikulas Dite, Jan 14 '19 at 13:59

Debugging and fine-tuning the linux process scheduler

1 Answers1