Nearly 100% disk utilization after upgrading to debian 10 without any reason

Question

We recently upgraded our servers to debian 10. The setup is completely the same as with debian9 beside of course some new software packages.

We are running the default kernel:

$ uname -a 
Linux kt105 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

We use 2 NVME SSD Disks in a software raid and after upgrading to debian 10 we encountered high disk utilization. Please scroll to see the last column:

# iostat -x 20 5
Linux 4.19.0-6-amd64 (kt105)    01/29/2020  _x86_64_    (48 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.07    0.00    0.04    0.00    0.00   99.89

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00    5.97      0.03    805.42     0.00     9.37   0.00  61.07    0.30    1.31   1.00    19.16   134.83 166.12  99.26
nvme1n1          3.14    2.91    736.61     70.75     8.41     0.96  72.83  24.80    0.15    0.84   1.00   234.74    24.34 164.20  99.26
md0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00    22.39     0.00   0.00   0.00
md1              0.00    0.00      0.07      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00    39.29     1.00   0.00   0.00
md2              0.07    2.34      1.89     68.62     0.00     0.00   0.00   0.00    0.00    0.00   0.00    27.00    29.35   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.02    0.00    0.00   99.95

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00    1.80      0.00     18.95     0.00     0.55   0.00  23.40    0.00    0.03   0.98     0.00    10.53 541.89  97.54
nvme1n1          0.00    1.80      0.00     18.95     0.00     0.55   0.00  23.40    0.00    0.03   0.98     0.00    10.53 541.89  97.54
md0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md1              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md2              0.00    1.10      0.00     17.20     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00    15.64   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.02    0.00    0.00   99.95

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00    2.85      0.00     11.15     0.00     1.05   0.00  26.92    0.00    0.04   0.98     0.00     3.91 345.61  98.50
nvme1n1          0.00    2.85      0.00     11.15     0.00     1.05   0.00  26.92    0.00    0.04   0.98     0.00     3.91 345.61  98.50
md0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md1              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md2              0.00    2.20      0.00      8.80     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.02    0.00    0.00   99.94

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1          0.00    2.10      0.00     22.50     0.00     0.55   0.00  20.75    0.00    0.02   0.99     0.00    10.71 473.71  99.48
nvme1n1          0.00    2.10      0.00     22.50     0.00     0.55   0.00  20.75    0.00    0.02   0.99     0.00    10.71 473.71  99.48
md0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md1              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md2              0.00    1.15      0.00     20.40     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00    17.74   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.02    0.00    0.02    0.00    0.00   99.95

Some reported this as a bug in the kernel:

https://kudzia.eu/b/2019/09/iostat-x-1-reporting-100-utilization-of-nearly-idle-nvme-drives/

I am not so sure as the temperature of our disk is high, too:

/sbin/nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 57 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 48,847,719
data_units_written                  : 5,641,464
host_read_commands                  : 357,945,226
host_write_commands                 : 164,837,853
controller_busy_time                : 320
power_cycles                        : 6
power_on_hours                      : 17,959
unsafe_shutdowns                    : 2
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 57 C
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0

On other servers, configured identical, and still running debian9 we have almost 0% disk utilization and a temperature of 40 Degrees Celsius.

iotop shows not so much disk access:

/sbin/iotop --only -ab -n5 -d5  
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s

   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
Total DISK READ:         0.00 B/s | Total DISK WRITE:        11.94 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       9.35 K/s

   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 23748 be/4 www-data      0.00 B     60.00 K  0.00 %  0.33 % nginx: worker process
   637 be/3 root          0.00 B      0.00 B  0.00 %  0.00 % [jbd2/md2-8]
Total DISK READ:         0.00 B/s | Total DISK WRITE:         7.16 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       9.35 K/s

   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 23748 be/4 www-data      0.00 B     88.00 K  0.00 %  0.16 % nginx: worker process
   637 be/3 root          0.00 B      8.00 K  0.00 %  0.00 % [jbd2/md2-8]
Total DISK READ:         0.00 B/s | Total DISK WRITE:         8.75 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       9.35 K/s

   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 23748 be/4 www-data      0.00 B    132.00 K  0.00 %  0.11 % nginx: worker process
   637 be/3 root          0.00 B      8.00 K  0.00 %  0.00 % [jbd2/md2-8]
Total DISK READ:         0.00 B/s | Total DISK WRITE:        26.26 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       9.35 K/s

   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 23748 be/4 www-data      0.00 B    252.00 K  0.00 %  0.19 % nginx: worker process
   637 be/3 root          0.00 B     12.00 K  0.00 %  0.00 % [jbd2/md2-8]
 23749 be/4 www-data      0.00 B      8.00 K  0.00 %  0.00 % nginx: worker process

By the way, the resync of md raid array was finished days ago:

 cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md2 : active raid1 nvme1n1p3[0] nvme0n1p3[1]
      932839744 blocks super 1.2 [2/2] [UU]
      bitmap: 3/7 pages [12KB], 65536KB chunk

md1 : active raid1 nvme1n1p2[0] nvme0n1p2[1]
      523264 blocks super 1.2 [2/2] [UU]

md0 : active raid1 nvme1n1p1[0] nvme0n1p1[1]
      4194240 blocks super 1.0 [2/2] [UU]

unused devices: <none>

I already tried to reboot the machine, of course.

So it looks like something is not working properly. When we used this server in production as an nginx loadbalancer we encountered some outage on high peaks which could not be explained. We had Upstream timeouts, DNS errors, OSCP stapling errors with letsencrypt.

All this problems happen in the same two minute interval and only at high peak usage. Problems vanish after about 120 seconds. After this everything runs fine until the next peak (with something like 2000-8000 req/s).

Or do you consider temperature and disk usage as normal?

Is this a kernel bug? Did I miss something crucial? How can I do some further investigations of the real problem?

Stopping nginx doesn't help. Its a bug in reporting the disk utilization. See answer. — Janning, Feb 06 '20 at 10:38

score 4 · Accepted Answer · answered Feb 02 '20 at 20:19

i'm quite convinced that 100% utilization shown for the underlying NVMe drives that are used to create md-raid is a kernel bug - see the related reports. temporarily upgrading to kernel 5.2 made the 100% utilization go away, we've also benchmarked that md-raid1 on older/newer kernels - including those reporting constant full utilisation - and did not notice any significant differences in throughput or iops/sec.

temperature 57C is within acceptable range, at least for Intel's P4510 4TB that we use - those have warning threshold at 70C.

full disclosure - i'm the author of the blog post you quote.

related bug reports:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=927184

https://bugs.centos.org/view.php?id=15723

https://github.com/sysstat/sysstat/issues/187

https://github.com/munin-monitoring/munin/issues/1119

https://unix.stackexchange.com/questions/517667/nvme-disk-shows-80-io-utilization-partitions-show-0-io-utilization?noredirect=1&lq=1

https://github.com/netdata/netdata/issues/5744#issuecomment-513873791