1

Since past few days I have strange I/O spikes in one virtual machine.

Its 2.6.32-504.el6.x86_64 #1 SMP Tue Sep 16 01:56:35 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 6.6 (Santiago)

Around 50G of memory and 24CPUs running elasticsearch data node.

We detected time outs in requests going to that elasticsearch node and after inspecting the vm for now we only managed to see that there sporadically disk I/O gets stuck. I used ioping on one of the disks in the virtual machine

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=1 time=3.76 ms (warmup)

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=2 time=1.17 s

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=3 time=131.7 us

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=4 time=282.8 us

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=5 time=999.4 ms

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=6 time=632.7 ms

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=7 time=2.15 s (slow)

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=8 time=400.2 ms

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=9 time=20.0 s (slow)

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=10 time=1.10 ms (fast)

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=11 time=1.30 ms (fast)

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=12 time=2.20 ms (fast)

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=13 time=2.61 ms (fast)

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=14 time=203.6 us (fast)

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=15 time=1.09 ms (fast)

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=16 time=319.3 us (fast)

4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=17 time=249.8 us (fast)

As you can see there was 20 second spike at one moment. Virtual machine is on vmware esxi blade. Datastore is being used by 3 more virtual machines but none of those is showing this kind of latency problems. I tried fsck and tune2fs and both showed no problems on the filesystem.

There were no updates on the virtualmachine when this started to happen. Any hint on how to debug this problem is appreciated

edit: here is atop -d info. Seems like lv gets 100% busy and java (elasticsearch is reading at that moment)

LVM | vg00-lv_data | busy 100% | | read 8904 | write 4 | | KiB/r 11 | KiB/w 4 |
| MBr/s 10.03 | MBw/s 0.00 | | avq 21.41 | avio 1.12 ms |

PID TID
RDDSK WRDSK
WCANCL DSK
CMD 1/1

2629 -
100.3M 12K 0K 100%
java

  • It may not be the answer to your question, but this answer of mine to my own question may be of interest: https://serverfault.com/a/660804/31475 – Halfgaar Jan 30 '20 at 08:02
  • Do you have any error logs with `journalctl -xe` or in /var/log/syslog? How is your inode count looking? You can use the following one-liner to see inode usage by folder : `sudo find . -xdev -type f | cut -d "/" -f 2 | sort | uniq -c | sort -n`. Obviously those are shots in the dark but you never know – Dexirian Jan 30 '20 at 14:54

1 Answers1

0

In the end, it seems everything was caused by the elasticsearch. We excluded the node from the cluster and then added it back in, causing shards to relocate from the machine and then back to machine. For some strange reason that fixed the problem.