Since past few days I have strange I/O spikes in one virtual machine.
Its 2.6.32-504.el6.x86_64 #1 SMP Tue Sep 16 01:56:35 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 6.6 (Santiago)
Around 50G of memory and 24CPUs running elasticsearch data node.
We detected time outs in requests going to that elasticsearch node and after inspecting the vm for now we only managed to see that there sporadically disk I/O gets stuck. I used ioping on one of the disks in the virtual machine
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=1 time=3.76 ms (warmup)
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=2 time=1.17 s
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=3 time=131.7 us
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=4 time=282.8 us
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=5 time=999.4 ms
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=6 time=632.7 ms
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=7 time=2.15 s (slow)
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=8 time=400.2 ms
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=9 time=20.0 s (slow)
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=10 time=1.10 ms (fast)
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=11 time=1.30 ms (fast)
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=12 time=2.20 ms (fast)
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=13 time=2.61 ms (fast)
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=14 time=203.6 us (fast)
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=15 time=1.09 ms (fast)
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=16 time=319.3 us (fast)
4 KiB <<< /dev/sdf1 (block device 100.0 GiB): request=17 time=249.8 us (fast)
As you can see there was 20 second spike at one moment. Virtual machine is on vmware esxi blade. Datastore is being used by 3 more virtual machines but none of those is showing this kind of latency problems. I tried fsck and tune2fs and both showed no problems on the filesystem.
There were no updates on the virtualmachine when this started to happen. Any hint on how to debug this problem is appreciated
edit: here is atop -d info. Seems like lv gets 100% busy and java (elasticsearch is reading at that moment)
LVM | vg00-lv_data | busy 100% | | read 8904 | write 4 | | KiB/r 11 | KiB/w 4 |
| MBr/s 10.03 | MBw/s 0.00 | | avq 21.41 | avio 1.12 ms |PID TID
RDDSK WRDSK
WCANCL DSK
CMD 1/12629 -
100.3M 12K 0K 100%
java