12

I have CPU I/O wait steady around 50%, but when I run iostat 1 it shows little to no disk activity.

What causes wait without iops?

NOTE: There no NFS or FUSE filesystems here, but it is using Xen virtualization.

enter image description here

Jason Cohen
  • 1,067
  • 3
  • 14
  • 21

11 Answers11

7

NFS can do this, and it wouldn't surprise me if other network filesystems (and even FUSE-based devices) had similar effects.

womble
  • 95,029
  • 29
  • 173
  • 228
6

Is there any chance other VMs on the server are thrashing the disk?

I know with virtualisation that you can get some strange results if the host node is overloaded.

lbft
  • 91
  • 4
  • True but that should be in steal% instead of io% right? Or can it cross over there too? – Jason Cohen Mar 08 '12 at 00:10
  • 3
    Steal happens when there's less CPU capacity available than requested by the VMs. If the physical disk is overloaded, your processes are going to spend a lot of time in iowait waiting for their turn at the disk even if they're not hitting the disk much. – lbft Mar 08 '12 at 00:21
  • Yeah, this. See another question with the same answer at http://serverfault.com/a/209031/57468 – mattdm Mar 08 '12 at 00:47
3

If this is the Amazon EC2 Xen environment using instance-based storage, ask Amazon to check the health of the host containing this image.

If this is a Xen environment that you can gain access to the hypervisor, then check the IOwait from without for the disk image (file, network, LVM-slice, whatever) being used for the xvda and xvdb devices. You'll also want to check the I/O system, in general, for the hypervisor since other disk devices might be monopolizing the system's resources.

iostat -txk 5

is usually a good starting diagnostic tool. It takes 5-second summaries of I/O for ALL devices available to it, and thus is useful both with-in and wither-out the VM image.

2

Check your available file descriptors / inodes. When you hit the limit, they swap and mimic iowait

Edit

I saw you are using xen, have a look at your current interrupts, you might find blkif is higher than normal.

Bit late now, but get munin installed and it will really help future debugging.

Sonassi
  • 21
  • 2
2
sudo sysctl vm.block_dump=1

Then check dmesg to see what is performing block read / writes or dirtying inodes.

Also check nofile limit in limits.conf, a process could be requesting more files than it is permitted to open.

neal
  • 161
  • 1
  • 4
1

WARNING: HDPARM IS DANGEROUS, ALWAYS READ ABOUT THE COMMAND YOU ARE GOING TO USE!

If no other virtual machines are stressing the hard disk(s), do

hdparm -f

on the underlying physical disk(s). Possibly the disk cache don't work accurately. This will flush the data stored in the cache, and you can constantly monitoring the I/O, whether it is about to rise again after the flush. If yes, it will be a cache problem.

vakufo
  • 111
  • 3
0

With load average, I've seen blocked networking operations (i.e. long calls to an external DB server) increase. I don't know for sure but I'm guessing network IO can cause CPU wait to go up? Can anyone confirm?

  • 1
    In most modern machines, no. Most, if not all recent systems have DMA-capable NICs to prevent precisely this sort of situation. – ZaMoose Mar 07 '12 at 23:54
0

Could be loopback devices, that are themselves mounted over the network.

0

On my machines NFS is the biggest IO-WAIT "producer". I have a SSD in my laptop which is fast as hell, so "real IO" is not the problem. Nevertheless I have sometimes lots of IO wait due to my mounted nfs shares.

SCP sometimes also seems to lead to IO Wait but to a far lesser extend.

0

This can be anything. It just means that something is waiting for end of I/O operation. You can figure out what process it is via ps, then attach gdb to it and check out backtrace to determine which call is hang (usually this is some network-related stuff or suddenly disconnected disk). For fd info, check out /proc.

eSyr
  • 1
0

I've also experienced a similar problem right before a disk in a RAID failed and some SATA cables with tight bends in them started failing.

The CPU usage was near 0%, but 1 or more CPU's on a 4-core system were spending 100% of their time in IOwait for extended periods of time (found via top multi-line cpu display) with very low IOps and bandwidth (found via iostat), but bursty high interrupt activity. Interactive command-line use was painful during any disk access (i.e. auto-save from someone's emacs session) but otherwise tolerable once the periods of IOwait passed (and presumably the operations succeeded after many retries).

mormegil
  • 125
  • 7