6

I understand IO wait when I see it on a server, it means that the CPU is blocked while I wait for the IO to catch up [source].

I am trying to understand why a SAN stats would show a high IO wait - does this indicate that the SAN CPU is blocked by the SAN disk or is it something else?

Robert MacLean
  • 2,186
  • 5
  • 28
  • 44

4 Answers4

8

A SAN has a much higher IO latency than a local disk due to the fundamental laws of physics. So if your application is doing lots of small writes and fsync() after each, you'll see a lot of iowait.

For example, here are two mysql replicants of the same dataset containing many small transactions, you'll see the slave on the SAN is spending a lot more of its time doing IO.

San: enter image description here

Local:

enter image description here

Dennis Kaarsemaker
  • 18,793
  • 2
  • 43
  • 69
  • Is that Graphite for the graphing? – Tom O'Connor Jan 28 '13 at 14:06
  • Yeah, graphite with [a plugin I wrote to graph mysql replication load](http://www.kaarsemaker.net/blog/2012/09/27/monitoring-replication-load-graphite/). – Dennis Kaarsemaker Jan 28 '13 at 14:10
  • What's the effect of SAN controller cache on this? – ewwhite Jan 28 '13 at 14:35
  • One very unsafe, very bad and very data unfriendly way to make a program stop using `fsync` is to preload libeatmydata http://www.flamingspork.com/projects/libeatmydata/ It will possibly make you lose data (hence the name) but can be useful in some cases. – Jens Timmerman Feb 01 '13 at 11:40
5

SAN wait time could mean that your storage is the bottleneck. It could also be server settings or the connection between your servers and the storage, but much more frequently, when I see wait time for a SAN disk, it's simply a busy SAN.

First, check the performance on the disks backing the volume. You're looking for spikes in IO/s or MB/s reads or writes, and potentially a spike in cache utilization. Try to look only at the hardware involved in the volume you're investigating. Also, look back and forward in time a little to see if there have been higher spikes that didn't cause issues. If so, then the storage hardware is unlikely to have been the problem. Corrective action for hardware bottlenecking on the storage could include migrating this volume to another pool or RAID, or increasing the number of spindles or cache.

Secondly, check the queue depth settings on the server. If you have a very high queue depth, your server will see higher latencies during periods of heavy utilization. Queue depth is a way for the storage to tell the server to throttle their IO allowing the storage to catch up. 32 is a good average number that would be supported by most server OSs and most storage devices I've seen. I've seen higher and lower work as well, but if it's set to 1024 or something, that could explain high wait times. In a situation where the queue depth is very high, the server queues up everything it wants to do, and then the storage does it as fast as it would have if the queue depth were a lot lower. Since the server measures wait time from when something comes into the queue and goes out of the queue, the wait time would go up.

Lastly, check the error logs for the server. Ensure that there's no transfer level issues (like disk timeouts or path failures). If there are, you'd want to look into the switch.

Basil
  • 8,811
  • 3
  • 37
  • 73
1

It's measured no differently than on a server: there are more IO requests coming in than can be dealt with by the hardware resources available.

EEAA
  • 108,414
  • 18
  • 172
  • 242
1

High IO wait as reported by the SAN management software either means SAN hardware can't keep up with the demands of your SAN clients. This is either because your hardware just doesn't have the capacity for your load, or it could be something is failing and under-performing.

A slowly failing drive causing poor performance is actually pretty common, especially in RAID5 setups. Pull the SMART logs for all of your drives and I'll bet you find a drive with a very high number of corrected errors. (Correcting those errors takes time. If an individual error is corrected within a certain amount of time, then the RAID controller does not log an error. But stack up a lot of those errors and that adds up to a lot of time. And that's how you get poor performance.)

longneck
  • 22,793
  • 4
  • 50
  • 84