Linux I/O bottleneck with data-movers

Question

I have a 24 core machine with 94.6GiB RAM running Ubuntu server 10.04. The box is experiencing high %iowait, unlike another server we have (4 cores) running the same types and amounts of processes. Both machines are connected to a VNX Raid fileserver, the 24-core machine via 4 FC cards, and the other via 2 gigabit ethernet cards. The 4-core machine currently outperforms the 24-core machine, has higher CPU usage and lower %iowait.

In 9 days uptime, %iowait averages at 16%, and is routinely above 30%. Most of the time CPU usage is very low, around 5% (due to the high iowait). There is ample free memory.

One thing I don't understand is why all the data appears to be going through device sdc rather than going through the data movers directly:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.11    0.39    0.75   16.01    0.00   76.74

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.00         0.00         0.00       1232          0
sdb               0.00         0.00         0.00       2960          0
sdc               1.53        43.71        44.54   36726612   37425026
dm-0              0.43        27.69         0.32   23269498     268696
dm-1              1.00         1.86         7.74    1566234    6500432
dm-2              0.96         1.72         5.97    1442482    5014376
dm-3              0.49         9.57         0.18    8040490     153272
dm-4              0.00         0.00         0.00       1794         24
dm-5              0.00         0.00         0.00        296          0

Another piece of the puzzle is that tasks frequently go into uninteruptable sleep mode (in top), also probably due to the io holdup.

What can I look at to help diagnose the problem? Why is all the data going through /dev/sdc? Is that normal?

UPDATE:

The network connection and VNX read/write capacity have been ruled out as bottlenecks. We can reach speeds of 800MB/s with the 4 bonded NICs (round-robin). The fiber channel cards are not yet being used. The VNX is well able to handle the IO (RAID6, 30x2TB 7.2kRPM disks per pool in two pools (60 disks total), about 60% read).

Ignore above about dm and sdc, they are all internal disks, and not part of the problem.

We think the issue might be with the nfs mounts or TCP (we have 5 mounts to 5 partitions on the VNX), but don't know what exactly. Any advice?

One small point: In this context, `dm` stands for device mapper, not data mover. This question would probably do much better at Server Fault. — Michael Hampton, Jul 27 '12 at 22:45
Are you using NFSv4 or NFSv3? Is your iowait on NFS connections only, or do you get it when running dd to test the disk speeds (assuming you have done this)? If your waiting is on NFS and your using V4, try V3. NFSv4 has some pretty random behavior at high loads, and we have recently had to disable it throughout our network. — Erik Aronesty, Nov 30 '12 at 13:10

score 6 · Answer 1 · answered Jul 27 '12 at 21:44

First of all if your CPUs (and damn! That's a lot 24) eat data faster than what can provide the data storage, then you get iowait. That's when the kernel pause a process during a blocking io (a read that comes too slow or a sync write).
So check that the storage can provide enough throughput for 24 cores.

Example, let's assume your storage can provide 500MB/s throughput, that you are connected via 2 Gigabit Ethernet line (bond), the network will already limit the maximum throughput to something around 100-180 MB/s. If your process eat data at the speed of 50 MB/s and that you run 4 threads on your 4 core machine: 4 x 50 MB/s = 200 MB/s consumed. If the network can sustained the 180MB/s then you wil not have much latency and your CPUs will be loaded. The network here is a small bottleneck.
Now if you scale this up to 24 cores and 24 threads, you would need 1200 MB/s, even if you change the wiring to allow such throughput, your storage system does not provide more than 500 MB/s, it becomes a bottleneck.

When it comes to io wait, bottlenecks can be everywhere. Not only on the physical layers, but also in software and kernel space buffers. It really depends on the usage patterns. But as the software bottlenecks are much harder to identify, it usually is preferrable to check the theorical throughput on the hardware before investigating the software stacks.

As said, an iowait occurs when a process make a read and the data takes time to arrive, or when it makes a sync write and the data modification acknowledgment takes its time. During a sync write, the process enter uninterruptible sleep so data don't get corrupted. There is one handy tool to see which call makes a process hang: latencytop. It is not the only one of its kind, but you can give it a try.

Note: for your information, dm stands for device mapper not data movers.

I completely agree (and feel it is less well understood) that keeping a system/solution resource balanced is important. But I also want to point out that IOWait can also be caused by a high rate of randomized IO (be it one process performing lots of seeks or lots of processes demanding their data be seeked). In this case IOWait can be high without IO bandwidth being the problem factor. — Matthew Ife, Jul 28 '12 at 00:16
@MIfe You are fully right about this. I also started mentioning this aspect when I pointed to inspect the software layer. If the pipe is big enough between the hardware storage and the hardware processes, then the problem lies in the software stacks, ranging from TCP buffers (example in kernel space) to random access to data concurrently (example in the user space). And this is much harder to identify. — Huygens, Jul 28 '12 at 17:20

score 5 · Answer 2 · answered Jul 27 '12 at 17:41

First of all, holy inferno that's a lot of iron! :)

Unfortunately since your setup sounds very complex, I don't think anyone's going to be able to provide a straight-away "There's your problem!" answer, unless they've done something with an extremely similar or identical setup and encountered the same problem. So, while this text is labeled by SU as an "Answer", you should probably consider it more like a "Suggestion". And I can't put it in the comments because it's too many words. :S

Without knowledge of how your hardware is mapped to the devices, it's hard to say why the I/O is going one place and not another. How do you have the devices mounted? Are your programs accessing the sd* devices directly, or are all of your filesystems mounted on the dm devices and all file accesses occur through there?

Other things I have to ask about:

What kind of RAID is it? If you're calculating parity bits with RAID5 or RAID6, that is hopefully taken care of by the raid server hardware... if not, the processing servers are doing that.... which is suboptimal and can cause I/O latency if done in software.
You isolated one of the main differences between the two servers in your message. One is using fibre channel and one is using ethernet. The Fibre Channel should be providing better latency and bandwidth, but maybe that's also a problem: if it's providing a lot of throughput, it could be making the RAID server very busy itself... and congestion leads to buffers/caches filling up, which increases latency, which causes higher I/O waits.

It's almost as if you may have a buffer bloat problem with your disk arrays -- you know? Hardware RAID controllers normally have a great deal of on-board cache, don't they? So as I/O to the media gets queued up and the caches get full with dirty pages, eventually the whole thing is saturated (if the mechanical storage can't keep up with the load) and latency sails through the roof... surely you can produce more load with 24 cores + FC than with 4 cores + GbE :) Check the RAID server and see how busy the disks are... a lot of the "I/O" may just be control packets, etc. I'm not sure how FC works but if it's anything like TCP then you're going to see retransmissions if the latencies are too high.

Like if you ask someone a question over the phone and they don't answer for a few seconds, you say "Hello?" -- networking protocols (and FC is just a networking protocol) do the same thing, just in a shorter timescale. But of course that extra "Hello?" is expensive in the context of networking because it adds even more data to an already-congested pipe.

In closing, a general tip:

When debugging latency/IO waits/throughput issues, always measure. Measure everywhere. Measure at the wire, measure what the programs themselves are doing, measure at the processing end, measure on the RAID server, etc. Don't just look at it from one perspective -- try to consider each individual component of the system that is responsible for processing, reading or writing any of the data in the pipeline. Take apart one transaction or one discrete work unit and dissect exactly the path it takes through your hardware, and measure at each distinct component to see if there are bottlenecks or places where there is undue latency, etc. A friend of mine called this "peeling back the onion", and I've used the phrase ever since to refer to the task of debugging a data flow.

score 2 · Answer 3 · edited Apr 13 '17 at 12:13

A small addition. You may want to look at your block-level tuning and I/O schedulers in this case. I'm not as familiar with Ubuntu, but there are a good amount of storage performance knobs to tweak. This definitely applies in the case of SAN storage and databases.

Take a look at the system I/O scheduler. CFQ is default, but noop and deadline are common choices for database workloads.
See this link for some other tuning parameters that may help.
You mention NFS and block storage. If block, which filesystem(s) are in use? The I/O wait sounds like a write-blocking situation from here. Are write-barriers enabled? Remount your filesystems with nobarrier. (Hint for Ubuntu)

Some relevant Server Fault links...

Linux - real-world hardware RAID controller tuning (scsi and cciss)

score 1 · Accepted Answer · answered Aug 21 '12 at 14:53

1

Thanks to all for the ideas and input. The problem was related to a combination of non-optimal ethernet bonding configuration, combined with a defective I/O module on the VNX itself. The I/O rate is now near where we expect it. It is interesting to note that dd file writing and reading tests and iozone benchmarks were not able to detect this, and could read and write nearly as fast as expected.

answered Aug 21 '12 at 14:53

Benjamin

183
8

Did EMC provide support/analysis to help you arrive at that conslusion? – ewwhite Aug 21 '12 at 14:56
Yes. (more characters) – Benjamin Aug 21 '12 at 17:12

score 0 · Answer 5 · answered Aug 02 '12 at 17:43

I will edit with more information soon enough, but first I'd like to say that you shouldn't let iostat's dm-* output confuse you. Device-mapper is an in-kernel passthru device just like md* (md0, md1, etc.) so you really only care about your underlying devices. All data passing to your disks goes through dm/md on the way, and the actual totals (bytes, seconds, etc.) are accurate, but the util is misleading.

Also, that's a very large amount of memory. Funny things start to happen that high (I myself run 2x64s and 2x96s), especially if you have one process taking up more than half of the ram. Read this article for more information. The article mentions mysql but please note that it is not mysql specific. Every software process will incurr penalties for access memory of another physical processor - think 48gb belongs to one proc, 48 to another. The process can only belong to one proc and in order to reach the other procs memory (after it's own 48GB has run out), it must decide to either store some of it's 48 in swap or pay a huge price to get to & from the other proc's memory. The article suggests running a numactl command to force the software to not swap and instead pay the penalty. I have personally see massive improvements from this. In other words - check to see if some of your I/O is going to swap! Use free -m (or similar) for this. If you have plenty of free memory, but some non-trivial amount of swappage (say 10% plus), this may very well be your issue.

Basil · Answer 6 · 2012-08-02T18:47:17.473

0

Looking at this from the storage perspective, do you have a way to measure scsi latency? OS io wait time includes a bunch of things outside the control of the storage, but when I go into my storage box and see IO latency at 2ms, I know that regardless of what the server is getting internally, the scsi commands are being responded to quickly, and I can eliminate storage as a variable.

edited Aug 02 '12 at 18:47

answered Aug 02 '12 at 18:12

Basil

8,811
3
37
73

Linux I/O bottleneck with data-movers

6 Answers6