I have a server, which exports home directories over NFS. They are on software RAID1 (/dev/sdb and /dev/sdc) and the OS is on /dev/sda. I noticed that my %iowait
as reported by top
and sar
are relatively high (compare to the rest of the servers). The values range between 5-10%, as for the other servers (which are more loaded than this one) the same as 0-1%. The so-called user experience drops when the %iowait
reaches values above 12%. Then we experience latency.
I don't have any drive errors in the logs. I would like to avoid playing with the drives using the trial-and-error method.
How I can find out which device (/dev/sda, /dev/sdb or /dev/sdc) is the bottleneck?
Thanks!
Edit: I use Ubuntu 9.10 and already have iostat
installed. I am not interested of NFS related issues, but more of how to find which device slows down the system. The NFS is not loaded, I have 32 threads available, the result of
grep th /proc/net/rpc/nfsd
th 32 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Edit2: Here is part of iostat -x 1
output (I hope I'm not violating some rules here):
avg-cpu: %user %nice %system %iowait %steal %idle
45.21 0.00 0.12 4.09 0.00 50.58
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 21.00 0.00 368.00 0.00 17.52 0.17 8.10 6.67 14.00
sdb 0.00 6.00 0.00 6.00 0.00 96.00 16.00 0.00 0.00 0.00 0.00
sdc 0.00 6.00 0.00 6.00 0.00 96.00 16.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 21.00 0.00 368.00 0.00 17.52 0.17 8.10 6.67 14.00
dm-2 0.00 0.00 0.00 12.00 0.00 96.00 8.00 0.00 0.00 0.00 0.00
drbd2 0.00 0.00 0.00 12.00 0.00 96.00 8.00 5.23 99.17 65.83 79.00
avg-cpu: %user %nice %system %iowait %steal %idle
45.53 0.00 0.24 6.56 0.00 47.68
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1.00 23.00 2.00 424.00 24.00 17.92 0.23 9.20 8.80 22.00
sdb 0.00 32.00 0.00 10.00 0.00 336.00 33.60 0.01 1.00 1.00 1.00
sdc 0.00 32.00 0.00 10.00 0.00 336.00 33.60 0.01 1.00 1.00 1.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 23.00 0.00 424.00 0.00 18.43 0.20 8.70 8.70 20.00
dm-2 0.00 0.00 0.00 44.00 0.00 352.00 8.00 0.30 6.82 0.45 2.00
drbd2 0.00 0.00 0.00 44.00 0.00 352.00 8.00 12.72 80.68 22.73 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
44.11 0.00 1.19 10.46 0.00 44.23
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 637.00 19.00 16.00 432.00 5208.00 161.14 0.34 9.71 6.29 22.00
sdb 0.00 31.00 0.00 13.00 0.00 352.00 27.08 0.00 0.00 0.00 0.00
sdc 0.00 31.00 0.00 13.00 0.00 352.00 27.08 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 20.00 651.00 456.00 5208.00 8.44 13.14 19.58 0.33 22.00
dm-2 0.00 0.00 0.00 42.00 0.00 336.00 8.00 0.01 0.24 0.24 1.00
drbd2 0.00 0.00 0.00 42.00 0.00 336.00 8.00 4.73 73.57 18.57 78.00
avg-cpu: %user %nice %system %iowait %steal %idle
46.80 0.00 0.12 1.81 0.00 51.27
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 16.00 0.00 240.00 0.00 15.00 0.14 8.75 8.12 13.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
What are the most relevant columns to look into? What values are considered unhealthy? I suppose await
and %util
are the ones I am looking for. In my opinion dm-1
is the bottleneck (this is the DRBD resource metadata).
Double thanks!
Edit3: Here is what my setup is:
sda
= OS, no RAID. Devices dm-0
and dm-1
are on it, as the latter is a metadata device for the DRBD resource (see below). Both dm-0
and dm-1
are LVM volumes;
drbd2 = dm-2 = sdb + sdc
-> this is the RAID1 device, which serves the user home directories over NFS. I don't think this one is the bottleneck. No LVM volume here.