12

I have a CentOS 5 VMWare server connecting to an OpenSolaris 2009.06 machine over NFS that holds the disk images. My virtual machines seem to be bound by slow IO so I'd like to do everything I can to optimize the connection.

I'm not sure of the best way to measure the throughput on a production system, but some unscientific tests using dd bs=1024k count=400 show local (OpenSolaris) writes of ~1.6GB/s and remote (CentOS) writes ~50MB/s. I imagine these are lower than what I'm actually getting since 7 VMs are currently running over the connection.

Currently, the 2 machines are direct-connected gigE with jumbo frames enabled on both NICs (MTU=9000). Other than that, no optimizations have been made. NFS mount/export is using defaults.

Where should I start turning knobs to improve the performance?

Sysadminicus
  • 586
  • 4
  • 8
  • 19
  • Throughput shouldn't matter too much. What is the underlying hardware specification on the system running OpenSolaris? How many disks/spindles do you have? How much RAM? – ewwhite Dec 09 '09 at 20:58
  • 12 disks spread across 2 raidz1 pools on one controller with 4GB of RAM. If throughput doesn't matter, what metric should I be looking at? – Sysadminicus Dec 09 '09 at 23:02
  • What does cat /proc/mounts | grep solaris_server say on the Linux client? Different versions of Linux have different default mount options :( – James Dec 16 '09 at 16:35
  • 10.10.1.1:/tank/vm /vm nfs rw,vers=3,rsize=1048576,wsize=1048576,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=10.10.1.1 0 0 – Sysadminicus Dec 16 '09 at 20:25
  • with *some* editions of Solaris 10, nfs3 was unstable. If you can move to nfs4 you may see some improvements. But, as other commenters have said, seeing 50MB/s across a gigE link is close to the highest you *can* see – warren Dec 24 '09 at 13:29

8 Answers8

4

Just to clarify, you're getting 50MB/sec with NFS over a single Gb ethernet connection?

And the host server is running CentOS with VMware Server installed, which is in turn running the 7 VMs? Is there a particular reason you've gone with CentOS and VMware Server combined, rather than VMware ESXi which is a higher performance solution?

The 50MB/sec isn't great, but it's not much below what you'd expect over a single Gb network cable - once you've put in the NFS tweaks people have mentioned above you're going to be looking at maybe 70-80MB/sec. Options along the line of:

"ro,hard,intr,retrans=2,rsize=32768,wsize=32768,nfsvers=3,tcp"

are probably reasonable for you at both ends of the system.

To get above that you're going to need to look at teaming the network cards into pairs, which should increase your throughput by about 90%. You might need a switch that supports 802.3ad to get the best performance with link aggregation.

One thing I'd suggest though is your IO throughput on the OpenSolaris box sounds suspiciously high, 12 disks aren't likely to support 1.6GB/sec of throughput, and that may be heavily cached by Solaris + ZFS.

Ewan Leith
  • 1,695
  • 8
  • 7
  • We are using CentOS + VMWare Server because it is free. Last I checked ESXi was pretty pricey. According to /proc/mounts, the rsize/wsize is currently 1048576. Just to confirm, you think reducing these to 32k will help increase the speed? I'll check out link aggregation. Would I do this on both ends of the connection or only one? I think you are right about the IO being cached. Bumping my dd's up over 512MB significantly drops the transfer rate (ranging between 50-120MB/s). – Sysadminicus Dec 28 '09 at 17:45
  • I no longer have the ability in the UI to accept an answer for this question, but I've upvoted this as it seems like link aggregation is going to be my best bet. – Sysadminicus Dec 31 '09 at 20:15
  • Sorry for the delayed reply, ESXi is now free in its basic form, and will give you a performance boost, but it does have limited functionality so might not be right for you. You'll need to do the link aggregation at both ends of the network link to see much of an improvement. Hope it works for you – Ewan Leith Jan 04 '10 at 14:15
2

For our RHEL/CentOS 5 machines we use the following mount flags

nfsvers=3,tcp,timeo=600,retrans=2,rsize=32768,wsize=32768,hard,intr,noatime

Newer Linux kernel version support even larger rsize/wsize parameters, but 32k is the maximum for the 2.6.18 kernel in EL5.

On the NFS server(s), at least for Linux no_wdelay supposedly helps if you have a disk controller with BBWC. Also, if you use the noatime flag on the clients, it probably makes sense to mount the filesystems on the servers with noatime as well.

And, as was already mentioned, don't bother with UDP. With higher speed networks (1GbE+) there is a small, but non-zero, chance of a sequence number wraparound causing data corruption. Also, if there is a possibility of packet loss, TCP will perform better than UDP.

If you're not worrying about data integrity that much, the "async" export option can be a major performance improvement (the problem with async is that you might lose data if the server crashes).

Also, at least for Linux server, you need to make sure you have enough NFS server threads running. The default 8 is just way too low.

janneb
  • 3,761
  • 18
  • 22
2

I once did a test with a dell r710, 1 cpu, 4 GB RAM, 6 SATA disk with RAID-10. The client was a sun x2100, both with CentOS 5.3 and the nfs params like mentioned above

"ro,hard,intr,retrans=2,rsize=32768,wsize=32768,nfsvers=3,tcp"

mounted on both sides with noatime.

I did also bump up to nfsds to 256 and used the noop scheduler for the perc6 raid controller. Another thing i did was to align the partitions to the 64K stripe size of the raid controller.

then i measured the nfs performance with dd - for reads i could fill the gigE pipe but for writes i could only get slightly better results as you. With async enabled i could get 70 to 80 MB/s but async was no option for my.

Maybe you can't get more with nfs from a gigE link?

2

Try this: Disable the ZFS Intent Log (ZIL) temporarily on the OpenSolaris NFS server with the following two steps

  1. echo zil_disable/W0t1 | mdb -kw
  2. re-mount the test partition

Then test again. You can use zilstat to make sure that there really is no more IO to the ZIL. If the test runs faster you know that the performance problem has something to do with the ZIL. If it still runs slow you know that the ZIL isn't the culprit and that using a SSD for the ZIL won't help either. See the ZFS Evil Tuning Guide for more information about the ZIL.

Another option would be to capture the network traffic (e.g with Wireshark) and see if there are any problems e.g. with the Jumbo frames. Verify that the packets on the wire look like you expect from your configuration. Is there any bad fragmentation going on? Are there retransmits?

knweiss
  • 3,955
  • 23
  • 20
1

Since I came here and also solved my weak nfs performance but other way around I want to share some details. Maybe someone else will find it also beneficial.

This is not meant for productive setup rather a performance/dev setup

First kernel parameters which you can tune up in sysctl.conf

cat << _EOFF > /etc/sysctl.d/tengbnet.conf
# allow testing with buffers up to 64MB 
net.core.rmem_max = 67108864 
net.core.wmem_max = 67108864 
# increase Linux autotuning TCP buffer limit to 32MB
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
# recommended default congestion control is htcp 
# for newer kernels try: bbr
net.ipv4.tcp_congestion_control=htcp
# recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing=1
# recommended to enable 'fair queueing'
net.core.default_qdisc = fq
# packets queued on the INPUT, when the interface gets its faster than kernel
net.core.netdev_max_backlog = 30000
_EOFF

# apply without restart
sysctl --system

# increasing Worker count to 16 and restart
sed -i 's/#RPCNFSDCOUNT=16/RPCNFSDCOUNT=16/g' /etc/sysconfig/nfs
systemctl restart nfs

# switching default nfs version to v3
sed -i 's/#\ Defaultvers=4/Defaultvers=3/g' /etc/nfsmount.conf

and Second which is up to you

setting instead of default sync option the following in /etc/exports

async,no_subtree_check

1

FYI the dd command will write to cache and no disk, this you can get crazy numbers like 1.6G/s because you are writing to RAM and not disk on Solaris you can use the "-oflag=sync" to force writes to disk

Kyle Hailey
  • 275
  • 3
  • 10
1

Raising the read and write payload sizes can help. Especially in conjunction with jumbo frames.

I tend to find 32k to be optimum.

rsize=32768,wsize=32768

Switching to UDP transport is of course faster then TCP, because it saves the overhead of transmission control. But it's only applicable on reliable networks and where NFSv4 isn't in use.

Dan Carley
  • 25,189
  • 5
  • 52
  • 70
  • It looks like CentOS is connecting using NFSv3. Is there value in NFSv4 for our use case? I'd say the network is pretty reliable given there is just a cross-over cable between the two NICs. – Sysadminicus Dec 09 '09 at 17:32
  • 2
    UDP is seriously not worth the hassle. Stick to TCP. I wouldn't suggest trying NFSv4 til you get v3 working properly. – James Dec 16 '09 at 16:34
1

NFS performance on ZFS is greatly improved by using an SSD for the ZFS intent log (ZIL) as this reduces the latency of operations. This thread about VMWare NFS on ZFS performance on the OpenSolaris NFS and ZFS mailing lists has further information, including a benchmark tool to see if ZIL performance is the bottleneck.

TRS-80
  • 2,564
  • 17
  • 15