4

Recently I've designed and configured a 4 node cluster for a webapp that does lots of file handling. The cluster have been broken down into 2 main roles, webserver and storage. Each role is replicated to a second server using drbd in active/passive mode. The webserver does a NFS mount of the data directory of the storage server and the latter also has a webserver running to serve files to browser clients.

In the storage servers I've created a GFS2 FS to hold the data which is wired to drbd. I've chose GFS2 mainly because the announced performance and also because the volume size which has to be pretty high.

Since we entered production I've been facing two problems that I think are deeply connected. First of all, the NFS mount on the webservers keeps hanging for a minute or so and then resumes normal operations. By analyzing the logs I've found out that NFS stops answering for a while and outputs the following log lines:

Oct 15 18:15:42 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:44 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:46 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:47 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:47 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:47 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:48 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:48 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:51 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:52 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:52 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:55 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:55 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying
Oct 15 18:15:58 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK
Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK

In this case, the hang lasted for 16 seconds but sometimes it takes 1 or 2 minutes to resume normal operations.

My first guess was this was happening due to heavy load of the NFS mount and that by increasing RPCNFSDCOUNT to a higher value, this would become stable. I've increased it several times and apparently, after a while, the logs started appearing less times. The value is now on 32.

After further investigating the issue, I've came across a different hang, despite the NFS messages still appear in the logs. Sometimes, the GFS2 FS simply hangs which causes both the NFS and the storage webserver to serve files. Both stay hang for a while and then they resume normal operations. This hangs leaves no trace on client side (also leaves no NFS ... not responding messages) and, on the storage side, the log system appears to be empty, even though the rsyslogd is running.

The nodes connect themselves through a 10Gbps non-dedicated connection but I don't think this is an issue because the GFS2 hang is confirmed but connecting directly to the active storage server.

I've been trying to solve this for a while now and I've tried different NFS configuration options, before I've found out the GFS2 FS is also hanging.

The NFS mount is exported as such:

/srv/data/ <ip_address>(rw,async,no_root_squash,no_all_squash,fsid=25)

And the NFS client mounts with:

mount -o "async,hard,intr,wsize=8192,rsize=8192" active.storage.vlan:/srv/data /srv/data

After some tests, these were the configurations that yielded more performance to the cluster.

I am desperate to find a solution for this as the cluster is already in production mode and I need to fix this so that this hangs won't happen in the future and I don't really know for sure what and how I should be benchmarking. What I can tell is that this is happening due to heavy loads as I have tested the cluster earlier and this problems weren't happening at all.

Please tell me if you need me to provide configuration details of the cluster, and which do you want me to post.

As last resort I can migrate the files to a different FS but I need some solid pointers on whether this will solve this problems as the volume size is extremely large at this point.

The servers are being hosted by a third-party enterprise and I don't have physical access to them.

Best regards.

EDIT 1: The servers are physical servers and their specs are:

  • Webservers:

    • Intel Bi Xeon E5606 2x4 2.13GHz
    • 24GB DDR3
    • Intel SSD 320 2 x 120GB Raid 1
  • Storage:

    • Intel i5 3550 3.3GHz
    • 16GB DDR3
    • 12 x 2TB SATA

Initially there was a VRack setup between the servers but we've upgraded one of the storage servers to have more RAM and it wasn't inside the VRack. They connect through a shared 10Gbps connection between them. Please note that it is the same connection that is used for public access. They use a single IP (using IP Failover) to connect between them and to allow for a graceful failover.

NFS is therefore over a public connection and not under any private network (it was before the upgrade, were the problem still existed).

The firewall was configured and tested thoroughly but I disabled it for a while to see if the problem still occurred, and it did. From my knowledge the hosting provider isn't blocking or limiting the connection between either the servers and the public domain (at least under a given bandwidth consumption threshold that hasn't been reached yet).

Hope this helps figuring out the problem.

EDIT 2:

Relevant software versions:

CentOS 2.6.32-279.9.1.el6.x86_64  
nfs-utils-1.2.3-26.el6.x86_64  
nfs-utils-lib-1.1.5-4.el6.x86_64  
gfs2-utils-3.0.12.1-32.el6_3.1.x86_64  
kmod-drbd84-8.4.2-1.el6_3.elrepo.x86_64  
drbd84-utils-8.4.2-1.el6.elrepo.x86_64  

DRBD configuration on storage servers:

#/etc/drbd.d/storage.res
resource storage {
    protocol C;

    on <server1 fqdn> {
            device /dev/drbd0;
            disk /dev/vg_storage/LV_replicated;
            address <server1 ip>:7788;
            meta-disk internal;
    }

    on <server2 fqdn> {
            device /dev/drbd0;
            disk /dev/vg_storage/LV_replicated;
            address <server2 ip>:7788;
            meta-disk internal;
    }
}

NFS Configuration in storage servers:

#/etc/sysconfig/nfs
RPCNFSDCOUNT=32
STATD_PORT=10002
STATD_OUTGOING_PORT=10003
MOUNTD_PORT=10004
RQUOTAD_PORT=10005
LOCKD_UDPPORT=30001
LOCKD_TCPPORT=30001

(can there be any conflict in using the same port for both LOCKD_UDPPORT and LOCKD_TCPPORT?)

GFS2 configuration:

# gfs2_tool gettune <mountpoint>
incore_log_blocks = 1024
log_flush_secs = 60
quota_warn_period = 10
quota_quantum = 60
max_readahead = 262144
complain_secs = 10
statfs_slow = 0
quota_simul_sync = 64
statfs_quantum = 30
quota_scale = 1.0000   (1, 1)
new_files_jdata = 0

Storage network environment:

eth0      Link encap:Ethernet  HWaddr <mac address>
          inet addr:<ip address>  Bcast:<bcast address>  Mask:<ip mask>
          inet6 addr: <ip address> Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:957025127 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1473338731 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2630984979622 (2.3 TiB)  TX bytes:1648430431523 (1.4 TiB)

eth0:0    Link encap:Ethernet  HWaddr <mac address>  
          inet addr:<ip failover address>  Bcast:<bcast address>  Mask:<ip mask>
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

The IP addresses are statically assigned with the given network configurations:

DEVICE="eth0"
BOOTPROTO="static"
HWADDR=<mac address>
ONBOOT="yes"
TYPE="Ethernet"
IPADDR=<ip address>
NETMASK=<net mask>

and

DEVICE="eth0:0"
BOOTPROTO="static"
HWADDR=<mac address>
IPADDR=<ip failover>
NETMASK=<net mask>
ONBOOT="yes"
BROADCAST=<bcast address>

Hosts file to allow for a graceful NFS failover in conjunction with NFS option fsid=25 set on both storage servers:

#/etc/hosts
<storage ip failover address> active.storage.vlan
<webserver ip failover address> active.service.vlan

As you can see, packet errors are down to 0. I've also ran ping for a long time without any packet loss. MTU size is the normal 1500. As there is no VLan by now, this is the MTU used to communicate between servers.

The webservers' network environment is similar.

One thing I forgot to mention is that the storage servers handle ~200GB of new files each day through the NFS connection, which is a key point for me to think this is some kind of heavy load problem with either NFS or GFS2.

If you need further configuration details please tell me.

EDIT 3:

Earlier today we had a major filesystem crash on the storage server. I couldn't get the details of the crash right away because the server stop responding. After the reboot, I noticed the filesystem was extremely slow, and I was not being able to serve a single file through either NFS or httpd, perhaps due to cache warming or so. Nevertheless, I've been monitoring the server closely and the following error came up in dmesg. The source of the problem is clearly GFS, which is waiting for a lock and ends up starving after a while.

INFO: task nfsd:3029 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000000     0  3029      2 0x00000080
 ffff8803814f79e0 0000000000000046 0000000000000000 ffffffff8109213f
 ffff880434c5e148 ffff880624508d88 ffff8803814f7960 ffffffffa037253f
 ffff8803815c1098 ffff8803814f7fd8 000000000000fb88 ffff8803815c1098
Call Trace:
 [<ffffffff8109213f>] ? wake_up_bit+0x2f/0x40
 [<ffffffffa037253f>] ? gfs2_holder_wake+0x1f/0x30 [gfs2]
 [<ffffffff814ff42e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff814ff2cb>] mutex_lock+0x2b/0x50
 [<ffffffffa0379f21>] gfs2_log_reserve+0x51/0x190 [gfs2]
 [<ffffffffa0390da2>] gfs2_trans_begin+0x112/0x1d0 [gfs2]
 [<ffffffffa0369b05>] ? gfs2_dir_check+0x35/0xe0 [gfs2]
 [<ffffffffa0377943>] gfs2_createi+0x1a3/0xaa0 [gfs2]
 [<ffffffff8121aab1>] ? avc_has_perm+0x71/0x90
 [<ffffffffa0383d1e>] gfs2_create+0x7e/0x1a0 [gfs2]
 [<ffffffffa037783f>] ? gfs2_createi+0x9f/0xaa0 [gfs2]
 [<ffffffff81188cf4>] vfs_create+0xb4/0xe0
 [<ffffffffa04217d6>] nfsd_create_v3+0x366/0x4c0 [nfsd]
 [<ffffffffa0429703>] nfsd3_proc_create+0x123/0x1b0 [nfsd]
 [<ffffffffa041a43e>] nfsd_dispatch+0xfe/0x240 [nfsd]
 [<ffffffffa025a5d4>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff810602a0>] ? default_wake_function+0x0/0x20
 [<ffffffffa025ac10>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa041ab62>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa041aaa0>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff81091de6>] kthread+0x96/0xa0
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffff81091d50>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

EDIT 4:

I've installed munin and I've got some new data coming out. Today there was another hang and munin shows me the following: the inode table size is high as 80k just before the hang and then drops suddenly to 10k. As with memory, cached data also drops suddenly from 7GB to 500MB. Load average also spikes during the hang and device usage of the drbd device also spikes to values around 90%.

Comparing to a previous hang, these two indicators behave identically. Can this be due to bad file management on the application side that doesn't release file handlers, or perhaps memory management issues coming from GFS2 or NFS (which I doubt)?

Thanks for any possible feedback.

EDIT 5:

Inode table usage from Munin:

Memory usage from Munin:

Tiago
  • 101
  • 1
  • 5

4 Answers4

1

I can only provide some general pointers.

First I would get some simple benchmark metrics up and running. At least then you'll know if the changes your making are for the best.

  • Munin
  • Cacti
  • Nagios

    are some good choices.

Are these nodes virtual or physical servers, what's their spec.

What kind of network connection are between each node

Is NFS setup over your hosting providers private network.

Your not limiting packets / ports with firewalls, Is your hosting provider doing this?

daxroc
  • 274
  • 1
  • 7
  • I'll install Munin and I will report later if I find some useful info. Regarding your other 'questions' I will update my original post with more info. Thank you for the answer! – Tiago Oct 16 '12 at 14:14
  • It would be worth adding the network configuration, hardware, software, and vlan configuration packet size. Have you done a simple ping test over a large number to see if you have any loss? – daxroc Oct 16 '12 at 14:32
1

I think you have two problems. A bottleneck causing the issue in the first place and more importantly, poor failure handling by GFS. GFS should really be slowing the transfer down until it works but I am not able to assist with that.

You say that the cluster handles ~200GB of new files into the NFS. How much data is being read from the cluster?

I would always be nervous having one network connection for the frontend and the backend as it allows the frontend to "directly" break the backend (by overloading the data connection).

If you install iperf on each of the boxes, you can test available network throughput at any given point. This may be a quickfire way of identifying if you have a network bottleneck.

How heavily is the network utilised? How fast are the disks on the storage server and what raid setup are you using? What throughput do you get on it? Assuming it is running *nix and you get quiet moment to test, you can use hdparm

$ hdpard -tT /dev/<device>

If you find heavy network utilisation, I would suggest putting GFS on a secondary and dedicated network connection.

Depending on how you have raid(ed) the 12 disks, you can have varying degrees of performance and this could be the second bottleneck. It would also depend on whether you are using hardware raid or software raid.

The copious amounts of memory you have on the box may be of little use if the data being requested is spread out over more than your total memory, which it sounds like it might be. Besides, memory can only help with reads and that too mostly if a lot of the reads are for the same file (otherwise, it would get kicked out of the cache)

When running top / htop, watch iowait. A high value here is an excellent indicator that the cpu is just twiddling its thumbs waiting for something (network, disk etc)

In my opinion NFS is less likely to be the culprit. We have fairly extensive experience with NFS and while it can be tuned / optimised - it tends to work pretty reliably.

I would be inclined to get the GFS component stable and then see if the issues with NFS goes away.

Finally, OCFS2 may be an option to consider as a replacement for GFS. While I was doing some research for distributed filesystems, I did a reasonably amount of research and I cannot remember the reasons I chose to try OCFS2 - but I did. Perhaps it had something to do with OCFS2 being used by Oracle for their database backends which would imply pretty high stability requirements.

Munin is your friend. But far more important is top / htop. vmstat can also give you some key numbers

$ vmstat 1

and you will get an update every second on exactly what the system is spending its time doing.

Good luck!

drone.ah
  • 482
  • 2
  • 6
  • The amount of reading is somewhat similar to the writes because they are interconnected at app level. To rule-out network congestion I've disabled temporarily the DRBD sync and switched to Samba instead of NFS. Having done this I haven't seen any hang happening, but looking at Munin I see a huge drop on the open inodes. This value is now reaching its previous maximum so I'm monitoring the server to see what happens. Can this be due to file mishandling from application level or has nothing to do with it? GFS2 was my choice because it is the ReadHat 'official' FS. Thank you for your suggestions! – Tiago Oct 19 '12 at 18:47
  • Forgot to say that `vmstat 1` shows me the cache memory oscillating but gradually increasing over time. The top command shows me a 90-ish% idle time and a 2-ish% iowait. Haven't had the chance to run `hdparm` because I want to wait for a less busy period to do that. Again, thanks for your helpful suggestions. – Tiago Oct 19 '12 at 19:02
  • To see whether its file mishandling from the application side, an easy test could be to check what happens when you restart the application. – drone.ah Oct 20 '12 at 09:07
  • After a reboot I sometimes get some 'zombie' apache workers locked somewhere. But since this is a production environment, I don't really have the time to do a strace to the apache workers otherwise the application would be offline too much time. Perhaps in some planned maintenance I have the chance to check that. – Tiago Oct 21 '12 at 22:36
0

First HA proxy front the web servers with either Varnish or Nginx.

Then for web file system: Why not use MooseFS instead of NFS, GFS2, its fault tolerant and fast for reads. What you loose from NFS, GFS2 is local locks, do you need that for your application? If not I would switch to MooseFS and skip the NFS,GFS2 problems. You will need to use Ucarp to HA the MFS metadata servers.

In MFS set replication goal to 3

# mfssetgoal 3 /folder

//Christian

Christian
  • 317
  • 1
  • 2
  • 8
  • Changing the file-system is my last resort to solve the problem I'm facing. After the changes I described in a comment to Shri's post the application got stable, without any hangs happening. The only thing I lack explanation now is why the open inodes keeps growing and after a while, it drops suddenly. Thanks for your cooperation. – Tiago Oct 21 '12 at 22:26
0

Based on your munin graphs the system is dropping caches, this is equivalent to running, one of the following:

  1. echo 2 > /proc/sys/vm/drop_caches
    1. free dentries and inodes
  2. echo 3 > /proc/sys/vm/drop_caches
    1. free pagescache, dentires and inodes

The question becomes why, is there perhaps a lingering cron task ?

Aside from the 01:00 -> 12:00 they appear to be at a regular interval.

Would be also worth checking about 1/2 way through a peak if running one of the above commands recreates your issue, however always ensure you run a sync right before doing so.

Failing that a strace of your drbd process (assuming again this is the culprit) around the time of an expected purge and through to said purge, may shed some light.

Oneiroi
  • 2,008
  • 1
  • 15
  • 28
  • Yes, I've verified earlier that `drop_caches` has the same effect on the `inode table size` as shown in the charts. But I want to understand why that happens because when it happens the system hangs for a while. I've checked the `crontab` and it has no "lingering" tasks. Looking at the graphs I also don't believe that it is a scheduled event because the cache drop happens when it reaches a certain threshold and not at specific times. – Tiago Oct 24 '12 at 16:11
  • @Tiago nothing in `/var/spool/cron/*` ? (Eliminating user crons), failing that only thing I can think of is a bespoke process that's monitoring the usage, and is set to `drop_caches` when it reaches a threshold, nothing in `/var/log/messages` or `dmesg`? that could provide a clue? – Oneiroi Oct 25 '12 at 09:46