Failing fast with an EFS NFS mount

Question

We run a Java application distributed across a number of servers, part of which involves the writing and reading of shared files. This side of the application is currently held together by a bunch of rsync cron jobs, so the option of replacing it with an EFS NFS mount that takes care of the high-availability problem was very appealing.

The files are written and served by Tomcat servers, and are non-mission-critical, but at times we serve thousands of such requests each minute. Our infrastructure is on-premise, so the EFS is mounted over a VPN setup. Given the potential for network issues and the non-vital nature of the files, we decided that we would much rather fail-fast and error rather than risk exhausting the Tomcat's threadpool waiting on unavailable IO.

For that, I've looked towards the timeo and retrans parameters of the mount command, with a view to setting these low enough that any network issues simply cause a bunch of IO errors (which the application is fine to handle) rather than a bunch of hanging threads. I'm aware that the AWS recommend not dropping the timeo parameter below 150 (15s), but the following was purely for verifying the behavior of the parameters.

My Testing

I mounted by share with the following command,

mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,soft,timeo=50,retrans=1 MOUNT_IP:/ efs-test

before revoking access to the fs using the AWS Security Group, and tested how long before an IO error occurred when attempting to ls the mounted fs. Looking at the mount man page and AWS' documentation, I would expect access to the mount to fail after 10 seconds - a 5 second timeout with a single retry.

Results

The timeouts for the ls command were as follows: 15s, 17s, 15s, 10s, 15s, 20s, 109s, 15s. ls: cannot access efs-test/: Input/output error

Testing mounting with different timeouts and retry attempts gave even more unpredictable results. I've tried replicating this by mounting an on-premise NFS share and cannot reproduce the unpredictable nature of it. We cannot push this into production whilst there's a chance of a two minute thread-hang in the even of a network issue.

Has anyone else experienced this issue, or can see where I'm going wrong? I don't understand why this issue only occurs when mounting from AWS, since I would have expected the IO-timeout to be enforced on the client side.

Testing this by removing the security group rules may not do exactly what you assume if there has been recent traffic that matched the rule. Use a Network ACL rule in the VPC, or configure your local VPN hardware with a rule to drop the packets and see if you get different behavior. — Michael - sqlbot, Aug 28 '19 at 01:55
Thanks for taking the time to read through that, ended up a longer question than I'd hoped. I've now tested using the VPC Network ACLs rather than the security groups, and the results are more consistent though still not ideal. The timeouts for a mount with timeo=50, retans=1 were: 15s, 15s, 21s, 15s, 22s, 23s, 24s, 15s, 25s. — DGoodman, Aug 29 '19 at 10:20

Failing fast with an EFS NFS mount

My Testing

Results

0 Answers0