We run a Java application distributed across a number of servers, part of which involves the writing and reading of shared files. This side of the application is currently held together by a bunch of rsync cron jobs, so the option of replacing it with an EFS NFS mount that takes care of the high-availability problem was very appealing.
The files are written and served by Tomcat servers, and are non-mission-critical, but at times we serve thousands of such requests each minute. Our infrastructure is on-premise, so the EFS is mounted over a VPN setup. Given the potential for network issues and the non-vital nature of the files, we decided that we would much rather fail-fast and error rather than risk exhausting the Tomcat's threadpool waiting on unavailable IO.
For that, I've looked towards the timeo and retrans parameters of the mount command, with a view to setting these low enough that any network issues simply cause a bunch of IO errors (which the application is fine to handle) rather than a bunch of hanging threads. I'm aware that the AWS recommend not dropping the timeo parameter below 150 (15s), but the following was purely for verifying the behavior of the parameters.
My Testing
I mounted by share with the following command,
mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,soft,timeo=50,retrans=1 MOUNT_IP:/ efs-test
before revoking access to the fs using the AWS Security Group, and tested how long before an IO error occurred when attempting to ls the mounted fs. Looking at the mount man page and AWS' documentation, I would expect access to the mount to fail after 10 seconds - a 5 second timeout with a single retry.
Results
The timeouts for the ls command were as follows: 15s, 17s, 15s, 10s, 15s, 20s, 109s, 15s.
ls: cannot access efs-test/: Input/output error
Testing mounting with different timeouts and retry attempts gave even more unpredictable results. I've tried replicating this by mounting an on-premise NFS share and cannot reproduce the unpredictable nature of it. We cannot push this into production whilst there's a chance of a two minute thread-hang in the even of a network issue.
Has anyone else experienced this issue, or can see where I'm going wrong? I don't understand why this issue only occurs when mounting from AWS, since I would have expected the IO-timeout to be enforced on the client side.