0

We have four lpars each running 1 java instance each. They do a lot of read/write operations to a shared NFS server. When the NFS server goes down abruptly, all the threads that were trying to read a image in each of these four servers get into a hung state. Below trace shows the same (process is a websphere applciation server process)

1) While we are working on the issues in the NFS server side, is there a way to avoid this from the code side?

2) If the underlying connection is tcp based (which I assume it is), should the tcp read/connect timeout take care of this? Basically I want to thread be returned back to the pool instead of waiting infinitely for the other side to repond.

3) Or is this something which should be taken care by the nfs 'client' on the source machine? Some config setting on the client side pertaining to nfs (since FileInputStream.open would not know whether the file it is trying to read is on the local server or the shared folder in nfs server)

Thanks in advance for your answers :)

We are using java 1.6 on WAS 7.0

[8/2/15 19:52:41:219 GST] 00000023 ThreadMonitor W   WSVR0605W: Thread "WebContainer : 77" (00003c2b) has been active for 763879 milliseconds and may be hung.  There is/are 110 thread(s) in total in the server that may be hung.
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:113)
        at java.io.FileInputStream.<init>(FileInputStream.java:73)
        at org.emarapay.presentation.common.util.ImageServlet.processRequest(Unknown Source)
        at org.emarapay.presentation.common.util.ImageServlet.doGet(Unknown Source)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:718)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:831)
        at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1694)
        at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1635)
        at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:113)
        at com.ibm.ws.webcontainer.filter.WebAppFilterChain._doFilter(WebAppFilterChain.java:80)
        at com.ibm.ws.webcontainer.filter.WebAppFilterManager.doFilter(WebAppFilterManager.java:908)
        at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:965)
        at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:508)
        at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl
womble
  • 95,029
  • 29
  • 173
  • 228
  • 2
    Don't assume that NFS is using TCP unless you explicitly configured it to do so. – Michael Hampton Aug 03 '15 at 07:16
  • Thanks Michael, need to check with our sysadmin on that. I assumed tcp since it is a file transfer and the source system need to know for sure that all the data has been transferred, may be I am wrong. But is there a setting on the nfs client side to configure a timeout? – Rajarajan Pudupatti Sundari Je Aug 03 '15 at 07:23

1 Answers1

1

This probably depends on how the NFS share is mounted. By default, NFS shared are mounted with the "hard" parameters, meaning that accesses to a non-responding NFS share will block indefinitely.

You can change the client side mount point, adding one of the following parameters (I'm using Linux man page here, maybe your specific options are a little different):

  • soft: if the soft option is specified, then the NFS client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application
  • intr: selects whether to allow signals to interrupt file operations on this mount point. Using the intr option is preferred to using the soft option because it is significantly less likely to result in data corruption. FYI, this was deprecated in Linux kernel 2.6.25+

Source: Linux nfs man page

shodanshok
  • 44,038
  • 6
  • 98
  • 162