Setup:
Fedora 8
Apache 2.2.8
Tomcat 5.5.8
Apache is forwarding requests using AJP.
Problem:
After a certain period of time (no constant at all, can be between an hour or two, or one or more days) Tomcat will go down. Either it stops responding, or it puts up the generic 'Service Temporarily Unavailable'.
Diagnosis:
There are two servers with the same setup. One houses a higher traffic website (several requests per second), the other a low traffic one (a handful of requests every few minutes). Both websites are completely different codebases, but they exhibit similar issues.
On the first server, when the problem occurs all threads slowly start getting taken up until it reaches the limit (MaxThreads 200). At that point the server is no longer responding (and comes up with the service unavailable page after a long period of time).
On the second server, when the problem occurs requests take a long time and when they are done all you see is the service unavailable page.
Other than the mention of the MaxThreads problem, the Tomcat logs do not indicate any specific issues that could be causing this.
However, in the Apache logs we're seeing random messages referring to AJP. Here's a sample of random message that we see (in no specific order):
[error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
[error] (104)Connection reset by peer: ajp_ilink_receive() can't receive header
[error] proxy: AJP: disabled connection for (localhost)
[error] ajp_read_header: ajp_ilink_receive failed
[error] (120006)APR does not understand this error code: proxy: read response failed from 127.0.0.1:8009 (localhost)
[error] ap_proxy_connect_backend disabling worker for (localhost)
The other odd thing that we've noticed on the higher traffic server is that right before the problem starts happening, database queries are taking much longer than before (2000-5000 ms versus normally 5-50ms). This only lasts for 2-4 seconds before the MaxThreads message comes up. I'm assuming this is a result of the server suddenly dealing with too much data/traffic/threads.
Background Information:
These two servers had been running without a problem for quite some time. The systems were actually setup each using two NICs during that time. They separated internal and external traffic. After a network upgrade, we moved these servers to single NICs (this was recommended to us for security/simplicity reasons). After that change, the servers started having these problems.
Resolution:
The obvious solution would be to move back to a setup of two NICs. The problems with that are that it would cause some complications with the network setup, and it seems like ignoring the problem. We'd prefer to try and get it running on a single NIC setup.
Googling the various error messages didn't provide anything useful (either old solutions or unrelated to our problem).
We've tried adjusting the various timeouts but that just made the server run slightly longer before dying.
We're not sure where to look to diagnose the problem further. We're still grasping at straws at what the problem could be:
1) The setup with AJP and Tomcat is incorrect, or outdated (i.e. known bugs?)
2) The network setup (two NICs versus one NIC) is causing confusion or throughput problems.
3) The websites themselves (there's no common code, no platforms being used, just basic Java code with servlets and JSP)
Update 1:
Following David Pashley's helpful advice, I did a stack trace/thread dump during the issue. What I found was that all 200 threads were in one of the following states:
"TP-Processor200" daemon prio=1 tid=0x73a4dbf0 nid=0x70dd waiting for monitor entry [0x6d3ef000..0x6d3efeb0]
at oracle.jdbc.pool.OracleConnectionCacheImpl.getActiveSize(OracleConnectionCacheImpl.java:988)
- waiting to lock <0x7e3455a0> (a oracle.jdbc.pool.OracleConnectionCacheImpl)
[further stack trace removed for brevity]
"TP-Processor3" daemon prio=1 tid=0x08f142a8 nid=0x652a waiting for monitor entry [0x75c7d000..0x75c7ddb0]
at oracle.jdbc.pool.OracleConnectionCacheImpl.getConnection(OracleConnectionCacheImpl.java:268)
- waiting to lock <0x7e3455a0> (a oracle.jdbc.pool.OracleConnectionCacheImpl)
[further stack trace removed for brevity]
Curiously, only one thread out of all the 200 threads was in this state:
"TP-Processor2" daemon prio=1 tid=0x08f135a8 nid=0x6529 runnable [0x75cfe000..0x75cfef30]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at oracle.net.ns.Packet.receive(Unknown Source)
at oracle.net.ns.DataPacket.receive(Unknown Source)
at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
[further stack trace removed for brevity]
It might be that the Oracle driver in this thread is forcing all other threads to wait for it to complete. For some reason it is must be stuck in this reading state (the server never recovers on its own, it requires a restart).
This suggests that it must be related to either the network between the server and the database, or the database itself. We're continuing diagnosis efforts, but any tips would be helpful.