22

Setup:
Fedora 8
Apache 2.2.8
Tomcat 5.5.8
Apache is forwarding requests using AJP.

Problem:
After a certain period of time (no constant at all, can be between an hour or two, or one or more days) Tomcat will go down. Either it stops responding, or it puts up the generic 'Service Temporarily Unavailable'.

Diagnosis:
There are two servers with the same setup. One houses a higher traffic website (several requests per second), the other a low traffic one (a handful of requests every few minutes). Both websites are completely different codebases, but they exhibit similar issues.

On the first server, when the problem occurs all threads slowly start getting taken up until it reaches the limit (MaxThreads 200). At that point the server is no longer responding (and comes up with the service unavailable page after a long period of time).

On the second server, when the problem occurs requests take a long time and when they are done all you see is the service unavailable page.

Other than the mention of the MaxThreads problem, the Tomcat logs do not indicate any specific issues that could be causing this.

However, in the Apache logs we're seeing random messages referring to AJP. Here's a sample of random message that we see (in no specific order):

[error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
[error] (104)Connection reset by peer: ajp_ilink_receive() can't receive header
[error] proxy: AJP: disabled connection for (localhost)
[error] ajp_read_header: ajp_ilink_receive failed
[error] (120006)APR does not understand this error code: proxy: read response failed from 127.0.0.1:8009 (localhost)
[error] ap_proxy_connect_backend disabling worker for (localhost)

The other odd thing that we've noticed on the higher traffic server is that right before the problem starts happening, database queries are taking much longer than before (2000-5000 ms versus normally 5-50ms). This only lasts for 2-4 seconds before the MaxThreads message comes up. I'm assuming this is a result of the server suddenly dealing with too much data/traffic/threads.

Background Information:
These two servers had been running without a problem for quite some time. The systems were actually setup each using two NICs during that time. They separated internal and external traffic. After a network upgrade, we moved these servers to single NICs (this was recommended to us for security/simplicity reasons). After that change, the servers started having these problems.

Resolution:
The obvious solution would be to move back to a setup of two NICs. The problems with that are that it would cause some complications with the network setup, and it seems like ignoring the problem. We'd prefer to try and get it running on a single NIC setup.

Googling the various error messages didn't provide anything useful (either old solutions or unrelated to our problem).

We've tried adjusting the various timeouts but that just made the server run slightly longer before dying.

We're not sure where to look to diagnose the problem further. We're still grasping at straws at what the problem could be:

1) The setup with AJP and Tomcat is incorrect, or outdated (i.e. known bugs?)
2) The network setup (two NICs versus one NIC) is causing confusion or throughput problems.
3) The websites themselves (there's no common code, no platforms being used, just basic Java code with servlets and JSP)

Update 1:
Following David Pashley's helpful advice, I did a stack trace/thread dump during the issue. What I found was that all 200 threads were in one of the following states:

"TP-Processor200" daemon prio=1 tid=0x73a4dbf0 nid=0x70dd waiting for monitor entry [0x6d3ef000..0x6d3efeb0]
at  oracle.jdbc.pool.OracleConnectionCacheImpl.getActiveSize(OracleConnectionCacheImpl.java:988)
- waiting to lock <0x7e3455a0> (a oracle.jdbc.pool.OracleConnectionCacheImpl)
[further stack trace removed for brevity]

"TP-Processor3" daemon prio=1 tid=0x08f142a8 nid=0x652a waiting for monitor entry [0x75c7d000..0x75c7ddb0]
at oracle.jdbc.pool.OracleConnectionCacheImpl.getConnection(OracleConnectionCacheImpl.java:268)
- waiting to lock <0x7e3455a0> (a oracle.jdbc.pool.OracleConnectionCacheImpl)
[further stack trace removed for brevity]

Curiously, only one thread out of all the 200 threads was in this state:

"TP-Processor2" daemon prio=1 tid=0x08f135a8 nid=0x6529 runnable [0x75cfe000..0x75cfef30]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at oracle.net.ns.Packet.receive(Unknown Source)
at oracle.net.ns.DataPacket.receive(Unknown Source)
at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
[further stack trace removed for brevity]

It might be that the Oracle driver in this thread is forcing all other threads to wait for it to complete. For some reason it is must be stuck in this reading state (the server never recovers on its own, it requires a restart).

This suggests that it must be related to either the network between the server and the database, or the database itself. We're continuing diagnosis efforts, but any tips would be helpful.

Jordy Boom
  • 411
  • 1
  • 3
  • 9

7 Answers7

9

It turns out that this version (classes12 - quite old) of the Oracle driver had various bugs in it that caused a deadlock (as seen in the TP-Processor2 state quoted above). It didn't become active until we switched to the new environment. Upgrading to the latest version (ojdbc14) has resolved the issue on the primary server.

Jordy Boom
  • 411
  • 1
  • 3
  • 9
  • This lead me to *my* correct solution: I had a lock in a DB-row ... and never got any Exception in the App-Server – cljk Jul 28 '14 at 09:15
6

From the description, I'd suggest the problem may be due to the database queries taking too long. If the queries are taking longer, request will take longer and therefore you'll have more of them running at once. As you're seeing, you're running out of tomcat threads. When you solve the problem with the database you should be okay.

  • Get a stack trace, either using jstack or using kill -3 $process_id. See what your threads are doing when it dies. If they're all waiting on the database, that's a good pointer to my theory. They might all be waiting on some lock.
  • Install LambdaProbe. It's invaluable for finding out what your tomcat is doing.
  • Upgrade your tomcat. 5.5.8 is incredibly old. I think they're now on 5.5.27.
David Pashley
  • 23,151
  • 2
  • 41
  • 71
  • David, I've updated the question (see Update 1) with new findings based on your thread dump/stack trace suggestion. – Jordy Boom Jun 05 '09 at 21:23
  • I'd suggest that your database connection pool is too small compared to your tomcat max connection value. It seems that most threads are waiting to get a database connection. – David Pashley Jun 06 '09 at 01:03
  • The only reason there are that many threads is because the threads normally being used are left waiting for that one thread attempting to read from the socket. The number of DB connections being used at any time go between 1 and 3. There's never a need for more than that many. – Jordy Boom Jun 08 '09 at 16:35
5

Add connectionTimeout and keepAliveTimeout to your AJP connector found in /etc/tomcat7/server.xml.

<Connector port="8009" protocol="AJP/1.3" redirectPort="8443" 
           connectionTimeout="10000" keepAliveTimeout="10000" />

Info about the AJP connector at https://tomcat.apache.org/tomcat-7.0-doc/config/ajp.html

  • connectionTimeout = The number of milliseconds this Connector will wait, after accepting a connection, for the request URI line to be presented. The default value for AJP protocol connectors is -1 (i.e. infinite).

  • keepAliveTimeout = The number of milliseconds this Connector will wait for another AJP request before closing the connection. The default value is to use the value that has been set for the connectionTimeout attribute.

If connectionTimeout and keepAliveTimeout values is not defined, then AJP connections will be kept alive for infinite. Causing to many threads, default max threads is 200.

I recommend installing psi-probe - an advanced manager and monitor for Apache Tomcat, forked from Lambda Probe. https://code.google.com/p/psi-probe/

paalfe
  • 51
  • 1
  • 1
4

Because of the way AJP works, the persistent connections between apache (using either mod_proxy_ajp or mod_jk) can only be safely closed by the client. In this case, the client is the apache worker that opens, and then holds a connection to tomcat for the life for the worker process.

Because of this behavior you cannot have more apache workers than tomcat worker threads. Doing so will cause additional http workers to fail to connect to tomcat (as the accept queue is full) and will mark your backend as DOWN!

Dave Cheney
  • 18,307
  • 7
  • 48
  • 56
  • 1
    Sorry for the comment after all these years, but couldn't this be guaranteed by setting the max-flag within the ProxyPass configuration to the number of MaxThreads of the servlet container? – Horst Gutmann Jan 07 '13 at 16:23
2

I've had better results with mod_proxy instead of mod_ajp in terms of stability, so try that solution. It's non-invasive - at best it will solve the problem and at worst it will rule out mod_ajp.

Other that that, sounds like your Tomcats stop responding and all request threads are tied up. Have your dev team look into what's going on - taking a thread dump and delivering it to them will be useful.

Robert Munteanu
  • 1,542
  • 5
  • 22
  • 38
  • I was under the impression that mod_proxy has some scalability issues despite being easier to hook up. It appears that the Apache foundation recommends mod_jk (http://wiki.apache.org/tomcat/FAQ/Connectors#Q2) – Ophidian Jun 04 '09 at 23:16
  • It does not provide sticky sesssion, true. But other than that I've never had problems with it. – Robert Munteanu Jun 05 '09 at 09:22
1

The first thing I think of when I hear that a server runs for a while, suddenly slows down and then starts having service failures is that it's running out of RAM and thrashing swap. I'm not clear on whether the AJP failures you're seeing could be consequent to timeouts, but it doesn't seem completely unreasonable; don't see any obvious way it would connect to the NIC, though. In any event, I recommend you get a picture of what's going on with your memory usage when these events come about.

If you are running out of RAM, you may need to turn down your Apache MaxClients and increase your ListenBacklog.

By the way, thanks for making your question so well-organized and complete.

chaos
  • 7,463
  • 4
  • 33
  • 49
  • When I obvserve 'top' while this is happening, memory usage remains fairly consistent. At least there are no spikes. There's only a brief moment of high CPU usage. – Jordy Boom Jun 05 '09 at 19:03
1

I had similar log errors in Redhat environment with proxy_ajp and Tomcat. Resolved by updating httpd package:

yum update httpd

from:

  • httpd-devel-2.2.3-43.el5_5.3.x86_64
  • httpd-2.2.3-43.el5_5.3.x86_64

to:

  • httpd-2.2.3-45.el5_6.3.x86_64
  • httpd-devel-2.2.3-45.el5_6.3.x86_64

Then restarted apache, followed by restarting Tomcat.

That fixed it for me!

slm
  • 7,355
  • 16
  • 54
  • 72
Bass
  • 11
  • 1