6

I'm running some benchmarks using apache bench for a java app that is running on tomcat.

Say I run a test like:

ab -c 10 -n 10000 http://localhost:8080/hello/world

It will run just fine. If I follow it with:

ab -c 50 -n 50000 http://localhost:8080/hello/world

Again it will run fine, but if I try again it starts to slow down after maybe 3500 completed requests.

I need help in terms of trying to debug the root cause of this.

I ran top, and I have a few gigs of memory that is unused so memory doesn't seem to be the issue.

The tomcat6 process does go to 70-80 or even 107%.

It seems restarting tomcat solves the issue, but at times a server reboot is required.

This is on a default tomcat install that has 200 threads allocated to it.

Tomcat logs are empty.

Update

So I changed both tcp_tw_recycle/reuse to 1, and running netstat shows a very low count now.

Previous to changing tcp_tw_recycle/reuse, I noticed things slowing down and ran netstat and I had 32400 tcp TIME_WAIT connections.

So an update on running the benchmarks now, with the -k switch I'm seeing MUCH more throughput. BUT, at some point things again start to slow down, but restarting tomcat now brings things back to normal. Before, even if I restarted tomcat, response times running ab would be very very slow. Now after changing tcp_tw_recycle/reuse, restarting tomcat brings things back to normal. Running top shows tomcat at only around 20% of cpu, so it seems the problem is with tomcat now, but how can I figure out what?

codecompleting
  • 493
  • 1
  • 4
  • 14

1 Answers1

4

There may be a few things going on here. Your command above translates to 50 concurrent connections, each issuing 1000 requests. One thing to note here is that if I recall correctly apachebench does not enable keep alive by default. It may be worth adding this (pass -k to your command above). This will be more of a real world test anyway, as most user agents do use keep-alive, as does Tomcat, by default. This should help the issue if my theories below are correct.

1) I suspect that your slamming that thread pool with too many requests, since each one is tearing down. This is a pretty big hit to those threads, as well as the TCP/IP stack on the system. Which leads me to...

2) You may be (ok, you probably are) running out of ephemeral ports and or hitting TIME_WAIT sockets. If each request is indeed a new, unique request, you're very likely going to be running into a TIME_WAIT situation with thousands of sockets in that state (have a look at netstat -an |grep -ic TIME_WAIT for a count of them during your load). These sockets will be ineligible for re-use unless you've enabled time_wait_reuse on your system. The fact that you're using localhost only makes this worse.

For more information on setting time_wait reuse up, have a look here. Also note that this thread correctly points out that setting the fin_wait timeout is incorrect in the context of time_wait, so avoid that. Tickling fin_wait in the context of TIME_WAIT is wrong and won't help you.

So have a look and potentially tweak tcp_tw_recycle/reuse specifically. These will help you get through your tests, as will keep-alive.

mcauth
  • 420
  • 2
  • 5
  • keep alive is keeping what alive? and w/o that flag what is happening? – codecompleting Dec 19 '11 at 21:48
  • btw, increasing the thread count doesn't effect the number of requests from what I am seeing. i.e. if I run -n 1000, regardless of the # of threads I use, I still see 1K new rows in mysql. – codecompleting Dec 19 '11 at 21:51
  • keep alive enables HTTP/1.1 keep-alive connections. Refer to section 8.1 in RFC 2616 which discusses the mechanism as well as the benefit to the approach. Suffice it to say it'll allow multiple requests per client down a single set of connections (as few as one per client, in your case). – mcauth Dec 20 '11 at 01:57
  • tcp_tw_recycle/reuse helped bring my open tcp TIME_WAIT connections down from like 30K to maybe 150. But tomcat still slows down, see my updated question. thanks! – codecompleting Dec 20 '11 at 14:23
  • 1
    Sounds like you need to learn a thing or two about HTTP :) FYI, ab is monstrously unsuited for doing dependable real-world benchmarking of applications you care about; switch to something a bit more "realistic", such as [**siege**](http://www.joedog.org/index/siege-home). – adaptr Dec 20 '11 at 14:25
  • @adaptr so this problem I am facing, your saying in real-world high traffic scenerios this wouldn't be an issue? it is more of a ab issue? – codecompleting Dec 20 '11 at 15:29
  • I am saying that ab does not yield usable real-world results when used to benchmark a combination of (possibly complex) interacting web traffic. – adaptr Dec 20 '11 at 15:31
  • @codecompleting: Based on the original question it wasn't clear that mySQL is involved here - that adds a whole other set of variables to look at. It may be worth digging into that layer at this point: DB thread pool size, settings, etc. in an effort to see what is going on there. – mcauth Dec 20 '11 at 19:30
  • @mcauth we can ignore mysql b/c even when I hit a web page that makes no mysql connections, this issue is still present. – codecompleting Dec 20 '11 at 20:37
  • Agree with adaptr but would recommend JMeter. Run a remote testing configuration (http://jmeter.apache.org/usermanual/remote-test.html) with at least two clients. – HTTP500 Dec 20 '11 at 22:26