2

First off, I apologize if I didn't include enough information to properly troubleshoot this issue. This sort of thing isn't my specialty, so it is a learning process. If there's something I need to provide, please let me know and I'll be happy to do what I can. The images associated with my question are at the bottom of this post.

We are dealing with a clustered environment of four WebLogic 9.2 Java application servers. The cluster utilizes a round-robin load algorithm. Other details include:

  • Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04)
  • BEA JRockit(R) (build R27.4.0-90_CR352234-91983-1.5.0_12-20071115-1605-linux-x86_64, compiled mode)

Basically, I started looking at the servers' performance because our customers are seeing lots of lag at various times of the day. Our servers should easily handle the loads they are given, so it's not clear what's going on. Using HP Performance Manager, I generated some graphs that indicate that the CPU usage is completely out of whack. It seems that, at any given point, one or more of the servers has a CPU utilization of over 50%. I know this isn't particularly high, but I would say it is a red flag based on the CPU utilization of the other servers in the WebLogic cluster.

Interesting things to note:

  • The high CPU utilization was occurring only on server02 for several weeks. The server crashed (extremely rare; we are not sure if it's related to this) and upon starting it back up, the CPU utilization was normal on all 4 servers.
  • We restarted all 4 managed servers and the application server (on server01) yesterday, on 2/28. As you can see, server03 and server04 picked up the behavior that was seen on server02 before.
  • The CPU utilization is a Java process owned by the application user (appown).
  • The number of transactions is consistent across all servers. It doesn't seem like any one server is actually handling more than another.

If anyone has any ideas or can at least point me in the right direction, that would be great. Again, please let me know if there is any additional information I should post. Thanks!

server03
server02
server01
server04

4 Answers4

1

Is the load balancing completely round robin or is it doing stickiness based on IP or cookie? You could have some kind of user traffic that sticks to one server and moves upon restart - especially if another one of your servers is calling an app on the cluster. So cross check it against actual hits to the server.

You may also have a race condition in the app that certain operations get it in a loop. For that you could take a thread dump (kill -3 pid) and pull it out of your stdout log and run something like Samurai on it to see what up.

I would also turn on garbage collection logging and see if GC times correlate with perceived lag times.

Ernest Mueller
  • 1,189
  • 2
  • 12
  • 25
  • If I go into the WebLogic console and check the number of transactions, they are all about the same. No one server has an unusually high number of transactions. As far as I know, there is no stickiness based on IP or cookies. Would that be a setting that I could locate in the WebLogic console? I'm intrigued by your idea of a race condition, though. I have thread dumps for all four servers (from when server02 had the issue) and loaded them into Samurai. I definitely see more blocked threads in server02. How does one learn how to interpret these dumps? Is it just an acquired skill? –  Mar 02 '10 at 19:11
  • Stickiness would probably be set either in Weblogic or in a load balancer if you're using one. Whatever it is that's doing your round robin would be where you'd set stickiness. And yeah, reading trace files is basically an acquired skill. If you see more blocked threads, look for where they are, and then if they all seem to be in the same code, go talk to the developer. 80% of problems like this are in the app code not in the infrastructure. Now, there is a chance they'll be waiting on something like a db pool connection in which case it is you... – Ernest Mueller Mar 10 '10 at 00:16
  • 2
    Oh, also, often CPU utilizations that stay at "even numbers" like 50% are because the app spins out of control on one CPU. If you have a 2 CPU system, a 50% CPU pattern can mean "100% on one CPU." Check the per CPU stats. I imagine you'll see the one the java proc's running on at 100% and the other one making that lil' turtle shell pattern you have. This definitely affects user response time for the app. – Ernest Mueller Mar 10 '10 at 00:20
0

I'm not an expert of cluster or Bea, but in performance analisis problems there isn't only CPU. What are the data about memory, disk and network? The tools to get data about are top (cpu and memory, with many details and also per process), vmstat (memory, cpu, disk), sar (sysstat package on linux, with all possible data and historic recordings). Then, what is the operating system of such machines and in which version?

twistedbrain
  • 31
  • 1
  • 8
  • I realize that it's not just CPU, but that's the one thing that really jumps out as a problem. The memory usage and network statistics all look normal across the board. –  Mar 02 '10 at 19:15
  • Sorry, but then I don't understan what is your problem. A CPU load of 50 or 60% as maximum per se in my knowledge isn't a problem. If your users experiments lags then it could be usefull a deeper analisis of all circumstances and performance factors to identify the reasons. – twistedbrain Mar 03 '10 at 12:20
  • To clarify: you're saying that it's normal for one server in a cluster of four servers (all with equal number of transactions) to have a CPU running at 50-60% while the other three servers almost never exceed 10%? I'm not expert at CPU profiling, but I think my concern is legitimate. –  Mar 03 '10 at 15:49
  • Thanks for clarifing, I didn't understood the scenario because for me it wasn't clear that it was a load sharing cluster and b.e. not only an HA cluster. Anyway, also if in such case that's couldn't be normal I don't see why the users should experience problems if a node has CPU at 50%, because it's not a very high load. Do you can correlate for sure such suffering users with that particular node and do you know why it is more loaded? If not, maybe could be useful to analyze other performance's elements of that node to understand the reasons. – twistedbrain Mar 03 '10 at 22:21
0

I would install a Java probe and profile the web application to further investigate where is exactly that 50% CPU going.

fglez
  • 326
  • 4
  • 18
0

Trigger a thread dump or two on each of the servers. You will likely find one of the servers has a thread running which is not running on the other servers. Also check memory utilization via the console. I have seen WebLogic get into a garbage collection loop when there isn't enough memory.

BillThor
  • 27,354
  • 3
  • 35
  • 69