2

Hello superior server gurus!

I'm running a Ubuntu server that hosts an apache tomcat service along with an MySQL database. The server load is always close to zero, even during the busiest hours of the week. Despite that I am experiencing random hangups 1-2 times per week, where the entire server stops responding.

An interesting effect of this lockdown is that all cronjobs seems to be executed later than scheduled, at least that is what the timestamps in various system logs indicate. Thus it appears to me that it is indeed the entire server that freezes, not only the custom software running as a part of the tomcat service. The hangup normally lasts for about 3-5 minutes, and afterwards everything jumps back to normal.

Hardware:
Model: Dell PowerEdge R720, 16 cores, 16 GB ram
HDD-configuration: Raid-1 (mirror)

Main services: 
apache tomcat, mysql, ssh/sftp

#uname -a
Linux es2 2.6.24-24-server #1 SMP Tue Jul 7 19:39:36 UTC 2009 x86_64 GNU/Linux

Running sysstat I can see huge peaks in both average load and disk block waits that corresponds in time exactly to when customers has reported problems with the backend system. Following is a plot of the disk usage from sar with a very obvious peak around 12.30pm.

My sincere apologies for putting this on a external server, but my rep is to low to include files here directly. Also had to put them together since i can only post one link :S

Sar plots: http://213.115.101.5/abba/tmpdata/sardata_es.jpg

Graph 1: Block wait, notice how the util% goes upp to 100% at approx 12.58

Graph 2: Block transfer, nothing unusual here.

Graph 3: Average load, peaks together with graph 1

Graph 4: CPU usage, still close to 0%.

Graph 5: Memory, nothing unusual here

Does anyone have any clue on what could cause this effect on a system? As I earlier explained the only software running on the server is a tomcat server with an SOAP interface, to allow users to connect to the database. Remote applications does also connect to the server via SSH to pull and upload files to it. At busy times im guessing that we have about 50 concurrent SSH/SFTP connections and not more than a 1-200 connections over http (soap/tomcat).

Googling around I found discussions about file handles and inode handles, but I think these are normal for 2.6.x kernals. Anyone that dissagrees?

cat /proc/sys/fs/file-nr
1152    0       1588671
cat /proc/sys/fs/inode-state
11392   236     0       0       0       0       0

At the same time "sar -v" shows these values for the time of the hangup above, but the inode-nr here is ALWAYS very high compared to above.

12:40:01    dentunusd   file-nr  inode-nr    pty-nr
12:40:01        40542      1024     15316         0
12:45:01        40568      1152     15349         0
12:50:01        40587       768     15365         0
12:55:01        40631      1024     15422         0
13:01:02        40648       896     15482         0
13:05:01        40595       768     15430         0
13:10:01        40637      1024     15465         0

I have seen this on two independent servers running the same setup of hardware, OS, software, raid-configuration etc. Thus I want to beleive that its more software/configuration dependent then hardware dependent.

Big thanks for your time
/Ebbe

Avada Kedavra
  • 1,244
  • 2
  • 13
  • 19

2 Answers2

2

The problems were related to a incompability issue between Ubuntu 8.04 LTS (Hardy) and the Dell PERC 6/i RAID controller, as reported in this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/607167 Upgrading to Ubuntu 10.04 LTS Lucid (kernel 2.6.32) resolves the issue.

In case anyone else runs into the same issues.

Avada Kedavra
  • 1,244
  • 2
  • 13
  • 19
1

May be you are running some heavy query which is doing a full table scan. Have you checked your slow query log.

If thats case just add proper indexes.

PS: Sorry If you have done this already.

Zimbabao
  • 117
  • 2
  • The slow query log has never been enabled, and honestly I did not know about it. At the same time the database is fairly small and the only way to access it is through a java/soap interface, thus all queries executed are known and tested, and if any of the queries used should do a full table scan this would happen much much more often I belevie. However, I will enable the slow log for curiosity and in case my statement above is wrong :) Thank you for the answere! – Avada Kedavra Aug 10 '10 at 15:31
  • Actually, this could turn out quite interesting. Thank you very much Zimbabao! – Avada Kedavra Aug 10 '10 at 15:44
  • I have enabled slow query log but cannot find anything suspisious here. Thanks for your suggestion though! Anyone else? Please? – Avada Kedavra Aug 11 '10 at 07:03