1

A Hudson server used by developers is having some timeout issues.

Whenever Java is compiling (git is checking the modules, java classes are compiled, tested and copied to Nexus), something goes wrong in the process, and everything becomes unresponsive.

I am able to watch iotop for example during the first 15-30 seconds, it peaks at 3Mb/s at most, and then, all of a sudden, top stops refreshing and I can't do anything else. Idem dito for SNMP, every SNMP requests results in a time out.

I am also unable to log in to ssh, and typing a simple command like "w" in an open terminal takes ages (5+ minutes) to come up with output.

This is a Debian server (pretty default installation, nothing odd running on it).

This is a VM running on a XenServer host. It has 4Gb memory allocated and 8 cores (4 physical cores + HT).

XenCenter is reporting 13% CPU load (which is 100% on one core). Disk usage is low, and memory isn't peaking either (it stays at about 3Gb out of 4Gb used, swap stays empty).

This is 'w' output just after it becomes responsive again:

$ w
 12:02:31 up 10 days, 19:45,  4 users,  load average: 32.13, 25.42, 15.32

So yes, the average system load is quite high... But even 3200 doesn't seem that high.

It also doesn't seem to happen each time it's building something.

My question: is there a way to sample data and see what's causing everything to hang, and even if Java is monopolising one core, the server has 7 unused cores, why does it fail to respond to any input (as IO-usage seems low too, and it's not running out of memory)?

Edit:

vmstat 1 output:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 3  0    940 725336 235388 1444280    0    0     0     0 2048  578 14  8 79  0
 5  0    940 702704 235392 1444296    0    0     0   200 3088 1691  8  8 81  3
 2  0    940 701480 235392 1444296    0    0     0     4 2116  658 13  7 80  0
 3  0    940 699480 235392 1444296    0    0     0    68 2688 1108 10  7 82  1
 2  0    940 657732 235396 1444292    0    0     0   244 4639 3412 11  5 81  2
 0  2    940 648872 235408 1451840    0    0     0  6616 5283 4573 11  3 79  6
 0  2    940 645696 235412 1455296    0    0     0  4436 1257 1134  3  1 74 22
 1  1    940 635064 235424 1465320    0    0     0  6940 3117 2653  6  1 87  6
 0  1    940 632792 235448 1466592    0    0     0  9236  610  405  2  1 81 16
 1  1    940 631464 235448 1466660    0    0     0  9524 1580  852  4  3 73 20
 1  1    940 633428 235448 1466660    0    0     0  9120 2931 1250  1  6 82 11
 3  0    940 620844 235540 1467496    0    0     0    48 2249 1368 16  4 80  0
 3  1    940 590080 235544 1467552    0    0     0  1632 2322  918  9 10 76  6
 4  0    940 538072 235544 1467552    0    0     0    28 2159 2364 15  7 77  1
 5  0    940 489588 235548 1467836    0    0     0  1412 1745  663 16  5 76  2
 7  0    940 402368 235548 1467996    0    0     0     0 2573  882 16  7 77  0
 3  0    940 371424 235560 1468128    0    0     0    20 2235  744 14  7 79  0
 4  0    940 265660 235568 1467840    0    0     0     0 2181  826 13  6 82  0
 4  0    940 257360 235604 1468588    0    0     0     0 1718  705 14  6 81  0
 3  0    940 193800 235628 1468124    0    0     0  1312 2164 1096 12  5 80  3

** hangs after this **

PS just when it became responsive again.

develop@Yong:/opt/develop/tomcat7-develop/bin$ ps -HAf
UID        PID  PPID  C STIME TTY          TIME CMD
root         2     0  0 Jan21 ?        00:00:00 [kthreadd]
root         3     2  0 Jan21 ?        00:00:00   [migration/0]
root         4     2  0 Jan21 ?        00:00:00   [ksoftirqd/0]
root         5     2  0 Jan21 ?        00:00:00   [watchdog/0]
root         6     2  0 Jan21 ?        00:00:00   [migration/1]
root         7     2  0 Jan21 ?        00:00:00   [ksoftirqd/1]
root         8     2  0 Jan21 ?        00:00:00   [watchdog/1]
root         9     2  0 Jan21 ?        00:00:00   [migration/2]
root        10     2  0 Jan21 ?        00:00:00   [ksoftirqd/2]
root        11     2  0 Jan21 ?        00:00:00   [watchdog/2]
root        12     2  0 Jan21 ?        00:00:00   [migration/3]
root        13     2  0 Jan21 ?        00:00:00   [ksoftirqd/3]
root        14     2  0 Jan21 ?        00:00:00   [watchdog/3]
root        15     2  0 Jan21 ?        00:00:01   [migration/4]
root        16     2  0 Jan21 ?        00:00:00   [ksoftirqd/4]
root        17     2  0 Jan21 ?        00:00:00   [watchdog/4]
root        18     2  0 Jan21 ?        00:00:02   [migration/5]
root        19     2  0 Jan21 ?        00:00:00   [ksoftirqd/5]
root        20     2  0 Jan21 ?        00:00:00   [watchdog/5]
root        21     2  0 Jan21 ?        00:00:01   [migration/6]
root        22     2  0 Jan21 ?        00:00:00   [ksoftirqd/6]
root        23     2  0 Jan21 ?        00:00:00   [watchdog/6]
root        24     2  0 Jan21 ?        00:00:00   [migration/7]
root        25     2  0 Jan21 ?        00:00:00   [ksoftirqd/7]
root        26     2  0 Jan21 ?        00:00:00   [watchdog/7]
root        27     2  0 Jan21 ?        00:00:00   [events/0]
root        28     2  0 Jan21 ?        00:01:04   [events/1]
root        29     2  0 Jan21 ?        00:00:00   [events/2]
root        30     2  0 Jan21 ?        00:00:00   [events/3]
root        31     2  0 Jan21 ?        00:00:00   [events/4]
root        32     2  0 Jan21 ?        00:00:00   [events/5]
root        33     2  0 Jan21 ?        00:00:00   [events/6]
root        34     2  0 Jan21 ?        00:00:00   [events/7]
root        35     2  0 Jan21 ?        00:00:00   [cpuset]
root        36     2  0 Jan21 ?        00:00:00   [khelper]
root        37     2  0 Jan21 ?        00:00:00   [netns]
root        38     2  0 Jan21 ?        00:00:00   [async/mgr]
root        39     2  0 Jan21 ?        00:00:00   [pm]
root        40     2  0 Jan21 ?        00:00:00   [xenwatch]
root        41     2  0 Jan21 ?        00:00:00   [xenbus]
root        42     2  0 Jan21 ?        00:00:00   [sync_supers]
root        43     2  0 Jan21 ?        00:00:00   [bdi-default]
root        44     2  0 Jan21 ?        00:00:00   [kintegrityd/0]
root        45     2  0 Jan21 ?        00:00:00   [kintegrityd/1]
root        46     2  0 Jan21 ?        00:00:00   [kintegrityd/2]
root        47     2  0 Jan21 ?        00:00:00   [kintegrityd/3]
root        48     2  0 Jan21 ?        00:00:00   [kintegrityd/4]
root        49     2  0 Jan21 ?        00:00:00   [kintegrityd/5]
root        50     2  0 Jan21 ?        00:00:00   [kintegrityd/6]
root        51     2  0 Jan21 ?        00:00:00   [kintegrityd/7]
root        52     2  0 Jan21 ?        00:00:00   [kblockd/0]
root        53     2  0 Jan21 ?        00:00:00   [kblockd/1]
root        54     2  0 Jan21 ?        00:00:00   [kblockd/2]
root        55     2  0 Jan21 ?        00:00:00   [kblockd/3]
root        56     2  0 Jan21 ?        00:00:00   [kblockd/4]
root        57     2  0 Jan21 ?        00:00:00   [kblockd/5]
root        58     2  0 Jan21 ?        00:00:00   [kblockd/6]
root        59     2  0 Jan21 ?        00:00:00   [kblockd/7]
root        60     2  0 Jan21 ?        00:00:00   [kseriod]
root        69     2  0 Jan21 ?        00:00:00   [kondemand/0]
root        70     2  0 Jan21 ?        00:00:00   [kondemand/1]
root        71     2  0 Jan21 ?        00:00:00   [kondemand/2]
root        72     2  0 Jan21 ?        00:00:00   [kondemand/3]
root        73     2  0 Jan21 ?        00:00:00   [kondemand/4]
root        74     2  0 Jan21 ?        00:00:00   [kondemand/5]
root        75     2  0 Jan21 ?        00:00:00   [kondemand/6]
root        76     2  0 Jan21 ?        00:00:00   [kondemand/7]
root        77     2  0 Jan21 ?        00:00:00   [khungtaskd]
root        78     2  0 Jan21 ?        00:00:00   [kswapd0]
root        79     2  0 Jan21 ?        00:00:00   [ksmd]
root        80     2  0 Jan21 ?        00:00:00   [aio/0]
root        81     2  0 Jan21 ?        00:00:00   [aio/1]
root        82     2  0 Jan21 ?        00:00:00   [aio/2]
root        83     2  0 Jan21 ?        00:00:00   [aio/3]
root        84     2  0 Jan21 ?        00:00:00   [aio/4]
root        85     2  0 Jan21 ?        00:00:00   [aio/5]
root        86     2  0 Jan21 ?        00:00:00   [aio/6]
root        87     2  0 Jan21 ?        00:00:00   [aio/7]
root        88     2  0 Jan21 ?        00:00:00   [crypto/0]
root        89     2  0 Jan21 ?        00:00:00   [crypto/1]
root        90     2  0 Jan21 ?        00:00:00   [crypto/2]
root        91     2  0 Jan21 ?        00:00:00   [crypto/3]
root        92     2  0 Jan21 ?        00:00:00   [crypto/4]
root        93     2  0 Jan21 ?        00:00:00   [crypto/5]
root        94     2  0 Jan21 ?        00:00:00   [crypto/6]
root        95     2  0 Jan21 ?        00:00:00   [crypto/7]
root        98     2  0 Jan21 ?        00:00:00   [khvcd]
root       408     2  0 Jan21 ?        00:00:06   [kjournald]
root       906     2  0 Jan21 ?        00:00:03   [flush-202:16]
root      1111     2  0 Jan21 ?        00:00:00   [kslowd000]
root      1112     2  0 Jan21 ?        00:00:00   [kslowd001]
root      1119     2  0 Jan21 ?        00:00:00   [cifsd]
root         1     0  0 Jan21 ?        00:00:00 init [2]      
root       480     1  0 Jan21 ?        00:00:00   udevd --daemon
daemon    1129     1  0 Jan21 ?        00:00:00   /sbin/portmap
statd     1140     1  0 Jan21 ?        00:00:00   /sbin/rpc.statd
root      1236     1  0 Jan21 ?        00:00:01   /usr/sbin/rsyslogd -c3
root      1283     1  0 Jan21 ?        00:00:27   /bin/bash /usr/sbin/xe-daemon
root     22213  1283  0 13:57 ?        00:00:00     sleep 60
root      1358     1  0 Jan21 ?        00:00:00   /usr/sbin/sshd
root     18625  1358  0 13:33 ?        00:00:00     sshd: yeri [priv]
yeri     18628 18625  0 13:33 ?        00:00:00       sshd: yeri@pts/0 
yeri     18629 18628  0 13:33 pts/0    00:00:00         -bash
root     18647 18629  0 13:33 pts/0    00:00:00           su -
root     18648 18647  0 13:33 pts/0    00:00:00             -su
root     19357 18648  0 13:45 pts/0    00:00:00               vmstat 1
root     19124  1358  0 13:44 ?        00:00:00     sshd: develop [priv]
develop  19127 19124  0 13:44 ?        00:00:00       sshd: develop@pts/1
develop  19128 19127  0 13:44 pts/1    00:00:00         -bash
develop  22174 19128  0 13:57 pts/1    00:00:00           ps -HAf
root     22163  1358  0 13:53 ?        00:00:00     sshd: andy [priv]
andy     22166 22163  0 13:54 ?        00:00:00       sshd: andy@pts/2 
andy     22167 22166  0 13:54 pts/2    00:00:00         -bash
andy     22216 22167  0 13:57 pts/2    00:00:00           -bash
andy     22217 22216  0 13:57 pts/2    00:00:00             [sed]
andy     22218 22216  0 13:57 pts/2    00:00:00             -bash
root      1677     1  0 Jan21 ?        00:00:00   nginx: master process /usr/sbi
www-data  1678  1677  0 Jan21 ?        00:00:00     nginx: worker process
snmp      1686     1  0 Jan21 ?        00:02:15   /usr/sbin/snmpd -Lsd -Lf /dev/
ntp       1700     1  0 Jan21 ?        00:00:00   /usr/sbin/ntpd -p /var/run/ntp
daemon    1718     1  0 Jan21 ?        00:00:00   /usr/sbin/atd
root      1739     1  0 Jan21 ?        00:00:02   /usr/sbin/cron
root     22159  1739  0 13:50 ?        00:00:00     /USR/SBIN/CRON
root     22161 22159  0 13:50 ?        00:00:00       /usr/bin/perl -w /usr/bin/
root     22214 22161  0 13:57 ?        00:00:00         [rateup]
daemon    1763     1  0 Jan21 ?        00:00:03   /usr/sbin/uptimed
root      1765     1  0 Jan21 tty2     00:00:00   /sbin/getty 38400 tty2
root      1766     1  0 Jan21 tty3     00:00:00   /sbin/getty 38400 tty3
root      1767     1  0 Jan21 tty4     00:00:00   /sbin/getty 38400 tty4
root      1768     1  0 Jan21 tty5     00:00:00   /sbin/getty 38400 tty5
root      1769     1  0 Jan21 tty6     00:00:00   /sbin/getty 38400 tty6
104       7240     1  0 Jan24 ?        00:00:00   /usr/bin/dbus-daemon --system
avahi     7269     1  0 Jan24 ?        00:00:00   avahi-daemon: running [Yong.lo
avahi     7270  7269  0 Jan24 ?        00:00:00     avahi-daemon: chroot helper
101        423     1  0 Jan31 ?        00:00:00   /usr/sbin/exim4 -bd -q30m
root      6799     1  0 Jan31 hvc0     00:00:00   /sbin/getty 38400 hvc0
root      1915     1  0 06:25 ?        00:00:01   /usr/bin/atop -a -w /var/log/a
root     19181     1  0 13:45 pts/0    00:00:00   /bin/sh /usr/bin/mysqld_safe
mysql    19218 19181  0 13:45 pts/0    00:00:03     /usr/sbin/mysqld --basedir=/
root     19219 19181  0 13:45 pts/0    00:00:00     logger -p daemon.err -t mysq
develop  19304     1 99 13:45 pts/1    66-17:02:10   /opt/develop/java/jdk/bin/j
develop  19457 19304 99 13:46 pts/1    373-12:31:17     /opt/develop/java/jdk1.6
develop  22219 19457  0 13:57 pts/1    00:00:00       [java]
develop  21193 19304 99 13:47 pts/1    26-16:29:39     /opt/develop/java/jdk1.6.
develop  22145 19304  0 13:48 pts/1    00:00:00     git fetch -t ssh://git@git.m
develop  22148 22145  0 13:48 pts/1    00:00:00       ssh git@git.miiatech.com g
develop  22215 22145  0 13:57 pts/1    00:00:00       [git]
Tuinslak
  • 1,435
  • 7
  • 30
  • 54

5 Answers5

1

the output of "w" doesn't help here very much. Please post the output of "vmstat 1" during the first 15-30 seconds, it gives a much deeper insight. How many files are in your java project? Is everything available locally? Or is hudson checking the stuff out from elsewhere or copying something from the network?

shakalandy
  • 768
  • 4
  • 10
  • The Git server is over a network (same physical server, different VM). It's a small project. The Git repo is 8.4Mb in size. I'll keep vmstat running until I see it becomes unresponsive again. – Tuinslak Feb 01 '11 at 11:27
  • I have added the vmstat output in my original post. – Tuinslak Feb 02 '11 at 12:50
  • Deleted my first comment. new approach: in vmstat output you see the decreasing memory per second. It starts at ~755/MB of free memory, and decreases up to 60mb/s. Your filesystem cache stays untouched, so it seems that there's a problem with your memory. Normally the kernel should free some memory from file system cache so that other processes could allocate more memory if needed. try to increase memory or to realloacte file system cache – shakalandy Feb 03 '11 at 14:12
0

Are you running Hudson as standalone application or inside a web container, e. g. Tomcat?

One thought which comes to mind is memory allocation for your Java process. If you did not specify additional memory options for the JVM (-Xms and -Xmx options) it may be that the JVM runs with the default 65M and garbage collection kicks in at that point and brings your system to a halt.

If you have the possibility to verify your runtime options (your "ps" output above is truncated on the right so I cannot see for myself) you could check if this is the root cause. If it is, you might want to add more memory to the process and see if it still stalls in such a scenario.

Axel Knauf
  • 1,600
  • 1
  • 10
  • 12
  • It's running inside Tomcat. Hudson (or Tomcat) isn't crashing, and the application compiles fine. Just that every other service or application running on the server stops responding. (icmp replies fine though). Java has a nice level of 0. – Tuinslak Feb 02 '11 at 13:57
  • What are your Tomcat settings in `/etc/default/tomcat` (maybe `tomcat6`)? Are there any `JAVA_OPTS` with the memory flags I mentioned? – Axel Knauf Feb 02 '11 at 14:01
  • It's Tomcat7, running a fairly default config. I can't see any JAVA_OPTS (except JAVA_OPTS="$JAVA_OPTS -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager" and JAVA_OPTS="$JAVA_OPTS $LOGGING_MANAGER"). Config: http://pastebin.com/tn7E8xY2. I'll increase the memory options of the JVM when testing again tomorrow. – Tuinslak Feb 02 '11 at 19:04
0

It really sounds like your server eats up all the memory and starts to swap. Are you certain about the memory usage when compiling? Javac easily eats a few gigs... I would keep a TOP up and check what the memory usage is through the build process. Once a server starts to chew disk due to active processes it will usually not respond, but you can often see the memory use increase until the server starts Swapping

Another easy way to test this is to disable Swapping:

swapoff /dev/sda2 (or whatever your swap partition is called)

And then try the build. If you are running out of memory the build is likely to die halfway. It is also possible that another process eating a lot of memory is killed by the OOM Killer in Linux.

pehrs
  • 8,749
  • 29
  • 46
  • Top (or any other application) stops showing any output. So keeping top open won't show a lot as it isn't refreshed anymore. My XenCenter shows the correct memory usage (which never reaches more than 3-3,5Gb (out of 4Gb) and swap stays empty (I can check the mrtg graphs). I've turned off the swap for now -- let's see if that shows anything. – Tuinslak Feb 02 '11 at 13:54
0

The issue seemed to be caused by the Git polling (it was set to every minute). Changed to every 10 minutes now and the load is close to zero at the moment.

Tuinslak
  • 1,435
  • 7
  • 30
  • 54
0

Check if you have any processes in "D" state (Uninterruptible sleep):

# ps -eo pid,user,state,cmd | awk '$3 ~ /D/ { print $0 }'
quanta
  • 50,327
  • 19
  • 152
  • 213