4

I have a system (the "host") that runs several containers, using LXC (i.e. the "guests"). I've installed Jenkins inside the guests and they appear to be working as intended, except that they don't respond to requests. (I've made several successful Jenkins installations before, including LXC.) In this case, the observed problem is that the built-in Jenkins web server (Jetty) is not responding to HTTP requests, even if those request are made from within the very LXC guest it's running in, i.e. pointing at the localhost.

I've been working to resolve this issue for several days, without success.

This is what you get when trying to contact the Jenkins web server from the localhost:

root@base:~# curl -vI http://localhost:8080/jenkins/
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
> HEAD /jenkins/ HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.58.0
> Accept: */*
>

On a working setup, you should get an HTTP-403 because you have not been authenticated and it shouldn't take more than a second or two to reply, but even after a few hours, there's no response. The Jenkins log file doesn't report any errors, either.

I need help to root-cause and resolve this problem, so that the Jenkins install works as intended and becomes accessible.

Any pointers on what/where to look for to figure out and fix this problem?


Here're some things I've already looked into:

  • Jenkins Configuration: The configuration file at /etc/default/jenkins is similar to my other working setups and has had minimal changes (e.g. binding to localhost only, and the prefix).
  • Apache Configuration: I reviewed the Apache reverse-proxy config and compared to other working systems, but that was not a problem. Also, Apache was always accessible (e.g. the "It works!" page), even from outside the LXC container, so traffic was not getting blocked by firewall rules. Apache would fail with HTTP-502 Proxy Error because Jenkins wouldn't reply to it. (That said, I've uninstalled Apache to simplify the environment.)
  • Log Files: The Jenkins log file at /var/log/jenkins/jenkins.log does not report any problems, which would usually show up as Java stack traces from exceptions.
  • Firewall Rules (iptables -S): All the chains/rules (INPUT, FORWARD, and OUTPUT) are set to ACCEPT. Still, since communication here is within the localhost, I wouldn't expect issues even if there had been other firewall rules in place.
  • Network Packets and Ports (netstat -tapon): Shows Jenkins (java process) listening on the expected port (default 8080, but I've tried others); it also shows the connection as ESTABLISHED (on both ends) after the curl client sends a request like the one shown above. This shows a successful TCP handshake.
  • Network Traffic (tcpdump -i lo): Shows the 3-way handshake being made; it explains why netstat shows connections as ESTABLISHED.
  • Comparing Against Working Setups: The other Jenkins installations I've made have similar environments and configurations (e.g. Ubuntu 18.04 hosts, same changes to Jenkins config file, installation procedures, etc).
  • Reproducing the Problem: I've tried (and failed) to reproduce the problem in other systems; I've used the exact same environment, installation process, configs, etc (e.g. my laptop, separate server at work, separate server at home, same LXC versions, matching guest OS image fingerprints, etc); everything works as expected outside the production server in question (Dell PowerEdge R640 Server).
  • Nuking the System from Orbit1: I've destroyed/rebuilt all the containers from scratch several times (including destroying the ZFS pool where all the data is stored); it made no difference.
  • Installing in Host Directly: I have confirmed that installing Jenkins directly on the host, i.e. outside any LXC containers/guests, also shows the problem.
  • Rule out Java/JVM: I can confirm that other Java-based applications work correctly, so it does not appear to be a problem that affects any/all Java-based programs. (I tested this by setting up an Apache Tomcat server, which worked as expected.)
  • Relocate Host: To rule out potential data center environment issues, I moved the server onto my desk area, where I have another test server with a working setup. This made no difference.
  • Run Stand-alone Jetty: I got the closest-matching Jetty server version I could find to the one bundled with Jenkins. Could not reproduce the problem. The stand-alone Jetty server replied to requests as expected, even though the one bundled with Jenkins still doesn't. (Jenkins' Jetty version is reported as jetty-9.4.z-SNAPSHOT; built: 2018-06-05T18:24:03.829Z in the log. There's no version with this .z-SNAPSHOT name on the Jetty releases page, so I used the closest match based on build date for this test: 9.4.11.v20180605)
  • Switch from OpenJRE to Oracle JRE: Installed/Set Oracle's JRE to be used (i.e. update-alternatives --config java). The same non-responsive behavior is observed.

Some of the questions I've already looked at, but weren't related or helpful:

I've read way more than these; they're only a sample.


1 It's the only way to be sure... mostly...

code_dredd
  • 156
  • 1
  • 11
  • Probably the reason your post hasn't gotten much attention is because it's so long. See if you can shorten it up? (Here because of your comment over on Meta.SE.) – jpmc26 Aug 24 '18 at 05:48
  • @jpmc26 I could remove a lot of info, but then I'd expect to get questions asking for the info I removed (e.g. what have you tried?, did you check ___?) or get down-voted because I didn't show "any research effort", etc. Seems like a catch-22. Is there, maybe, a better way to organize the info currently available? – code_dredd Aug 24 '18 at 06:02
  • I've removed _all_ of my prior troubleshooting steps in order to make the question appear less "intimidating"... – code_dredd Sep 05 '18 at 16:38
  • I think this is a big improvement. You could *maybe* reintroduce some of your troubleshooting info as a bullet list of possible problems you've eliminated (with each item being a sentence or two at most). The list of SE questions could probably be tied into that. – jpmc26 Sep 05 '18 at 16:50
  • Hope that looks better. – code_dredd Sep 05 '18 at 21:11
  • I have found a way around this problem, but I'll need more time to prepare a response (e.g. confirming a few other things). Stay tuned. – code_dredd Sep 20 '18 at 02:33

2 Answers2

2

Here is more detailed explanation of the same issue https://issues.jenkins-ci.org/browse/JENKINS-33412

Previous Jenkins versions limit number of threads that are used by underlying jetty to 40 (handlerCountMax option). And by default jetty uses Runtime.getRuntime().availableProcessors()/2 threads for selectors. And if number of CPU cores is high enough (e.g. 70) or jetty ssl connector is started as well and CPU cores number is more than 36 then threads are exhausted and http request just stuck. Consider migrating to latest jenkins and define threads count for jetty manually - check these jetty params - qtpMaxThreadsCount, jettyAcceptorsCount, jettySelectorsCount.

  • Thanks for sharing the info. I was already using the latest Jenkins version available (from their repo) when I ran into the issue. Still, I'll take a closer look at this and the options in the server and see if it makes a difference, probably next week, and post back. – code_dredd Oct 05 '18 at 19:56
  • 1
    I took a look at this today and was able to confirm that allocating 69 or less CPU cores to the container allows Jenkins to work under Ubuntu 18.04 LTS. I'll update my post a bit later. Thanks for pointing me in the right direction. – code_dredd Oct 09 '18 at 18:07
1

TL;DR

If there're 70 or more CPUs in the system hosting Jenkins, then Jenkins/Jetty gets stuck and won't work. Either make sure that the system/container where Jenkins will run has less than 70 CPUs available or upgrade your Jenkins installation to at least 2.138.2, which got released today (2018-10-10).


Summary

It turns out that Jenkins 2.138.1 in the Ubuntu 18.04 LTS repository has a bug that causes Jenkins/Jetty to not respond when on a system with 70 or more CPUs. Jenkins 2.138.2 was released today, Oct. 10, 2018, and it includes fixes for several underlying issues, one of which caused the problem I had run into.

The changelog is here. The key fix for me is this one:

I can confirm that this bug fix does indeed fix the problem and verified this on my server with 72 CPUs.

If you cannot (yet) upgrade your Jenkins installation, then read on for a potential workaround.


Workaround (for containers)

If you're installing Jenkins inside LXC, then you can control this with the following commands:

  • lxc config set <container> limits.cpu N, where N < 70; and
  • lxc exec <container> -- systemctl restart jenkins.service

You may also need to update the profile config, which you can do as follows:

lxc profile set <container-profile> limits.cpu N

with the same caveat(s) already shown above. If you're using a virtual machine (e.g. VirtualBox, VMware, etc), then you should still be able to set the number of CPUs available to the VM.

P.S.: Thanks to Pavel for his post, which led me in the right direction to play around with CPU/Core counts.

code_dredd
  • 156
  • 1
  • 11