My Spring Boot application stopped working a couple of days ago and I'm trying to figure out why so I can prevent it in the future. This is the first time this happens so I don't really know where to start. Restarting the server solved the problem.
I will write down everything I consider relevant and hopefully someone will help me with how I should go about this.
- Hosted on a Digital Ocean droplet.
- Ubuntu 16.04, 1GB RAM, 25GB SSD, 1 core.
- The HTTP requests hit a separate server (same setup) running Nginx and are passed to the upstream server running the Spring Boot application. During the failure, all http requests return a 502 and are logged in error.log by Nginx as
2019/04/20 20:06:56 [error] 14576#14576: *1161160 connect() failed (111: Connection refused) while connecting to upstream, client: xx.xxx.x.xxx, server: api.example.com, request: "OPTIONS /oauth/token HTTP/1.1", upstream: "http://xx.xxx.xx.xxx:8080/oauth/token", host: "api.example.com", referrer: "https://example.com/login"
2019/04/20 20:06:56 [error] 14576#14576: *1161160 no live upstreams while connecting to upstream, client: xx.xxx.x.xxx, server: api.example.com, request: "OPTIONS /oauth/token HTTP/1.1", upstream: "http://server_upstream/oauth/token", host: "api.example.com", referrer: "https://example.com/login"
- I was able to SSH onto the server without issue.
- I use log4j2 for logging in the Spring Boot application, but nothing was logged during the failure.
- A separate cron on the same server, periodically fetching data over HTTP, worked fine during the failure.
- When the failure happened there was a huge drop in used memory of the server (85% -> 18%).
- I cannot find any relevant information in the syslog.
- The Spring Boot application in run in systemd, and (I think) the Spring Boot application was still running during the failure.
Where should I start looking for the reason for the failure? Is there anything I can do to make it easier to debug this if it happens again?