1

Our database-server sometimes become unresponsive. It runs a fully updated Ubuntu 14.04 LTS. Notable non-vanilla software running on it are Nimbus, TSM and Oracle.

About once a day, it becomes unresponsive, so far at night-time, when a series of maintenance-tasks are done such as backups.

After it becomes unresponsive, it seems to stay this way forever. I'm not able to SSH into it, and it doesen't accept any database connections.

Weird thing is, the server responds to ping. If I use telnet to open port 22(SSH) or port 1521(Oracle), I do get a reply from the server. Port 22 even states something like "This is OpenSSH". But actually using the ssh client or opening a database-connection just hangs.

I've been looking in the log-files, and found absolutely nothing(dmesg, syslog, auth.log etc). It also seem to be suspiciously little activity in the the logfiles during the unresponsive period. After restarting the server, it works again.

My immediate reaction was to run apt-get update and apt-get dist-upgrade, and monitor the max file-descriptor limit being reached. However, the hard limit for Oracle is far from the filesystem max, so it seems weird if thats the case. Anyone else have any ideas what could cause this?

EDIT: Forgot to mention that CPU, memory and disk space was far from reaching 100%. (They were already monitored, and after this happened, I started monitoring open file descriptors as well, but it has yet to happen again). I can also add that I don't expect anyone to call out the exact problem, but any ideas for additional things to monitor would be appreciated.

  • 4
    Collect hardware utilization information. That's always step #1 in troubleshooting performance problems. Once you have empirical data, you can make informed decisions. – EEAA Feb 03 '16 at 14:10
  • Use tools like `htop`, `atop`, `vmstat`. – Khaled Feb 03 '16 at 14:14
  • Forgot to mention that. CPU, memory and disk space was far from reaching 100% – Henrik Kjus Alstad Feb 03 '16 at 14:14
  • 1
    @HenrikKjusAlstad You need to collect usage **trends** it's not enough to take a one-shot view of the resources. Look into something like [Munin](http://munin-monitoring.org/). – EEAA Feb 03 '16 at 14:19
  • @EEAA We do. Although I know munin well, our infrastructure guys use Nimbus for that. It basically do the same(although in my opinion its a bit more limited). But it does give us trends(graphs) of CPU use, memory usage, disk space used, network traffic use etc. None of that looks alarming or out of the ordinary. – Henrik Kjus Alstad Feb 03 '16 at 14:22
  • 1
    If SSH is unable to connect, check the DNS too – Dom Feb 03 '16 at 15:02
  • 1
    You need to employ [Scientific Method](http://serverfault.com/questions/646852/excessive-number-of-sleeping-processes-in-centos-howto-diagnose/646876#646876) not Shouting Halp at the internets method. – user9517 Feb 03 '16 at 15:59
  • check the systat SAR files from the unresponsive periods, you can graph it even with ksar or something like that. and the logs and dmesg etc – Petter H Feb 03 '16 at 16:52
  • It's hard to tell with the little information we have. The thing that stands out to me is that it tends to happen when you run your maintenance tasks. Can you try putting some more time between the different jobs (e.g. strart cronjobs further apart from each other)? Also, try running cronjob with a lower priority (see `man nice`) to see if that helps. Note that the jobs will take longer when using nice, so keep that in mind when (re-)scheduling the jobs. – Oldskool Feb 03 '16 at 19:23
  • For a start I suggest you ssh to the server, start `top`, and leave it running until the next time you experience a problem. There is a good chance it will give you some indication of where the problem is. – kasperd Feb 03 '16 at 21:12
  • Can't add a comment, so.. That problem could happen with high hdd usage. As an example, if you will set low innodb_buffer in MySql and will start to restore a huge backup - server will become unresponsive to everything but ping, and there will be no high CPU usage. run `iostat -x 2 > somelog.txt` and check it later – Lev Bystritskiy Feb 04 '16 at 07:32

1 Answers1

0

All variables looked quite normal. However, I wrote a cronjob to output the date/time and filedescriptors every minute, and found the filedescriptors to be within normal values. However, at 3am, the servers clock suddenly went 2 hours back in time(took me a while to notice that from the logfile), and then it died without any errors in logs.

It turned out to be a problem on the hosting/WMWare-level(which is not my concern). Among other things, the WMWare host had a time that was completely off. After the infrastructure company fixed their WMWare platform, it worked fine again.