Our database-server sometimes become unresponsive. It runs a fully updated Ubuntu 14.04 LTS. Notable non-vanilla software running on it are Nimbus, TSM and Oracle.
About once a day, it becomes unresponsive, so far at night-time, when a series of maintenance-tasks are done such as backups.
After it becomes unresponsive, it seems to stay this way forever. I'm not able to SSH into it, and it doesen't accept any database connections.
Weird thing is, the server responds to ping. If I use telnet to open port 22(SSH) or port 1521(Oracle), I do get a reply from the server. Port 22 even states something like "This is OpenSSH". But actually using the ssh client or opening a database-connection just hangs.
I've been looking in the log-files, and found absolutely nothing(dmesg, syslog, auth.log etc). It also seem to be suspiciously little activity in the the logfiles during the unresponsive period. After restarting the server, it works again.
My immediate reaction was to run apt-get update and apt-get dist-upgrade, and monitor the max file-descriptor limit being reached. However, the hard limit for Oracle is far from the filesystem max, so it seems weird if thats the case. Anyone else have any ideas what could cause this?
EDIT: Forgot to mention that CPU, memory and disk space was far from reaching 100%. (They were already monitored, and after this happened, I started monitoring open file descriptors as well, but it has yet to happen again). I can also add that I don't expect anyone to call out the exact problem, but any ideas for additional things to monitor would be appreciated.