Server dying every few days - how to investigate

Question

I have Ubuntu 9.10 dedicated server (un-managed) and it started dying few weeks ago.

Before i request hardware inspection i would like to confirm that there is no software issue of some kind going on on the server.

Server is unmanaged so I need to do everything by my self.

Server is hosting few WP sites and one VBulettin forum.

Here is my php info http://pastebin.com/hSQVQBMR

Server has worked * flawlessly * for about a year, in the mean time, no a single restart, and now it started to suddenly hang.

It always happen at approximately same time (4-6 am CET) when we have most visitors online.

But strange thing is that this never happened before, it worked very well for a year or more.

So my question is - how to investigate?

I have cacti set up from day one - and there is no unusual activity what so ever.. Further more, every time it hangs out it happen on down slope of load and mysql queries chats (and all other load-related charts)

What I didn't have was number of sockets chart, but I added that today.

Thing that worries me the most is that every time i requested restart (aprox 4 times in last 7 days) support guy told me that he was getting black screen (so i guess this is not the case of load ~50)

What log files should I watch?

What entries in those files should I look for?

Can you be more specific about what "dying" means to you? -- Are we talking just the web server, or the whole box? Can you still SSH in? Does the box panic? See also http://serverfault.com/questions/127352/diagnosing-hardware-problem-in-linux-server-thats-kernel-panicking Re: diagnosing mysterious problems -- Bad RAM is my first guess in situations like this (and yes, RAM can go bad a year or more after it's installed) — voretaq7, Aug 19 '11 at 16:43
Whole server dyes. Not I or support guy can log in, and it has to be restarted manually. — kodisha, Aug 19 '11 at 16:51
OK, I'd start with the memory tests in the question I linked then. Sounds like a hard crash/kernel panic, which is often bad RAM — voretaq7, Aug 19 '11 at 17:49

score 1 · Answer 1 · answered Aug 19 '11 at 18:00

Look for Memory errors and HDD errors in /var/logs/messages to start with.

Is this server in a data center? With proper electricity feed? Variation in electricity can cause a server to crash and can also prevent it from booting if not enough power is available.

You can also test your hardware, especially your Memory and HDD.

score 1 · Answer 2 · answered Aug 20 '11 at 09:25

Set up CPU temperature monitoring, if you haven't already. If the problem is overheating then you may be able to see a sharp rise in temperature just before failure.

/var/log/kern.log would be worth a look. However if the system is crashing it may well be unable to write anything to it when it really matters.

If you can get access to the console - or perhaps better, use a serial console and leave something logging everything written to it (I use 'screen' for this) - then you may be able to see what the kernel says when it crashes.

score 0 · Answer 3 · answered Aug 21 '11 at 20:09

Does it "die" when there is nothing to do for a while? Then power-saving may be the problem here. Try to disable it completely or at least prevent it from switching a CPU or core into C-sleep-state.

I`ve got a bunch of Dell Servers that expose the strangest errors if C-state is enabled in BIOS power savings.

Do you know what kind of hardware is being used (make, model, CPU - propably Intel)?

Server dying every few days - how to investigate

3 Answers3