What's the first thing you check when an untouched unix server starts going berserk?

Question

So you have this neatly setup unix server and it's super fast and works swell and everything is great for months, and suddenly all kinds of weird errors start showing up for a variety of different services and none of them make a lot of sense on their own, much less together.

What are cheap things you should check as soon as you get your ssh session to the machine?

I'm specially interested in trauma stories that highlight non-obvious commands and rare situations, but I guess what's obvious varies from person to person, so we can just list them all freely.

Avery Payne · Accepted Answer · 2009-06-10T08:19:01.790

First Order: Is it responsive?

If you can't log in, there's bigger problems afoot. This generally comes in two flavors: hardware failure, and software failure. Both are potentially catastrophic. To prevent DFA errors, check the general hardware health first - a simple glance-over usually will suffice.

Second Order: Are the system's underlying structures in good health and order?

Check the "Golden Triad" of systems:

Enough CPU time is free for processing
Enough disk space is free for storage
Enough memory is free for workloads

In the last few decades, the triad has expanded into a "quad" which includes communications (networking):

Connectivity is functional, responsive, and has capacity

Third Order: What is the severity of the issue?

What programs or services are affected? In decreasing order of severity, is it systemic (system-wide), clustered (a group of programs), or isolated (a specific program)? Clusters of programs typically are tripping up because a specific underlying service has failed or gone unresponsive. Systemic issues are sometimes related to this (think DNS or IP conflicts) but knowing where to look is usually the key.

Fourth Order: Are diagnostic tools providing useful data relevant to the issue? Now that you have info about the health of the system (second order) and what parts of it are experiencing issues (third order) this should make it easy to narrow down where the problem is.

Error messages or log files should be a common waypoint on this journey.

CPU issues:

loadav
top
strace

Disk space / I-O issues:

df
du
lsof
iostat
vmstat

Memory issues:

free

Connectivity issues:

ping
route (and arp and rarp and friends)
iptables, ipchains, ipfw (for those BSD folks out there)
traceroute or mtr
hosts, nslookup, or dig
netstat

Most common complaint (that I hear):

Email is not delivering fast enough (more than a minute from send to receipt by recipient) or, email is rejecting my attempt to send. This usually comes down to the rate limiter in Postfix kicking in during a spam-storm, which impacts the ability to accept internal delivery.

A real-life example:

However, this is not always the case. One time, the issue persisted regardless of service restarts; so after 3 minutes it was time to start looking around. CPU was busy but under 100%, yet the load had soared to 15 on a box of just 2 cores, and was threatening to go higher. The top command revealed that the mail system was in overdrive, along with the mail scanner, but there were no amavis child processes to be seen. That was the clue - the mail queue command (mailq) showed some 150+ undelivered messages, over 80% of which were spam, in the last 20 minutes. A quick adjustment to lower the rate limiter (which reduced the intake rate of the spam storm) while increasing the number of child email scanner processes (to help process the backlog), followed by a service restart, resolved the issue and the system was able to complete deliveries in a short time.

The cause of the problem was that the amavis parent process had keeled over dead, and the child processes had eventually all run their course (they self-terminate after so many scans to prevent memory leaks). So there were SMTP processes in postfix attempting to contact...thin air...to do the spam/virus scanning that was needed. The distro I was using had out-of-date packages that would never be updated; as the installation was due to be replaced in a year or so, I manually "overrode" the install to the latest version, which included several bug fixes. I haven't had the same problem since.

score 5 · Answer 2 · answered May 18 '09 at 17:55

5

usually "who" followed by "last"

a heap of issues on machines I've managed over times have been because of a very loose definition of "untouched" - often someone has done something :)

answered May 18 '09 at 17:55

Mark Regensberg

1,421
12
14

score 4 · Answer 3 · answered May 18 '09 at 08:00

Well, I'll start.

This one bit me once, I spent hours trying thousands of different things, disabling services here and there, rebooting, etc. What was the problem? Totally out of disk space.

So, here's the first thing I type when debugging a suddenly troubled server:

df -h

I never forget that now. It just saved me lots of wasted effort. Thought I'd share.

score 2 · Answer 4 · answered May 18 '09 at 08:01

2

top (or htop)

answered May 18 '09 at 08:01

Oli

1,791
17
27

1

or prstat on solaris. – kch May 18 '09 at 08:04

score 1 · Answer 5 · answered May 23 '09 at 06:02

Running something like (at)sar on the host is almost mandatory. The usefulness of being able to get historical snapshots of CPU, network, memory and disk I/O (amongst others) cannot be understated.

There have been many times that I have been able to diagnose a fault by examining what the host was doing in the past 24 hours, and seeing when things started going awry.

score 1 · Answer 6 · answered May 23 '09 at 06:28

1

Checking dmesg for any errors - I usually start with a dmesg | tail, because chances are things are still going wrong and the server is still trying to do whatever is causing the error.

answered May 23 '09 at 06:28

Andy

1,493
14
14

score 1 · Answer 7 · answered May 18 '09 at 08:59

1

If you can I would always try shutting down all NICs bar the management one.

answered May 18 '09 at 08:59

Chopper3

100,240
9
106
238

score 1 · Answer 8 · answered May 18 '09 at 09:01

First thing I check is 'top' (are there any strange processes; ones that hog memory or CPU time.)

If nothing turns up there, I'll check 'who' to see if anyone else is on my machine for some reason.

Maybe a filesystem got dismounted; check with a call to 'cat /etc/mtab' and then 'fstab' to make sure everything will come up right on boot.

Check uptime to make sure the # of users on the box is reasonable (should only be you) and then skim through var/log/auth.log to see if anything is awry there.

These are catch-alls. Depending on the errors your box is throwing, you may need to examine specific processes that are causing the trouble.

score 0 · Answer 9 · answered May 21 '09 at 23:40

0

top df -h and ALWAYS check /var/log to make sure that partition hasn't filled up. That has caused total melt down on me a few times.

answered May 21 '09 at 23:40

Nolte

121
1

score 0 · Answer 10 · answered May 23 '09 at 04:30

0

df -ha

to check if harddrives are full and someone hasn't received warnings

htop or top

to check memory and cpu usage isn't abnormally high.

Alternatively if the box isn't responding I go into the vm-ware client and check cpu/ram from there.

answered May 23 '09 at 04:30

Omegatron

121
1
4

score 0 · Answer 11 · answered Nov 09 '09 at 16:39

0

On linux, I usually check dmesg and /var/log/messages or /var/log/syslog. dmesg will indicate if it's a sudden hardware fault; quite a lot of other problems will show up in the system logs.

answered Nov 09 '09 at 16:39

pjc50

1,720
10
12

score 0 · Answer 12 · answered Nov 09 '09 at 17:54

I suppose the first thing I do is a disk space check (as others have mentioned). If the simple checks don't reveal a "common" problem then I'll investigate further.

One thing I like to do is capture a snapshot of the system. I can grep these later to look for anything that has caught my eye.

lsof > /tmp/lsof.tmp &
ps auxfw > /tmp/ps.tmp &
netstat -anp > /tmp/netstat.tmp &

From there it's troubleshooting 101 but I find it a bit faster to grep the saved logs and if the condition clears while I'm logged in I have something to go on or look for changes.

What's the first thing you check when an untouched unix server starts going berserk?

12 Answers12

Linked