2

I appoligize in advance for not being the proper admin, I'm just a programmer with a server on which I installed Debian Etch plus mysql, php, apache and ISPConfig.

So, it had an uptime of more than 900 days with not a single problem (there's no important load on it, just a couple of our services), and then it started to behave badly - suddenly it freezes (only ping is working, nothing else) and when I try to restart it via ISP's interface, it freezes completely. Then I have to request support for a manual restart. After that, it works fine for a couple of days, then the same thing happens again (it happened three times so far).

Now I performed a network boot and run fsck (found 1.1% non-contiguous) and I hope it will help

My question is did anyone had similar experience and what could be causing such a problem (when only ping works)?

Also, I looked in system log, but found nothing which could indicate a problem. Is there some other log I should look into?


thanks for a lot of answers!

Sorry, I didn't register yet, so I have no option to vote up. But thanks!

First, to clear the issue, this is a housed server, and there is network boot / reset / manual reset function at the ISP's support.

It probably is a HDD issue, since -after the fsck- everything seemed to work fine, until i looked deeper and realized that only the front page works, while others don't (pages give '403 forbidden' error or just a blank page or mysql error...).

SSH is also seems to work but it actually doesn't work: i can try to log in and it will refuse the wrong password, but when I enter the correct one - the connection just closes.

I will try to access the files once again through network boot and backup as much as possible, then will have to replace the disk...

Is it possible to clone a disk with errors on it? Is it worth trying, anyway?

UPDATE: Today (one day after I asked the question) it turned out that the HDD is definitely defective. Once again, thanks for your time and help!

Milos
  • 21
  • 2
  • 2
    Long uptimes is a *bad* thing...it means it hasn't been updated! – Bart Silverstrim Jan 12 '12 at 00:22
  • 4
    900 days of uptime? So, you haven't been patching this box? Are you sure it hasn't been compromised? – MDMarra Jan 12 '12 at 00:23
  • Clarify - you said you're restarting it via an ISP's interface. Is this a hosted system, your own system, a virtual system...? – Bart Silverstrim Jan 12 '12 at 00:24
  • What freezes? The console? Can you ssh into it? Is it doing something in Cron with a high I/O? Are there still cron logs registering or anything to indicate that it's just the console...or can you switch to another virtual console? What are the last bits in the logs being recorded...? – Bart Silverstrim Jan 12 '12 at 00:25
  • If I were you I'd use my backups rather than cloning your disk with errors. – Lucas Kauffman Jan 12 '12 at 13:37
  • It's very common to at least attempt some data recovery off the drive. I've restored from older backups and then run dd_rescue or other recovery tools to try and get the most recent versions of important files. – Brett Dikeman Jan 12 '12 at 21:28

2 Answers2

2

Assuming this is a dedicated physical server:

The next time it freezes, you should have your hosting company plug in a "crash cart" and see what's on the screen (console), or go down yourself. The next time it starts to act up, if you're able to login, type "dmesg" and look for error messages; include them by editing your question and pasting them, or using pastebin.

I've snapped photos with a digital camera or cellphone in the past for later reference or showing to someone remotely. Any serious kernel messages will most likely be on screen (it depends on how logging is configured); without this information, the answers you get will be essentially wild guesses.

My wild guess is hard drive failure; bring a bootable CD (Ubuntu is probably easiest) and run smartctl -A insert hard drive device path here. You'll get a list of drive health parameters, and more importantly, a log of errors from the drive, if any.

Also: ignore the person who suggested doing an OS upgrade. That is exceptionally dangerous advice.

Update: Yes, it's possible to clone a damaged drive, if you don't have good or recent backups. Look at GNU ddrescue. It's an advanced tool, though. If money is on the line, send it out for recovery, or at least hire a pro sysadmin who has experience with data recovery.

Brett Dikeman
  • 354
  • 2
  • 8
1

It's possible this is a hardware issue. Disk or memory errors, over heating (clogged fan or air vents), network card that went bad. Unless there are any hardware errors then as a first thing I would upgrade the system to lenny, then squeeze. It's possible it may automagically fix your problems.

I would also scan the system for badblocks (that's the command name). In mkfs.ext3 there exists the following option:

-c     This option causes e2fsck to use badblocks(8) program to do a read-only scan of the device in order to find any bad 
       blocks.  If any bad blocks are found, they are added to the bad block inode to prevent them from being allocated to
       a file or directory. If this option is specified twice, then the bad block scan will be done using a 
       non-destructive read-write test.

So you will be able to avoid disk errors caused by bad blocks.

Also consider running a memory test using memtest86 or memtest86+. If it finds errors and you feel adventurous you can use memtest's output to feed to the kernel and map out any bad memory: http://rick.vanrein.org/linux/badram/

I know for a fact it works very well. I once had a bad dimm which would predictably crash and burn the system at some point of memory allocation. After using memtest and finding the bad memory area I used badram kernel parameter to map it out and the problem was solved.

aseq
  • 4,550
  • 1
  • 22
  • 46
  • 2
    **If any of the system hardware is suspect or the machine is unstable, the very last thing you should be doing is upgrading the OS.** That's a sure-fire recipe for ending up with a corrupted OS install. Please edit your answer to remove such suggestions. Also: if you find any bad blocks on a device with badblocks, the hard drive's own internal bad-block remapping mechanisms have failed - it should be replaced outright, not worked around. – Brett Dikeman Jan 12 '12 at 05:55
  • @BeeDee: Here, have +1. Upgrading some piece of software can possibly help if things have been unstable all the time. If something has worked 900+ days in a row and suddenly starts to have problems, upgrading software is a Bad Career Move (tm). It's instead time to inspect hardware. – Janne Pikkarainen Jan 12 '12 at 07:23
  • @BeeDee: I would personally replace the system right away and not bother testing anything. But I suspected the OP not having that luxury and I was talking from the perspective of trying to get the system working, no matter what, and worry about replacement later. I agree the OS upgrade should happen after you're checked and possibly fixed any hardware issues. – aseq Jan 12 '12 at 23:51