1

I've got a SUSE box with 8GB RAM and Reiserfs filesystem which has been running smoothly for over 4 years with no o/s and h/w related problems. The box serves a couple of (database driven) sites of low to moderate traffic which incurs low i/o, cpu and memory utilization.

Recently the machine hanged 3 times in the time span of 10 days. This has happened in irregular times (e.g not every time at 00:00 o clock). CPU, memory and HD are heavily underutilized and I've validated that these where also underutilized at the time of the halt so the sites are not responsible.

Every time the box hangs it can only respond to ping but no other service is usable (ssh, www etc). I then reboot the box and everything returns to normal (until the next halt).

What I've found in /var/log/boot.msg (possibly happening before and during the halt) in all 3 incidents is Filesystem is NOT clean and then a Replaying journal which seems to do a lot of work but never gets to 100%:

Reiserfs super block in block 16 on 0xfd03 of format 3.6 with standard journal
Blocks (total/free): 786432/540858 by 4096 bytes
Filesystem is NOT clean
Replaying journal: Trans replayed: mountid 39, transid 12424272, desc 7381, len 9, commit 7391, next trans offset 7374

Replaying journal: |                                        |  0.1%  1 trans
Trans replayed: mountid 39, transid 12424273, desc 7392, len 9, commit 7402, next trans offset 7385

Trans replayed: mountid 39, transid 12424274, desc 7403, len 9, commit 7413, next trans offset 7396
Trans replayed: mountid 39, transid 12424275, desc 7414, len 9, commit 7424, next trans offset 7407

Replaying journal: |                                        /  0.5%  4 trans
Trans replayed: mountid 39, transid 12424276, desc 7425, len 8, commit 7434, next trans offset 7417

Trans replayed: mountid 39, transid 12424277, desc 7435, len 9, commit 7445, next trans offset 7428
Trans replayed: mountid 39, transid 12424278, desc 7446, len 9, commit 7456, next trans offset 7439

Replaying journal: |                                        -  1.0%  7 trans

This went on to 33% on the first incident, and to 58% on the 3rd incident.

Could the halt of the system be reiserfs related?
Any ideas on where should I look at next?

thanks a lot

cherouvim
  • 744
  • 3
  • 18
  • 37

1 Answers1

2

Sounds like you have a bad hard drive (or more). If a bad-sector is found on the disk during regular use... the system immediately tries to do a recovery of the data and marks the disk as unclean. Being 4-years-old or-so it could very well be having disk problems. Most desktop-flavored disk drives only have a 1 or 3-year warranty... and server-grade drives typically only have 3-5 year warranties. You might also want to consider running a utility like GRC's spinrite which does an amazing job of scanning for problems and also refreshing the disks. (it is amazingly good at fixing all disk problems that are not the result of physical damage to the platters)

TheCompWiz
  • 7,349
  • 16
  • 23