1

Lately I had several servers which encountered a write error on an EXT3 filesystem and as a result of that remounted the filesystem read-only. Understandably on a production server this causes severe problems. On a reboot the filesystem where fixed but on large partitions this takes a lot of time. After the filesystem was fixed, correcting several errors, the server runs well again.

What can I do to minimize the rate at which this happens? I can't seem to find much information on periodically checking the filesystem(s) on a running server. Is it possible to change the way in which EXT3 / the system handles write errors? What would be a sane solution.

All servers which this is regarding to are running CentOS Linux 5.4 or 5.5.

2 Answers2

3

There shouldn't be any write errors at all with ext3 and if there are, you should check for possible hardware defects (most likely damaged disks or maybe cabling problems).

Sven
  • 97,248
  • 13
  • 177
  • 225
  • When I check the disk on the server where this problem most recently occurred SMART says there aren't any problems and the disk is healthy. How can you otherwise assess the health of the disk to determine of it has problems? Also the servers who have experienced these problems are running fine for weeks on the same disks. – Reinoud van Santen Mar 16 '11 at 12:42
  • Run some smartctl tests, check the `-t` flag to smartctl in the man page for more details. Many of the tests are suitable for running on a live server. – MadHatter Mar 16 '11 at 15:53
0

You could mount your filesystems with -o errors=continue option. Check man mount for details. However, this is not recommended and I agree with SvenW. If you have hardware RAID card, run some checks on it, force it to verify the integrity on your array. How about cables? Are you sure they are intact? As for periodically checking the filesystems on a running server - they must be unmounted. You could pick a overnight hours, if possible.

grs
  • 2,235
  • 6
  • 28
  • 36