How to diagnose very bad and slow ext3 behavior?

Question

I'm managing an old admin server running Redhat WS4 update 3, and we have an ext3 volume where I had a large (30GB) sqlite database mounted on /opt.

Everytime I do large queries/inserts into this database it raises the IO waits so high that we cannot login to the server anymore, nor sudo to another user, nor edit a crontab file (vi never quits).

I'm replacing sqlite with mysql and while backuping the 19GB or mysql directory, I encounter the same problem.

Note that these operations are done with a regular user. The server is a PROLIANT DL385 G1 with kernel 2.6.9-34.ELsmp in 64bits.

I'm now considering remounting the volume as ext2 to see if journaling is the source to my problem, but I honestly don't really know what to check next.

Every serious file copy ends up blocking the server for other users trying to log on, and server gets back to normal once the copy ends.

I need to pointers to where to look next to explain such behavior (old disk getting slower ? bad kernel with known bug ? corrupt journaling which triggers thousands of superfluous reads/writes ? etc...)

Thanks in advance.

If you do any other types of IO operations on that volume (eg, dd 5GB of /dev/zero to a file), do you see the same behavior? Is the underlying volume a hardware/software RAID array or a single disk? You should be able to use smartctl to check SMART data for the device, and run SMART tests on it. If it's a hardware RAID array, does the management software give any indication of a failed/failing drive? Does looking through dmesg or /var/log/messages give any indication that there are unreadable or unwriteable sectors? I'm sort of thinking bad drive.... — Kendall, Sep 07 '11 at 17:16
How much RAM does it have? These things often shipped with just 1GB which will cause massive thrashing with serious workloads. — David Schwartz, Sep 07 '11 at 17:55
Server has 4GB memory and 8GB swap (not used at the moment). — Baramin, Sep 08 '11 at 09:16
@Kendall : I can reproduce the problem with dd if=/dev/zero of=file, and CPU for IO wait stays up quite some time after I Ctrl-C the dd. I just managed to have my server breathe a lot more by modifying the syslogd.conf and setting all log files to buffered mode. I think a big part of my problem, is that this server is receiving a ton of syslog entries and flushes every time, impacting greatly the performance of my large file operations. — Baramin, Sep 08 '11 at 11:40

score 2 · Answer 1 · answered Sep 08 '11 at 12:23

Replying to my own question, as I finally found the real source of the problem.

1_ syslog.conf was configured to log in files and immediatly flush 2_ our proxies where recently configured to use this server syslog to log LDAP authentication attempts. These happen at a rate of several per second because of stupid (or misconfigured) update programs, a-la Adobe updater.

In fine, the server was CONSTANTLY flushing buffers to disk and that showed everytime we tried to write to big files.

How to diagnose very bad and slow ext3 behavior?

1 Answers1