2

I have a web site with users lighttpd and CGI scripts.

After upgrading to Fedora 11 (ext4) the disc access became erratic. The timing of python -c 'import cgi' varies between 0.1 to almost 10 seconds: graph

How can I diagnose the problem? (Tools, methods, best practices ...)

Update Jul 30, 2009:
Found out that several CGI process were hogging the drive. After killing them the graph is stable between 0.02 and 0.03. Still didn't get an answer on how to diagnose such problems.

Miki Tebeka
  • 158
  • 5
  • You'll need to provide some additional information - drive configurations, is it part of an md set, does it use LVM (and if so, are you running many snapshots or using mirroring), are you running other programs on the same system, do you have smartd enabled, etc. For all I know, the drive is simply dying. – Avery Payne Jul 29 '09 at 04:48

3 Answers3

1

If it is fresh install then tools like makewhatis which are used by apropos, whatis might cause disk to be heavily used. Wait for few days for things to get stabilized (updatedb, prelink, makewhatis, etc.) then may be timings will be consistent.

It would also depend on something else you are doing on server and what the cgi script is actually doing, where it is taking input from, size of input, etc.

Also if disk is very old, use diagnostic tools (like seagate seatools) to look for controller / bad sector problems. The tools will also allow you to optionally repair the sector if drive is actually from seagate.

Saurabh Barjatiya
  • 4,643
  • 2
  • 29
  • 34
  • Server was installed more than 20 days ago so I guess it's not the apropos. Is there a way to know which process is writing to disk? – Miki Tebeka Jul 29 '09 at 15:55
  • I do not know method of finding which process is writing to disk. But you can use iostat function to see which partition is being written to. That may help narrow down if you have partitions like /var, /etc etc. Also that will confirm that delay is actually due to some other process using disk. If iostat shows no other process is writing to disk then you should check the code. Also run the program on other computers (ext3, other fedora) and plot the response time graph. If they are irregular too then there is no problem with ext4/fedora 11. – Saurabh Barjatiya Jul 30 '09 at 04:18
0

Do you really need/want ext4 on a production server? It's a still a mighty bit to green for my taste for a server.

Sven
  • 97,248
  • 13
  • 177
  • 225
0

The only way to diagnose a problem like this is with lots and lots of data. Familiarize yourself with vmstat and iostat. A tool I recently learned about in this thread is dstat which effectively combines the two.

For problems like the one you're describing, this command would likely be useful:

$ dstat -M app -cdnygl

It will report on CPU, IO (disk and net), interrupts, swap, and load average. As a nice little bonus, it will include the name of whatever process was "most expensive" a the time the snapshot was taken. Unfortunately that particular command produces output too wide to paste here, so here's a bit more conservative version:

$ dstat -M app -cdn
--most-expensive-- ----total-cpu-usage---- -dsk/total- -net/total-
     process      |usr sys idl wai hiq siq| read  writ| recv  send
bacula-fd        0|  1   0  98   0   0   0| 426k  108k|   0     0 
bash             1|  2   2  96   0   0   0|   0    20k|1460B 1804B
apache2          8|  4   2  94   0   0   0|   0     0 |  76k   15k
                  |  1   3  96   0   0   0|   0     0 |1132B 1034B
apache2          1|  2   2  96   0   0   0|   0  8192B|  11k 3895B
                  |  2   1  96   0   0   0|   0    32k|3322B 1338B
kipmi0           1|  2   2  96   0   0   0|   0     0 |1309B 1146B
Insyte
  • 9,314
  • 2
  • 27
  • 45