1

I used iostat and trying to understand the iostat output using this link : https://linoxide.com/linux-command/linux-iostat-command/ but It is not clearly explained.

PF the output at different times from 18:00:00 to 18:45:00 :
We got the delay in database write operation at 18:45:00

18:00:00

Linux 4.1.0-0.Node1.1-amd64 (Node1)        05/25/2018      _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.58    0.00    0.84    0.14    0.00   96.44

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              20.45       318.11       272.61    4801729    4114908
loop0            15.24        15.29         0.00     230867          0

18:10:00

Linux 4.1.0-0.Node1.1-amd64 (Node1)        05/25/2018      _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.54    0.07    1.67    0.28    0.00   93.44

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              66.79       450.28      6292.63    7071585   98824332
loop0            14.65        14.70         0.00     230867          0

18:20:00

Linux 4.1.0-0.Node1.1-amd64 (Node1)        05/25/2018      _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.48    0.17    2.46    0.66    0.00   90.23

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             115.61      1712.27     13585.10   27917361  221496016
loop0            14.11        14.16         0.00     230892          0

18:30:00

Linux 4.1.0-0.Node1.1-amd64 (Node1)        05/25/2018      _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.17    0.30    3.11    1.02    0.00   87.40

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             158.97      2682.65     19061.12   45347101  322206568
loop0            13.61        13.66         0.00     230892          0

18:40:00

Linux 4.1.0-0.Node1.1-amd64 (Node1)        05/25/2018      _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.73    0.45    3.30    1.40    0.00   86.12

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             175.77      3647.19     19549.58   63840689  342197396
loop0            13.14        13.19         0.00     230930          0

18:45:00

Linux 4.1.0-0.Node1.1-amd64 (Node1)        05/25/2018      _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.98    0.46    3.55    1.57    0.00   85.45

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             181.99      4116.60     19870.20   73291969  353769384
loop0            12.92        12.97         0.00     230930          0

We got the delay in the database write at 18:45:00, So I would like to understand the IO operations from 18:00:00 to 18:45:00 and does the above output at 18:45:00 shows any problem in IO ?

Harry
  • 111
  • 4
  • The numbers are not detailed enough to draw any meaningful conclusions (you would need to graph those and more,metrics with 1 or 5s resolution to see spikes and you need to include service time, queue utilization and similar values.) the only thing you see is that the load is very write heavy and ramping up. – eckes May 29 '18 at 21:09

1 Answers1

0

I'm going to assume this is a SATA disk system so 181 TPS is about the limit. You can try iostat -x 1 to get some extended statistics (example below). Note the %util column. This will tell you how much the read/write load the disk is under. I would guess this to be nearing 100% and causing the database problems. Another metric is the svctm. This tells you how long it's taking the disk to complete any given I/O operation. The higher the number, the worse the situation.

Investigate what else is running at this time (DB dumps running? backups? slocate/mlocate?) that may be causing I/O bottlenecks.

$ iostat  -x
Linux 4.4.0-124-generic (nebulus)         05/29/2018      _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.02    0.00    0.03    0.00    0.00   99.94

    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    vda               0.00     0.35    0.18    0.40     3.23     7.03    35.07     0.00    1.17    0.71    1.38   0.29   0.02
    dm-0              0.00     0.00    0.09    0.24     1.01     0.87    11.24     0.00   10.91    0.76   14.68   0.18   0.01
    dm-1              0.00     0.00    0.07    0.29     1.69     4.17    32.90     0.00    2.33    0.59    2.74   0.23   0.01

If you are still nearing 100% utilization, talk to your programmer and see if he can improve things by making more efficient DB queries. If this isn't an option, then throw money into some faster disk hardware (move to RAID0 or SSD disks).

Disk utilization can also be caused by the system having too little available RAM. As soon as memory becomes tight, disk swapping starts and everything comes to a crawl.

Just a couple places to start looking.

Server Fault
  • 3,454
  • 7
  • 48
  • 88