6

I'm really struggling with a performance issue currently, where all my database queries takes up to a 1000 times more than normally, when my database server is under "heavy" load.

I have a server that runs ubuntu 12.04 and hosts my postgresql 9.2 database. The server is on Hetzner.de and it is a EX6s (8 core xenon with 32Gb ram and two 3T HDD's in a RAID1 setup using ubuntus software raid). When I run a high volume of queries (which I often done in the night) then I see that almost all CPU usage is spend on CPU I/O wait. I have new relic monitoring installed, and I can't seem to find any other indications of what could be the root cause of this CPU I/O wait, which clearly is a bottleneck to my performance and throughput.

So the question is, what is this CPU I/O wait and what is it waiting for?

I added All the overview charts I can get hold on from new relic beneath. What am I overlooking? There must be an obvious bottleneck? Where should I dive in?

Database server CPU usage - The evil one that tells me something is wrong https://rpm.newrelic.com/public/charts/cEdIvvoQZCr

Database server load average https://rpm.newrelic.com/public/charts/cMNdrYW51QJ

Database server physical memory https://rpm.newrelic.com/public/charts/c3dZBntNpa1

Database server disk I/O utulization - as you can see the disk doesn't seem to be utilized very much (almost nothing) https://rpm.newrelic.com/public/charts/9YEVw6RekFG

Database server network I/O (Mb/s) - the network is a gigabit internal network that all communications are made on. https://rpm.newrelic.com/public/charts/lKiZ0Szmwe7

Top 5 database operations by wall clock time https://rpm.newrelic.com/public/charts/dCt45YH12FK

Database throughput https://rpm.newrelic.com/public/charts/bIbtQ1mDzMI

Database response time https://rpm.newrelic.com/public/charts/fPcNL8WA6xx

UPDATE:

After doing a sudo iostat -k 1 I'm starting to get suspicious. I get a lot of output like this, that I don't see in NR:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdc             546.00      2296.00      6808.00       2296       6808
sdd             593.00      1040.00      7416.00       1040       7416
md1               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
md2            1398.00      3328.00     13064.00       3328      13064
md3               0.00         0.00         0.00          0          0
Niels Kristian
  • 358
  • 2
  • 13
  • Did you manage to find what was happening. We're experiencing a similar situation (higher than normal cpu-wait and io activity). Using iotop we traced it to autovacuum but still don't know what's going on http://serverfault.com/questions/474227/why-is-autovacuum-using-so-much-io-when-running-on-a-partitions-master-table-af – Vlad Jan 31 '13 at 15:23

1 Answers1

5

In short, you need faster disks.

When processes are in iowait, this means that they have issued an IO request, and are waiting for the results of that request. With adequate RAM, most of your working set will be cached in RAM, and as such, reads will not contribute to much of this IO contention, so writes are typically the culprit.

Regarding your disk IO graph, I suspect that New Relic is, for some reason, not collecting IO metrics correctly. I'd recommend possibly looking into a secondary data collection system like Munin. Short of that, you can watch disk IO in realtime by issuing the following command:

$ iostat -k 1

This will output realtime disk reads and writes, in kbps. I suspect you'll see quite a bit of activity there that New Relic is not seeing.

EEAA
  • 108,414
  • 18
  • 172
  • 242
  • I think you might be right about NR. See my update... Would you say that that is "quite a bit of activity"? – Niels Kristian Nov 30 '12 at 02:33
  • I got the advice from someone else, that I could set up a couple of SSD's and move my pg_xlog dir onto those, as this should be the write heavy part of postgres. Is that something you are familiar with? – Niels Kristian Nov 30 '12 at 02:38