I have a database running on GCP. Occasionally it gets very slow for a period of minutes (like average statement execution time spikes by 10x or more). The slowness is correlated with increases in the await output from iostat (system.io.await metric in the image below). Normally this is around 500µs, but during the outages it's spiked up to 20ms.

My first guess was that this indicated the disk was saturated, but {r,w}{,kb}_s were all within the normal range that the instance has gracefully handled (with normal await):


My second guess was that maybe we had a noisy neighbor on the persistent disk, but I failed over the database to a different VM and the problem persisted.

What else could be causing the spikes in await? Also, what tools or tests would be best for diagnosing this?

Ben Kuhn
  • 121
  • 2

2 Answers2


The commands and the tools you are using are perfect for debugging problems in physical disk, the problem here is that the structure in the cloud is completely different. A Persistent Disk on GCP is not actually a real disk-- It is a Virtual Volume that uses a lot of Physical devices. And those devices use the Google network and other structures to work. The following vide explains better how it works:


So, there are many factors that determine the performance of the Storage Volumes. According to the official documentation, you can review persistent disk performance metrics in Cloud Monitoring, Google Cloud's integrated monitoring solution.

You could check this other document that can help you to check your disk performance.

If your workload has a bursty I/O usage pattern, expect to see bursts in throttled bytes corresponding to bursts in read or written bytes.

Databases are a common example of bursty workloads. Databases tend to have short microbursts of I/O operations, which lead to temporary increases in queue depth. Higher queue depth can result in higher latency because more outstanding I/O operation requests are waiting in queue.

Per all the systems involved in the Cloud for storage, If you want to know with more details about what is happening with your performance, I recommend you contact GCP Support, they should have more tools to troubleshoot your issue.

If you have a Free trial account you can have chat support through Console Support Center, also, you can visit the following link for more information.

Or, you could contract a Support plan in order to have cases for technical support through phone, and chat.

  • 1
    Thanks so much, that's useful to know! Unfortunately it's actually a managed DB, so I'm stuck haranguing them to get additional metrics or file GCP tickets, but it sounds like that's our best option... – Ben Kuhn Sep 29 '20 at 00:02
  • 1
    @BenKuhn if you find my answer useful, please consider accepting it, thank you! – Jose Luis Delgadillo Sep 30 '20 at 19:49

Update: I didn't notice you were using cloud based storage until now. What I've noted only really applies to physical disks or fabric.

Have you checked IOPS? iostat 1 will give you the tps column which is Transfers Per Second (man pages says this is transfers/sec issued to the device so pretty close to IOPS). Maybe the DB is throwing hundreds of ops/sec at the disk and it's causing the high await time.

Server Fault
  • 3,454
  • 7
  • 48
  • 88
  • 1
    I have metrics for `r_s` and `w_s` but not `tps` (I think). Is `tps = r_s + w_s`? If not, what's the difference? (`r_s` and `w_s` were both well within normal range.) – Ben Kuhn Sep 25 '20 at 14:14
  • 1
    The man page states `tps` as being what was *issued* to the disk whereas `r/s` and `w/s` are *completed* (after merges, whatever that means) but might be close enough. It's probably best to pic one metric or the other and stick with it while debugging and making changes. – Server Fault Sep 28 '20 at 20:26