2

Note that while this question is a little Redis-specific, the main problem is generic: a process takes so much HDD IO write bandwidth that other processes can't write anything.

We've got an Ubuntu VM inside Ubuntu-based Xen XCP host (installed on two HDDs in software RAID1). That VM is running Redis server under about 2K commands/s load.

Problem: when said Redis server does BGREWRITEAOF, it blocks its clients for about 10 seconds.

Details:

Only AOF persistence is used, no RDB. Redis is configured to fsync AOF file once per second.

On BGREWRITEAOF Redis forks and does all disk-intensive work in the child process. Meanwhile, main process keeps appending data to its AOF file.

BGREWRITEAOF takes about 10 seconds (1.5GBs of data, 150 MB/s disk write speed). The child process doing rewrite consumes all HDD IO write throughput.

Parent process attempts to fsync, it takes more than two seconds, data protection kicks in, and blocking write gets called, blocking the parent process until BGREWRITEAOF is finished with the disk.

Here is a detailed info and discussion that lead me to above interpretation of events.

Question: It looks fishy to me that a process is allowed to take so much disk IO that everything else is blocked. Is there something that I can do on system level to fix that? I'm OK if BGREWRITEAOF will take a little more time, as long as parent process is allowed to save its data while rewrite is active.

Please note that I'm aware of workarounds, like moving AOF persistence to slave, using no-appendfsync-on-rewrite Redis config option etc.; this question is specifically about resolving the problem, not working around it.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
Alexander Gladysh
  • 2,343
  • 7
  • 30
  • 47

2 Answers2

1

AFAICS you can try to change IO scheduler. Try to use this command:

echo cfq > /sys/block/$DEVICE/queue/scheduler

Where $DEVICE is your RAID1 disk. This command install 'Completely Fair Queuing' scheduler for your device.

apatrushev
  • 11
  • 1
  • 1
    I'd probably stay away from the [CFQ I/O scheduler](http://en.wikipedia.org/wiki/CFQ). It can be more of a problem on server systems and virtual guests. It's also probably the default for this Ubuntu system. – ewwhite Oct 30 '12 at 09:21
  • It says `noop [deadline] cfq` on VM. – Alexander Gladysh Oct 30 '12 at 09:34
0

I would suggest changing your I/O scheduler and applying some light tuning techniques. While I don't have a comprehensive tuning guide, some of the answers and suggestions detailed in this question may help you as well.

Consider changing the I/O elevator to deadline or the noop algorithm, and retest. You can make this change on the fly using the technique detailed in another answer. Add an entry to the GRUB kernel command like to make this persistent across reboots (add: elevator=deadline)

Perhaps some detail about the underlying hardware or host system setup would help. Is there any battery-backed or flash-backed write cache on the storage subsystem? That can make a difference.

Finally, you can try some light benchmarking/monitoring tools to see what's going on. If you have access to iostat, for instance, you can run it in another terminal window as you test your application.

E.g. iostat -x 1 will run with 1-second samples and provide some indication on read/write speed and I/O service time and wait time. I also like collectl for this purpose.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • See dstat and iostat dumps here: https://groups.google.com/forum/#!msg/redis-db/vSAvnYVtX9w/NRYlBwOrCTsJ – Alexander Gladysh Oct 30 '12 at 09:35
  • It seems that I'm using `deadline` scheduler. Will try `noop` and (for completeness) `cfq`. – Alexander Gladysh Oct 30 '12 at 09:40
  • Okay. I see... You have no write cache for your storage, so you're at a disadvantage there. You're literally waiting for rotating disks to acknowledge I/O. You also have multiple layers of I/O elevators; one in the host, one in the guest. This may be a situation where you want `deadline` in the host and `noop` in the guest. – ewwhite Oct 30 '12 at 09:40
  • No battery-backed or flash-backed cache, sadly. This is a plain and humble Hetzner box. – Alexander Gladysh Oct 30 '12 at 09:40
  • I'll add that using CFQ will allow you to starve processes with a little more granularity. It also opens up the possibility of using [ionice](http://linux.die.net/man/1/ionice) on a per-process basis. (`ionice` is not compatible with the other schedulers) – ewwhite Oct 30 '12 at 09:43
  • `deadline` on Xen, `noop` on VM seems to be the best option. We still have to analyze data properly, but while other combination trigger 500-1100 502 errors (with stable traffic load), while `deadline`-`noop` triggers 0-600 502s (with the same load). However it still stalls. Is there anything else we can tune? – Alexander Gladysh Oct 30 '12 at 16:51