0

Both the host OS and the guest OS are Debian 8 amd64. The guest is a raw disk image stored at /srv/machines/web.img. The host has a backup script scheduled each night at 1:55am. Here is my backup script:

#!/bin/bash

BACKDIR='/srv/backups'
BACKFILENAME='web.img'

# Let's rotate the backups and delete the oldest one
for i in `seq 8 -1 0` ; do
  if [ -f "$BACKDIR"/"$BACKFILENAME"-$i ] ; then
    mv "$BACKDIR"/"$BACKFILENAME"-$i "$BACKDIR"/"$BACKFILENAME"-$(($i + 1))
  fi
done

rm -f "$BACKDIR"/"$BACKFILENAME"-9

# Hold guest writes in a temp COW file
virsh snapshot-create-as --domain web bsnap --diskspec vda,file=/srv/machines/bsnap.qcow2 --disk-only --atomic --quiesce

# Make backup of the guest RAW image
cp -a /srv/machines/"$BACKFILENAME" "$BACKDIR"/"$BACKFILENAME"-0

# commit guest writes to the usual RAW file
virsh blockcommit web vda --active --verbose --pivot

Zabbix CPU I/O wait graph shows the backup takes about 1 hour. Every night during this time period, but at different times, I get kernel oops (not sure they are actually oops) int the guest vm, like this one:

Jul  4 02:13:59 web kernel: [83400.216115] INFO: task jbd2/dm-0-8:200 blocked for more than 120 seconds.
Jul  4 02:13:59 web kernel: [83400.216485]       Not tainted 3.16.0-4-amd64 #1
Jul  4 02:13:59 web kernel: [83400.216817] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  4 02:13:59 web kernel: [83400.217427] jbd2/dm-0-8     D ffff880efef1dac8     0   200      2 0x00000000
Jul  4 02:13:59 web kernel: [83400.217433]  ffff880efef1d670 0000000000000046 0000000000012f00 ffff880efef87fd8
Jul  4 02:13:59 web kernel: [83400.217437]  0000000000012f00 ffff880efef1d670 ffff880f3fcb37b0 ffff880f3ff7a5e8
Jul  4 02:13:59 web kernel: [83400.217441]  0000000000000002 ffffffff811d7620 ffff880efef87c80 ffff8800bbba8b98
Jul  4 02:13:59 web kernel: [83400.217444] Call Trace:
Jul  4 02:13:59 web kernel: [83400.217454]  [<ffffffff811d7620>] ? generic_block_bmap+0x50/0x50
Jul  4 02:13:59 web kernel: [83400.217460]  [<ffffffff815114a9>] ? io_schedule+0x99/0x120
Jul  4 02:13:59 web kernel: [83400.217464]  [<ffffffff811d762a>] ? sleep_on_buffer+0xa/0x10
Jul  4 02:13:59 web kernel: [83400.217468]  [<ffffffff8151182c>] ? __wait_on_bit+0x5c/0x90
Jul  4 02:13:59 web kernel: [83400.217471]  [<ffffffff811d7620>] ? generic_block_bmap+0x50/0x50
Jul  4 02:13:59 web kernel: [83400.217474]  [<ffffffff815118d7>] ? out_of_line_wait_on_bit+0x77/0x90
Jul  4 02:13:59 web kernel: [83400.217480]  [<ffffffff810a7e90>] ? autoremove_wake_function+0x30/0x30
Jul  4 02:13:59 web kernel: [83400.217498]  [<ffffffffa00c542e>] ? jbd2_journal_commit_transaction+0x175e/0x1950 [jbd2]
Jul  4 02:13:59 web kernel: [83400.217507]  [<ffffffff810a2f21>] ? pick_next_task_fair+0x6e1/0x820
Jul  4 02:13:59 web kernel: [83400.217514]  [<ffffffffa00c8be2>] ? kjournald2+0xb2/0x240 [jbd2]
Jul  4 02:13:59 web kernel: [83400.217518]  [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
Jul  4 02:13:59 web kernel: [83400.217524]  [<ffffffffa00c8b30>] ? commit_timeout+0x10/0x10 [jbd2]
Jul  4 02:13:59 web kernel: [83400.217530]  [<ffffffff8108809d>] ? kthread+0xbd/0xe0
Jul  4 02:13:59 web kernel: [83400.217534]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jul  4 02:13:59 web kernel: [83400.217539]  [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
Jul  4 02:13:59 web kernel: [83400.217543]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180

As a consequence, some random process gets killed (last night was the database engine...). I'm not skilled enough to decode what actually happened to the guest kernel here, but that 120 seconds with jbd2 in the same line makes me suspect some sort of filesystem timeout is the culprit. Now, assuming my suspect goes in the right direction, I suppose that the host backup script interferes badly with the I/O performance in the guest.

Please note that the host machine has 256GB of RAM memory, the guest image is 180GB in size and nearly the whole guest image is kept into RAM (vmtouch shows figures between 90 and 98% of it cached). I can't tell about the temporary COW file, but, even if it were kept on disk, seems to me 120 seconds are way too much to be justified by software RAID1 6GB/s Enterprise level SATA disks. Moreover, the host OS logs do not show anything that could make me suspect of disk I/O performance problems. Server load, at that time, is what the backup script does only and nothing else.

Is there anything in my backup script that can cause such problems?

Lucio Crusca
  • 330
  • 2
  • 10
  • 31
  • `Jul 4 02:13:59 web kernel: [83400.217524] [] ? commit_timeout+0x10/0x10 [jbd2]` - this looks bad. Can you show us: `cat /proc/mdstat`, `mdadm --detail /dev/md*X*` and `smartctl -x /dev/sd*X*`- from both disks? Let's start from the bottom layer first. Do you have swap space there? – Michal Sokolowski Jul 05 '16 at 07:23
  • I'm not sure I understand. The error is from the guest kernel, while software RAID 1 and SMART is in the host system. Do you really need that? – Lucio Crusca Jul 05 '16 at 07:43
  • Lucio, I miss understood, I thought it was host's kernel "oops". Anyway, let's check it. I'll analyze your backup script. – Michal Sokolowski Jul 05 '16 at 07:56
  • Personally I don't use libvirtd snapshot feature, I use LVM snapshots in host to do that (with great success). If you're using Debian stable, then libvirtd, kernel, qemu-kvm are ancient. You could try to install newer `linux-image-4.6`, `libvirtd` and `qemu-x86` from jessie-backports. It works in production well enough. – Michal Sokolowski Jul 05 '16 at 08:06
  • I've already checked mdstat and smart, it's all ok. I'd prefer to avoid reporting that informations in my question because I don't think it's useful and it could mislead others. – Lucio Crusca Jul 05 '16 at 08:33
  • It's your call. – Michal Sokolowski Jul 05 '16 at 09:34

0 Answers0