2

I have deployed four ubuntu 10.04 server. They are coupled two by two in a cluster scenario. on both sides we have software raid1 disks, drbd8 and OCFS2 and on top of it some kvm machines run with qcow2 disks.

I followed this: Link

corosync is just used for DRBD and OCFS, the kvm machines are run "manually"

When it works is fine: good performances, good I/O, but at a given time one of the two cluster started hanging. Then we tried with just one server turned on and it hangs the same. It seems to happen when an heavy READ in one of the virtual machines occurs, that is during rsyn backup. When the fact occurs the virtual machines are not reachable any more and the real server responds with good delay to the ping but no screen and no ssh is available.

All we can do is force shutdown (hold the button) and restart and when it turns on again the raid on which relay drbd is resyncing. All the time it hangs we see such fact.

After a couple of week of pain on one side this morning also the other cluster hung, but it has different moteherboard, ram, kvm instances. What is similar is reading for rsync scenario and Western Digital RAID Edistion disks on both side.

Can anybody give me some input to solve such issue?

UPDATE: I converted all images from qcow2 to raw and mounted the file system from within the virtual machine with noatime and nodiratime. I used ionice for rsync, but this morning it hung again while a user was reading a lot of files from a samba share. Now I am moving virtual machines imagines from ocfs2 to ext3, but it is really a defeat... any ideas are welcome.

1 Answers1

0

Sounds like you need to try another storage scheme to me (though if you use RAW preallocated disks with the VMs, you will avoid some of the overheads, and you really need qcow2 only if you're using snapshots)

Are the VMs running stable without the clustering, using only the local disks?

Have you tried to use ionice to assign the rsync process a higher nice level, so that it doesn't break everything else?

Have you tested with GFS instead of ocfs2? Might turn out better, and you have a descrption in the guide you posted a link to

dyasny
  • 18,482
  • 6
  • 48
  • 63
  • Thanks. Concerning RAW format it's an activity we have planned and we will do the conversion this evening. – Stefano Annese Nov 18 '11 at 14:11
  • Concerning the use of local disk, do you mean without drbd and ocfs2 or can it be a good test to use a server alone, but with the full stack? – Stefano Annese Nov 18 '11 at 15:03
  • Concerning GFS, it is a little bit hard to give it a try now as the system is already in use and I've never used it, but I will think about it. – Stefano Annese Nov 18 '11 at 15:07
  • without the storage stack - just start a VM up using an image on a local disk, or a local LV – dyasny Nov 18 '11 at 15:16
  • Most of the space is used by the cluster storage. Now I am going to try qcow2->raw, ionice introduction and noatime into the virtual machines (It is an idea I get googling). If such low impact modifications will not pay I'll add a couple of disks and try with a local storage. What about trying to keep at least the software raid1? Thanks again for your precious advices. – Stefano Annese Nov 18 '11 at 16:26
  • noatime and nodiratime are indeed a good idea. generally trimming a VM down to only the required services is always good. As for anything else, the way you describe it, points to a storage layer issue in my opinion, and I'm no drbd expert. Most of my VMs reside on a centralised SAN, and I prefer to avoid clustered FS wherever I can afford to. – dyasny Nov 18 '11 at 21:59