2

I have a simple four node Oracle VM environment. A management server running in vmware, a nfs server for shared storage and two Oracle VM servers running the actual hypervisor.

For some reason the node running the pool master service will suddenly reboot for no obvious reason. I'm fairly sure it's a software issue, possibly a cluster watchdog of some sort. Just to be clear, it's the vm server/hypervisor that reboots, not the guest machines.

Have anyone seen similar issues, or have any suggestions as to where I should start looking for the root cause?

I don't see anything suspicious in the /var/log/ovs*/ logs, any other place I shold look?

The documentation from Oracle leaves a little something to be desired.

Roy
  • 4,256
  • 4
  • 35
  • 50
  • I've the same issue on my Oracle VM installation. Two hosts (Intel HP blade) using a shared NFS storage, sometimes the node running the pool master service reboot without reason. Roy, how did you solve it? –  May 24 '11 at 09:18

3 Answers3

1

I'm not sure if you have the nice fancy graphs that come with the VM Management or not. If you do they do provide a decent amount of insight into what the memory, cpu and disks are doing. Perhaps there might be some correlation? From there you can start looking at top and ps to see what exactly is running, and in use, when the server bounces.

Also can you set the servers into debug mode? Do they support that?

I hope this helps get you started at the very least.

lilott8
  • 496
  • 5
  • 14
  • I havent found any fancy graphs yet, though they might be available with enterprise manager when support is added for OVM 2.2. Even something like the VMware webconsole would be nice. As for using linux tools such as top, the reboots are rather sudden and unpredictable. I'll see if I can find some sort of debug mode to get more information in the logs. – Roy Nov 12 '09 at 18:22
  • does the OVM server have any logging capabilities? Maybe your problem is the hardware/OS that is running the vm? – lilott8 Nov 12 '09 at 19:23
  • It's not the guest machines that are rebooting, so yes, the problem is most likely in the OS or cluster software. OVM is a bare metal hypervisor (eg. it runs on a rebranded and slightly downstripped RHEL). It's probably not the hardware, as this only happens to the current pool master server. – Roy Nov 12 '09 at 21:17
1

Turns out the nodes were not communicating correctly, due to the node hostname being listed on the loopback address in /etc/hosts. The cluster services would silently force a reboot to protect shared storage.

Roy
  • 4,256
  • 4
  • 35
  • 50
  • I might add that this is a bug in Oracle VM Server 2.2. The hostname specified during installation of the hypervisor will be added to the host file entry for the loopback interface. The entry needs to be removed if you wish to use the same hostname when configuring the server in Oracle VM Manager. – Roy Jun 07 '11 at 07:38
0

Are you using ocfs2? if so increase the ocfs2 timeout in /etc/sysconfig/o2cb.conf